Data Science Methodology

Coursera/IBM Data Science

Data Science Methodology - From Understanding to Preparation and From Modeling to Evaluation (1)

떼닝 2023. 12. 27. 07:49

From Understanding to Preparation

Data Understanding

Case Study : Understanding the Data

Descriptive Statistics:

- unvariate statistics (unvariate : 분포가 변량이 하나인... 일도량의)

- pairwise correlations (pairwise : 쌍으로)

- histogram

Histograms are a good way to understand:

- how values or variables are distributed

- what data preparation might be needed to make the variable more useful in a model

Case Study : Looking at data Quality

Data Quality:

- missing values

- invalid or misleading values

Case Study : This is an iterative process

- Iterative data collection and understanding : Refined definition of "CHF Admission"

From Understanding to Preparation

- Data understanding : What does it mean to "prepare" or "clean" data?

- Data preparation : What are ways in which data is prepared?

Examples of Data Cleansing

Using Domain Knowledge

- Feature engineering is the process of using domain knowledge of the data to create features that make the machine learning algorithms work

- Feature engineering is critical when machine learning tools are being applied to analyze the data

Data Preparation - Case Study

Case Study : Defining CHF admission

- Define "CFF admission" and "CHF readmission"

Case Study : Aggregating records

Transactional records:

- Claims : professional provider, facility, pharmaceutical

- Inpatient & outpatient records : diagnoses, procedures, prescriptions and more

- Possibly thousands per patient, depending on clinical history

Case Study : Aggregating to patient level

- Aggregate to patient level

- Roll up to 1 record per patient

Create new columns representing the transaction:

- outpatients visits/Inpatient episodes : frequency, recency, diagnoses/length of stay, procedures, prescriptions

- Comorbidities with CHF (comorbidity : 한 환자가 두 만성 질환을 동시에 앓는 상태)

Case Study : More or less data needed?

- Literature review of important factors for CHF readmission

- Loop back to the data collection stage and add additional data, if needed

Case Study : Completing the data set

Merge all data into one table:

- one record per patient

- list of variables used in modeling (target : CHF readmission with 30 days following discharge from CHF hospitalization)

Case Study : Creating new variables

Merge all data into one table:

- one record per patient

- list of variables used in modeling

Case Study : using training sets

- cohort : 2,343 patients

- randomly divided into training and testing sets : 70% / 30% split

- training : 1,640 patients

- testing : 703 patients

Practice Quiz : Lesson 1 From Understanding to Preparation

Q. In the case study, during the Data Understanding stage, data scientists discovered that not all the expected congestive heart failure admissions were being captured. What action did they take to resolve the issue?

A. The data scientists looped back to the Data Collection stage, adding secondary and tertiary diagnoses, and built a more comprehensive definition of congestive heart failure admission

Q. Select the correct statements that describe what data scientists do during the Data Preparation stage.

A. During the Data Preparation stage, data scientists define the variables to be used in the model.

During the Data Preparation stage, data scientists determine the timing of events.

Q. Select the correct statement about the Data Preparation stage of the data science methodology.

A. The Data Preparation stage involves handling missing and improperly coded data and can include using text analysis to structure unstructured or semi-structured text data