Data Science Methodology - From Understanding to Preparation and From Modeling to Evaluation (1)
Data Science Methodology
From Understanding to Preparation
Data Understanding
Case Study : Understanding the Data
Descriptive Statistics:
- unvariate statistics (unvariate : 분포가 변량이 하나인... 일도량의)
- pairwise correlations (pairwise : 쌍으로)
- histogram
Histograms are a good way to understand:
- how values or variables are distributed
- what data preparation might be needed to make the variable more useful in a model
Case Study : Looking at data Quality
Data Quality:
- missing values
- invalid or misleading values
Case Study : This is an iterative process
- Iterative data collection and understanding : Refined definition of "CHF Admission"
From Understanding to Preparation
- Data understanding : What does it mean to "prepare" or "clean" data?
- Data preparation : What are ways in which data is prepared?
Examples of Data Cleansing
Using Domain Knowledge
- Feature engineering is the process of using domain knowledge of the data to create features that make the machine learning algorithms work
- Feature engineering is critical when machine learning tools are being applied to analyze the data
Data Preparation - Case Study
Case Study : Defining CHF admission
- Define "CFF admission" and "CHF readmission"
Case Study : Aggregating records
Transactional records:
- Claims : professional provider, facility, pharmaceutical
- Inpatient & outpatient records : diagnoses, procedures, prescriptions and more
- Possibly thousands per patient, depending on clinical history
Case Study : Aggregating to patient level
- Aggregate to patient level
- Roll up to 1 record per patient
Create new columns representing the transaction:
- outpatients visits/Inpatient episodes : frequency, recency, diagnoses/length of stay, procedures, prescriptions
- Comorbidities with CHF (comorbidity : 한 환자가 두 만성 질환을 동시에 앓는 상태)
Case Study : More or less data needed?
- Literature review of important factors for CHF readmission
- Loop back to the data collection stage and add additional data, if needed
Case Study : Completing the data set
Merge all data into one table:
- one record per patient
- list of variables used in modeling (target : CHF readmission with 30 days following discharge from CHF hospitalization)
Case Study : Creating new variables
Merge all data into one table:
- one record per patient
- list of variables used in modeling
Case Study : using training sets
- cohort : 2,343 patients
- randomly divided into training and testing sets : 70% / 30% split
- training : 1,640 patients
- testing : 703 patients
Practice Quiz : Lesson 1 From Understanding to Preparation
Q. In the case study, during the Data Understanding stage, data scientists discovered that not all the expected congestive heart failure admissions were being captured. What action did they take to resolve the issue?
A. The data scientists looped back to the Data Collection stage, adding secondary and tertiary diagnoses, and built a more comprehensive definition of congestive heart failure admission
Q. Select the correct statements that describe what data scientists do during the Data Preparation stage.
A. During the Data Preparation stage, data scientists define the variables to be used in the model.
During the Data Preparation stage, data scientists determine the timing of events.
Q. Select the correct statement about the Data Preparation stage of the data science methodology.
A. The Data Preparation stage involves handling missing and improperly coded data and can include using text analysis to structure unstructured or semi-structured text data