Summary of Type II Diabetes risk Prediction model

Tobi Olatunji, MD, MSHI

revolvingdoor readmissions readmissions

Project Summary

As a quality measure, the Readmission Reduction Program administered by the Centers for Medicare and Medicaid Services (CMS) penalizes hospitals for excess readmissions within 30 days of discharge. Hospitals without a readmission reduction strategy could lose millions of dollars in CMS reimbursements on a yearly basis. I built a prediction model using open source Cerner data representing 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks, including over 101,000 patient encounters with over 50 features, that correctly assigned 72.5% of patients to 2 groups (Readmitted, Not Readmitted) based on their risk of 30-day readmission. This model can help identify high-risk patients for special Transition of Care programs or better discharge planning that reduce their risk of returning within 30 days.


Readmission refers to being hospitalized again after being discharged and is significant for two reasons: quality and cost of health care. In other words, readmission reflects relatively low quality and also has negative social impacts. Readmission is being considered as an indicator for evaluating the overall health care environment.

The US Medicare Payment Advisory Commission stated that $12 billion per year is spent on preventable readmissions

The US Medicare Payment Advisory Commission stated that $12 billion per year is spent on preventable readmissions, and a study estimated the cost of readmission of Medicare patients to be $17.4 billion per year. When the medical costs of admitted patients are analyzed, high cost patients, who comprise 15% of all admitted patients, account for 55.3% of total medical costs, while the readmitted patients accounted for the highest portion, 42.0%.

 Map-Variation-Readmissions readmission_target_pops top_conditions_readmissions

Project Description

The dataset used for this project represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 100,000 patients with over 50 features representing patient and hospital outcomes. The project ultimately attempts to predict patients who are susceptible to <30-day readmissions because of the impact of this measure on CMS reimbursements.


To build a predict/identify patients at risk for readmission (<30 days, >30 days)

Github Repo Link (here)

Approach/Modeling Process

  1. Read in data using pandas (Python).
  2. Preprocess data- clean and merge data, replace or remove missing values
  3. Data exploration- Univariate/pairwise/3D Plots (Histograms, Boxplots, Violin plots, Density distributions, scatterplots, 3D plots), correlations (Spearman’s, Pearson’s), summaries (Mean, Median, Max, Quartiles)- to get a better understanding of features, identify outliers, variable distribution, trends, patterns, class imbalance, test assumptions (for parametric versus nonparametric)
  4. Feature extraction- combine or transform some features (for example, categorize ICD 9/10 codes), and convert some categorical features to interval.
  5. Normalize: Scale quantitative features
  6. Feature Selection using a combination of the following approaches:
    1. Clinical domain knowledge- identify features that are known to contribute to target outcome (for example, age, serum glucose levels, HbA1c)
    2. Multiple Regression (Linear or Logistic)- Check for variable significance, coefficients, AIC, deviance and goodness of fit
    3. Step Function in R- Algorithm removes feature that contributes least to model in stepwise process using AIC as scoring function
    4. Random Forest feature importance- tree based algorithm that ranks feature importance
    5. Recursive Feature Elimination algorithm
    6. Univariate Feature selection algorithms- SelectKbest, SelectPercentiles, etc
    7. Principal component analysis with varimax rotation- identifies the features each Principal component aligns with the most
  7. Split data into training and testing set using any of the following approaches:
    1. Usually a random 66.7/33.3, 75/25, 80/20 split depending on sample size,
    2. Split training set into training and validation set or
    3. Use cross validation on training set.
  8. Feed selected features (training set) into a number of algorithms, tuning the parameters (using a grid search or otherwise) to get highest accuracy, AUC, F1 score, sensitivity and specificity, precision and recall on the unseen data (test set)
    1. Logistic Regression
    2. Random Forest
    3. Support Vector Machines
    4. Neural Networks
    5. Naïve Bayes
    6. K-Nearest Neighbors
  9. Visualize evaluations using ROC curves



  1. Computing power- running cross validation with machine learning algorithms like Support Vector Machines and Random Forests on large datasets on a standalone personal computer is very resource intensive. The SVM took about 12hrs to run without cross validation. I attempted to overcome this by experimenting with AWS clusters/instances.
  2. Multiple encounters for same patient- the initial assumption was that there were over 100,000 unique patient records. Further analysis revealed that there were only about 70,000 unique patients.
  3. Class imbalance- “No readmission” accounted for almost 60% of patients significantly skewing the models/predictions until measures were put in place to correct this.


Predictor Variables

  1. num_lab_procedures : Number of lab procedures
  2. num_medications: Number of medications
  3. diag2: Secondary diagnosis
  4. diag1: Primary diagnosis
  5. time_in_hospital: length of stay in days
  6. Age
  7. num_procedures: Number of non-lab procedures
  8. number_inpatient: number of in-patient visits in the year leading up to the index encounter
  9. number_diagnoses: number of diagnoses
  10. insulin: Presence of insulin treatment


Main Insights

Although the modeling process is still ongoing (to attempt to further improve accuracy), the following initial insights were considered worthy of note:

  1. Hb1Ac was not significant in this model- this was probably because about 84% of the patients had no recent HbA1c results available.
  2. Among all other disease categories, diagnoses of Circulatory and Neoplastic diseases were the most significant predictors of readmission in these diabetic patients.
  3. Secondary diagnosis of Respiratory disease and traumatic injury had the highest lengths-of-stay.
  4. Gender and race were not significant predictors of readmission although dominant population was Caucasian. It would be interesting to study if the significant predictors differ among different race and gender populations.
  5. Readmission rates were lower in those discharged home (7.2% vs 12.8%)



  1. A lot of time had been lost trying to model the data resulting in skewed results before the class imbalance problem was identified.
  2. Initial models could have been created with smaller proportions of the data than attempting to run resource intensive algorithms on a local machine with limited processing power.
  3. At some point, the accuracy seemed to be stuck at about 55%, then at about 62%, making the feature selection process seem very unproductive.
Type II Diabetes 30-day Readmission Risk Prediction
Tagged on:                         

Leave a Reply

Your email address will not be published. Required fields are marked *