DATS 6202-010 MACHINE LEARNING I

FINAL PROJECT

Detecting the Presence of Diabetes Mellitus

Sabina Azim, Arushi Kapoor, Natasha Vij

Introduction:

The topic for this project was obtained from the annual Women in Data Science Datathon organized on Kaggle. ​The WiDS Datathon 2021 is a collaboration led by the WiDS Worldwide team at Stanford University, the West Big Data Innovation Hub, and the WiDS Datathon Committee.

The 2021 WiDS Datathon focused on “patient health, with an emphasis on the chronic condition of diabetes, through data from MIT’s GOSSIS (Global Open Source Severity of Illness Score) initiative.”

The COVID-19 pandemic has forced the healthcare industry to get a rapid understanding of a patient’s overall health as hospitals around the world struggle with overloading of patients in critical condition. Therefore, knowledge about chronic conditions such as diabetes mellitus assist healthcare workers in making clinical decisions about patient care.

The purpose of this challenge is to “determine whether a patient admitted to an ICU has been diagnosed with a particular type of diabetes, Diabetes Mellitus.” The data has been collected from the first 24 hours of intensive care and labeled training data has been used for model development. The testing data was also provided by the organizing committee on Kaggle.

Citation: “WiDS Datathon 2021.” Kaggle, www.kaggle.com/c/widsdatathon2021.

For more information on our Kaggle Competition entry, please click here.

The code for this project was inspired by the Breast Cancer Wisconsin Case study completed by Professor Yuxiao here.

Importing Packages

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
import csv
import warnings
warnings.filterwarnings('ignore')
import warnings
warnings.filterwarnings("ignore")

Setting Display Options

In [2]:
pd.set_option("display.float.format", lambda x: "%.2f" % x)

Importing Tensorflow

In [3]:
# The magic below allows us to use tensorflow version 2.x
%tensorflow_version 2.x 
import tensorflow as tf
from tensorflow import keras

Setting the Random Seed

In [4]:
# The random seed
random_seed = 100

# Set random seed in tensorflow
tf.random.set_seed(random_seed)

# Set random seed in numpy
import numpy as np
np.random.seed(random_seed)

Mounting drive/Setting Directory

In [5]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [6]:
abspath = '/content/drive/My Drive/Colab Notebooks/teaching/gwu/machine_learning_I/project/'

Setting the working directory to the absolute path and importing Professor's shallow utilities file so it can be used through this project.

In [7]:
# Change working directory to the absolute path
%cd $abspath

# Import the shallow utitilities
%run pmlm_utilities_shallow.ipynb
/content/drive/My Drive/Colab Notebooks/teaching/gwu/machine_learning_I/project

Experiment:

It was important to determine which model would best suit our dataset before starting any type of data preprocessing. While exploring this dataset, it was clear that the target variable for this project was diabetes_mellitus, with values either being 0 (meaning a patient does not have diabetes) or 1 (meaning a patient does have diabetes). With there being a set finite number of values within the target variable, it was clear that this problem would best be solved using a classification model. Specifically, since this model only had two target classes with a sample belonging to only one class, the best model for the purposes of this project was determined to be a binary classification logistic regression model.

This project was initially completed as a Kaggle competition where the test data was provided. The code that was run for the purposes of the Kaggle competition produced an accuracy score of .74. As a comparison, the provided Kaggle test data was omitted and new test data was extracted from the training dataset to observe how the two accuracy scores compared for the purposes of this final assignment.

With the model selected, the first step in preparing the dataset was to complete a thorough data preprocessing. There were ten steps included for the purposes of this project’s data preprocessing:

  1. Loading the data
  2. Splitting the data
  3. Handling uncommon features
  4. Handling identifiers
  5. Handling date time variables
  6. Handling missing data
  7. Encoding the data
  8. Splitting the feature and target
  9. Scaling the data
  10. Handling class imbalance

In keeping with a fair train/validation/test split, the training data was split 60 (training) : 40 (validation/testing). The validation and testing was split further into a 50:50 split. Therefore, the full training dataset was split as followed:

  • Training - 60%
  • Validation - 20%
  • Testing - 20%

For this project, the following files are pre-loaded from Professor Yuxiao Huang's Github repository:

Data Preprocessing

1. Loading the Data

Loading the training data and making a copy of the raw training data. Also dropping the unnamed column from the raw data.

In [8]:
# Load the raw training data
df_raw = pd.read_csv(abspath + 'TrainingWiDS2021.csv', header=0)

# Remove the unnamed column
df_raw = df_raw.drop(columns='Unnamed: 0')

# Make a copy of df_raw
df = df_raw.copy(deep=True)

Setting the target variable, which in the case of this project would be Diabetes Mellitus since we are trying to find out if individuals will have Diabetes Mellitus or not.

In [9]:
# Get the name of the target
target = 'diabetes_mellitus'

Getting the dimensions of the training data.

In [10]:
pd.DataFrame([[df.shape[0], df.shape[1]]], columns=['# rows', '# columns'])
Out[10]:
# rows # columns
0 130157 180

Previewing the first 5 rows of the training data.

In [11]:
df.head()
Out[11]:
encounter_id hospital_id age bmi elective_surgery ethnicity gender height hospital_admit_source icu_admit_source icu_id icu_stay_type icu_type pre_icu_los_days readmission_status weight albumin_apache apache_2_diagnosis apache_3j_diagnosis apache_post_operative arf_apache bilirubin_apache bun_apache creatinine_apache fio2_apache gcs_eyes_apache gcs_motor_apache gcs_unable_apache gcs_verbal_apache glucose_apache heart_rate_apache hematocrit_apache intubated_apache map_apache paco2_apache paco2_for_ph_apache pao2_apache ph_apache resprate_apache sodium_apache ... h1_hemaglobin_max h1_hemaglobin_min h1_hematocrit_max h1_hematocrit_min h1_inr_max h1_inr_min h1_lactate_max h1_lactate_min h1_platelets_max h1_platelets_min h1_potassium_max h1_potassium_min h1_sodium_max h1_sodium_min h1_wbc_max h1_wbc_min d1_arterial_pco2_max d1_arterial_pco2_min d1_arterial_ph_max d1_arterial_ph_min d1_arterial_po2_max d1_arterial_po2_min d1_pao2fio2ratio_max d1_pao2fio2ratio_min h1_arterial_pco2_max h1_arterial_pco2_min h1_arterial_ph_max h1_arterial_ph_min h1_arterial_po2_max h1_arterial_po2_min h1_pao2fio2ratio_max h1_pao2fio2ratio_min aids cirrhosis hepatic_failure immunosuppression leukemia lymphoma solid_tumor_with_metastasis diabetes_mellitus
0 214826 118 68.00 22.73 0 Caucasian M 180.30 Floor Floor 92 admit CTICU 0.54 0 73.90 2.30 113.00 502.01 0 0 0.40 31.00 2.51 nan 3.00 6.00 0.00 4.00 168.00 118.00 27.40 0 40.00 nan nan nan nan 36.00 134.00 ... nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 1
1 246060 81 77.00 27.42 0 Caucasian F 160.00 Floor Floor 90 admit Med-Surg ICU 0.93 0 70.20 nan 108.00 203.01 0 0 nan 9.00 0.56 1.00 1.00 3.00 0.00 1.00 145.00 120.00 36.90 0 46.00 37.00 37.00 51.00 7.45 33.00 145.00 ... 11.30 11.30 36.90 36.90 1.30 1.30 3.50 3.50 557.00 557.00 4.20 4.20 145.00 145.00 12.70 12.70 37.00 37.00 7.45 7.45 51.00 51.00 54.80 51.00 37.00 37.00 7.45 7.45 51.00 51.00 51.00 51.00 0 0 0 0 0 0 0 1
2 276985 118 25.00 31.95 0 Caucasian F 172.70 Emergency Department Accident & Emergency 93 admit Med-Surg ICU 0.00 0 95.30 nan 122.00 703.03 0 0 nan nan nan nan 3.00 6.00 0.00 5.00 nan 102.00 nan 0 68.00 nan nan nan nan 37.00 nan ... nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 0
3 262220 118 81.00 22.64 1 Caucasian F 165.10 Operating Room Operating Room / Recovery 92 admit CTICU 0.00 0 61.70 nan 203.00 1206.03 1 0 nan nan nan 0.60 4.00 6.00 0.00 5.00 185.00 114.00 25.90 1 60.00 30.00 30.00 142.00 7.39 4.00 nan ... 11.60 11.60 34.00 34.00 1.60 1.10 nan nan 43.00 43.00 nan nan nan nan 8.80 8.80 37.00 27.00 7.44 7.34 337.00 102.00 342.50 236.67 36.00 33.00 7.37 7.34 337.00 265.00 337.00 337.00 0 0 0 0 0 0 0 0
4 201746 33 19.00 nan 0 Caucasian M 188.00 NaN Accident & Emergency 91 admit Med-Surg ICU 0.07 0 nan nan 119.00 601.01 0 0 nan nan nan nan nan nan nan nan nan 60.00 nan 0 103.00 nan nan nan nan 16.00 nan ... nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 0

5 rows × 180 columns

2. Splitting the Data

Using train_test_split from sklearn, the data is being split into 60% training data and 40% testing data.

The testing data is further split into 50% validation and 50% testing data.

In [12]:
from sklearn.model_selection import train_test_split

# Divide the data into training (60%) and test (40%)
df_train, df_test = train_test_split(df, 
                                     train_size=0.6, 
                                     random_state=random_seed, 
                                     stratify=df[target])

# Divide the test data into validation (50%) and test (50%)
df_val, df_test = train_test_split(df_test, 
                                   train_size=0.5, 
                                   random_state=random_seed, 
                                   stratify=df_test[target])

# Reset the index
df_train, df_val, df_test = df_train.reset_index(drop=True), df_val.reset_index(drop=True), df_test.reset_index(drop=True)

Getting the dimensions of df_train after the split from above.

In [13]:
# Print the dimension of df_train
pd.DataFrame([[df_train.shape[0], df_train.shape[1]]], columns=['# rows', '# columns'])
Out[13]:
# rows # columns
0 78094 180

Getting the dimensions of df_val after the split from above.

In [14]:
# Print the dimension of df_val
pd.DataFrame([[df_val.shape[0], df_val.shape[1]]], columns=['# rows', '# columns'])
Out[14]:
# rows # columns
0 26031 180
In [15]:
# Print the dimension of df_test
pd.DataFrame([[df_test.shape[0], df_test.shape[1]]], columns=['# rows', '# columns'])
Out[15]:
# rows # columns
0 26032 180

3.Handling Uncommon Features

Using the common_var_checker to print the common variables between the df_train, df_val, df_test, and the target.

In [16]:
# Call common_var_checker
# See the implementation in pmlm_utilities.ipynb
df_common_var = common_var_checker(df_train, df_val, df_test, target)

# Print df_common_var
df_common_var
Out[16]:
common var
0 age
1 aids
2 albumin_apache
3 apache_2_diagnosis
4 apache_3j_diagnosis
... ...
175 temp_apache
176 urineoutput_apache
177 ventilated_apache
178 wbc_apache
179 weight

180 rows × 1 columns

Getting the uncommon features

Getting the features in the training data but not in the validation or test data.

In [17]:
uncommon_feature_train_not_val_test = np.setdiff1d(df_train.columns, df_common_var['common var'])

# Print the uncommon features
pd.DataFrame(uncommon_feature_train_not_val_test, columns=['uncommon feature'])
Out[17]:
uncommon feature

Getting the features in the validation data but not in the training or test data.

In [18]:
uncommon_feature_val_not_train_test = np.setdiff1d(df_val.columns, df_common_var['common var'])

# Print the uncommon features
pd.DataFrame(uncommon_feature_val_not_train_test, columns=['uncommon feature'])
Out[18]:
uncommon feature

Getting the features in the test data but not in the training or validation data.

In [19]:
uncommon_feature_test_not_train_val = np.setdiff1d(df_test.columns, df_common_var['common var'])

# Print the uncommon features
pd.DataFrame(uncommon_feature_test_not_train_val, columns=['uncommon feature'])
Out[19]:
uncommon feature

Dropping the uncommon features

Using the drop functoin to remove the uncommon features from the training data, then previewing the df_train.

In [20]:
# Remove the uncommon features from the training data
df_train = df_train.drop(columns=uncommon_feature_train_not_val_test)

# Print the first 5 rows of df_train
df_train.head()
Out[20]:
encounter_id hospital_id age bmi elective_surgery ethnicity gender height hospital_admit_source icu_admit_source icu_id icu_stay_type icu_type pre_icu_los_days readmission_status weight albumin_apache apache_2_diagnosis apache_3j_diagnosis apache_post_operative arf_apache bilirubin_apache bun_apache creatinine_apache fio2_apache gcs_eyes_apache gcs_motor_apache gcs_unable_apache gcs_verbal_apache glucose_apache heart_rate_apache hematocrit_apache intubated_apache map_apache paco2_apache paco2_for_ph_apache pao2_apache ph_apache resprate_apache sodium_apache ... h1_hemaglobin_max h1_hemaglobin_min h1_hematocrit_max h1_hematocrit_min h1_inr_max h1_inr_min h1_lactate_max h1_lactate_min h1_platelets_max h1_platelets_min h1_potassium_max h1_potassium_min h1_sodium_max h1_sodium_min h1_wbc_max h1_wbc_min d1_arterial_pco2_max d1_arterial_pco2_min d1_arterial_ph_max d1_arterial_ph_min d1_arterial_po2_max d1_arterial_po2_min d1_pao2fio2ratio_max d1_pao2fio2ratio_min h1_arterial_pco2_max h1_arterial_pco2_min h1_arterial_ph_max h1_arterial_ph_min h1_arterial_po2_max h1_arterial_po2_min h1_pao2fio2ratio_max h1_pao2fio2ratio_min aids cirrhosis hepatic_failure immunosuppression leukemia lymphoma solid_tumor_with_metastasis diabetes_mellitus
0 173141 39 54.00 15.44 0 Caucasian F 162.60 Emergency Department Accident & Emergency 616 admit Neuro ICU 0.07 0 40.82 nan 119.00 601.01 0 0 nan 8.00 0.59 nan 4.00 6.00 0.00 5.00 137.00 90.00 33.40 0 50.00 nan nan nan nan 33.00 142.00 ... nan nan nan nan 0.90 0.90 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 0
1 169925 204 87.00 23.73 1 Caucasian M 184.20 Direct Admit Operating Room / Recovery 431 admit CSICU 0.18 0 80.50 nan 203.00 1206.03 1 0 nan 15.00 0.74 0.40 4.00 6.00 0.00 5.00 85.00 55.00 30.50 1 51.00 37.00 37.00 191.00 7.44 51.00 134.00 ... 10.30 10.30 30.50 30.50 1.40 1.40 nan nan 138.00 138.00 4.30 4.30 134.00 134.00 3.80 3.80 43.00 37.00 7.44 7.39 191.00 180.00 477.50 477.50 37.00 37.00 7.44 7.44 191.00 191.00 477.50 477.50 0 0 0 0 0 0 0 0
2 250908 109 21.00 30.00 0 Other/Unknown F 157.48 Emergency Department Accident & Emergency 429 admit Cardiac ICU 0.31 0 74.41 nan 305.00 901.03 0 1 nan 37.00 7.36 nan 4.00 6.00 0.00 5.00 84.00 118.00 24.70 0 146.00 nan nan nan nan 34.00 139.00 ... nan nan nan nan 1.20 1.20 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 0
3 155276 160 49.00 32.02 0 Caucasian M 172.70 Emergency Department Accident & Emergency 470 admit Med-Surg ICU 0.28 0 95.50 3.30 113.00 501.02 0 0 0.30 31.00 2.20 nan 4.00 6.00 0.00 5.00 139.00 119.00 34.80 0 62.00 nan nan nan nan 10.00 138.00 ... nan nan nan nan nan nan nan nan nan nan 3.30 3.30 134.00 134.00 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 1
4 247104 64 nan 29.87 0 African American F 167.60 NaN Floor 683 admit Med-Surg ICU 3.11 0 83.90 3.10 302.00 109.12 0 0 1.20 16.00 1.13 nan 3.00 6.00 0.00 5.00 106.00 96.00 41.00 0 59.00 nan nan nan nan 43.00 135.00 ... nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 0

5 rows × 180 columns

Using the drop functoin to remove the uncommon features from the validation data, then previewing the df_val.

In [21]:
# Remove the uncommon features from the validation data
df_val = df_val.drop(columns=uncommon_feature_val_not_train_test)

# Print the first 5 rows of df_val
df_val.head()
Out[21]:
encounter_id hospital_id age bmi elective_surgery ethnicity gender height hospital_admit_source icu_admit_source icu_id icu_stay_type icu_type pre_icu_los_days readmission_status weight albumin_apache apache_2_diagnosis apache_3j_diagnosis apache_post_operative arf_apache bilirubin_apache bun_apache creatinine_apache fio2_apache gcs_eyes_apache gcs_motor_apache gcs_unable_apache gcs_verbal_apache glucose_apache heart_rate_apache hematocrit_apache intubated_apache map_apache paco2_apache paco2_for_ph_apache pao2_apache ph_apache resprate_apache sodium_apache ... h1_hemaglobin_max h1_hemaglobin_min h1_hematocrit_max h1_hematocrit_min h1_inr_max h1_inr_min h1_lactate_max h1_lactate_min h1_platelets_max h1_platelets_min h1_potassium_max h1_potassium_min h1_sodium_max h1_sodium_min h1_wbc_max h1_wbc_min d1_arterial_pco2_max d1_arterial_pco2_min d1_arterial_ph_max d1_arterial_ph_min d1_arterial_po2_max d1_arterial_po2_min d1_pao2fio2ratio_max d1_pao2fio2ratio_min h1_arterial_pco2_max h1_arterial_pco2_min h1_arterial_ph_max h1_arterial_ph_min h1_arterial_po2_max h1_arterial_po2_min h1_pao2fio2ratio_max h1_pao2fio2ratio_min aids cirrhosis hepatic_failure immunosuppression leukemia lymphoma solid_tumor_with_metastasis diabetes_mellitus
0 148622 86 67.00 37.56 0 Caucasian M 168.00 NaN Floor 1035 admit CCU-CTICU 0.01 0 106.00 nan 114.00 102.01 0 0 nan 20.00 0.73 nan 4.00 6.00 0.00 5.00 80.00 53.00 36.00 0 111.00 nan nan nan nan 27.00 137.00 ... nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 0
1 203145 189 60.00 24.95 0 Caucasian F 165.10 Direct Admit Accident & Emergency 543 admit Med-Surg ICU 0.30 0 68.00 2.20 302.00 109.16 0 0 0.30 7.40 0.50 nan 4.00 6.00 0.00 5.00 79.00 64.00 24.70 0 45.00 nan nan nan nan 10.00 136.00 ... 8.80 8.80 24.80 24.80 1.50 1.40 nan nan 171.00 171.00 nan nan nan nan 10.14 10.14 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 0
2 227137 13 50.00 25.81 0 Caucasian M 177.80 NaN Floor 708 admit Med-Surg ICU 0.37 0 81.60 2.90 113.00 501.01 0 0 1.00 35.00 2.68 1.00 3.00 4.00 0.00 1.00 103.00 60.00 28.00 1 53.00 32.70 32.70 164.00 7.33 16.00 128.00 ... 11.50 11.40 34.50 34.30 1.20 1.20 nan nan 142.00 136.00 4.90 4.30 129.00 128.00 27.50 26.40 34.70 30.20 7.35 7.32 164.00 95.00 164.00 107.00 nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 0
3 273697 21 58.00 34.58 0 Caucasian M 180.30 Other Hospital Other Hospital 512 admit CCU-CTICU 0.01 0 112.40 nan 112.00 107.01 0 0 nan 10.00 0.94 nan 4.00 6.00 0.00 5.00 137.00 58.00 35.30 0 52.00 nan nan nan nan 28.00 140.00 ... nan nan nan nan 1.18 1.18 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 0
4 252290 60 21.00 19.53 0 Other/Unknown F 152.40 Emergency Department Accident & Emergency 538 admit Med-Surg ICU 0.09 0 45.36 nan 123.00 702.01 0 0 nan 15.00 0.80 0.21 4.00 6.00 0.00 5.00 242.00 122.00 nan 0 72.00 24.70 24.70 110.00 7.27 13.00 152.00 ... nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 25.90 24.70 7.31 7.27 112.00 110.00 533.33 523.81 nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 0

5 rows × 180 columns

Using the drop functoin to remove the uncommon features from the test data, then previewing the df_test.

In [22]:
# Remove the uncommon features from the test data
df_test = df_test.drop(columns=uncommon_feature_test_not_train_val)

# Print the first 5 rows of df_test
df_test.head()
Out[22]:
encounter_id hospital_id age bmi elective_surgery ethnicity gender height hospital_admit_source icu_admit_source icu_id icu_stay_type icu_type pre_icu_los_days readmission_status weight albumin_apache apache_2_diagnosis apache_3j_diagnosis apache_post_operative arf_apache bilirubin_apache bun_apache creatinine_apache fio2_apache gcs_eyes_apache gcs_motor_apache gcs_unable_apache gcs_verbal_apache glucose_apache heart_rate_apache hematocrit_apache intubated_apache map_apache paco2_apache paco2_for_ph_apache pao2_apache ph_apache resprate_apache sodium_apache ... h1_hemaglobin_max h1_hemaglobin_min h1_hematocrit_max h1_hematocrit_min h1_inr_max h1_inr_min h1_lactate_max h1_lactate_min h1_platelets_max h1_platelets_min h1_potassium_max h1_potassium_min h1_sodium_max h1_sodium_min h1_wbc_max h1_wbc_min d1_arterial_pco2_max d1_arterial_pco2_min d1_arterial_ph_max d1_arterial_ph_min d1_arterial_po2_max d1_arterial_po2_min d1_pao2fio2ratio_max d1_pao2fio2ratio_min h1_arterial_pco2_max h1_arterial_pco2_min h1_arterial_ph_max h1_arterial_ph_min h1_arterial_po2_max h1_arterial_po2_min h1_pao2fio2ratio_max h1_pao2fio2ratio_min aids cirrhosis hepatic_failure immunosuppression leukemia lymphoma solid_tumor_with_metastasis diabetes_mellitus
0 254797 79 79.00 29.12 1 Other/Unknown M 167.60 Emergency Department Operating Room / Recovery 337 transfer Med-Surg ICU 0.43 0 81.80 1.30 308.00 1904.01 1 0 2.00 57.00 2.42 0.55 2.00 5.00 0.00 1.00 161.00 128.00 17.20 1 40.00 41.00 41.00 235.00 7.31 9.00 151.00 ... 5.80 5.80 17.20 17.20 nan nan 1.00 1.00 199.00 199.00 5.20 5.20 150.00 150.00 9.80 9.80 41.00 30.00 7.52 7.31 376.00 228.00 427.27 376.00 30.00 30.00 7.52 7.52 376.00 376.00 376.00 376.00 0 0 0 0 0 0 0 0
1 189332 118 63.00 29.34 1 Caucasian M 172.70 Operating Room Operating Room / Recovery 89 admit Neuro ICU 0.00 0 87.50 nan 218.00 1505.02 1 0 nan 16.00 0.70 nan 4.00 6.00 0.00 5.00 148.00 60.00 35.10 0 198.00 nan nan nan nan 4.00 140.00 ... nan nan nan nan 1.00 1.00 nan nan nan nan nan nan nan nan nan nan 44.00 44.00 7.39 7.39 115.00 115.00 287.50 287.50 nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 1
2 265950 173 72.00 21.64 0 Caucasian M 175.30 NaN Other Hospital 962 admit Med-Surg ICU 0.00 0 66.50 1.90 119.00 601.05 0 0 1.30 9.00 0.65 nan 1.00 5.00 0.00 1.00 162.00 130.00 28.80 0 113.00 nan nan nan nan 43.00 138.00 ... nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 0
3 189173 110 29.00 20.80 0 Caucasian F 165.10 Emergency Department Accident & Emergency 969 admit CCU-CTICU 0.02 0 56.70 nan 122.00 703.03 0 0 nan 6.00 0.58 0.25 2.00 3.00 0.00 1.00 92.00 106.00 nan 0 53.00 34.20 34.20 138.00 7.41 27.00 141.00 ... nan nan nan nan nan nan 0.80 0.80 nan nan 4.30 3.60 141.00 141.00 nan nan 34.20 34.00 7.41 7.41 141.00 138.00 552.00 470.00 34.00 34.00 7.41 7.41 141.00 141.00 470.00 470.00 0 0 0 0 0 0 0 0
4 261367 118 84.00 20.28 0 Asian M 157.50 Floor Floor 97 admit MICU 2.01 0 50.30 nan 301.00 410.01 0 0 nan 22.00 1.44 nan 4.00 5.00 0.00 2.00 154.00 127.00 21.90 0 68.00 nan nan nan nan 47.00 146.00 ... nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 1 1 0 0 0 0

5 rows × 180 columns

4. Handling Identifiers

Combining the dataframes

Combining df_train, df_val, and df_test using a concat function before dropping the identifiers.

In [23]:
# Combine df_train, df_val and df_test
df = pd.concat([df_train, df_val, df_test], sort=False)

Getting the identifiers

Using the id_checker to get the id column from the combined dataframe we made above.

In [24]:
# Call id_checker on df
# See the implementation in pmlm_utilities.ipynb
df_id = id_checker(df)

# Print the first 5 rows of df_id
df_id.head()
Out[24]:
encounter_id
0 173141
1 169925
2 250908
3 155276
4 247104

Removing the identifiers

Use the drop function to remove the identifiers from the df_train, df_val, and df_test.

In [25]:
import numpy as np

# Remove identifiers from df_train
df_train.drop(columns=np.intersect1d(df_id.columns, df_train.columns), inplace=True)

# Remove identifiers from df_val
df_val.drop(columns=np.intersect1d(df_id.columns, df_val.columns), inplace=True)

# Remove identifiers from df_test
df_test.drop(columns=np.intersect1d(df_id.columns, df_test.columns), inplace=True)

Previewing df_train, df_val, and df_test after dropping identifiers

In [26]:
# Print the first 5 rows of df_train
df_train.head()
Out[26]:
hospital_id age bmi elective_surgery ethnicity gender height hospital_admit_source icu_admit_source icu_id icu_stay_type icu_type pre_icu_los_days readmission_status weight albumin_apache apache_2_diagnosis apache_3j_diagnosis apache_post_operative arf_apache bilirubin_apache bun_apache creatinine_apache fio2_apache gcs_eyes_apache gcs_motor_apache gcs_unable_apache gcs_verbal_apache glucose_apache heart_rate_apache hematocrit_apache intubated_apache map_apache paco2_apache paco2_for_ph_apache pao2_apache ph_apache resprate_apache sodium_apache temp_apache ... h1_hemaglobin_max h1_hemaglobin_min h1_hematocrit_max h1_hematocrit_min h1_inr_max h1_inr_min h1_lactate_max h1_lactate_min h1_platelets_max h1_platelets_min h1_potassium_max h1_potassium_min h1_sodium_max h1_sodium_min h1_wbc_max h1_wbc_min d1_arterial_pco2_max d1_arterial_pco2_min d1_arterial_ph_max d1_arterial_ph_min d1_arterial_po2_max d1_arterial_po2_min d1_pao2fio2ratio_max d1_pao2fio2ratio_min h1_arterial_pco2_max h1_arterial_pco2_min h1_arterial_ph_max h1_arterial_ph_min h1_arterial_po2_max h1_arterial_po2_min h1_pao2fio2ratio_max h1_pao2fio2ratio_min aids cirrhosis hepatic_failure immunosuppression leukemia lymphoma solid_tumor_with_metastasis diabetes_mellitus
0 39 54.00 15.44 0 Caucasian F 162.60 Emergency Department Accident & Emergency 616 admit Neuro ICU 0.07 0 40.82 nan 119.00 601.01 0 0 nan 8.00 0.59 nan 4.00 6.00 0.00 5.00 137.00 90.00 33.40 0 50.00 nan nan nan nan 33.00 142.00 36.70 ... nan nan nan nan 0.90 0.90 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 0
1 204 87.00 23.73 1 Caucasian M 184.20 Direct Admit Operating Room / Recovery 431 admit CSICU 0.18 0 80.50 nan 203.00 1206.03 1 0 nan 15.00 0.74 0.40 4.00 6.00 0.00 5.00 85.00 55.00 30.50 1 51.00 37.00 37.00 191.00 7.44 51.00 134.00 35.40 ... 10.30 10.30 30.50 30.50 1.40 1.40 nan nan 138.00 138.00 4.30 4.30 134.00 134.00 3.80 3.80 43.00 37.00 7.44 7.39 191.00 180.00 477.50 477.50 37.00 37.00 7.44 7.44 191.00 191.00 477.50 477.50 0 0 0 0 0 0 0 0
2 109 21.00 30.00 0 Other/Unknown F 157.48 Emergency Department Accident & Emergency 429 admit Cardiac ICU 0.31 0 74.41 nan 305.00 901.03 0 1 nan 37.00 7.36 nan 4.00 6.00 0.00 5.00 84.00 118.00 24.70 0 146.00 nan nan nan nan 34.00 139.00 36.60 ... nan nan nan nan 1.20 1.20 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 0
3 160 49.00 32.02 0 Caucasian M 172.70 Emergency Department Accident & Emergency 470 admit Med-Surg ICU 0.28 0 95.50 3.30 113.00 501.02 0 0 0.30 31.00 2.20 nan 4.00 6.00 0.00 5.00 139.00 119.00 34.80 0 62.00 nan nan nan nan 10.00 138.00 37.10 ... nan nan nan nan nan nan nan nan nan nan 3.30 3.30 134.00 134.00 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 1
4 64 nan 29.87 0 African American F 167.60 NaN Floor 683 admit Med-Surg ICU 3.11 0 83.90 3.10 302.00 109.12 0 0 1.20 16.00 1.13 nan 3.00 6.00 0.00 5.00 106.00 96.00 41.00 0 59.00 nan nan nan nan 43.00 135.00 36.60 ... nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 0

5 rows × 179 columns

In [27]:
# Print the first 5 rows of df_val
df_val.head()
Out[27]:
hospital_id age bmi elective_surgery ethnicity gender height hospital_admit_source icu_admit_source icu_id icu_stay_type icu_type pre_icu_los_days readmission_status weight albumin_apache apache_2_diagnosis apache_3j_diagnosis apache_post_operative arf_apache bilirubin_apache bun_apache creatinine_apache fio2_apache gcs_eyes_apache gcs_motor_apache gcs_unable_apache gcs_verbal_apache glucose_apache heart_rate_apache hematocrit_apache intubated_apache map_apache paco2_apache paco2_for_ph_apache pao2_apache ph_apache resprate_apache sodium_apache temp_apache ... h1_hemaglobin_max h1_hemaglobin_min h1_hematocrit_max h1_hematocrit_min h1_inr_max h1_inr_min h1_lactate_max h1_lactate_min h1_platelets_max h1_platelets_min h1_potassium_max h1_potassium_min h1_sodium_max h1_sodium_min h1_wbc_max h1_wbc_min d1_arterial_pco2_max d1_arterial_pco2_min d1_arterial_ph_max d1_arterial_ph_min d1_arterial_po2_max d1_arterial_po2_min d1_pao2fio2ratio_max d1_pao2fio2ratio_min h1_arterial_pco2_max h1_arterial_pco2_min h1_arterial_ph_max h1_arterial_ph_min h1_arterial_po2_max h1_arterial_po2_min h1_pao2fio2ratio_max h1_pao2fio2ratio_min aids cirrhosis hepatic_failure immunosuppression leukemia lymphoma solid_tumor_with_metastasis diabetes_mellitus
0 86 67.00 37.56 0 Caucasian M 168.00 NaN Floor 1035 admit CCU-CTICU 0.01 0 106.00 nan 114.00 102.01 0 0 nan 20.00 0.73 nan 4.00 6.00 0.00 5.00 80.00 53.00 36.00 0 111.00 nan nan nan nan 27.00 137.00 36.40 ... nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 0
1 189 60.00 24.95 0 Caucasian F 165.10 Direct Admit Accident & Emergency 543 admit Med-Surg ICU 0.30 0 68.00 2.20 302.00 109.16 0 0 0.30 7.40 0.50 nan 4.00 6.00 0.00 5.00 79.00 64.00 24.70 0 45.00 nan nan nan nan 10.00 136.00 35.80 ... 8.80 8.80 24.80 24.80 1.50 1.40 nan nan 171.00 171.00 nan nan nan nan 10.14 10.14 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 0
2 13 50.00 25.81 0 Caucasian M 177.80 NaN Floor 708 admit Med-Surg ICU 0.37 0 81.60 2.90 113.00 501.01 0 0 1.00 35.00 2.68 1.00 3.00 4.00 0.00 1.00 103.00 60.00 28.00 1 53.00 32.70 32.70 164.00 7.33 16.00 128.00 36.30 ... 11.50 11.40 34.50 34.30 1.20 1.20 nan nan 142.00 136.00 4.90 4.30 129.00 128.00 27.50 26.40 34.70 30.20 7.35 7.32 164.00 95.00 164.00 107.00 nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 0
3 21 58.00 34.58 0 Caucasian M 180.30 Other Hospital Other Hospital 512 admit CCU-CTICU 0.01 0 112.40 nan 112.00 107.01 0 0 nan 10.00 0.94 nan 4.00 6.00 0.00 5.00 137.00 58.00 35.30 0 52.00 nan nan nan nan 28.00 140.00 36.60 ... nan nan nan nan 1.18 1.18 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 0
4 60 21.00 19.53 0 Other/Unknown F 152.40 Emergency Department Accident & Emergency 538 admit Med-Surg ICU 0.09 0 45.36 nan 123.00 702.01 0 0 nan 15.00 0.80 0.21 4.00 6.00 0.00 5.00 242.00 122.00 nan 0 72.00 24.70 24.70 110.00 7.27 13.00 152.00 36.40 ... nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 25.90 24.70 7.31 7.27 112.00 110.00 533.33 523.81 nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 0

5 rows × 179 columns

In [28]:
# Print the first 5 rows of df_test
df_test.head()
Out[28]:
hospital_id age bmi elective_surgery ethnicity gender height hospital_admit_source icu_admit_source icu_id icu_stay_type icu_type pre_icu_los_days readmission_status weight albumin_apache apache_2_diagnosis apache_3j_diagnosis apache_post_operative arf_apache bilirubin_apache bun_apache creatinine_apache fio2_apache gcs_eyes_apache gcs_motor_apache gcs_unable_apache gcs_verbal_apache glucose_apache heart_rate_apache hematocrit_apache intubated_apache map_apache paco2_apache paco2_for_ph_apache pao2_apache ph_apache resprate_apache sodium_apache temp_apache ... h1_hemaglobin_max h1_hemaglobin_min h1_hematocrit_max h1_hematocrit_min h1_inr_max h1_inr_min h1_lactate_max h1_lactate_min h1_platelets_max h1_platelets_min h1_potassium_max h1_potassium_min h1_sodium_max h1_sodium_min h1_wbc_max h1_wbc_min d1_arterial_pco2_max d1_arterial_pco2_min d1_arterial_ph_max d1_arterial_ph_min d1_arterial_po2_max d1_arterial_po2_min d1_pao2fio2ratio_max d1_pao2fio2ratio_min h1_arterial_pco2_max h1_arterial_pco2_min h1_arterial_ph_max h1_arterial_ph_min h1_arterial_po2_max h1_arterial_po2_min h1_pao2fio2ratio_max h1_pao2fio2ratio_min aids cirrhosis hepatic_failure immunosuppression leukemia lymphoma solid_tumor_with_metastasis diabetes_mellitus
0 79 79.00 29.12 1 Other/Unknown M 167.60 Emergency Department Operating Room / Recovery 337 transfer Med-Surg ICU 0.43 0 81.80 1.30 308.00 1904.01 1 0 2.00 57.00 2.42 0.55 2.00 5.00 0.00 1.00 161.00 128.00 17.20 1 40.00 41.00 41.00 235.00 7.31 9.00 151.00 36.00 ... 5.80 5.80 17.20 17.20 nan nan 1.00 1.00 199.00 199.00 5.20 5.20 150.00 150.00 9.80 9.80 41.00 30.00 7.52 7.31 376.00 228.00 427.27 376.00 30.00 30.00 7.52 7.52 376.00 376.00 376.00 376.00 0 0 0 0 0 0 0 0
1 118 63.00 29.34 1 Caucasian M 172.70 Operating Room Operating Room / Recovery 89 admit Neuro ICU 0.00 0 87.50 nan 218.00 1505.02 1 0 nan 16.00 0.70 nan 4.00 6.00 0.00 5.00 148.00 60.00 35.10 0 198.00 nan nan nan nan 4.00 140.00 36.30 ... nan nan nan nan 1.00 1.00 nan nan nan nan nan nan nan nan nan nan 44.00 44.00 7.39 7.39 115.00 115.00 287.50 287.50 nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 1
2 173 72.00 21.64 0 Caucasian M 175.30 NaN Other Hospital 962 admit Med-Surg ICU 0.00 0 66.50 1.90 119.00 601.05 0 0 1.30 9.00 0.65 nan 1.00 5.00 0.00 1.00 162.00 130.00 28.80 0 113.00 nan nan nan nan 43.00 138.00 37.20 ... nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 0
3 110 29.00 20.80 0 Caucasian F 165.10 Emergency Department Accident & Emergency 969 admit CCU-CTICU 0.02 0 56.70 nan 122.00 703.03 0 0 nan 6.00 0.58 0.25 2.00 3.00 0.00 1.00 92.00 106.00 nan 0 53.00 34.20 34.20 138.00 7.41 27.00 141.00 35.60 ... nan nan nan nan nan nan 0.80 0.80 nan nan 4.30 3.60 141.00 141.00 nan nan 34.20 34.00 7.41 7.41 141.00 138.00 552.00 470.00 34.00 34.00 7.41 7.41 141.00 141.00 470.00 470.00 0 0 0 0 0 0 0 0
4 118 84.00 20.28 0 Asian M 157.50 Floor Floor 97 admit MICU 2.01 0 50.30 nan 301.00 410.01 0 0 nan 22.00 1.44 nan 4.00 5.00 0.00 2.00 154.00 127.00 21.90 0 68.00 nan nan nan nan 47.00 146.00 35.60 ... nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 1 1 0 0 0 0

5 rows × 179 columns

5. Handling Date/Time Variables

Setting the date time variables from the data. In this case, there are no date time variables, hence the brackets remain empty.

In [29]:
# Get the date time variables
datetime_vars = []

Calling the datetime_transfer on df_train, df_val, and df_test.

In [30]:
# Call datetime_transformer on df_train
# See the implementation in pmlm_utilities.ipynb
df_train = datetime_transformer(df_train, datetime_vars)

# Print the first 5 rows of df_train
df_train.head()
Out[30]:
hospital_id age bmi elective_surgery ethnicity gender height hospital_admit_source icu_admit_source icu_id icu_stay_type icu_type pre_icu_los_days readmission_status weight albumin_apache apache_2_diagnosis apache_3j_diagnosis apache_post_operative arf_apache bilirubin_apache bun_apache creatinine_apache fio2_apache gcs_eyes_apache gcs_motor_apache gcs_unable_apache gcs_verbal_apache glucose_apache heart_rate_apache hematocrit_apache intubated_apache map_apache paco2_apache paco2_for_ph_apache pao2_apache ph_apache resprate_apache sodium_apache temp_apache ... h1_hemaglobin_max h1_hemaglobin_min h1_hematocrit_max h1_hematocrit_min h1_inr_max h1_inr_min h1_lactate_max h1_lactate_min h1_platelets_max h1_platelets_min h1_potassium_max h1_potassium_min h1_sodium_max h1_sodium_min h1_wbc_max h1_wbc_min d1_arterial_pco2_max d1_arterial_pco2_min d1_arterial_ph_max d1_arterial_ph_min d1_arterial_po2_max d1_arterial_po2_min d1_pao2fio2ratio_max d1_pao2fio2ratio_min h1_arterial_pco2_max h1_arterial_pco2_min h1_arterial_ph_max h1_arterial_ph_min h1_arterial_po2_max h1_arterial_po2_min h1_pao2fio2ratio_max h1_pao2fio2ratio_min aids cirrhosis hepatic_failure immunosuppression leukemia lymphoma solid_tumor_with_metastasis diabetes_mellitus
0 39 54.00 15.44 0 Caucasian F 162.60 Emergency Department Accident & Emergency 616 admit Neuro ICU 0.07 0 40.82 nan 119.00 601.01 0 0 nan 8.00 0.59 nan 4.00 6.00 0.00 5.00 137.00 90.00 33.40 0 50.00 nan nan nan nan 33.00 142.00 36.70 ... nan nan nan nan 0.90 0.90 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 0
1 204 87.00 23.73 1 Caucasian M 184.20 Direct Admit Operating Room / Recovery 431 admit CSICU 0.18 0 80.50 nan 203.00 1206.03 1 0 nan 15.00 0.74 0.40 4.00 6.00 0.00 5.00 85.00 55.00 30.50 1 51.00 37.00 37.00 191.00 7.44 51.00 134.00 35.40 ... 10.30 10.30 30.50 30.50 1.40 1.40 nan nan 138.00 138.00 4.30 4.30 134.00 134.00 3.80 3.80 43.00 37.00 7.44 7.39 191.00 180.00 477.50 477.50 37.00 37.00 7.44 7.44 191.00 191.00 477.50 477.50 0 0 0 0 0 0 0 0
2 109 21.00 30.00 0 Other/Unknown F 157.48 Emergency Department Accident & Emergency 429 admit Cardiac ICU 0.31 0 74.41 nan 305.00 901.03 0 1 nan 37.00 7.36 nan 4.00 6.00 0.00 5.00 84.00 118.00 24.70 0 146.00 nan nan nan nan 34.00 139.00 36.60 ... nan nan nan nan 1.20 1.20 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 0
3 160 49.00 32.02 0 Caucasian M 172.70 Emergency Department Accident & Emergency 470 admit Med-Surg ICU 0.28 0 95.50 3.30 113.00 501.02 0 0 0.30 31.00 2.20 nan 4.00 6.00 0.00 5.00 139.00 119.00 34.80 0 62.00 nan nan nan nan 10.00 138.00 37.10 ... nan nan nan nan nan nan nan nan nan nan 3.30 3.30 134.00 134.00 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 1
4 64 nan 29.87 0 African American F 167.60 NaN Floor 683 admit Med-Surg ICU 3.11 0 83.90 3.10 302.00 109.12 0 0 1.20 16.00 1.13 nan 3.00 6.00 0.00 5.00 106.00 96.00 41.00 0 59.00 nan nan nan nan 43.00 135.00 36.60 ... nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 0

5 rows × 179 columns

In [31]:
# Call datetime_transformer on df_val
# See the implementation in pmlm_utilities.ipynb
df_val = datetime_transformer(df_val, datetime_vars)

# Print the first 5 rows of df_val
df_val.head()
Out[31]:
hospital_id age bmi elective_surgery ethnicity gender height hospital_admit_source icu_admit_source icu_id icu_stay_type icu_type pre_icu_los_days readmission_status weight albumin_apache apache_2_diagnosis apache_3j_diagnosis apache_post_operative arf_apache bilirubin_apache bun_apache creatinine_apache fio2_apache gcs_eyes_apache gcs_motor_apache gcs_unable_apache gcs_verbal_apache glucose_apache heart_rate_apache hematocrit_apache intubated_apache map_apache paco2_apache paco2_for_ph_apache pao2_apache ph_apache resprate_apache sodium_apache temp_apache ... h1_hemaglobin_max h1_hemaglobin_min h1_hematocrit_max h1_hematocrit_min h1_inr_max h1_inr_min h1_lactate_max h1_lactate_min h1_platelets_max h1_platelets_min h1_potassium_max h1_potassium_min h1_sodium_max h1_sodium_min h1_wbc_max h1_wbc_min d1_arterial_pco2_max d1_arterial_pco2_min d1_arterial_ph_max d1_arterial_ph_min d1_arterial_po2_max d1_arterial_po2_min d1_pao2fio2ratio_max d1_pao2fio2ratio_min h1_arterial_pco2_max h1_arterial_pco2_min h1_arterial_ph_max h1_arterial_ph_min h1_arterial_po2_max h1_arterial_po2_min h1_pao2fio2ratio_max h1_pao2fio2ratio_min aids cirrhosis hepatic_failure immunosuppression leukemia lymphoma solid_tumor_with_metastasis diabetes_mellitus
0 86 67.00 37.56 0 Caucasian M 168.00 NaN Floor 1035 admit CCU-CTICU 0.01 0 106.00 nan 114.00 102.01 0 0 nan 20.00 0.73 nan 4.00 6.00 0.00 5.00 80.00 53.00 36.00 0 111.00 nan nan nan nan 27.00 137.00 36.40 ... nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 0
1 189 60.00 24.95 0 Caucasian F 165.10 Direct Admit Accident & Emergency 543 admit Med-Surg ICU 0.30 0 68.00 2.20 302.00 109.16 0 0 0.30 7.40 0.50 nan 4.00 6.00 0.00 5.00 79.00 64.00 24.70 0 45.00 nan nan nan nan 10.00 136.00 35.80 ... 8.80 8.80 24.80 24.80 1.50 1.40 nan nan 171.00 171.00 nan nan nan nan 10.14 10.14 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 0
2 13 50.00 25.81 0 Caucasian M 177.80 NaN Floor 708 admit Med-Surg ICU 0.37 0 81.60 2.90 113.00 501.01 0 0 1.00 35.00 2.68 1.00 3.00 4.00 0.00 1.00 103.00 60.00 28.00 1 53.00 32.70 32.70 164.00 7.33 16.00 128.00 36.30 ... 11.50 11.40 34.50 34.30 1.20 1.20 nan nan 142.00 136.00 4.90 4.30 129.00 128.00 27.50 26.40 34.70 30.20 7.35 7.32 164.00 95.00 164.00 107.00 nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 0
3 21 58.00 34.58 0 Caucasian M 180.30 Other Hospital Other Hospital 512 admit CCU-CTICU 0.01 0 112.40 nan 112.00 107.01 0 0 nan 10.00 0.94 nan 4.00 6.00 0.00 5.00 137.00 58.00 35.30 0 52.00 nan nan nan nan 28.00 140.00 36.60 ... nan nan nan nan 1.18 1.18 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 0
4 60 21.00 19.53 0 Other/Unknown F 152.40 Emergency Department Accident & Emergency 538 admit Med-Surg ICU 0.09 0 45.36 nan 123.00 702.01 0 0 nan 15.00 0.80 0.21 4.00 6.00 0.00 5.00 242.00 122.00 nan 0 72.00 24.70 24.70 110.00 7.27 13.00 152.00 36.40 ... nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 25.90 24.70 7.31 7.27 112.00 110.00 533.33 523.81 nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 0

5 rows × 179 columns

In [32]:
# See the implementation in pmlm_utilities.ipynb
df_test = datetime_transformer(df_test, datetime_vars)

# Print the first 5 rows of df_test
df_test.head()
Out[32]:
hospital_id age bmi elective_surgery ethnicity gender height hospital_admit_source icu_admit_source icu_id icu_stay_type icu_type pre_icu_los_days readmission_status weight albumin_apache apache_2_diagnosis apache_3j_diagnosis apache_post_operative arf_apache bilirubin_apache bun_apache creatinine_apache fio2_apache gcs_eyes_apache gcs_motor_apache gcs_unable_apache gcs_verbal_apache glucose_apache heart_rate_apache hematocrit_apache intubated_apache map_apache paco2_apache paco2_for_ph_apache pao2_apache ph_apache resprate_apache sodium_apache temp_apache ... h1_hemaglobin_max h1_hemaglobin_min h1_hematocrit_max h1_hematocrit_min h1_inr_max h1_inr_min h1_lactate_max h1_lactate_min h1_platelets_max h1_platelets_min h1_potassium_max h1_potassium_min h1_sodium_max h1_sodium_min h1_wbc_max h1_wbc_min d1_arterial_pco2_max d1_arterial_pco2_min d1_arterial_ph_max d1_arterial_ph_min d1_arterial_po2_max d1_arterial_po2_min d1_pao2fio2ratio_max d1_pao2fio2ratio_min h1_arterial_pco2_max h1_arterial_pco2_min h1_arterial_ph_max h1_arterial_ph_min h1_arterial_po2_max h1_arterial_po2_min h1_pao2fio2ratio_max h1_pao2fio2ratio_min aids cirrhosis hepatic_failure immunosuppression leukemia lymphoma solid_tumor_with_metastasis diabetes_mellitus
0 79 79.00 29.12 1 Other/Unknown M 167.60 Emergency Department Operating Room / Recovery 337 transfer Med-Surg ICU 0.43 0 81.80 1.30 308.00 1904.01 1 0 2.00 57.00 2.42 0.55 2.00 5.00 0.00 1.00 161.00 128.00 17.20 1 40.00 41.00 41.00 235.00 7.31 9.00 151.00 36.00 ... 5.80 5.80 17.20 17.20 nan nan 1.00 1.00 199.00 199.00 5.20 5.20 150.00 150.00 9.80 9.80 41.00 30.00 7.52 7.31 376.00 228.00 427.27 376.00 30.00 30.00 7.52 7.52 376.00 376.00 376.00 376.00 0 0 0 0 0 0 0 0
1 118 63.00 29.34 1 Caucasian M 172.70 Operating Room Operating Room / Recovery 89 admit Neuro ICU 0.00 0 87.50 nan 218.00 1505.02 1 0 nan 16.00 0.70 nan 4.00 6.00 0.00 5.00 148.00 60.00 35.10 0 198.00 nan nan nan nan 4.00 140.00 36.30 ... nan nan nan nan 1.00 1.00 nan nan nan nan nan nan nan nan nan nan 44.00 44.00 7.39 7.39 115.00 115.00 287.50 287.50 nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 1
2 173 72.00 21.64 0 Caucasian M 175.30 NaN Other Hospital 962 admit Med-Surg ICU 0.00 0 66.50 1.90 119.00 601.05 0 0 1.30 9.00 0.65 nan 1.00 5.00 0.00 1.00 162.00 130.00 28.80 0 113.00 nan nan nan nan 43.00 138.00 37.20 ... nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 0 0 0 0 0 0
3 110 29.00 20.80 0 Caucasian F 165.10 Emergency Department Accident & Emergency 969 admit CCU-CTICU 0.02 0 56.70 nan 122.00 703.03 0 0 nan 6.00 0.58 0.25 2.00 3.00 0.00 1.00 92.00 106.00 nan 0 53.00 34.20 34.20 138.00 7.41 27.00 141.00 35.60 ... nan nan nan nan nan nan 0.80 0.80 nan nan 4.30 3.60 141.00 141.00 nan nan 34.20 34.00 7.41 7.41 141.00 138.00 552.00 470.00 34.00 34.00 7.41 7.41 141.00 141.00 470.00 470.00 0 0 0 0 0 0 0 0
4 118 84.00 20.28 0 Asian M 157.50 Floor Floor 97 admit MICU 2.01 0 50.30 nan 301.00 410.01 0 0 nan 22.00 1.44 nan 4.00 5.00 0.00 2.00 154.00 127.00 21.90 0 68.00 nan nan nan nan 47.00 146.00 35.60 ... nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 1 1 0 0 0 0

5 rows × 179 columns

6. Handling Missing Data

Combining the dataframes

In [33]:
df = pd.concat([df_train, df_val, df_test], sort=False)

Then, the Nan_checker is called to check for empty values.

In [34]:
# Call nan_checker on df
# See the implementation in pmlm_utilities.ipynb
df_nan = nan_checker(df)

# Print df_nan
df_nan
Out[34]:
var proportion dtype
0 h1_bilirubin_min 0.92 float64
1 h1_bilirubin_max 0.92 float64
2 h1_albumin_max 0.91 float64
3 h1_albumin_min 0.91 float64
4 h1_lactate_max 0.91 float64
... ... ... ...
155 d1_sysbp_min 0.00 float64
156 d1_heartrate_max 0.00 float64
157 d1_heartrate_min 0.00 float64
158 icu_admit_source 0.00 object
159 gender 0.00 object

160 rows × 3 columns

Getting the data types of the nan values in the combined dataframe, df.

In [35]:
pd.DataFrame(df_nan['dtype'].unique(), columns=['dtype'])
Out[35]:
dtype
0 float64
1 object
In [36]:
# Get the variables with missing values, their proportion of missing values and data type
df_miss = df_nan[df_nan['dtype'] == 'float64'].reset_index(drop=True)

# Print df_miss
df_miss
Out[36]:
var proportion dtype
0 h1_bilirubin_min 0.92 float64
1 h1_bilirubin_max 0.92 float64
2 h1_albumin_max 0.91 float64
3 h1_albumin_min 0.91 float64
4 h1_lactate_max 0.91 float64
... ... ... ...
151 d1_diasbp_min 0.00 float64
152 d1_sysbp_max 0.00 float64
153 d1_sysbp_min 0.00 float64
154 d1_heartrate_max 0.00 float64
155 d1_heartrate_min 0.00 float64

156 rows × 3 columns

In [37]:
# Separating the training data
df_train = df.iloc[:df_train.shape[0], :]

# Separating the validation data
df_val = df.iloc[df_train.shape[0]:df_train.shape[0] + df_val.shape[0], :]

# Separating the test data
df_test = df.iloc[df_train.shape[0] + df_val.shape[0]:, :]
In [38]:
# Print the dimension of df_train
pd.DataFrame([[df_train.shape[0], df_train.shape[1]]], columns=['# rows', '# columns'])
Out[38]:
# rows # columns
0 78094 179
In [39]:
# Print the dimension of df_val
pd.DataFrame([[df_val.shape[0], df_val.shape[1]]], columns=['# rows', '# columns'])
Out[39]:
# rows # columns
0 26031 179
In [40]:
# Print the dimension of df_test
pd.DataFrame([[df_test.shape[0], df_test.shape[1]]], columns=['# rows', '# columns'])
Out[40]:
# rows # columns
0 26032 179

The missing data is imputed with the most_frequent values.

In [41]:
from sklearn.impute import SimpleImputer

# If there are missing values
if len(df_miss['var']) > 0:
    # The SimpleImputer
    si = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

    # Impute the variables with missing values in df_train, df_val and df_test 
    df_train[df_miss['var']] = si.fit_transform(df_train[df_miss['var']])
    df_val[df_miss['var']] = si.transform(df_val[df_miss['var']])
    df_test[df_miss['var']] = si.transform(df_test[df_miss['var']])

7. Encoding Categorical Data

The training, validation and testing data is combined.

Then, the cat_var_checker is called.

In [42]:
# Combine df_train, df_val and df_test
df = pd.concat([df_train, df_val, df_test], sort=False)

# Print the unique data type of variables in df
pd.DataFrame(df.dtypes.unique(), columns=['dtype'])
Out[42]:
dtype
0 int64
1 float64
2 object
In [43]:
# Call cat_var_checker on df
# See the implementation in pmlm_utilities.ipynb
df_cat = cat_var_checker(df)

# Print the dataframe
df_cat
Out[43]:
var nunique
0 hospital_admit_source 16
1 icu_type 8
2 ethnicity 7
3 icu_admit_source 6
4 gender 3
5 icu_stay_type 3
In [44]:
# One-hot-encode the categorical features in the combined data
df = pd.get_dummies(df, columns=np.setdiff1d(df_cat['var'], [target]))

# Print the first 5 rows of df
df.head()
Out[44]:
hospital_id age bmi elective_surgery height icu_id pre_icu_los_days readmission_status weight albumin_apache apache_2_diagnosis apache_3j_diagnosis apache_post_operative arf_apache bilirubin_apache bun_apache creatinine_apache fio2_apache gcs_eyes_apache gcs_motor_apache gcs_unable_apache gcs_verbal_apache glucose_apache heart_rate_apache hematocrit_apache intubated_apache map_apache paco2_apache paco2_for_ph_apache pao2_apache ph_apache resprate_apache sodium_apache temp_apache urineoutput_apache ventilated_apache wbc_apache d1_diasbp_invasive_max d1_diasbp_invasive_min d1_diasbp_max ... diabetes_mellitus ethnicity_African American ethnicity_Asian ethnicity_Caucasian ethnicity_Hispanic ethnicity_Native American ethnicity_Other/Unknown gender_F gender_M hospital_admit_source_Acute Care/Floor hospital_admit_source_Chest Pain Center hospital_admit_source_Direct Admit hospital_admit_source_Emergency Department hospital_admit_source_Floor hospital_admit_source_ICU hospital_admit_source_ICU to SDU hospital_admit_source_Observation hospital_admit_source_Operating Room hospital_admit_source_Other hospital_admit_source_Other Hospital hospital_admit_source_Other ICU hospital_admit_source_PACU hospital_admit_source_Recovery Room hospital_admit_source_Step-Down Unit (SDU) icu_admit_source_Accident & Emergency icu_admit_source_Floor icu_admit_source_Operating Room / Recovery icu_admit_source_Other Hospital icu_admit_source_Other ICU icu_stay_type_admit icu_stay_type_readmit icu_stay_type_transfer icu_type_CCU-CTICU icu_type_CSICU icu_type_CTICU icu_type_Cardiac ICU icu_type_MICU icu_type_Med-Surg ICU icu_type_Neuro ICU icu_type_SICU
0 39 54.00 15.44 0 162.60 616 0.07 0 40.82 3.10 119.00 601.01 0 0 0.40 8.00 0.59 1.00 4.00 6.00 0.00 5.00 137.00 90.00 33.40 0 50.00 38.00 38.00 78.00 7.36 33.00 142.00 36.70 0.00 0 9.88 74.00 46.00 98.00 ... 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0
1 204 87.00 23.73 1 184.20 431 0.18 0 80.50 3.10 203.00 1206.03 1 0 0.40 15.00 0.74 0.40 4.00 6.00 0.00 5.00 85.00 55.00 30.50 1 51.00 37.00 37.00 191.00 7.44 51.00 134.00 35.40 1751.33 1 3.80 81.00 37.00 51.00 ... 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0
2 109 21.00 30.00 0 157.48 429 0.31 0 74.41 3.10 305.00 901.03 0 1 0.40 37.00 7.36 1.00 4.00 6.00 0.00 5.00 84.00 118.00 24.70 0 146.00 38.00 38.00 78.00 7.36 34.00 139.00 36.60 858.38 0 4.80 74.00 46.00 119.00 ... 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
3 160 49.00 32.02 0 172.70 470 0.28 0 95.50 3.30 113.00 501.02 0 0 0.30 31.00 2.20 1.00 4.00 6.00 0.00 5.00 139.00 119.00 34.80 0 62.00 38.00 38.00 78.00 7.36 10.00 138.00 37.10 0.00 0 7.80 74.00 46.00 96.00 ... 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0
4 64 67.00 29.87 0 167.60 683 3.11 0 83.90 3.10 302.00 109.12 0 0 1.20 16.00 1.13 1.00 3.00 6.00 0.00 5.00 106.00 96.00 41.00 0 59.00 38.00 38.00 78.00 7.36 43.00 135.00 36.60 295.06 0 9.10 74.00 46.00 55.00 ... 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0

5 rows × 212 columns

The categorical variables are encoded using LabelEncoder()

In [45]:
from sklearn.preprocessing import LabelEncoder

# The LabelEncoder
le = LabelEncoder()

# Encode categorical target in the combined data
df[target] = le.fit_transform(df[target])

# Print the first 5 rows of df
df.head()
Out[45]:
hospital_id age bmi elective_surgery height icu_id pre_icu_los_days readmission_status weight albumin_apache apache_2_diagnosis apache_3j_diagnosis apache_post_operative arf_apache bilirubin_apache bun_apache creatinine_apache fio2_apache gcs_eyes_apache gcs_motor_apache gcs_unable_apache gcs_verbal_apache glucose_apache heart_rate_apache hematocrit_apache intubated_apache map_apache paco2_apache paco2_for_ph_apache pao2_apache ph_apache resprate_apache sodium_apache temp_apache urineoutput_apache ventilated_apache wbc_apache d1_diasbp_invasive_max d1_diasbp_invasive_min d1_diasbp_max ... diabetes_mellitus ethnicity_African American ethnicity_Asian ethnicity_Caucasian ethnicity_Hispanic ethnicity_Native American ethnicity_Other/Unknown gender_F gender_M hospital_admit_source_Acute Care/Floor hospital_admit_source_Chest Pain Center hospital_admit_source_Direct Admit hospital_admit_source_Emergency Department hospital_admit_source_Floor hospital_admit_source_ICU hospital_admit_source_ICU to SDU hospital_admit_source_Observation hospital_admit_source_Operating Room hospital_admit_source_Other hospital_admit_source_Other Hospital hospital_admit_source_Other ICU hospital_admit_source_PACU hospital_admit_source_Recovery Room hospital_admit_source_Step-Down Unit (SDU) icu_admit_source_Accident & Emergency icu_admit_source_Floor icu_admit_source_Operating Room / Recovery icu_admit_source_Other Hospital icu_admit_source_Other ICU icu_stay_type_admit icu_stay_type_readmit icu_stay_type_transfer icu_type_CCU-CTICU icu_type_CSICU icu_type_CTICU icu_type_Cardiac ICU icu_type_MICU icu_type_Med-Surg ICU icu_type_Neuro ICU icu_type_SICU
0 39 54.00 15.44 0 162.60 616 0.07 0 40.82 3.10 119.00 601.01 0 0 0.40 8.00 0.59 1.00 4.00 6.00 0.00 5.00 137.00 90.00 33.40 0 50.00 38.00 38.00 78.00 7.36 33.00 142.00 36.70 0.00 0 9.88 74.00 46.00 98.00 ... 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0
1 204 87.00 23.73 1 184.20 431 0.18 0 80.50 3.10 203.00 1206.03 1 0 0.40 15.00 0.74 0.40 4.00 6.00 0.00 5.00 85.00 55.00 30.50 1 51.00 37.00 37.00 191.00 7.44 51.00 134.00 35.40 1751.33 1 3.80 81.00 37.00 51.00 ... 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0
2 109 21.00 30.00 0 157.48 429 0.31 0 74.41 3.10 305.00 901.03 0 1 0.40 37.00 7.36 1.00 4.00 6.00 0.00 5.00 84.00 118.00 24.70 0 146.00 38.00 38.00 78.00 7.36 34.00 139.00 36.60 858.38 0 4.80 74.00 46.00 119.00 ... 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
3 160 49.00 32.02 0 172.70 470 0.28 0 95.50 3.30 113.00 501.02 0 0 0.30 31.00 2.20 1.00 4.00 6.00 0.00 5.00 139.00 119.00 34.80 0 62.00 38.00 38.00 78.00 7.36 10.00 138.00 37.10 0.00 0 7.80 74.00 46.00 96.00 ... 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0
4 64 67.00 29.87 0 167.60 683 3.11 0 83.90 3.10 302.00 109.12 0 0 1.20 16.00 1.13 1.00 3.00 6.00 0.00 5.00 106.00 96.00 41.00 0 59.00 38.00 38.00 78.00 7.36 43.00 135.00 36.60 295.06 0 9.10 74.00 46.00 55.00 ... 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0

5 rows × 212 columns

In [46]:
# Separating the training data
df_train = df.iloc[:df_train.shape[0], :]

# Separating the validation data
df_val = df.iloc[df_train.shape[0]:df_train.shape[0] + df_val.shape[0], :]

# Separating the test data
df_test = df.iloc[df_train.shape[0] + df_val.shape[0]:, :]
In [47]:
# Print the dimension of df_train
pd.DataFrame([[df_train.shape[0], df_train.shape[1]]], columns=['# rows', '# columns'])
Out[47]:
# rows # columns
0 78094 212
In [48]:
# Print the dimension of df_val
pd.DataFrame([[df_val.shape[0], df_val.shape[1]]], columns=['# rows', '# columns'])
Out[48]:
# rows # columns
0 26031 212
In [49]:
# Print the dimension of df_test
pd.DataFrame([[df_test.shape[0], df_test.shape[1]]], columns=['# rows', '# columns'])
Out[49]:
# rows # columns
0 26032 212

8. Splitting the Feature & the Target

The feature and the target is split.

In [50]:
# Get the feature matrix
X_train = df_train[np.setdiff1d(df_train.columns, [target])].values
X_val = df_val[np.setdiff1d(df_val.columns, [target])].values
X_test = df_test[np.setdiff1d(df_test.columns, [target])].values

# Get the target vector
y_train = df_train[target].values
y_val = df_val[target].values
y_test = df_test[target].values

9. Scaling the Data

The training, validation and testing data is standardized using StandardScaler().

In [51]:
from sklearn.preprocessing import StandardScaler

# The StandardScaler
ss = StandardScaler()
In [52]:
# Standardize the training data
X_train = ss.fit_transform(X_train)

# Standardize the validation data
X_val = ss.transform(X_val)

# Standardize the test data
X_test = ss.transform(X_test)

10. Handling Class Imbalance

SMOTE is used to hande class imbalance.

In [53]:
pd.Series(y_train).value_counts()
Out[53]:
0    61203
1    16891
dtype: int64
In [54]:
from imblearn.over_sampling import SMOTE

# The SMOTE
smote = SMOTE(random_state=random_seed)

# Augment the training data
X_smote_train, y_smote_train = smote.fit_resample(X_train, y_train)
In [55]:
pd.Series(y_smote_train).value_counts()
Out[55]:
1    61203
0    61203
dtype: int64
In [56]:
X_smote_sub1, X_smote_sub2, y_smote_sub1, y_smote_sub2 = train_test_split(X_smote_train, y_smote_train, test_size=0.01, random_state=42)
In [57]:
# Using 1% of our total training set allowed for a faster graphical representation runtime
X_train_sub1, X_train_sub2, y_train_sub1, y_train_sub2 = train_test_split(X_train, y_train, 
                                     test_size=0.01, 
                                     random_state=random_seed, 
                                     stratify=y_train)
In [58]:
y_smote_gen_ori_train = separate_generate_original(X_smote_sub2, y_smote_sub2, X_train, y_train, 1)
In [59]:
# Plot the scatter plot using TSNE
# See the implementation in pmlm_utilities.ipynb
plot_scatter_tsne(X_smote_sub2,
                  y_smote_gen_ori_train, 
                  [0, 1, 2],
                  ['0', '1', '+1'],
                  ['blue', 'green', 'red'],
                  ['o', '^', 's'],
                  'bottom-right',
                  abspath,
                  'scatter_plot_smote.pdf',
                  random_seed)

Hyperparameter Tuning

The goal of hyperparameter tuning is to find and optimize the parameter values that leads to a higher accuracy score and a lower validation performance loss. The final parameters tested for this model are shown below:

In [60]:
# Change working directory to the absolute path of the shallow models folder
%cd $abspath

# Import the shallow models
%run pmlm_models_shallow.ipynb
/content/drive/My Drive/Colab Notebooks/teaching/gwu/machine_learning_I/project

Creating the dictionary of the models

In the dictionary:

1. the key is the acronym of the model <br>
2. the value is the model
In [61]:
from sklearn.linear_model import LogisticRegression

models = {'lr': LogisticRegression(class_weight='balanced', random_state=random_seed),
          'lr_mbgd': LogisticRegression_MBGD()}

Creating the dictionary of the pipelines

In the dictionary:

1. the key is the acronym of the model <br>
2. the value is the pipeline, which, for now, only includes the model
In [62]:
from sklearn.pipeline import Pipeline

pipes = {}

for acronym, model in models.items():
    pipes[acronym] = Pipeline([('model', model)])

Getting the predefined split cross-validator

In [63]:
# Get the:
# feature matrix and target velctor in the combined training and validation data
# target vector in the combined training and validation data
# PredefinedSplit
# See the implementation in pmlm_utilities.ipynb
X_train_val, y_train_val, ps = get_train_val_ps(X_smote_train, y_smote_train, X_val, y_val)

GridSearchCV

Creating the dictionary of the parameter grids. In the dictionary:

1. the key is the acronym of the model<br>
2. the value is the parameter grid of the model
In [64]:
param_grids = {}

The parameter grid for Logistic Regression

The hyperparameters we want to fine-tune are:

  1. tol_grid
  2. C_grid

The parameter grid for Logisitic Regression Mini-Batch Gradient Descent

The hyperparameters we want to fine-tune are:

  1. eta_grid
  2. alpha_grid
In [65]:
# The parameter grid of tol
tol_grid = [.5 * 10 ** -3, 10 ** -2, 2 * 10 ** -1] #Adding weights to parameter to change the accuracy score

# The parameter grid of C
C_grid = [.1, 1, 10]

# Update param_grids
param_grids['lr'] = [{'model__tol': tol_grid,
                      'model__C': C_grid}]
In [66]:
# The parameter grid of eta
eta_grid = [10 ** -3, 10 ** -2, 10 ** -1] 

# The parameter grid of alpha
alpha_grid = [0.1, 1, 10]

# Update param_grids
param_grids['lr_mbgd'] = [{'model__eta': eta_grid,
                           'model__alpha': alpha_grid}]

Creating the directory for the cv results produced by GridSearchCV

In [67]:
directory = os.path.dirname(abspath + '/result/dm2/cv_results/GridSearchCV/')
if not os.path.exists(directory):
    os.makedirs(directory)

Tuning the hyperparameters

The code below shows how to fine-tune the hyperparameters.

In [68]:
from sklearn.model_selection import GridSearchCV

# The list of [best_score_, best_params_, best_estimator_] obtained by GridSearchCV
best_score_params_estimator_gs = []

# For each model
for acronym in pipes.keys():
    # GridSearchCV
    gs = GridSearchCV(estimator=pipes[acronym],
                      param_grid=param_grids[acronym],
                      scoring='f1_weighted', # changed from 'f1_macro', helped drastically increase the accuracy score
                      n_jobs=2,
                      cv=ps,
                      return_train_score=True)
        
    # Fit the pipeline
    gs = gs.fit(X_train_val, y_train_val)
    
    # Update best_score_params_estimator_gs
    best_score_params_estimator_gs.append([gs.best_score_, gs.best_params_, gs.best_estimator_])
    
    # Sort cv_results in ascending order of 'rank_test_score' and 'std_test_score'
    cv_results = pd.DataFrame.from_dict(gs.cv_results_).sort_values(by=['rank_test_score', 'std_test_score'])
    
    # Get the important columns in cv_results
    important_columns = ['rank_test_score',
                         'mean_test_score', 
                         'std_test_score', 
                         'mean_train_score', 
                         'std_train_score',
                         'mean_fit_time', 
                         'std_fit_time',                        
                         'mean_score_time', 
                         'std_score_time']
    
    # Move the important columns ahead
    cv_results = cv_results[important_columns + sorted(list(set(cv_results.columns) - set(important_columns)))]

    # Write cv_results file
    cv_results.to_csv(path_or_buf=abspath + 'result/dm2/cv_results/GridSearchCV/' + acronym + '.csv', index=False)

# Sort best_score_params_estimator_gs in descending order of the best_score_
best_score_params_estimator_gs = sorted(best_score_params_estimator_gs, key=lambda x : x[0], reverse=True)

# Print best_score_params_estimator_gs
pd.DataFrame(best_score_params_estimator_gs, columns=['best_score', 'best_param', 'best_estimator'])
Out[68]:
best_score best_param best_estimator
0 0.79 {'model__C': 1, 'model__tol': 0.0005} (LogisticRegression(C=1, class_weight='balance...
1 0.78 {'model__alpha': 0.1, 'model__eta': 0.001} (LogisticRegression_MBGD(alpha=0.1, batch_size...

Model Selection

Here we will select best_estimator_gs as the best model.

In [69]:
# Get the best_score, best_params and best_estimator obtained by GridSearchCV
best_score_gs, best_params_gs, best_estimator_gs = best_score_params_estimator_gs[0]

Model Evaluation

In [70]:
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import roc_auc_score

# Get the prediction on the testing data using best_model
y_test_pred = best_estimator_gs.predict(X_test)

# Get the precision, recall, fscore, support
precision, recall, fscore, support = precision_recall_fscore_support(y_test, y_test_pred)

# Get the auc
auc = roc_auc_score(y_test, y_test_pred)

# Get the dataframe of precision, recall, fscore and auc
pd.DataFrame([[precision, recall, fscore, auc]], columns=['Precision', 'Recall', 'F1-score', 'AUC'])
Out[70]:
Precision Recall F1-score AUC
0 [0.91215793366331, 0.48326434062684803] [0.7858543280070581, 0.7257548845470693] [0.844308696911451, 0.5801916932907348] 0.76

Conclusion

After running the model, it was concluded that the highest accuracy score obtained was .79. This final score is .08 higher than our orgignal baseline score of .71 that was obtained by running this model without changing any parameters from the code provided in the Breast Cancer Wisconsin case study. There were a few parameters that were changed during the trial and error period of hyperparameter tuning for the purposes of this project:

  • Random seed
  • Scoring factor
  • Tol grid
  • C grid
  • Eta grid
  • Alpha grid

Through trial and error, simply changing the random seed or any of the Gridsearch parameter grid did not help increase the score. However, changing the scoring from f1_macro significantly changed our accuracy score. Initially, the scoring was changed to "f1", producing a score of .56. Next, the scoring was changed to f1_weighted. This produced our final score of .79, a significant increase from the .71 baseline score that was initially reached.

Even though the parameter tuning alone did not vastly increase the score, this final score of .79 is still an acceptable score because all aspects of data preprocessing and model selection were properly applied. Therefore, it can be concluded that this model has a .79 accuracy score and a precision score between .48-.91.

In comparison, the score obtained from the live Kaggle competition in which the provided test data was gave a score of .74. That score was obtained used the "f1_macro" scoring that was initially tested. Therefore, between the two attempts of this model, it can be concluded that this final version of our Diabetes Mellitus prediction model is the more accurate model created.