The topic for this project was obtained from the annual Women in Data Science Datathon organized on Kaggle. The WiDS Datathon 2021 is a collaboration led by the WiDS Worldwide team at Stanford University, the West Big Data Innovation Hub, and the WiDS Datathon Committee.
The 2021 WiDS Datathon focused on “patient health, with an emphasis on the chronic condition of diabetes, through data from MIT’s GOSSIS (Global Open Source Severity of Illness Score) initiative.”
The COVID-19 pandemic has forced the healthcare industry to get a rapid understanding of a patient’s overall health as hospitals around the world struggle with overloading of patients in critical condition. Therefore, knowledge about chronic conditions such as diabetes mellitus assist healthcare workers in making clinical decisions about patient care.
The purpose of this challenge is to “determine whether a patient admitted to an ICU has been diagnosed with a particular type of diabetes, Diabetes Mellitus.” The data has been collected from the first 24 hours of intensive care and labeled training data has been used for model development. The testing data was also provided by the organizing committee on Kaggle.
Citation: “WiDS Datathon 2021.” Kaggle, www.kaggle.com/c/widsdatathon2021.
For more information on our Kaggle Competition entry, please click here.
The code for this project was inspired by the Breast Cancer Wisconsin Case study completed by Professor Yuxiao here.
Importing Packages
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
import csv
import warnings
warnings.filterwarnings('ignore')
import warnings
warnings.filterwarnings("ignore")
Setting Display Options
pd.set_option("display.float.format", lambda x: "%.2f" % x)
Importing Tensorflow
# The magic below allows us to use tensorflow version 2.x
%tensorflow_version 2.x
import tensorflow as tf
from tensorflow import keras
Setting the Random Seed
# The random seed
random_seed = 100
# Set random seed in tensorflow
tf.random.set_seed(random_seed)
# Set random seed in numpy
import numpy as np
np.random.seed(random_seed)
Mounting drive/Setting Directory
from google.colab import drive
drive.mount('/content/drive')
abspath = '/content/drive/My Drive/Colab Notebooks/teaching/gwu/machine_learning_I/project/'
Setting the working directory to the absolute path and importing Professor's shallow utilities file so it can be used through this project.
# Change working directory to the absolute path
%cd $abspath
# Import the shallow utitilities
%run pmlm_utilities_shallow.ipynb
It was important to determine which model would best suit our dataset before starting any type of data preprocessing. While exploring this dataset, it was clear that the target variable for this project was diabetes_mellitus, with values either being 0 (meaning a patient does not have diabetes) or 1 (meaning a patient does have diabetes). With there being a set finite number of values within the target variable, it was clear that this problem would best be solved using a classification model. Specifically, since this model only had two target classes with a sample belonging to only one class, the best model for the purposes of this project was determined to be a binary classification logistic regression model.
This project was initially completed as a Kaggle competition where the test data was provided. The code that was run for the purposes of the Kaggle competition produced an accuracy score of .74. As a comparison, the provided Kaggle test data was omitted and new test data was extracted from the training dataset to observe how the two accuracy scores compared for the purposes of this final assignment.
With the model selected, the first step in preparing the dataset was to complete a thorough data preprocessing. There were ten steps included for the purposes of this project’s data preprocessing:
In keeping with a fair train/validation/test split, the training data was split 60 (training) : 40 (validation/testing). The validation and testing was split further into a 50:50 split. Therefore, the full training dataset was split as followed:
For this project, the following files are pre-loaded from Professor Yuxiao Huang's Github repository:
Loading the training data and making a copy of the raw training data. Also dropping the unnamed column from the raw data.
# Load the raw training data
df_raw = pd.read_csv(abspath + 'TrainingWiDS2021.csv', header=0)
# Remove the unnamed column
df_raw = df_raw.drop(columns='Unnamed: 0')
# Make a copy of df_raw
df = df_raw.copy(deep=True)
Setting the target variable, which in the case of this project would be Diabetes Mellitus since we are trying to find out if individuals will have Diabetes Mellitus or not.
# Get the name of the target
target = 'diabetes_mellitus'
Getting the dimensions of the training data.
pd.DataFrame([[df.shape[0], df.shape[1]]], columns=['# rows', '# columns'])
Previewing the first 5 rows of the training data.
df.head()
Using train_test_split from sklearn, the data is being split into 60% training data and 40% testing data.
The testing data is further split into 50% validation and 50% testing data.
from sklearn.model_selection import train_test_split
# Divide the data into training (60%) and test (40%)
df_train, df_test = train_test_split(df,
train_size=0.6,
random_state=random_seed,
stratify=df[target])
# Divide the test data into validation (50%) and test (50%)
df_val, df_test = train_test_split(df_test,
train_size=0.5,
random_state=random_seed,
stratify=df_test[target])
# Reset the index
df_train, df_val, df_test = df_train.reset_index(drop=True), df_val.reset_index(drop=True), df_test.reset_index(drop=True)
Getting the dimensions of df_train after the split from above.
# Print the dimension of df_train
pd.DataFrame([[df_train.shape[0], df_train.shape[1]]], columns=['# rows', '# columns'])
Getting the dimensions of df_val after the split from above.
# Print the dimension of df_val
pd.DataFrame([[df_val.shape[0], df_val.shape[1]]], columns=['# rows', '# columns'])
# Print the dimension of df_test
pd.DataFrame([[df_test.shape[0], df_test.shape[1]]], columns=['# rows', '# columns'])
Using the common_var_checker to print the common variables between the df_train, df_val, df_test, and the target.
# Call common_var_checker
# See the implementation in pmlm_utilities.ipynb
df_common_var = common_var_checker(df_train, df_val, df_test, target)
# Print df_common_var
df_common_var
Getting the uncommon features
Getting the features in the training data but not in the validation or test data.
uncommon_feature_train_not_val_test = np.setdiff1d(df_train.columns, df_common_var['common var'])
# Print the uncommon features
pd.DataFrame(uncommon_feature_train_not_val_test, columns=['uncommon feature'])
Getting the features in the validation data but not in the training or test data.
uncommon_feature_val_not_train_test = np.setdiff1d(df_val.columns, df_common_var['common var'])
# Print the uncommon features
pd.DataFrame(uncommon_feature_val_not_train_test, columns=['uncommon feature'])
Getting the features in the test data but not in the training or validation data.
uncommon_feature_test_not_train_val = np.setdiff1d(df_test.columns, df_common_var['common var'])
# Print the uncommon features
pd.DataFrame(uncommon_feature_test_not_train_val, columns=['uncommon feature'])
Dropping the uncommon features
Using the drop functoin to remove the uncommon features from the training data, then previewing the df_train.
# Remove the uncommon features from the training data
df_train = df_train.drop(columns=uncommon_feature_train_not_val_test)
# Print the first 5 rows of df_train
df_train.head()
Using the drop functoin to remove the uncommon features from the validation data, then previewing the df_val.
# Remove the uncommon features from the validation data
df_val = df_val.drop(columns=uncommon_feature_val_not_train_test)
# Print the first 5 rows of df_val
df_val.head()
Using the drop functoin to remove the uncommon features from the test data, then previewing the df_test.
# Remove the uncommon features from the test data
df_test = df_test.drop(columns=uncommon_feature_test_not_train_val)
# Print the first 5 rows of df_test
df_test.head()
Combining the dataframes
Combining df_train, df_val, and df_test using a concat function before dropping the identifiers.
# Combine df_train, df_val and df_test
df = pd.concat([df_train, df_val, df_test], sort=False)
Getting the identifiers
Using the id_checker to get the id column from the combined dataframe we made above.
# Call id_checker on df
# See the implementation in pmlm_utilities.ipynb
df_id = id_checker(df)
# Print the first 5 rows of df_id
df_id.head()
Removing the identifiers
Use the drop function to remove the identifiers from the df_train, df_val, and df_test.
import numpy as np
# Remove identifiers from df_train
df_train.drop(columns=np.intersect1d(df_id.columns, df_train.columns), inplace=True)
# Remove identifiers from df_val
df_val.drop(columns=np.intersect1d(df_id.columns, df_val.columns), inplace=True)
# Remove identifiers from df_test
df_test.drop(columns=np.intersect1d(df_id.columns, df_test.columns), inplace=True)
Previewing df_train, df_val, and df_test after dropping identifiers
# Print the first 5 rows of df_train
df_train.head()
# Print the first 5 rows of df_val
df_val.head()
# Print the first 5 rows of df_test
df_test.head()
Setting the date time variables from the data. In this case, there are no date time variables, hence the brackets remain empty.
# Get the date time variables
datetime_vars = []
Calling the datetime_transfer on df_train, df_val, and df_test.
# Call datetime_transformer on df_train
# See the implementation in pmlm_utilities.ipynb
df_train = datetime_transformer(df_train, datetime_vars)
# Print the first 5 rows of df_train
df_train.head()
# Call datetime_transformer on df_val
# See the implementation in pmlm_utilities.ipynb
df_val = datetime_transformer(df_val, datetime_vars)
# Print the first 5 rows of df_val
df_val.head()
# See the implementation in pmlm_utilities.ipynb
df_test = datetime_transformer(df_test, datetime_vars)
# Print the first 5 rows of df_test
df_test.head()
Combining the dataframes
df = pd.concat([df_train, df_val, df_test], sort=False)
Then, the Nan_checker is called to check for empty values.
# Call nan_checker on df
# See the implementation in pmlm_utilities.ipynb
df_nan = nan_checker(df)
# Print df_nan
df_nan
Getting the data types of the nan values in the combined dataframe, df.
pd.DataFrame(df_nan['dtype'].unique(), columns=['dtype'])
# Get the variables with missing values, their proportion of missing values and data type
df_miss = df_nan[df_nan['dtype'] == 'float64'].reset_index(drop=True)
# Print df_miss
df_miss
# Separating the training data
df_train = df.iloc[:df_train.shape[0], :]
# Separating the validation data
df_val = df.iloc[df_train.shape[0]:df_train.shape[0] + df_val.shape[0], :]
# Separating the test data
df_test = df.iloc[df_train.shape[0] + df_val.shape[0]:, :]
# Print the dimension of df_train
pd.DataFrame([[df_train.shape[0], df_train.shape[1]]], columns=['# rows', '# columns'])
# Print the dimension of df_val
pd.DataFrame([[df_val.shape[0], df_val.shape[1]]], columns=['# rows', '# columns'])
# Print the dimension of df_test
pd.DataFrame([[df_test.shape[0], df_test.shape[1]]], columns=['# rows', '# columns'])
The missing data is imputed with the most_frequent values.
from sklearn.impute import SimpleImputer
# If there are missing values
if len(df_miss['var']) > 0:
# The SimpleImputer
si = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
# Impute the variables with missing values in df_train, df_val and df_test
df_train[df_miss['var']] = si.fit_transform(df_train[df_miss['var']])
df_val[df_miss['var']] = si.transform(df_val[df_miss['var']])
df_test[df_miss['var']] = si.transform(df_test[df_miss['var']])
The training, validation and testing data is combined.
Then, the cat_var_checker is called.
# Combine df_train, df_val and df_test
df = pd.concat([df_train, df_val, df_test], sort=False)
# Print the unique data type of variables in df
pd.DataFrame(df.dtypes.unique(), columns=['dtype'])
# Call cat_var_checker on df
# See the implementation in pmlm_utilities.ipynb
df_cat = cat_var_checker(df)
# Print the dataframe
df_cat
# One-hot-encode the categorical features in the combined data
df = pd.get_dummies(df, columns=np.setdiff1d(df_cat['var'], [target]))
# Print the first 5 rows of df
df.head()
The categorical variables are encoded using LabelEncoder()
from sklearn.preprocessing import LabelEncoder
# The LabelEncoder
le = LabelEncoder()
# Encode categorical target in the combined data
df[target] = le.fit_transform(df[target])
# Print the first 5 rows of df
df.head()
# Separating the training data
df_train = df.iloc[:df_train.shape[0], :]
# Separating the validation data
df_val = df.iloc[df_train.shape[0]:df_train.shape[0] + df_val.shape[0], :]
# Separating the test data
df_test = df.iloc[df_train.shape[0] + df_val.shape[0]:, :]
# Print the dimension of df_train
pd.DataFrame([[df_train.shape[0], df_train.shape[1]]], columns=['# rows', '# columns'])
# Print the dimension of df_val
pd.DataFrame([[df_val.shape[0], df_val.shape[1]]], columns=['# rows', '# columns'])
# Print the dimension of df_test
pd.DataFrame([[df_test.shape[0], df_test.shape[1]]], columns=['# rows', '# columns'])
The feature and the target is split.
# Get the feature matrix
X_train = df_train[np.setdiff1d(df_train.columns, [target])].values
X_val = df_val[np.setdiff1d(df_val.columns, [target])].values
X_test = df_test[np.setdiff1d(df_test.columns, [target])].values
# Get the target vector
y_train = df_train[target].values
y_val = df_val[target].values
y_test = df_test[target].values
The training, validation and testing data is standardized using StandardScaler().
from sklearn.preprocessing import StandardScaler
# The StandardScaler
ss = StandardScaler()
# Standardize the training data
X_train = ss.fit_transform(X_train)
# Standardize the validation data
X_val = ss.transform(X_val)
# Standardize the test data
X_test = ss.transform(X_test)
SMOTE is used to hande class imbalance.
pd.Series(y_train).value_counts()
from imblearn.over_sampling import SMOTE
# The SMOTE
smote = SMOTE(random_state=random_seed)
# Augment the training data
X_smote_train, y_smote_train = smote.fit_resample(X_train, y_train)
pd.Series(y_smote_train).value_counts()
X_smote_sub1, X_smote_sub2, y_smote_sub1, y_smote_sub2 = train_test_split(X_smote_train, y_smote_train, test_size=0.01, random_state=42)
# Using 1% of our total training set allowed for a faster graphical representation runtime
X_train_sub1, X_train_sub2, y_train_sub1, y_train_sub2 = train_test_split(X_train, y_train,
test_size=0.01,
random_state=random_seed,
stratify=y_train)
y_smote_gen_ori_train = separate_generate_original(X_smote_sub2, y_smote_sub2, X_train, y_train, 1)
# Plot the scatter plot using TSNE
# See the implementation in pmlm_utilities.ipynb
plot_scatter_tsne(X_smote_sub2,
y_smote_gen_ori_train,
[0, 1, 2],
['0', '1', '+1'],
['blue', 'green', 'red'],
['o', '^', 's'],
'bottom-right',
abspath,
'scatter_plot_smote.pdf',
random_seed)
The goal of hyperparameter tuning is to find and optimize the parameter values that leads to a higher accuracy score and a lower validation performance loss. The final parameters tested for this model are shown below:
# Change working directory to the absolute path of the shallow models folder
%cd $abspath
# Import the shallow models
%run pmlm_models_shallow.ipynb
In the dictionary:
1. the key is the acronym of the model <br>
2. the value is the model
from sklearn.linear_model import LogisticRegression
models = {'lr': LogisticRegression(class_weight='balanced', random_state=random_seed),
'lr_mbgd': LogisticRegression_MBGD()}
In the dictionary:
1. the key is the acronym of the model <br>
2. the value is the pipeline, which, for now, only includes the model
from sklearn.pipeline import Pipeline
pipes = {}
for acronym, model in models.items():
pipes[acronym] = Pipeline([('model', model)])
# Get the:
# feature matrix and target velctor in the combined training and validation data
# target vector in the combined training and validation data
# PredefinedSplit
# See the implementation in pmlm_utilities.ipynb
X_train_val, y_train_val, ps = get_train_val_ps(X_smote_train, y_smote_train, X_val, y_val)
Creating the dictionary of the parameter grids. In the dictionary:
1. the key is the acronym of the model<br>
2. the value is the parameter grid of the model
param_grids = {}
The parameter grid for Logistic Regression
The hyperparameters we want to fine-tune are:
The parameter grid for Logisitic Regression Mini-Batch Gradient Descent
The hyperparameters we want to fine-tune are:
# The parameter grid of tol
tol_grid = [.5 * 10 ** -3, 10 ** -2, 2 * 10 ** -1] #Adding weights to parameter to change the accuracy score
# The parameter grid of C
C_grid = [.1, 1, 10]
# Update param_grids
param_grids['lr'] = [{'model__tol': tol_grid,
'model__C': C_grid}]
# The parameter grid of eta
eta_grid = [10 ** -3, 10 ** -2, 10 ** -1]
# The parameter grid of alpha
alpha_grid = [0.1, 1, 10]
# Update param_grids
param_grids['lr_mbgd'] = [{'model__eta': eta_grid,
'model__alpha': alpha_grid}]
Creating the directory for the cv results produced by GridSearchCV
directory = os.path.dirname(abspath + '/result/dm2/cv_results/GridSearchCV/')
if not os.path.exists(directory):
os.makedirs(directory)
Tuning the hyperparameters
The code below shows how to fine-tune the hyperparameters.
from sklearn.model_selection import GridSearchCV
# The list of [best_score_, best_params_, best_estimator_] obtained by GridSearchCV
best_score_params_estimator_gs = []
# For each model
for acronym in pipes.keys():
# GridSearchCV
gs = GridSearchCV(estimator=pipes[acronym],
param_grid=param_grids[acronym],
scoring='f1_weighted', # changed from 'f1_macro', helped drastically increase the accuracy score
n_jobs=2,
cv=ps,
return_train_score=True)
# Fit the pipeline
gs = gs.fit(X_train_val, y_train_val)
# Update best_score_params_estimator_gs
best_score_params_estimator_gs.append([gs.best_score_, gs.best_params_, gs.best_estimator_])
# Sort cv_results in ascending order of 'rank_test_score' and 'std_test_score'
cv_results = pd.DataFrame.from_dict(gs.cv_results_).sort_values(by=['rank_test_score', 'std_test_score'])
# Get the important columns in cv_results
important_columns = ['rank_test_score',
'mean_test_score',
'std_test_score',
'mean_train_score',
'std_train_score',
'mean_fit_time',
'std_fit_time',
'mean_score_time',
'std_score_time']
# Move the important columns ahead
cv_results = cv_results[important_columns + sorted(list(set(cv_results.columns) - set(important_columns)))]
# Write cv_results file
cv_results.to_csv(path_or_buf=abspath + 'result/dm2/cv_results/GridSearchCV/' + acronym + '.csv', index=False)
# Sort best_score_params_estimator_gs in descending order of the best_score_
best_score_params_estimator_gs = sorted(best_score_params_estimator_gs, key=lambda x : x[0], reverse=True)
# Print best_score_params_estimator_gs
pd.DataFrame(best_score_params_estimator_gs, columns=['best_score', 'best_param', 'best_estimator'])
Here we will select best_estimator_gs as the best model.
# Get the best_score, best_params and best_estimator obtained by GridSearchCV
best_score_gs, best_params_gs, best_estimator_gs = best_score_params_estimator_gs[0]
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import roc_auc_score
# Get the prediction on the testing data using best_model
y_test_pred = best_estimator_gs.predict(X_test)
# Get the precision, recall, fscore, support
precision, recall, fscore, support = precision_recall_fscore_support(y_test, y_test_pred)
# Get the auc
auc = roc_auc_score(y_test, y_test_pred)
# Get the dataframe of precision, recall, fscore and auc
pd.DataFrame([[precision, recall, fscore, auc]], columns=['Precision', 'Recall', 'F1-score', 'AUC'])
After running the model, it was concluded that the highest accuracy score obtained was .79. This final score is .08 higher than our orgignal baseline score of .71 that was obtained by running this model without changing any parameters from the code provided in the Breast Cancer Wisconsin case study. There were a few parameters that were changed during the trial and error period of hyperparameter tuning for the purposes of this project:
Through trial and error, simply changing the random seed or any of the Gridsearch parameter grid did not help increase the score. However, changing the scoring from f1_macro significantly changed our accuracy score. Initially, the scoring was changed to "f1", producing a score of .56. Next, the scoring was changed to f1_weighted. This produced our final score of .79, a significant increase from the .71 baseline score that was initially reached.
Even though the parameter tuning alone did not vastly increase the score, this final score of .79 is still an acceptable score because all aspects of data preprocessing and model selection were properly applied. Therefore, it can be concluded that this model has a .79 accuracy score and a precision score between .48-.91.
In comparison, the score obtained from the live Kaggle competition in which the provided test data was gave a score of .74. That score was obtained used the "f1_macro" scoring that was initially tested. Therefore, between the two attempts of this model, it can be concluded that this final version of our Diabetes Mellitus prediction model is the more accurate model created.