DATS 6202-010 MACHINE LEARNING I

FINAL PROJECT

Detecting the Presence of Diabetes Mellitus

Sabina Azim, Arushi Kapoor, Natasha Vij

Introduction:¶

The topic for this project was obtained from the annual Women in Data Science Datathon organized on Kaggle. The WiDS Datathon 2021 is a collaboration led by the WiDS Worldwide team at Stanford University, the West Big Data Innovation Hub, and the WiDS Datathon Committee.

The 2021 WiDS Datathon focused on “patient health, with an emphasis on the chronic condition of diabetes, through data from MIT’s GOSSIS (Global Open Source Severity of Illness Score) initiative.”

The COVID-19 pandemic has forced the healthcare industry to get a rapid understanding of a patient’s overall health as hospitals around the world struggle with overloading of patients in critical condition. Therefore, knowledge about chronic conditions such as diabetes mellitus assist healthcare workers in making clinical decisions about patient care.

The purpose of this challenge is to “determine whether a patient admitted to an ICU has been diagnosed with a particular type of diabetes, Diabetes Mellitus.” The data has been collected from the first 24 hours of intensive care and labeled training data has been used for model development. The testing data was also provided by the organizing committee on Kaggle.

Citation: “WiDS Datathon 2021.” Kaggle, www.kaggle.com/c/widsdatathon2021.

For more information on our Kaggle Competition entry, please click here.

The code for this project was inspired by the Breast Cancer Wisconsin Case study completed by Professor Yuxiao here.

Importing Packages

import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
import csv
import warnings
warnings.filterwarnings('ignore')
import warnings
warnings.filterwarnings("ignore")

Setting Display Options

pd.set_option("display.float.format", lambda x: "%.2f" % x)

Importing Tensorflow

# The magic below allows us to use tensorflow version 2.x
%tensorflow_version 2.x 
import tensorflow as tf
from tensorflow import keras

Setting the Random Seed

# The random seed
random_seed = 100

# Set random seed in tensorflow
tf.random.set_seed(random_seed)

# Set random seed in numpy
import numpy as np
np.random.seed(random_seed)

Mounting drive/Setting Directory

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

abspath = '/content/drive/My Drive/Colab Notebooks/teaching/gwu/machine_learning_I/project/'

Setting the working directory to the absolute path and importing Professor's shallow utilities file so it can be used through this project.

# Change working directory to the absolute path
%cd $abspath

# Import the shallow utitilities
%run pmlm_utilities_shallow.ipynb

/content/drive/My Drive/Colab Notebooks/teaching/gwu/machine_learning_I/project

Experiment:¶

It was important to determine which model would best suit our dataset before starting any type of data preprocessing. While exploring this dataset, it was clear that the target variable for this project was diabetes_mellitus, with values either being 0 (meaning a patient does not have diabetes) or 1 (meaning a patient does have diabetes). With there being a set finite number of values within the target variable, it was clear that this problem would best be solved using a classification model. Specifically, since this model only had two target classes with a sample belonging to only one class, the best model for the purposes of this project was determined to be a binary classification logistic regression model.

This project was initially completed as a Kaggle competition where the test data was provided. The code that was run for the purposes of the Kaggle competition produced an accuracy score of .74. As a comparison, the provided Kaggle test data was omitted and new test data was extracted from the training dataset to observe how the two accuracy scores compared for the purposes of this final assignment.

With the model selected, the first step in preparing the dataset was to complete a thorough data preprocessing. There were ten steps included for the purposes of this project’s data preprocessing:

Loading the data
Splitting the data
Handling uncommon features
Handling identifiers
Handling date time variables
Handling missing data
Encoding the data
Splitting the feature and target
Scaling the data
Handling class imbalance

In keeping with a fair train/validation/test split, the training data was split 60 (training) : 40 (validation/testing). The validation and testing was split further into a 50:50 split. Therefore, the full training dataset was split as followed:

Training - 60%
Validation - 20%
Testing - 20%

For this project, the following files are pre-loaded from Professor Yuxiao Huang's Github repository:

Data Preprocessing¶

1. Loading the Data¶

Loading the training data and making a copy of the raw training data. Also dropping the unnamed column from the raw data.

# Load the raw training data
df_raw = pd.read_csv(abspath + 'TrainingWiDS2021.csv', header=0)

# Remove the unnamed column
df_raw = df_raw.drop(columns='Unnamed: 0')

# Make a copy of df_raw
df = df_raw.copy(deep=True)

Setting the target variable, which in the case of this project would be Diabetes Mellitus since we are trying to find out if individuals will have Diabetes Mellitus or not.

# Get the name of the target
target = 'diabetes_mellitus'

Getting the dimensions of the training data.

pd.DataFrame([[df.shape[0], df.shape[1]]], columns=['# rows', '# columns'])

Previewing the first 5 rows of the training data.

df.head()

2. Splitting the Data¶

Using train_test_split from sklearn, the data is being split into 60% training data and 40% testing data.

The testing data is further split into 50% validation and 50% testing data.

from sklearn.model_selection import train_test_split

# Divide the data into training (60%) and test (40%)
df_train, df_test = train_test_split(df, 
                                     train_size=0.6, 
                                     random_state=random_seed, 
                                     stratify=df[target])

# Divide the test data into validation (50%) and test (50%)
df_val, df_test = train_test_split(df_test, 
                                   train_size=0.5, 
                                   random_state=random_seed, 
                                   stratify=df_test[target])

# Reset the index
df_train, df_val, df_test = df_train.reset_index(drop=True), df_val.reset_index(drop=True), df_test.reset_index(drop=True)

Getting the dimensions of df_train after the split from above.

# Print the dimension of df_train
pd.DataFrame([[df_train.shape[0], df_train.shape[1]]], columns=['# rows', '# columns'])

Getting the dimensions of df_val after the split from above.

# Print the dimension of df_val
pd.DataFrame([[df_val.shape[0], df_val.shape[1]]], columns=['# rows', '# columns'])

# Print the dimension of df_test
pd.DataFrame([[df_test.shape[0], df_test.shape[1]]], columns=['# rows', '# columns'])

3.Handling Uncommon Features¶

Using the common_var_checker to print the common variables between the df_train, df_val, df_test, and the target.

# Call common_var_checker
# See the implementation in pmlm_utilities.ipynb
df_common_var = common_var_checker(df_train, df_val, df_test, target)

# Print df_common_var
df_common_var

Getting the uncommon features

Getting the features in the training data but not in the validation or test data.

uncommon_feature_train_not_val_test = np.setdiff1d(df_train.columns, df_common_var['common var'])

# Print the uncommon features
pd.DataFrame(uncommon_feature_train_not_val_test, columns=['uncommon feature'])

Getting the features in the validation data but not in the training or test data.

uncommon_feature_val_not_train_test = np.setdiff1d(df_val.columns, df_common_var['common var'])

# Print the uncommon features
pd.DataFrame(uncommon_feature_val_not_train_test, columns=['uncommon feature'])

Getting the features in the test data but not in the training or validation data.

uncommon_feature_test_not_train_val = np.setdiff1d(df_test.columns, df_common_var['common var'])

# Print the uncommon features
pd.DataFrame(uncommon_feature_test_not_train_val, columns=['uncommon feature'])

Dropping the uncommon features

Using the drop functoin to remove the uncommon features from the training data, then previewing the df_train.

# Remove the uncommon features from the training data
df_train = df_train.drop(columns=uncommon_feature_train_not_val_test)

# Print the first 5 rows of df_train
df_train.head()

Using the drop functoin to remove the uncommon features from the validation data, then previewing the df_val.

# Remove the uncommon features from the validation data
df_val = df_val.drop(columns=uncommon_feature_val_not_train_test)

# Print the first 5 rows of df_val
df_val.head()

Using the drop functoin to remove the uncommon features from the test data, then previewing the df_test.

# Remove the uncommon features from the test data
df_test = df_test.drop(columns=uncommon_feature_test_not_train_val)

# Print the first 5 rows of df_test
df_test.head()

4. Handling Identifiers¶

Combining the dataframes

Combining df_train, df_val, and df_test using a concat function before dropping the identifiers.

# Combine df_train, df_val and df_test
df = pd.concat([df_train, df_val, df_test], sort=False)

Getting the identifiers

Using the id_checker to get the id column from the combined dataframe we made above.

# Call id_checker on df
# See the implementation in pmlm_utilities.ipynb
df_id = id_checker(df)

# Print the first 5 rows of df_id
df_id.head()

Removing the identifiers

Use the drop function to remove the identifiers from the df_train, df_val, and df_test.

import numpy as np

# Remove identifiers from df_train
df_train.drop(columns=np.intersect1d(df_id.columns, df_train.columns), inplace=True)

# Remove identifiers from df_val
df_val.drop(columns=np.intersect1d(df_id.columns, df_val.columns), inplace=True)

# Remove identifiers from df_test
df_test.drop(columns=np.intersect1d(df_id.columns, df_test.columns), inplace=True)

Previewing df_train, df_val, and df_test after dropping identifiers

# Print the first 5 rows of df_train
df_train.head()

# Print the first 5 rows of df_val
df_val.head()

# Print the first 5 rows of df_test
df_test.head()

5. Handling Date/Time Variables¶

Setting the date time variables from the data. In this case, there are no date time variables, hence the brackets remain empty.

# Get the date time variables
datetime_vars = []

Calling the datetime_transfer on df_train, df_val, and df_test.

# Call datetime_transformer on df_train
# See the implementation in pmlm_utilities.ipynb
df_train = datetime_transformer(df_train, datetime_vars)

# Print the first 5 rows of df_train
df_train.head()

# Call datetime_transformer on df_val
# See the implementation in pmlm_utilities.ipynb
df_val = datetime_transformer(df_val, datetime_vars)

# Print the first 5 rows of df_val
df_val.head()

# See the implementation in pmlm_utilities.ipynb
df_test = datetime_transformer(df_test, datetime_vars)

# Print the first 5 rows of df_test
df_test.head()

6. Handling Missing Data¶

Combining the dataframes

df = pd.concat([df_train, df_val, df_test], sort=False)

Then, the Nan_checker is called to check for empty values.

# Call nan_checker on df
# See the implementation in pmlm_utilities.ipynb
df_nan = nan_checker(df)

# Print df_nan
df_nan

Getting the data types of the nan values in the combined dataframe, df.

pd.DataFrame(df_nan['dtype'].unique(), columns=['dtype'])

# Get the variables with missing values, their proportion of missing values and data type
df_miss = df_nan[df_nan['dtype'] == 'float64'].reset_index(drop=True)

# Print df_miss
df_miss

# Separating the training data
df_train = df.iloc[:df_train.shape[0], :]

# Separating the validation data
df_val = df.iloc[df_train.shape[0]:df_train.shape[0] + df_val.shape[0], :]

# Separating the test data
df_test = df.iloc[df_train.shape[0] + df_val.shape[0]:, :]

# Print the dimension of df_train
pd.DataFrame([[df_train.shape[0], df_train.shape[1]]], columns=['# rows', '# columns'])

# Print the dimension of df_val
pd.DataFrame([[df_val.shape[0], df_val.shape[1]]], columns=['# rows', '# columns'])

# Print the dimension of df_test
pd.DataFrame([[df_test.shape[0], df_test.shape[1]]], columns=['# rows', '# columns'])

The missing data is imputed with the most_frequent values.

from sklearn.impute import SimpleImputer

# If there are missing values
if len(df_miss['var']) > 0:
    # The SimpleImputer
    si = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

    # Impute the variables with missing values in df_train, df_val and df_test 
    df_train[df_miss['var']] = si.fit_transform(df_train[df_miss['var']])
    df_val[df_miss['var']] = si.transform(df_val[df_miss['var']])
    df_test[df_miss['var']] = si.transform(df_test[df_miss['var']])

7. Encoding Categorical Data¶

The training, validation and testing data is combined.

Then, the cat_var_checker is called.

# Combine df_train, df_val and df_test
df = pd.concat([df_train, df_val, df_test], sort=False)

# Print the unique data type of variables in df
pd.DataFrame(df.dtypes.unique(), columns=['dtype'])

# Call cat_var_checker on df
# See the implementation in pmlm_utilities.ipynb
df_cat = cat_var_checker(df)

# Print the dataframe
df_cat

# One-hot-encode the categorical features in the combined data
df = pd.get_dummies(df, columns=np.setdiff1d(df_cat['var'], [target]))

# Print the first 5 rows of df
df.head()

The categorical variables are encoded using LabelEncoder()

from sklearn.preprocessing import LabelEncoder

# The LabelEncoder
le = LabelEncoder()

# Encode categorical target in the combined data
df[target] = le.fit_transform(df[target])

# Print the first 5 rows of df
df.head()

# Separating the training data
df_train = df.iloc[:df_train.shape[0], :]

# Separating the validation data
df_val = df.iloc[df_train.shape[0]:df_train.shape[0] + df_val.shape[0], :]

# Separating the test data
df_test = df.iloc[df_train.shape[0] + df_val.shape[0]:, :]

# Print the dimension of df_train
pd.DataFrame([[df_train.shape[0], df_train.shape[1]]], columns=['# rows', '# columns'])

# Print the dimension of df_val
pd.DataFrame([[df_val.shape[0], df_val.shape[1]]], columns=['# rows', '# columns'])

# Print the dimension of df_test
pd.DataFrame([[df_test.shape[0], df_test.shape[1]]], columns=['# rows', '# columns'])

8. Splitting the Feature & the Target¶

The feature and the target is split.

# Get the feature matrix
X_train = df_train[np.setdiff1d(df_train.columns, [target])].values
X_val = df_val[np.setdiff1d(df_val.columns, [target])].values
X_test = df_test[np.setdiff1d(df_test.columns, [target])].values

# Get the target vector
y_train = df_train[target].values
y_val = df_val[target].values
y_test = df_test[target].values

9. Scaling the Data¶

The training, validation and testing data is standardized using StandardScaler().

from sklearn.preprocessing import StandardScaler

# The StandardScaler
ss = StandardScaler()

# Standardize the training data
X_train = ss.fit_transform(X_train)

# Standardize the validation data
X_val = ss.transform(X_val)

# Standardize the test data
X_test = ss.transform(X_test)

10. Handling Class Imbalance¶

SMOTE is used to hande class imbalance.

pd.Series(y_train).value_counts()

0    61203
1    16891
dtype: int64

from imblearn.over_sampling import SMOTE

# The SMOTE
smote = SMOTE(random_state=random_seed)

# Augment the training data
X_smote_train, y_smote_train = smote.fit_resample(X_train, y_train)

pd.Series(y_smote_train).value_counts()

1    61203
0    61203
dtype: int64

X_smote_sub1, X_smote_sub2, y_smote_sub1, y_smote_sub2 = train_test_split(X_smote_train, y_smote_train, test_size=0.01, random_state=42)

# Using 1% of our total training set allowed for a faster graphical representation runtime
X_train_sub1, X_train_sub2, y_train_sub1, y_train_sub2 = train_test_split(X_train, y_train, 
                                     test_size=0.01, 
                                     random_state=random_seed, 
                                     stratify=y_train)

y_smote_gen_ori_train = separate_generate_original(X_smote_sub2, y_smote_sub2, X_train, y_train, 1)

# Plot the scatter plot using TSNE
# See the implementation in pmlm_utilities.ipynb
plot_scatter_tsne(X_smote_sub2,
                  y_smote_gen_ori_train, 
                  [0, 1, 2],
                  ['0', '1', '+1'],
                  ['blue', 'green', 'red'],
                  ['o', '^', 's'],
                  'bottom-right',
                  abspath,
                  'scatter_plot_smote.pdf',
                  random_seed)

Hyperparameter Tuning¶

The goal of hyperparameter tuning is to find and optimize the parameter values that leads to a higher accuracy score and a lower validation performance loss. The final parameters tested for this model are shown below:

# Change working directory to the absolute path of the shallow models folder
%cd $abspath

# Import the shallow models
%run pmlm_models_shallow.ipynb

/content/drive/My Drive/Colab Notebooks/teaching/gwu/machine_learning_I/project

Creating the dictionary of the models¶

In the dictionary:

1. the key is the acronym of the model <br>
2. the value is the model

from sklearn.linear_model import LogisticRegression

models = {'lr': LogisticRegression(class_weight='balanced', random_state=random_seed),
          'lr_mbgd': LogisticRegression_MBGD()}

Creating the dictionary of the pipelines¶

In the dictionary:

1. the key is the acronym of the model <br>
2. the value is the pipeline, which, for now, only includes the model

from sklearn.pipeline import Pipeline

pipes = {}

for acronym, model in models.items():
    pipes[acronym] = Pipeline([('model', model)])

Getting the predefined split cross-validator¶

# Get the:
# feature matrix and target velctor in the combined training and validation data
# target vector in the combined training and validation data
# PredefinedSplit
# See the implementation in pmlm_utilities.ipynb
X_train_val, y_train_val, ps = get_train_val_ps(X_smote_train, y_smote_train, X_val, y_val)

GridSearchCV¶

Creating the dictionary of the parameter grids. In the dictionary:

1. the key is the acronym of the model<br>
2. the value is the parameter grid of the model

param_grids = {}

The parameter grid for Logistic Regression

The hyperparameters we want to fine-tune are:

tol_grid
C_grid

The parameter grid for Logisitic Regression Mini-Batch Gradient Descent

The hyperparameters we want to fine-tune are:

eta_grid
alpha_grid

# The parameter grid of tol
tol_grid = [.5 * 10 ** -3, 10 ** -2, 2 * 10 ** -1] #Adding weights to parameter to change the accuracy score

# The parameter grid of C
C_grid = [.1, 1, 10]

# Update param_grids
param_grids['lr'] = [{'model__tol': tol_grid,
                      'model__C': C_grid}]

# The parameter grid of eta
eta_grid = [10 ** -3, 10 ** -2, 10 ** -1] 

# The parameter grid of alpha
alpha_grid = [0.1, 1, 10]

# Update param_grids
param_grids['lr_mbgd'] = [{'model__eta': eta_grid,
                           'model__alpha': alpha_grid}]

Creating the directory for the cv results produced by GridSearchCV

directory = os.path.dirname(abspath + '/result/dm2/cv_results/GridSearchCV/')
if not os.path.exists(directory):
    os.makedirs(directory)

Tuning the hyperparameters

The code below shows how to fine-tune the hyperparameters.

from sklearn.model_selection import GridSearchCV

# The list of [best_score_, best_params_, best_estimator_] obtained by GridSearchCV
best_score_params_estimator_gs = []

# For each model
for acronym in pipes.keys():
    # GridSearchCV
    gs = GridSearchCV(estimator=pipes[acronym],
                      param_grid=param_grids[acronym],
                      scoring='f1_weighted', # changed from 'f1_macro', helped drastically increase the accuracy score
                      n_jobs=2,
                      cv=ps,
                      return_train_score=True)
        
    # Fit the pipeline
    gs = gs.fit(X_train_val, y_train_val)
    
    # Update best_score_params_estimator_gs
    best_score_params_estimator_gs.append([gs.best_score_, gs.best_params_, gs.best_estimator_])
    
    # Sort cv_results in ascending order of 'rank_test_score' and 'std_test_score'
    cv_results = pd.DataFrame.from_dict(gs.cv_results_).sort_values(by=['rank_test_score', 'std_test_score'])
    
    # Get the important columns in cv_results
    important_columns = ['rank_test_score',
                         'mean_test_score', 
                         'std_test_score', 
                         'mean_train_score', 
                         'std_train_score',
                         'mean_fit_time', 
                         'std_fit_time',                        
                         'mean_score_time', 
                         'std_score_time']
    
    # Move the important columns ahead
    cv_results = cv_results[important_columns + sorted(list(set(cv_results.columns) - set(important_columns)))]

    # Write cv_results file
    cv_results.to_csv(path_or_buf=abspath + 'result/dm2/cv_results/GridSearchCV/' + acronym + '.csv', index=False)

# Sort best_score_params_estimator_gs in descending order of the best_score_
best_score_params_estimator_gs = sorted(best_score_params_estimator_gs, key=lambda x : x[0], reverse=True)

# Print best_score_params_estimator_gs
pd.DataFrame(best_score_params_estimator_gs, columns=['best_score', 'best_param', 'best_estimator'])

Model Selection¶

Here we will select best_estimator_gs as the best model.

# Get the best_score, best_params and best_estimator obtained by GridSearchCV
best_score_gs, best_params_gs, best_estimator_gs = best_score_params_estimator_gs[0]

Model Evaluation¶

from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import roc_auc_score

# Get the prediction on the testing data using best_model
y_test_pred = best_estimator_gs.predict(X_test)

# Get the precision, recall, fscore, support
precision, recall, fscore, support = precision_recall_fscore_support(y_test, y_test_pred)

# Get the auc
auc = roc_auc_score(y_test, y_test_pred)

# Get the dataframe of precision, recall, fscore and auc
pd.DataFrame([[precision, recall, fscore, auc]], columns=['Precision', 'Recall', 'F1-score', 'AUC'])

Conclusion¶

After running the model, it was concluded that the highest accuracy score obtained was .79. This final score is .08 higher than our orgignal baseline score of .71 that was obtained by running this model without changing any parameters from the code provided in the Breast Cancer Wisconsin case study. There were a few parameters that were changed during the trial and error period of hyperparameter tuning for the purposes of this project:

Random seed
Scoring factor
Tol grid
C grid
Eta grid
Alpha grid

Through trial and error, simply changing the random seed or any of the Gridsearch parameter grid did not help increase the score. However, changing the scoring from f1_macro significantly changed our accuracy score. Initially, the scoring was changed to "f1", producing a score of .56. Next, the scoring was changed to f1_weighted. This produced our final score of .79, a significant increase from the .71 baseline score that was initially reached.

Even though the parameter tuning alone did not vastly increase the score, this final score of .79 is still an acceptable score because all aspects of data preprocessing and model selection were properly applied. Therefore, it can be concluded that this model has a .79 accuracy score and a precision score between .48-.91.

In comparison, the score obtained from the live Kaggle competition in which the provided test data was gave a score of .74. That score was obtained used the "f1_macro" scoring that was initially tested. Therefore, between the two attempts of this model, it can be concluded that this final version of our Diabetes Mellitus prediction model is the more accurate model created.

	encounter_id	hospital_id	age	bmi	elective_surgery	ethnicity	gender	height	hospital_admit_source	icu_admit_source	icu_id	icu_stay_type	icu_type	pre_icu_los_days	weight	albumin_apache	apache_2_diagnosis	apache_3j_diagnosis	apache_post_operative	bilirubin_apache	bun_apache	creatinine_apache	fio2_apache	gcs_eyes_apache	gcs_motor_apache	gcs_unable_apache	gcs_verbal_apache	glucose_apache	heart_rate_apache	hematocrit_apache	intubated_apache	map_apache	paco2_apache	paco2_for_ph_apache	pao2_apache	ph_apache	resprate_apache	sodium_apache	...	h1_hemaglobin_max	h1_hemaglobin_min	h1_hematocrit_max	h1_hematocrit_min	h1_inr_max	h1_inr_min	h1_lactate_max	h1_lactate_min	h1_platelets_max	h1_platelets_min	h1_potassium_max	h1_potassium_min	h1_sodium_max	h1_sodium_min	h1_wbc_max	h1_wbc_min	d1_arterial_pco2_max	d1_arterial_pco2_min	d1_arterial_ph_max	d1_arterial_ph_min	d1_arterial_po2_max	d1_arterial_po2_min	d1_pao2fio2ratio_max	d1_pao2fio2ratio_min	h1_arterial_pco2_max	h1_arterial_pco2_min	h1_arterial_ph_max	h1_arterial_ph_min	h1_arterial_po2_max	h1_arterial_po2_min	h1_pao2fio2ratio_max	h1_pao2fio2ratio_min	diabetes_mellitus
0	214826	118	68.00	22.73	0	Caucasian	M	180.30	Floor	Floor	92	admit	CTICU	0.54	73.90	2.30	113.00	502.01	0	0.40	31.00	2.51	nan	3.00	6.00	0.00	4.00	168.00	118.00	27.40	0	40.00	nan	nan	nan	nan	36.00	134.00	...	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	1
1	246060	81	77.00	27.42	0	Caucasian	F	160.00	Floor	Floor	90	admit	Med-Surg ICU	0.93	70.20	nan	108.00	203.01	0	nan	9.00	0.56	1.00	1.00	3.00	0.00	1.00	145.00	120.00	36.90	0	46.00	37.00	37.00	51.00	7.45	33.00	145.00	...	11.30	11.30	36.90	36.90	1.30	1.30	3.50	3.50	557.00	557.00	4.20	4.20	145.00	145.00	12.70	12.70	37.00	37.00	7.45	7.45	51.00	51.00	54.80	51.00	37.00	37.00	7.45	7.45	51.00	51.00	51.00	51.00	1
2	276985	118	25.00	31.95	0	Caucasian	F	172.70	Emergency Department	Accident & Emergency	93	admit	Med-Surg ICU	0.00	95.30	nan	122.00	703.03	0	nan	nan	nan	nan	3.00	6.00	0.00	5.00	nan	102.00	nan	0	68.00	nan	nan	nan	nan	37.00	nan	...	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	0
3	262220	118	81.00	22.64	1	Caucasian	F	165.10	Operating Room	Operating Room / Recovery	92	admit	CTICU	0.00	61.70	nan	203.00	1206.03	1	nan	nan	nan	0.60	4.00	6.00	0.00	5.00	185.00	114.00	25.90	1	60.00	30.00	30.00	142.00	7.39	4.00	nan	...	11.60	11.60	34.00	34.00	1.60	1.10	nan	nan	43.00	43.00	nan	nan	nan	nan	8.80	8.80	37.00	27.00	7.44	7.34	337.00	102.00	342.50	236.67	36.00	33.00	7.37	7.34	337.00	265.00	337.00	337.00	0
4	201746	33	19.00	nan	0	Caucasian	M	188.00	NaN	Accident & Emergency	91	admit	Med-Surg ICU	0.07	nan	nan	119.00	601.01	0	nan	nan	nan	nan	nan	nan	nan	nan	nan	60.00	nan	0	103.00	nan	nan	nan	nan	16.00	nan	...	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	0

	common var
0	age
1	aids
2	albumin_apache
3	apache_2_diagnosis
4	apache_3j_diagnosis
...	...
175	temp_apache
176	urineoutput_apache
177	ventilated_apache
178	wbc_apache
179	weight

	encounter_id	hospital_id	age	bmi	elective_surgery	ethnicity	gender	height	hospital_admit_source	icu_admit_source	icu_id	icu_stay_type	icu_type	pre_icu_los_days	weight	albumin_apache	apache_2_diagnosis	apache_3j_diagnosis	apache_post_operative	arf_apache	bilirubin_apache	bun_apache	creatinine_apache	fio2_apache	gcs_eyes_apache	gcs_motor_apache	gcs_verbal_apache	glucose_apache	heart_rate_apache	hematocrit_apache	intubated_apache	map_apache	paco2_apache	paco2_for_ph_apache	pao2_apache	ph_apache	resprate_apache	sodium_apache	...	h1_hemaglobin_max	h1_hemaglobin_min	h1_hematocrit_max	h1_hematocrit_min	h1_inr_max	h1_inr_min	h1_lactate_max	h1_lactate_min	h1_platelets_max	h1_platelets_min	h1_potassium_max	h1_potassium_min	h1_sodium_max	h1_sodium_min	h1_wbc_max	h1_wbc_min	d1_arterial_pco2_max	d1_arterial_pco2_min	d1_arterial_ph_max	d1_arterial_ph_min	d1_arterial_po2_max	d1_arterial_po2_min	d1_pao2fio2ratio_max	d1_pao2fio2ratio_min	h1_arterial_pco2_max	h1_arterial_pco2_min	h1_arterial_ph_max	h1_arterial_ph_min	h1_arterial_po2_max	h1_arterial_po2_min	h1_pao2fio2ratio_max	h1_pao2fio2ratio_min	diabetes_mellitus
0	173141	39	54.00	15.44	0	Caucasian	F	162.60	Emergency Department	Accident & Emergency	616	admit	Neuro ICU	0.07	40.82	nan	119.00	601.01	0	0	nan	8.00	0.59	nan	4.00	6.00	5.00	137.00	90.00	33.40	0	50.00	nan	nan	nan	nan	33.00	142.00	...	nan	nan	nan	nan	0.90	0.90	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	0
1	169925	204	87.00	23.73	1	Caucasian	M	184.20	Direct Admit	Operating Room / Recovery	431	admit	CSICU	0.18	80.50	nan	203.00	1206.03	1	0	nan	15.00	0.74	0.40	4.00	6.00	5.00	85.00	55.00	30.50	1	51.00	37.00	37.00	191.00	7.44	51.00	134.00	...	10.30	10.30	30.50	30.50	1.40	1.40	nan	nan	138.00	138.00	4.30	4.30	134.00	134.00	3.80	3.80	43.00	37.00	7.44	7.39	191.00	180.00	477.50	477.50	37.00	37.00	7.44	7.44	191.00	191.00	477.50	477.50	0
2	250908	109	21.00	30.00	0	Other/Unknown	F	157.48	Emergency Department	Accident & Emergency	429	admit	Cardiac ICU	0.31	74.41	nan	305.00	901.03	0	1	nan	37.00	7.36	nan	4.00	6.00	5.00	84.00	118.00	24.70	0	146.00	nan	nan	nan	nan	34.00	139.00	...	nan	nan	nan	nan	1.20	1.20	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	0
3	155276	160	49.00	32.02	0	Caucasian	M	172.70	Emergency Department	Accident & Emergency	470	admit	Med-Surg ICU	0.28	95.50	3.30	113.00	501.02	0	0	0.30	31.00	2.20	nan	4.00	6.00	5.00	139.00	119.00	34.80	0	62.00	nan	nan	nan	nan	10.00	138.00	...	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	3.30	3.30	134.00	134.00	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	1
4	247104	64	nan	29.87	0	African American	F	167.60	NaN	Floor	683	admit	Med-Surg ICU	3.11	83.90	3.10	302.00	109.12	0	0	1.20	16.00	1.13	nan	3.00	6.00	5.00	106.00	96.00	41.00	0	59.00	nan	nan	nan	nan	43.00	135.00	...	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	0

	encounter_id	hospital_id	age	bmi	ethnicity	gender	height	hospital_admit_source	icu_admit_source	icu_id	icu_stay_type	icu_type	pre_icu_los_days	weight	albumin_apache	apache_2_diagnosis	apache_3j_diagnosis	bilirubin_apache	bun_apache	creatinine_apache	fio2_apache	gcs_eyes_apache	gcs_motor_apache	gcs_verbal_apache	glucose_apache	heart_rate_apache	hematocrit_apache	intubated_apache	map_apache	paco2_apache	paco2_for_ph_apache	pao2_apache	ph_apache	resprate_apache	sodium_apache	...	h1_hemaglobin_max	h1_hemaglobin_min	h1_hematocrit_max	h1_hematocrit_min	h1_inr_max	h1_inr_min	h1_lactate_max	h1_lactate_min	h1_platelets_max	h1_platelets_min	h1_potassium_max	h1_potassium_min	h1_sodium_max	h1_sodium_min	h1_wbc_max	h1_wbc_min	d1_arterial_pco2_max	d1_arterial_pco2_min	d1_arterial_ph_max	d1_arterial_ph_min	d1_arterial_po2_max	d1_arterial_po2_min	d1_pao2fio2ratio_max	d1_pao2fio2ratio_min	h1_arterial_pco2_max	h1_arterial_pco2_min	h1_arterial_ph_max	h1_arterial_ph_min	h1_arterial_po2_max	h1_arterial_po2_min	h1_pao2fio2ratio_max	h1_pao2fio2ratio_min
0	148622	86	67.00	37.56	Caucasian	M	168.00	NaN	Floor	1035	admit	CCU-CTICU	0.01	106.00	nan	114.00	102.01	nan	20.00	0.73	nan	4.00	6.00	5.00	80.00	53.00	36.00	0	111.00	nan	nan	nan	nan	27.00	137.00	...	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan
1	203145	189	60.00	24.95	Caucasian	F	165.10	Direct Admit	Accident & Emergency	543	admit	Med-Surg ICU	0.30	68.00	2.20	302.00	109.16	0.30	7.40	0.50	nan	4.00	6.00	5.00	79.00	64.00	24.70	0	45.00	nan	nan	nan	nan	10.00	136.00	...	8.80	8.80	24.80	24.80	1.50	1.40	nan	nan	171.00	171.00	nan	nan	nan	nan	10.14	10.14	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan
2	227137	13	50.00	25.81	Caucasian	M	177.80	NaN	Floor	708	admit	Med-Surg ICU	0.37	81.60	2.90	113.00	501.01	1.00	35.00	2.68	1.00	3.00	4.00	1.00	103.00	60.00	28.00	1	53.00	32.70	32.70	164.00	7.33	16.00	128.00	...	11.50	11.40	34.50	34.30	1.20	1.20	nan	nan	142.00	136.00	4.90	4.30	129.00	128.00	27.50	26.40	34.70	30.20	7.35	7.32	164.00	95.00	164.00	107.00	nan	nan	nan	nan	nan	nan	nan	nan
3	273697	21	58.00	34.58	Caucasian	M	180.30	Other Hospital	Other Hospital	512	admit	CCU-CTICU	0.01	112.40	nan	112.00	107.01	nan	10.00	0.94	nan	4.00	6.00	5.00	137.00	58.00	35.30	0	52.00	nan	nan	nan	nan	28.00	140.00	...	nan	nan	nan	nan	1.18	1.18	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan
4	252290	60	21.00	19.53	Other/Unknown	F	152.40	Emergency Department	Accident & Emergency	538	admit	Med-Surg ICU	0.09	45.36	nan	123.00	702.01	nan	15.00	0.80	0.21	4.00	6.00	5.00	242.00	122.00	nan	0	72.00	24.70	24.70	110.00	7.27	13.00	152.00	...	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	25.90	24.70	7.31	7.27	112.00	110.00	533.33	523.81	nan	nan	nan	nan	nan	nan	nan	nan

	encounter_id	hospital_id	age	bmi	elective_surgery	ethnicity	gender	height	hospital_admit_source	icu_admit_source	icu_id	icu_stay_type	icu_type	pre_icu_los_days	weight	albumin_apache	apache_2_diagnosis	apache_3j_diagnosis	apache_post_operative	bilirubin_apache	bun_apache	creatinine_apache	fio2_apache	gcs_eyes_apache	gcs_motor_apache	gcs_verbal_apache	glucose_apache	heart_rate_apache	hematocrit_apache	intubated_apache	map_apache	paco2_apache	paco2_for_ph_apache	pao2_apache	ph_apache	resprate_apache	sodium_apache	...	h1_hemaglobin_max	h1_hemaglobin_min	h1_hematocrit_max	h1_hematocrit_min	h1_inr_max	h1_inr_min	h1_lactate_max	h1_lactate_min	h1_platelets_max	h1_platelets_min	h1_potassium_max	h1_potassium_min	h1_sodium_max	h1_sodium_min	h1_wbc_max	h1_wbc_min	d1_arterial_pco2_max	d1_arterial_pco2_min	d1_arterial_ph_max	d1_arterial_ph_min	d1_arterial_po2_max	d1_arterial_po2_min	d1_pao2fio2ratio_max	d1_pao2fio2ratio_min	h1_arterial_pco2_max	h1_arterial_pco2_min	h1_arterial_ph_max	h1_arterial_ph_min	h1_arterial_po2_max	h1_arterial_po2_min	h1_pao2fio2ratio_max	h1_pao2fio2ratio_min	hepatic_failure	immunosuppression	diabetes_mellitus
0	254797	79	79.00	29.12	1	Other/Unknown	M	167.60	Emergency Department	Operating Room / Recovery	337	transfer	Med-Surg ICU	0.43	81.80	1.30	308.00	1904.01	1	2.00	57.00	2.42	0.55	2.00	5.00	1.00	161.00	128.00	17.20	1	40.00	41.00	41.00	235.00	7.31	9.00	151.00	...	5.80	5.80	17.20	17.20	nan	nan	1.00	1.00	199.00	199.00	5.20	5.20	150.00	150.00	9.80	9.80	41.00	30.00	7.52	7.31	376.00	228.00	427.27	376.00	30.00	30.00	7.52	7.52	376.00	376.00	376.00	376.00	0	0	0
1	189332	118	63.00	29.34	1	Caucasian	M	172.70	Operating Room	Operating Room / Recovery	89	admit	Neuro ICU	0.00	87.50	nan	218.00	1505.02	1	nan	16.00	0.70	nan	4.00	6.00	5.00	148.00	60.00	35.10	0	198.00	nan	nan	nan	nan	4.00	140.00	...	nan	nan	nan	nan	1.00	1.00	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	44.00	44.00	7.39	7.39	115.00	115.00	287.50	287.50	nan	nan	nan	nan	nan	nan	nan	nan	0	0	1
2	265950	173	72.00	21.64	0	Caucasian	M	175.30	NaN	Other Hospital	962	admit	Med-Surg ICU	0.00	66.50	1.90	119.00	601.05	0	1.30	9.00	0.65	nan	1.00	5.00	1.00	162.00	130.00	28.80	0	113.00	nan	nan	nan	nan	43.00	138.00	...	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	0	0	0
3	189173	110	29.00	20.80	0	Caucasian	F	165.10	Emergency Department	Accident & Emergency	969	admit	CCU-CTICU	0.02	56.70	nan	122.00	703.03	0	nan	6.00	0.58	0.25	2.00	3.00	1.00	92.00	106.00	nan	0	53.00	34.20	34.20	138.00	7.41	27.00	141.00	...	nan	nan	nan	nan	nan	nan	0.80	0.80	nan	nan	4.30	3.60	141.00	141.00	nan	nan	34.20	34.00	7.41	7.41	141.00	138.00	552.00	470.00	34.00	34.00	7.41	7.41	141.00	141.00	470.00	470.00	0	0	0
4	261367	118	84.00	20.28	0	Asian	M	157.50	Floor	Floor	97	admit	MICU	2.01	50.30	nan	301.00	410.01	0	nan	22.00	1.44	nan	4.00	5.00	2.00	154.00	127.00	21.90	0	68.00	nan	nan	nan	nan	47.00	146.00	...	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	1	1	0

	var	proportion	dtype
0	h1_bilirubin_min	0.92	float64
1	h1_bilirubin_max	0.92	float64
2	h1_albumin_max	0.91	float64
3	h1_albumin_min	0.91	float64
4	h1_lactate_max	0.91	float64
...	...	...	...
155	d1_sysbp_min	0.00	float64
156	d1_heartrate_max	0.00	float64
157	d1_heartrate_min	0.00	float64
158	icu_admit_source	0.00	object
159	gender	0.00	object

	best_score	best_param	best_estimator
0	0.79	{'model__C': 1, 'model__tol': 0.0005}	(LogisticRegression(C=1, class_weight='balance...
1	0.78	{'model__alpha': 0.1, 'model__eta': 0.001}	(LogisticRegression_MBGD(alpha=0.1, batch_size...