Predicting Diabetes Using Logistic Regression with SciKit¶

The following analysis will use data from hospital that tracks the prevalance of diabetes in patients with suspected contributing factors measured as well. My goal is to help the patients prevent diabetes. I am going to create a logistic regression model so I can predict when a patient is getting close to the diabetes range.

diabetes

As the graphic shows, type 2 diabetes is a preventable disease since most of the contributing factors can be controlled. The predictor varaibles in our dataset are:

  • Pregnancies
  • Glucose
  • BloodPressure
  • SkinThickness
  • Insulin
  • BMI
  • DiabetesPedigreeFunction - calculates diabetes likelihood depending on the subject's age and his/her diabetic family history
  • Age
  • Outcome - 1 if the patient has diabetes and 0 if not

I will create multiple models and visualizations to find our best predictors for diabetes. Before I do this I need to prepare the data.

Import packages and prepare data.¶

To clean the data I will use pandas, numpy and base python. I will use Scikit to create the models and matplotlib/seaborn to visualize.

In [1]:
import warnings
from sklearn.linear_model import LogisticRegression 
from sklearn.model_selection import train_test_split 
from sklearn import metrics 
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 

Let' take a look at our data.

In [2]:
# load dataset 
warnings.filterwarnings('ignore')
diab_df = pd.read_csv("Hospital.csv") 

diab_df.head()
Out[2]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1

It is clear that outcome will be our target variable. The other columns will all be our features. Next, I will seperate the features from the target, as well as seperating out testing and training data.

First Model¶

I will start with a single response predictor. The Diabets Pedigree Function is supposed to be a good predictor for diabetes so I will start with it.

In [3]:
diab_cols = ['DiabetesPedigreeFunction'] 

X = diab_df[diab_cols]# Features 

y = diab_df.Outcome # Target variable 


X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0) 


# instantiate the model 

logreg =  LogisticRegression(solver='liblinear') 

# fit the model with data 

logreg.fit(X_train,y_train) 

# predicting 

y_pred=logreg.predict(X_test) 

y_pred 

cnf_matrix = metrics.confusion_matrix(y_test, y_pred) 

cnf_matrix 
Out[3]:
array([[124,   6],
       [ 55,   7]], dtype=int64)

Second Model¶

For detecting a positive case of diabetes, this method work great with over 95 % accuracy. However, the number of false positives is alarmingly high and would not be accepted. I am going to need to use a different variable or use multiple.

In [4]:
#split dataset in features and target variable 

diab_cols = ['BMI', 'Age','DiabetesPedigreeFunction'] 

X = diab_df[diab_cols]# Features 

y = diab_df.Outcome # Target variable 


X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0) 


# instantiate the model 

logreg =  LogisticRegression(solver='liblinear') 



# fit the model with data 

logreg.fit(X_train,y_train) 



# predicting 

y_pred=logreg.predict(X_test) 

y_pred 

cnf_matrix = metrics.confusion_matrix(y_test, y_pred) 

cnf_matrix 
Out[4]:
array([[121,   9],
       [ 45,  17]], dtype=int64)

Final Model and Visualization¶

Adding BMI and age improved our false negatives significantly, but it still has a bad rate. Next, I'm going to try using all feature variables and analyze the results.

In [5]:
#split dataset in features and target variable 

diab_cols = ['Pregnancies', 'Insulin', 'BMI', 'Age','Glucose','BloodPressure','DiabetesPedigreeFunction'] 

X = diab_df[diab_cols]# Features 

y = diab_df.Outcome # Target variable 

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0) 

# instantiate the model 

logreg =  LogisticRegression(solver='liblinear') 



# fit the model with data 

logreg.fit(X_train,y_train) 



# predicting 

y_pred=logreg.predict(X_test) 

y_pred 

cnf_matrix = metrics.confusion_matrix(y_test, y_pred) 

cnf_matrix 
Out[5]:
array([[119,  11],
       [ 26,  36]], dtype=int64)

These are our best result by far, the model is very good at identifying positives and decent at identifying negatives. I'm going to analyze this one farther since it is yielding the best results.

In [6]:
class_names=[0,1] # name  of classes 

fig, ax = plt.subplots() 

tick_marks = np.arange(len(class_names)) 

plt.xticks(tick_marks, class_names) 

plt.yticks(tick_marks, class_names) 

# create heatmap 

sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g') 

ax.xaxis.set_label_position("top") 

plt.tight_layout() 

plt.title('Confusion matrix', y=1.1) 

plt.ylabel('Actual label') 

plt.xlabel('Predicted label')
Out[6]:
Text(0.5, 427.9555555555555, 'Predicted label')
In [7]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred)) 

print("Precision:",metrics.precision_score(y_test, y_pred)) 

print("Recall:",metrics.recall_score(y_test, y_pred)) 
Accuracy: 0.8072916666666666
Precision: 0.7659574468085106
Recall: 0.5806451612903226

The model is performing with 80% accuracy. We can now use this model for our patients who do not have diabetes yet, but may be on the verge. By inputing their statistics into the model, we have a much better chance at preventing the disease when a patient may be close to getting it. The last thing I want to do is take a look at out how this would work. I am going to imput 3 healthy patients and 3 unhealthy that are likely to get diabets

In [8]:
healthy1=logreg.predict([[1,148,11,23,199,76,0.6]])
healthy1
Out[8]:
array([0], dtype=int64)
In [9]:
healthy2=logreg.predict([[2,132,15,22,201,72,0.4]])
healthy2
Out[9]:
array([0], dtype=int64)
In [10]:
healthy3=logreg.predict([[0,110,14,25,180,71,0.3]])
healthy3
Out[10]:
array([0], dtype=int64)
In [19]:
unhealthy1=logreg.predict([[6,10,35.6,95,100,86,2]])
unhealthy1
Out[19]:
array([1], dtype=int64)
In [17]:
unhealthy2=logreg.predict([[5,20,37,36,90,92,3.2]])
unhealthy2
Out[17]:
array([1], dtype=int64)
In [23]:
unhealthy3=logreg.predict([[4,22,44,48,110,91,1.55]])
unhealthy3
Out[23]:
array([1], dtype=int64)

These 6 patients were all correcly diaganosed. We can see the stats of the helthy and unhelthy patients vary greatly.

In [ ]:
 
In [ ]:
 
In [ ]: