Pima Diabetes Data Analytics – Neil¶

Here are the set the analytics that has been run on this data set

Data Cleaning to remove zeros
Data Exploration for Y and Xs
Descriptive Statistics – Numerical Summary and Graphical (Histograms) for all variables
Screening of variables by segmenting them by Outcome
Check for normality of dataset
Study bivariate relationship between variables using pair plots, correlation and heat map
Statistical screening using Logistic Regression
Validation of the model its precision and ploting of confusion matrix

Importing necessary packages¶

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(color_codes =True)
%matplotlib inline

Importing the Diabetes CSV data file¶

Import the data and test if all the columns are loaded
The Data frame has been assigned a name of ‘diab’

diab=pd.read_csv("diabetes.csv")
diab.head()

About data set¶

In this data set, Outcome is the Dependent Variable and Remaining 8 variables are independent variables.

Finding if there are any null and Zero values in the data set¶

diab.isnull().values.any()
## To check if data contains null values

False

Inference:¶

Data frame doesn’t have any NAN values
As a next step, we will do preliminary screening of descriptive stats for the dataset

diab.describe()
## To run numerical descriptive stats for the data set

Inference at this point¶

Minimum values for many variables are 0.
As biological parameters like Glucose, BP, Skin thickness,Insulin & BMI cannot have zero values, looks like null values have been coded as zeros
As a next step, find out how many Zero values are included in each variable

(diab.Pregnancies == 0).sum(),(diab.Glucose==0).sum(),(diab.BloodPressure==0).sum(),(diab.SkinThickness==0).sum(),(diab.Insulin==0).sum(),(diab.BMI==0).sum(),(diab.DiabetesPedigreeFunction==0).sum(),(diab.Age==0).sum()
## Counting cells with 0 Values for each variable and publishing the counts below

(111, 5, 35, 227, 374, 11, 0, 0)

Inference:¶

As Zero Counts of some the variables are as high as 374 and 227, in a 768 data set, it is better to remove the Zeros uniformly for 5 variables (excl Pregnancies & Outcome)
As a next step, we’ll drop 0 values and create a our new dataset which can be used for further analysis

## Creating a dataset called 'dia' from original dataset 'diab' with excludes all rows with have zeros only for Glucose, BP, Skinthickness, Insulin and BMI, as other columns can contain Zero values.
drop_Glu=diab.index[diab.Glucose == 0].tolist()
drop_BP=diab.index[diab.BloodPressure == 0].tolist()
drop_Skin = diab.index[diab.SkinThickness==0].tolist()
drop_Ins = diab.index[diab.Insulin==0].tolist()
drop_BMI = diab.index[diab.BMI==0].tolist()
c=drop_Glu+drop_BP+drop_Skin+drop_Ins+drop_BMI
dia=diab.drop(diab.index[c])

dia.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 392 entries, 3 to 765
Data columns (total 9 columns):
Pregnancies                 392 non-null int64
Glucose                     392 non-null int64
BloodPressure               392 non-null int64
SkinThickness               392 non-null int64
Insulin                     392 non-null int64
BMI                         392 non-null float64
DiabetesPedigreeFunction    392 non-null float64
Age                         392 non-null int64
Outcome                     392 non-null int64
dtypes: float64(2), int64(7)
memory usage: 30.6 KB

Inference¶

As in above, created a cleaned up list titled “dia” which has 392 rows of data instead of 768 from original list
Looks like we lost nearly 50% of data but our data set is now cleaner than before
In fact the removed values can be used for Testing during modeling. So actually we haven’t really lost them completly.

Performing Preliminary Descriptive Stats on the Data set¶

Performing 5 number summary
Usually, the first thing to do in a data set is to get a hang of vital parameters of all variables and thus understand a little bit about the data set such as central tendency and dispersion

dia.describe()

Split the data frame into two sub sets for convenience of analysis¶

As we wish to study the influence of each variable on Outcome (Diabetic or not), we can subset the data by Outcome
dia1 Subset : All samples with 1 values of Outcome
dia0 Subset: All samples with 0 values of Outcome

dia1 = dia[dia.Outcome==1]
dia0 = dia[dia.Outcome==0]

dia1

dia0

Graphical screening for variables¶

Now we will start graphical analysis of outcome. At the data is nominal(binary), we will run count plot and compute %ages of samples who are diabetic and non-diabetic

## creating count plot with title using seaborn
sns.countplot(x=dia.Outcome)
plt.title("Count Plot for Outcome")

Text(0.5, 1.0, 'Count Plot for Outcome')

# Computing the %age of diabetic and non-diabetic in the sample
Out0=len(dia[dia.Outcome==1])
Out1=len(dia[dia.Outcome==0])
Total=Out0+Out1
PC_of_1 = Out1*100/Total
PC_of_0 = Out0*100/Total
PC_of_1, PC_of_0

(66.83673469387755, 33.16326530612245)

Inference on screening Outcome variable¶

There are 66.8% 1’s (diabetic) and 33.1% 0’s (nondiabetic) in the data
As a next step, we will start screening variables

Graphical Screening for Variables¶

We will take each variable, one at a time and screen them in the following manner
Study the data distribution (histogram) of each variable – Central tendency, Spread, Distortion(Skewness & Kurtosis)
To visually screen the association between ‘Outcome’ and each variable by plotting histograms & Boxplots by Outcome value

Screening Variable – Pregnancies¶

## Creating 3 subplots - 1st for histogram, 2nd for histogram segmented by Outcome and 3rd for representing same segmentation using boxplot
plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.set_style("dark")
plt.title("Histogram for Pregnancies")
sns.distplot(dia.Pregnancies,kde=False)
plt.subplot(1,3,2)
sns.distplot(dia0.Pregnancies,kde=False,color="Blue", label="Preg for Outome=0")
sns.distplot(dia1.Pregnancies,kde=False,color = "Gold", label = "Preg for Outcome=1")
plt.title("Histograms for Preg by Outcome")
plt.legend()
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome,y=dia.Pregnancies)
plt.title("Boxplot for Preg by Outcome")

Text(0.5, 1.0, 'Boxplot for Preg by Outcome')

Inference on Pregnancies¶

Visually, data is right skewed. For data of count of pregenancies. A large proportion of the participants are zero count on pregnancy. As the data set includes women > 21 yrs, its likely that many are unmarried
When looking at the segemented histograms, a hypothesis is the as pregnancies includes, women are more likely to be diabetic
In the boxplots, we find few outliers in both subsets. Esp some non-diabetic women have had many pregenancies. I wouldn’t be worried.
To validate this hypothesis, need to statistically test.

Screening Variable – Glucose¶

plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
plt.title("Histogram for Glucose")
sns.distplot(dia.Glucose, kde=False)
plt.subplot(1,3,2)
sns.distplot(dia0.Glucose,kde=False,color="Gold", label="Gluc for Outcome=0")
sns.distplot(dia1.Glucose, kde=False, color="Blue", label = "Gloc for Outcome=1")
plt.title("Histograms for Glucose by Outcome")
plt.legend()
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome,y=dia.Glucose)
plt.title("Boxplot for Glucose by Outcome")

Text(0.5, 1.0, 'Boxplot for Glucose by Outcome')

Inference on Glucose¶

1st graph – Histogram of Glucose data is slightly skewed to right. Understandably, the data set contains over 60% who are diabetic and its likely that their Glucose levels were higher. But the grand mean of Glucose is at 122.\
2nd graph – Clearly diabetic group has higher glucose than non-diabetic.
3rd graph – In the boxplot, visually skewness seems acceptable (<2) and its also likely that confidence intervels of the means are not overlapping. So a hypothesis that Glucose is measure of outcome, is likely to be true. But needs to be statistically tested.

Screening Variable – Blood Pressure¶

plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.distplot(dia.BloodPressure, kde=False)
plt.title("Histogram for Blood Pressure")
plt.subplot(1,3,2)
sns.distplot(dia0.BloodPressure,kde=False,color="Gold",label="BP for Outcome=0")
sns.distplot(dia1.BloodPressure,kde=False, color="Blue", label="BP for Outcome=1")
plt.legend()
plt.title("Histogram of Blood Pressure by Outcome")
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome,y=dia.BloodPressure)
plt.title("Boxplot of BP by Outcome")

Text(0.5, 1.0, 'Boxplot of BP by Outcome')

Inference on Blood Pressure¶

1st graph – Distribution looks normal. Mean value is 69, well within normal values for diastolic of 80. One should expect this data to be normal, but as we don’t know if the particpants are only hypertensive medication, we can’t comment much.
2nd graph – Most non diabetic women seem to have nominal value of 69 and diabetic women seems to have high BP.
3rd graph – Few outliers in the data. Its likely that some people have low and some have high BP. So the association between diabetic (Outcome) and BP is an suspect and needs to be statistically validated.

Screening Variable – Skin Thickness¶

plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.distplot(dia.SkinThickness, kde=False)
plt.title("Histogram for Skin Thickness")
plt.subplot(1,3,2)
sns.distplot(dia0.SkinThickness, kde=False, color="Gold", label="SkinThick for Outcome=0")
sns.distplot(dia1.SkinThickness, kde=False, color="Blue", label="SkinThick for Outcome=1")
plt.legend()
plt.title("Histogram for SkinThickness by Outcome")
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome, y=dia.SkinThickness)
plt.title("Boxplot of SkinThickness by Outcome")

Text(0.5, 1.0, 'Boxplot of SkinThickness by Outcome')

Inferences for Skinthickness¶

1st graph – Skin thickness seems be be skewed a bit.
2nd graph – Like BP, people who are not diabetic have lower skin thickness. This is a hypothesis that has to be validated. As data of non-diabetic is skewed but diabetic samples seems to be normally distributed.

Screening Variable – Insulin¶

plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.distplot(dia.Insulin,kde=False)
plt.title("Histogram of Insulin")
plt.subplot(1,3,2)
sns.distplot(dia0.Insulin,kde=False, color="Gold", label="Insulin for Outcome=0")
sns.distplot(dia1.Insulin,kde=False, color="Blue", label="Insuline for Outcome=1")
plt.title("Histogram for Insulin by Outcome")
plt.legend()
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome, y=dia.Insulin)
plt.title("Boxplot for Insulin by Outcome")

Text(0.5, 1.0, 'Boxplot for Insulin by Outcome')

Inference for Insulin¶

2hour serum insulin is expected to be between 16 to 166. Clearly there are Outliers in the data. These Outliers are concern for us and most of them with higher insulin values ar also diabetic. So this is a suspect.

Screening Variable – BMI¶

plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.distplot(dia.BMI, kde=False)
plt.title("Histogram for BMI")
plt.subplot(1,3,2)
sns.distplot(dia0.BMI, kde=False,color="Gold", label="BMI for Outcome=0")
sns.distplot(dia1.BMI, kde=False, color="Blue", label="BMI for Outcome=1")
plt.legend()
plt.title("Histogram for BMI by Outcome")
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome, y=dia.BMI)
plt.title("Boxplot for BMI by Outcome")

Text(0.5, 1.0, 'Boxplot for BMI by Outcome')

Inference for BMI¶

1st graph – There are few outliers. Few are obese in the dataset. Expected range is between 18 to 25. In general, people are obese
2nd graph – Diabetic people seems to be only higher side of BMI. Also the contribute more for outliers
3rd graph – Same inference as 2nd graph

Screening Variable – Diabetes Pedigree Function¶

plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.distplot(dia.DiabetesPedigreeFunction,kde=False)
plt.title("Histogram for Diabetes Pedigree Function")
plt.subplot(1,3,2)
sns.distplot(dia0.DiabetesPedigreeFunction, kde=False, color="Gold", label="PedFunction for Outcome=0")
sns.distplot(dia1.DiabetesPedigreeFunction, kde=False, color="Blue", label="PedFunction for Outcome=1")
plt.legend()
plt.title("Histogram for DiabetesPedigreeFunction by Outcome")
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome, y=dia.DiabetesPedigreeFunction)
plt.title("Boxplot for DiabetesPedigreeFunction by Outcome")

Text(0.5, 1.0, 'Boxplot for DiabetesPedigreeFunction by Outcome')

Inference of Diabetes Pedigree Function¶

I dont know what this variable is. But it doesn’t seem to contribute to diabetes
Data is skewed. I don’t know if his parameter is expected to be a normal distribution. Not all natural parameters are normal
As DPF increases, there seems to be a likelihood of being diabetic, but needs statistical validation

Screening Variable – Age¶

plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.distplot(dia.Age,kde=False)
plt.title("Histogram for Age")
plt.subplot(1,3,2)
sns.distplot(dia0.Age,kde=False,color="Gold", label="Age for Outcome=0")
sns.distplot(dia1.Age,kde=False, color="Blue", label="Age for Outcome=1")
plt.legend()
plt.title("Histogram for Age by Outcome")
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome,y=dia.Age)
plt.title("Boxplot for Age by Outcome")

Text(0.5, 1.0, 'Boxplot for Age by Outcome')

Inference for Age¶

Age is skewed. Yes, as this is life data, it is likely to fall into a weibull distribution and not normal
There is a tendency that as people age, they are likely to become diabetic. This needs statistical validation
But diabetes, itself doesn’t seem to have an influence of longetivity. May be it impacts quality of life which is not measured in this data set.

Normality Test¶

Inference: None of the variables are normal. (P>0.05) May be subsets are normal

## importing stats module from scipy
from scipy import stats
## retrieving p value from normality test function
PregnanciesPVAL=stats.normaltest(dia.Pregnancies).pvalue
GlucosePVAL=stats.normaltest(dia.Glucose).pvalue
BloodPressurePVAL=stats.normaltest(dia.BloodPressure).pvalue
SkinThicknessPVAL=stats.normaltest(dia.SkinThickness).pvalue
InsulinPVAL=stats.normaltest(dia.Insulin).pvalue
BMIPVAL=stats.normaltest(dia.BMI).pvalue
DiaPeFuPVAL=stats.normaltest(dia.DiabetesPedigreeFunction).pvalue
AgePVAL=stats.normaltest(dia.Age).pvalue
## Printing the values
print("Pregnancies P Value is " + str(PregnanciesPVAL))
print("Glucose P Value is " + str(GlucosePVAL))
print("BloodPressure P Value is " + str(BloodPressurePVAL))
print("Skin Thickness P Value is " + str(SkinThicknessPVAL))
print("Insulin P Value is " + str(InsulinPVAL))
print("BMI P Value is " + str(BMIPVAL))
print("Diabetes Pedigree Function P Value is " + str(DiaPeFuPVAL))
print("Age P Value is " + str(AgePVAL))

Pregnancies P Value is 6.155097831782508e-20
Glucose P Value is 1.3277887088487345e-05
BloodPressure P Value is 0.030164917115239397
Skin Thickness P Value is 0.01548332935449814
Insulin P Value is 8.847272035922274e-43
BMI P Value is 1.4285556992424915e-09
Diabetes Pedigree Function P Value is 1.1325395699626466e-39
Age P Value is 1.0358469089881947e-21

Screening of Association between Variables to study Bivariate relationship¶

We will use pairplot to study the association between variables – from individual scatter plots
Then we will compute pearson correlation coefficient
Then we will summarize the same as heatmap

sns.pairplot(dia, vars=["Pregnancies", "Glucose","BloodPressure","SkinThickness","Insulin", "BMI","DiabetesPedigreeFunction", "Age"],hue="Outcome")
plt.title("Pairplot of Variables by Outcome")

Text(0.5, 1.0, 'Pairplot of Variables by Outcome')

Inference from Pair Plots¶

From scatter plots, to me only BMI & SkinThickness and Pregnancies & Age seem to have positive linear relationships. Another likely suspect is Glucose and Insulin.
There are no non-linear relationships
Lets check it out with Pearson Correlation and plot heat maps

cor = dia.corr(method ='pearson')
cor

sns.heatmap(cor)

<matplotlib.axes._subplots.AxesSubplot at 0xc4a9350>

Inference from ‘r’ values and heat map¶

No 2 factors have strong linear relationships
Age & Pregnancies and BMI & SkinThickness have moderate positive linear relationship
Glucose & Insulin technically has low correlation but 0.58 is close to 0.6 so can be assumed as moderate correlation

Final Inference before model building¶

Data set contains many zero values and they have been removed and remaining data has been used for screening and model building
Nearly 66% of participants are diabetic in the sample data
Visual screening (boxplots and segmented histograms) shows that few factors seem to influence the outcome
Moderate correlation exists between few factors and so while building model, this has to be borne in mind. If co-correlated factors are included, it might lead to Inflation of Variance.

As a next step, a binary logistic regression model has been built

Logistic Regression¶

A logistic regression is used from the dependent variable is binary, ordinal or nominal and the independent variables are either continuous or discrete
In this scenario, a Logit Model has been used to fit the data
In this case an event is defined as occurance of ‘1’ in outcome
Basically logistic regression uses the odds ratio to build the model

cols=["Pregnancies", "Glucose","BloodPressure","SkinThickness","Insulin", "BMI","DiabetesPedigreeFunction", "Age"]
X=dia[cols]
y=dia.Outcome

## Importing stats models for running logistic regression
import statsmodels.api as sm
## Defining the model and assigning Y (Dependent) and X (Independent Variables)
logit_model=sm.Logit(y,X)
## Fitting the model and publishing the results
result=logit_model.fit()
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.563677
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                Outcome   No. Observations:                  392
Model:                          Logit   Df Residuals:                      384
Method:                           MLE   Df Model:                            7
Date:                Thu, 09 May 2019   Pseudo R-squ.:                  0.1128
Time:                        15:31:51   Log-Likelihood:                -220.96
converged:                       True   LL-Null:                       -249.05
                                        LLR p-value:                 8.717e-10
============================================================================================
                               coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
Pregnancies                  0.1299      0.049      2.655      0.008       0.034       0.226
Glucose                      0.0174      0.005      3.765      0.000       0.008       0.026
BloodPressure               -0.0484      0.009     -5.123      0.000      -0.067      -0.030
SkinThickness                0.0284      0.015      1.898      0.058      -0.001       0.058
Insulin                      0.0019      0.001      1.598      0.110      -0.000       0.004
BMI                         -0.0365      0.022     -1.669      0.095      -0.079       0.006
DiabetesPedigreeFunction     0.4636      0.344      1.347      0.178      -0.211       1.138
Age                          0.0005      0.016      0.031      0.976      -0.031       0.032
============================================================================================

Inference from the Logistic Regression¶

The R sq value of the model is 56%.. that is this model can explain 56% of the variation in dependent variable
To identify which variables influence the outcome, we will look at the p-value of each variable. We expect the p-value to be less than 0.05(alpha risk)
When p-value<0.05, we can say the variable influences the outcome
Hence we will eliminate Diabetes Pedigree Function, Age, Insulin and re run the model

2nd itertion of the Logistic Regression with fewer variables¶

cols2=["Pregnancies", "Glucose","BloodPressure","SkinThickness","BMI"]
X=dia[cols2]

logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary2())

Optimization terminated successfully.
         Current function value: 0.569365
         Iterations 5
                         Results: Logit
=================================================================
Model:              Logit            Pseudo R-squared: 0.104     
Dependent Variable: Outcome          AIC:              456.3820  
Date:               2019-05-05 22:48 BIC:              476.2383  
No. Observations:   392              Log-Likelihood:   -223.19   
Df Model:           4                LL-Null:          -249.05   
Df Residuals:       387              LLR p-value:      1.5817e-10
Converged:          1.0000           Scale:            1.0000    
No. Iterations:     5.0000                                       
-----------------------------------------------------------------
                   Coef.  Std.Err.    z    P>|z|   [0.025  0.975]
-----------------------------------------------------------------
Pregnancies        0.1291   0.0374  3.4489 0.0006  0.0557  0.2024
Glucose            0.0215   0.0040  5.4447 0.0000  0.0138  0.0293
BloodPressure     -0.0507   0.0089 -5.6868 0.0000 -0.0682 -0.0332
SkinThickness      0.0299   0.0149  2.0073 0.0447  0.0007  0.0592
BMI               -0.0313   0.0215 -1.4537 0.1460 -0.0734  0.0109
=================================================================

Inference from 2nd Iteration¶

We will now eliminate BMI and re run the model

3rd iteration of Logistic Regression¶

cols3=["Pregnancies", "Glucose","BloodPressure","SkinThickness"]
X=dia[cols3]
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.572076
         Iterations 5
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                Outcome   No. Observations:                  392
Model:                          Logit   Df Residuals:                      388
Method:                           MLE   Df Model:                            3
Date:                Sun, 05 May 2019   Pseudo R-squ.:                 0.09956
Time:                        22:49:35   Log-Likelihood:                -224.25
converged:                       True   LL-Null:                       -249.05
                                        LLR p-value:                 9.769e-11
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Pregnancies       0.1403      0.037      3.820      0.000       0.068       0.212
Glucose           0.0199      0.004      5.297      0.000       0.013       0.027
BloodPressure    -0.0571      0.008     -7.242      0.000      -0.073      -0.042
SkinThickness     0.0160      0.011      1.404      0.160      -0.006       0.038
=================================================================================

Inference from 3rd Iteration¶

Now the P value of skinthickness is greater than 0.05, hence we will eliminate it and re run the model

4th Iteration of Logistic Regression¶

cols4=["Pregnancies", "Glucose","BloodPressure"]
X=dia[cols4]
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.574607
         Iterations 5
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                Outcome   No. Observations:                  392
Model:                          Logit   Df Residuals:                      389
Method:                           MLE   Df Model:                            2
Date:                Sun, 05 May 2019   Pseudo R-squ.:                 0.09558
Time:                        22:49:53   Log-Likelihood:                -225.25
converged:                       True   LL-Null:                       -249.05
                                        LLR p-value:                 4.597e-11
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Pregnancies       0.1405      0.037      3.826      0.000       0.069       0.212
Glucose           0.0210      0.004      5.709      0.000       0.014       0.028
BloodPressure    -0.0525      0.007     -7.449      0.000      -0.066      -0.039
=================================================================================

Inference from 4th Run¶

Now the model is clear. We have 3 variables that influence the Outcome and then are Pregnancies, Glucose and BloodPressure
Luckly, none of these 3 variables are co-correlated. Hence we can safetly assume tha the model is not inflated

## Importing LogisticRegression from Sk.Learn linear model as stats model function cannot give us classification report and confusion matrix
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
cols4=["Pregnancies", "Glucose","BloodPressure"]
X=dia[cols4]
y=dia.Outcome
logreg.fit(X,y)
## Defining the y_pred variable for the predicting values. I have taken 392 dia dataset. We can also take a test dataset
y_pred=logreg.predict(X)
## Calculating the precision of the model
from sklearn.metrics import classification_report
print(classification_report(y,y_pred))

              precision    recall  f1-score   support

           0       0.79      0.89      0.84       262
           1       0.71      0.53      0.61       130

   micro avg       0.77      0.77      0.77       392
   macro avg       0.75      0.71      0.72       392
weighted avg       0.77      0.77      0.76       392

C:\Users\Neil\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Precision of the model is 77%¶

from sklearn.metrics import confusion_matrix
## Confusion matrix gives the number of cases where the model is able to accurately predict the outcomes.. both 1 and 0 and how many cases it gives false positive and false negatives
confusion_matrix = confusion_matrix(y, y_pred)
print(confusion_matrix)

[[234  28]
 [ 61  69]]

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
count	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000
mean	3.845052	120.894531	69.105469	20.536458	79.799479	31.992578	0.471876	33.240885	0.348958
std	3.369578	31.972618	19.355807	15.952218	115.244002	7.884160	0.331329	11.760232	0.476951
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.078000	21.000000	0.000000
25%	1.000000	99.000000	62.000000	0.000000	0.000000	27.300000	0.243750	24.000000	0.000000
50%	3.000000	117.000000	72.000000	23.000000	30.500000	32.000000	0.372500	29.000000	0.000000
75%	6.000000	140.250000	80.000000	32.000000	127.250000	36.600000	0.626250	41.000000	1.000000
max	17.000000	199.000000	122.000000	99.000000	846.000000	67.100000	2.420000	81.000000	1.000000

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
count	392.000000	392.000000	392.000000	392.000000	392.000000	392.000000	392.000000	392.000000	392.000000
mean	3.301020	122.627551	70.663265	29.145408	156.056122	33.086224	0.523046	30.864796	0.331633
std	3.211424	30.860781	12.496092	10.516424	118.841690	7.027659	0.345488	10.200777	0.471401
min	0.000000	56.000000	24.000000	7.000000	14.000000	18.200000	0.085000	21.000000	0.000000
25%	1.000000	99.000000	62.000000	21.000000	76.750000	28.400000	0.269750	23.000000	0.000000
50%	2.000000	119.000000	70.000000	29.000000	125.500000	33.200000	0.449500	27.000000	0.000000
75%	5.000000	143.000000	78.000000	37.000000	190.000000	37.100000	0.687000	36.000000	1.000000
max	17.000000	198.000000	110.000000	63.000000	846.000000	67.100000	2.420000	81.000000	1.000000

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
Pregnancies	1.000000	0.198291	0.213355	0.093209	0.078984	-0.025347	0.007562	0.679608	0.256566
Glucose	0.198291	1.000000	0.210027	0.198856	0.581223	0.209516	0.140180	0.343641	0.515703
BloodPressure	0.213355	0.210027	1.000000	0.232571	0.098512	0.304403	-0.015971	0.300039	0.192673
SkinThickness	0.093209	0.198856	0.232571	1.000000	0.182199	0.664355	0.160499	0.167761	0.255936
Insulin	0.078984	0.581223	0.098512	0.182199	1.000000	0.226397	0.135906	0.217082	0.301429
BMI	-0.025347	0.209516	0.304403	0.664355	0.226397	1.000000	0.158771	0.069814	0.270118
DiabetesPedigreeFunction	0.007562	0.140180	-0.015971	0.160499	0.135906	0.158771	1.000000	0.085029	0.209330
Age	0.679608	0.343641	0.300039	0.167761	0.217082	0.069814	0.085029	1.000000	0.350804
Outcome	0.256566	0.515703	0.192673	0.255936	0.301429	0.270118	0.209330	0.350804	1.000000

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
4	0	137	40	35	168	43.1	2.288	33	1
6	3	78	50	32	88	31.0	0.248	26	1
8	2	197	70	45	543	30.5	0.158	53	1
13	1	189	60	23	846	30.1	0.398	59	1
14	5	166	72	19	175	25.8	0.587	51	1
16	0	118	84	47	230	45.8	0.551	31	1
19	1	115	70	30	96	34.6	0.529	32	1
24	11	143	94	33	146	36.6	0.254	51	1
25	10	125	70	26	115	31.1	0.205	41	1
31	3	158	76	36	245	31.6	0.851	28	1
39	4	111	72	47	207	37.1	1.390	56	1
43	9	171	110	24	240	45.4	0.721	54	1
53	8	176	90	34	300	33.7	0.467	58	1
56	7	187	68	39	304	37.7	0.254	41	1
70	2	100	66	20	90	32.9	0.867	28	1
88	15	136	70	32	110	37.1	0.153	43	1
99	1	122	90	51	220	49.7	0.325	31	1
109	0	95	85	25	36	37.4	0.247	24	1
110	3	171	72	33	135	33.3	0.199	24	1
111	8	155	62	26	495	34.0	0.543	46	1
114	7	160	54	32	175	30.5	0.588	39	1
120	0	162	76	56	100	53.2	0.759	25	1
125	1	88	30	42	99	55.0	0.496	26	1
128	1	117	88	24	145	34.5	0.403	40	1
130	4	173	70	14	168	29.7	0.361	33	1
132	3	170	64	37	225	34.5	0.356	30	1
152	9	156	86	28	155	34.3	1.189	42	1
159	17	163	72	41	114	40.9	0.817	47	1
165	6	104	74	18	156	29.9	0.722	41	1
171	6	134	70	23	130	35.4	0.542	29	1
…	…	…	…	…	…	…	…	…	…
584	8	124	76	24	600	28.7	0.687	52	1
588	3	176	86	27	156	33.3	1.154	52	1
595	0	188	82	14	185	32.0	0.682	22	1
603	7	150	78	29	126	35.2	0.692	54	1
606	1	181	78	42	293	40.0	1.258	22	1
611	3	174	58	22	194	32.9	0.593	36	1
612	7	168	88	42	321	38.2	0.787	40	1
614	11	138	74	26	144	36.1	0.557	50	1
638	7	97	76	32	91	40.9	0.871	32	1
646	1	167	74	17	144	23.4	0.447	33	1
647	0	179	50	36	159	37.8	0.455	22	1
648	11	136	84	35	130	28.3	0.260	42	1
655	2	155	52	27	540	38.7	0.240	25	1
659	3	80	82	31	70	34.2	1.292	27	1
662	8	167	106	46	231	37.6	0.165	43	1
663	9	145	80	46	130	37.9	0.637	40	1
689	1	144	82	46	180	46.1	0.335	46	1
693	7	129	68	49	125	38.5	0.439	43	1
695	7	142	90	24	480	30.4	0.128	43	1
696	3	169	74	19	125	29.9	0.268	31	1
709	2	93	64	32	160	38.0	0.674	23	1
715	7	187	50	33	392	33.9	0.826	34	1
716	3	173	78	39	185	33.8	0.970	31	1
722	1	149	68	29	127	29.3	0.349	42	1
730	3	130	78	23	79	28.4	0.323	34	1
732	2	174	88	37	120	44.5	0.646	24	1
740	11	120	80	37	150	42.3	0.785	48	1
748	3	187	70	22	200	36.4	0.408	36	1
753	0	181	88	44	510	43.3	0.222	26	1
755	1	128	88	39	110	36.5	1.057	37	1

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
3	1	89	66	23	94	28.1	0.167	21	0
18	1	103	30	38	83	43.3	0.183	33	0
20	3	126	88	41	235	39.3	0.704	27	0
27	1	97	66	15	140	23.2	0.487	22	0
28	13	145	82	19	110	22.2	0.245	57	0
32	3	88	58	11	54	24.8	0.267	22	0
35	4	103	60	33	192	24.0	0.966	33	0
40	3	180	64	25	70	34.0	0.271	26	0
50	1	103	80	11	82	19.4	0.491	22	0
51	1	101	50	15	36	24.2	0.526	26	0
52	5	88	66	21	23	24.4	0.342	30	0
54	7	150	66	42	342	34.7	0.718	42	0
57	0	100	88	60	110	46.8	0.962	31	0
59	0	105	64	41	142	41.5	0.173	22	0
63	2	141	58	34	128	25.4	0.699	24	0
68	1	95	66	13	38	19.6	0.334	25	0
69	4	146	85	27	100	28.9	0.189	27	0
71	5	139	64	35	140	28.6	0.411	26	0
73	4	129	86	20	270	35.1	0.231	23	0
82	7	83	78	26	71	29.3	0.767	36	0
85	2	110	74	29	125	32.4	0.698	27	0
87	2	100	68	25	71	38.5	0.324	26	0
91	4	123	80	15	176	32.0	0.443	34	0
92	7	81	78	40	48	46.7	0.261	42	0
94	2	142	82	18	64	24.7	0.761	21	0
95	6	144	72	27	228	33.9	0.255	40	0
97	1	71	48	18	76	20.4	0.323	22	0
98	6	93	50	30	64	28.7	0.356	23	0
103	1	81	72	18	40	26.6	0.283	24	0
105	1	126	56	29	152	28.7	0.801	21	0
…	…	…	…	…	…	…	…	…	…
673	3	123	100	35	240	57.3	0.880	22	0
679	2	101	58	17	265	24.2	0.614	23	0
680	2	56	56	28	45	24.2	0.332	22	0
682	0	95	64	39	105	44.6	0.366	22	0
685	2	129	74	26	205	33.2	0.591	25	0
688	1	140	74	26	180	24.1	0.828	23	0
692	2	121	70	32	95	39.1	0.886	23	0
698	4	127	88	11	155	34.5	0.598	28	0
700	2	122	76	27	200	35.9	0.483	26	0

Articles

Pima Indian Diabetes Data Analysis in Python

Pima Diabetes Data Analytics – Neil¶

Importing necessary packages¶

Importing the Diabetes CSV data file¶

About data set¶

Finding if there are any null and Zero values in the data set¶

Inference:¶

Inference at this point¶

Inference:¶

Inference¶

Performing Preliminary Descriptive Stats on the Data set¶

Split the data frame into two sub sets for convenience of analysis¶

Graphical screening for variables¶

Inference on screening Outcome variable¶

Graphical Screening for Variables¶

Screening Variable – Pregnancies¶

Inference on Pregnancies¶

Screening Variable – Glucose¶

Inference on Glucose¶

Screening Variable – Blood Pressure¶

Inference on Blood Pressure¶

Screening Variable – Skin Thickness¶

Inferences for Skinthickness¶

Screening Variable – Insulin¶

Inference for Insulin¶

Screening Variable – BMI¶

Inference for BMI¶

Screening Variable – Diabetes Pedigree Function¶

Inference of Diabetes Pedigree Function¶

Screening Variable – Age¶

Inference for Age¶

Normality Test¶

Screening of Association between Variables to study Bivariate relationship¶

Inference from Pair Plots¶

Inference from ‘r’ values and heat map¶

Final Inference before model building¶

Logistic Regression¶

Inference from the Logistic Regression¶

2nd itertion of the Logistic Regression with fewer variables¶

Inference from 2nd Iteration¶

3rd iteration of Logistic Regression¶

Inference from 3rd Iteration¶

4th Iteration of Logistic Regression¶

Inference from 4th Run¶

Precision of the model is 77%¶

The result is telling us that we have 234+69 are correct predictions and 61+28 are incorrect predictions.¶