Articles

Pima Indian Diabetes Data Analysis in Python







 

Pima Diabetes Data Analytics – Neil

Here are the set the analytics that has been run on this data set

  • Data Cleaning to remove zeros
  • Data Exploration for Y and Xs
  • Descriptive Statistics – Numerical Summary and Graphical (Histograms) for all variables
  • Screening of variables by segmenting them by Outcome
  • Check for normality of dataset
  • Study bivariate relationship between variables using pair plots, correlation and heat map
  • Statistical screening using Logistic Regression
  • Validation of the model its precision and ploting of confusion matrix
 

Importing necessary packages

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(color_codes =True)
%matplotlib inline
 

Importing the Diabetes CSV data file

  • Import the data and test if all the columns are loaded
  • The Data frame has been assigned a name of ‘diab’
In [3]:
diab=pd.read_csv("diabetes.csv")
diab.head()
Out[3]:
  Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
 

About data set

In this data set, Outcome is the Dependent Variable and Remaining 8 variables are independent variables.

 

Finding if there are any null and Zero values in the data set

In [29]:
diab.isnull().values.any()
## To check if data contains null values
Out[29]:
False
 

Inference:

  • Data frame doesn’t have any NAN values
  • As a next step, we will do preliminary screening of descriptive stats for the dataset
In [30]:
diab.describe()
## To run numerical descriptive stats for the data set
Out[30]:
  Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.240885 0.348958
std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329 11.760232 0.476951
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.000000 0.000000
25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.000000 0.000000
50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.000000 0.000000
75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.000000 1.000000
max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 1.000000
 

Inference at this point

  • Minimum values for many variables are 0.
  • As biological parameters like Glucose, BP, Skin thickness,Insulin & BMI cannot have zero values, looks like null values have been coded as zeros
  • As a next step, find out how many Zero values are included in each variable
In [32]:
(diab.Pregnancies == 0).sum(),(diab.Glucose==0).sum(),(diab.BloodPressure==0).sum(),(diab.SkinThickness==0).sum(),(diab.Insulin==0).sum(),(diab.BMI==0).sum(),(diab.DiabetesPedigreeFunction==0).sum(),(diab.Age==0).sum()
## Counting cells with 0 Values for each variable and publishing the counts below
Out[32]:
(111, 5, 35, 227, 374, 11, 0, 0)
 

Inference:

  • As Zero Counts of some the variables are as high as 374 and 227, in a 768 data set, it is better to remove the Zeros uniformly for 5 variables (excl Pregnancies & Outcome)
  • As a next step, we’ll drop 0 values and create a our new dataset which can be used for further analysis
In [4]:
## Creating a dataset called 'dia' from original dataset 'diab' with excludes all rows with have zeros only for Glucose, BP, Skinthickness, Insulin and BMI, as other columns can contain Zero values.
drop_Glu=diab.index[diab.Glucose == 0].tolist()
drop_BP=diab.index[diab.BloodPressure == 0].tolist()
drop_Skin = diab.index[diab.SkinThickness==0].tolist()
drop_Ins = diab.index[diab.Insulin==0].tolist()
drop_BMI = diab.index[diab.BMI==0].tolist()
c=drop_Glu+drop_BP+drop_Skin+drop_Ins+drop_BMI
dia=diab.drop(diab.index[c])
In [35]:
dia.info()
 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 392 entries, 3 to 765
Data columns (total 9 columns):
Pregnancies                 392 non-null int64
Glucose                     392 non-null int64
BloodPressure               392 non-null int64
SkinThickness               392 non-null int64
Insulin                     392 non-null int64
BMI                         392 non-null float64
DiabetesPedigreeFunction    392 non-null float64
Age                         392 non-null int64
Outcome                     392 non-null int64
dtypes: float64(2), int64(7)
memory usage: 30.6 KB
 

Inference

  • As in above, created a cleaned up list titled “dia” which has 392 rows of data instead of 768 from original list
  • Looks like we lost nearly 50% of data but our data set is now cleaner than before
  • In fact the removed values can be used for Testing during modeling. So actually we haven’t really lost them completly.
 

Performing Preliminary Descriptive Stats on the Data set

  • Performing 5 number summary
  • Usually, the first thing to do in a data set is to get a hang of vital parameters of all variables and thus understand a little bit about the data set such as central tendency and dispersion
In [19]:
dia.describe()
Out[19]:
  Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
count 392.000000 392.000000 392.000000 392.000000 392.000000 392.000000 392.000000 392.000000 392.000000
mean 3.301020 122.627551 70.663265 29.145408 156.056122 33.086224 0.523046 30.864796 0.331633
std 3.211424 30.860781 12.496092 10.516424 118.841690 7.027659 0.345488 10.200777 0.471401
min 0.000000 56.000000 24.000000 7.000000 14.000000 18.200000 0.085000 21.000000 0.000000
25% 1.000000 99.000000 62.000000 21.000000 76.750000 28.400000 0.269750 23.000000 0.000000
50% 2.000000 119.000000 70.000000 29.000000 125.500000 33.200000 0.449500 27.000000 0.000000
75% 5.000000 143.000000 78.000000 37.000000 190.000000 37.100000 0.687000 36.000000 1.000000
max 17.000000 198.000000 110.000000 63.000000 846.000000 67.100000 2.420000 81.000000 1.000000
 

Split the data frame into two sub sets for convenience of analysis

  • As we wish to study the influence of each variable on Outcome (Diabetic or not), we can subset the data by Outcome
  • dia1 Subset : All samples with 1 values of Outcome
  • dia0 Subset: All samples with 0 values of Outcome
In [8]:
dia1 = dia[dia.Outcome==1]
dia0 = dia[dia.Outcome==0]
In [21]:
dia1
Out[21]:
  Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
4 0 137 40 35 168 43.1 2.288 33 1
6 3 78 50 32 88 31.0 0.248 26 1
8 2 197 70 45 543 30.5 0.158 53 1
13 1 189 60 23 846 30.1 0.398 59 1
14 5 166 72 19 175 25.8 0.587 51 1
16 0 118 84 47 230 45.8 0.551 31 1
19 1 115 70 30 96 34.6 0.529 32 1
24 11 143 94 33 146 36.6 0.254 51 1
25 10 125 70 26 115 31.1 0.205 41 1
31 3 158 76 36 245 31.6 0.851 28 1
39 4 111 72 47 207 37.1 1.390 56 1
43 9 171 110 24 240 45.4 0.721 54 1
53 8 176 90 34 300 33.7 0.467 58 1
56 7 187 68 39 304 37.7 0.254 41 1
70 2 100 66 20 90 32.9 0.867 28 1
88 15 136 70 32 110 37.1 0.153 43 1
99 1 122 90 51 220 49.7 0.325 31 1
109 0 95 85 25 36 37.4 0.247 24 1
110 3 171 72 33 135 33.3 0.199 24 1
111 8 155 62 26 495 34.0 0.543 46 1
114 7 160 54 32 175 30.5 0.588 39 1
120 0 162 76 56 100 53.2 0.759 25 1
125 1 88 30 42 99 55.0 0.496 26 1
128 1 117 88 24 145 34.5 0.403 40 1
130 4 173 70 14 168 29.7 0.361 33 1
132 3 170 64 37 225 34.5 0.356 30 1
152 9 156 86 28 155 34.3 1.189 42 1
159 17 163 72 41 114 40.9 0.817 47 1
165 6 104 74 18 156 29.9 0.722 41 1
171 6 134 70 23 130 35.4 0.542 29 1
584 8 124 76 24 600 28.7 0.687 52 1
588 3 176 86 27 156 33.3 1.154 52 1
595 0 188 82 14 185 32.0 0.682 22 1
603 7 150 78 29 126 35.2 0.692 54 1
606 1 181 78 42 293 40.0 1.258 22 1
611 3 174 58 22 194 32.9 0.593 36 1
612 7 168 88 42 321 38.2 0.787 40 1
614 11 138 74 26 144 36.1 0.557 50 1
638 7 97 76 32 91 40.9 0.871 32 1
646 1 167 74 17 144 23.4 0.447 33 1
647 0 179 50 36 159 37.8 0.455 22 1
648 11 136 84 35 130 28.3 0.260 42 1
655 2 155 52 27 540 38.7 0.240 25 1
659 3 80 82 31 70 34.2 1.292 27 1
662 8 167 106 46 231 37.6 0.165 43 1
663 9 145 80 46 130 37.9 0.637 40 1
689 1 144 82 46 180 46.1 0.335 46 1
693 7 129 68 49 125 38.5 0.439 43 1
695 7 142 90 24 480 30.4 0.128 43 1
696 3 169 74 19 125 29.9 0.268 31 1
709 2 93 64 32 160 38.0 0.674 23 1
715 7 187 50 33 392 33.9 0.826 34 1
716 3 173 78 39 185 33.8 0.970 31 1
722 1 149 68 29 127 29.3 0.349 42 1
730 3 130 78 23 79 28.4 0.323 34 1
732 2 174 88 37 120 44.5 0.646 24 1
740 11 120 80 37 150 42.3 0.785 48 1
748 3 187 70 22 200 36.4 0.408 36 1
753 0 181 88 44 510 43.3 0.222 26 1
755 1 128 88 39 110 36.5 1.057 37 1

130 rows × 9 columns

In [36]:
dia0
Out[36]:
  Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
3 1 89 66 23 94 28.1 0.167 21 0
18 1 103 30 38 83 43.3 0.183 33 0
20 3 126 88 41 235 39.3 0.704 27 0
27 1 97 66 15 140 23.2 0.487 22 0
28 13 145 82 19 110 22.2 0.245 57 0
32 3 88 58 11 54 24.8 0.267 22 0
35 4 103 60 33 192 24.0 0.966 33 0
40 3 180 64 25 70 34.0 0.271 26 0
50 1 103 80 11 82 19.4 0.491 22 0
51 1 101 50 15 36 24.2 0.526 26 0
52 5 88 66 21 23 24.4 0.342 30 0
54 7 150 66 42 342 34.7 0.718 42 0
57 0 100 88 60 110 46.8 0.962 31 0
59 0 105 64 41 142 41.5 0.173 22 0
63 2 141 58 34 128 25.4 0.699 24 0
68 1 95 66 13 38 19.6 0.334 25 0
69 4 146 85 27 100 28.9 0.189 27 0
71 5 139 64 35 140 28.6 0.411 26 0
73 4 129 86 20 270 35.1 0.231 23 0
82 7 83 78 26 71 29.3 0.767 36 0
85 2 110 74 29 125 32.4 0.698 27 0
87 2 100 68 25 71 38.5 0.324 26 0
91 4 123 80 15 176 32.0 0.443 34 0
92 7 81 78 40 48 46.7 0.261 42 0
94 2 142 82 18 64 24.7 0.761 21 0
95 6 144 72 27 228 33.9 0.255 40 0
97 1 71 48 18 76 20.4 0.323 22 0
98 6 93 50 30 64 28.7 0.356 23 0
103 1 81 72 18 40 26.6 0.283 24 0
105 1 126 56 29 152 28.7 0.801 21 0
673 3 123 100 35 240 57.3 0.880 22 0
679 2 101 58 17 265 24.2 0.614 23 0
680 2 56 56 28 45 24.2 0.332 22 0
682 0 95 64 39 105 44.6 0.366 22 0
685 2 129 74 26 205 33.2 0.591 25 0
688 1 140 74 26 180 24.1 0.828 23 0
692 2 121 70 32 95 39.1 0.886 23 0
698 4 127 88 11 155 34.5 0.598 28 0
700 2 122 76 27 200 35.9 0.483 26 0
704 4 110 76 20 100 28.4 0.118 27 0
707 2 127 46 21 335 34.4 0.176 22 0
710 3 158 64 13 387 31.2 0.295 24 0
711 5 126 78 27 22 29.6 0.439 40 0
713 0 134 58 20 291 26.4 0.352 21 0
718 1 108 60 46 178 35.5 0.415 24 0
721 1 114 66 36 200 38.1 0.289 21 0
723 5 117 86 30 105 39.1 0.251 42 0
726 1 116 78 29 180 36.1 0.496 25 0
733 2 106 56 27 165 29.0 0.426 22 0
736 0 126 86 27 120 27.4 0.515 21 0
738 2 99 60 17 160 36.6 0.453 21 0
741 3 102 44 20 94 30.8 0.400 26 0
742 1 109 58 18 116 28.5 0.219 22 0
744 13 153 88 37 140 40.6 1.174 39 0
745 12 100 84 33 105 30.0 0.488 46 0
747 1 81 74 41 57 46.3 1.096 32 0
751 1 121 78 39 74 39.0 0.261 28 0
760 2 88 58 26 16 28.4 0.766 22 0
763 10 101 76 48 180 32.9 0.171 63 0
765 5 121 72 23 112 26.2 0.245 30 0

262 rows × 9 columns

 

Graphical screening for variables

  • Now we will start graphical analysis of outcome. At the data is nominal(binary), we will run count plot and compute %ages of samples who are diabetic and non-diabetic
In [37]:
## creating count plot with title using seaborn
sns.countplot(x=dia.Outcome)
plt.title("Count Plot for Outcome")
Out[37]:
Text(0.5, 1.0, 'Count Plot for Outcome')
 
In [38]:
# Computing the %age of diabetic and non-diabetic in the sample
Out0=len(dia[dia.Outcome==1])
Out1=len(dia[dia.Outcome==0])
Total=Out0+Out1
PC_of_1 = Out1*100/Total
PC_of_0 = Out0*100/Total
PC_of_1, PC_of_0
Out[38]:
(66.83673469387755, 33.16326530612245)
 

Inference on screening Outcome variable

  • There are 66.8% 1’s (diabetic) and 33.1% 0’s (nondiabetic) in the data
  • As a next step, we will start screening variables
 

Graphical Screening for Variables

  • We will take each variable, one at a time and screen them in the following manner
  • Study the data distribution (histogram) of each variable – Central tendency, Spread, Distortion(Skewness & Kurtosis)
  • To visually screen the association between ‘Outcome’ and each variable by plotting histograms & Boxplots by Outcome value
 

Screening Variable – Pregnancies

In [40]:
## Creating 3 subplots - 1st for histogram, 2nd for histogram segmented by Outcome and 3rd for representing same segmentation using boxplot
plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.set_style("dark")
plt.title("Histogram for Pregnancies")
sns.distplot(dia.Pregnancies,kde=False)
plt.subplot(1,3,2)
sns.distplot(dia0.Pregnancies,kde=False,color="Blue", label="Preg for Outome=0")
sns.distplot(dia1.Pregnancies,kde=False,color = "Gold", label = "Preg for Outcome=1")
plt.title("Histograms for Preg by Outcome")
plt.legend()
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome,y=dia.Pregnancies)
plt.title("Boxplot for Preg by Outcome")
Out[40]:
Text(0.5, 1.0, 'Boxplot for Preg by Outcome')
 
 

Inference on Pregnancies

  • Visually, data is right skewed. For data of count of pregenancies. A large proportion of the participants are zero count on pregnancy. As the data set includes women > 21 yrs, its likely that many are unmarried
  • When looking at the segemented histograms, a hypothesis is the as pregnancies includes, women are more likely to be diabetic
  • In the boxplots, we find few outliers in both subsets. Esp some non-diabetic women have had many pregenancies. I wouldn’t be worried.
  • To validate this hypothesis, need to statistically test.
 

Screening Variable – Glucose

In [41]:
plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
plt.title("Histogram for Glucose")
sns.distplot(dia.Glucose, kde=False)
plt.subplot(1,3,2)
sns.distplot(dia0.Glucose,kde=False,color="Gold", label="Gluc for Outcome=0")
sns.distplot(dia1.Glucose, kde=False, color="Blue", label = "Gloc for Outcome=1")
plt.title("Histograms for Glucose by Outcome")
plt.legend()
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome,y=dia.Glucose)
plt.title("Boxplot for Glucose by Outcome")
Out[41]:
Text(0.5, 1.0, 'Boxplot for Glucose by Outcome')
 
 

Inference on Glucose

  • 1st graph – Histogram of Glucose data is slightly skewed to right. Understandably, the data set contains over 60% who are diabetic and its likely that their Glucose levels were higher. But the grand mean of Glucose is at 122.\
  • 2nd graph – Clearly diabetic group has higher glucose than non-diabetic.
  • 3rd graph – In the boxplot, visually skewness seems acceptable (<2) and its also likely that confidence intervels of the means are not overlapping. So a hypothesis that Glucose is measure of outcome, is likely to be true. But needs to be statistically tested.
 

Screening Variable – Blood Pressure

In [18]:
plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.distplot(dia.BloodPressure, kde=False)
plt.title("Histogram for Blood Pressure")
plt.subplot(1,3,2)
sns.distplot(dia0.BloodPressure,kde=False,color="Gold",label="BP for Outcome=0")
sns.distplot(dia1.BloodPressure,kde=False, color="Blue", label="BP for Outcome=1")
plt.legend()
plt.title("Histogram of Blood Pressure by Outcome")
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome,y=dia.BloodPressure)
plt.title("Boxplot of BP by Outcome")
Out[18]:
Text(0.5, 1.0, 'Boxplot of BP by Outcome')
 
 

Inference on Blood Pressure

  • 1st graph – Distribution looks normal. Mean value is 69, well within normal values for diastolic of 80. One should expect this data to be normal, but as we don’t know if the particpants are only hypertensive medication, we can’t comment much.
  • 2nd graph – Most non diabetic women seem to have nominal value of 69 and diabetic women seems to have high BP.
  • 3rd graph – Few outliers in the data. Its likely that some people have low and some have high BP. So the association between diabetic (Outcome) and BP is an suspect and needs to be statistically validated.
 

Screening Variable – Skin Thickness

In [21]:
plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.distplot(dia.SkinThickness, kde=False)
plt.title("Histogram for Skin Thickness")
plt.subplot(1,3,2)
sns.distplot(dia0.SkinThickness, kde=False, color="Gold", label="SkinThick for Outcome=0")
sns.distplot(dia1.SkinThickness, kde=False, color="Blue", label="SkinThick for Outcome=1")
plt.legend()
plt.title("Histogram for SkinThickness by Outcome")
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome, y=dia.SkinThickness)
plt.title("Boxplot of SkinThickness by Outcome")
Out[21]:
Text(0.5, 1.0, 'Boxplot of SkinThickness by Outcome')
 
 

Inferences for Skinthickness

  • 1st graph – Skin thickness seems be be skewed a bit.
  • 2nd graph – Like BP, people who are not diabetic have lower skin thickness. This is a hypothesis that has to be validated. As data of non-diabetic is skewed but diabetic samples seems to be normally distributed.
 

Screening Variable – Insulin

In [42]:
plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.distplot(dia.Insulin,kde=False)
plt.title("Histogram of Insulin")
plt.subplot(1,3,2)
sns.distplot(dia0.Insulin,kde=False, color="Gold", label="Insulin for Outcome=0")
sns.distplot(dia1.Insulin,kde=False, color="Blue", label="Insuline for Outcome=1")
plt.title("Histogram for Insulin by Outcome")
plt.legend()
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome, y=dia.Insulin)
plt.title("Boxplot for Insulin by Outcome")
Out[42]:
Text(0.5, 1.0, 'Boxplot for Insulin by Outcome')
 
 

Inference for Insulin

  • 2hour serum insulin is expected to be between 16 to 166. Clearly there are Outliers in the data. These Outliers are concern for us and most of them with higher insulin values ar also diabetic. So this is a suspect.
 

Screening Variable – BMI

In [23]:
plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.distplot(dia.BMI, kde=False)
plt.title("Histogram for BMI")
plt.subplot(1,3,2)
sns.distplot(dia0.BMI, kde=False,color="Gold", label="BMI for Outcome=0")
sns.distplot(dia1.BMI, kde=False, color="Blue", label="BMI for Outcome=1")
plt.legend()
plt.title("Histogram for BMI by Outcome")
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome, y=dia.BMI)
plt.title("Boxplot for BMI by Outcome")
Out[23]:
Text(0.5, 1.0, 'Boxplot for BMI by Outcome')
 
 

Inference for BMI

  • 1st graph – There are few outliers. Few are obese in the dataset. Expected range is between 18 to 25. In general, people are obese
  • 2nd graph – Diabetic people seems to be only higher side of BMI. Also the contribute more for outliers
  • 3rd graph – Same inference as 2nd graph
 

Screening Variable – Diabetes Pedigree Function

In [24]:
plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.distplot(dia.DiabetesPedigreeFunction,kde=False)
plt.title("Histogram for Diabetes Pedigree Function")
plt.subplot(1,3,2)
sns.distplot(dia0.DiabetesPedigreeFunction, kde=False, color="Gold", label="PedFunction for Outcome=0")
sns.distplot(dia1.DiabetesPedigreeFunction, kde=False, color="Blue", label="PedFunction for Outcome=1")
plt.legend()
plt.title("Histogram for DiabetesPedigreeFunction by Outcome")
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome, y=dia.DiabetesPedigreeFunction)
plt.title("Boxplot for DiabetesPedigreeFunction by Outcome")
Out[24]:
Text(0.5, 1.0, 'Boxplot for DiabetesPedigreeFunction by Outcome')
 
 

Inference of Diabetes Pedigree Function

  • I dont know what this variable is. But it doesn’t seem to contribute to diabetes
  • Data is skewed. I don’t know if his parameter is expected to be a normal distribution. Not all natural parameters are normal
  • As DPF increases, there seems to be a likelihood of being diabetic, but needs statistical validation
 

Screening Variable – Age

In [25]:
plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.distplot(dia.Age,kde=False)
plt.title("Histogram for Age")
plt.subplot(1,3,2)
sns.distplot(dia0.Age,kde=False,color="Gold", label="Age for Outcome=0")
sns.distplot(dia1.Age,kde=False, color="Blue", label="Age for Outcome=1")
plt.legend()
plt.title("Histogram for Age by Outcome")
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome,y=dia.Age)
plt.title("Boxplot for Age by Outcome")
Out[25]:
Text(0.5, 1.0, 'Boxplot for Age by Outcome')
 
 

Inference for Age

  • Age is skewed. Yes, as this is life data, it is likely to fall into a weibull distribution and not normal
  • There is a tendency that as people age, they are likely to become diabetic. This needs statistical validation
  • But diabetes, itself doesn’t seem to have an influence of longetivity. May be it impacts quality of life which is not measured in this data set.
 

Normality Test

Inference: None of the variables are normal. (P>0.05) May be subsets are normal

In [43]:
## importing stats module from scipy
from scipy import stats
## retrieving p value from normality test function
PregnanciesPVAL=stats.normaltest(dia.Pregnancies).pvalue
GlucosePVAL=stats.normaltest(dia.Glucose).pvalue
BloodPressurePVAL=stats.normaltest(dia.BloodPressure).pvalue
SkinThicknessPVAL=stats.normaltest(dia.SkinThickness).pvalue
InsulinPVAL=stats.normaltest(dia.Insulin).pvalue
BMIPVAL=stats.normaltest(dia.BMI).pvalue
DiaPeFuPVAL=stats.normaltest(dia.DiabetesPedigreeFunction).pvalue
AgePVAL=stats.normaltest(dia.Age).pvalue
## Printing the values
print("Pregnancies P Value is " + str(PregnanciesPVAL))
print("Glucose P Value is " + str(GlucosePVAL))
print("BloodPressure P Value is " + str(BloodPressurePVAL))
print("Skin Thickness P Value is " + str(SkinThicknessPVAL))
print("Insulin P Value is " + str(InsulinPVAL))
print("BMI P Value is " + str(BMIPVAL))
print("Diabetes Pedigree Function P Value is " + str(DiaPeFuPVAL))
print("Age P Value is " + str(AgePVAL))
 
Pregnancies P Value is 6.155097831782508e-20
Glucose P Value is 1.3277887088487345e-05
BloodPressure P Value is 0.030164917115239397
Skin Thickness P Value is 0.01548332935449814
Insulin P Value is 8.847272035922274e-43
BMI P Value is 1.4285556992424915e-09
Diabetes Pedigree Function P Value is 1.1325395699626466e-39
Age P Value is 1.0358469089881947e-21
 

Screening of Association between Variables to study Bivariate relationship

  • We will use pairplot to study the association between variables – from individual scatter plots
  • Then we will compute pearson correlation coefficient
  • Then we will summarize the same as heatmap
In [49]:
sns.pairplot(dia, vars=["Pregnancies", "Glucose","BloodPressure","SkinThickness","Insulin", "BMI","DiabetesPedigreeFunction", "Age"],hue="Outcome")
plt.title("Pairplot of Variables by Outcome")
Out[49]:
Text(0.5, 1.0, 'Pairplot of Variables by Outcome')
 
 

Inference from Pair Plots

  • From scatter plots, to me only BMI & SkinThickness and Pregnancies & Age seem to have positive linear relationships. Another likely suspect is Glucose and Insulin.
  • There are no non-linear relationships
  • Lets check it out with Pearson Correlation and plot heat maps
In [54]:
cor = dia.corr(method ='pearson')
cor
Out[54]:
  Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
Pregnancies 1.000000 0.198291 0.213355 0.093209 0.078984 -0.025347 0.007562 0.679608 0.256566
Glucose 0.198291 1.000000 0.210027 0.198856 0.581223 0.209516 0.140180 0.343641 0.515703
BloodPressure 0.213355 0.210027 1.000000 0.232571 0.098512 0.304403 -0.015971 0.300039 0.192673
SkinThickness 0.093209 0.198856 0.232571 1.000000 0.182199 0.664355 0.160499 0.167761 0.255936
Insulin 0.078984 0.581223 0.098512 0.182199 1.000000 0.226397 0.135906 0.217082 0.301429
BMI -0.025347 0.209516 0.304403 0.664355 0.226397 1.000000 0.158771 0.069814 0.270118
DiabetesPedigreeFunction 0.007562 0.140180 -0.015971 0.160499 0.135906 0.158771 1.000000 0.085029 0.209330
Age 0.679608 0.343641 0.300039 0.167761 0.217082 0.069814 0.085029 1.000000 0.350804
Outcome 0.256566 0.515703 0.192673 0.255936 0.301429 0.270118 0.209330 0.350804 1.000000
In [55]:
sns.heatmap(cor)
Out[55]:
<matplotlib.axes._subplots.AxesSubplot at 0xc4a9350>
 
 

Inference from ‘r’ values and heat map

  • No 2 factors have strong linear relationships
  • Age & Pregnancies and BMI & SkinThickness have moderate positive linear relationship
  • Glucose & Insulin technically has low correlation but 0.58 is close to 0.6 so can be assumed as moderate correlation
 

Final Inference before model building

  • Data set contains many zero values and they have been removed and remaining data has been used for screening and model building
  • Nearly 66% of participants are diabetic in the sample data
  • Visual screening (boxplots and segmented histograms) shows that few factors seem to influence the outcome
  • Moderate correlation exists between few factors and so while building model, this has to be borne in mind. If co-correlated factors are included, it might lead to Inflation of Variance.

  • As a next step, a binary logistic regression model has been built
 

Logistic Regression

  • A logistic regression is used from the dependent variable is binary, ordinal or nominal and the independent variables are either continuous or discrete
  • In this scenario, a Logit Model has been used to fit the data
  • In this case an event is defined as occurance of ‘1’ in outcome
  • Basically logistic regression uses the odds ratio to build the model
In [5]:
cols=["Pregnancies", "Glucose","BloodPressure","SkinThickness","Insulin", "BMI","DiabetesPedigreeFunction", "Age"]
X=dia[cols]
y=dia.Outcome
In [7]:
## Importing stats models for running logistic regression
import statsmodels.api as sm
## Defining the model and assigning Y (Dependent) and X (Independent Variables)
logit_model=sm.Logit(y,X)
## Fitting the model and publishing the results
result=logit_model.fit()
print(result.summary())
 
Optimization terminated successfully.
         Current function value: 0.563677
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                Outcome   No. Observations:                  392
Model:                          Logit   Df Residuals:                      384
Method:                           MLE   Df Model:                            7
Date:                Thu, 09 May 2019   Pseudo R-squ.:                  0.1128
Time:                        15:31:51   Log-Likelihood:                -220.96
converged:                       True   LL-Null:                       -249.05
                                        LLR p-value:                 8.717e-10
============================================================================================
                               coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
Pregnancies                  0.1299      0.049      2.655      0.008       0.034       0.226
Glucose                      0.0174      0.005      3.765      0.000       0.008       0.026
BloodPressure               -0.0484      0.009     -5.123      0.000      -0.067      -0.030
SkinThickness                0.0284      0.015      1.898      0.058      -0.001       0.058
Insulin                      0.0019      0.001      1.598      0.110      -0.000       0.004
BMI                         -0.0365      0.022     -1.669      0.095      -0.079       0.006
DiabetesPedigreeFunction     0.4636      0.344      1.347      0.178      -0.211       1.138
Age                          0.0005      0.016      0.031      0.976      -0.031       0.032
============================================================================================
 

Inference from the Logistic Regression

  • The R sq value of the model is 56%.. that is this model can explain 56% of the variation in dependent variable
  • To identify which variables influence the outcome, we will look at the p-value of each variable. We expect the p-value to be less than 0.05(alpha risk)
  • When p-value<0.05, we can say the variable influences the outcome
  • Hence we will eliminate Diabetes Pedigree Function, Age, Insulin and re run the model
 

2nd itertion of the Logistic Regression with fewer variables

In [76]:
cols2=["Pregnancies", "Glucose","BloodPressure","SkinThickness","BMI"]
X=dia[cols2]
In [77]:
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary2())
 
Optimization terminated successfully.
         Current function value: 0.569365
         Iterations 5
                         Results: Logit
=================================================================
Model:              Logit            Pseudo R-squared: 0.104     
Dependent Variable: Outcome          AIC:              456.3820  
Date:               2019-05-05 22:48 BIC:              476.2383  
No. Observations:   392              Log-Likelihood:   -223.19   
Df Model:           4                LL-Null:          -249.05   
Df Residuals:       387              LLR p-value:      1.5817e-10
Converged:          1.0000           Scale:            1.0000    
No. Iterations:     5.0000                                       
-----------------------------------------------------------------
                   Coef.  Std.Err.    z    P>|z|   [0.025  0.975]
-----------------------------------------------------------------
Pregnancies        0.1291   0.0374  3.4489 0.0006  0.0557  0.2024
Glucose            0.0215   0.0040  5.4447 0.0000  0.0138  0.0293
BloodPressure     -0.0507   0.0089 -5.6868 0.0000 -0.0682 -0.0332
SkinThickness      0.0299   0.0149  2.0073 0.0447  0.0007  0.0592
BMI               -0.0313   0.0215 -1.4537 0.1460 -0.0734  0.0109
=================================================================

 

Inference from 2nd Iteration

  • We will now eliminate BMI and re run the model
 

3rd iteration of Logistic Regression

In [78]:
cols3=["Pregnancies", "Glucose","BloodPressure","SkinThickness"]
X=dia[cols3]
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary())
 
Optimization terminated successfully.
         Current function value: 0.572076
         Iterations 5
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                Outcome   No. Observations:                  392
Model:                          Logit   Df Residuals:                      388
Method:                           MLE   Df Model:                            3
Date:                Sun, 05 May 2019   Pseudo R-squ.:                 0.09956
Time:                        22:49:35   Log-Likelihood:                -224.25
converged:                       True   LL-Null:                       -249.05
                                        LLR p-value:                 9.769e-11
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Pregnancies       0.1403      0.037      3.820      0.000       0.068       0.212
Glucose           0.0199      0.004      5.297      0.000       0.013       0.027
BloodPressure    -0.0571      0.008     -7.242      0.000      -0.073      -0.042
SkinThickness     0.0160      0.011      1.404      0.160      -0.006       0.038
=================================================================================
 

Inference from 3rd Iteration

  • Now the P value of skinthickness is greater than 0.05, hence we will eliminate it and re run the model
 

4th Iteration of Logistic Regression

In [79]:
cols4=["Pregnancies", "Glucose","BloodPressure"]
X=dia[cols4]
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary())
 
Optimization terminated successfully.
         Current function value: 0.574607
         Iterations 5
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                Outcome   No. Observations:                  392
Model:                          Logit   Df Residuals:                      389
Method:                           MLE   Df Model:                            2
Date:                Sun, 05 May 2019   Pseudo R-squ.:                 0.09558
Time:                        22:49:53   Log-Likelihood:                -225.25
converged:                       True   LL-Null:                       -249.05
                                        LLR p-value:                 4.597e-11
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Pregnancies       0.1405      0.037      3.826      0.000       0.069       0.212
Glucose           0.0210      0.004      5.709      0.000       0.014       0.028
BloodPressure    -0.0525      0.007     -7.449      0.000      -0.066      -0.039
=================================================================================
 

Inference from 4th Run

  • Now the model is clear. We have 3 variables that influence the Outcome and then are Pregnancies, Glucose and BloodPressure
  • Luckly, none of these 3 variables are co-correlated. Hence we can safetly assume tha the model is not inflated
In [34]:
## Importing LogisticRegression from Sk.Learn linear model as stats model function cannot give us classification report and confusion matrix
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
cols4=["Pregnancies", "Glucose","BloodPressure"]
X=dia[cols4]
y=dia.Outcome
logreg.fit(X,y)
## Defining the y_pred variable for the predicting values. I have taken 392 dia dataset. We can also take a test dataset
y_pred=logreg.predict(X)
## Calculating the precision of the model
from sklearn.metrics import classification_report
print(classification_report(y,y_pred))
 
              precision    recall  f1-score   support

           0       0.79      0.89      0.84       262
           1       0.71      0.53      0.61       130

   micro avg       0.77      0.77      0.77       392
   macro avg       0.75      0.71      0.72       392
weighted avg       0.77      0.77      0.76       392

 
C:\Users\Neil\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
 

Precision of the model is 77%

In [35]:
from sklearn.metrics import confusion_matrix
## Confusion matrix gives the number of cases where the model is able to accurately predict the outcomes.. both 1 and 0 and how many cases it gives false positive and false negatives
confusion_matrix = confusion_matrix(y, y_pred)
print(confusion_matrix)
 
[[234  28]
 [ 61  69]]
 

The result is telling us that we have 234+69 are correct predictions and 61+28 are incorrect predictions.

 

 

Sign-up for collaborat newsletter