Pima Diabetes Data Analytics – Neil¶

Here are the set the analytics that has been run on this data set

  • Data Cleaning to remove zeros
  • Data Exploration for Y and Xs
  • Descriptive Statistics – Numerical Summary and Graphical (Histograms) for all variables
  • Screening of variables by segmenting them by Outcome
  • Check for normality of dataset
  • Study bivariate relationship between variables using pair plots, correlation and heat map
  • Statistical screening using Logistic Regression
  • Validation of the model its precision and ploting of confusion matrix

Importing necessary packages¶

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(color_codes =True)
%matplotlib inline

Importing the Diabetes CSV data file¶

  • Import the data and test if all the columns are loaded
  • The Data frame has been assigned a name of ‘diab’
  Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
About data set¶

In this data set, Outcome is the Dependent Variable and Remaining 8 variables are independent variables.


Finding if there are any null and Zero values in the data set¶

## To check if data contains null values


  • Data frame doesn’t have any NAN values
  • As a next step, we will do preliminary screening of descriptive stats for the dataset
## To run numerical descriptive stats for the data set
  Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
Inference at this point¶

  • Minimum values for many variables are 0.
  • As biological parameters like Glucose, BP, Skin thickness,Insulin & BMI cannot have zero values, looks like null values have been coded as zeros
  • As a next step, find out how many Zero values are included in each variable
(diab.Pregnancies == 0).sum(),(diab.Glucose==0).sum(),(diab.BloodPressure==0).sum(),(diab.SkinThickness==0).sum(),(diab.Insulin==0).sum(),(diab.BMI==0).sum(),(diab.DiabetesPedigreeFunction==0).sum(),(diab.Age==0).sum()
## Counting cells with 0 Values for each variable and publishing the counts below
(111, 5, 35, 227, 374, 11, 0, 0)


  • As Zero Counts of some the variables are as high as 374 and 227, in a 768 data set, it is better to remove the Zeros uniformly for 5 variables (excl Pregnancies & Outcome)
  • As a next step, we’ll drop 0 values and create a our new dataset which can be used for further analysis
In [4]:
## Creating a dataset called 'dia' from original dataset 'diab' with excludes all rows with have zeros only for Glucose, BP, Skinthickness, Insulin and BMI, as other columns can contain Zero values.
drop_Glu=diab.index[diab.Glucose == 0].tolist()
drop_BP=diab.index[diab.BloodPressure == 0].tolist()
drop_Skin = diab.index[diab.SkinThickness==0].tolist()
drop_Ins = diab.index[diab.Insulin==0].tolist()
drop_BMI = diab.index[diab.BMI==0].tolist()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 392 entries, 3 to 765
Data columns (total 9 columns):
Pregnancies                 392 non-null int64
Glucose                     392 non-null int64
BloodPressure               392 non-null int64
SkinThickness               392 non-null int64
Insulin                     392 non-null int64
BMI                         392 non-null float64
DiabetesPedigreeFunction    392 non-null float64
Age                         392 non-null int64
Outcome                     392 non-null int64
dtypes: float64(2), int64(7)
memory usage: 30.6 KB


  • As in above, created a cleaned up list titled “dia” which has 392 rows of data instead of 768 from original list
  • Looks like we lost nearly 50% of data but our data set is now cleaner than before
  • In fact the removed values can be used for Testing during modeling. So actually we haven’t really lost them completly.

Performing Preliminary Descriptive Stats on the Data set¶

  • Performing 5 number summary
  • Usually, the first thing to do in a data set is to get a hang of vital parameters of all variables and thus understand a little bit about the data set such as central tendency and dispersion
  Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
Split the data frame into two sub sets for convenience of analysis¶

  • As we wish to study the influence of each variable on Outcome (Diabetic or not), we can subset the data by Outcome
  • dia1 Subset : All samples with 1 values of Outcome
  • dia0 Subset: All samples with 0 values of Outcome
In [8]:
dia1 = dia[dia.Outcome==1]
dia0 = dia[dia.Outcome==0]
  Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
130 rows × 9 columns

  Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
262 rows × 9 columns


Graphical screening for variables¶

  • Now we will start graphical analysis of outcome. At the data is nominal(binary), we will run count plot and compute %ages of samples who are diabetic and non-diabetic
## creating count plot with title using seaborn
plt.title("Count Plot for Outcome")
# Computing the %age of diabetic and non-diabetic in the sample
PC_of_1 = Out1*100/Total
PC_of_0 = Out0*100/Total
PC_of_1, PC_of_0
(66.83673469387755, 33.16326530612245)

Inference on screening Outcome variable¶

  • There are 66.8% 1’s (diabetic) and 33.1% 0’s (nondiabetic) in the data
  • As a next step, we will start screening variables

Graphical Screening for Variables¶

  • We will take each variable, one at a time and screen them in the following manner
  • Study the data distribution (histogram) of each variable – Central tendency, Spread, Distortion(Skewness & Kurtosis)
  • To visually screen the association between ‘Outcome’ and each variable by plotting histograms & Boxplots by Outcome value

Screening Variable – Pregnancies¶

## Creating 3 subplots - 1st for histogram, 2nd for histogram segmented by Outcome and 3rd for representing same segmentation using boxplot
plt.figure(figsize=(20, 6))
plt.title("Histogram for Pregnancies")
sns.distplot(dia0.Pregnancies,kde=False,color="Blue", label="Preg for Outome=0")
sns.distplot(dia1.Pregnancies,kde=False,color = "Gold", label = "Preg for Outcome=1")
plt.title("Histograms for Preg by Outcome")
plt.title("Boxplot for Preg by Outcome")
Text(0.5, 1.0, 'Boxplot for Preg by Outcome')

Inference on Pregnancies¶

  • Visually, data is right skewed. For data of count of pregenancies. A large proportion of the participants are zero count on pregnancy. As the data set includes women > 21 yrs, its likely that many are unmarried
  • When looking at the segemented histograms, a hypothesis is the as pregnancies includes, women are more likely to be diabetic
  • In the boxplots, we find few outliers in both subsets. Esp some non-diabetic women have had many pregenancies. I wouldn’t be worried.
  • To validate this hypothesis, need to statistically test.

Screening Variable – Glucose¶

plt.figure(figsize=(20, 6))
plt.title("Histogram for Glucose")
sns.distplot(dia.Glucose, kde=False)
sns.distplot(dia0.Glucose,kde=False,color="Gold", label="Gluc for Outcome=0")
sns.distplot(dia1.Glucose, kde=False, color="Blue", label = "Gloc for Outcome=1")
plt.title("Histograms for Glucose by Outcome")
plt.title("Boxplot for Glucose by Outcome")
Text(0.5, 1.0, 'Boxplot for Glucose by Outcome')

Inference on Glucose¶

  • 1st graph – Histogram of Glucose data is slightly skewed to right. Understandably, the data set contains over 60% who are diabetic and its likely that their Glucose levels were higher. But the grand mean of Glucose is at 122.\
  • 2nd graph – Clearly diabetic group has higher glucose than non-diabetic.
  • 3rd graph – In the boxplot, visually skewness seems acceptable (<2) and its also likely that confidence intervels of the means are not overlapping. So a hypothesis that Glucose is measure of outcome, is likely to be true. But needs to be statistically tested.

Screening Variable – Blood Pressure¶

plt.figure(figsize=(20, 6))
sns.distplot(dia.BloodPressure, kde=False)
plt.title("Histogram for Blood Pressure")
sns.distplot(dia0.BloodPressure,kde=False,color="Gold",label="BP for Outcome=0")
sns.distplot(dia1.BloodPressure,kde=False, color="Blue", label="BP for Outcome=1")
plt.title("Histogram of Blood Pressure by Outcome")
plt.title("Boxplot of BP by Outcome")
Text(0.5, 1.0, 'Boxplot of BP by Outcome')

Inference on Blood Pressure¶

  • 1st graph – Distribution looks normal. Mean value is 69, well within normal values for diastolic of 80. One should expect this data to be normal, but as we don’t know if the particpants are only hypertensive medication, we can’t comment much.
  • 2nd graph – Most non diabetic women seem to have nominal value of 69 and diabetic women seems to have high BP.
  • 3rd graph – Few outliers in the data. Its likely that some people have low and some have high BP. So the association between diabetic (Outcome) and BP is an suspect and needs to be statistically validated.

Screening Variable – Skin Thickness¶

plt.figure(figsize=(20, 6))
sns.distplot(dia.SkinThickness, kde=False)
plt.title("Histogram for Skin Thickness")
sns.distplot(dia0.SkinThickness, kde=False, color="Gold", label="SkinThick for Outcome=0")
sns.distplot(dia1.SkinThickness, kde=False, color="Blue", label="SkinThick for Outcome=1")
plt.title("Histogram for SkinThickness by Outcome")
sns.boxplot(x=dia.Outcome, y=dia.SkinThickness)
plt.title("Boxplot of SkinThickness by Outcome")
Text(0.5, 1.0, 'Boxplot of SkinThickness by Outcome')

Inferences for Skinthickness¶

  • 1st graph – Skin thickness seems be be skewed a bit.
  • 2nd graph – Like BP, people who are not diabetic have lower skin thickness. This is a hypothesis that has to be validated. As data of non-diabetic is skewed but diabetic samples seems to be normally distributed.

Screening Variable – Insulin¶

plt.figure(figsize=(20, 6))
plt.title("Histogram of Insulin")
sns.distplot(dia0.Insulin,kde=False, color="Gold", label="Insulin for Outcome=0")
sns.distplot(dia1.Insulin,kde=False, color="Blue", label="Insuline for Outcome=1")
plt.title("Histogram for Insulin by Outcome")
sns.boxplot(x=dia.Outcome, y=dia.Insulin)
plt.title("Boxplot for Insulin by Outcome")
Text(0.5, 1.0, 'Boxplot for Insulin by Outcome')

Inference for Insulin¶

  • 2hour serum insulin is expected to be between 16 to 166. Clearly there are Outliers in the data. These Outliers are concern for us and most of them with higher insulin values ar also diabetic. So this is a suspect.

Screening Variable – BMI¶

plt.figure(figsize=(20, 6))
sns.distplot(dia.BMI, kde=False)
plt.title("Histogram for BMI")
sns.distplot(dia0.BMI, kde=False,color="Gold", label="BMI for Outcome=0")
sns.distplot(dia1.BMI, kde=False, color="Blue", label="BMI for Outcome=1")
plt.title("Histogram for BMI by Outcome")
sns.boxplot(x=dia.Outcome, y=dia.BMI)
plt.title("Boxplot for BMI by Outcome")
Text(0.5, 1.0, 'Boxplot for BMI by Outcome')

Inference for BMI¶

  • 1st graph – There are few outliers. Few are obese in the dataset. Expected range is between 18 to 25. In general, people are obese
  • 2nd graph – Diabetic people seems to be only higher side of BMI. Also the contribute more for outliers
  • 3rd graph – Same inference as 2nd graph

Screening Variable – Diabetes Pedigree Function¶

plt.figure(figsize=(20, 6))
plt.title("Histogram for Diabetes Pedigree Function")
sns.distplot(dia0.DiabetesPedigreeFunction, kde=False, color="Gold", label="PedFunction for Outcome=0")
sns.distplot(dia1.DiabetesPedigreeFunction, kde=False, color="Blue", label="PedFunction for Outcome=1")
plt.title("Histogram for DiabetesPedigreeFunction by Outcome")
sns.boxplot(x=dia.Outcome, y=dia.DiabetesPedigreeFunction)
plt.title("Boxplot for DiabetesPedigreeFunction by Outcome")
Text(0.5, 1.0, 'Boxplot for DiabetesPedigreeFunction by Outcome')

Inference of Diabetes Pedigree Function¶

  • I dont know what this variable is. But it doesn’t seem to contribute to diabetes
  • Data is skewed. I don’t know if his parameter is expected to be a normal distribution. Not all natural parameters are normal
  • As DPF increases, there seems to be a likelihood of being diabetic, but needs statistical validation

Screening Variable – Age¶

plt.figure(figsize=(20, 6))
plt.title("Histogram for Age")
sns.distplot(dia0.Age,kde=False,color="Gold", label="Age for Outcome=0")
sns.distplot(dia1.Age,kde=False, color="Blue", label="Age for Outcome=1")
plt.title("Histogram for Age by Outcome")
plt.title("Boxplot for Age by Outcome")
Text(0.5, 1.0, 'Boxplot for Age by Outcome')

Inference for Age¶

  • Age is skewed. Yes, as this is life data, it is likely to fall into a weibull distribution and not normal
  • There is a tendency that as people age, they are likely to become diabetic. This needs statistical validation
  • But diabetes, itself doesn’t seem to have an influence of longetivity. May be it impacts quality of life which is not measured in this data set.

Normality Test¶

Inference: None of the variables are normal. (P>0.05) May be subsets are normal

## importing stats module from scipy
from scipy import stats
## retrieving p value from normality test function
## Printing the values
print("Pregnancies P Value is " + str(PregnanciesPVAL))
print("Glucose P Value is " + str(GlucosePVAL))
print("BloodPressure P Value is " + str(BloodPressurePVAL))
print("Skin Thickness P Value is " + str(SkinThicknessPVAL))
print("Insulin P Value is " + str(InsulinPVAL))
print("BMI P Value is " + str(BMIPVAL))
print("Diabetes Pedigree Function P Value is " + str(DiaPeFuPVAL))
print("Age P Value is " + str(AgePVAL))
Pregnancies P Value is 6.155097831782508e-20
Glucose P Value is 1.3277887088487345e-05
BloodPressure P Value is 0.030164917115239397
Skin Thickness P Value is 0.01548332935449814
Insulin P Value is 8.847272035922274e-43
BMI P Value is 1.4285556992424915e-09
Diabetes Pedigree Function P Value is 1.1325395699626466e-39
Age P Value is 1.0358469089881947e-21

Screening of Association between Variables to study Bivariate relationship¶

  • We will use pairplot to study the association between variables – from individual scatter plots
  • Then we will compute pearson correlation coefficient
  • Then we will summarize the same as heatmap
sns.pairplot(dia, vars=["Pregnancies", "Glucose","BloodPressure","SkinThickness","Insulin", "BMI","DiabetesPedigreeFunction", "Age"],hue="Outcome")
plt.title("Pairplot of Variables by Outcome")
Text(0.5, 1.0, 'Pairplot of Variables by Outcome')