Full Lean Six Sigma Case Solution In Python

Clear Calls Case

  • Review content provided in PDF to understand the case study background.
  • This is an end to end case study. Some parts of the case study will be solved in XL and PPT. Statistical concepts are solved in Python below.
  • Refer to case Questions to understand solutions

Copyright 2012 @ Canopus Business Management Group. All Rights Reserved This case study is created by Canopus Business Management Group for indiviual use and learning purpose. Not be reproduced for any other purpose. Refer to www.collaborat.com for terms of use

Author : Nilakantasrinivasan (Neil) neil@collaborat.com

In [66]:
# Importing basic packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(color_codes =True)
%matplotlib inline
In [67]:
# Importing the data file for the Clear Calls Case Study
CS=pd.read_excel("Case Study Data XY for python GB.xls", index_col=0)
In [68]:
# Descriptive Statistics of the data
CS.describe().transpose()
Out[68]:
count mean std min 25% 50% 75% max
Installation TAT in Hours 93.0 41.612903 14.423737 18.0 29.0 41.00 51.000 88.0
Commission to SA 93.0 243.806452 139.052560 67.0 146.0 210.00 320.000 670.0
No. of installations requests in queue 92.0 10.304348 3.270713 3.0 8.0 10.00 13.000 20.0
Prcessing Time@ Back Office 92.0 2.280435 0.820244 0.0 1.9 2.05 2.500 5.4
Distance from the SA office 92.0 6.388043 1.802065 3.0 5.4 6.15 7.425 11.0
Distance travelled 93.0 7.507527 2.412343 3.0 5.6 7.00 9.200 14.5
Installation Time by SA 93.0 6.594624 1.504223 0.0 6.0 6.50 8.000 9.0

Data Clean up and Preparation for Analysis

In [69]:
# Checking for null values
CS.isnull().values.any()
Out[69]:
True
In [70]:
# Identifying rows with NAN values
CS[pd.isnull(CS).any(axis=1)]
Out[70]:
Date Installation TAT in Hours Defects / Non Defects Type of Technology (Modem, routers, etc) Commission to SA No. of installations requests in queue Prcessing Time@ Back Office Distance from the SA office Distance travelled Installation Time by SA Service Affiliate CODE Service Affiliate TEAMS
S.No.
71 2002-04-23 53 D NaN 245 NaN NaN NaN 9.5 6.0 NaN NaN
In [71]:
# Dropping rows with NAN Values
CS=CS.dropna(axis=0)
In [7]:
CS.describe()
Out[7]:
Installation TAT in Hours Commission to SA No. of installations requests in queue Prcessing Time@ Back Office Distance from the SA office Distance travelled Installation Time by SA
count 92.000000 92.000000 92.000000 92.000000 92.000000 92.000000 92.000000
mean 41.489130 243.793478 10.304348 2.280435 6.388043 7.485870 6.601087
std 14.453027 139.814440 3.270713 0.820244 1.802065 2.416454 1.511167
min 18.000000 67.000000 3.000000 0.000000 3.000000 3.000000 0.000000
25% 28.750000 145.250000 8.000000 1.900000 5.400000 5.600000 5.900000
50% 41.000000 210.000000 10.000000 2.050000 6.150000 7.000000 6.600000
75% 50.250000 321.000000 13.000000 2.500000 7.425000 9.050000 8.000000
max 88.000000 670.000000 20.000000 5.400000 11.000000 14.500000 9.000000
In [72]:
# Locating rows which have zero values
CS.loc[(CS==0).any(axis=1)]
Out[72]:
Date Installation TAT in Hours Defects / Non Defects Type of Technology (Modem, routers, etc) Commission to SA No. of installations requests in queue Prcessing Time@ Back Office Distance from the SA office Distance travelled Installation Time by SA Service Affiliate CODE Service Affiliate TEAMS
S.No.
7 2002-04-02 48 D B 143 14.0 0.0 5.4 8.3 0.0 C C1
In [74]:
# Drop 7th row with zero values
CS.drop(index=7, axis=0)
Out[74]:
Date Installation TAT in Hours Defects / Non Defects Type of Technology (Modem, routers, etc) Commission to SA No. of installations requests in queue Prcessing Time@ Back Office Distance from the SA office Distance travelled Installation Time by SA Service Affiliate CODE Service Affiliate TEAMS
S.No.
1 2002-04-01 25 N A 155 9.0 2.0 5.5 6.5 5.0 A A1
2 2002-04-01 55 D B 175 13.0 1.2 5.6 9.5 6.5 B B1
3 2002-04-01 40 D C 98 10.0 2.3 6.3 6.3 6.0 C C1
4 2002-04-02 35 D D 125 8.0 2.0 4.9 5.4 5.5 A A1
5 2002-04-02 65 D B 450 15.0 1.5 7.2 10.0 8.0 A A1
6 2002-04-02 75 D B 350 18.0 2.4 8.5 12.0 8.0 A A2
8 2002-04-03 22 N A 176 9.0 2.5 6.0 7.0 4.5 A A1
9 2002-04-03 35 D A 154 8.0 4.0 4.6 5.6 4.0 A A2
10 2002-04-03 88 D B 134 20.0 3.0 11.0 14.0 8.0 B B1
11 2002-04-04 52 D C 235 10.0 4.0 5.6 5.6 6.7 A A1
12 2002-04-04 26 N D 245 7.0 2.4 3.5 4.0 6.5 A A2
13 2002-04-04 28 N A 123 6.0 3.5 3.5 4.5 5.0 A A1
14 2002-04-05 56 D B 430 12.0 4.2 7.0 10.0 8.0 B B1
15 2002-04-05 37 D A 180 5.0 3.0 4.1 5.0 5.4 A A1
16 2002-04-05 38 D D 210 10.0 4.2 6.0 6.5 5.6 B B1
17 2002-04-06 45 D D 340 14.0 1.3 7.2 8.0 7.3 A A1
18 2002-04-06 54 D B 542 9.0 1.9 6.8 9.5 6.5 B B1
19 2002-04-06 49 D B 143 14.0 1.3 5.4 8.3 8.0 C C1
20 2002-04-07 47 D D 176 14.0 1.6 7.0 7.6 8.0 C C1
21 2002-04-07 35 D B 154 9.0 1.3 5.6 8.3 6.0 A A1
22 2002-04-07 56 D d 187 11.0 2.4 10.5 11.2 8.0 a a2
23 2002-04-07 55 D B 234 15.0 2.1 6.0 9.8 8.3 A A2
24 2002-04-08 29 N A 159 7.0 1.8 4.3 5.0 6.0 B B1
25 2002-04-08 48 D C 245 16.0 2.1 7.8 7.8 7.0 B B1
26 2002-04-08 37 D C 345 3.0 2.0 5.6 5.6 6.5 B B1
27 2002-04-09 45 D D 450 13.0 2.3 7.5 8.0 8.5 A A1
28 2002-04-09 34 D A 187 9.0 2.5 6.0 6.8 5.0 B B2
29 2002-04-09 43 D C 146 11.0 2.6 7.4 7.4 4.5 A A1
30 2002-04-10 85 D B 234 17.0 1.2 9.5 14.5 8.0 A A2
31 2002-04-10 38 D C 453 9.0 2.3 8.0 8.0 9.0 B B1
... ... ... ... ... ... ... ... ... ... ... ... ...
63 2002-04-21 47 D C 543 13.0 2.0 8.0 8.0 8.0 B B1
64 2002-04-21 48 D D 123 12.0 2.3 6.5 7.0 9.0 B B1
65 2002-04-21 54 D D 100 14.0 2.3 11.0 12.0 8.0 C C1
66 2002-04-22 29 N A 150 8.0 2.0 5.0 5.0 6.5 C C1
67 2002-04-22 33 D C 123 8.0 1.7 6.5 6.5 7.8 A A1
68 2002-04-22 46 D D 246 12.0 1.5 8.0 8.5 7.0 A A2
69 2002-04-23 46 D D 234 10.0 3.0 5.0 5.6 7.0 A A2
70 2002-04-23 27 N A 120 6.0 2.5 4.8 5.6 5.0 A A2
72 2002-04-23 56 D D 187 11.0 2.4 10.5 11.2 8.0 A A2
73 2002-04-24 48 D D 146 13.0 1.8 8.0 9.0 7.0 B B1
74 2002-04-24 23 N A 203 7.0 1.9 5.4 5.4 6.0 C C1
75 2002-04-24 34 D B 210 9.0 2.0 5.5 7.0 9.0 A A1
76 2002-04-25 23 N A 230 11.0 2.3 4.0 4.0 5.0 A A1
77 2002-04-25 24 N B 330 10.0 2.2 3.0 4.0 6.5 A A1
78 2002-04-25 29 N D 230 8.0 1.7 6.0 6.0 6.5 B B1
79 2002-04-26 22 N C 340 12.0 2.2 5.0 5.0 5.0 B B2
80 2002-04-26 58 D B 432 6.0 2.1 8.5 11.0 8.0 B B1
81 2002-04-26 33 D D 134 10.0 1.8 6.0 6.5 6.0 B B1
82 2002-04-27 34 D B 165 9.0 1.5 5.5 7.0 7.0 A A1
83 2002-04-27 31 D C 176 10.0 2.4 6.5 6.5 7.0 A A1
84 2002-04-27 23 N A 231 8.0 2.1 6.0 6.5 5.0 A A1
85 2002-04-28 18 N D 320 6.0 1.9 3.0 3.0 4.0 A A1
86 2002-04-28 31 D D 156 8.0 2.0 5.4 5.4 8.0 C C1
87 2002-04-28 28 N A 546 7.0 5.4 6.4 6.4 7.0 B B1
88 2002-04-29 51 D D 324 14.0 3.2 6.5 6.5 9.0 A A2
89 2002-04-29 51 D B 234 6.0 2.3 8.5 11.0 8.0 A A2
90 2002-04-29 18 N C 121 5.0 2.0 3.5 3.5 3.0 A A1
91 2002-04-30 25 N A 187 14.0 3.0 4.2 4.8 5.0 A A1
92 2002-04-30 57 D D 120 10.0 2.0 7.5 8.0 8.0 A A2
93 2002-04-30 38 D B 140 5.0 2.0 6.5 8.0 9.0 B B2

91 rows × 12 columns

Identify what labels are contained in each column

In [10]:
CS['Type of Technology (Modem, routers, etc)'].unique()
Out[10]:
array(['A', 'B', 'C', 'D', 'd'], dtype=object)
In [11]:
CS['Service Affiliate CODE'].unique()
Out[11]:
array(['A', 'B', 'C', 'a'], dtype=object)
In [12]:
CS['Service Affiliate TEAMS'].unique()
Out[12]:
array(['A1', 'B1', 'C1', 'A2', 'a2', 'B2', 'C2'], dtype=object)
In [13]:
CS.loc[CS['Type of Technology (Modem, routers, etc)'] == 'd', 'Type of Technology (Modem, routers, etc)'] = 'D'
In [14]:
CS['Type of Technology (Modem, routers, etc)'].unique()
Out[14]:
array(['A', 'B', 'C', 'D'], dtype=object)
In [15]:
CS.loc[CS['Service Affiliate CODE'] == 'a', 'Service Affiliate CODE'] = 'A'
In [16]:
CS['Service Affiliate CODE'].unique()
Out[16]:
array(['A', 'B', 'C'], dtype=object)
In [17]:
CS.loc[CS['Service Affiliate TEAMS'] == 'a2', 'Service Affiliate TEAMS'] = 'A2'
In [18]:
CS['Service Affiliate TEAMS'].unique()
Out[18]:
array(['A1', 'B1', 'C1', 'A2', 'B2', 'C2'], dtype=object)
In [77]:
# Group the dataframe into two based on Defects and Non Defects (For defectives here)
CSD=CS[CS['Defects / Non Defects']=='D']
CSD
Out[77]:
Date Installation TAT in Hours Defects / Non Defects Type of Technology (Modem, routers, etc) Commission to SA No. of installations requests in queue Prcessing Time@ Back Office Distance from the SA office Distance travelled Installation Time by SA Service Affiliate CODE Service Affiliate TEAMS
S.No.
2 2002-04-01 55 D B 175 13.0 1.2 5.6 9.5 6.5 B B1
3 2002-04-01 40 D C 98 10.0 2.3 6.3 6.3 6.0 C C1
4 2002-04-02 35 D D 125 8.0 2.0 4.9 5.4 5.5 A A1
5 2002-04-02 65 D B 450 15.0 1.5 7.2 10.0 8.0 A A1
6 2002-04-02 75 D B 350 18.0 2.4 8.5 12.0 8.0 A A2
7 2002-04-02 48 D B 143 14.0 0.0 5.4 8.3 0.0 C C1
9 2002-04-03 35 D A 154 8.0 4.0 4.6 5.6 4.0 A A2
10 2002-04-03 88 D B 134 20.0 3.0 11.0 14.0 8.0 B B1
11 2002-04-04 52 D C 235 10.0 4.0 5.6 5.6 6.7 A A1
14 2002-04-05 56 D B 430 12.0 4.2 7.0 10.0 8.0 B B1
15 2002-04-05 37 D A 180 5.0 3.0 4.1 5.0 5.4 A A1
16 2002-04-05 38 D D 210 10.0 4.2 6.0 6.5 5.6 B B1
17 2002-04-06 45 D D 340 14.0 1.3 7.2 8.0 7.3 A A1
18 2002-04-06 54 D B 542 9.0 1.9 6.8 9.5 6.5 B B1
19 2002-04-06 49 D B 143 14.0 1.3 5.4 8.3 8.0 C C1
20 2002-04-07 47 D D 176 14.0 1.6 7.0 7.6 8.0 C C1
21 2002-04-07 35 D B 154 9.0 1.3 5.6 8.3 6.0 A A1
22 2002-04-07 56 D d 187 11.0 2.4 10.5 11.2 8.0 a a2
23 2002-04-07 55 D B 234 15.0 2.1 6.0 9.8 8.3 A A2
25 2002-04-08 48 D C 245 16.0 2.1 7.8 7.8 7.0 B B1
26 2002-04-08 37 D C 345 3.0 2.0 5.6 5.6 6.5 B B1
27 2002-04-09 45 D D 450 13.0 2.3 7.5 8.0 8.5 A A1
28 2002-04-09 34 D A 187 9.0 2.5 6.0 6.8 5.0 B B2
29 2002-04-09 43 D C 146 11.0 2.6 7.4 7.4 4.5 A A1
30 2002-04-10 85 D B 234 17.0 1.2 9.5 14.5 8.0 A A2
31 2002-04-10 38 D C 453 9.0 2.3 8.0 8.0 9.0 B B1
32 2002-04-10 53 D B 245 13.0 3.4 6.8 9.5 6.0 C C1
33 2002-04-11 50 D D 121 12.0 2.0 6.0 6.4 7.0 A A1
34 2002-04-11 41 D B 75 6.0 2.0 5.6 8.5 7.0 B B1
35 2002-04-11 71 D C 98 14.0 1.5 6.3 6.3 6.0 A A2
... ... ... ... ... ... ... ... ... ... ... ... ...
47 2002-04-15 50 D B 234 16.0 1.3 7.3 10.0 6.0 B B1
48 2002-04-16 46 D D 256 12.0 2.5 6.5 7.0 6.0 B B1
50 2002-04-16 48 D B 120 10.0 2.0 7.5 9.8 7.0 A A2
51 2002-04-17 36 D C 231 8.0 2.0 7.4 7.4 6.0 C C1
52 2002-04-17 44 D B 342 12.0 2.0 6.5 9.2 8.0 B B1
53 2002-04-17 34 D A 120 9.0 3.2 6.5 7.4 6.0 B B1
55 2002-04-18 57 D B 89 15.0 2.3 7.0 10.0 5.5 B B1
56 2002-04-18 41 D C 78 8.0 2.0 5.7 5.7 6.0 B B1
57 2002-04-19 52 D B 67 12.0 1.8 8.5 11.0 6.0 A A2
59 2002-04-19 59 D B 78 11.0 3.8 9.2 13.0 7.8 B B1
60 2002-04-20 61 D D 670 14.0 4.2 7.5 8.0 8.0 A A2
61 2002-04-20 47 D B 234 9.0 1.6 7.5 10.0 9.0 A A2
63 2002-04-21 47 D C 543 13.0 2.0 8.0 8.0 8.0 B B1
64 2002-04-21 48 D D 123 12.0 2.3 6.5 7.0 9.0 B B1
65 2002-04-21 54 D D 100 14.0 2.3 11.0 12.0 8.0 C C1
67 2002-04-22 33 D C 123 8.0 1.7 6.5 6.5 7.8 A A1
68 2002-04-22 46 D D 246 12.0 1.5 8.0 8.5 7.0 A A2
69 2002-04-23 46 D D 234 10.0 3.0 5.0 5.6 7.0 A A2
72 2002-04-23 56 D D 187 11.0 2.4 10.5 11.2 8.0 A A2
73 2002-04-24 48 D D 146 13.0 1.8 8.0 9.0 7.0 B B1
75 2002-04-24 34 D B 210 9.0 2.0 5.5 7.0 9.0 A A1
80 2002-04-26 58 D B 432 6.0 2.1 8.5 11.0 8.0 B B1
81 2002-04-26 33 D D 134 10.0 1.8 6.0 6.5 6.0 B B1
82 2002-04-27 34 D B 165 9.0 1.5 5.5 7.0 7.0 A A1
83 2002-04-27 31 D C 176 10.0 2.4 6.5 6.5 7.0 A A1
86 2002-04-28 31 D D 156 8.0 2.0 5.4 5.4 8.0 C C1
88 2002-04-29 51 D D 324 14.0 3.2 6.5 6.5 9.0 A A2
89 2002-04-29 51 D B 234 6.0 2.3 8.5 11.0 8.0 A A2
92 2002-04-30 57 D D 120 10.0 2.0 7.5 8.0 8.0 A A2
93 2002-04-30 38 D B 140 5.0 2.0 6.5 8.0 9.0 B B2

66 rows × 12 columns

In [76]:
# Same as above ( for Non Defectives)
CSN=CS[CS['Defects / Non Defects']=='N']
CSN
Out[76]:
Date Installation TAT in Hours Defects / Non Defects Type of Technology (Modem, routers, etc) Commission to SA No. of installations requests in queue Prcessing Time@ Back Office Distance from the SA office Distance travelled Installation Time by SA Service Affiliate CODE Service Affiliate TEAMS
S.No.
1 2002-04-01 25 N A 155 9.0 2.0 5.5 6.5 5.0 A A1
8 2002-04-03 22 N A 176 9.0 2.5 6.0 7.0 4.5 A A1
12 2002-04-04 26 N D 245 7.0 2.4 3.5 4.0 6.5 A A2
13 2002-04-04 28 N A 123 6.0 3.5 3.5 4.5 5.0 A A1
24 2002-04-08 29 N A 159 7.0 1.8 4.3 5.0 6.0 B B1
38 2002-04-12 24 N A 220 10.0 2.0 3.5 4.3 5.0 A A1
39 2002-04-13 26 N A 190 6.0 2.4 4.0 4.8 5.0 A A1
43 2002-04-14 26 N A 600 7.0 2.5 5.5 6.5 7.5 C C2
44 2002-04-14 28 N A 573 8.0 2.6 5.5 6.8 5.6 C C1
46 2002-04-15 25 N C 231 8.0 2.0 5.0 5.0 5.0 A A1
49 2002-04-16 28 N A 320 7.0 2.0 4.0 5.0 4.5 A A1
54 2002-04-18 28 N C 657 8.0 4.3 5.9 5.9 7.0 A A1
58 2002-04-19 25 N A 453 10.0 1.9 6.0 6.8 6.0 B B2
62 2002-04-20 21 N A 456 6.0 1.5 7.0 8.0 6.0 A A1
66 2002-04-22 29 N A 150 8.0 2.0 5.0 5.0 6.5 C C1
70 2002-04-23 27 N A 120 6.0 2.5 4.8 5.6 5.0 A A2
74 2002-04-24 23 N A 203 7.0 1.9 5.4 5.4 6.0 C C1
76 2002-04-25 23 N A 230 11.0 2.3 4.0 4.0 5.0 A A1
77 2002-04-25 24 N B 330 10.0 2.2 3.0 4.0 6.5 A A1
78 2002-04-25 29 N D 230 8.0 1.7 6.0 6.0 6.5 B B1
79 2002-04-26 22 N C 340 12.0 2.2 5.0 5.0 5.0 B B2
84 2002-04-27 23 N A 231 8.0 2.1 6.0 6.5 5.0 A A1
85 2002-04-28 18 N D 320 6.0 1.9 3.0 3.0 4.0 A A1
87 2002-04-28 28 N A 546 7.0 5.4 6.4 6.4 7.0 B B1
90 2002-04-29 18 N C 121 5.0 2.0 3.5 3.5 3.0 A A1
91 2002-04-30 25 N A 187 14.0 3.0 4.2 4.8 5.0 A A1

Graphical Summary of All variables

In [80]:
# Histogram Ploting for Installation data
plt.title("Histogram for Installation TAT in Hours")
sns.distplot(CS['Installation TAT in Hours'], kde=False)
Out[80]:
<matplotlib.axes._subplots.AxesSubplot at 0xe814d10>

Histogram and Boxplot for all the Xs Variables, one after another

In [81]:
plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.set_style("dark")
plt.title("Histogram for Commission to SA")
sns.distplot(CS['Commission to SA'],kde=False)
plt.subplot(1,3,2)
sns.distplot(CSN['Commission to SA'],kde=False,color="Blue", label="Commission for Non Defects")
sns.distplot(CSD['Commission to SA'],kde=False,color = "Gold", label = "Commission for Defects")
plt.title("Histograms for Commission to SA")
plt.legend()
plt.subplot(1,3,3)
sns.boxplot(x=CS['Defects / Non Defects'],y=CS['Commission to SA'])
plt.title("Segmented Boxplot for Commission")
Out[81]:
Text(0.5, 1.0, 'Segmented Boxplot for Commission')
In [22]:
plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.set_style("dark")
plt.title("Histogram for Items in Queue")
sns.distplot(CS['No. of installations requests in queue'],kde=False)
plt.subplot(1,3,2)
sns.distplot(CSN['No. of installations requests in queue'],kde=False,color="Blue", label="Items in Queue for Non Defects")
sns.distplot(CSD['No. of installations requests in queue'],kde=False,color = "Gold", label = "Items in Queue for Defects")
plt.title("Histograms for Items in Queue")
plt.legend()
plt.subplot(1,3,3)
sns.boxplot(x=CS['Defects / Non Defects'],y=CS['No. of installations requests in queue'])
plt.title("Segmented Boxplot for Items in Queue")
Out[22]:
Text(0.5, 1.0, 'Segmented Boxplot for Items in Queue')
In [23]:
plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.set_style("dark")
plt.title("Histogram for Processing Time@ Back Office")
sns.distplot(CS['Prcessing Time@ Back Office'],kde=False)
plt.subplot(1,3,2)
sns.distplot(CSN['Prcessing Time@ Back Office'],kde=False,color="Blue", label="Prcessing Time@ Back Office for Non Defects")
sns.distplot(CSD['Prcessing Time@ Back Office'],kde=False,color = "Gold", label = "Prcessing Time@ Back Office for Defects")
plt.title("Histograms for Prcessing Time@ Back Office")
plt.legend()
plt.subplot(1,3,3)
sns.boxplot(x=CS['Defects / Non Defects'],y=CS['Prcessing Time@ Back Office'])
plt.title("Segmented Boxplot for Prcessing Time@ Back Office")
Out[23]:
Text(0.5, 1.0, 'Segmented Boxplot for Prcessing Time@ Back Office')
In [24]:
plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.set_style("dark")
plt.title("Histogram for Distance from the SA office")
sns.distplot(CS['Distance from the SA office'],kde=False)
plt.subplot(1,3,2)
sns.distplot(CSN['Distance from the SA office'],kde=False,color="Blue", label="Distance from the SA office for Non Defects")
sns.distplot(CSD['Distance from the SA office'],kde=False,color = "Gold", label = "Distance from the SA office for Defects")
plt.title("Histograms for Distance from the SA office")
plt.legend()
plt.subplot(1,3,3)
sns.boxplot(x=CS['Defects / Non Defects'],y=CS['Distance from the SA office'])
plt.title("Segmented Boxplot for Distance from the SA office")
Out[24]:
Text(0.5, 1.0, 'Segmented Boxplot for Distance from the SA office')
In [25]:
plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.set_style("dark")
plt.title("Histogram for Distance travelled")
sns.distplot(CS['Distance travelled'],kde=False)
plt.subplot(1,3,2)
sns.distplot(CSN['Distance travelled'],kde=False,color="Blue", label="Distance travelled for Non Defects")
sns.distplot(CSD['Distance travelled'],kde=False,color = "Gold", label = "Distance travelled for Defects")
plt.title("Histograms for Distance travelled")
plt.legend()
plt.subplot(1,3,3)
sns.boxplot(x=CS['Defects / Non Defects'],y=CS['Distance travelled'])
plt.title("Segmented Boxplot for Distance travelled")
Out[25]:
Text(0.5, 1.0, 'Segmented Boxplot for Distance travelled')
In [82]:
# Ploting multiple boxplots and comparing the factors of two variables
plt.figure(figsize=(20, 6))
plt.subplot(1,2,1)
sns.boxplot(x=CS['Service Affiliate CODE'], y=CS['Installation TAT in Hours'])
plt.subplot(1,2,2)
sns.boxplot(x=CS['Service Affiliate TEAMS'], y=CS['Installation TAT in Hours'])
Out[82]:
<matplotlib.axes._subplots.AxesSubplot at 0xf78ced0>
In [83]:
# Numercial Descripion of Installation TAT
CS['Installation TAT in Hours'].describe()
Out[83]:
count    92.000000
mean     41.489130
std      14.453027
min      18.000000
25%      28.750000
50%      41.000000
75%      50.250000
max      88.000000
Name: Installation TAT in Hours, dtype: float64

Normality Testing

In [28]:
from scipy import stats
stats.normaltest(CS['Installation TAT in Hours']).pvalue
Out[28]:
0.017485495462223954
In [54]:
from statsmodels.graphics.gofplots import qqplot
from matplotlib import pyplot
In [55]:
qqplot(CS['Installation TAT in Hours'], line ='s')
pyplot.show()

Stem and Leaf Plot for Installation TAT

In [56]:
import stemgraphic
from stemgraphic import stem_graphic
In [61]:
stemgraphic.stem_graphic(CS['Installation TAT in Hours'], scale = 10)
Out[61]:
(<Figure size 540x198 with 1 Axes>, <matplotlib.axes._axes.Axes at 0x5c6c3b0>)

Superimposing Customer Spec of 30 hours over the Installation TAT Distribution

In [64]:
sns.distplot(CS['Installation TAT in Hours'],hist=False)
plt.axvline(x=30, color='r')
Out[64]:
<matplotlib.lines.Line2D at 0x5e12570>

Scatter Diagrams to screen variables visually

In [85]:
sns.pairplot(CS, vars=['Installation TAT in Hours','Commission to SA'])
Out[85]:
<seaborn.axisgrid.PairGrid at 0xe874e70>
In [86]:
sns.pairplot(CS,vars=['Installation TAT in Hours','Commission to SA','No. of installations requests in queue','Prcessing Time@ Back Office','Distance from the SA office','Distance travelled','Installation TAT in Hours'])
plt.title("Pairplot of Variables by Outcome")
Out[86]:
Text(0.5, 1.0, 'Pairplot of Variables by Outcome')

Correlation Coefficient Computation

In [31]:
cor = CS.corr(method ='pearson')
cor
Out[31]:
Installation TAT in Hours Commission to SA No. of installations requests in queue Prcessing Time@ Back Office Distance from the SA office Distance travelled Installation Time by SA
Installation TAT in Hours 1.000000 -0.054352 0.704439 -0.026992 0.699643 0.789236 0.484246
Commission to SA -0.054352 1.000000 -0.022570 0.251247 0.002459 -0.027295 0.218077
No. of installations requests in queue 0.704439 -0.022570 1.000000 -0.094425 0.506817 0.554900 0.257616
Prcessing Time@ Back Office -0.026992 0.251247 -0.094425 1.000000 0.045785 -0.050316 0.084683
Distance from the SA office 0.699643 0.002459 0.506817 0.045785 1.000000 0.880675 0.515070
Distance travelled 0.789236 -0.027295 0.554900 -0.050316 0.880675 1.000000 0.463018
Installation Time by SA 0.484246 0.218077 0.257616 0.084683 0.515070 0.463018 1.000000
In [32]:
sns.heatmap(cor)
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0xbfaec10>

Multiple Linear Regression

In [87]:
# Run No.1
inputs=['No. of installations requests in queue','Distance travelled','Distance from the SA office']
X=CS[inputs]
In [34]:
import statsmodels.api as sm
X = sm.add_constant(X)
y= CS['Installation TAT in Hours']
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
model.summary()
Out[34]:
OLS Regression Results
Dep. Variable: Installation TAT in Hours R-squared: 0.726
Model: OLS Adj. R-squared: 0.716
Method: Least Squares F-statistic: 77.54
Date: Tue, 02 Jul 2019 Prob (F-statistic): 1.27e-24
Time: 12:42:24 Log-Likelihood: -316.29
No. Observations: 92 AIC: 640.6
Df Residuals: 88 BIC: 650.7
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -1.6937 3.255 -0.520 0.604 -8.161 4.774
No. of installations requests in queue 1.7028 0.297 5.734 0.000 1.113 2.293
Distance travelled 3.4980 0.731 4.782 0.000 2.044 4.952
Distance from the SA office -0.0859 0.947 -0.091 0.928 -1.967 1.795
Omnibus: 9.702 Durbin-Watson: 1.732
Prob(Omnibus): 0.008 Jarque-Bera (JB): 10.399
Skew: 0.614 Prob(JB): 0.00552
Kurtosis: 4.098 Cond. No. 60.5


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [88]:
# Run No.2
inputs1=['No. of installations requests in queue','Distance travelled']
X=CS[inputs1]
X = sm.add_constant(X)
y= CS['Installation TAT in Hours']
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
model.summary()
Out[88]:
OLS Regression Results
Dep. Variable: Installation TAT in Hours R-squared: 0.726
Model: OLS Adj. R-squared: 0.719
Method: Least Squares F-statistic: 117.6
Date: Tue, 02 Jul 2019 Prob (F-statistic): 1.03e-25
Time: 15:58:35 Log-Likelihood: -316.29
No. Observations: 92 AIC: 638.6
Df Residuals: 89 BIC: 646.2
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -1.8144 2.954 -0.614 0.541 -7.684 4.056
No. of installations requests in queue 1.7015 0.295 5.768 0.000 1.115 2.288
Distance travelled 3.4425 0.399 8.622 0.000 2.649 4.236
Omnibus: 9.524 Durbin-Watson: 1.729
Prob(Omnibus): 0.009 Jarque-Bera (JB): 10.134
Skew: 0.608 Prob(JB): 0.00630
Kurtosis: 4.080 Cond. No. 49.3


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [47]:
# Run No.3 (Without Y Intercept - 'C' in Y=mx+C will be not included in the model now)
inputs1=['No. of installations requests in queue','Distance travelled']
X=CS[inputs1]
y= CS['Installation TAT in Hours']
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
model.summary()
Out[47]:
OLS Regression Results
Dep. Variable: Installation TAT in Hours R-squared: 0.970
Model: OLS Adj. R-squared: 0.970
Method: Least Squares F-statistic: 1478.
Date: Tue, 02 Jul 2019 Prob (F-statistic): 1.48e-69
Time: 13:04:57 Log-Likelihood: -316.49
No. Observations: 92 AIC: 637.0
Df Residuals: 90 BIC: 642.0
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
No. of installations requests in queue 1.6168 0.260 6.222 0.000 1.101 2.133
Distance travelled 3.3344 0.357 9.336 0.000 2.625 4.044
Omnibus: 9.282 Durbin-Watson: 1.707
Prob(Omnibus): 0.010 Jarque-Bera (JB): 9.843
Skew: 0.595 Prob(JB): 0.00729
Kurtosis: 4.074 Cond. No. 7.28


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Hypothesis Testing

Each variable (discrete) will be checked whether it influences Installation TAT or not in the following tests

ANOVA using Statsmodels

Data has to be stacked

In [89]:
from statsmodels.formula.api import ols      # For n-way ANOVA
from statsmodels.stats.anova import anova_lm # For n-way ANOVA

ANOVA for TAT impact by different technologies

Ho : Installation TAT for different technologies is same Ha : Installation TAT for different technologies is different - Type of Modem

In [39]:
plt.figure(figsize=(10, 6))
sns.boxplot(x=CS['Type of Technology (Modem, routers, etc)'], y=CS['Installation TAT in Hours'])
Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0xcf47810>
In [40]:
y= CS['Installation TAT in Hours']
X=CS['Type of Technology (Modem, routers, etc)']
formula = 'y ~ X'
model = ols(formula, CS).fit()
aov_table = anova_lm(model)
print(aov_table)
            df        sum_sq      mean_sq          F        PR(>F)
X          3.0   7672.377384  2557.459128  19.852175  6.469518e-10
Residual  88.0  11336.611747   128.825133        NaN           NaN

ANOVA for TAT impact by different Service Affilitates

Ho : Installation TAT for different Service Affiliates is same Ha : Installation TAT for different Service Affiliates is different

In [41]:
plt.figure(figsize=(10, 6))
sns.boxplot(x=CS['Service Affiliate CODE'], y=CS['Installation TAT in Hours'])
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x10c7b10>
In [42]:
SA_CODE=CS['Service Affiliate CODE']
formula = 'y ~ SA_CODE'
model = ols(formula, CS).fit()
aov_table = anova_lm(model)
print(aov_table)
            df        sum_sq     mean_sq        F    PR(>F)
SA_CODE    2.0    375.955797  187.977899  0.89787  0.411095
Residual  89.0  18633.033333  209.359925      NaN       NaN

ANOVA for TAT impact by different Service Affilitates Team

  • Ho : Installation TAT for different teams within Service Affiliate the same
  • Ha : Installation TAT for different teams within Service Affiliate is different
In [43]:
plt.figure(figsize=(10, 6))
sns.boxplot(x=CS['Service Affiliate TEAMS'], y=CS['Installation TAT in Hours'])
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x10c78f0>
In [44]:
y= CS['Installation TAT in Hours']
X=CS['Service Affiliate TEAMS']
formula = 'y ~ X'
model = ols(formula, CS).fit()
aov_table = anova_lm(model)
print(aov_table)
            df       sum_sq      mean_sq         F    PR(>F)
X          5.0   6008.93610  1201.787220  7.950252  0.000003
Residual  86.0  13000.05303   151.163407       NaN       NaN

Test of Homogenity of Variance (Pre-requisite for ANOVA)

This test is conducted for one variable Service Affiliate Code as an example here.

In [45]:
# Test of Homogenity of Variance
w, pvalue = stats.levene(CS['Installation TAT in Hours'][CS['Service Affiliate CODE'] == 'A'],
             CS['Installation TAT in Hours'][CS['Service Affiliate CODE'] == 'B'],
             CS['Installation TAT in Hours'][CS['Service Affiliate CODE'] == 'C'])
print(w, pvalue)
1.4773284413305592 0.2337891705413206

2t test

Pre and Post data comparison after improvement to validate if statistical different

In [103]:
Prepost= pd.read_excel('F:/2019 GB Python/ClearCalls Pre-Post Data for Py .xlsx')
Prepost.head()
Out[103]:
Pre_TAT Post_TAT
0 25 22.8061
1 55 23.5905
2 40 23.0000
3 35 19.4463
4 65 20.5000
In [104]:
sns.boxplot(x='variable', y='value', data = pd.melt(Prepost), width =0.3)
Out[104]:
<matplotlib.axes._subplots.AxesSubplot at 0x10fc4470>
In [105]:
from scipy import stats
from scipy.stats import ttest_1samp, ttest_ind,mannwhitneyu,levene,shapiro,wilcoxon
In [106]:
t_statistic, p_value = ttest_ind(Prepost['Pre_TAT'], Prepost['Post_TAT'])
print(t_statistic, p_value)
11.569845231962535 1.911732901563976e-23