slide 1: © 2013  2016 ExcelR Solutions. All Rights Reserved
Data Science using R Minitab XLMiner
R Minitab XLMiner for Forecasting
slide 2: © 2013  2016 ExcelR Solutions. All Rights Reserved
PMP
PMIACP
PMIRMP
CSM
LSSGB
Project Management Professional
Agile Certified Practitioner
Risk Management Professional
Certified Scrum Master
Lean Six Sigma Green Belt
LSSBB
SSMBB
ITIL
Lean Six Sigma Black Belt
Six Sigma Master Black Belt
Information Technology Infrastructure Library
Agile PM Dynamic System Development Methodology Atern
Name: Bharani Kumar
Education: IIT Hyderabad
Indian School of Business
Professional certifications:
My Introduction
slide 3: © 2013  2016 ExcelR Solutions. All Rights Reserved
HSBC
Driven using UK policies
ITC Infotech
Driven using Indian policies SME
Infosys
Driven using Indian policies under Large enterprises
Deloitte
Driven using US policies
1
2
3
4
My Introduction
RESEARCH in
ANALYTICS DEEP
LEARNING IOT
DATA SCIENTIST
slide 4: © 2013  2016 ExcelR Solutions. All Rights Reserved
Tuckman Model
slide 5: © 2013  2016 ExcelR Solutions. All Rights Reserved
AGENDA
Data
Visualization
using Tableau
Data Mining –
Supervised
Unsupervised
Machine
Learning
Text Mining
NLP
AGENDA
slide 6: © 2013  2016 ExcelR Solutions. All Rights Reserved
What does it take to be a DATA SCIENTIST
Successful Data Scientist
All Agenda
Topics
Domain
Knowledge
Practice
Statistical
Analysis
Data
Minin
g
Forecasting
Data
Visualizatio
n
slide 7: © 2013  2016 ExcelR Solutions. All Rights Reserved
Welcome to the Information Age …
… drowning in data and starving for Knowledge
slide 8: © 2013  2016 ExcelR Solutions. All Rights Reserved
500 million tweets every day 1.3 billion accounts
YouTube users upload 100 hours of video every minute
306 items are purchased every second
26.6 Million transactions per day
100 terabytes of data uploaded daily
http://www.dnaindia.com/scitech/reportfacebooksaw
onebillionsimultaneoususersonaug242119428
Processing 100 petabytes a day 1 petabyte
1000 terabytes
More than 1 million customer transactions every hour
BIG DATA
https://www.techinasia.com/alibabacrushesrecordsbrings143billionsinglesday
slide 9: © 2013  2016 ExcelR Solutions. All Rights Reserved
Why Tableau
slide 10: © 2013  2016 ExcelR Solutions. All Rights Reserved
Why Tableau
slide 11: © 2013  2016 ExcelR Solutions. All Rights Reserved
Why Tableau
slide 12: © 2013  2016 ExcelR Solutions. All Rights Reserved
Why Tableau
slide 13: © 2013  2016 ExcelR Solutions. All Rights Reserved
1
2
3
4
5
Data Types – Continuous Discrete Nominal Ordinal Interval
Ratio Random Variable Probability Probability Distribution
First second third fourth moment business decisions
Graphical representation – Barplot Histogram Boxplot Scatter
diagram
Simple Linear Regression
Hypothesis Testing
Agenda – Basic Statistics
slide 14: © 2013  2016 ExcelR Solutions. All Rights Reserved
Data Types – Continuous Discrete
slide 15: © 2013  2016 ExcelR Solutions. All Rights Reserved
Data Types – Preliminaries
slide 16: © 2013  2016 ExcelR Solutions. All Rights Reserved
Random Variable
slide 17: © 2013  2016 ExcelR Solutions. All Rights Reserved
Probability
slide 18: © 2013  2016 ExcelR Solutions. All Rights Reserved
Probability Distribution
slide 19: © 2013  2016 ExcelR Solutions. All Rights Reserved
Probability Applications
slide 20: © 2013  2016 ExcelR Solutions. All Rights Reserved
Population
Sampling Frame
SRS
Sample
Sampling Funnel
slide 21: © 2013  2016 ExcelR Solutions. All Rights Reserved
Central Tendency Population Sample
Mean / Average
Median Middle value of the data
Mode Most occurring value in the data
Measures of Central Tendency
“Every American should have above average income and my
Administration is going to see they get it.” – American President
slide 22: © 2013  2016 ExcelR Solutions. All Rights Reserved
Measures of Dispersion
slide 23: © 2013  2016 ExcelR Solutions. All Rights Reserved
Dispersion Population Sample
Variance
Standard Deviation
Range Max – Min
Measures of Dispersion
slide 24: © 2013  2016 ExcelR Solutions. All Rights Reserved
For a probability distribution the mean of the distribution is known as the expected
value
The expected value intuitively refers to what one would find if they repeated the
experiment an infinite number of times and took the average of all of the outcomes
Mathematically it is calculated as the weighted average of each possible value
Expected Value
The formula for calculating the
expected value for a discrete random
variable X denoted by μ is:
The variance of a discrete random
variable X denoted by σ2 is
slide 25: © 2013  2016 ExcelR Solutions. All Rights Reserved
Graphical Techniques – Bar Chart
slide 26: © 2013  2016 ExcelR Solutions. All Rights Reserved
Graphical Techniques – Histogram
A Histogram Represents the frequency distribution i.e. how many observations
take the value within a certain interval.
slide 27: © 2013  2016 ExcelR Solutions. All Rights Reserved
Skewness Kurtosis
• A measure of asymmetry in the distribution
• Mathematically it is given by
Exµ/σ
3
• Negative skewness implies mass of the
distribution is concentrated on the
right
Third and Fourth moments
Skewness Kurtosis
• A measure of the “Peakedness” of
the distribution
• Mathematically it is given by
Exµ/σ
4
3
• For Symmetric distributions negative
kurtosis implies wider peak and thinner
tails
slide 28: © 2013  2016 ExcelR Solutions. All Rights Reserved
Graphical Techniques – Box Plot
Box Plot : This graph shows the distribution of data by dividing
the data into four groups with the same number of data points
in each group. The box contains the middle 50 of the data
points and each of the two whiskers contain 25 of the data
points. It displays two common measures of the variability or
spread in a data set
Range : It is represented on a box plot by the
distance between the smallest value and the largest
value including any outliers. If you ignore outliers
the range is illustrated by the distance between the
opposite ends of the whiskers
RangeIQR: The
middle half of a data
set falls within the
inter quartile range
Inter
quartile
slide 29: © 2013  2016 ExcelR Solutions. All Rights Reserved
Normal Distribution
The Probability associated with any single value of a random variable is always zero
Area under the entire curve is always equal to 1
The normal random variable takes values from ∞ to +∞
slide 30: © 2013  2016 ExcelR Solutions. All Rights Reserved
Normal Distribution
Characterized by
a bell shaped
curve
Has the following
properties:
68.26 of values
lie within ±1 σ
from the mean
95.46 of the
values lie within
±2 σ from the
mean
99.73 of the
values lie within ±
3σ from the
mean
slide 31: © 2013  2016 ExcelR Solutions. All Rights Reserved
Normal Distribution
Characterized by
mean µ and
standard deviation σ
XNµσ
slide 32: © 2013  2016 ExcelR Solutions. All Rights Reserved
Z scores Standard Normal Distribution
• For every value x of the random variable X we can calculate Z score:
• Interpretation − How many standard deviations away is the value from the mean
Z
X−µ
σ
slide 33: © 2013  2016 ExcelR Solutions. All Rights Reserved
Calculating Probability from Z distribution
Suppose GMAT scores can be reasonably modelled using a normal distribution
− µ 711 σ 29
What is px ≤ 680
Step 1: Calculate Z score corresponding to 680
 Z 680711/29 1.06
Step 2: Calculate the probabilities using Z – Tables
 PZ ≤ 1 0.14
slide 34: © 2013  2016 ExcelR Solutions. All Rights Reserved
Calculating Probability from Z distribution
• What is P 697 ≤ X ≤ 740
• Step 1 : Use Px1 ≤ X ≤ x2 Use P X ≤ x2 − P X ≤ x1
• Step 2 : Calculate P X ≤ x2 and P X ≤ x1 as before
P X ≤ 740 P Z ≤ 1 0.84 P X ≤ 697 P Z ≤  0.5 0.31
• Step 3 : Calculate P 697 ≤ X ≤ 740 0.84 – 0.31 0.53
slide 35: © 2013  2016 ExcelR Solutions. All Rights Reserved
Normal Quantile QQ Plot
Sample Quantiles
Theoretical Quantiles
slide 36: © 2013  2016 ExcelR Solutions. All Rights Reserved
Sampling variation
Sample mean can be and most likely is different from the population mean
Sample mean varies from one sample to another
Sample mean is a random variable
slide 37: © 2013  2016 ExcelR Solutions. All Rights Reserved
Central Limit Theorem
The standard error of the mean estimates the variability between samples whereas the
standard deviation measures the variability within a single sample
The Distribution of the sample mean
 will be normal when the distribution of data in the population is normal
 will be approximately normal even if the distribution of data in the population is not normal
if the “sample size” is fairly large
Mean X µ the same as the population mean of the raw data
Standard Deviation X where σ is the population standard deviation and n is the sample size
 This is referred to as standard error of mean
_
σ
√
slide 38: © 2013  2016 ExcelR Solutions. All Rights Reserved
Sample Size Calculation
A Sample Size of 30 is considered large enough but that may /may not be adequate
More Precise conditions
 n 10 K
3
2
where K
3
is sample skewness and
 n 10 K
4
where K
4
is sample kurtosis
slide 39: © 2013  2016 ExcelR Solutions. All Rights Reserved
Confidence Interval
• What is the Probability of tomorrow’s temperature being 42 degrees
Probability is ‘0’
• Can it be between 50⁰C 100⁰C
slide 40: © 2013  2016 ExcelR Solutions. All Rights Reserved
Case Study: Confidence Interval
• A University with 100000 alumni is thinking of offering a
new affinity credit card to its alumni.
• Profitability of the card depends on the average balance
maintained by the card holders.
• A Market research campaign is launched in which about
140 alumni accept the card in a pilot launch.
• Average balance maintained by these is 1990 and the
standard deviation is 2833. Assume that the population
standard deviation is 2500 from previous launches.
• What we can say about the average balance that will be
held after a full−fledged market launch
slide 41: © 2013  2016 ExcelR Solutions. All Rights Reserved
Interval estimates of parameters
• Based on sample data
− The point estimate for mean balance 1990
− Can we trust this estimate
• What do you think will happen if we took another random sample of 140 alumni
• Because of this uncertainty we prefer to provide the estimate as an interval range
and associate a level of confidence with it
Interval
Estimate
Point Estimate ± Margin of Error
slide 42: © 2013  2016 ExcelR Solutions. All Rights Reserved
Confidence Interval for the Population Mean
Start by choosing a confidence level 1α e.g. 95 99 90
Then the population mean will be with in
X ± Z
1ᾳ
where Z
1ᾳ
satisfies p Z
1ᾳ
≤ Z ≤ Z
1ᾳ
1ᾳ
σ
√ Margin of error depends on the underlying uncertainty confidence level and sample size
_
Interval
Estimate
Point Estimate ± Margin of Error
slide 43: © 2013  2016 ExcelR Solutions. All Rights Reserved
Calculate Z value  90 95 99
slide 44: © 2013  2016 ExcelR Solutions. All Rights Reserved
Confidence Interval Calculation
• Based on the survey and past data
• Construct a 95 confidence interval for the mean card balance and interpret it
• Construct a 90 confidence interval for the mean card balance and interpret it
− n 140 σ 2500 X 1990
− σ
X
_

σ
√
2500
√ 140
211.29
slide 45: © 2013  2016 ExcelR Solutions. All Rights Reserved
Confidence Interval Interpretation
Consider the 95 Confidence interval for the mean income : 1576 2404
Does this mean that
 The mean balance of the population lies in the range
 The mean balance is in this range 95 of the time
 95 of the alumni have balance in this range
Interpretation 1 : Mean of the population has a 95 chance of being in this range for a random
sample
Interpretation 2 : Mean of the population will be in this range for 95 of the random samples
slide 46: © 2013  2016 ExcelR Solutions. All Rights Reserved
What if we don’t know Sigma
• Suppose that the alumni of this university are very different and hence population standard
deviation from previous launches can not be used
We replace σ with our best guess point estimate s which is the standard deviation of the sample:
Calculate
• If the underlying population is normally distributed T is a random variable distributed
according to a tdistribution with n1 degrees of freedom T
n1
• Research has shown that the tdistribution is fairly robust to deviation of the population
of the normal model
slide 47: © 2013  2016 ExcelR Solutions. All Rights Reserved
Student’s tdistribution
As n
ꝏ
t
n
N01
i.e. as the degrees
of the freedom
increase the
tdistribution
approaches the
standard
normal distribution
slide 48: © 2013  2016 ExcelR Solutions. All Rights Reserved
Confidence Interval for mean with unknown Sigma
slide 49: © 2013  2016 ExcelR Solutions. All Rights Reserved
Calculating tvalue
• Construct a 95 confidence interval for the mean card balance and interpret it
− n 140 σ 2500 X 1990
− σ
X
_

2833
√140
239.46
Then the 95 confidence interval for balance is 1516 2464
Calculate t
0.95 139
1.98
slide 50: © 2013  2016 ExcelR Solutions. All Rights Reserved
Right Decision
Confidence
Type II error
Right Decision
Power
Type I error
Ho is TRUE
H1 is TRUE
Fail to
Reject Ho
Reject Ho
Hypothesis Testing
1 α
1 β
Start with Hypothesis about a
Population Parameter
Collect Sample Information
Reject/Do Not Reject Hypothesis
The factors that affect the power of a test include sample size effect size population variability and .
Power and are related as increasing decreases . Since power is calculated by 1 minus if you
increase You also increase the power of a test. The maximum power a test can have is 1 whereas
the minimum value is 0.
slide 51: © 2013  2016 ExcelR Solutions. All Rights Reserved
Our quality will not improve
after the consulting project
We will acquire 8000 new
customers if I open a store in
this area
We will need 400 more
person hours to finish this
project
The retail market will grow by
50 in the next 5 years
Our potential customers do
not spend more than 60
minutes on the web every day
Less than 5 clients will default
on their loans
Hypothesis Testing
slide 52: © 2013  2016 ExcelR Solutions. All Rights Reserved
Hypothesis Testing
slide 53: © 2013  2016 ExcelR Solutions. All Rights Reserved
Hypothesis Testing
slide 54: © 2013  2016 ExcelR Solutions. All Rights Reserved
1Sample Z test
The length of 25 samples of a fabric are taken at random. Mean
and standard deviation from the historic 2 years study are 150 and
4 respectively. Test if the current mean is greater than the historic
mean. Assume α to be 0.05
Normality Test
Stat Basic Statistics
Graphical Summary
1
Population Standard
Deviation Known or Not
1 Sample Z Test
Stat Basic Statistics
1 Sample Z
2
3
Fabric
Data
slide 55: © 2013  2016 ExcelR Solutions. All Rights Reserved
1Sample Z test – Write Hypothesis
slide 56: © 2013  2016 ExcelR Solutions. All Rights Reserved
Y: Fabric Length is
continuous
X: Discrete 1 Population
We are comparing mean
with external standard
of 150mm
Data was
shown to
be normal
Population
standard
deviation is
known4
slide 57: © 2013  2016 ExcelR Solutions. All Rights Reserved
1Sample t Test
The mean diameter of the bolt manufactured should be 10mm to
be able to fit into the nut. 20 samples are taken at random from
production line by a quality inspector. Conduct a test to check with
95 confidence that the mean is not different from the
specification value.
Normality Test
Stat Basic Statistics
Graphical Summary
1
Population Standard
Deviation Known or Not
1 Sample t Test
Stat Basic Statistics
1 Sample t
2
3
Bolt
Diameter
slide 58: © 2013  2016 ExcelR Solutions. All Rights Reserved
1Sample t Test – Write Hypothesis
slide 59: © 2013  2016 ExcelR Solutions. All Rights Reserved
Y: Bolt Diameter is continuous
X: Discrete 1 Population
We are comparing mean with
external standard of 10mm
Data was
given to be
Normal
Population
standard
deviation is
NOT known
slide 60: © 2013  2016 ExcelR Solutions. All Rights Reserved
1Sample Sign Test
The scores of 20 students for the statistics exam are provided. Test
if the current median is not equal to historic median of 82. Assume
‘ ’ to be 0.05
Normality Test
Stat Basic Statistics
Graphical Summary
1
1 Sample Sign Test
Stat Non Parametric
1 Sample sign
3
Student
Scores
slide 61: © 2013  2016 ExcelR Solutions. All Rights Reserved
1Sample Sign Test – Write Hypothesis
slide 62: © 2013  2016 ExcelR Solutions. All Rights Reserved
2Sample t Test
A financial analyst at a Financial institute wants to evaluate a
recent credit card promotion. After this promotion 450
cardholders were randomly selected. Half received an ad
promoting a full waiver of interest rate on purchases made over
the next three months and half received a standard Christmas
advertisement. Did the ad promoting full interest rate waiver
increase purchases
Normality Test
Stat Basic Statistics
Graphical Summary
1
Variance Test
Stat Basic Statistics
2 Variance
2 Sample t Test
Stat Basic Statistics
2Sample t
2
3
Marketing
Strategy
slide 63: © 2013  2016 ExcelR Solutions. All Rights Reserved
2Sample t Test – Write Hypothesis
slide 64: © 2013  2016 ExcelR Solutions. All Rights Reserved
Hypothesis Testing
slide 65: © 2013  2016 ExcelR Solutions. All Rights Reserved
Paired T Test
• This test is used to compare the means of two sets of observations when all the
other external conditions are the same
• This is a more powerful test as the variability in the observations is due to
differences between the people or objects sampled is factored out
Example: To find out if medication A lowers blood pressure
slide 66: © 2013  2016 ExcelR Solutions. All Rights Reserved
Trigger your thoughts
Comparing the performance of machine A vs.
machine B by feeding different raw materials to
each machine
Compare the performance of machine A vs.
machine B when the same raw material is fed to
each machine
Compare the power output of a wind mill when you
use motor A for 1 month and motor B for 1 month
Compare the power output of two wind mills next
to each other simultaneously when you use
motor A on one wind mill and motor B on another
Identifying resistor defects and capacitor defects
in same PCB by collecting such data using 20 PCB
units
Identifying resister defects on 20 PCB’s and
capacitor defects on 20 different PCB’s
slide 67: © 2013  2016 ExcelR Solutions. All Rights Reserved
2Sample t test or Paired T test
Effect of fuel additive on vehicles is being studied. Out of a total of 20 vehicles 10
vehicles are chosen randomly and mileage is recorded. In rest of the 10 vehicles
additive to be tested is added with the fuel and their mileage is recorded. Find if the
mileage increases by adding the fuel additive.
Assume the same data was recorded if only 10 vehicles were chosen and mileage
was recorded before and after adding the additive. What method will you choose to
find the result.
2Sample t test
Paired T test
slide 68: © 2013  2016 ExcelR Solutions. All Rights Reserved
MannWhitney test
Effect of fuel additive on vehicles is being studied. Out of a total
of 20 vehicles 10 vehicles are chosen randomly and mileage is
recorded. In rest of the 10 vehicles additive to be tested is added
with the fuel and their mileage is recorded. Find if the mileage
increases by adding the fuel additive.
Normality Test
Stat Basic Statistics
Graphical Summary
1
Mann – Whitney test
for Medians
Stat Non Parametric
Mann Whitney
2
Vehicle with
without
Additives
slide 69: © 2013  2016 ExcelR Solutions. All Rights Reserved
MannWhitney Test – Write Hypothesis
slide 70: © 2013  2016 ExcelR Solutions. All Rights Reserved
Paired T test
Effect of fuel additive on vehicles is being studied. Out of a total
of 20 vehicles 10 vehicles are chosen randomly and mileage is
recorded. In rest of the 10 vehicles additive to be tested is added
with the fuel and their mileage is recorded. Find if the mileage
increases by adding the fuel additive. Assume the same data was
recorded if only 10 vehicles were chosen and mileage was
recorded before and after adding the additive.
Normality Test
Stat Basic Statistics
Graphical Summary
1
Paired T Test
Stat Basic Statistic
Paired T
2
Vehicle with
without
Additives
• Since the data was not
normal the cause of
nonnormality was
investigated and it was
found that the first data
point for “with additive”
was wrongly entered.
This value should have
been 20. Now proceed
with the rest of the
analysis.
• If the data were truly
nonnormal our analysis
would stop here.
slide 71: © 2013  2016 ExcelR Solutions. All Rights Reserved
Paired T test – Write Hypothesis
slide 72: © 2013  2016 ExcelR Solutions. All Rights Reserved
OneWay ANOVA
A marketing organization outsources their backoffice operations
to three different suppliers. The contracts are up for renewal and
the CMO wants to determine whether they should renew
contracts with all suppliers or any specific supplier. CMO want to
renew the contract of supplier with the least transaction time.
CMO will renew all contracts if the performance of all suppliers is
similar
Normality Test
Stat Basic Statistics
Graphical Summary
1
Variance Test
Stat ANOVA
Test for Equal Variances
ANOVA
Stat ANOVA
OneWay….
2
3
Contract
Renewal
slide 73: © 2013  2016 ExcelR Solutions. All Rights Reserved
Example : More weight reduction programs
• She randomly assigns equal number of participants to each of these programs from
a common pool of volunteers
• Suppose the nutrition expert would like to do a comparative evaluation of three diet
programsAtkins South Beach GM
• Suppose the average weight losses in each of the groupsarms of the experiments
are 4.5kg 7kg 5.3kg
• What can she conclude
slide 74: © 2013  2016 ExcelR Solutions. All Rights Reserved
Two kinds of variation matter
• Not every individual in each program will respond identically to the diet program
• Easier to identify variations across programs if variations within programs are smaller
• Hence the method is called Analysis of VarianceANOVA
• Within group variation Experimental Error
• Between group variation
slide 75: © 2013  2016 ExcelR Solutions. All Rights Reserved
• It should be obvious that for every observation : Tot
ij
t
i
+ e
ij
• What is more surprising and useful is:
Formalizing the intuition behind variations
slide 76: © 2013  2016 ExcelR Solutions. All Rights Reserved
Statistically test for equality means
• n subjects equally divided into r groups
• Hypothesis
 H0: μ1 μ2 μ3 … μr
 Not all μ
i
are equal
• Calculate
 Mean Square Treatment MSTR SSTR / r‐1
 Mean Square Error MSE SSE / n‐r
 The ratio of two squares f MSTR/MSE Between group variation/Within group variation
 Strength of this evidence p‐value PrFr‐1n‐r ≥ f
• Reject the null hypothesis if p‐value α
slide 77: © 2013  2016 ExcelR Solutions. All Rights Reserved
Analysis of varianceANOV A
• ANOVA can be used to test equality of means when there are
more then 2 populations
• ANOVA can be used with one or two factors
• If only one factor is varying then we would use a oneway
ANOVA
– Example: We are interested in comparing the mean performance of several departments within a
company. Here the only factor is the name of department
– If there are two factors we would use a two way ANOVA. Example: One factor is department and the
second factor is the shift.day vs. Night
slide 78: © 2013  2016 ExcelR Solutions. All Rights Reserved
Analysis of varianceANOV A
Source of Variation Sum of Squares SS Degrees of Freedom Mean Square MS F Test Statistic
Between Treatments SSFactor K1 MSFactor SSFactor /
DFFactor
F MSFactor /
MSError
Within Treatment SSError Nk MSError SSError /
DFError
Total SSTotal N1
Source of Variation Sum of Squares SS Degrees of Freedom Mean Square MS F Test Statistic
Factor A SS
A
n
A
 1 MS
A
SS
A
/ n
A
– 1 F
A
MS
A
/ MS
E
Factor B SS
B
n
B
 1 MS
B
SS
B
/ n
B
– 1 F
B
MS
B
/ MS
E
Interaction A B SS
AB
n
A
– 1 n
B
– 1 MS
AB
SS
AB
/ n
AB
– 1 F
AB
MS
AB
/ MS
E
Error SS
E
n – n
A
n
B
MS
E
SS
E
/ n – n
A
n
B
Total SS
T
n  1
One Way ANOVA
Two Way ANOVA
slide 79: © 2013  2016 ExcelR Solutions. All Rights Reserved
Is the T ransaction time dependent
on whether person A or B processes
the transaction
Is medicine 1 effective or medicine 2
at reducing heart stroke
Is the new branding program more
effective in increasing profits
Does the productivity of employees vary
depending on the three levels
Beginner Intermediate and Advanced
Three different sale closing methods
were used. Which one is most effective
Four types of machines are used. Is
weight of the Rugby ball dependent on
the type of machine used
2 Sample ttest ANOVA – One Way
Dichotomies
slide 80: © 2013  2016 ExcelR Solutions. All Rights Reserved
NonParametric equivalent to ANOVA
• When the data are not normal or if the data points are very few to figure out
if the data are normal and we have more than 2 populations we can use the
Mood’s Median or Kruskal Wallis test to compare the populations
H
o
: All the medians are the same
H
a
: One of the medians is different
• Mood’s median assigns the data from each population that is higher than
the overall median to one group and all points that are equal or lower to
another group. It then uses a ChiSquare test to check if the observed
frequencies are close to expected frequencies
• Kruskal Wallis is another test that is nonparametric equivalent of ANOV A.
Kruskal Wallis is the extension of MannWhitney test
slide 81: © 2013  2016 ExcelR Solutions. All Rights Reserved
Mood’s Median Kruskal Wallis
Growth is measured for three treatments as shown in the case
study. Compare the effect of the three treatments on growth.
Mood’s Median – handles outliers well
Stat Nonparametric Mood’s Median
1
Kruskal Wallis – more powerful than Mood’s Median
Stat Nonparametric Kruskal Wallis
2
Height
Growth
slide 82: © 2013  2016 ExcelR Solutions. All Rights Reserved
Hypothesis Testing
slide 83: © 2013  2016 ExcelR Solutions. All Rights Reserved
1Proportion Test
• A poll is carried out to find the acceptability of
new football coach by the people. It was
decided that if the support rate for the coach
for the entire population was truly less then
25 the coach would be fired
• 2000 people participated and 482 people
supported the new coach
• Conduct a test to check if the new coach should
be fired with 95 level of confidence
Football
Coach
1Proportion Test
Stat Basic Statistics 1Proportion
slide 84: © 2013  2016 ExcelR Solutions. All Rights Reserved
2Proportion Test
Johnnie Talkers soft drinks division sales manager has been planning
to launch a new sales incentive program for their sales executives.
The sales executives felt that adults 40 yrs won’t buy children will
hence requested sales manager not to launch the program.
Analyze the data determine whether there is evidence at 5
significance level to support the hypothesis
Johnnie
Talkers
Proportion A Proportion B Check pvalue Ho
Proportion A NOT Proportion B
If pvalue alpha
we reject Ho
Ha
slide 85: © 2013  2016 ExcelR Solutions. All Rights Reserved
ChiSquare Test
How can you determine whether the distribution of defects in your product or service has changed
from the historic distribution over time or exceeds an industry standard
• Do you think mean is more significant or variance
Comparing population’s variance to a standard value involves calculating the
chisquare test statistic
We can also:
Determine whether one variable is dependent over another
Comparing observed expected frequencies where variance is unknown.
This is called as goodnessoffit test
Compare multiple proportions
slide 86: © 2013  2016 ExcelR Solutions. All Rights Reserved
ChiSquare Goodnessoffit test
Goodnessoffit test is to test assumptions about the distributions that fit the
process data
Are observed frequencies O same or different from historical expected or
theoretical frequencies E
If there’s a difference between them this suggests that the distribution
model expressed by the expected frequencies does not fit the data
slide 87: © 2013  2016 ExcelR Solutions. All Rights Reserved
ChiSquare Test
• A city has a newly opened nuclear plant and there are families staying
dangerously close to the plant. A health safety officer wants to take this case
up to provide relocation for the families that live in the surrounding area. To
make a strong case he wants to prove with numbers that an exposure to
radiation levels is leading to an increase in diseased population. He
formulates a contingency table of exposure and disease.
• Does the data suggest an association between the disease and exposure
Disease Total
Exposure Yes No
Yes 37 13 50
No 17 53 70
Total 54 66 120
slide 88: © 2013  2016 ExcelR Solutions. All Rights Reserved
ChiSquare Test
Calculate the number of individuals of exposed and unexposed groups
expected in each disease category yes and no if the probabilities were
the same
If there were no effect of exposure the probabilities should be same and
the chisquared statistic would have a very low value.
Proportion of population exposed 50/120 0.42
Proportion of population not exposed 70/120 0.58
Thus expected values:
Population with disease 54
Exposure Yes : 54 0.42 22.5
Exposure No : 54 0.58 31.5
Population without disease 66
Exposure Yes : 66 0.42 27.5
Exposure No : 66 0.58 38.5
slide 89: © 2013  2016 ExcelR Solutions. All Rights Reserved
ChiSquare Test
• Calculate the Chisquared statistic
χ2 Σ
29.1
• Calculate the degrees of freedom :
Number of rows – 1 X Number of columns – 1
df 2 – 1 X 2 – 1 1
• Calculate the pvalue from the Chisquared table
For chisquared value 29.1 and degrees of freedom 1 from the table pvalue is 0.001
• Interpretation: There is 0.001 chance of obtaining such discrepancies between expected and
observed values if there is no association
• Conclusion : There is an association between the exposure and disease
slide 90: © 2013  2016 ExcelR Solutions. All Rights Reserved
ChiSquare Test
Bahamantech Research Company uses 4 regional centers in South Asia
India China Srilanka and Bangladesh to input data of questionnaire
responses. They audit a certain of the questionnaire responses versus
data entry. Any error in data entry renders it defective. The chief data
scientist wants to check whether the defective varies by country.
Analyze the data at 5 significance level and help the manager draw
appropriate inferences. ‘1’ means not defectives ‘0’ means defective
All proportions are equal Check pvalue Ho
Not all proportions are equal
If pvalue alpha
we reject Ho
Ha
Bahaman
Research
slide 91: © 2013  2016 ExcelR Solutions. All Rights Reserved
NonParametric Tests
• Referred to as “distribution free” as they don’t involve making
assumptions of any data
• They have lower power than the parametric tests and hence are
always given the second preference after the parametric tests
• These tests are typically focused on median rather than mean
• They involve straightforward procedures like counting and ordering
slide 92: © 2013  2016 ExcelR Solutions. All Rights Reserved
Thank You
slide 93: © 2013  2016 ExcelR Solutions. All Rights Reserved
Probability Distributions
Lognormal:
• Fits many kinds of failure data
• Used for reliability analysis cyclestofailure loading variables fatigue stress
• Tensile strength of fibers breaking strength of concrete
• Environment data such as random quantities of pollutants in water or air
• Economic variables such as per capita income
• Extreme values are well managed makes data normal
• μ σ are mean standard deviation of natural logarithms
Data Log transformed
12 2.48
28 3.33
87 4.47
143 4.96
slide 94: © 2013  2016 ExcelR Solutions. All Rights Reserved
Probability Distributions
Lognormal:
• This distribution is right skewed
• Skewness increases as value of σ increases
• Pdf starts at zero increases to its mode and then decreases
• If timetofailure has a lognormal distribution then the logarithm of timetofailure has a normal
distirbution
slide 95: © 2013  2016 ExcelR Solutions. All Rights Reserved
Probability Distributions
Exponential:
• Length of time between checkins at a reception desk calls at a call center customers at a cashier
• Used when events occur continuously independently at a constant average rate
• Used to model rate of change that will occur in a given amount of time
• How long equipment will keep working with proper maintenance part replacement
• Use to model behavior of independent variables that have a constant rate
• The occurrences of variables are described by a Poisson distribution but the times between occurrences
are described by Exponential distribution
• If X is Poisson distributed then Y 1/X will be exponentially distributed
• of arrivals at a checkout counter of product failures over time – Poisson
• Length of time between events i.e. one arrival or failure the next – Exponential distribution
• Exponential distribution can model the interval between random events
• λ failure rate θ mean x random variable
• Used to model mean time between occurrences
• In exponential population 37 of observations are below the mean
63 are above
• Uses constant failure rate
slide 96: © 2013  2016 ExcelR Solutions. All Rights Reserved
Probability Distributions
Weibull:
• Model failure rate rate is not constant
• Model time to failure time to repair material strength
• When system/item ages failure rate increases/decreases
• Can model different distributions due to having parameters of shape scale location
• Can simulate Lognormal Exponential many other distributions
• Use widely in reliability statistical applications
• Weibull Lognormal are from same family both can be used to assess the dataset that
contains close to average values not too high / low
• However Weibull is a better fit when majority of data falls to the higher side
• Lognormal is a better fit when majority of data falls to the lower side
slide 97: © 2013  2016 ExcelR Solutions. All Rights Reserved
Probability Distributions
Weibull:
• β is shape parameter also called as slope determines the shape of the distribution
When beta 1 shape of distribution exponential distribution
When beta: 3 to 4 shape of distribution normal distribution
Several beta values can approximate lognormal distribution
• η is scaled parameter eta determines the spread or width of distribution
• γ is nonzero location parameter is the point below which there are no failures changing the value will
move distribution to right or left
Gamma 0 there is a period when no failures occur
Gamma 0 failures have occurred before time equals zero
e.g. defective raw materials or failure during transportation
When Gamma 0 eta is called as characteristic life
• Regardless of specific value of beta 63.2 of values fall below the characteristic life
slide 98: © 2013  2016 ExcelR Solutions. All Rights Reserved
Probability Distributions
Bivariate Normal Distribution:
• Used when 2 variables that are normally distributed may be totally independent or may be correlated to
some degree
• A joint distribution of two independent variables that simultaneously
jointly crossclassifies the data
• Can be discrete or continuous
• 3D plot like mountain terrain
• X Y axes represent independent variables
• Z axis shows either
frequency for discrete data
probability for continuous data
• The maximum or peak occurs when X1 Mu1 X2 Mu2. You can take a
“slice” anywhere along the distribution by fixing one of the variables. This
is known as a conditional distribution
slide 99: © 2013  2016 ExcelR Solutions. All Rights Reserved
Probability Distributions
Bivariate Normal Distribution:
• Can help determine items of critical importance:
• Causality – examine the joint frequencies to investigate if the second variable changes in a
systematic way when the first variable changes
• Predictions – reviewing outcomes from one variable as the other changes
• Importance – if two variables are causally related they should have a statistically significant impact
slide 100: © 2013  2016 ExcelR Solutions. All Rights Reserved
Scatter Diagram
Scatter diagrams or plots provides a graphical representation of the relationship of two
continuous variables
Be Careful  Correlation does not guarantee causation. Correlation by itself does not
imply a cause and effect relationship
Judge strength of relationship by width or tightness of scatter
Determine direction of the relationship e.g. If X increases and Y decreases it is negative
correlation similarly if X increases and Y increases it is positive correlation
slide 101: © 2013  2016 ExcelR Solutions. All Rights Reserved
slide 102: © 2013  2016 ExcelR Solutions. All Rights Reserved
Correlation Analysis
Correlation Analysis measures the degree of linear relationship between two variables
Range of correlation coefficient 1 to +1
Perfect positive relationship +1
Perfect negative relationship 1
No Linear relationship 0
If the absolute value of the correlation coefficient is greater than 0.85 then we say there is a
good relationship
• Example: r 0.87 r 0.9 r 0.9 r 0.87 describe good relationship
• Example: r 0.5 r 0.5 r 0.28 describe poor relationship
Correlation values of 1 or 1 imply an exact linear relationship. However the real value of
correlation is in quantifying less than perfect relationships
We can perform regression analysis which attempts to further describe this type of
relationship if the correlation is good between the 2 variables
slide 103: © 2013  2016 ExcelR Solutions. All Rights Reserved
Correlation Analysis
slide 104: © 2013  2016 ExcelR Solutions. All Rights Reserved
Linear Regression Model
The equation that represents how an independent variable is related to a dependent variable
and an error term is a regression model
y β
0
+ β
1
x + ε
Where β
0
and β
1
are called parameters of the model
ε is a random variable called error term.
β
0
β
1
slide 105: © 2013  2016 ExcelR Solutions. All Rights Reserved
Linear Regression Model
y intercept
Error term
An observed value of x
when x equals x
0
Mean value of
y when x
equals x
0
Straight line defined by the
equation y β
0
+ β
1
x
X
Y
x
0
A specific value of x the
independent variable.
β
0
β
1
Fitting a straight line by least squares
ˆ
Y
ˆ
b
0
+
ˆ
b
1
X
slide 106: © 2013  2016 ExcelR Solutions. All Rights Reserved
Regression Analysis
Rsquaredalso known as Coefficient of determination represents the variation in
output dependent variable explained by input variables/s or Percentage of response
variable variation that is explained by its relationship with one or more predictor variables
Higher the R2 the better the model fits your data
R2 is always between 0 and 100
R squared is between 0.65 and 0.8 Moderate correlation
R squared in greater than 0.8 Strong correlation
slide 107: © 2013  2016 ExcelR Solutions. All Rights Reserved
Regression Analysis
Prediction and Confidence Interval are types of confidence intervals used for
predictions in regression and other linear models
Prediction Interval: Represents a range that a single new observation is likely to fall given
specified settings of the predictors
Confidence interval of the prediction: Represents a range that the mean response is likely
to fall given specified settings of the predictors
The prediction interval is always wider than the corresponding confidence interval because
of the added uncertainty involved in predicting a single response versus the mean response
slide 108: © 2013  2016 ExcelR Solutions. All Rights Reserved
Regression Techniques – Simple Linear Regression
Y Continuous
X Single
Continuous
Simple
Linear
Regression
Y Continuous
X Single
Discrete
Simple
Linear
Regression
Create
Dummy
Variable
slide 109: 109 Footer Copyright © 2015 ExcelR . All rights reserved.
Simple Linear Regression – Dummy Variable
Gender Dummy Variable
Male 1
Female 0
Male 1
Female 0
Male 1
Male 1
Female 0
Male 1
Male 1
Female 0
slide 110: 110 Footer Copyright © 2015 ExcelR . All rights reserved.
Simple Linear Regression – R
A business problem:
The Waist Circumference – Adipose Tissue data
• Studies have shown that individuals with excess Adipose tissue AT in the abdominal region have a higher risk of
cardiovascular diseases
• Computed Tomography commonly called the CT Scan is the only technique that allows for the precise and
reliable measurement of the AT at any site in the body
• The problems with using the CT scan are:
• Many physicians do not have access to this technology
• Irradiation of the patient suppresses the immune system
• Expensive
• Is there a simpler yet reasonably accurate way to predict the AT area i.e.
• Easily available
• Risk free
• Inexpensive
• A group of researchers conducted a study with the aim of predicting abdominal AT area using simple
anthropometric measurements i.e. measurements on the human body
• The Waist Circumference – Adipose Tissue data is a part of this study wherein the aim is to study how well waist
circumference WC predicts the AT area
slide 111: 111 Footer Copyright © 2015 ExcelR . All rights reserved.
Simple Linear Regression – Data Set
Observation Waist AT Observation Waist AT Observation Waist AT
1 74.75 25.72 38 103 129 75 108 217
2 72.6 25.89 39 80 74.02 76 100 140
3 81.8 42.6 40 79 55.48 77 103 109
4 83.95 42.8 41 83.5 73.13 78 104 127
5 74.65 29.84 42 76 50.5 79 106 112
6 71.85 21.68 43 80.5 50.88 80 109 192
7 80.9 29.08 44 86.5 140 81 103.5 132
8 83.4 32.98 45 83 96.54 82 110 126
9 63.5 11.44 46 107.1 118 83 110 153
10 73.2 32.22 47 94.3 107 84 112 158
11 71.9 28.32 48 94.5 123 85 108.5 183
12 75 43.86 49 79.7 65.92 86 104 184
13 73.1 38.21 50 79.3 81.29 87 111 121
14 79 42.48 51 89.8 111 88 108.5 159
15 77 30.96 52 83.8 90.73 89 121 245
16 68.85 55.78 53 85.2 133 90 109 137
17 75.95 43.78 54 75.5 41.9 91 97.5 165
18 74.15 33.41 55 78.4 41.71 92 105.5 152
19 73.8 43.35 56 78.6 58.16 93 98 181
20 75.9 29.31 57 87.8 88.85 94 94.5 80.95
21 76.85 36.6 58 86.3 155 95 97 137
22 80.9 40.25 59 85.5 70.77 96 105 125
23 79.9 35.43 60 83.7 75.08 97 106 241
24 89.2 60.09 61 77.6 57.05 98 99 134
25 82 45.84 62 84.9 99.73 99 91 150
26 92 70.4 63 79.8 27.96 100 102.5 198
27 86.6 83.45 64 108.3 123 101 106 151
28 80.5 84.3 65 119.6 90.41 102 109.1 229
29 86 78.89 66 119.9 106 103 115 253
30 82.5 64.75 67 96.5 144 104 101 188
31 83.5 72.56 68 105.5 121 105 100.1 124
32 88.1 89.31 69 105 97.13 106 93.3 62.2
33 90.8 78.94 70 107 166 107 101.8 133
34 89.4 83.55 71 107 87.99 108 107.9 208
35 102 127 72 101 154 109 108.5 208
36 94.5 121 73 97 100
37 91 107 74 100 123
slide 112: 112 Footer Copyright © 2015 ExcelR . All rights reserved.
Simple Linear Regression – Transformation
reg  lmAT Waist Linear Regression
summaryreg
confintreg level0.95
predictreg interval"predict”
reg_log  lmAT logWaist Regression using Logarithmic Transformation
summaryreg_log
confintreg_log level0.95
predictreg interval"predict”
reg_exp  lmlogAT Waist Regression using Exponential Transformation
summaryreg_exp
confintreg_exp level 0.95
predictreg interval"predict”
slide 113: 113 Footer Copyright © 2015 ExcelR . All rights reserved.
Regression Techniques – Multiple Linear Regression
Y Continuous
X Multiple
Continuous
Multiple
Linear
Regression
Y Continuous
X Multiple
Discrete
Multiple
Linear
Regression
Create
Dummy
Variable
slide 114: 114 Footer Copyright © 2015 ExcelR . All rights reserved.
Multiple Linear Regression – Dummy Variable
Make of car
Dummy
Variable_Petrol
Dummy
Variable_Diesel
Dummy
Variable_CNG
Dummy
Variable_LPG
Petrol 1 0 0 0
Diesel 0 1 0 0
CNG 0 0 1 0
LPG 0 0 0 1
Diesel 0 1 0 0
CNG 0 0 1 0
Petrol 1 0 0 0
LPG 0 0 0 1
Petrol 1 0 0 0
LPG 0 0 0 1
slide 115: 115 Footer Copyright © 2015 ExcelR . All rights reserved.
Multiple Regression Model
DATA : CARS 81 observations “cars.csv ”
• VOL cubic feet of cab space
• HP engine horsepower
• MPG average miles per gallon
• SP top speed miles per hour
• WT vehicle weight hundreds of pounds
Our interest is to model the MPG of a car based on the other variables
slide 116: 116 Footer Copyright © 2015 ExcelR . All rights reserved.
Model and Assumptions
Our Model:
① Linearity Assumptions about the form of the model:
◦ Linear in parameters
② Assumptions about the errors:
◦ IID Normal Independently identically distributed
◦ Zero mean
◦ Constant variance Homoscedasticity
◦ If no constant variance HETEROSCEDASTICITY
◦ Independent of each other. If not independent it is called as AUTO CORRELATION problem
③ Assumptions about the predictors:
◦ Nonrandom
◦ Measured without error
◦ Linearly independent of each other. If not it is called as COLLINEARITY problem
④ Assumptions about the observations:
◦ Equally reliable
Y b
0
+b
1
X
1
+b
2
X
2
+......+b
k
X
k
+e
Linear
Independent
Normal
Equal Variance
slide 117: 117 Footer Copyright © 2015 ExcelR . All rights reserved.
Techniques used for Discrete Output
Logit Analysis
Probit Analysis
Logistic Regression
1
3
2
slide 118: 118 Footer Copyright © 2015 ExcelR . All rights reserved.
Regression Techniques – Simple Logistic Regression
Y Discrete
X Single
Continuous
Simple
Logistic
Regression
Y Discrete
X Single
Discrete
Simple
Logistic
Regression
Create
Dummy
Variable
slide 119: 119 Footer Copyright © 2015 ExcelR . All rights reserved.
Logistic Regression
• Logistic Regression model predicts the probability associated with each
dependent variable Category
How does it do this
• It finds linear relationship between independent variables and a link
function of this probabilities. Then the link function that provides the
best goodnessoffit for the given data is chosen
slide 120: 120 Footer Copyright © 2015 ExcelR . All rights reserved.
Logistic Regression
Multiple Logistic Regression Model is quite similar to the Multiple
Linear Regression Model Only β coefficients vary
slide 121: 121 Footer Copyright © 2015 ExcelR . All rights reserved.
Logistic Regression
slide 122: 122 Footer Copyright © 2015 ExcelR . All rights reserved.
Logistic Regression Methods
slide 123: 123 Footer Copyright © 2015 ExcelR . All rights reserved.
Assumptions in Logistic Regression
Only one outcome per event – Like pass or fail
The outcomes are statistically independent
All relevant predictors are in the model
One category at a time – Mutually exclusive
collectively exhaustive
Sample sizes are larger than for linear regression
1
2
3
4
5
slide 124: 124 Footer Copyright © 2015 ExcelR . All rights reserved.
Steps in Logistic Regression
Collect organize sample data
Formulate Logistic Regression Model
Check the model’s validity
Determine Probabilities using Probability equation
Compile the results
1
2
3
4
5
slide 125: 125 Footer Copyright © 2015 ExcelR . All rights reserved.
Logistic Regression Example
Imagine that you are a Data Scientist at a very large scale integration circuit
manufacturing company. You want to know whether or not the time spent
inspecting each product impacts the quality assurance department’s ability to
detect a designing error in the circuit
→ Step1: Collect and organize the sample data
→ Number of Observations
→ Error Identification
→ Inspection Time
Number of Observations: 55 Observations of circuits with errors and
determine whether those errors were detected by QA