Understanding Hypothesis testing and P-values
Hypothesis testing and P-values are building blocks of a data scientist. One need to have this skill in order to verify the hypothesis/claims made by your company or your competitors.
Hypothesis Testing -
Note - This is just a statistical example and in no way it reflects the actual scenario of demonetization.
Let us say that a government claims that during demonitization the average time spent by a person in queue was 1 hour.
Here we have a claim. Now as a data scientist our job is to support or reject this claim. This claim is hard to verify absolutely because millions of people stood in the queue so it is impossible for us to interview each one of them. A simple way to approach the problem will be to sample a few people, say 100 from the population. There are various ways to do sampling of data but let's not go into details of those techniques.
The first step to approach this problem is to form a "Null Hypothesis". Here our Null Hypothesis will be - "Mean/average time spent in the queue was 1 hour".
Alternative hypothesis will be that the mean time spent was more than 1 hour.
We would have to verify if the null hypothesis(i.e the hypothesis which supports the claim) is true or not.
Let us assume that the mean time of the sample of 100 person was 1.2 hours with standard deviation of 0.5.
Now the job is to verify whether our null hypothesis is true or not.
let us calculate the standard deviation of the sample distribution. We don't know the std. deviation of the population so we will estimate it using the sample's std deviation.
We will now calculate z-score(helps us to form a relationship between mean and std deviation)
The z score is negative which means that the individual hours of waiting for a person was below the mean of population.
A z-score -4 will give us the probability of 0.02%. (from the z score table). This essentially means that if we assume that the null hypothesis is true then the probability of getting a sample this extreme is 0.02%!(this is what p-value is)
Assuming that the level of significance chosen was 0.05(some threshold). Since p-value is less than the level of significance we reject the null hypothesis.
Hence the alternative hypothesis was true!