Statistics For Data Science: Hypothesis Testing

--

Inferential statistics, where we take a sample from a population and use that sample to make predictions about the populations, is all about Hypothesis testing.

For example, suppose we have been told that the mean height of 25-year old males in the US is 172 cm(=𝜇). We could try to take a random sample of 100 25-year old males, find the mean height of the sample, which say comes out to be 175 cm(=ẍ), and then choose a confidence level of 90, 95 or 99%, construct a confidence interval around the sample mean, and then state how confident we are that our sample, which gave ẍ = 175cm, supports the claim that population mean, 𝜇 = 172 cm.

While doing hypothesis testing, it is very important to make the distinction between proving a claim and providing support for a claim. In inferential statistics, we’re usually not able to prove something with certainty. Instead, we use the data to support a theory we have. The data can provide strong or confident support for our theory, but we still can’t necessarily prove it.

The steps involved in Hypothesis testing are as follows :

1. State the null and alternate hypotheses.
2. Determine the confidence level.
3. Calculate the test-statistic
4. Find the critical values and determine the regions of acceptance and rejection
5. State the conclusion.

Let’s go through the steps one by one using the example of male heights given above.

1. State the null and alternate hypothesis

In any hypothesis test, the first thing we always want to do is to state what are the null and alternate hypotheses.

The alternate hypothesis Ha is the abnormality we are looking for in the data: it is the significance we are hoping to find. Once we have an alternative hypothesis, we always want to state the opposite claim, which we call the null hypothesis Ho.

For example, in our experiment involving the average heights of 25-year old US males, our null hypothesis will be

Ho : 𝜇 = 175 cm, and

the alternate hypothesis will be given by

Ha: 𝜇 ≠ 175 cm.

One thing to note is that the null hypothesis always includes the ‘=’ sign, whether as pure ‘=’ or as ‘≤’ and ‘≥’ sign. Whereas the alternate hypothesis includes the ‘>’, ‘<’, or ‘≠’ signs.

2. Determine the confidence level

The next step involves choosing a confidence level for our test, which is usually 90%, 95% or 99%. Choosing a confidence level of 99% means we want to be more confident about our results than say if we choose 90%. But choosing a higher confidence level comes at the cost of a larger confidence interval and a larger margin of error. This means that if we choose a higher confidence level, we will be less likely to detect a true difference between our sample statistic and the hypothesized value if that difference actually exists.

Level of significance, 𝛼

The level of significance, 𝛼 is given by

𝛼 = 1 - CL ,

where CL is the confidence level.

3. Calculate the test-statistic

Once we have determined the confidence level, the next step is to calculate the test-statistic for our hypothesis test. But any test statistic we calculate will depend on whether we are running a two-tailed or one-tailed test.

To determine this, we need to look at our null and alternate hypotheses defined above:

If the null hypothesis and the alternate hypothesis contain the = and ≠ signs, we will use a two-tailed test. On the other hand, if the null hypothesis contains the ≥ or ≤ sign, with the alternate hypothesis having only the < or > sign, then we use the one-tailed test.

In two-tailed tests, we have 2 -regions of rejection, one in each tail of the distribution.

A one-tailed test has only one region of rejection, either in the left tail or right tail of the distribution, depending on whether we predict the population parameter is lesser than or greater than the stated value

A two-tail test is more conservative than a one-tailed test because the area of rejection in either tail is only half of what it would have been in the case of a one-tailed test. So unless we are extremely confident about the directionality, we should play safe and use the more conservative 2-tailed test.

We can proceed to find out the test-statistic as follows :

a) If population standard deviation is known,

where z is the test statistic,

ẍ is the sample mean,

𝜇 is the population mean, as per our null hypothesis

𝜎 is the population standard deviation.

b) If the population standard deviation is unknown, then the test statistic is given by

where ‘s’ is the sample standard deviation.

4. Find the critical values and determine the regions of acceptance and rejection

Once we have calculated the test statistics above, we use these to find the area under the distribution corresponding to the test statistic, using either the normal distribution or the student’s t-distribution. This is also known as p-value.

For a one-tailed test, the p-value will be the area we get using the test-statistic, either using a statistical calculator or a z-table or t-table.

For a two-tailed test, this is slightly more complicated:

If the test-statistic is negative, we calculate the area and simply multiply it by 2 to account for both tails, to get the p-value.

If the test-statistic is positive, we need to calculate the area on right by using 1 minus area corresponding to the test statistic, and then we double the result to get the p-value.

Once we have got the p-value, it is relatively simple to decide whether or not we should reject the null hypothesis, using :

if p ≤ 𝛼, reject the null Hypothesis

if p > 𝛼, do not reject the null Hypothesis.

We can state our final conclusion based on the p-value, which is whether the data provides significant evidence to reject the null hypothesis or not.