Hypothesis Testing

By: Asher Noel & Leo Saenger


If you have no statistics background, these links are a good place to start: linear regression and hypothesis testing

Hypothesis testing is a way to test and compare the validity of hypotheses. It has enjoyed the spotlight of much research, but it is not without its flaws: it is a method, not magic.


Generally, say that we partition a parameter space Θ\Theta into two disjoint sets Θ0\Theta_0 and Θ1\Theta_1 and that we wish to test two hypothesis against each other: specifically, that the parameter of interest is either in one disjoint set or the other. Then we have two hypotheses:

H0:θΘ0H_0 : \theta \in \Theta_0 and H1:θΘ1H_1 : \theta \in \Theta_1.

We call H0H_0 the null hypothesis and H1H_1 the alternate hypothesis.

To test these hypotheses, we need data. Let XX be a random variable representing our data. We test a hypothesis by finding a subset of outcomes RR in the range of XX called the rejection region. If XRX \in R, then we reject the null hypothesis: otherwise, we do not reject the null hypothesis.


We never accept the null or alternative hypothesis, we only ever reject H0H_0 or retain H0H_0

We define the rejection region RR as the region where a test statistic TT is above a criticalvaluecritical value:

R=x:T(x)>cR = {x : T(x) > c }.

The problem in hypothesis testing is of finding an appropriate test statistic TT and critical value cc.


Often, estimation and confidence intervals are better tools than hypothesis testing. Only use hypothesis testing when you want to test a well-defined hypothesis.


Fields are moving away from hypothesis testing. One example is the Machine Learning research community. There, models are often compared on the basis of performance on specific datasets. Measures of uncertainty and statistical rigor are as important as ever, but hypothesis testing is not.


There are two common errors: false positives, also referred to as type I errors, where we reject H0H_0 when H0H_0 is true, and false negatives, or type II errors, when we retain H0H_0 when H1H_1 is true.


The power function of a test with rejection region RR is β(θ)=P(XR)\beta(\theta) = P(X \in R).

The size of a test is α=supβ(θ)\alpha = \sup \beta(\theta). The supremum of a set is the least upper bound. In other words, the size of a test is the largest probability of rejecting H0H_0 when H0H_0 is true.

A test has level α\alpha if its size is less than or equal to α\alpha.


A level α\alpha test rejects H0:θ=θ0H_0 : \theta = \theta_0 if and only if the 1α1-\alpha confidence interval does not contain θ0\theta_0.

This is important for two reasons. Consider an example where we have a confidence interval and two values outside the interval, one close and one far. In the first case, the estimated value of θ\theta is close to θ0\theta_0, so the finding is probably of little value. In the second case, the estimated value is far, so the finding has scientific value. This shows that statistical significance does not imply scientific importance, and that confidence intervals can be more informative than tests.


Often, researchers report more than whether or not they reject or retain the null. Usually, there is a the smallest α\alpha at which the test rejects the null: we call this the p-value.



If a p-value is large, this has two interpretations: either H0H_0 is true, or H0H_0 is false but the test has low power. A large p-value is not strong evidence in favor of H0H_0.


The p-value is not the probability that the null hypothesis is true! The p-value is the probability under the null of observing a value of the test statistic the same as or more extreme than what was actually observed.


Hypothesis testing is useful when there is evidence to reject H0H_0. If H0H_0 is the status quo, then this makes sense. We cannot use it to prove that H0H_0 is true. Failure to reject H0H_0 can occur because H0H_0 is true or because the test has low power.


P-values are also susceptible to p-hacking. This refers to making assumptions about data or tests that influence the p-value to be more favorable, usually to increase the chance of publication.


Different fields have different standards of significance. Physicists oftem aim for much stronger findings than α=0.05\alpha = 0.05, whereas psychologists have accrued a poor reputation for sketchy science.


As a final point about the problems with p-values, they are susceptible to decisions you make about when to collect data, even if that does not change the data you actually observe.

For example, if you toss a coin n=12n=12 times and observe s=9s = 9 heads, then if the null hypothesis is that the coin is fair, the one sided test statistic where t(s)=st(s) = s leads to a p-value of 0.073. This is larger than the magical and arbitrary 5% threshold.

If instead the modeler kept tossing the coin until they observed ns=3n - s = 3 tails, then the data-generating distribution is negative binomial. Under this model and the same null hypothesis, we get that the one sided p-value is 0.0327. All of a sudden, without changing the data, there is “significant” evidence of bias in the coin! Long live Bayes :).


For further reading, check out Harvard’s Statistics 111 course materials, “All of Statistics” by Larry Wasserman, and “Machine Learning: A Probabilistic Perspective” by Kevin Murphy.