By: Asher Noel & Leo Saenger
Hypothesis testing is a way to test and compare the validity of hypotheses. It has enjoyed the spotlight of much research, but it is not without its flaws: it is a method, not magic.
Generally, say that we partition a parameter space into two disjoint sets and and that we wish to test two hypothesis against each other: specifically, that the parameter of interest is either in one disjoint set or the other. Then we have two hypotheses:
We call the null hypothesis and the alternate hypothesis.
To test these hypotheses, we need data. Let be a random variable representing our data. We test a hypothesis by finding a subset of outcomes in the range of called the rejection region. If , then we reject the null hypothesis: otherwise, we do not reject the null hypothesis.
We never accept the null or alternative hypothesis, we only ever reject or retain
We define the rejection region as the region where a test statistic is above a :
The problem in hypothesis testing is of finding an appropriate test statistic and critical value .
Often, estimation and confidence intervals are better tools than hypothesis testing. Only use hypothesis testing when you want to test a well-defined hypothesis.
Fields are moving away from hypothesis testing. One example is the Machine Learning research community. There, models are often compared on the basis of performance on specific datasets. Measures of uncertainty and statistical rigor are as important as ever, but hypothesis testing is not.
There are two common errors: false positives, also referred to as type I errors, where we reject when is true, and false negatives, or type II errors, when we retain when is true.
The power function of a test with rejection region is .
The size of a test is . The supremum of a set is the least upper bound. In other words, the size of a test is the largest probability of rejecting when is true.
A test has level if its size is less than or equal to .
A level test rejects if and only if the confidence interval does not contain .
This is important for two reasons. Consider an example where we have a confidence interval and two values outside the interval, one close and one far. In the first case, the estimated value of is close to , so the finding is probably of little value. In the second case, the estimated value is far, so the finding has scientific value. This shows that statistical significance does not imply scientific importance, and that confidence intervals can be more informative than tests.
Often, researchers report more than whether or not they reject or retain the null. Usually, there is a the smallest at which the test rejects the null: we call this the p-value.
If a p-value is large, this has two interpretations: either is true, or is false but the test has low power. A large p-value is not strong evidence in favor of .
The p-value is not the probability that the null hypothesis is true! The p-value is the probability under the null of observing a value of the test statistic the same as or more extreme than what was actually observed.
Hypothesis testing is useful when there is evidence to reject . If is the status quo, then this makes sense. We cannot use it to prove that is true. Failure to reject can occur because is true or because the test has low power.
P-values are also susceptible to p-hacking. This refers to making assumptions about data or tests that influence the p-value to be more favorable, usually to increase the chance of publication.
Different fields have different standards of significance. Physicists oftem aim for much stronger findings than , whereas psychologists have accrued a poor reputation for sketchy science.
As a final point about the problems with p-values, they are susceptible to decisions you make about when to collect data, even if that does not change the data you actually observe.
For example, if you toss a coin times and observe heads, then if the null hypothesis is that the coin is fair, the one sided test statistic where leads to a p-value of 0.073. This is larger than the magical and arbitrary 5% threshold.
If instead the modeler kept tossing the coin until they observed tails, then the data-generating distribution is negative binomial. Under this model and the same null hypothesis, we get that the one sided p-value is 0.0327. All of a sudden, without changing the data, there is “significant” evidence of bias in the coin! Long live Bayes :).
For further reading, check out Harvard’s Statistics 111 course materials, “All of Statistics” by Larry Wasserman, and “Machine Learning: A Probabilistic Perspective” by Kevin Murphy.