Hypothesis Testing
Living document on hypothesis testing from scratch.
An important task in experimental science is to determine whether the change in observed variables comes from randomness or a real causal effect. Providing a statistical answer to the question of vaccine efficacy takes a similar form. Such questions are formalized as choosing between two hypotheses:
- The Null Hypothesis: There is no effect on the observed variables, and any variations are purely a matter of chance/randomness. (Both vaccines are equally effective)
- The Alternative Hypothesis: The observed variables are not independent of the experimental variables, and may be causally related. (Both vaccines are not equally effective)
A foundational idea in hypothesis testing is randomization - what would happen if there was no dependence between the experimental variables and the observations were randomly assigned to all experimental variables.
We construct a distribution over a problem-specific test statistic (e.g. sample mean, difference of sample means) and compute the p-value - the cumulative probability of how the null statistic, test statistic under the assumption that is true, relates to the observed statistic 1. A low p-value is considered to be statistically significant - we reject and accept .
Binary Random Variable
We distinguish between a population statistic and a sample statistic , both proportions. For instance, can be the historical complication rate of a surgery over a country’s population and can be the complication rate reported by a neighborhood surgeon, merely a sample of the whole population. One can now pose a statistical question: is ?.
Parametric Bootstrap Simulation
By assuming the null hypothesis is true, for each observation in the sample we simulate a complication with probability , and construct the null statistic . By many repeated simulations, we can construct an empirical distribution over the null statistic . For simulations, the p-value is computed as,
The p-value here is estimated, and the variability is the statistic is assumed to be normal. Such an assumption is valid when the observations are independent and the sample size is large.
Two Binary Random Variables
Often such a situation arises in treatment studies where under a randomized control trial, we want to compare the efficacy of a “treatment” against a “control” .
The null hypothesis is that there is no difference between treatment and control. Here again, we re-randomize the outcome for each unit (person) in the trial and construct a distribution of the null statistic, i.e. difference in the means of treatment and control group.
We compute the p-value of the observed statistic under the null distribution and go through the same accept/reject decision.
To get a confidence interval, we simply bootstrap from both the treatment sample and control sample multiple times to construct an estimate of distribution over the difference of means (proportions).
Refererences
Section 4 of Raschka2 provides a good summary of the possibilities in classical hypothesis testing.
Chapters 1-5 of Box & Tiao3 for a deeper dive.
William’s Test between two dependent correlations sharing a variable.
The Permutation Test: A Visual Explanation of Statistical Testing by Jared Wilber (2019)
Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations
Everything is a Linear Model by Daniel Roelfs (2022)
Common statistical tests are linear models (or: how to teach stats) by Jonas Kristoffer Lindeløv (2019)
Nonparametric Statistics Book4
Footnotes
-
Raschka, S. (2018). Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning. ArXiv, abs/1811.12808. https://arxiv.org/abs/1811.12808 ↩
-
Box, G.E., & Tiao, G.C. (1973). Bayesian inference in statistical analysis. International Statistical Review, 43, 242. ↩
-
Hollander, M., Wolfe, D.A., & Chicken, E. (1973). Nonparametric Statistical Methods. https://onlinelibrary.wiley.com/doi/book/10.1002/9781119196037 ↩