Hypothesis Testing

Living document on hypothesis testing from scratch.

🧮 math
Table of Contents

An important task in experimental science is to determine whether the change in observed variables comes from randomness or a real causal effect. Providing a statistical answer to the question of vaccine efficacy takes a similar form. Such questions are formalized as choosing between two hypotheses:

A foundational idea in hypothesis testing is randomization - what would happen if there was no dependence between the experimental variables and the observations were randomly assigned to all experimental variables.

We construct a distribution over a problem-specific test statistic (e.g. sample mean, difference of sample means) and compute the p-value - the cumulative probability of how the null statistic, test statistic under the assumption that H0H_0 is true, relates to the observed statistic 1. A low p-value is considered to be statistically significant - we reject H0H_0 and accept H1H_1.

Binary Random Variable

We distinguish between a population statistic pp and a sample statistic p^\hat{p}, both proportions. For instance, pp can be the historical complication rate of a surgery over a country’s population and p^\hat{p} can be the complication rate reported by a neighborhood surgeon, merely a sample of the whole population. One can now pose a statistical question: is p^<p\hat{p} < p?.

Parametric Bootstrap Simulation

By assuming the null hypothesis is true, for each observation in the sample we simulate a complication with probability pp, and construct the null statistic p^sim(1)\hat{p}_\mathrm{sim}^{(1)}. By many repeated simulations, we can construct an empirical distribution over the null statistic p^\hat{p}. For SS simulations, the p-value is computed as,

p-value=1Si=1SI[p^sim(i)<p]\text{p-value} = \frac{1}{S} \sum_{i=1}^S \mathbb{I}[\hat{p}_{\mathrm{sim}}^{(i)} < p]

The p-value here is estimated, and the variability is the statistic is assumed to be normal. Such an assumption is valid when the observations are independent and the sample size is large.

Two Binary Random Variables

Often such a situation arises in treatment studies where under a randomized control trial, we want to compare the efficacy of a “treatment” p^T\hat{p}_T against a “control” p^C\hat{p}_C.

The null hypothesis H0H_0 is that there is no difference between treatment and control. Here again, we re-randomize the outcome for each unit (person) in the trial and construct a distribution of the null statistic, i.e. difference in the means of treatment and control group.

We compute the p-value of the observed statistic under the null distribution and go through the same accept/reject decision.

To get a confidence interval, we simply bootstrap from both the treatment sample and control sample multiple times to construct an estimate of distribution over the difference of means (proportions).


Section 4 of Raschka2 provides a good summary of the possibilities in classical hypothesis testing.

Chapters 1-5 of Box & Tiao3 for a deeper dive.

William’s Test between two dependent correlations sharing a variable.

The Permutation Test: A Visual Explanation of Statistical Testing by Jared Wilber (2019)

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations

Everything is a Linear Model by Daniel Roelfs (2022)

Common statistical tests are linear models (or: how to teach stats) by Jonas Kristoffer Lindeløv (2019)

Nonparametric Statistics Book4


  1. https://www.openintro.org/book/ims/

  2. Raschka, S. (2018). Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning. ArXiv, abs/1811.12808. https://arxiv.org/abs/1811.12808

  3. Box, G.E., & Tiao, G.C. (1973). Bayesian inference in statistical analysis. International Statistical Review, 43, 242.

  4. Hollander, M., Wolfe, D.A., & Chicken, E. (1973). Nonparametric Statistical Methods. https://onlinelibrary.wiley.com/doi/book/10.1002/9781119196037