An important task in experimental science is to determine whether the change in observed variables comes from randomness or a real causal effect. Providing a statistical answer to the question of vaccine efficacy takes a similar form. Such questions are formalized as choosing between two hypotheses:
A foundational idea in hypothesis testing is randomization - what would happen if there was no dependence between the experimental variables and the observations were randomly assigned to all experimental variables.
We construct a distribution over a problem-specific test statistic (e.g. sample mean, difference of sample means) and compute the p-value - the cumulative probability of how the null statistic, test statistic under the assumption that $H_0$ is true, relates to the observed statistic ^{1}. A low p-value is considered to be statistically significant - we reject $H_0$ and accept $H_1$.
We distinguish between a population statistic $p$ and a sample statistic $\hat{p}$, both proportions. For instance, $p$ can be the historical complication rate of a surgery over a country's population and $\hat{p}$ can be the complication rate reported by a neighborhood surgeon, merely a sample of the whole population. One can now pose a statistical question: is $\hat{p} < p$?.
By assuming the null hypothesis is true, for each observation in the sample we simulate a complication with probability $p$, and construct the null statistic $\hat{p}_\mathrm{sim}^{(1)}$. By many repeated simulations, we can construct an empirical distribution over the null statistic $\hat{p}$. For $S$ simulations, the p-value is computed as,
The p-value here is estimated, and the variability is the statistic is assumed to be normal. Such an assumption is valid when the observations are independent and the sample size is large.
Often such a situation arises in treatment studies where under a randomized control trial, we want to compare the efficacy of a "treatment" $\hat{p}_T$ against a "control" $\hat{p}_C$.
The null hypothesis $H_0$ is that there is no difference between treatment and control. Here again, we re-randomize the outcome for each unit (person) in the trial and construct a distribution of the null statistic, i.e. difference in the means of treatment and control group.
We compute the p-value of the observed statistic under the null distribution and go through the same accept/reject decision.
To get a confidence interval, we simply bootstrap from both the treatment sample and control sample multiple times to construct an estimate of distribution over the difference of means (proportions).
Section 4 of Raschka^{2} provides a good summary of the possibilities in classical hypothesis testing.
Chapters 1-5 of Box & Tiao^{3} for a deeper dive.
William's Test between two dependent correlations sharing a variable.
The Permutation Test: A Visual Explanation of Statistical Testing - Jared Wilber (2019)
Raschka, S. (2018). Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning. ArXiv, abs/1811.12808. https://arxiv.org/abs/1811.12808 â†©
Box, G.E., & Tiao, G.C. (1973). Bayesian inference in statistical analysis. International Statistical Review, 43, 242. â†©