Statistics, Origins and People

Science is perhaps not as pure as it is touted to be.

Jul 21, 2020

2 mins

The origins of statistics feel rather uneasy. In 1877, Francis Galton presented at the ritualistic Friday Evening Discourse at the Royal Institution of Great Britain. His talk was titled “Typical Laws of Heredity”. That’s right, statistics started with eugenics. He brought along a physical apparatus called a quincunx, now known as the Galton board.

A Galton Box (CC-BY-SA-4.0 BY Rodrigo Tetsuo Argenton)

The Galton board was a physical realization of the extraordinary result of the central limit theorem. Balls would fall through a Pascal’s triangle of pins into vertical columns and form a bell-curved distribution at the bottom. These were the first time the concepts of “regression to the mean” and “co-relation” (modern correlation) were first developed. While each ball’s path seems chaotic, the population of balls show order. This was beautiful.

Nonetheless, it pains me to see how science isn’t as pure it is touted to be and cannot be separated from the people who do it. In Judea Pearl’s words,

It is an irony of history that Galton started out in search of causation and ended up discovering correlation, a relationship that is oblivious of causation.

Galton had an everlasting influence on his disciple, Karl Pearson. Pearson subsequently became an extremely powerful scientific figure by pretty much creating the field of statistics. He went on to create the well known journal Biometrika. He would be described by his biographer as a “zealot”. The scientific discourse wasn’t as clean as one might have wanted it to be. Dissenters were largely abandoned and competing ideas found it very hard to publish in Biometrika, which during its formative years was personally edited by Pearson himself.

The philosophical predispositions of Pearson (and Galton) blinded them towards conspicuous failings of correlation and downgraded causation as a special case. Today, we know in fact that the opposite is true. Pearson himself contributed to a vast number of examples of “spurious correlations”. One of them, as early as 1899, was based on the Simpson’s Paradox - mixing different populations can lead to reverse conclusions. However, he dismissed them as a statistical artifact of an inappropriate mixing. No wonder, this was a very big missed opportunity.

Something similar happened with probabilistic methods for decades before 1980s when rule-based logic systems dominated discourse in AI research. Later, neural networks suffered a similar fate fate until the turn of the 21st century when impressive results finally compelled the community to take note.

Science is a social endeavor. The sociological side-effects are inseparable. People will always make mistakes and bad judgements, often for questionable ulterior motives. I wish scientific discourse was as pure as I imagined them to be as a child. May be there is a silver lining - science corrects itself given enough time. But is it worth wasting years fighting the wrong battles?