Origin of the p-value

The story of the p-value used in statistics starts with William Gossett, better known under the pseudonym Student. Gossett worked in a brewery and was tasked with maintaining quality by assessing the looks and fragrance of plants used in the brewery for flavoring. To do this, he had to understand how ideal and representative a sample would exactly be with small sample sizes. To put it in other words, what do the results of a sample size of 1 or 2 say about the results when applied to a population of millions of samples. In statistical terms, you could answer the question by calculating the error distribution of the mean and comparing it with small and large sample sizes.

In more and other words…

Gossett would collect a few samples and average some results. Yet the average result never matched the actual result in larger samples. There was a certain error. He wanted to know, how the sample size would affect that error. For example, he wanted to know how many sample results one needs in order to make good predictions of the results of a million samples.

In more practical language…

Gossett needed to determine the quality of beer. Lets say the company produced 100 bottles of beer. Gossett could not test all the bottles, so he wanted to know the minimum of bottles one should test for quality in order to make accurate predictions of the quality of those 100 bottles.

One could ask the question: What is the chance that 100 bottles are perfect when only 5 bottles are tested to be perfect. Gossett turned the question around: How many bottles should I test that turn out to be perfect, before I can say with a certainty of x% that all 100 bottles are perfect. The term perfect refers to the degrees sacharrine of the malt extract, which had to be within 0.5 degrees of the targeted 133 degrees.

Gossett calculated this accuracy of estimates from different sample sizes and plotted it in a table. We call it the t-table and the method is named after the Student t-test. He found out that with only 4 samples, he could get within the 0.5 degrees in 92% of the time.

The Student pseudonym

Gossett was not allowed to publish under his own name, since other breweries would find out the efficiency of their method to test for quality. However, he was allowed to publish under a pseudonym, namely Student. The publication of his work went largely unnoticed. However, Some people, amongst which R.A. Fisher thought Gossett’s ideas were brilliant and could determine whether results between two groups were statistically significant. In 1925 1 Fisher published an important work. If a certain result has a probability of less than 5% to occur by chance, it’s stated to be statistically significant. The choosing of 5%, or 0.05 (also known as the p-value) was arbitrary and became quite controversial. Medical, Economic and Psychology Journals use the p-value in research.

Crisis in science

However, lately there has been a crisis in science 2. Not only because of so-called p-hacking, but also low power, publication bias, transparency etc. which make reproducibility of studies a major difficulty. According to several studies of medicine, psychology and economics, more than half of studies are not reproducible. 34.

With these issues coming to mind, a p-value of 0.005 is proposed. This value, however, is also arbitrary. Perhaps we should follow Gossett’s methodology. He was a man that looked to be rational in his work. He did not want to create structures for how research ought to be or create criteria for quality assessment. No, he was more interested in solving problems.

P-values have no significance when they do not solve problems. Significant p-values have no value when they do not solve problems. 


Articles used:

Researchers want to redefine the threshold for scientific discovery from 0.05 to 0.005
The Guinness brewer who revolutionzed statistics

  1. Statistical Methods for Research Workers
  2. http://www.mrc-cbu.cam.ac.uk/wp-content/uploads/2016/09/Bishop_CBUOpenScience_November2016.pdf
  3. http://science.sciencemag.org/content/349/6251/aac4716
  4.  http://www.sciencemag.org/news/2016/03/about-40-economics-experiments-fail-replication-survey