Disraeli Had it Right

I only have an elementary grasp of statistics, but I know enough to become frustrated when I read some article proclaiming that the results of some test were "statistically significant," as if that statement was, well, significant. Not only is it often insignificant (as the word is commonly used), but it is also an incomplete statement. Say we have a woman, Ms. T, who claims to have a palette so precise as to be able to discern whether her tea was made by pouring hot water over the tea bag or whether the tea bag was added to the hot water. If we wanted to test her claim, basic statistical methodology would be (roughly):

  1. Form a hypothesis, such as "Ms. T cannot tell the difference in how her tea is prepared." This hypothesis, called the null hypothesis, is formed for the express purpose of being disproved.
  2. Form an alternative hypothesis, such as "Ms. T can tell difference in how her tea is prepared," in the event the null hypothesis is rejected.
  3. Set up an experiment, such as one where Ms. T is blindfolded, given fifty cups of tea prepared either by adding the tea bag or the hot water first, and asking her to identify which method was used for each cup.
  4. Decide what the chances are that Ms. T would randomly be able to guess correctly. These are called the significance levels, which are probabilities that, given our null hypothesis (she can't tell the difference), the results could have happened merely by chance (Standard values are 5%, 1% and 0.1%, though there's no particular reason for this.). If our results have a probability lower than these values, we reject the null hypothesis and assume support for the alternative hypothesis because we believe our results are so improbable that the results are not just coincidence. If the probability is higher than our significance levels, we assume support for the null hypothesis.
  5. Compare Ms. T's results with the results of a group of "normal" people. Let's say that we find that Ms. T was correct in so many of the test tastes (49 out of 50) that, compared to a normal population, the chances of her randomly guessing correctly that many times was only 3%.

What happens now? Well, remember, our significance levels were 5%, 1% and 0.1%. Since the actual probability was determined to be 3%, our results are "statistically significant" at the 5% level, but are not significant at the 1% and .1% levels. "Significant" in statistics does not mean "important" or considerable", but rather "indicative" or "expressive." In plain English, all our results have shown is that if we assume the chances are 5% that Ms. T would be able to guess 49 out of 50 times correctly randomly, a result of 3% is indicative that our results did not happen by chance, and thus the null hypothesis should be rejected.

But remember, the 5% significance level we selected is purely arbitrary--some others might think that .1% is more accurate for significance, that we should only reject our null hypothesis if there is less than one chance in a thousand Ms. T's results could have happened by coincidence. Furthermore, some researchers have been found to determine the significance levels after they've conducted the experiment; that way, they can choose a level ex post that makes their results "significant."

At best, most journalists' claims about statistical significance are incomplete because they fail to include the benchmark that determines significance, and at worst, they grossly misinterpret "statistically significant" as meaning "statistically important."

I find the philosophy behind this type of statistical method to be interesting. I might write a future post on how we subscribe to the same philosophy when determining the burden of proof necessary to convict a criminal. If nothing else, I trust this post has convinced readers that journalists have the third type of lie down pat, whether they realize it or not.