Yes, We Need Statistical Significance Testing

A rule of thumb may yield correct results but can’t be scientifically credible.

Published in

Towards AI

8 min readOct 27, 2022

Take any research paper or blog post presenting a new method for AI, you’ll very probably find a statement similar to this:

[…] a significant improvement over previous work.

If this is a method applied to language generation tasks (automatic summarization, paraphrasing, machine translation, conversational AI, …), the “improvement” is almost always an increase of some evaluation metric score. It can be, for instance, BLEU, ROUGE, BLEURT or BERTScore.

But are these improvements really significant? Are we really improving over previous work? Is it not just luck?

To answer these questions, we can perform statistical significance testing. Statistical significance testing is a standard statistical tool designed to ensure that experimental results are not coincidental. In natural language processing (NLP), or even AI in general, this often means checking that a proposed system A is not better by chance than a baseline system B.

The Most Common Misinterpretations: Hypothesis Testing, Confidence Interval, P-Value

A refresher on how to interpret statistical inference correctly

towardsdatascience.com

The usefulness of statistical significance testing in Science is regularly debated (see for instance: Wasserstein et al. (2019)). Nonetheless, in NLP, there is evidence that it helps to make more credible claims and more insightful comparisons between systems (see, for instance: Koehn (2004), Riezler and Maxwell (2005), Dror et al. (2018), Kocmi et al. (2021)).

So first, let’s have a look at how often claims in NLP research are made with the support of statistical significance testing.

The Decline Of Statistical Significance Testing

The work of Dror et al. (2018) is one of the most recent and cited studies on the use of statistical significance testing in NLP. One of their major observations is that NLP researchers put a lot of emphasis on empirical results and draw conclusions from them without checking their statistical significance. After analyzing a sample of 213 papers published at ACL and TACL 2017, they found that only 36.6% (78 papers) of them claimed that they performed statistical significance testing, meaning that 63.4% didn’t check whether their observations were coincidental.

I extended this analysis to the last 12 years by focusing on the machine translation research (my area of expertise) papers published at conferences organized by the ACL. In total, I manually annotated 913 papers. In all the annotated papers, the authors compared machine translation systems using some automatic evaluation metric such as BLEU.

We clearly see that in machine translation, the use of statistical significance testing peaked more than 10 years ago, in 2011, then declined until 2018, and finally seems to slowly bounce back. Last year, only 41% of the publications performed statistical significance testing.

If most studies in machine translation don’t perform statistical significance testing, how can they claim significant improvements over previous work?

Simply based on the rule of thumb that a X BLEU point of improvement is always significant, where X is usually a value greater than 0.5.

To assess whether we can “guess”, based on a rule of thumb, that a system is significantly better than another one, I performed the first large-scale empirical study on statistical significance testing in machine translation.

An Empirical Study Of Statistical Significance Testing

The goal of the following experiments is to empirically assess when the difference between two systems, in terms of BLEU, is significant. I will especially check whether a 0.5 BLEU points improvement or above, can be considered significant, as it is commonly assumed now in research papers.

Since I want my observations to be as general and replicable as possible, I chose to do experiments with all the translation outputs submitted to WMT21 and WMT22 main translation tasks. These tasks received a total 486 submissions. It yields 2,706 possible system pairs for which we can test the statistical significance of the difference.

I used SacreBLEU with the default parameters for significance testing with paired bootstrap resampling and approximate randomization.

The following charts give the p-value for each BLEU score difference for each system pair. Lower p-values mean more significant differences. Note that not all the system pairs are presented here. For better readability, I zoomed in on the charts to only show the area of interest. The dotted lines indicate interesting thresholds for discussions.

p-values computed with paired bootstrap resampling for the difference in BLEU for each system pair. *Note: With paired bootstrap resampling, SacreBLEU performs an odd approximation near 0 for the p-value. That’s why there aren’t any delta BLEU with a p-value of 0.0.* — Illustration by the author.

p-value computed with approximate randomization for the difference in BLEU for each system pair. — Illustration by the author.

In NLP, common thresholds for p-values are used to decide whether an improvement is significant: 0.001, 0.01, and 0.05. Authors arbitrarily decide on the threshold under which they will consider their results statistically significant. For each one of these thresholds, we can make the following observations:

p-value < 0.001: A delta BLEU higher than 2.7 is always significant at WMT21/22. Only a few points aren’t significant above 2.0. These observations are similar to both types of a statistical significance tests. At this threshold of p-value, 35.0% and 36.1% of the delta BLEU between 1.0 and 2.0 aren’t significant according to paired bootstrap resampling and approximate randomization, respectively. To claim a significant improvement with p-value < 0.001, statistical significance testing appears necessary for differences lower than 2.0 BLEU points.
p-value < 0.01: Except for 3 points with paired bootstrap resampling, a delta BLEU higher than 1.5 is significant. At this threshold of p-value, 10.0% and 13.9% of the delta BLEU between 1.0 and 1.5 aren’t considered significant according to paired bootstrap resampling and approximate randomization, respectively. To claim significant improvements with a p-value < 0.01, statistical significance testing appears necessary for differences lower than 1.5 BLEU points.
p-value < 0.05: Except for 2 points with approximate randomization, a delta BLEU higher than 1.0 is significant at WMT21/22. At this threshold of p-value, 20.4% and 42.0% of the delta BLEU between 0.5 and 1.0 aren’t considered significant according to paired bootstrap resampling and approximate randomization, respectively. To claim significant improvement with p-value < 0.05, statistical significance testing appears necessary for differences lower than 1.0 BLEU points.

In other words, there seem to be some clear thresholds that we can use to claim significant differences without even performing statistical significance testing. To claim a significant improvement over previous work in BLEU, the difference should be greater than 1.0 BLEU points. If we assume that “very significant” is “p-value < 0.001,” then the improvement should be 2.0 BLEU points or greater.

Of course, this is just an empirical experiment, and there will be rare situations where these thresholds lead to wrong conclusions. I could extend these experiments to all WMT and WAT campaigns to get many more points, but I think for the purpose of this article, this is already convincing enough.

All of these observations are useful, but only if we are very happy with BLEU and want to continue to use it in AI research papers… Given its poor correlation with humans, we may want to use other metrics.

But what happens if we change the metric?

For instance, since 1.0 BLEU points difference appears significant, does it mean that a 1.0 points difference with another metric is also significant?

Obviously, no. But let’s draw another chart to check it, this time using COMET, a state-of-the-art metric for machine translation. The same WMT21/22 systems are considered, but only with paired bootstrap resampling. Note: I used comet-compare for statistical significance testing with COMET, using the model wmt20-comet-da.

p-values computed with paired bootstrap resampling for the difference in COMET for each system pair. — Illustration by the author.

We clearly see here that a 1.0 COMET point of improvement is often insignificant. Even a 4.0 COMET points difference might not be significant. COMET scores are not bounded between 0 and 100. WMT21/22 systems can reach hundreds of absolute COMET points. This will vary a lot depending on the COMET model used. The model wmt21-comet-da often gives scores lower than 10 for systems that would get, for instance, 100 COMET points with the model wmt20-comet-da. More extensively, I discussed this peculiarity of COMET in one of my previous articles.

A Large-Scale Automatic Evaluation of Machine Translation

185 systems evaluated in the 21 translations directions of WMT22

towardsdatascience.com

These observations show that we shouldn’t make any assumptions about the significance of the difference between the two systems. We can’t just guess whether the difference between the two systems is significant, especially with new metrics.

We should always apply statistical significance testing. Currently, most research papers don’t.

So Why The Large Majority Of Research Papers Don’t Do Statistical Significance Testing?

Statistical significance testing isn’t widely used in NLP research for many reasons. Here are some of them.

It is often considered useless, assuming that for some tasks, such as machine translation, an improvement will always be found statistically significant. Empirically, we saw in the previous section that this is only true for some thresholds and that these thresholds can‘t be generalized.

Currently, in NLP, a “large” difference between two metrics scores is good enough for publication at top-tier conferences and journals. So, if we get a great X points of improvement over previous work, why should we try to show that it isn’t statistically significant if no one asks for it? If NLP research wasn’t so much focused on positive results, this wouldn’t be a problem. Empirical results with a high p-value should be accepted if the remainder of the paper is correct and insightful for the research community.

Finally, in many papers, statistical significance testing can’t technically be performed. A common trend in AI research is to simply copy scores from previously published papers. In this situation, without the outputs of all the compared systems, statistical significance testing isn’t possible. This is one of the many reasons why I fight against the copy and comparisons of previously published scores. Implementing statistical significance testing calls de facto for better evaluation practices.

Comparing the Uncomparable to Claim the State of the Art: A Concerning Trend

Spotting evaluation errors and speculative claims in GPT-3, PaLM, and AlexaTM.

medium.com

To give some contrast to this article, I invite you to read some counterpoints in this blog post by Mathias Müller.

If you want to support this work, follow me on Medium.