Friday 23 August 2013

Why the Bonferroni correction is a mistake (almost always)

Why statistical adjustment for multiple comparisons (eg. the Bonferroni correction) is almost always a mistake

Bruce G Charlton


One thing that everybody who has ever done a course on statistics apparently remembers is that there is ‘a problem’ with using multiple tests for statistical significance' on a single data set.

In a nutshell, if you keep looking for significant differences, or significant correlations, by two-way comparisons - then you will eventually find one by chance.

So that if you were to seek for the 'cause' of fingernail cancer by measuring 20 biochemical variables, then you would expect one of these variables would correlate with the diagnosis at the p=0.5 level of significance - on the basis that p=0.5 is a one-in-twenty probability statistic.


For some reason, everybody remembers this problem. But what to do about it is where the trouble starts - especially considering that most research studies measure more than a pair of variables and consequently want to make more than one comparison.


Increasing the 'stringency' of the statistical test by demanding a smaller p value in proportion to the number of comparisons – with the greater the number comparisons the smaller the p value before ‘significance’ is reached (eg the Bonferroni 'correction') - is probably the commonest suggestion - but this is the wrong answer.

Or rather it is the answer to a different question, because (as it is used) it is trying to provide a statistical solution to a scientific problem - thus it is trying to replace science by statistical obfuscation which cannot be done: this the Bonferroni 'correction' is (in common usage) an error based upon incompetence.

As I will try to explain, in reality, the Bonferroni correction has no plausible role in mainstream research - what it does is something that almost never needs to be done.


Statistical tests are based on the idea that the investigator has taken a random sample from a population, and wishes to generalise from that sample to the whole population. Each random sample is a microcosm of the population from which it is drawn, so by measuring a variable in the sample one can makes an estimate of the size of that variable in a population.

For example, the percentage of intended Republican voters in a random sample from the US state of Utah is an estimate of the percentage of Republican voters in the whole of Utah.

If the sample is small, then the estimate is imprecise and will have a large confidence interval - and as the size of a sample gets bigger, then the properties of the population from which it is drawn become more apparent, hence the confidence interval gets smaller and the estimate is regarded as more precise.


For instance, if there were only ten people randomly sampled in an opinion poll, then obviously this cannot give a precise estimate of the true proportion that would vote Republican compared with Democrat, and Libertarian voters would probably be missed-out, or if included over-estimated as a proportion, and the tiny proportion intending to vote for the Monster Raving Looney Party would almost certainly be unrepresented.

But a random sample of 1000 will yield a highly precise estimate of the major parties support, and will let you know whether the MRLP voters are few enough to be ignored.

The statistical null hypothesis assumes that comparisons between Republican and Democrat are comparisons between random samples drawn from a single population. The size of the p value estimates how likely this null hypothesis is, given the measurements we have obtained from our sample.


Now, suppose we want to compare Democrat support in Utah and Massachusetts, to see if it is different. This is the kind of question being asked in almost all science where statistics are used. 

Random samples of opinion are taken in the two states, and it looks as if there are a higher proportion of potential Democrat voters in Massachusetts. A t-test is used to ask how likely it is that the apparent difference in Democrat support could in fact have arisen by random chance in randomly drawing two samples from the same population (taking into account that the samples have a particular size, are normally distributed, and are characterized by these particular mean and standard deviation values).

It turns out that the p value is low, which means that the difference in intended Democrat voting between samples from Utah and Massachusetts is large enough to make it improbable that the samples were drawn from the same population (ie. it is more probable that the two sample were drawn from different populations).


What is happening here is that we have decided that there is a significant difference between a microcosm of Massachusetts voting patterns and a microcosm of Utah voting patterns. It seems very unlikely that they could have been so different simply by random chance of sampling from a single population. The p value merely summarizes the probabilities relating to this state of affairs.

In this example, the p value is affected only by the characteristics of the Utah and Massachusetts samples, and we use it to decide how big a sample is needed before we accept the probability that voting patterns in Massachusetts really are different from Utah.

The necessary size of the sample (to make this decision of difference) depends on how big the difference is (the bigger the difference, the smaller the sample needed to discover it), and on the scatter around the mean (the more scattered the variation around the mean, the bigger the sample needed to discover the true mean value which is being obscured by the scatter).

But the sample size needed to decide that Utah and Massachusetts are different does not (of course!) depend upon how many other samples are being compared, in different studies, involving different places.


Supposing we had done opinion polls in both Utah and Massachusetts, and that there really was a difference between the microcosms of Utah and Massachusetts voting intentions.

Suppose then that someone goes and does an opinion poll in Texas. Naturally, this makes no difference to our decision regarding the difference between Massachusetts and Utah.

Even if opinion polls were conducted on intended Democrat voters in every state of the Union, then these other pollsters performed statistical tests to see whether these other states differed from one another - this would have no effect whatsoever on our initial inference concerning the difference between Massachusetts and Utah.

In particular, there would be no reason to feel that it was now necessary either to take much larger samples of the Utah and Massachusetts populations, nor to demand much bigger differences between measuring voting patterns, in order to feel the same level of confidence that the difference was real. Because information on voting in other places is irrelevant to the question of voting in Massachusetts and Utah.

Yet this is what the Bonferroni 'correction' imposes: it falsely assumes that the addiction of other comparisons somehow means that we do need to have a larger sample from Massachusees and Utah in order to conclude the same as we did before. This is just plain wrong!  


In sum, the appropriate p value which results from comparing the Utah and Massachusetts samples ought to derive solely from the characteristics of those samples (ie. the size of the samples, and the measured proportion of Democrat voters); and is not affected by the properties of other samples, nor the number of other samples. Obviously not!


So, what assumptions are being made by the procedures for ‘correction’ of p values for multiple comparisons, such as the Bonferroni procedure? This procedure demands a smaller p value to count as a significant difference according to the number of comparisons. And therefore it assumes that the reality of a significant difference in voting intentions between Utah and Massachusetts is – somehow! – affected the voting intentions in other states...

From where does this confusion arise?  The answer is that the multiple comparisons procedure has a different null hypothesis. The question being asked is different.

The question implicitly being asked when using the Bonferroni ‘correction’ is no longer 'what is the likelihood that these two opinion polls were performed on the same population' - but instead something on the lines of  'how likely is it that we will find differences between any two opinion polls – no matter how many opinion polls we do on a single population'.


In effect, the Bonferroni procedure incrementally adjusts the p values, such that every time we take another opinion poll, the p value that counts as significant gets smaller, so that the probability of finding a statistically significant difference between polls remains constant no matter how many polls we do.

In other words, the Bonferroni ‘correction’ is based upon the correct but irrelevant fact that the more states we compared by opinion polls, the greater the chance that we would find any two states with different voting intentions by sheer chance variation in the validity of polls.

But this has precisely nothing to do with a comparison of voting patterns between Utah and Massachusetts.


In effect, the Bp takes account of the fact that a sample may be large enough to provide a sufficiently precise microcosm of the voting intentions in two states, but the imprecision of each two-way comparison is multiplied whenever we do another, and another, such comparison. So with the Bonferroni ‘correction’ in operation - no matter how many opinion polls are compared, we are no more likely to find a difference than if only two samples were compared.


But why or when might this kind of statistical correction be useful and relevant?

Beats me!...

I cannot imagine any plausible situation when it would be legitimate to use the Bonferroni procedure in practice. I cannot imagine any legitimate scientific situation in which it would be appropriate to apply this correction.

I am not saying there aren’t any situations when the Bp is appropriate – but I cannot think of any, and surely such situations must be very rare indeed.


Yet in practice the Bonferroni procedure is used - hence mis-used - a lot; and in fact the Bp often insisted-upon by supposed statistical experts as a condition of refereeing processes in academic and publishing situations (e.g. use the Bp or you won't pass your PhD, use the Bp or your paper will be rejected).

The usual situation is the non-scientific (hence anti-scientific) statistical incompetents (which is to say, nearly everybody) believes that the Bonferroni correction is merely a more rigorous use of significance testing - a marker of a competent researcher; when in fact it is (almost always) the marker of somebody who hasn't a clue what they are doing or why.

This situation is a damning indictment of the honesty and competence of modern researchers - who are prepared to use and indeed impose a procedure they don't understand - and apparently don't care about enough to make the ten minutes of effort necessary to understand; but instead just 'go along with' a prevailing trend that is not just arbitrary but pernicious - multiply pernicious not only in terms of systematically misinterpreting research results, but also in coercing people formally to agree to believing in nonsense; and thereby publicly to join-in with a purposive destruction of real science and the motivations of real science. 


So, given that applying the Bonferroni procedure is not a ‘correction’ but a nonsensical and distorting misunderstanding; then what should be done about the problem of multiple comparisons? Because the problem is real, even though the suggested answer has nothing to do with the problem.

Supposing you had been trawling data to find possibly important correlations between any of a large number of measured variables. A perfectly legitimate scientific procedure. Well, there may be some magnitude of significance (e.g. p value) at which your attention would be attracted to a specific pair of variables from among all the others.

If, in the above hypothetical data trawl, fingernail cancer showed a positive association with nose-picking with a p value of 0.05 - then that is the size of the association - of a size in which it would occur by random sampling of the same population only once in twenty times. It doesn't matter how many other variables were considered as well as nose-picking in looking for causes of fingernail cancer, one, two, four or a hundred and twenty eight; but if that nature of association between NPicking and FNail Ca is important enough to be interesting, then it is important enough to be interesting.

If a pairwise correlation or difference between two populations is big enough to be interesting, but you are are unsure whether or not it might be due to repeated random sampling and multiple comparisons within of a single population - then further statistical analysis of that same set of data cannot help you resolve the uncertainty.


But - given the reality of the problem of multiple comparisons - what to do?

The one and only rigorous answer is (if possible) to check your measurement using new data.

If you are unsure whether Utah and Massachusetts voting patterns really are different, then don't fiddle around with the statistical analysis of that poll - go out and do another poll.

And keep making observations until you are sure enough.

It's called doing science. 


baduin said...

You seem to have this quite backwards.

If I were required to check whether the result of my data trawling is true, I would find nearly always that it is wrong. And here goes my dissertation.

Someone who would do such an error will never succeed at a scientific career.

Bruce Charlton said...

@b - I see you are one step ahead of me...

dearieme said...

Mind you, there are those who doubt the whole rationale of significance tests anyway, and not just because the use of confidence intervals will often be a superior alternative. When I was young and took a shufti at Bayesian statistics I found the argument quite attractive but for one thing: where on earth was I to get a "prior" from? Perhaps the field has advanced since then.

Bruce Charlton said...

@d - When I was an epidemiologist people used to talk about confidence intervals being superior, but in practice they always used them as significance tests (non-overlapping CIs being regarded as equivalent to p less than 0.05).

In other words, they wanted the stats to tell them what was *true* - which they never can.

I agree wrt Batesian stats - if you already know the priors, you already know the answer - I think it is essentially a nonsensical attempt to replace science by pseudo-numbers purporting to represent science.

Frequentist type traditional stats are fine if used, and not used, in the way they were up to about the mid-1960s - i.e. by real scientists, who regard stats as mostly a convenient summary display of the results, and the main thing being the honest exercise of informed judgment to interpret them.