**Why statistical adjustment for multiple comparisons (eg. the Bonferroni correction) is almost always a mistake**

Bruce G Charlton

*

One thing that everybody who has ever done a course on statistics apparently remembers is that there is ‘a problem’ with using multiple tests for statistical significance' on a single data set.

One thing that everybody who has ever done a course on statistics apparently remembers is that there is ‘a problem’ with using multiple tests for statistical significance' on a single data set.

In a nutshell, if you keep
looking for significant differences, or significant correlations, by two-way comparisons - then you will
eventually find one by chance.

So that if you were to seek for the 'cause' of fingernail cancer by measuring 20 biochemical variables, then you would expect one of these variables would correlate with the diagnosis at the p=0.5 level of significance - on the basis that p=0.5 is a one-in-twenty probability statistic.

So that if you were to seek for the 'cause' of fingernail cancer by measuring 20 biochemical variables, then you would expect one of these variables would correlate with the diagnosis at the p=0.5 level of significance - on the basis that p=0.5 is a one-in-twenty probability statistic.

*

For some reason, everybody remembers this problem. But

For some reason, everybody remembers this problem. But

*what to do about it*is where the trouble starts - especially considering that most research studies measure more than a pair of variables and consequently want to make more than one comparison.
*

Increasing the 'stringency' of the statistical test by demanding a smaller p value in proportion to the number of comparisons – with the greater the number comparisons the smaller the p value before ‘significance’ is reached (eg the Bonferroni 'correction') - is probably the commonest suggestion - but this is the wrong answer.

http://en.wikipedia.org/wiki/Bonferroni_correction

Increasing the 'stringency' of the statistical test by demanding a smaller p value in proportion to the number of comparisons – with the greater the number comparisons the smaller the p value before ‘significance’ is reached (eg the Bonferroni 'correction') - is probably the commonest suggestion - but this is the wrong answer.

http://en.wikipedia.org/wiki/Bonferroni_correction

Or rather it is the answer to
a different question, because (as it is used) it is trying to provide a statistical solution to
a scientific problem - thus it is trying to

*replace*science by statistical obfuscation which cannot be done: this the Bonferroni 'correction' is (in common usage)*an error based upon incompetence*.
As I will try to explain, in reality,
the Bonferroni correction has no plausible role in mainstream research - what it does is something that almost never needs to be done.

*

*

Statistical tests are based
on the idea that the investigator has taken a random sample from a population,
and wishes to generalise from that sample to the whole population.

For example, the percentage of intended Republican voters in a random sample from the US state of Utah is an estimate of the percentage of Republican voters in the whole of Utah.

*Each random sample is a microcosm of the population from which it is drawn*, so by measuring a variable in the sample one can makes an estimate of the size of that variable in a population.For example, the percentage of intended Republican voters in a random sample from the US state of Utah is an estimate of the percentage of Republican voters in the whole of Utah.

If the sample is small, then
the estimate is imprecise and will have a large confidence interval - and as the
size of a sample gets bigger, then the properties of the population from which
it is drawn become more apparent, hence the confidence interval gets smaller and
the estimate is regarded as more precise.

*

For instance, if there were only ten people randomly sampled in an opinion poll, then obviously this cannot give a precise estimate of the true proportion that would vote Republican compared with Democrat, and Libertarian voters would probably be missed-out, or if included over-estimated as a proportion, and the tiny proportion intending to vote for the Monster Raving Looney Party would almost certainly be unrepresented.

For instance, if there were only ten people randomly sampled in an opinion poll, then obviously this cannot give a precise estimate of the true proportion that would vote Republican compared with Democrat, and Libertarian voters would probably be missed-out, or if included over-estimated as a proportion, and the tiny proportion intending to vote for the Monster Raving Looney Party would almost certainly be unrepresented.

But a random sample of 1000
will yield a highly precise estimate of the major parties support, and will let
you know whether the MRLP voters are few enough to be ignored.

The statistical null
hypothesis assumes that comparisons between Republican and Democrat are
comparisons between random samples drawn from a single population. The size of
the p value estimates how likely this null hypothesis is, given the
measurements we have obtained from our sample.

*

Now, suppose we want to compare Democrat support in Utah and Massachusetts, to see if it is different. This is the kind of question being asked in almost all science where statistics are used.

Random samples of opinion are taken in the two states, and it looks as if there are a higher proportion of potential Democrat voters in Massachusetts. A t-test is used to ask how likely it is that the apparent difference in Democrat support could in fact have arisen by random chance in randomly drawing two samples from the same population (taking into account that the samples have a particular size, are normally distributed, and are characterized by these particular mean and standard deviation values).

Now, suppose we want to compare Democrat support in Utah and Massachusetts, to see if it is different. This is the kind of question being asked in almost all science where statistics are used.

Random samples of opinion are taken in the two states, and it looks as if there are a higher proportion of potential Democrat voters in Massachusetts. A t-test is used to ask how likely it is that the apparent difference in Democrat support could in fact have arisen by random chance in randomly drawing two samples from the same population (taking into account that the samples have a particular size, are normally distributed, and are characterized by these particular mean and standard deviation values).

It turns out that the p value
is low, which means that the difference in intended Democrat voting between
samples from Utah and Massachusetts is large enough to make it improbable that
the samples were drawn from the same population (ie. it is more probable that
the two sample were drawn from different populations).

*

What is happening here is that we have decided that there is

What is happening here is that we have decided that there is

*a significant difference between a microcosm of Massachusetts voting patterns and a microcosm of Utah voting patterns*. It seems very unlikely that they could have been so different simply by random chance of sampling from a single population. The p value merely summarizes the probabilities relating to this state of affairs.
In this example, the p value is affected only
by the characteristics of the Utah and Massachusetts samples, and we use it to
decide how big a sample is needed before we accept the probability that voting
patterns in Massachusetts really are different from Utah.

The necessary size of the sample (to make this decision of difference) depends on how big the difference is (the bigger the difference, the smaller the sample needed to discover it), and on the scatter around the mean (the more scattered the variation around the mean, the bigger the sample needed to discover the true mean value which is being obscured by the scatter).

The necessary size of the sample (to make this decision of difference) depends on how big the difference is (the bigger the difference, the smaller the sample needed to discover it), and on the scatter around the mean (the more scattered the variation around the mean, the bigger the sample needed to discover the true mean value which is being obscured by the scatter).

But the sample size needed to
decide that Utah and Massachusetts are different does

**not**(of course!) depend upon how many other samples are being compared, in different studies, involving different places.
*

Supposing we had done opinion polls in both Utah and Massachusetts, and that there really was a difference between the microcosms of Utah and Massachusetts voting intentions.

Suppose then that someone goes and does an opinion poll in Texas. Naturally, this makes no difference to our decision regarding the difference between Massachusetts and Utah.

Even if opinion polls were conducted on intended Democrat voters in every state of the Union, then these other pollsters performed statistical tests to see whether these other states differed from one another - this would have no effect whatsoever on our initial inference concerning the difference between Massachusetts and Utah.

Supposing we had done opinion polls in both Utah and Massachusetts, and that there really was a difference between the microcosms of Utah and Massachusetts voting intentions.

Suppose then that someone goes and does an opinion poll in Texas. Naturally, this makes no difference to our decision regarding the difference between Massachusetts and Utah.

Even if opinion polls were conducted on intended Democrat voters in every state of the Union, then these other pollsters performed statistical tests to see whether these other states differed from one another - this would have no effect whatsoever on our initial inference concerning the difference between Massachusetts and Utah.

In particular, there would be
no reason to feel that it was now necessary either to take much larger samples
of the Utah and Massachusetts populations, nor to demand much bigger
differences between measuring voting patterns, in order to feel the same level
of confidence that the difference was real. Because information on voting in
other places is irrelevant to the question of voting in Massachusetts and Utah.

Yet this is what the Bonferroni 'correction' imposes: it falsely assumes that the addiction of other comparisons somehow means that we

Yet this is what the Bonferroni 'correction' imposes: it falsely assumes that the addiction of other comparisons somehow means that we

*do*need to have a larger sample from Massachusees and Utah in order to conclude the same as we did before. This is*just plain wrong!*
*

In sum, the appropriate p value which results from comparing the Utah and Massachusetts samples ought to derive solely from the characteristics of those samples (ie. the size of the samples, and the measured proportion of Democrat voters); and is not affected by the properties of other samples, nor the number of other samples. Obviously not!

In sum, the appropriate p value which results from comparing the Utah and Massachusetts samples ought to derive solely from the characteristics of those samples (ie. the size of the samples, and the measured proportion of Democrat voters); and is not affected by the properties of other samples, nor the number of other samples. Obviously not!

*

So, what assumptions are being made by the procedures for ‘correction’ of p values for multiple comparisons, such as the Bonferroni procedure? This procedure demands a smaller p value to count as a significant difference according to the number of comparisons. And therefore it assumes that the reality of a significant difference in voting intentions between Utah and Massachusetts is – somehow! – affected the voting intentions in other states...

So, what assumptions are being made by the procedures for ‘correction’ of p values for multiple comparisons, such as the Bonferroni procedure? This procedure demands a smaller p value to count as a significant difference according to the number of comparisons. And therefore it assumes that the reality of a significant difference in voting intentions between Utah and Massachusetts is – somehow! – affected the voting intentions in other states...

From where does this
confusion arise? The answer is that the
multiple comparisons procedure has a different null hypothesis. The question
being asked is different.

The question implicitly being
asked when using the Bonferroni ‘correction’ is no longer 'what is the likelihood
that these two opinion polls were performed on the same population' - but
instead something on the lines of 'how likely is it that we will find differences between

*

In effect, the Bonferroni procedure incrementally adjusts the p values, such that every time we take another opinion poll, the p value that counts as significant gets smaller, so that the probability of finding a statistically significant difference between polls remains constant no matter how many polls we do.

In other words, the Bonferroni ‘correction’ is based upon the

*

In effect, the Bp takes account of the fact that a sample may be large enough to provide a sufficiently precise microcosm of the voting intentions in two states, but the imprecision of each two-way comparison is multiplied whenever we do another, and another, such comparison. So with the Bonferroni ‘correction’ in operation - no matter how many opinion polls are compared, we are no more likely to find a difference than if only two samples were compared.

*any two**opinion polls*– no matter how many opinion polls we do on a single population'.*

In effect, the Bonferroni procedure incrementally adjusts the p values, such that every time we take another opinion poll, the p value that counts as significant gets smaller, so that the probability of finding a statistically significant difference between polls remains constant no matter how many polls we do.

In other words, the Bonferroni ‘correction’ is based upon the

*correct**but irrelevant*fact that the more states we compared by opinion polls, the greater the chance that we would find any two states with different voting intentions by sheer chance variation in the validity of polls.*But this has precisely nothing to do with a comparison of voting patterns between Utah and Massachusetts.**

In effect, the Bp takes account of the fact that a sample may be large enough to provide a sufficiently precise microcosm of the voting intentions in two states, but the imprecision of each two-way comparison is multiplied whenever we do another, and another, such comparison. So with the Bonferroni ‘correction’ in operation - no matter how many opinion polls are compared, we are no more likely to find a difference than if only two samples were compared.

*

But why or when might this kind of statistical correction be useful and relevant?

Beats me!...

But why or when might this kind of statistical correction be useful and relevant?

Beats me!...

I cannot imagine any
plausible situation when it would be legitimate to use the Bonferroni
procedure in practice. I cannot imagine any legitimate scientific situation in which it would be appropriate to apply this correction.

I am not saying there aren’t

*

Yet in practice the Bonferroni procedure is used - hence mis-used - a lot; and in fact the Bp often insisted-upon by supposed statistical experts as a condition of refereeing processes in academic and publishing situations (e.g. use the Bp or you won't pass your PhD, use the Bp or your paper will be rejected).

The usual situation is the non-scientific (hence anti-scientific) statistical incompetents (which is to say, nearly everybody) believes that the Bonferroni correction is merely a more rigorous use of significance testing - a marker of a competent researcher; when in fact it is (almost always) the marker of somebody who hasn't a clue what they are doing or why.

This situation is a

I am not saying there aren’t

*any*situations when the Bp is appropriate – but I cannot think of any, and surely such situations must be very rare indeed.*

Yet in practice the Bonferroni procedure is used - hence mis-used - a lot; and in fact the Bp often insisted-upon by supposed statistical experts as a condition of refereeing processes in academic and publishing situations (e.g. use the Bp or you won't pass your PhD, use the Bp or your paper will be rejected).

The usual situation is the non-scientific (hence anti-scientific) statistical incompetents (which is to say, nearly everybody) believes that the Bonferroni correction is merely a more rigorous use of significance testing - a marker of a competent researcher; when in fact it is (almost always) the marker of somebody who hasn't a clue what they are doing or why.

This situation is a

*damning indictment*of the honesty and competence of modern researchers - who are prepared to use and indeed impose a procedure they don't understand - and apparently don't care about enough to make the ten minutes of effort necessary to understand; but instead just 'go along with' a prevailing trend that is not just arbitrary but*pernicious*- multiply pernicious not only in terms of systematically misinterpreting research results, but also in coercing people formally to agree to believing in nonsense; and thereby publicly to join-in with a purposive destruction of real science and the motivations of real science.
*

So, given that applying the Bonferroni procedure is not a ‘correction’ but a nonsensical and distorting misunderstanding; then what

So, given that applying the Bonferroni procedure is not a ‘correction’ but a nonsensical and distorting misunderstanding; then what

*should*be done about the problem of multiple comparisons? Because the problem is real, even though the suggested answer has nothing to do with the problem.
Supposing you had been
trawling data to find possibly important correlations between any of a large
number of measured variables. A perfectly legitimate scientific procedure. Well, there may be some magnitude of
significance (e.g. p value) at which your attention would be attracted to a
specific pair of variables from among all the others.

If, in the above hypothetical data trawl, fingernail cancer showed a positive association with nose-picking with a p value of 0.05 - then that is the size of the association - of a size in which it would occur by random sampling of the same population only once in twenty times. It doesn't matter how many other variables were considered as well as nose-picking in looking for causes of fingernail cancer, one, two, four or a hundred and twenty eight; but if that nature of association between NPicking and FNail Ca is important enough to be interesting, then it is important enough to be interesting.

If a pairwise correlation or difference between two populations is big enough to be interesting, but you are are unsure whether or not it might be due to repeated random sampling and multiple comparisons within of a single population - then

*further statistical analysis of that same set of data cannot help you resolve the uncertainty*.

*

But - given the reality of the problem of multiple comparisons - what to do?

The one and only rigorous answer is (if possible) to

But - given the reality of the problem of multiple comparisons - what to do?

The one and only rigorous answer is (if possible) to

*check your measurement using new data*.
If you are unsure whether Utah
and Massachusetts voting patterns really are different, then don't fiddle
around with the statistical analysis of that poll -

*go out and do another poll*.
And keep making observations
until you

It's called doing science.

*

*are*sure enough.It's called doing science.

*

You seem to have this quite backwards.

ReplyDeleteIf I were required to check whether the result of my data trawling is true, I would find nearly always that it is wrong. And here goes my dissertation.

Someone who would do such an error will never succeed at a scientific career.

@b - I see you are one step ahead of me...

ReplyDeleteMind you, there are those who doubt the whole rationale of significance tests anyway, and not just because the use of confidence intervals will often be a superior alternative. When I was young and took a shufti at Bayesian statistics I found the argument quite attractive but for one thing: where on earth was I to get a "prior" from? Perhaps the field has advanced since then.

ReplyDelete@d - When I was an epidemiologist people used to talk about confidence intervals being superior, but in practice they always used them as significance tests (non-overlapping CIs being regarded as equivalent to p less than 0.05).

ReplyDeleteIn other words, they wanted the stats to tell them what was *true* - which they never can.

I agree wrt Batesian stats - if you already know the priors, you already know the answer - I think it is essentially a nonsensical attempt to replace science by pseudo-numbers purporting to represent science.

Frequentist type traditional stats are fine if used, and not used, in the way they were up to about the mid-1960s - i.e. by real scientists, who regard stats as mostly a convenient summary display of the results, and the main thing being the honest exercise of informed judgment to interpret them.