Multiple Comparisons Problem and Baseball Statistics

One of the most fundamental statistical concepts is the idea of statistical significance. Statisticians learn how to conduct hypothesis tests to determine whether or not some observed effect is likely random noise or a statistically explainable effect. However, when many tests are conducted simultaneously, we may inadvertently run into the multiple comparisons problem. A great example of this is looking at player improvement in baseball.

The end effect of the multiple comparisons problem when unaccounted for is claiming statistical significance too often. In baseball, this could manifest as falsely identifying more players as “having improved” than actually did in a given year.

There are a number of ways to address the multiple comparisons problem (sometimes also called the multiple testing problem). In this article we’ll explain where the problem comes from, the various attempts at fixing it, and how it applies to baseball hitters.

the multiple comparisons problem in baseball

What is Hypothesis Testing?

For a full introduction, see our previous article about hypothesis testing in sports.

Let’s use baseball to explain the problem. Sometimes hitters go on hot streaks, or cold streaks, but usually these things tend to balance out so that their overall quality can be measured by their batting average.

But hitter’s don’t always stay the same quality their entire career. Young players tend to get better, older players tend to get worse, and sometime players just “figure something out” midway through their career. How can we tell the difference between a player getting better versus just being on a hot streak? Similarly, how can we differentiate a cold streak from getting worse?

The answer is hypothesis tests! Hypothesis testing is all about determining the statistical significance of a change in a hitter’s performance. Two factors combine together to help determine whether or not something is statistically significant:

  1. The length of time the improvement has gone on for (the sample size)
  2. The size of the improvement

If a player goes from a .250 to a .290 hitter over the course of a whole season we’re probably fairly comfortable saying they got better. However, if they went from a .250 to a .290 hitter from April to May, that might not be as convincing.

Consider the same .250 hitter but suppose that the next year they still get a better average – but only to the tune of a .255 average. Do we think they actually got demonstrably better? The 5 point improvement is much less impressive than a 40 point improvement. The larger the difference, the more confident we can be in someone actually getting better.

Hypothesis testing is a way to combine the sample size with the size of the improvement to make a judgment on whether a player’s improvement (a) actually is true or (b) is just statistical noise.

What is the Multiple Comparisons Problem?

In general there is no way to perfectly tell the difference between noise and actual improvement. However, as the sample size increases and the size of the change increases, it becomes less likely the product of noise. That is, it is harder for random noise to consistently benefit a player for a long period of time.

In designing hypothesis tests, typically we pick a value (often called \alpha ) which helps in making the judgment. This value \alpha is called Type 1 Error and can be thought of as “false positives”. In baseball terms, it is the probability of saying a player has improved when really the change was due to statistical noise. Typical values for \alpha is about 1-5%.

When looking at one player, a 5% false alarm rate is pretty small. We can be pretty confident that if the stats say a player got better, they actually did! However, when looking at the entire league, a 5% false positive rate might end up falsely concluding that 20+ players have gotten better when they haven’t.

This is the multiple comparisons problem. When running many, many hypothesis tests, the number of false positives will increase. This makes it difficult to answer questions like “how many players get better from year to year” or “did pitchers get better relative to hitters from year to year”.

Let’s take a look at some data from some recent baseball seasons.

Multiple Testing Problem in Baseball

We gathered data from the 2017, 2018, and 2019 MLB seasons to highlight why the multiple comparisons problem is important. By virtue of the sheer amount of professional baseball players, if we try to statistically determine who got better or worse, we’ll set off some false alarms.

To show this, we compared each player’s 2017 and 2018 stats to see who’s performance changed enough to be called statistically significant at a false alarm rate of 5%. Then, we’ll use the 2019 data to try to show how many of these are likely false alarms.

We pruned the data to those hitters who had at least 15 at bats in the 2017, 2018, and 2019. This resulted in a set of 417 players to look at. With a false alarm rate of 5%, we would expect about 21 players to appear to have gotten better without actually having gotten better. That is, we expect 21 players to have gotten better/worse from 2017 to 2018 just by virtue of statistical noise.

The plot below shows the 2017 and 2018 batting averages for those players who got significantly better or worse from one year to the next.

baseball and the multiple testing problem

Notice the blank space along the line y=x above. Remember that we only plotted those players who got significantly better or worse from 2017 to 2018; the line y=x corresponds to those hitters who stayed the same.

There were 29 hitters who qualified as having gotten significantly better or worse. We actually expect a lot of these to be false alarms!

To determine false alarms, we looked at whether or not a player’s 2019 average was closer to their 2018 average or their 2017 average. If it was closer to their 2017 average, we declared the 2018 season a fluke. Otherwise, we determine that the player in question actually got better.

Of the 29 hitters who “got better” statistically from 2017 to 2018, only 18 of those are likely to be hitters who truly got better. The other 11 reverted closer to their 2017 performance the following season.

That means that nearly 40% (!!) of the players who look like they got better actually just had a statistically good season. If you ever wonder why teams will not pay players for showing one good season in a contract year, look no further.

Solutions to the Multiple Comparisons Problem

There are some ways to try to fix the multiple comparisons problem. No fix will be perfect, however. The real world is just random enough that false positives will always be a problem.

One such solution to the multiple comparisons problem is the so-called Bonfernoi Correction. The multiple testing problem arises because a 5% false alarm rate applied to hundreds of examples results in many, many false alarms. The Bonferoni correction attempts to solve this by simply reducing this 5% number to a stricter threshold so that the total number of false alarms is a desired rate.

A more interesting example is “Holm’s step-down procedure”. This solution to the multiple comparisons problem suggests applying stricter and stricter criteria as more and more tests are made.

For example, maybe the first test is done at a 5% false alarm rate level but the 10th is done at 4% and the 30th at 3% and so on. By making the false alarms harder to trigger as more tests are done, fewer overall false alarms happen.

No matter the choice, something must be done to counteract the multiple testing problem if the statistics are to remain valuable.

Comments

Before concluding, we wanted to talk more about the baseball example above and multiple testing in sports. In the baseball example, we identified fewer statistical outliers (11) than the predicted (21).

There could be a few reasons for this. Most notably, the quality of pitcher’s relative to hitters got better from 2017 to 2018. The league-wide average dropped from .255 to .248 in that year. If this truly indicates that pitchers improved relative to batters, then our hypothesis tests are flawed.

If pitchers got better, then a hitter who hit .260 in both 2017 and 2018 may have actually gotten better in 2018! If you look back at the graphic above, many more players got worse from 2017 to 2018 than got better. This “moving of the averages” may make the calculations that we were trying to do even harder.

The multiple testing problem shows up throughout sports. In addition to comparing baseball players’ averages over the years, the multiple comparisons problem can show up in the following settings:

  • Comparing basketball players’ shooting percentages from one year to the next to see if they improved meaningfully
    • This can be complicated by better shooters drawing better defensive players/more attention, see our article on usage rate.
  • Comparing golfers’ performances course-by-course from one year to the next to identify up and comers
  • Comparing hockey goalies save percentages year over year
  • And many more

In sports, statistics don’t always tell the truth. Random variation is enough to confuse the issue of who is getting better or worse. Knowing how to deal with the multiple comparisons problem is an important tool.