How to Use the Autocorrelation Function in Baseball
One of the most under-taught tools in mathematics is the autocorrelation function. Autocorrelation functions are extremely useful whenever data are related to time or are otherwise ordered.
The autocorrelation function reveals how distinct events are related in time. Are close-by events correlated with each other? Are they independent of each other? Is there some other complicated, time-based relationship between different events? No matter what, the autocorrelation function of a time series can help discern what governs the temporal behavior of data.
In this article we apply the autocorrelation function and its related statistical properties to study at-bats in baseball. I have often posited that at-bats in baseball can be modeled with binomial and multinomial distributions. However, this would require independence of the outcome of consecutive at bats. To put that another way, how likely someone is to get a hit shouldn’t depend on what happened in their last at-bat. To put this even another way, we want to study whether or not hot streaks actually exist in baseball using autocorrelation functions.
If you poll different analysts, you’ll probably get different guesses and instincts as to whether or not baseball at-bats are independent of one another. More traditional sports personalities might argue that of course there exists hot and cold streaks in at-bats. I am, a priori, not so convinced.
In this article, we’ll use the autocorrelation function to study independence and correlation between consecutive at-bats in baseball.
To receive email updates when new articles are posted, use the subscription form below!
Asking the Right Question
Being a mathematician is more about asking the right question than it is about knowing everything there is to know about math. Often I take this to the extreme and claim that mathematicians are best described as generalists whose main skill and training is to ask precise, important, and compelling questions. An important aspect of this is to explain why we’re asking a question and why it’s interesting.
Before turning to our specific application – temporal correlation between hits in baseball – let’s start with a simpler example.
If you’re at a casino and see that red has come up 8 times in a row on a roulette table, do you immediately run over and bet everything on red? Certainly red must be hot, right? Or do you bet on black because it’s due?
Of course neither of those intuitions are correct. The outcome of previous events in roulette have absolutely no bearing on the likelihood of future events. In roulette, consecutive events are independent.
In sports, though, things get more complicated because of psychological factors. If a player has 8 straight hits and he comes to the plate with runners on 1st and 2nd and you’re up 2 runs, do you intentionally walk them?
If at-bats are independent, then the answer doesn’t depend at all on the fact that the player is on a hitting streak. It’s identical to the roulette example. You can make the decision for yourself using, for example, RE24. However, we don’t know if at-bats are independent! That’s the point of all this.
A Soft Introduction to the Autocorrelation Function in Baseball
The autocorrelation function of a sequence of data tells you many things about it. Most importantly, it tells us how close-in-time samples are related to each other.
Consider the following two sequences of 0s and 1s. A 1 could represent a hit and a 0 an out in our baseball example. In each case, there are an equal number of 0s and 1s (so the hitter’s have the same batting average) but their “time signature” is different.
- Series A: 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0
- Series B: 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0
In Series A, nearby samples are closely related to each other. We stand a much better chance of predicting the outcome of an at-bat for the player corresponding to Series A. If that player just got a hit, watch out. They’re likely to get another.
The Series B player, though, seems much more random. Even though both these players hit .500 in the example above, the player in Series A is much easier to predict. The autocorrelation function is a way to take these informal observations and formalize them.
The next section can be entirely skipped for those who just want to hear about baseball. To skip ahead, click this link.
A Mathematical Treatment of Autocorrelation Functions
The autocorrelation function is a function of a time variable \tau often called delay. The autocorrelation function R(\tau) at delay \tau tells us the average value of a data set when it is multiplied by a version of itself delayed by \tau units. Explicitly, if x(t) is a time-indexed data set, then the autocorrelation function is given by R(\tau) = E[x(t)x(t+\tau)] where E means expected value.
Depending on the context, the expected value E can mean different things and be interpreted in different ways. For example, if x(t) is a signal (think radio waves!), then the autocorrelation function of x(t) is given by R(\tau) =\int x(t) x(t+\tau) dt (I am omitting the integration bounds for simplicity).
If, instead, t is a discrete time-index, then the autocorrelation function is given by R(\tau) = \sum x(t)x(t+\tau) (again, omitting the summation bounds for simplicity).
In the commonly used language in the signal processing world, the autocorrelation function is the convolution of a function against a time-reversed version of itself. That is, R(\tau) = x(t) * x(-t). For more reading about how to analyze autocorrelation functions in this context, see for example this wikipedia article about the Wiener-Khinchin Theorem.
Using Autocorrelation Functions for Bernoulli Sequences
While we could continue down the mathematical rabbit hole, let’s move back to the scenario we care about. Let’s look at some examples. Remember back to before where we defined “series A” and “series B”. Let’s look at the autocorrelation functions (interpreted in the convolutional sense) of these two data series. Starting with for Series A.
Sequences which have strong correlation tend to have this type of signature. They are large for small lags and slowly decrease as the lag increases. Intuitively, this means that for small time differences, the outcome of an event is predictive of future events. Mathematically, this means that the sequence and its time advanced version are highly correlated and have a large inner product. The blue bars indicate a measure of statistical significance. Let’s look at Series B.
Notice here that at 0 lag the autocorrelation is still 1. However, as the lag increases, the autocorrelation function is very small, between the blue bars. In fact, the value of the autocorrelation function appears random. Because the ACF doesn’t pass outside the blue bars, while some correlation is measured between the series and a shifted version of itself, this correlation is not statistically significant.
Here are more examples. Let’s create three sequences of 0s and 1s randomly. In the first, consecutive elements have a 95% chance of staying the same. In the second, this probability decreases to 75%. In the third, the probability is 50% – equivalent to a totally random sequence of 0s and 1s with no temporal correlation. Each of these sequences will have on average the same number of 0s and 1s but have dramatically different time-dependencies between consecutive samples. The autocorrelation function of these sequences reveals these differences.
Notice that the signal with the strongest temporal dependence, the 95% sequence, has the “widest” autocorrelation function. For lags up to about 20 (not a coincidence that 20 is related to 95%), the autocorrelation function size is statistically significant.
The middle version, the 75% chance of staying the same, still shows statistically significant correlation for small lags. However, after delay values larger than 4 (a non-coincidental relationship to 75%), the autocorrelation function passes below the statistical significance line.
The last figure is the autocorrelation function for a random sequence of 1s and 0s. Notice that only once does the autocorrelation function pass outside the statistical significance lines. This is to be expected but is “random error” and is the result of the multiple comparisons problem.
Using the autocorrelation function for a sequence of hits and looking at the statistical significance of the ACF values can tell us whether hitters in baseball actually are streaky.
Using Autocorrelation Functions in Baseball
We’re going to look at a hitter’s at-bats and see if an at-bat’s outcome has any relationship to the outcomes of the previous at-bats. That is, we’re going to look at the autocorrelation function of a hitter’s hit/out sequence.
If streaks exist, then we’re going to see statistically significant autocorrelation values for small lag values. If streaks exist, then the autocorrelation function will look like the 75% or 95% plots from the last section. If hot streaks DON’T exist, then the autocorrelation function will have a peak at 0 lag and be close to 0 everywhere else. It will look like the “totally random” plot from above.
We pulled the data from Mike Trout’s 2021 season and recorded his sequence of “hit” or “no hit”. Then, we computed this sequence’s autocorrelation function as described in the last few sections. Here is what we found.
This autocorrelation function shows the expected peak at 0 lag. However, even for shifts of 1 at-bat, the autocorrelation function is not statistically significant. For Mike Trout in 2021, there was no correlation in outcome between consecutive at-bats. To put that another way, Mike Trout showed no evidence at all of being streaky.
Let’s try somebody else, Aaron Judge for example. Performing the same analysis, here is what the data shows.
Again, there is no evidence that Aaron Judge’s at-bats which happen nearby in time are correlated with each other.
What if we go to the extreme? Javier Baez is often cited as one of professional baseball’s ‘streakiest’ hitters. What happens when we look at Baez’s autocorrelation function?
Still, there is no evidence of correlation between consecutive at bats even for baseball’s notoriously streakiest hitter.
Takeaways
This study was meant to understand whether there was any correlation between consecutive at-bats in professional baseball. If a player got a hit last time up, is he more dangerous the next time? If a player has gotten out 10 times in a row, is he an easy out?
By and large, the conclusions of this study show that this effect is totally, 100% psychological on the viewer’s part. There is no statistical basis to conclude that there is a temporal correlation between at-bats in baseball. In fact, it is probably sufficient to model the outcome of an at-bat as independent of outcomes of all recent at-bats.
So, next time you hear somebody talk about a specific hitter being streaky, you can confidently look them in the eyes and say “you’re making things up”.
To receive email updates when new articles are posted, use the form below!
Just encountered this while trying to calculate autocorrelation (lag 1) for my slow pitch softball team. Interesting read!
I was wondering why this isn’t a more often used statistic as a measure of streakiness. I suppose the answer is because with independent outcomes, it’s boring.
I will say that the Trout, Judge, Baez, charts look exactly identical. Possibly these are the wrong images?
Thanks for the comment, and I’m glad you enjoyed it. Great catch; the uploaded images had a bug that were fixed.