Modeling Batters Faced Per Inning with 9 Mixed Negative Binomial Distributions
In studying the statistics of baseball, one of the most fundamental objects we could model is the number of batters faced per inning. From predicting the number of innings a pitcher will last to predicting the winner of an individual pitching matchup, this is an important concept. A more accurate way of modeling the number of batters faced per inning can lead to better models for many baseball events. However, this is one of those modeling problems that requires a bit of creativity to get it right.
In this article, we’ll look at a few different statistical models to estimate the number of batters faced per inning. Then, we’ll compare the model’s output to observed data from the 2021 season.
Batters Faced Per Inning in 2021
The first step in any modeling problem is getting raw data to compare your model’s predictions against. Luckily for us, data from the MLB is readily available all over the internet. Typically we build scrapers with Python’s BeautifulSoup package to grab data from the HTML and JavaScript on a website. For baseball, though, Retrosheet has condensed, compiled, accurate, and easy-to-parse data for a significant portion of the games in baseball history. With this, computing the average number of batters faced per inning in the 2021 season is a walk in the park.
This chart says that in roughly 33% of all innings were 3 up, 3 down. In a further (roughly) 27% of innings, 4 batters made it to the plate. After this point, the probability of more and more batters coming to the plate decays smoothly. We looked at the difference between the National league and the American league and did not notice a significant difference.
Why Find a Model / Why Not Use Empirical Probabilities?
Our goal in developing a model for the number of batters in an inning is to use it to eventually predict winners in a game. While the raw data over the course of the season describes global, meta properties of the batters faced per inning distribution, it is actually not the most helpful in modeling the outcome of any particular game. Here’s an example.
If the best pitcher starts against the worst hitting team, we would expect the odds of a 3-batter inning to increase relative to the season-long averages. We would also expect the odds of really long innings to decrease when the pitcher is much better than the opposing team’s hitters. The goal of this analysis is to try to uncover the mechanism of action which governs the batters faced distribution so that it can be adapted to predict the outcomes of individual games with knowledge of the quality of pitchers and batters. This model is meant to be simpler and more tractable than our previous attempt at predicting baseball winners.
The Negative Binomial Distribution
The most natural solution to modeling the batters faced per inning distribution is to use the well-known negative binomial distribution. In any intro stats class in either high school or in college, the negative binomial is one of the big four discrete probability distributions you learn along with the binomial, the hyper geometric, and the Poisson.
The negative binomial distribution deals with the question ‘how many times do we need to repeat a process until a set number of successes or failures occurs’. For example, if we want to roll a dice until we roll the third six, the negative binomial distribution will tell us the probability that we would stop on the third roll or on the fourth roll or on the tenth roll, for example.
If you’re paying close attention it is fairly straightforward to see how this might apply to baseball when modeling batters faced per inning. To translate into statistical parlance, a baseball inning is nothing more than repeating a process (at-bats) until a set number (3) of failures (outs) occurs. So the simplest model one might try when predicting batters faced per inning is to use the negative binomial. Doing so results in the following chart where the red dots show modeled probabilities and the black dots the actual data.
While the general shape of the curve is ok, I claim that there is enough disagreement in the above chart to encourage further study. Most notably, the negative binomial model underestimates the probability of “3 up, 3 down” innings and overestimates the probability of innings of length 4. This model does pretty well, but I am hoping that we can do better.
The Problem with the Negative Binomial in Baseball
One of the key assumptions in the negative binomial distribution is that the probability of failure and success is the same for each event. In the dice example, this was clearly satisfied because the probability of rolling a 6 is identical from roll to roll. However, in baseball this is not the case. The probability of any individual hitter recording an out is specific to that hitter’s abilities. The probability of an out changes in each at bat.
That is, the negative binomial distribution is not a good model for batters faced per inning because the probability of an out changes from hitter to hitter and inning to inning. If we want a better model, we need a way to incorporate this. Unfortunately, to my knowledge there does not exist any well-studied distributions that allow for repeated events with probabilities changing throughout the course of the experiment. This is why in our previous iteration in modeling baseball innings we suggested using Monte Carlo simulation. To avoid that, we’re going to have to be creative.
The Beta Negative Binomial Distribution
The problem with modeling batters faced per inning with a negative binomial is that the probability changes from at-bat to at-bat. The closest statistical model that I know of is one which allows for changing probability from inning to inning. While not exactly what we’re looking for, it is certainly closer. The beta negative binomial distribution is nothing more than a regular negative binomial distribution but where the probability is chosen randomly before the beginning of the inning from a beta distribution.
A beta distribution is a natural choice in many statistical applications to model random probabilities because the support of the beta distribution is on the interval [0,1]. The beta distribution requires the choice of two shape parameters, alpha and beta. Choosing these parameters in the correct way can ensure that the probabilities of recording an out, though now random, accurately model reality.
That was a lot of math, let’s see how the model does when we plug in the actual numbers. The chart below shows the real data in black, the same negative binomial model from above in red, and the new beta negative binomial model in blue.
The beta negative binomial distribution is very good at accurately modeling the probability of a specific number of batters coming to the plate in any given inning. It is better than the vanilla negative binomial model because it takes into account the variability in the expected on base performance based on matchups and pitchers.
Why not the Beta Negative Binomial
In the last section we hypothesized that the number of batters faced per inning deviated from our model because the quality of certain pitchers and batters varied from inning to inning. While the beta negative binomial distribution explained the observed data quite well, it won’t be terribly helpful going into our model going forward.
The beta negative binomial distribution works by assuming that the probability of recording and out varies from inning to inning and that this probability can be modeled via a random selection before the inning. However, in real life we can use information about who is pitching and who is hitting to make informed choices about how the probability of an out should be modeled.
The most complicated model we could do is to use the exact lineup and pitcher statistics to simulate every at bat in the coming inning to estimate the relevant probabilities. I claim that is too complicated. Instead, I think just using the pitcher and the inning number will give us good enough results.
If you are interested why using the inning is valuable, the following chart shows how the on-base percentage varies throughout the game.
The chart puts an asterisk after OBP because we aren’t exactly measuring OBP here. What we’re measuring is a modified version of on-base percentage. This modified version is defined to be ‘getting on base but not getting into a double play or caught stealing’. It isn’t a very valuable statistic to evaluate player quality, but it is valuable in estimating the inning length.
Modeling OBP in this way shows that the probability of recording an out can vary by nearly 10% at different points in the game.
Batters Faced Per Inning with a Mixed Negative Binomial
Our total model for batters faced per inning is a mixture of the individual negative binomial models for each inning. That is, if we compute all 9 inning’s curves like we did in the previous chart and average the curves out, we should get better prediction of the raw data across all innings.
In upcoming analysis, we’ll look at how to incorporate a pitcher’s quality to remove the need to beta distribution in our model. That is, we’ll use the inning number and the pitcher who started the inning to predict the probability of an individual recording an out. Then, we’ll feed this probability into a negative binomial model to predict the number of batters in the inning. Finally, we’ll use this data to predict the runs scored in an inning as accurately as possible.
To subscribe to receive email updates when more articles are posted, use the form below!
2 Replies to “Modeling Batters Faced Per Inning with 9 Mixed Negative Binomial Distributions”
Comments are closed.