Modeling Runs Per Inning and MLB Runs per Game the Right Way
In our last article, we looked at how best to model the number of batters faced per inning in a typical baseball game. In this article we’re going to push this a step further and look at modeling the runs per inning distribution based on what we’ve learned from last time. This is the second step along the way in building our baseball model to predict winners and probabilities in every game.
The central object of study is modeling the number of runs scored in a given inning in a way that can be adapted to whoever is pitching and whoever is hitting. That is, we want a runs per inning model that can be changed given information about the expected on base percentage of the pitcher and of the hitting teams. At the end of the article, we’ll have developed the model that leads into development of the GIF below.
Let’s start by looking at two very simple models many people might be tempted to use and show why fitting models often requires a bit more finesse. That is, we’re going to show why not to use these models and why you can’t in general pick a model out of a hat.
Runs Per Inning and Runs Per Game Naively
Two of the most basic discrete probability distributions are the Poisson distribution and the exponential distribution. For many, these two distributions are the go to for modeling discrete random variables that can take the values {0,1,2,…}. Because the number of runs in an inning is greater than or equal to 0 and must be an integer, these seem like a natural choice.
However, just because a distribution might work doesn’t mean it is the correct way to go. The Poisson distribution is used to model the number of occurrences of some discrete event over an interval of time. For example, it is widely used to model the number of customers expected to arrive at a store in an hour.
On the other hand, the geometric distribution is used to model the number of times an experiment has to be repeated to obtain a positive result. For example, you could use a geometric distribution to estimate how many times you would need to roll a dice before a six comes up.
These distributions do not apply – at all, really – to modeling the number of runs in an inning of baseball. In fact, we plotted these distributions on top of the raw data to show that they don’t match. The following curves are fit using the method of maximum likelihood estimators.
The raw data is in black, the Poisson model in red, and the geometric model in blue. We notice that neither of these models do a particularly good job of predicting the probabilities of zero or one runs in an inning. Both models underestimate the probability of scoreless innings. Both models overestimate the probability of one run innings.
These models don’t accurately describe the scoring distribution in one inning, so we shouldn’t expect them to generalize to predicting the number of runs per game well, either. We can use these “per-inning” models to obtain “per-game” models by adding nine copies of them together to simulate the sum of runs in nine innings.
The sum of nine geometric distributions (each with parameter p) is a negative binomial distribution (with parameters 9 and p). The sum of 9 Poisson distributions (each with parameter λ) is itself another Poisson distribution (with parameter 9*λ). Modeling the runs per game distribution in these ways results in the following chart.
Overall, the shapes of the red (Poisson) and blue (Negative Binomial) distribution are quite a bit different from the black (raw data). In particular, these models dramatically underrate the probability of very low scoring games and overestimate the probability of high scoring games. Two or fewer runs were scored nearly 31% of the time last year. The Poisson estimates this probability at 16% while the negative binomial at 22% – both dramatically different from reality.
OK. So the Poisson and geometric/negative binomial are not good choices for modeling the runs per game distribution. How should we proceed?
Runs Per Inning via Batters Faced Per Inning
Because we’ve previously studied how to model batters faced per inning, it is natural to use this to aid our analysis. The proposed workflow is as follows:
- Model batters faced per inning (which we already know how to do)
- Using the expected number of batters faced to predict the number of runs scored. This is essentially modeling the number of runners left on base.
The reason this will work so well is that the relationship between batters faced and runs scored is quite easy to model. The GIF below shows how this distribution changes as the number of batters faced increases.
Quite clearly, this GIF shows that as the number of batters faced increases, the expected number of runs distribution shifts to the right in a fairly predictable way. Towards the end of the GIF, the probability estimates get a little bit noisier because we just didn’t have enough data for innings where 10+ batters came to the plate.
This leads to a method of estimating the probability of a certain number of runs being scored in an inning using the law of total probability. First we estimate the probability of facing a given number of batters using our mixed negative binomial model from last time. Then we compute the probability that a given number of runners are left on base. Knowing the number of batters who come to the plate and the number that were left on base, we can figure out the number who scored.
Runs Per Inning Model Performance
Using our mixed negative binomial model to estimate the probability of facing a given number of batters and the new runners left on base model, we can estimate the probability of scoring any number of runs in a given inning. The chart below shows the raw data in black and our model outputs in blue.
Our model is accurate enough so that the blue dots are almost entirely on top of the black dots in the above image – close enough that we can hardly see that there are two data sets. Plotting the model errors makes things a bit clearer. The red dots in the image below show the difference between the raw probabilities and our model’s predicted probabilities
The error between our predicted probability and actual probability is never more than one percentage point. In fact, it is never more than 0.8%. This model is far more accurate at predicting runs per inning than anything else we’ve considered.
Runs Per Game Model Performance
It seems straightforward to generalize a runs per inning model into a runs per game model but some care must be taken. The precise mathematical operation needed to go from “per-inning” models to “per-game” models is the idea of convolution, in particular 9-fold convolution. Convolution is the way to take two random variable and “add them together”. For us, we take the runs scored distribution for one inning (last section) and add it to itself 9 times to represent 9 innings of baseball being played.
The plot below shows the raw MLB runs per game data in black and our model’s predicted probabilities in blue obtained via 9-fold convolution of the runs per inning model.
Notice that the errors here are actually noticeable unlike in the per-inning case. Also notice, though, that the errors are much, much smaller than using the Poisson or negative binomial distributions to model runs per game. Some possible reasons for this error are listed below.
First, this model assumes that all innings will have the same scoring distribution which we know to not be true. Second, it assumes a constant on base percentages regardless of who is pitching and who is batting. Updating the model to account for these factors should make it more accurate. This is the content of a future article, but we are curious to see what happens in the runs per inning distribution if we change the on-base percentage.
Runs Per Inning when Varying OBP
On-base percentage is the key factor which governs our batters faced per inning model and, as a result, our runs per inning model. Different pitchers throwing against different hitting teams will result in different expected on-base percentages. If we want to predict runs scored for any given matchup, we need to account for this fact.
The following GIF shows how the runs scored per inning distribution evolves with an increasing on-base percentage.
This is great because now we know how to estimate the number of runs that will be scored given the expected on-base percentage in a particular at bat or inning. The last step in designing our baseball model is in studying how to estimate OBP knowing the pitcher, the inning, and how good the hitting team is.
Really, being able to change our beliefs about the number of runs which will be scored based on on-base percentage is the key to our analysis. The only remaining question is how to model OBP. Stay tuned.
To receive email updates when new article and analyses are posted, please use the form below!
One Reply to “Modeling Runs Per Inning and MLB Runs per Game the Right Way”
Comments are closed.