Mathematical Modeling Tutorial: The TDJ Baseball Model (Part 1)
Check out this updated content in which we describe how to model the number of batters faced per inning using a mixture of negative binomial distributions.
Simulating Pitcher-Batter Matchups
This article will be the first in a series in which I will showcase the process of making a mathematical model from beginning to end. This mathematical modeling tutorial will cover the entire process of creation from conception to data gathering, from analysis to automation.
While the focus of the article is ‘how to make a mathematical model’ or ‘what is a mathematical model’, I will also try to provide insights about the underlying sport that are independently interesting. In this work we’ll develop the basics for our model and tackle the first mathematical question: How do we model the pitcher-batter interaction. More specifically, what I mean is this. If a player bats .280 while a pitcher allows only a .220 batting average against, what is the probability that my batter gets a hit against this pitcher?
What is a Mathematical Model?
Every mathematical modeling project starts with an idea. A mathematical model is nothing more than defining a set of steps that transforms some input into a desired output. For example, we can transform polling data into a prediction of who will win the election. We can use baseball batting averages and pitcher ERAs into a prediction of which team will win a given matchup. In general, a mathematical model arises whenever you ask the question ‘How can we use tools from statistics, calculus, or other mathematical fields to transform what we have seen through observed data and make predictions about future occurrences?’
Perhaps my favorite example is FiveThirtyEight’s election model. Most simply, the input to this model is polling and economic data and the output is a probability of an individual candidate winning the presidential election. Along the way, they bake in the possibility of different events. What if the economy crashes? What if a major disaster happens between now and the election? There is a lot of time for things to change but by studying past elections and polling trends they can make predictions about what is likely and what is not. The specifics of how they implement all these different scenarios is what we call ‘the model’. The specifics of creating this object is called ‘mathematical modeling’.
The Stages of Creating a Mathematical Model
While mathematical modelling is a broad and diverse subject, it can be boiled down to a few key ideas. Every model has to start with an idea. The first step is to find some phenomenon which can be described by studying mathematics. From predicting elections with polling data to investing money using stock market data, almost anything can suffice. While the subject of mathematical modeling is as much an art as a science, I can briefly describe a general outline of the steps I would take when I am building a model.
- (Ideation) Come up with an idea for a way to study something
- (Definition) Define your model inputs and outputs
- (Analysis Step) Use your mathematical knowledge to create a well-defined ‘pipeline’ that transforms the inputs to the outputs.
- (Implementation Step) Figure out a way (usually computer programming) to implement the input/output transformation of step 3.
- (Verification Step) Make sure your model is at least reasonably accurate on previously observed inputs and outputs.
- (Refinement Step) Think about what simplifications you made in step 3 and ask if they perfectly describe the situation. Hint: they don’t. Determine a way your mathematical modeling could have been more accurate and go back to step three.
And that is really all it takes. Steps one through four give you a model. Step five makes sure your model works. Step six can be repeated as many times as is needed to transform a basic model into an advanced model.
The First Step of Mathematical Modeling
As I just suggested, building mathematical modeling starts with one simple idea. The basic ‘unit’ or ‘event’ of a baseball game is an at-bat. Baseball can be thought of as a sequence of these basic units. The idea that leads to my model is this: If I can simulate very accurately the outcome of an at-bat, I can simulate innings, whole games, final season standings, etc. What do I mean by simulate here? I mean that I can say that for a specific pitcher-batter matchup, I think there is an x% change of recording an out, y% change of a single, z% change of a walk, etc.
That’s it. Step 1 of mathematical modeling, having an idea, is done for us. But wait, you may ask, baseball has way more intricacies than just simulating at bats. What about injuries? What about managerial decisions, defense, base running subtleties, trades, etc.? What if players over/under perform their expectations? While these all certainly have an impact on the outcome of a season, they don’t belong here. They belong in Step 6, refinement.
If you start a mathematical modeling project with the idea that you need to model injuries, defense, trades, and other minutiae in addition to simulating at-bats, you’ll get overwhelmed and get nowhere. If I start with the idea that I only need to be able to simulate a single at-bat, then the project is much more manageable. In some sense, the first iteration of your model should involve the minimum possible complexity that gets meaningful results. Mathematical modeling can be as complex as you like, but the key is to start simple
Model Input and Output
The model output is pretty easy for us to define here. We want to predict playoff chances and championship probabilities for every MLB team in the given year. The output is usually just the question that you wanted to answer in the first place.
What about the model inputs? Oftentimes, the model inputs will change with how complex the model is. I mean, as the refinement step happens over and over again, we may need to add more and more inputs to the model. For instance, if we want to incorporate the possibility of trades occurring, we may need access to salary information, past buying tendencies, and minor league prospect ratings. If we don’t want to incorporate trading, we probably won’t need any of those things.
Our initial model is based on simulating an individual at-bat for a given pitcher and batter. Therefore, to simulate the entire season we at minimum need to know
- Every team’s roster
- Pitching box statistics
- Batting box statistics
- The MLB schedule.
With these four things, we can reasonably simulate every at bat in every game and see what tends to happen. Mathematical modeling can get extremely complex and the resulting models can be huge. However, it is important to start small when building a mathematical model.
Pitcher Batter Matchups
At this point we move into the third step of mathematical modeling. Our goal is to transform the inputs defined above into predictions about the outcome of the baseball season. As discussed previously, the most important step is to be able to predict the probabilities for a specific outcome of an at bat given knowledge of the pitcher and the batter. Here is our first attempt which is, admittedly, quite simple.
Suppose Jesse Winker is at bat against Yu Darvish. Winker has been batting .293 this year. The most literal interpretation of this .293 number is ‘In 29.3% of Winker’s at bats so far this season he has recorded a hit’. We could also interpret this number in a subtly different way: Winker may be expected to record a hit in 29.3% of his future at bats. We have discussed this distinction between ‘a guy hitting .300’ and ‘a true .300 hitter’ in our previous article about the effects of the shortened baseball season.
Now, while Winker may get a hit in 29.3% of his at bats, that number doesn’t take into account the quality of pitcher he is facing. Yu Darvish, for instance, has only allowed opponents to hit .200 against him this year. That is significantly better than a league average pitcher. We probably wouldn’t expect Winker to get as many hits against Yu Darvish as against anyone else. So, how might we determine the change of Winker getting a hit when we know that he is batting against specifically Darvish.
This is where the creativity comes in. The way I choose to model this interaction is to measure both Winker’s batting average and Darvish’s opponents’ batting average as relative to league average. In particular, the league average hitter in 2019 hit about .250. That means Winker is about 43 points above average for a hitter. Yu Darvish, on the other hand is about 50 points better than the average pitcher. Combining these two numbers means that the matchup should favor Darvish by about 7 points relative to league average. Therefore, we can estimate Winker’s chance of getting a hit against Darvish as about 24.3% which is 0.7% below league average.
Remember, this is just a first iteration, we can improve upon this estimate later! However, it should be good enough as a first try. I can summarize the previous discussion with one formula. For a specific batter-pitcher matchup, we let BA denote the batter’s average and OBA denote the pitcher’s opponents’ batting average. Then, the chance the batter records a hit in his at bat is approximately:
BA + OBA -.250
Simulating At Bat Outcomes
In the simplest case, there are essentially 7 things that can happen in an at-bat:
- Out (We will add logic for double plays in the future)
- Single
- Double
- Triple
- HR
- BB
- HBP
Sure, we will eventually need to be able to incorporate sacrifices, stolen bases, and any other long-tail events. But for now, to be able to accurately simulate a plate appearance, it is enough to assign the outcome of the PA into one of the above six categories. We can use the ideas from the prior section to estimate the probability a player records a single, a double, etc.
For example, let SP denote the batter’s single percentage (the number of singles divided by total plate appearances) and let OSP denote the pitcher’s opponents’ single percentage. Moreover, let LASP denote the league-average single percentage. Then, the probability that our specific batter records a single against this specific pitcher is, just as above, approximately
SP + OSP – LASP
Repeating this for doubles, triples, etc. we can accurately simulate the outcome of a specific batter-pitcher matchup.
Simulating Innings, Games and Seasons
Knowing who is pitching and the batting order, it is straightforward to simulate an inning. We just simulate at-bats until we have recorded three outs. Then, having a good estimate of when pitchers might change, we can simulate entire games. One step further, knowing the MLB schedule we can simulate entire seasons. For us, the mathematical modeling hinges entirely on simulating an at bat.
Simulating seasons hundreds or thousands of times gives us an idea of what types of results for a season are typical. We can count how many times a team make the playoffs, divide by the total number of simulated seasons and report this number as ‘playoff chances’. We can do the same for championship percentages.
Even better, as the season starts, we can update our simulations to take into account what has already happened during the season. For instance, we might have reported the Padres’ championship chances to be fairly low before the season. However, after their hot start and big moves at the trade deadline, our model would likely reflect increased certainty of San Diego making the playoffs and winning the title.
Simulating Cubs – Reds on Sep. 9
In this section we’ll describe a bit more specifically and in detail what simulating a game will look like for our first model iteration. I’ve chosen to stick with the above example with Winker facing Darvish and will simulate Cubs-Reds on September 9th. We will use the following simplistic model for pitching changes.
We suppose that the two starters, Bauer and Darvish, each go 6 innings. Then, they are replaced by a reliever. Instead of trying to implement logic to decide which reliever will take their place, we’ll suppose that an ‘average’ Reds reliever and an ‘average’ Cubs reliever takes over. The necessary data for these players is as follows (Data from Baseball-Reference.com)
Player | 1B % | 2B % | 3B % | HR % | BB % | HBP% | Out % |
League Average | 13.5% | 4.3% | 0.3% | 3.5% | 9.2% | 1.3% | 67.9% |
Bauer | 8.7% | 1.5% | 1% | 3.0% | 6.6% | 1.5% | 77.7% |
Darvish | 13.6% | 1.9% | 0.5% | 1.9% | 5.1% | 0.9% | 76.2% |
Reds’ Reliever | 12.5% | 3.3% | 0.6% | 3.2% | 6.6% | 1.5% | 72.5% |
Cubs’ Reliever | 13.1% | 4.4% | 0.3% | 3.4% | 8.6% | 1% | 69.2% |
Then, we need to compute the same table for the Reds’ and Cubs’ starting lineups. Then, we can model every at-bat in the game according to the probabilities discussed above. Our results are described in the next section.
Mathematical Modeling Results
We ran 384 simulations of the above game. The results suggest that the game was much closer than one might expect. Because Bauer and Darvish have been so dominant this year, the game saw very, very few hits. In fact, 80% of the simulated games ended in a 1-0 tie. Here is a graphical summary of the distribution of scores in simulations of this game.
This graphic tells us that, at a very basic level, our model works. Bauer and Darvish are two of the best pitchers in the league. We expect the game to be a pitcher’s duel and that is exactly what we see. However, this raises some concerns for us to address. This model suggests there is about an 80% chance that the game ends with a 1-0 score. Even though these pitchers are Cy Young candidates, this still feels like it favors the pitchers a bit too much. This is something we might revisit when we revise our model in the next few articles
Conclusions
This article is meant mostly as a mathematical modeling tutorial. We tackle the subject of predicting the outcome of a baseball game through purely statistical techniques. I outline the steps I would take in constructing a model and we go over the first few in a real world setting. This mathematical modeling example is meant to be instructive in applied mathematics as much as it is to be informative about baseball.