Intro to Time Series and Sports
In many real life applications – stock markets, weather modeling, and sales forecasting to name a few – data has a natural sequential ordering that can be exploited for powerful analysis. Such analysis is called time series analysis. It would seem natural to combine time series and sports because sports data is naturally sequential. One things happens, then another, then another.
However, time series analysis in the sports world is relatively scant. It is used much less often than other common techniques like machine learning, regression models, sabermetrics, and Bayesian statistics.
In this article, we want to look at the interaction between time series and sports. What are they? What tools are used to analyze them? And why have they not been widely employed for sports data?
As we often do, all the code used to generate figures and analysis in this article is contained in a Jupyter notebook on The Data Jocks github.
What is a Time Series? Intuitively!
In some data sets, the points have no relationship to each other. As an example, if we ask 100 people who they’re voting for, it doesn’t matter if the 3rd person says a democrat and the 4th person says a republican or if those answers come in the opposite order. These types of datasets are unordered.
Time series is what we call data where the order does matter. Consider stock market data. It matters if the stock went from 3,000 to 3,0005 the next day versus going from 3005 to 3000. As an investor, one is good, the other is bad.
Usually the phrase time series refers to a dataset where the data points are sampled over time. This feature naturally leads to the sequential structure which is their hallmark. It shouldn’t be too much of a stretch to apply these ideas to sports.
Time Series Examples in Sports
Sports data naturally has a time component to it. Players evolve over time. Teams get better or worse. Strategies change as events come to pass. A few specific examples in sports are below.
- A player’s wOBA is a measure of their offensive output that balances power for consistency. A higher wOBA means a player had a better game. wOBA happens to be a stat that has high variance from game to game, though. Looking at the time series of a player’s wOBA can help identify trends and determine if a player is meaningfully improving or not.
- The amount of points scored by a fantasy player over a series of games is a time series. One of the best ways to win your fantasy league is to predict which players will be better or worse than last year. Analysis on the time series of fantasy points scored (and including things like age) can help provide an edge.
- Evaluating prospects isn’t always about who is the best player at the current moment. It is more about predicting who will be the best professional player. A player’s college performance is a time series. Looking at the trajectory of their production, extrapolations can be made about their professional career.
In each of these examples, it is easy to associate sports data with a time series. Framing datasets as time series allows one to use the huge toolbox of time series analysis tools. Lets look at a few below.
Time Series Analysis
There are many things one can do with time series. To me, the types of analysis fall under two main categories: descriptive and inferential. The difference between these two is the goal. In descriptive statistics, one attempts to describe and analyze existing data to explain what happened. In inferential statistics, one may use the same data set to make predictions about the future. I’ll highlight one example of each.
A type of descriptive statistic is time series segmentation. Segmenting a time series refers to splitting it into multiple different parts where the behavior is distinctly different. A classic example is splitting a phone call into portions where one party is talking, the other is talking, or neither are talking.
In sports, we can segment possessions in basketball to try to determine stretches where we played better or our opponent did. Then, by looking at various factors like who was on the court and which players were run, conclusions can be made to hopefully improve future performance.
Inferential statistics are generally much more difficult – in time series this is commonly called “forecasting”. The classic example is the stock market. It is more interesting to predict future stock market values than to describe what happened in the past.
In sports, time series forecasting can be applied to project player career trajectories. Whether preparing for the upcoming draft, making trades for new players, or deciding how much to pay your current players, knowing how good somebody is in the future is of crucial importance. In fact, I would say that forecasting player career trajectories is THE most important analytical technique in sports.
We’ve talked previously about one of the key time series analysis tools: the autocorrelation function in sports. Lets talk more specifically now about time series forecasting.
Time Series Forecasting in Sports
Time series forecasting has a long, long history. Some of the most popular methods include:
- Moving average models
- Autoregressive integrated moving average models (ARIMA)
- ML methods like recurrent neural networks
Let’s look at the simplest of these, the moving average model, with an example. We took the game-by-game RE24 for Aaron Judge’s historic 2022 season. For a refresher on RE24, check out our past article!
For now, it suffices to know that RE24 is essentially a measure of runs created via the player’s at bats. A higher RE24 means a player had a better offensive game. It is also important to know that a value of 0 is league average – positive means net positive contributions to your team. Shown below is the time series of Aaron Judge’s RE24 from 2022.
Notice that this plot captures the natural variation in the stat over the season. Without the red line, it can be hard to tell whether this is a good season or a bad season just by looking. However, by using the simplest possible time series analysis technique and computing the series’ average, we can tell that over the course of the season Judge added about 0.5 runs to his team’s total.
We’re going to look at how the moving average models may increase the predictive power in forecasting. But first, what is the moving average model?
Moving Average Model for Sports Forecasting
Let’s start this section with some math.
The moving average model predicts the next value using both (a) the long term mean of the time series and (b) the recent behavior of the time series. Combining these two things causes the model to stay close to the long-term behavior of the series while also respecting recent trends.
Formally, let \mu be the time series average value over a long time. Also let X_{-1},\dots, X_{-N} be the N most recent observations in the time series. For each index, let r_{-i} = X_{-i}- \mu be the difference between the observed value and the mean.
The r_{-i} values have some intuition. If many recent r_{-i} values are positive, then it means that the time series has been larger than average recently. If many of them are small, then the time series has been smaller than average recently. Using these trends can help predict future values.
The moving average model predicts the value X_i of the time series using only previous observed values. For some choice of parameters \theta_1,\dots,\theta_N , we predict the value X_i = \mu + \sum_{i=1}^N \theta_i r_i . The values of \theta_i can be chosen to place varying amount of emphasis on more recent observations.
We used a few different moving average models to predict Aaron Judge’s. The first one shown below uses the past 8 games to predict Aaron Judge’s performance:
At first glance it looks like this moving average model is pretty good at fitting the very noisy time series. For example, it correctly captures the hot streak from games 85 to 110. We built a similar moving average model with a 15 game window.
Because this model uses a longer window, it is slower to react to hot streaks. For example, just visually, the model seems “late to the show” in catching Judge’s midseason hot streak.
We built one more very basic model for comparison. This model uses Judge’s year-to-date RE24 average to predict performance. Very slow to react, but it takes in as much data as possible.
At first glance, it seems like the 8-game model does best. However, we evaluated each of these models’ forecasting abilities. We compute the mean squared error in predicting a game’s RE24 with each of these models. Here is what we found.
Model |
MSE |
---|---|
Moving Average (8) |
1.37 |
Moving Average (15) |
1.30 |
Year-to-Date Average |
1.26 |
Interestingly, the moving average model with an 8 game window did the worst of the bunch. The best model was one that reacted very slowly and didn’t use any recency bias at all! This is further evidence of something we’ve claimed a lot: hot streaks don’t exist in sports.
This very basic time series model didn’t work here, but maybe applying some more advanced models might help. We’ll return to this idea in future articles.
Final Thoughts
Time series are an extremely important mathematical object that describes datasets with an order. It seems they have a natural application in sports. We show one very basic time series model and why proper application of time series isn’t necessarily straightforward.