Sports Analytics Example: NFL Historical Betting Analysis
From time to time I want to write articles that focus on teaching analytical methods as much as they focus on sharing the insights drawn from analytical methods. I have done at least one sports analytics example before, but this time I am going to focus on historical betting analysis from a database of Vegas NFL lines. However, I am going to focus as much on the process of analysis and give sports analytics examples of what a data science project could look like.
The question I am going to answer is going to be quite simple: Convert from a team being favored by X points to a percentage chance of winning the game. For example, if an NFL team is favored by 7.5 points, then do they have a 70% chance of winning the game? Do they have an 80% chance?
We’ll go through the entire process in this sports analytics example: data cleaning/pre-processing, initial analysis, model selection, and final model computation. We’ll be using the programming language R throughout all of this and the model we are going to focus in on is a logistic regression model.
If you would like to simply skip ahead to the results, click here for the conversion table from Vegas line to winning percentage. Otherwise, continue reading this sports analytics example to see what we can say about betting in the NFL.
Initial Data Exploration
The first step in any data analysis problem is trying to get a feel for what your data looks like. My data is from this Kaggle competition. Here is a snapshot of what the data looks like:
Just browsing through this data, I first noticed something: when indicating the home teams and away teams, we are given the full name but when indicating the favorite in the Vegas line, we are given an abbreviation. Our whole analysis is about seeing how often the favorite win in Vegas. Therefore, we will at some point need our program to be able to look at the ‘team_favorite_id’ column and tell whether the home team or the away team is favored. That is, we need a list of conversions from team IDs into team names.
Many problems in data science can be solved algorithmically BUT this particular sports analytics example shows that sometimes doing things by hand is the easiest. We could conceivably do some analysis so that our program could ‘learn’ that IND translates to Indianapolis Colts, but if we simply takes two and a half minutes to fill these in by hand, we have solved the conversion problem.
The Conversion Problem
Before we can do any historical betting analysis, we need to solve the team name to team ID conversion problem. First, I need to know which names I need the conversions for and what conversions there are. Using the ‘unique’ function in R will return a list of all the unique elements in a list. I have loaded our dataset into a data frame called ‘data’. Then, the unique team names and unique ids are obtained with:
I quickly notice that there are 40 teams and 33 Ids. I need to actually look into what is going on because multiple teams have the same ID. If we look back at the screenshot above, you notice that the Los Angeles Raiders have the ‘OAK’ abbreviation. Moreover, if you look closer, the St. Louis Rams have the abbreviation ‘LAR’. Now, this tells me that if a team changes city or branding, the ID in the data set is taken to be their current abbreviation. Therefore, I can make the following .csv file by hand (I wrote unique team names to a .csv with write.csv then made the ID column in excel) to let my file convert from IDs to team names:
If I hadn’t looked at the number of different unique teams, I probably would have mistakenly put ‘SD’ for the San Diego Chargers and gotten run-time errors later. One last thing: Why are there 33 IDs not 32? Interestingly, if you print out the list of unique IDs from the data set, we see ‘PICK’ as one of the IDs which corresponds to a line of 0 points or a ‘pick ‘em’ – nobody is favored.
Initial Analysis
Now, I can meaningfully work with my data now that I have the conversions problem solved. My first step in getting a feel for what is going on is to compute the observed winning percentage for a given line. For each game in my data set, I record the line and whether or not the favorite won. Then, knowing how many times a given spread showed up, I can divide number of wins by a favorite for a given spread by the number of times that spread was given for a game to determine winning percentage for a given spread. Here is what I found:
At first glance, this looks pretty good. We expect that as the spread gets larger (-15 implies the favorite is expected to win by 15 points), the winning percentage generally gets larger. However, two things are evident. First of all, if we want to do a proper analysis, we need to introduce some ‘smoothing’ effect.
If we simply look at this chart and take it as Gospel, then we would say that being a 17 point favorite means you have a lower chance at winning then if you were a 15 point favorite. This is nonsensical. The reason 17 point favorites have won less often than 15 point favorites is because of the fact that being a 17 point favorite is relatively rare so the small sample size leads to relatively large errors. Moreover, the 20-25 point favorite range all has a computed 100% winning rate. These teams certainly are guaranteed to win (as a 100% win rate would imply), but the win probability is close enough to 100% so that in our small sample size of 20-25 point favorites, they have won every time. It would be nice if we could differentiate the observed 100% win rate of, say, 24, point favorites from a more realistic value like 98 or 99% and have that 98 or 99% be based entirely on a model.
The second thing we observe is that this is only half the graph. The true graph also has win rates for underdogs. I have included the full graph below. Budding data scientist should be able to recognize this approximate shape: the logistic function.
[H2] Logistic Regression
There are many different curves we could call ‘the’ logistic curve, but shown below is one example highly reminiscent of the above curve.
The function for the logistic curve is given by y=\frac{1}{1+e^{-k(x-x_0)}} where
- k determines the steepness of the curve and whether it rises left-to-right or right-to-left, and
- x_0 determines the midpoint of the curve
Logistic regression is the process of fitting a logistic curve to available data. In particular, given a set of observed data we find the values of k and x_0 so that the curve best matches what we see.
Note for the technically minded: We need to be quite careful when performing logistic regression here. Logistic regression is not the process of fitting a curve to the ‘Winning Percentage v. Spread’ curve above. Logistic regression must have a binary response variable with a continuous predictive variable. Therefore, the data set we are performing the regression has labelled training data where the input is the spread and the output is 1 if that team won, 0 if that team lost.
Now that we understanding logistic regression, we can continue our sports analytics example for historical betting analysis in NFL games.
Results
The table below has our results which converts from Vegas line information to winning probability in the NFL. This was the goal in our historical betting analysis and tells us how to determine how likely a team is to win based on the stated line information.
[table_id=line_conversion/]
Analysis
One thing to ask is whether our choice of a logistic model was correct. There are many different curves that ‘look’ like the logistic curve (so called sigmoidal functions if you are familiar with neural network terminology). For example, what if we used a shifted and scaled version of arctangent to fit to our data set? This function has largely the same shape as the logistic function, perhaps it fits our data better. One would need to think quite carefully about which function to use because the particular choice will have an effect on our results.
A quick and easy way to check the appropriateness of model choice is to inspect the ‘residuals’. A residual is the difference between the observed and the actual value. For example, if we predicted 3.5 point favorites to win 61.2% of the time but in our observed data they only 62.3% of the time, we have some small error of about 1.1% which we call ‘a residual’. A good, general rule of thumb is that if the residuals are small in magnitude (the model and the observed data are close to each other) and the residuals have no pattern, the our model is fairly strong. A quite note about ‘no pattern’: If there is a pattern in the residuals, it is something we could have ‘found’ if we used a more appropriate model. If there is not pattern, then there is no more information to be gained by using a more complex model.
Here is a graph of both the modelled winning percentages as well as the empirical (observed) winning percentages on the same plot. This first plot shows us that our model describes the observed data quite well in a ‘first glance’ type way.
Now, let us look at the residuals:
Again, these residuals are fairly indicative of a good model. I will point our two bits of structure (or patterns) one might observe in the above and try to explain why they do not worry me when it comes to the accuracy of my model.
First, the residuals are symmetric about the origin. Rotate the picture by 180 degrees (around an axis sticking straight out of your screen) and it is the exact same. However, because of the nature of our data, we already remarked that we have this nice symmetry. Therefore, any model we use will have this type of symmetry.
Second, there is a strong pattern towards the ends of the data set (>17 and <-17 spread) where the residuals look almost linear. However, I claim that this artifact is actually attributable to systemic error in the data set rather than attributable to the model. Recall, as we get further from the center of our model, the sample size decreases dramatically. Therefore, teams with very large or very small spreads always lost or won (respectively). Moreover, any of the other sigmoidal models I suggested will have this same artifact. A larger sample size of large-spread games could help address this error.
Commentary
This concludes my sports analytics example. In this article we examined betting information from Vegas and conducted our historical betting analysis for the NFL. I encourage you to think about the conclusions we made and if you would have done anything differently or could perhaps improve in any way.
One Reply to “Sports Analytics Example: NFL Historical Betting Analysis”
Comments are closed.