Augmenting NFL Scores to Improve Model Accuracy

Sometimes the final score doesn’t tell the whole story of an NFL game. In fact, usually this is the case. The final score can change dramatically just based on a few plays. Here are a few examples.

The 2023 wildcard game between the Bengals and Ravens was a very tough matchup. In the fourth quarter, the Ravens were poised to take a touchdown lead. However, at the goal line Sam Hubbard returned a fumble to score a Bengals touchdown. The Bengals ended up winning by 7, but without that one play Baltimore may have won by 7! This play was nearly a 14 point swing.

Another example is the 2008 super bowl between the Bears and Colts. On the very first play of the game, Devin Hester returned the opening kickoff 92 yards for a touchdown, giving the Bears an early lead. The rest of the game, though, was pure dominance by the Colts. They had almost double the Bears offensive yards and had the ball almost twice as long. However, the final score indicated a reasonably successful game. Hester’s single play changed things enough so that the final score didn’t tell the story of the game.

We want to answer two questions:

Can we and should we figure out how to systematically augment final scores of NFL games to better describe what happened in the game?

“Can we” is a question of whether this is possible to do in an unbiased way. “Should we” is a question of whether or not doing this provides any value. That is, if we figure out how to accurately augment NFL scores, does this make our models more accurate?

Before diving into the details, let me give you the answers. Can we? Yes. Should we? Yes.

Part 1: How to Augment NFL Scores to Tell a Better Story

If you look back at the two examples I gave in the introduction, you’ll notice that they share a common theme. In both of these examples, big plays were the reason the final score was not representative of the flow of the game. In each of those examples, big plays shifted the final score by a large margin.

I claim that the correct way to augment NFL scores is to reduce the impact of big plays! Some people might stop right here and argue that “some teams thrive on big plays”. And this is certainly true. Some teams do get more big plays than others. The question, though, is whether these teams have been lucky and had more big plays by random chance or whether the way they play systematically generates more big plays. This is the topic of part 2 of this article: whether or now omitting big plays is a good idea.

For now, though, we want to study how to do this. How can we limit the effect of the outlier plays – kick return TDs, 99 yard pick sixes, missed field goals from inside the 20 yard line.

How to Measure Big Plays

The examples I gave above are clearly big plays. 99 yard pick sixes swing the score by 10+ points. Kick return touchdowns turn an ordinary drive into a guaranteed score, adding a ton of value in the process.

But what is the cutoff? A 99 yard touchdown run by Derrick Henry is certainly a “big play”. But what about an 80 yarder? 60? 40? 20? Certainly a 4th and 15 conversion late in the game is a big play, but what about 4th and 10? 4th and 2?

The key to identifying big plays in a game is to assign a points value to every play. This lets us compare different plays in an apples-to-apples way. This is the entire idea behind the increasingly common NFL stat called “expected points added”.

Expected points added assigns a value to every individual play based on on whether it helped or hurt an overall team’s score. A 99 yard pick six probably has an expected points added of 12 points or so. A 99 yard touchdown run probably adds about 8 points.

Using the expected points added of a play removes any ambiguity from defining what a big play is. Then, after we can compare all the plays of a game in a fair and balanced way, we can reduce the impact of the outliers.

How to Reduce Impact of Outlier Plays

In any statistics course, we are taught to limit the impact of outliers. Outliers can skew a dataset and give false conclusions about the population from which the dataset came. The last section helped us convert NFL plays to numerical values that can be analyzed just like any other stats data set.

There are tons of ways to reduce the impact of outliers on a dataset. The most obvious is to use the dataset median instead of the average (or mean). But there are other ways. Sometimes students are taught to sue the “1.5IQR” rule in high school stats classes. In the NFL application, this would look like ignoring any plays that added or subtracted too many points.

In general, what we want to do is look at the expected points added of all the plays in a game and figure out what a typical play from that game looked like. The way that we’re going to do this in this article is to look at various measures of central tendency of the expected points added distribution.

Here is an example. Typically when analyzing the result of a game, people will use “margin of victory”. You can figure out margin of victory by multiplying the number of plays by the average expected points added per play. That is, take mean EPA and multiply by number of plays to get margin of victory.

This method leaves in the outliers, though. An alternative is to augment the final score by multiplying the number of plays by the median of the expected points added per play. Or, you can use a different measure of central tendency.

Either way, you can change the final score of an NFL game to better reflect what happened by using different measures of central tendency. The plot below shows the relationship between the true and augmented margin of victory using the median EPA.

augmenting NFL scores using median EPA is not the best idea

If this plot looks weird, it should. By far the most common play is a 0.0 median EPA. Therefore, for many of the games, the median EPA is exactly 0.0. All the dots on the vertical line in the above plot are 0 median EPA games adjusted by a 2 point home field advantage. Augmenting using the median is a bit silly. But using other measures of central tendency gives nice results:

augmenting scores with the trimean is pretty good

Notice that most games lie in either quadrant 1 above the red line of quadrant 3 below the red line. This tells us that augmenting game scores tends to reduce the margin of victory. This means that teams that win tend to win by less when we take the biggest plays out of the game.

We still don’t know necessarily that this is a good thing, but this plot does show how augmenting game scores changes things. Lets turn to the most important question. Should we be augmenting game scores in the first place?

Part 2: How Augmented NFL Scores Improve Model Performance

The proof is in the pudding. We can argue at length about whether or not removing big plays from a game’s final score is a good idea or not. But like most sports arguments, we would probably resort to impassioned emotional arguments and anecdotal evidence.

The real way to test whether or not augmenting NFL scores is a good idea is to see if it helps our models make better predictions. So, we tested this idea by using a very simple NFL model to see if our predictive accuracy increases.

The Model

One simple way to give NFL team’s ratings is to use their games to infer ratings. If the Packers beat the Bears by 6 points, this is evidence that their rating is abouit 6 points higher than the Bears’ rating.

After a few weeks, teams start having more and more games of data so that we can give ratings to all the teams that reflect their margins of victory. The specifics of this model is that we want to assign ratings to every team so that across all games played, the differences between teams’ ratings predicts the margin of victory as accurately as possible. You can solve a straightforward least squares problem with gradient methods to solve this problem.

If you read the previous paragraph again, though, you’ll see that the margin of victory plays directly into this model’s output. What we want to do in the next section is compare whether or not augmenting NFL margins of victory leads to a more accurate model.

The Results

The key in testing NFL model accuracy is to only use previous week data to predict current week games. That is, if we’re trying to predict the outcomes of week 7 of the 2022-2023 NFL season, we should only use data from weeks 1-6 to make this prediction.

We did this in both the 2022-23 and 2023-24 seasons to see how our models did. The table below shows some quick summary stats. The column and row headings can be interpreted as follows:

  • Accuracy: % of game winners predicted correctly using only previous week data
  • RMS rpediction error: Try to predict marign of victory then take the square root of the mean of the squares of the prediction errors
  • Regular game scores: feed the raw game score into the model
  • Augmented game scores (XYZ): feed the augmented final score into the model using XYZ measure of central tendency

2022-2023

Accuracy

RMS prediction error

Regualr Game Scores

58.9

14.3

Augmented Game Scores (Median)

59.5

12.5

Augmented Game Scores (Trimean)

60.8

12.9

Augmented Game Scores (90% Truncated)

61.2

14.0

Augmented Games Scores (75% Truncated)

63.8

13.6

Here we see that not only does augmenting the final game score make us predict more games correctly, it also lets us predict the margin of victory a lot more accurately. To see how good some of these numbers are, take a look at the NFL prediction accuracy results on ThePredictionTracker.com for the 2022-2023 season. The very basic model we use here begins to get close to “state of the art” models.

What I find very interesting is that predicted spread and accuracy prefer different measures of central tendency. The median throws away a ton of data, but is also the most conservative model. It predicts the smallest spreads and gives the best RMS prediction error. The truncated means are closer to the actual margin of victory and do really well at picking winners straight up. We see a very similar story in the 2023-2024 data

2023-2024

Accuracy

RMS prediction error

Regualr Game Scores

57.4

16.2

Augmented Game Scores (Median)

59.3

14.4

Augmented Game Scores (Trimean)

58.8

14.0

Augmented Game Scores (90% Truncated)

61.2

15.3

Augmented Games Scores (75% Truncated)

62.2

14.9

Again, the truncated means do much better in predicting winners but the L-estimators median and trimean do better at predicting margin of victory.

Conclusions

No matter how you slice it, it is very hard to argue against augmenting game scores in building NFL models. We showed over two years of data that different measures of central tendency dramatically improve the accuracy of our basic NFL models.