Reverse Engineering Strength of Record in College Football?

College football in particular can be particularly divisive when debating the resume of one team over another. One of the tools that ESPN uses to help pick between teams in college football and basketball is strength of record or SOR.

The naming similarities between strength of record and strength of schedule are not accidental. Strength of record is meant to go one step further than strength of schedule by contextualizing how impressive a team’s actual record is. It does this by looking at the opponents, not just a record. For example, a 5-5 record against all top 10 opponents is more impressive than a 5-5 record against middle-of-the-pack teams!

Strength of record tries to capture this difference in a precise mathematical way. In this article, we want to dig into SOR and understand how it works.

There is one big problem, though. Strength of Record is ESPN proprietary information; it isn’t public how it is calculated. However, by looking at other ESPN stats, the public facing description of the statistic, and using experience as a mathematician and sports analyst, we’re going to make some educated guesses about how it works.

In this article we will try to reverse engineer the formula for SOR. I’ll approach the problem by looking at the definitions provided for strength of record and asking myself how I would design formulas and algorithms to accomplish this metric. Then we’ll talk a little about the pros and cons of strength of record as a metric.

Because we’re reverse engineering a metric instead of explaining it from scratch, this will read quite differently from our other exploratory articles talking about QBR v Passer rating, Usage Rate, and PER. However, at the end of the day the goal is to help you understand better how advanced stats and metrics work.

We reverse engineer ESPN's Strength of Record metric used in college football and college basketball

To receive email updates when new articles are posted, use the subscription form below!

ESPN’s Definition of Strength of Record

There are two definitions you can find on ESPN’s website talking about strength of record. These definitions are:

First, why do these make sense as methods for measuring team quality? Let’s start with the first one – how difficult the W/L record is to achieve. Ask yourself why we ever talk about a team’s strength of schedule in the first place.

Strength of schedule does not tell you how good a team is. It provides context for a team’s record. Just having a hard schedule doesn’t make a team good. They still need to win games. Strength of schedule is only a tool that, when paired with a team’s record, provides information about overall quality.

Strength of record tries to get there in one step. If team A has a harder strength of schedule than team B, then if they have the same record that means team A is better. If team A’s schedule is twice as hard as team B’s schedule but they have 1 fewer win, team A might still be better! ESPN’s first definition is all about combining record with strength of schedule to come up with a way to rank teams in college football and college basketball.

The second definition gets more into specifics, though. ESPN actually provides precise mathematical language for what the strength of record metric is meant to do. This precise mathematical language is a huge clue to how the metric is calculated.

There are two key phrases used. First, the metric talks about “average top 25 team”. Second, the definition talks about the chance or probability of attaining a given record. Let’s start looking more closely at these phrases. First, we’re going to give an arbitrary name to the “average top 25 team” for reference – how about something especially vague like Northern State University.

If we want to compare the strength of record of team A and team B, ESPN’s metric proposes to estimate what would happen if Northern State University (NSU) played team A’s schedule and team B’s schedule.

We start by computing the probability that NSU would have a better record than team A when playing team A’s schedule. It does the same for team B. If the probability of NSU outperforming A is larger than the probability of NSU outperforming B, that means team B was better. This is because team B’s accomplishments are harder to replicate. This is reflected by the smaller probability of it happening.

On the other hand, if the probability of NSU outperforming A is smaller than the probability of outperforming B, this means team A’s performance is harder to beat. Therefore, team A is the better team. Either way, we end up with two probabilities that let us rank the teams.

Got it?

If not, read that again and again because this understanding is the basis for the work we’re going to do in the next section.

Summary of Results: How SOR is Calculated

I’m going to quickly include my best guess as to how this metric is computed before digging into the explanation. The next few sections will explain why I think the way I do.

Putting this in bullet points might make it more straightforward. I think SOR is calculated using the following steps.

  1. Calculate the average Football Power Index (FPI) of the top 25 teams. This tells us the quality of “an average top 25 team”. For reference let us assume a fake school (NSU!) has this FPI score.
    • These FPI values give us probabilities of one team beating another team
  2. To compute the SOR for team A, simulate NSU playing against team A’s schedule some number (typically tens of thousands) of times.
    • This uses FPI to determine the probability of NSU winning in each game
  3. If team A won X games, count the percentage of times that NSU won at least X games in our Monte Carlo simulations.
  4. This percentage is the strength of record of team A.
  5. Do this for each team and sort by the lowest probability to find the best teams.

Let’s see exactly why I think this is how things are done.

Reverse Engineering Strength of Record

Reverse engineering a metric from its definition is largely the same as designing a metric from scratch. However, reverse engineering only has one right answer while designing from scratch can have many. In either case, the key is to ask what core mathematical object we need to measure.

Based on ESPN’s definition of SOR, our goal is to measure how likely it is for an average top 25 team to obtain a specific record against a specific schedule. Let’s break this down into its constituent parts.

Step 1: Identify Key Features/Eliminate Extraneous Information

We want to measure the likelihood of an average top 25 team obtaining a specific record against a specific schedule. At each step of designing a metric you must ask yourself: is there a straightforward way to do this?

At this point, the answer is no.

Whenever the answer is no, one must reduce the problem to something simpler that we might have a chance of having a metric for. Repeated reductions to simpler problems eventually leads to solutions. This is exactly the way that mathematicians learn how to think when proving theorems. A great book that teaches this intuition is the famous “How to Prove It” by Daniel Velleman. This is as close as anything comes to required reading for mathematicians and analysts.

This is the art of data science and sports analytics. You have to have a gut feeling for what things are easy to measure and what needs to be abstracted away. You have to have a feeling of what the core of your problem is before you peel the onion.

When I thought about this problem, I made the following simplifications to get to a point where our metric was easy to define.

  1. [Hardest/Most Abstract] The probability of an average top 25 team obtaining a specific record against a specific schedule
  2. The probability of a specific team obtaining a specific record against a specific schedule.
  3. [Easiest/Smallest Scope] The probability of a specific team winning in any individual game against a specific opponent.

Notice how at each step the problem gets simpler. Also notice how at each step, it is clear how the previous step is related. We started with a very difficult thing – strength of record – and ended up with something much simpler.

Step 2: Devise a Metric for the Core Issue

My gut instinct tells me that this we’ve reduced the problem enough. I think that estimating “the probability of a specific team winning in any individual game against a specific opponent” is simple enough to start with. In fact, this is one of the most commonly studied problems in college football and, more generally, in every sport.

Take a look, for example, at this website. This website links and tracks many, many different websites that predict the probability of one team winning against another in any given game. You could pick any of these methods to estimate the probability of one team beating another team, they will all lead to a reasonable metric. At the end of the day, they are all estimates anyway so one shouldn’t be terribly better than another.

My best guess for SOR is that they use their own proprietary metric “Football Power Index” to estimate the probability of any one team beating another.

FPI estimates how far above or below average a team is. Comparing two teams’ FPI scores will predict a spread for a hypothetical matchup. Finally, taking into account home field advantage and other factors such as major injuries lets one predict the probability of one team beating another team in a hypothetical college football matchup.

I think FPI is a pretty good guess for how ESPN estimates the probability of one team beating a specific other team. Next, we have to undo our simplification process and climb back up the abstraction hierarchy to get back to Strength of Record.

Step 3: Reverse the Simplification Process to Obtain a Metric

We have at this point solved the third problem posed in Step 1 of this reverse engineering game. We need to know how to go from step 3 back to step 2 and then from step 2 back to step 1. The first of these abstractions is extremely straightforward. The answer is Monte Carlo simulation.

Monte Carlo simulation refers to the process of repeating random events thousands (or millions!) of times in order to try to figure out probabilities and averages. If you look back at this page about FPI, you’ll see that ESPN uses the phrase “20,000 simulations” – this is a watermark that tells us they’re using Monte Carlo methods.

(A quick aside to any current or aspiring data scientists or mathematicians: Monte Carlo is by far the best tool you can learn to bootstrap your abilities. I use it nearly constantly in my everyday life. This book is a pretty good introduction into how it works.)

In this case, we can Monte Carlo simulate a team playing against a specific schedule to get the probability of a certain number of wins. We can estimate the probability that a team gets 8 wins, 9 wins, 10 or more wins, etc. against a specific schedule. To do this, you just simulate the season thousands of times and count in what percentage of these simulation 8 wins, 9 wins, 10 or more wins occur.

The last question is how we go from step 2 back to step 1 – the transition from a specific team to an arbitrary top 25 team. At this point, there are lots of ways to accomplish this. Perhaps the simplest – and what I bet they do – is average the FPI scores of the top 25 teams. The simulations only require knowing the FPI of each team. Using the average top 25 FPI gives an idea of how an average top 25 team would do.

And we’re done. Really.

We started at the simple question – the probability of a specific team beating another specific team in a game. Metrics already exist to do this. Then, we used standard mathematical and statistical techniques to abstract this simple metric to reflect what we actually want to compute. This is the general process for designing metrics and I am fairly certain it is extremely close to what ESPN actually does.

To receive email updates when new articles are posted, use the subscription form below!