How to Measure and Compute Strength of Schedule

When somebody asks me what a mathematician actually does all day, I often bring up strength of schedule in sports. Strength of schedule is one of those things that is “intuitive” but is actually extremely hard to measure. One of the best skills mathematicians provide is understanding the core of an issue and defining metrics to answer questions.

Strength of schedule (SOS) is a classic example of this problem.

Everyone knows what strength of schedule should mean. If team A and team B have the same record but team A’s strength of schedule was harder, then team A “did better” than team B. To put that another way, team A’s wins meant more and their record indicates this because it was earned against better teams.

In this article I want to explain exactly why strength of schedule is so hard to define. We’ll look at a few example definitions. Next, as mathematicians are often wont to do, we’ll provide some counter examples that show why each of these definitions might not work how we want them to.

We study how to measure and computing strength of schedule

Contents hide

1 Why Strength of Schedule is Hard to Measure

1.1 Potential Definition 1

1.2 Potential Definition 2

1.3 Potential Definition 3

2 How Strength of Schedule is Calculated Today

3 Strength of Schedule in Different Sports

4 How TDJ Might Calculate SOS in the Future

4.1 Share this:

To receive email updates when new articles are posted, use the subscription form below!

Why Strength of Schedule is Hard to Measure

Let’s start by talking about why strength of schedule is so difficult to measure. The chief problem in coming up with metrics is distilling what we actually want to measure. So ask yourself the question: what should strength of schedule actually measure? If a specific team has a harder strength of schedule, what precisely does that mean?

I bet some of you out there rolled your eyes and said “SOS should measure whose schedule was harder”. The problem with this is when you take that a step further. What does harder mean? Metrics have to be specific and measurable. One thing mathematicians do all day is define metrics to quantify and measure intuitive concepts.

This might seem like an ethereal, philosophical argument, but it is crucially important. What does harder mean? I’m going to go through a few examples of what harder could mean and show why this is a subtle concept.

Potential Definition 1

“A harder strength of schedule means that your opponents were better on average”

What do you mean by “were better on average”? Suppose team A played the best team 3 times and the worst team 3 times. Team B, on the other hand, played 6 middle-of-the-pack teams. Whose schedule was harder?

Maybe you’ll say they were the same. But team A going undefeated would be more impressive than team B going undefeated because they beat the best team 3 times while team B only defeated average teams! Beating Alabama 3 times is much more impressive than any number of wins against Tulsa. This must mean team A is better! Therefore, team A must have had the harder schedule!

On the other hand, what if both these teams lose all their games? Now team B lost 6 times to average times while team A lost three times to the worst team in the league! Of course that is worse for team A; team B is the better team. Because they had the same record, that means that team B had the harder schedule!

See why this is difficult? In the two cases above, the schedules are the same but we arrived at much different conclusions.

Clearly a vague measure like “whose opponents were better” can lead to contradictory results. What if we get more specific?

Potential Definition 2

“A harder strength of schedule means that my opponents won more games in total than your opponents did.”

This definition doesn’t work because this also relies on your opponents’ strength of schedule too! This is the “RPI” problem in college basketball (to read more, check out this article and ctrl+f for RPI).

A good example to see why this won’t work is to look back to the days where there was no interleague play in major league baseball. The American League and the National League existed largely independently of one another until the world series.

What would happen if you computed the strengths of schedule of the AL/NL champions by adding up the total wins of their opponents? Because they played everyone in their respective leagues, the numbers would be nearly identical. However, this doesn’t mean that the American league and the National league were of the same quality.

Adding up your opponents’ wins only works if your opponents themselves played comparably difficult strengths of schedule. In some sense, this is a self-referential problem. We need to know our opponents’ strengths of schedule in order to compute our own strength of schedule.

Things are getting fishy, let’s try a third definition.

Potential Definition 3

“Team A having a harder strength of schedule than team B means that if they switched schedules, team A’s record would improve and team B’s record would worsen.”

Nice, this is very specific. Good metrics should be overwhelmingly, painfully, and exhaustingly specific in their definitions. And, by all accounts it is a pretty good definition.

When people use strength of schedule they are using it to compare the quality of two teams with otherwise similar records. However, even this definition has some shortcomings.

First, how can we tell if a team’s record would be better or worse if they played a different schedule? The only thing we know is how well a team did against their own schedule. In order to know how well a team would do against a different schedule requires an estimate of each team’s overall quality. This is kind of the point of trying to measure strength of schedule in the first place.

Second, this idea works well when we’re debating between two teams. But what happens when there are three teams? We can compare each pair of them separately, but what happens if we get conflicting results?

Suppose teams A, B, and C have the same record and we can compare them by switching schedules. What if our results say A>B, B>C, but C>A? How do we resolve this issue? Is this even possible? The mathematically inclined reader might notice I’m talking about the transitive property. If we come up with a way to measure strengths of schedule, we would want it to be transitive.

How Strength of Schedule is Calculated Today

So how is strength of schedule measured today I’ve argued that it is really difficult to compute, but strength of schedule is a critical metric used in many sports.

For example, the March Madness selection committee has, in the past, used RPI (a measure of strength of schedule!) to select the tournament field. In college football, the selection committee explicitly includes strength of schedule in their playoff rankings. Clearly, strength of schedule matters. How is strength of schedule computed, then?

Let’s start with college basketball. Until 2018, RPI was used as a tool to select the tournament field. The formula for RPI is $RPI = 0.25 \cdot WP + 0.5 \cdot OWP + 0.25 \cdot OOWP$ where WP is a team’s winning percentage, OWP is the winning percentage of their opponents, and OOWP is the winning percentage of all their opponents’ opponents.

The idea behind RPI is that (a) you need to win a lot of games yourself, (b) your opponents need to win a lot of games, and (c) to control for your opponents potentially having easy schedules, your opponents’ opponents need to win a lot of games. Looking at your own record is a “first order” estimate of how good you are. Looking at how good your opponents were and how good your opponents’ opponents were is a higher order approximation. Seems like a good idea, right?

It is a bad system. This is why the selection committee replaced RPI with NET ratings. NET ratings are a way of indirectly measuring strength of schedule and are nearly identical to our own metric, Ensemble Ratings.

The college football playoffs are even worse. The committee is meant to take into account the strength of schedule of the candidate teams. However, no methodology is specified. This leaves room for bad methodology. This is especially true when you consider the following quote directly from the CFP website: “Nuanced mathematical formulas ignore some teams that ‘deserve’ [emphasis theirs] to be selected”.

Translating for everyone out there, “we would rather pick our favorite and most profitable teams to make the playoffs than teams that are actually better”. Math is not to be trusted, of course, because everyone knows that statistics lie!

Sigh.

Strength of Schedule in Different Sports

Calculating strength of schedule is incredibly important in some sports but not nearly as important as others. In the MLB and NBA, hardly anyone talks about strength of schedule. In the NFL, it is mildly more important. In college sports, it is of paramount importance.

Why is this?

In baseball and professional basketball, the seasons are sufficiently long that it doesn’t matter. If the season is sufficiently long, it means that each individual team will basically play all the other teams. So while everyone doesn’t play the same exact schedule, the schedules are pretty comparable. That means record by itself is a pretty good indicator of quality.

In the NFL, each team plays roughly half the league and will necessarily play a mix of good and bad teams. So while a division winner will have a harder schedule than a team that finished in last place last year, the difference isn’t that huge. It matters, but not much more than a game or two in a team’s record.

However, in college sports, strengths of schedule can vary much more wildly. In football, teams from the group-of-six conference might not play a single top 25 team all year. However, a team from the SEC might have to play 6 or 7 or 8 of them in the regular season. This means that a team’s record is much less indicative of their overall quality. Strength of schedule is a crucial piece of the puzzle.

In general, you can tell how important discussions of strength of schedule are in a given league by looking at (a) how many games are played, (b) how big the league is, and (c) how representative the schedules are of the entire league.

SOS is more important in college football than in college basketball because something like 75% of the schedule is played in your own conference. In college basketball, this number is closer to half. This allows for more opportunities for out-of-conference play which give data points to measure overall quality.

How TDJ Might Calculate SOS in the Future

Personally, I don’t think strength of schedule is a valuable metric to compute at all. SOS is not important on its own. Rather, it is only valuable as context for interpreting a team’s record. Instead of looking at the strength of schedule explicitly, I think it is better to come up with team-quality metrics that explicitly take into account the quality of opponents.

Things like Elo, our Ensemble ratings, NET ratings, and other similar metrics are things we would recommend instead of measuring strength of schedule directly. However, if our arm were twisted and we were forced to measure strength of schedule….

(Warning for some fun [for me] math ahead)

I think that there is a graph theoretical approach that could work that I hope to explore in the future. Measuring strength of schedule requires consideration of the overall structure of “who played who” and how well mixed the schedules of the leagues are.

This is a perfect problem to use graph theory to model. Graph theory is the perfect tool to model a group of objects (in our case, the teams in the league) and relationships between pairs of them (the outcome of a game played between a pair of teams).

The tools and language of graph theory allow extraction of global structure that takes into account all the local information simultaneously. Perhaps algorithms for calculating sources and sinks in weighted (multi)graphs will be helpful. Perhaps the language of connectivity of a graph will help us measure how confident we are in a team’s strength of schedule.

I’m not sure exactly the correct formulation, but I think something interesting can be said. Moreover, I don’t think this is merely an academic exercise. I think that this approach can provide the optimal solution for calculating strength of schedule directly. I hope to be able to present this approach in the future.