What are Elo Ratings in Sports Analytics and Gaming?

From the classical application in professional chess to FiveThirtyEight’s basketball and baseball models to even online esports and gaming like League of Legends, the Elo rating system is ubiquitous. This begs the natural questions: What is Elo? How does Elo work in sports analytics? How does Elo work in gaming? And why do Elo ratings work well in such drastically different applications?

In this article we’ll help you understand the Elo rating system. Just like anything in sports analytics, understanding Elo ratings requires three things:

An intuitive explanation
A robust theoretical mathematical backing
Clear and compelling examples

We’ll try to include all three so you never have to ask again “What are Elo Ratings?”

To receive email updates when new articles are posted, use the subscription form below!

A History of Applications of Elo Ratings

I am compelled, as always, to talk first about the name Elo. I, myself, was guilty of assuming that Elo was an acronym standing for something (like Electric Light Orchestra, maybe). As a result, I often wrote the system as ELO ratings instead of the correct form, Elo ratings.

Elo is actually named after a Physics Professor named Arpad Elo. Therefore, the rating system might have been better-named “Elo’s rating system” instead of the “Elo rating system”.

Arpad Elo first developed Elo to rate chess players. The point was to assign every player a number so that somebody’s strength could be inferred before you even played them. This led to better tournament seeding and fairer matches in club play, for example. In fact, it worked so well that Elo ratings entered into other realms.

Notably for fans of sports analytics, the website FiveThirtyEight uses Elo-like ratings to make most of their major predictions. In fact, one of their flagship models was called “Carmelo”, which was perhaps better stylized “carmElo”. FiveThirtyEight also uses Elo ratings with quarterback and pitcher adjustments to predict championship odds and individual game winners in both football and baseball.

The esports world is one of the latest to adopt Elo rating systems. For those not in the know, esports is the collective name for online, competitive video games like Starcraft, League of Legends, and Counter Strike. Whether or not you want to call these sports, their inherently competitive nature, significant training hours, and large payouts make the comparison valid in my eyes. In many of these online games, Elo rating systems determine rankings so that the best players can be recognized.

Elo rating systems have become more and more commonplace simply because they work and they’re extremely versatile. In the next section we’ll give a short, intuitive explanation for how and why Elo ratings work.

An Intuitive Approach to Elo Ratings

An Elo rating is a single number that is meant to represent the overall quality of a player or a team. A higher Elo rating means that that player or team is quite good, a lower Elo rating means they are bad.

When two competitors go head-to-head, the team with the higher Elo rating is favored and should be expected to win more often or not. The larger the gap between the two competitors, the more decisive the victory should be. Another way to interpret a larger gap is that as the gap grows, the probability that the higher-Elo team wins grows larger as well.

At the end of the day, Elo ratings are meant to be predictive of future matchups between competitors. But the real art and beauty of an Elo rating is how they are determined. Part of the reason sports are played in the first place is that we can’t really know who will win on any given Sunday. So, how could we possibly come up with ratings that do just that in an efficient manner?

Elo ratings are determined by looking at the results of already completed games and matchups. A team that wins lots of games against good opponents should have a high Elo rating. A team that loses to bad opponents should have a low Elo rating. This still doesn’t tell us how to actually compute the ratings.

The easiest way to understand how to compute an Elo rating is that when two teams play, the winner “steals” some of the loser’s points. Suppose, for example, that team A is rated 1600 and team B is rated 1550. If team A beats team B, then their rating will increase by a small amount, maybe 5 points. Consequently team B’s rating should decrease by 5 points. At the end of the day, the total number of points didn’t change, they just moved from team B to team A.

Notice that the more a team wins, the more points they will steal. This naturally causes the good teams to have high ratings and the bad teams to have low ratings.

The Mathematics Behind an Elo Rating: Expected Score

Floating in the background behind Elo ratings are normal distributions, parameter estimation, and logistic curves. The central tenet is this: each player or team’s performance in a given matchup is approximately normally distributed. Sometimes they’ll play above their level, sometimes at their level, and sometimes they’ll just plain stink. Even if one team is better, that doesn’t mean they are 100% guaranteed to win. However, the larger the skill gap, the closer the probability of the favorite winning gets to 100%.

The key concept to understand related to Elo ratings is the idea of expected score. It may seem complicated, but in a win-loss game, a player’s expected score is simply the probability that they win the matchup. In fact, the probability of winning in a given matchup can be computed very simply. If player A’s rating is $R_A$ and player B’s rating is $R_B$ , then player A’s expected score is $\frac{1}{1+10^{(R_B-R_A)/400}}$ .

Here’s an example. If player A is rated 400 points better than player B, then $(R_B-R_A)/400 = -1$ . Therefore, the probability that player A wins against player B is $\frac{1}{1+0.1}\approx 91%$ . Moreover, player A’s expected score is 0.91. Player B’s winning probability is the complement of this, just 9%. You can check for yourself that as A’s rating grows larger compared to B’s, the expected score approaches 1 for team A and 0 for team B.

The choice of the normalizing factor of “400” that shows up is entirely arbitrary. Different Elo rating systems can use different numbers. The net effect is that the range of ratings will be different.

The Mathematics Behind the Elo Rating System: Updating Ratings

The more interesting part of Elo ratings is how your rating changes in response to wins. Comparing the outcome of a game (1 for a win, 0 for a loss) to the expected score tells us to what extent a player over- or under-performed relative to their expectations. For example, if player A is expected to win 80% of the time and ends up winning then they have over-performed by $1-0.8 = 0.2$ points. The other player, even though they were the underdog, still underperformed relative to expectations. The scored $0 - 0.2 = -0.2$ points above expectation.

In the same example, if the underdog had won, then they performed $1-0.2=0.8$ points above expectation. Notice that the underdog winning gets a better “relative to expectation” score than the favorite winning.

Finally, a player’s Elo rating change is calculated by multiplying their “performance relative to expectation” number by a scaling constant, typically denoted $K$ . For example, in chess there are many scaling constants used, but $K=20$ is common.

The graphic below summarizes how the outcome of a match and the two competitors’ ratings combine to yield ratings gains and losses for the players.

Why Elo Ratings in Chess and Gaming are Difficult to Use

The hardest part of an Elo rating system to get right is the scaling constant that determines how many points are gained. A larger constant means more points are won and lost as a result of each individual game. A smaller constant means fewer points are transferred between the players. Choosing the correct constant is a balancing act.

If the constant is too small, then player ratings cannot change very quickly. This means that many, many games are required in order for a player’s rating to be reflective of their true skill. This is especially a problem when players are either (a) new and assigned a provisional rating which may not reflect their skill, or (b) quickly getting better/worse at the game. Moreover, in games and sports where it is difficult to play a large amount of games (for example, American football) this is an even bigger problem.

On the other hand, if the constant is chosen too large, then the variance in a player’s rating will become too large. Consider what happens in an extreme case: you can’t tell if a low-rated player is actually low-skill or if they simply lost one or two previous games and now their rating is extremely low as a result. To say this another way, when the scaling constant is too large, the temporal correlation between rating and recent results is also large. More recent games can have an unduly large impact on a player’s Elo rating.

Designing an appropriate Elo rating system requires balancing these competing concerns.

Why Elo Ratings in Traditional Sports are Difficult to Use

Aside from the difficulties in designing an Elo system from a theoretical math perspective, there are also difficulties trying to apply Elo ratings to traditional physical sports. The biggest reason this is the case is because a team’s quality may vary significantly from game to game. While Elo is designed to accommodate normal variance in quality from game-to-game, some sports break the mold.

Let’s use the 2021 Baltimore Ravens as an example. The team started the season very well: an 8-4 record in their first 12 games. By all accounts, they were one of the top teams in the league. Then, their quarterback Lamar Jackson got hurt. They then go on to lose the rest of the games on their schedule.

If we blindly used Elo ratings, we would inaccurately have thought that the Ravens were better than they actually were going into week 13. The information that Lamar Jackson got injured wouldn’t be incorporated into predicting their week 13 score. This is because Elo ratings only use previously played games. As a result, Elo ratings can be inefficient in their use of information. They require games to be played to infer team quality after the fact. They struggle to take into account external factors.

Another example of this effect is in baseball. A team’s overall quality changes from game-to-game based on who the starting pitcher is. Therefore, using traditional Elo ratings would not be able to capture a team’s overall quality because it is not a constant.

Some outlets counteract this effect using adjustments. For example, FiveThirtyEight’s rating system includes a “quarterback adjustment” which changes a team’s overall Elo rating based on who their starting quarterback is. In football this makes sense because the quarterback position is by far the most important.

Final Thoughts

There is a reason that Elo ratings are used so much: they are simple and they work. Moreover, for the people playing the game Elo ratings can be very satisfying because they can see themselves improve in real time.

Elo ratings are also interesting from a mathematical and statistical perspective; I hope to be able to write in the future about parameter estimation, Cramer-Rao lower bounds, and how good the Elo rating system is at accomplishing the goal it sets out to do.