Ensemble Learning in Sports

Everyone talks about models in sports but not nearly as many people realize that ensemble learning in sports is something we have unwittingly been doing for years. Even a casual sports enthusiast can understand that Vegas lines, Coaches polls, and consensus fantasy football rankings are often exceptionally accurate in predicting winners, good basketball teams, and good fantasy players. Most people attribute this effect to the ‘wisdom of the masses’. But, the more time you spend talking to ‘the masses’, you realize that the average sports better or college coach or ESPN fantasy expert doesn’t really know that much.

It turns out that we can – in a quite satisfying way – talk about the accuracy of these things using language and ideas from machine learning. In particular, we’re going to highlight in this article that each of these three objects – college sports polls, fantasy rankings, and Vegas lines – can be described as a special instance of a broad idea known as ensemble learning.

Model Complexity

A model in its simplest sense is a way of converting from observable data to meaningful conclusions about the object we’re studying. For example, in the sports world we might like to convert from a team’s win-loss resume and their players’ stats into a prediction of who might win in a given matchup.

An important consideration with any model is its complexity. In the most general sense, a model’s complexity is a measure of how complicated it is to actually do the conversion from data to predictions. A model with low complexity could be ‘pick the team with a better record to win’ or ‘pick the home team to win’. If, on the other hand, we used every box score of every game and used historical trends to combine this data before comparing our results to expert consensus, our model would be quite a bit more complex.

There is an interesting relationship between the complexity of your model and the accuracy of your model. Naively, one might expect that as model complexity grows unbounded, so too might predictive accuracy. However, this is not what we observe nor what we know to be true. If your model is too simple, you will fail to extract all possible information and nuance from the available data. If your model is too complex, you will find patterns and trends in the past that are coincidences and are not indicative of future trends. When a model is too complex, the model will sacrifice too much future predictive ability in order to explain the past data as well as possible.

One of the main complexities in machine learning is balancing these competing effects and finding the optimal model complexity. Enter ensemble learning.

Ensemble learning is the idea of aggregating many weak (or, ‘not complex’) models (or ‘learners’) to form a highly accurate model. It turns out that if we do this, we can create highly predictive models. Because the basic building block of our model is inherently ‘weak’, there is little risk of overfitting; our model will tend not to find artificial patterns. When we build many different weak learners, there are many more chances for an individual learner or a subset of the learners to pick up on subtle trends.

Ensemble learning has an interpretation as ‘wisdom of the masses’. The idea of wisdom of the masses is pretty similar to the previous paragraph. Individuals are good – but generally not great – at making inferences from a set of data. However, certain individuals will place varying amounts of emphasis on different aspects of a data set and come to different conclusions.

For example, suppose you ask a group of people ‘who is going to win the super bowl in 2020?’ You are going to get many different answers for many different reasons. Some might say the Steelers are going to win because they have the best record. Some might say the Chiefs are going to win because they won last year and are arguably better this year. Some might say the Packers, or Saints, or Seahawks for various other reasons.

The point is – each of these people have an opinion and a prediction based on data. Each of these people has their own model. They convert from their observed data into a prediction based on their own interpretation. The people are the weak learners. If we can aggregate their opinions in some way, we have a way of performing ensemble learning in sports.

It turns out, this is already done quite a bit and we don’t even realize that it is happening. I am going to discuss – in increasing order of complexity – how we may interpret college polls, fantasy rankings, and Vegas lines as instances of ensemble learners. The basis for all of our conclusions is that people and their opinions can reasonably be interpreted as weak learners.

College Polls are Ensemble Learners

This is perhaps the simplest observation we make here. Polls are one of the main ways we use to determine who the best teams throughout college sports are. Because of the huge size of division 1 athletics, record doesn’t really indicate quality as much as it does in professional sports. That is why polls, both coaches and AP polls, have been done for years.

How accurate are these polls? Actually reasonably accurate. Here is a really quick case study. In college football, most good teams’ seasons end with a bowl game. These bowl games are designed to match up teams of seemingly equal quality. That is, these games should be as close to a toss-up as possible. According to this source, both the coaches and AP poll are able to correctly predict winners at a better rate than ‘toss-up’. That is, using polls actually is a reasonably accurate way of predicting winners even in tight contests.

The coaches’ poll and AP polls each rank the top 25 times in both college basketball and football by aggregating individual rankings from either a group of college coaches (in, obviously, the coaches poll) or a group of sports writers from across the nation (in the AP poll). Each of the people that vote in this pool can be considered a weak learner. They have their biases, they focus on different things, and they might emphasize some games more than others (like, for instance, the recency bias). But, if we average all these opinions together, we get a pretty good picture of who the good teams are and who the best teams are in college sports. Polls in college sports are ensemble learners.

Fantasy Rankings are Ensemble Learners

The second instance that I have seen of ensemble learners in sports is in fantasy rankings. Anyone who has played fantasy football has observed the following phenomenon. When you form a new league, there is always at least one person who either has an auto-drafted team or who just picks the best player available in each round. Those of us who spend countless hours reading fantasy articles tend to get tilted because that same guy tends to win an inordinate amount of the time.

The guys you were low on that he took at their ADP? They all had good years! The guys you reached for two rounds early? Yeah, turns out they weren’t that good. This phenomenon happens quite often. Auto-drafting works well. I have even verified this effect by simulating hundreds of thousands of drafts: average draft position is a reasonably strong barometer of how good of a fantasy asset a particular player is. It turns out, we can explain this level of accuracy by appealing, again, to an application of ensemble learning in sports.

ADP, or average draft position, is computed by taking the average of where a player was taken in every draft in a given year. Again, we appeal to the idea that humans can act as weak learners. When someone is preparing for a fantasy football draft, they study, ingest information, and prepare to make predictions about how the players should be ranked before the season starts. This is an organic machine learning algorithm. Computing the average draft position is an unweighted combination of all these weak learners in order to form a consensus of how good a player might be.

Outside of ADP, we can find many other instances of ensemble learning throughout fantasy sports. As just one example, Boris Chen’s fantasy site forms an ensemble model by aggregating many experts’ rankings. Again, the wisdom of the masses idea is present throughout sports.

Vegas Lines Are Ensemble Learners

This last instance of ensemble learning in sports is in the accuracy of Vegas lines. Vegas lines are notoriously quite difficult to beat consistently by your average-Joe, every-day better. Many people do make a living out of sports betting, but for most others sports betting is just a hobby that tends to lose money over time. This means that Vegas lines are fairly good predictors of what is actually going to happen. This effect can be explained by viewing an individual better as a weak learner and Vegas lines as an (inherently stronger) ensemble learner. How can we formalize this idea that Vegas lines are ensemble methods?

Initial Vegas lines are set, people bet, the line moves, and we end up with a closing line. The process of how the lines moves can actually be thought of as the line being an ‘aggregate’ of individual predictions – an ensemble model. In this setting, though, the actual predictions made by the weak learners are a bit less precise than our previous two examples.

Suppose a Vegas line has team A as a 6.5 point favorite. If I bet on team A, my prediction – or the output of the weak learner that is myself – is that team A will win by more than 6.5 points. While in reality lines move so that Vegas has roughly the same payout on both sides, the line moving can be thought of as the line adjusting to the increased information of who bet on which side. Let me put that another way, the Vegas line moves and updates to reflect how many people bet on one side or another. Once roughly the same number of people are betting on either side of the line, the line stabilizes. The closing line, then, reflects all the bets, it reflects all the predictions made by the weak learners. The closing line is an aggregate of all the individual people who have bet on a game. The closing line is an ensemble learner.

Ensemble Learning in Sports – Polls, Fantasy Rankings, and Vegas Lines

Model Complexity