How to Pick March Madness Upsets

Everyone says they love making a bracket, but really what everyone loves is picking the right March Madness upsets. Be honest, that is the fun part. Nobody brags to their friends about correctly picking the third seed to advance to the sweet 16. But you can bet that I still tell my friends that I screamed loud and clear that Abilene Christian was going to beat Texas last year.

What goes into picking the right march madness upsets? Is there a secret formula? Something we should be looking for? If I shake the data enough will the good teams separate themselves from the rest like sifting gold from sand?

In this article we’ll look at a few different metrics to see if they are helpful in predicting whether a team is more or less likely to be a Cinderella. To skip forward to the results, follow this link to see how to predict March Madness Upsets.

How to Make Predictions with Data

First I want to start with something very simple-given a data set and two groups to classify this data into (in our case we want to label each team as ‘Cinderella’ and ‘Not a Cinderella’), how can this classification be done?

This topic is one of the two main focuses of machine learning. As big of a deal as it is in pop-science media, machine learning is largely about predicting which of a few groups a certain object will fall into. There are a myriad of ways to do this from the simpler decision trees and logistic regression to the more complicated neural networks.

No matter what you do, everything boils down to finding some difference in the data associated with the two groups. If we can identify some way – some stat or trait – in which the two groups differ, then we can use this difference to make predictions. The picture below shows what it looks like when one variable differs significantly for two groups.

An example of making predictions using two distributions

If I wanted to predict which group a random object fell into using the above picture, I would look at the predictor variable shown on the x-axis. If this variable is larger than 2, I would probably pick the purple group B and if it’s smaller than 2 I would pick the blue group A. In college basketball language, if we can find a way (a stat) in which typical Cinderella teams differ from their compatriots, then we have a better chance at predicting Cinderellas in the future.

Before getting further into our specific discussion I’ll offer this: predicting Cinderellas is very hard. Almost no statistic will help us reliably separate the Cinderellas from the typical first-round fodder. However, we did find a few statistics which had a slightly different distribution for Cinderellas than for everyone else. The distribution of one of these mystery statistics is shown below, the data split between Cinderellas and otherwise.

This mystery predictor helps predict march madness upsets

Remember: If we can find a variable that looks different for the two groups, it can help us predict which group an object falls into. The above picture – even though I’ve intentionally obfuscated the axis labels – shows a statistic where the distribution is different between Cinderellas and non-Cinderellas. We’ll return to this later.

Candidate Metrics for Predicting March Madness Upsets

In order to predict March Madness upsets, we need to identify some candidate statistics to look at. Traditionally, teams are seeded according to how good the selection committee thinks they are. For example, all the 11-seeds are about as good as one another. Beyond just overall team quality, are there other traits that make someone a dangerous opponent?

There are a few common statistics that analysts and commentators like to use to pick March Madness upsets. The one I hear most often is pace. A team’s pace is how many possessions they get per game. Using pace to predict Cinderellas is predicated on the idea that the underdog needs to force the favorite out of their comfort zone in order to win. If a team’s pace is much higher than average or much lower than average, the favorite may be in uncomfortable territory. This may just be enough to win.

A second and third statistic often used to predict March Madness upsets are offensive rating and three point rate. Offensive rating is a measure of how many points a team scores per possession while three point rate is the percentage of shots which are threes. Both of these stats measure the ability for a team to be dangerous on offense. Shooting a lot of threes or having an otherwise good offense means a team has the ability to ‘go off’. This is a good recipe for a March Madness upset.

The full set of statistics we investigate to see if there is any meaningful difference between Cinderellas and non-Cinderellas is:

  • Pace
  • Offensive Rating
  • Three Point Rate
  • Free Throw Rate
  • SRS (Simple Rating System) – an unbiased measure of a team’s quality independent of strength of schedule

Using Various Statistics to pick Cinderellas

In each of the following sections, we look at the various statistics to see if they are meaningful in helping to predict Cinderellas. The data used are all March Madness tournaments since 2010 because past that point the advanced metrics are not easily available. A team is defined to be a Cinderella provided:

  • They win at least one game as a 12+ seed
  • They win at least two games at an 8+ seed (which in general requires beating a 1 or 2 seed in the second round for the teams that are seeded 8-11)
  • Making the final four as a 5+ seed

The point isn’t to have a perfect definition of Cinderella. In fact, our system calls a team a Cinderella provided they significantly over perform expectations. This definition also allows for enough Cinderellas to meaningfully look at the data. All images are made with the fantastic package in R called ‘lattice’.

Pace

Before starting, I expected pace to be an interesting stat in predicting March Madness upsets. If you control the pace of the game, you can make the other team feel significantly out of control. Shown below are two plots. The first is a one-dimensional scatter plot split by a team’s seed. The color of the dot indicates whether or not a team was a Cinderella.

A strip plot showing pace versus seed

This plot is called a strip plot because it plots one dimensional strips of data for each value of ‘Seed’. While the effect isn’t terribly strong nor obvious from the above plot, it does seem that more of the blue dots (which correspond to Cinderella teams) have a slower pace. If we aggregate all the data points across their seed (think squishing the y-axis of the above plot) and plot the distributions of the variables, we get the following result.

A density plot showing pace and its ability to pick march madness upsets

These curves are empirical estimates of the probability density function of a team’s pace conditioned but I’ll simply call this plot a density plot. The density plot for pace shows that more of the Cinderellas tend to fall in the ‘slower than average’ category. In the figure, this is evident in the blue curve generalyl being to the left of the purple curve. A full 62% of Cinderellas have a below average pace for their seed.

While the distinction in pace between Cinderellas and non-Cinderellas isn’t earth shattering, it is significant. Pace should be considered when picking your March Madness upsets.

Offensive Rating

A team’s offensive rating is a measure of how many points they score per 100 possessions. Teams with better offensive ratings will score more points. They might be more dangerous as an underdog because of a proclivity to get hot. The plot below shows offensive rating for various seeds with Cinderellas and non-Cinderellas colored differently. We also include on the right the normalized offensive rating measured in units ‘standard deviations above/ below the mean’.

Two strip plots looking at offensive rating against seed

Unlike in the discussion involving pace, there isn’t much correlation between offensive rating and being a Cinderella. This picture is a good example of why normalization is helpful. Better seeds have higher offensive ratings because they score more points. However, when we change the data to measure offensive rating relative to the average for the seed, we can collapse the data along the y-axis and form a density plot.

A density plot showing offensive rating and its ability to pick march madness cinderellas

This chart shows nearly indistinguishable normalized offensive ratings between Cinderellas and non-Cinderellas. As a result, offensive rating is not a good tool to predict March Madness upsets.

Three Point Rate

Sometimes analysts like to identify potential Cinderellas by finding teams that take a lot of three point attempts. Teams that shoot a lot of threes have the potential to be “variancy”. If they have even just a mildly above average shooting day, the volume of threes they take can add up quickly. The strip plots for three point attempt rate is shown below.

A strip plot for 3 point rate against NCAA tournament seed

The strip plot here does not reveal much of a distinction in three point rate between Cinderella and non-Cinderella teams. The density plot shows largely the same information.

A density plot showing the lack of ability of 3 point rate to predict NCAA tournament upsets

These distributions might look ever so slightly different, but I do not think the difference is anything other than small sample size errors. That is, over the last ten years looking at three point rate has not been a consistently good estimator of March Madness upsets.

Free Throw Rate

Admittedly, free throw rate is a statistic I thought would not be particularly valuable in predicting Cinderellas. It is hard for me to come up with a justification or an explanation of why free throw rate might matter. The data supports this fact. Here is the strip plot for free throw rates.

A strip plot of free throw rate against seed

It is difficult to see much correlation between free throw rate and cinderellas on the strip plot. We’ll use the density plot below to see if there is much of a correlation.

Density plot for free throw rate

If anything the density plot shows that a slightly lower free throw rate is beneficial. Roughly 56% of the Cinderella teams in the last 12 years have had a below average free throw rate. While I don’t necessarily think this effect is anything statistically significant, it is at least worth reporting. I do not think free throw rate is a good predictor of March Madness upsets, but that may be for you to decide.

SRS

The last data point we look at is a team’s Simple Rating System (SRS) score. SRS is a measure of how good a team is measured relative to ‘league average’. My initial instinct going into this analysis is that SRS measured relative to average for a team’s seed would be a good predictor of which teams will be Cinderellas. This would mean that teams that are better than their seed suggests are more likely to go on runs. Let’s see if this is the case. Below shows the strip plots for SRS as well as the values normalized to seed average.

Two strip plots showing normalized and unnormalized SRS against seed

First of all, SRS is interesting to look at because we can very clearly see the distinction between the quality of various different seeds. Better team’s have better SRS values. What is also fairly clear is that the blue dots tend to aggregate on the right side of the SRS region. This indicates that my instinct was correct: teams that are under seeded are more likely to be Cinderellas. The density plot shows this fact even more clearly.

A density plot showing that SRS normalized to seed is a good predictor of Cinderellas

The blue curve is noticeably shifted to the right of the purple curve. This means that the distribution of SRS values relative to seed average is higher for Cinderellas than for non-Cinderellas. In fact, this effect is the most significant we’ve seen so far. 72% of Cinderella teams in the last 12 years had better SRS values than the average for their seed. This is certainly the best metric we’ve looked at to predict March Madness upsets.

Summary

We looked at team’s pace, their three point and free throw rate, their offensive rating, and their SRS value in order to see if we could accurately predict whether or not a team was likely to be a Cinderella or not. We found:

  • Pace and SRS were valuable in predicting Cinderellas.
    • 62% of Cinderellas had a below average pace
    • 72% of Cinderellas had an above average SRS
  • Three point rate, offensive rating, and free throw rate were not found to have a significant relationship with a team having a Cinderella run or not.

Stay tuned in the next few days for our analysis and articles discussing our predictions for both the Men’s and Women’s March Madness tournaments. We’ll incorporate these findings along with our own team quality metric (Bayes’ Ensemble) to predict the most likely March Madness upsets and the most likely overall champions.

To receive email updates when new posts are made, please use the subscription form below!

Comments are closed.