March Madness Machine Learning 2020

Not a single sporting event causes more debate, excitement, talk of ‘strength of schedule’, and interest in analytics as March Madness. Every budding data scientist or mathematician tries their hand at building a March Madness machine learning model. There is even a March Madness Machine Learning 2020 competition hosted by google every year where contestants don’t just pick a bracket, they have to estimate win probabilities for every game. People take this stuff seriously.

As we’ll see shortly, simply telling you ‘which teams are likely to win each individual matchup’ will not be that interesting nor that helpful in creating a good bracket. In fact, most machine learning models end up being pure chalk. So, instead of telling you which teams are likely to win each game, I will provide percentages of certain teams making the second round, sweet sixteen, etc. Then, you can use information like ‘Abilene Christian has a 20% chance at making the Sweet sixteen – highest of any 10+ seed’ to pick reasonable upsets. (Note: That number is actually true. Abilene Christian is my super sleeper pick).

First, though, let me explain my methods. Click here to skip ahead to my Cinderella picks.

My March Madness Machine Learning Model

The model I use in this article is not so different from my Bayes Ensemble method that ranks NBA teams. Long story short: I generate team ratings for everyone in the field using a mixture of observed game scores and recorded Las Vegas line data. Obviously, using games played this season give us a very good estimate of how good teams are relative to one another. But, why Vegas data? I wrote an article a few months ago about how Vegas lines are interpretable as ensemble learners and, therefore, are extremely accurate measures of how much better one team is than another. Mixing Vegas data and game scores should give us a very good (though, potentially, biased) way of determining the relative quality of teams.

What do these rankings tell us? Very simply: the difference between two teams’ ratings is my predicted margin of victory. Gonzaga is roughly a 26, Illinois a 21. Therefore, I would predict Gonzaga to beat Illinois by 5 points on a neutral court on average.

However, just using the ratings to predict who wins and who loses is boring. If I did that, all I would know is that the most likely final four is all one seeds and the most likely champion is Gonzaga. That doesn’t tell me much. Rather, I want to know how likely each of these things is. How likely is it that Abilene Christian makes the elite 8? How likely is it that Gonzaga wins?

One way to do this is to lookup a table that converts lines to winning probabilities. You can look at my piece performing a historical betting analysis on NFL data to see how one might do this using logistic regression.

A second possibility for converting predicted margin to winning probability is to use a normality assumption on the game score. For instance, I think that Gonzaga will beat Illinois by 5 on average. However, it is fairly reasonable to assume that Gonzaga’s actual margin of victory will resemble a normally distributed random variable with mean 5 and some standard deviation.

Without worrying too much about how, I determined a standard deviation of about 8.5 points fit the data best. This means that if a team is favored by 8.5 points, I can compute their win probability by querying the cumulative normal distribution for z=1. Using this, we can see that a line of 8.5 points means the favorite should win about 84% of the time, roughly matching what we see in the tables.

In exactly this way I can compute the probability of any tournament team beating anyone else in the field. How, then, can we get to things like ‘What is the probability that Abilene Christian makes the Sweet Sixteen’? We return to old faithful: Monte Carlo simulation.

I set up the bracket and simulated the tournament 40,000 times. For each game in each simulation, I picked the winner according to the probabilities computed from my rankings system. This way, I don’t pick Gonzaga to win in the first round every time. They actually lost in 4 of these 40,000 simulations. As Virginia taught us, this is a distinct possibility and we cannot simply discard it as unlikely.

Finally, I counted how many times each team advanced to each phase of the tournament. This is what will help us make our bracket picks.

Using These Results

Like I said, just picking the better team in every game is going to get you a bracket where you almost always just pick the higher seed. Nobody likes that guy in their bracket pool. Doing this is boring. If you’re anything like me, you have a visceral need to make ignorant picks and talk about how right you were for five years after. I still talk about my Loyola-Chicago call from a few years ago.

So, how can we use these computer results (which simply tell us ‘higher seeds are better’) to make a fun, reasonable bracket? It’s easy actually: let the model help you identify which teams are better/worse than their seed and which teams are more likely to advance further than their seed would suggest.

It is more likely than not that Texas will beat Abilene Christian in the first round. They are the better team. However, my model thinks Texas is way overrated and Abilene is underrated. Put all this together and we get Abilene with a 44% chance of beating Texas!

March Madness Machine Learning Team Rankings

Before moving on to discuss probabilities and Cinderellas, I think it is best to simply present the rankings and ratings of each team in the field. Remember, the numbers below are predicted margin of victory against a league average opponent on neutral court.

The number in each column represents the probability of a team making it to that particular round. R32 is the probability of making it to the round of 32, S16 is probability of advancing to sweet 16, etc.

TeamRatingR32S16E8F4FinalChampionshipSeed
Gonzaga26.470.99990.95280.85670.68120.54610.42781
Norfolk/App-4.331e-040000016
Oklahoma12.840.58040.03350.0130.00330.0011e-048
Missouri11.050.41960.01370.0040.0011e-0409
Creighton16.080.90230.50350.06360.02530.0090.00265
UC Santa Barbara5.230.09770.01654e-041e-040012
Virginia160.87230.45910.06190.02310.00880.0034
Ohio6.430.12770.02094e-0400013
USC15.340.80710.43980.14230.0230.00880.00286
Wichita St.8.020.19290.04730.00532e-040011
Kansas15.240.90930.5020.15040.02380.00870.00353
Eastern Washington3.730.09070.01095e-0400014
Oregon12.780.56440.10130.04030.00439e-043e-047
VA Commonwealth11.380.43560.05810.02010.00154e-041e-0410
Iowa21.160.9840.83860.64080.21320.12870.07542
Grand Canyon2.450.0160.0023e-0400015
Michigan21.390.99760.74550.56270.41370.16160.0961
Texas Southern-6.00.00241e-04000016
LSU13.420.37270.0740.03240.01480.00247e-048
St. Bonaventure12.840.62730.18040.09980.05670.01140.00419
Colorado16.040.91030.59410.20840.11470.02660.01075
Georgetown9.170.08970.01799e-041e-040012
Florida St.14.640.74220.32890.08930.03880.00570.00174
NC Greensboro4.410.25780.05910.00650.00141e-04013
BYU12.690.72740.40350.19680.06780.01370.00396
Michigan St.10.880.27260.09040.02470.0053e-04011
Texas15.110.5640.29820.12990.0420.0070.0023
Abilene Christian6.350.4360.20790.07870.01980.00240.001114
Connecticut14.940.49470.11790.04180.00676e-041e-047
Maryland13.230.50530.1150.03970.00798e-042e-0410
Alabama17.150.98270.76610.48820.21050.05490.02242
Iona-1.020.01730.0012e-041e-040015
Baylor21.970.95090.78960.60990.46130.26410.10561
Hartford-2.570.04910.01110.00253e-040016
North Carolina14.60.52630.11130.04880.02090.00577e-048
Wisconsin17.290.47370.0880.03730.01330.00244e-049
Villanova17.390.78940.47380.16590.090.0340.00685
Winthrop5.960.21060.06460.00670.00141e-04012
Purdue15.220.88630.44880.12790.05960.01980.00314
North Texas9.680.11370.01280.00100013
Texas Tech16.10.58140.26320.08940.02320.00548e-046
Utah St.10.910.41860.15250.04110.00910.0018011
Arkansas15.480.85050.54210.24760.08320.02560.0043
Colgate14.060.14950.04220.00482e-040014
Florida11.690.57670.23180.12910.04120.01320.00337
Virginia Tech11.630.42330.14360.07240.02060.00516e-0410
Ohio St.18.280.98480.62410.41560.17570.06970.0172
Oral Roberts0.320.01525e-04000015
Illinois21.960.98980.8170.66040.46850.30750.1331
Drexel2.050.01027e-04000016
Loyola-Chicago14.680.6540.14090.07160.02870.00890.00138
Georgia Tech11.430.3460.04140.01540.00410.0012e-049
Tennessee16.010.87540.60620.18390.08690.0360.00815
Oregon St.6.060.12460.03280.00171e-040012
Oklahoma St.12.650.840.34160.06590.02220.00588e-044
Liberty4.170.160.01940.001100013
San Diego St.12.930.57690.25780.07350.01580.00425e-046
Syracuse11.190.42310.15610.03740.00649e-04011
West Virginia14.530.96590.58360.20010.05590.02220.00383
Morehead St.-1.060.03410.00251e-0400014
Clemson11.170.36570.06590.02840.00460.001607
Rutgers14.080.63430.17790.09280.02580.00920.001210
Houston19.250.99680.75620.56770.2810.15580.05032
Cleveland St.-3.120.00320000015

For the remainder of the article, I am going to split the field up into four sections: The longshots (seeds 13-16), the Cinderellas (seeds 9-12), the spoilers (seeds 5-8), and the favorites (seeds 1-4). For each group, I’ll highlight a few teams that are likely to over/under perform their seeds and make a nice run.

It is important to note that sorting teams by ‘Rating’ is only half the story. The other half of the story is quality of opponent. An upset alert takes only a mild mixture of an over-seeded favorite and an under-seeded underdog. In each of the sections below I’ll look at both ‘who the best teams are’ and ‘who is most likely to make a run’. The second takes into account strength of region and difficulty of matchup.

The Longshots

Abilene Christian is my team. They have the highest chance of making the sweet 16 that I ever remember seeing from a 14 seed. Not only is Abilene about 12-seed good, their potential opponents – Texas and BYU – are significantly over seeded. Disclaimer: I haven’t watched Abilene Christian play a single game, but that isn’t the point of my blog. I am not an eye-test guy. I am a ‘here is what the numbers are screaming to me’ guy. And the numbers, for whatever reason, think Abilene Christian is way better than a 14 seed.

Nobody else in this group seems to have much of a chance, I’ll let you figure out what is going on for yourself using the table below.

The number in each column represents the probability of a team making it to that particular round. R32 is the probability of making it to the round of 32, S16 is probability of advancing to sweet 16, etc.

TeamRatingR32S16E8F4FinalChampionshipsSeed
Norfolk St.-4.331e-040000016
Ohio6.430.12770.02094e-0400013
Eastern Washington3.730.09070.01095e-0400014
Grand Canyon2.450.0160.0023e-0400015
Texas Southern-60.0000000016
NC Greensboro4.410.25780.05910.00650.00141e-04013
Abilene Christian6.350.4360.20790.07870.01980.00240.001114
Iona-1.020.01730.0012e-041e-040015
Hartford-2.570.04910.01110.00253e-040016
North Texas9.680.11370.01280.00100013
Colgate14.060.14950.04220.00482e-040014
Oral Roberts0.320.01525e-04000015
Drexel2.050.01027e-04000016
Liberty4.170.160.01940.001100013
Morehead St.-1.060.03410.00251e-0400014
Cleveland St.-3.120.00320000015

The Cinderellas

This is typically the most fun group to play with. Want to send a 9 seed to the elite 8? It can happen. A 12 seed makes the sweet 16? Almost every year. This is the group that has the memorable runs deep into the tournament where they have no business being. Those types of events are extremely hard to predict. However, what I can do is help you understand which teams are probably better than their seed indicates so you can make your own picks.

The number in each column represents the probability of a team making it to that particular round. R32 is the probability of making it to the round of 32, S16 is probability of advancing to sweet 16, etc.

TeamRatingR32S16E8F4FinalChampionshipsSeed
Missouri11.050.41960.01370.0040.0011e-0409
UC Santa Barbara5.230.09770.01654e-041e-040012
Drake8.020.19290.04730.00532e-040011
VA Commonwealth11.380.43560.05810.02010.00154e-041e-0410
St. Bonaventure12.840.62730.18040.09980.05670.01140.00419
Georgetown9.170.08970.01799e-041e-040012
MSU/UCLA10.880.27260.09040.02470.0053e-04011
Maryland13.230.50530.1150.03970.00798e-042e-0410
Wisconsin17.290.47370.0880.03730.01330.00244e-049
Winthrop5.960.21060.06460.00670.00141e-04012
Utah St.10.910.41860.15250.04110.00910.0018011
Virginia Tech11.630.42330.14360.07240.02060.00516e-0410
Georgia Tech11.430.3460.04140.01540.00410.0012e-049
Oregon St.6.060.12460.03280.00171e-040012
Syracuse11.190.42310.15610.03740.00649e-04011
Rutgers14.080.63430.17790.09280.02580.00920.001210

The Spoilers

The best of this group looks to be Villanova with a bunch of other teams close behind. However, in this range, matchups are everything. It seems like the surest bets to make the sweet 16 are Tennessee and Colorado. If you want to pick a team from this group to make a final four appearance, I would suggest Colorado. They have by far the easiest route. No huge upsets lurking in this group. San Diego State is the most likely of the <6 seed teams to get bounced in the first round, but they are still favored. Connecticut is the best relative to their seed. BYU is the worst relative to their seed.

The number in each column represents the probability of a team making it to that particular round. R32 is the probability of making it to the round of 32, S16 is probability of advancing to sweet 16, etc.

TeamRatingR32S16E8F4FinalChampionshipsSeed
Oklahoma12.840.58040.03350.0130.00330.0011e-048
Creighton16.080.90230.50350.06360.02530.0090.00265
USC15.340.80710.43980.14230.0230.00880.00286
Oregon12.780.56440.10130.04030.00439e-043e-047
LSU13.420.37270.0740.03240.01480.00247e-048
Colorado16.040.91030.59410.20840.11470.02660.01075
BYU12.690.72740.40350.19680.06780.01370.00396
Connecticut14.940.49470.11790.04180.00676e-041e-047
North Carolina14.60.52630.11130.04880.02090.00577e-048
Villanova17.390.78940.47380.16590.090.0340.00685
Texas Tech16.10.58140.26320.08940.02320.00548e-046
Florida11.690.57670.23180.12910.04120.01320.00337
Loyola-Chicago14.680.6540.14090.07160.02870.00890.00138
Tennessee16.010.87540.60620.18390.08690.0360.00815
San Diego St.12.930.57690.25780.07350.01580.00425e-046
Clemson11.170.36570.06590.02840.00460.001607

The Favorites

I need to start by talking about Gonzaga. Gonzaga is a heavy, heavy favorite this year. I ran a similar (but simpler) analysis last year to simulate the missing march madness and the most likely champion only won something like 20% of the times. Gonzaga is more than twice as likely to win this year than a normal ‘first overall seed’. Gonzaga wins over FORTY percent of my simulations. This is absurd. Its boring, its blasé, its vanilla, but I am taking Gonzaga to win every bracket I enter. My model favors Gonzaga by about 5 points over the next best team. This translates to somewhere between a 70-75% chance of beating the second best team in the country on any given night. I am not the only numbers junkie to favor Gonzaga this highly, KenPom agrees.

This group is pretty much what you would expect the top 4 to look like. My model thinks Virginia is closer to a high 2 than a 4 seed. My model thinks Texas and WVU should be 4’s not 3’s. Small things like that. However, if you sort the table below by Sweet 16 chances, we see some interesting things.

It is more likely than not that Texas, Florida St. OK St., Purdue, and Virginia get bounced before the sweet 16. Texas in particular (shoutout Abilene Christian) looks to be in danger. I found only about a 56% chance that Texas makes it out of the first round.

The number in each column represents the probability of a team making it to that particular round. R32 is the probability of making it to the round of 32, S16 is probability of advancing to sweet 16, etc.

TeamRatingR32S16E8F4FinalChampionshipsSeeds
Gonzaga26.470.99990.95280.85670.68120.54610.42781
Virginia160.87230.45910.06190.02310.00880.0034
Kansas15.240.90930.5020.15040.02380.00870.00353
Iowa21.160.9840.83860.64080.21320.12870.07542
Michigan21.390.99760.74550.56270.41370.16160.0961
Florida St.14.640.74220.32890.08930.03880.00570.00174
Texas15.110.5640.29820.12990.0420.0070.0023
Alabama17.150.98270.76610.48820.21050.05490.02242
Baylor21.970.95090.78960.60990.46130.26410.10561
Purdue15.220.88630.44880.12790.05960.01980.00314
Arkansas15.480.85050.54210.24760.08320.02560.0043
Ohio St.18.280.98480.62410.41560.17570.06970.0172
Illinois21.960.98980.8170.66040.46850.30750.1331
Oklahoma St.12.650.840.34160.06590.02220.00588e-044
West Virginia14.530.96590.58360.20010.05590.02220.00383
Houston19.250.99680.75620.56770.2810.15580.05032

My Machine Learning March Madness Cinderella Picks

I am planting my flag on Abilene Christian. First of all, my model thinks that Abilene is better than their seed and Texas is worse than their seed. Even better, the potential second round matchup – BYU – is also bad for their seed. Normally a 14 seed has to beat a 3 and a 6 to make the sweet 16. Tall task. However, my ratings think Abilene is 12-seed capable. Moreover, Texas is a 5-6 seed quality and BYU is a 9 seed quality. In these terms, this Cinderella story seems much closer to possibility.

Will it happen? Probably not. The odds still say it is more likely than not that Abilene Christian loses in the first round and none of this happens. But this is March Madness, weirder things have happened.

The other 10+ seeds I think have a very good chance of making the sweet 16 and beyond are Rutgers, Syracuse, Utah State, and Virginia Tech. My model thinks that each of these teams a 14% or greater chance of making the sweet 16. I would be willing to bet a good sum that at least one – and maybe 2 – from this group make it that far.

Other than that, I will let you draw your own conclusions from the bounty of numbers I’ve provided. I’m picking Gonzaga to win, and I think you should too, but I’ll leave it up to you. Use these numbers as a guide, as a suggestion, on which scenarios are most likely. At the end of the day, though, remember this: there is absolutely no way to find clarity in the madness.