Multicollinearity in Sports and NBA Box Plus Minus

Sometimes terms and ideas in math can be quite a mouthful. We have things like endomorphisms and psedofunctors in category theory as well as homoscedasticity and multicollinearity in machine learning.

Mathematicians name things intelligently though, we aren’t picking random words. Pseudofunctors are objects that are kind of like functors (which are, in turn, kind of like functions). In the same way, multicollinearity in machine learning and sports analytics is the idea of many things (multi) all existing on the same (co) line (linearity).

What this means and why it is an important machine learning concept is the topic of today’s article. We’re also going to touch on why this concept is important in sports analytics. In particular, we’re going to take a look at how multicollinearity shows up in stats based on +/- in the NBA. For a refresher, check out our past article on real plus minus.

How does multicollinearity show up in sports? And why is it a problem

Contents hide

1 What is Multicollinearity?

2 Multicollinearity Example

3 Why Does Multicollinearity Matter in Machine Learning?

3.1 Multicollinearity and Correlation vs. Causation

4 Where Multicollinearity Shows Up in Sports Analytics

4.1 On-Off Splits

4.2 Multicollinearity and Plus Minus Stats

5 Future Work

5.1 Share this:

To receive email updates when new articles are posted, use the subscription form below!

What is Multicollinearity?

The entire goal of machine learning is to predict the outcome of some event given data about that event. Maybe we want to predict which stocks will go up based on past data. Maybe we want to predict which marketing campaign will drive the most sales. Or maybe we want to predict which team will win the super bowl using data from the season.

Part of the art of data science and machine learning is deciding which variables (observables) to use to predict the outcome. In the stock market case, maybe we have to decide whether to use

The history of the stock in question,
The history of the particular stock and all related stocks
The history of all stocks
The history of all stocks plus twitter data indicating sentiment

And we can go on and on. More variables isn’t always better. The more variables you use, the higher the chance that your model finds spurious, random relationships in the training data. This can cause overfitting.

multicollinearity is a phenomenon that happens in data science when two variables are inextricably linked and their effects cannot be separated. Let’s look at an example

Multicollinearity Example

I want to give two examples of multicollinearity, one in sports and one not. Remember, the multicollinearity phenomenon shows up when two or more variables are capturing the same information in a model.

In healthcare, doctors usually start an exam by taking a patient’s history. They’ll ask things like age, weight, exercise level, blood pressure, etc. They often collect this data in order to get more context for deciding what ailment a patient actually has.

However, it is fairly likely that a doctor could predict blood pressure by knowing the age, weight, and exercise level of the patient. This variable or data point is correlated with the others. This is an example of multicollinearity.

In sports, someone might try to build a model using both “whether or not a team won” and “how much they won by”. However, knowing by how much a team won or lost actually tells you whether or not they won. That is, one variable tells you about another. This is another example of multicollinearity.

Why Does Multicollinearity Matter in Machine Learning?

OK, so far we understand multicollinearity and we’ve seen examples of it. But why does it matter? Why is it something we talk about?

In short, when variables are multicollinear, it can be hard to separate their effects. If two variables are really highly correlated, then it is hard to tell them apart.

In the healthcare example above, weight and blood pressure are highly correlated.Because they often go hand in hand, it can be difficult to tell if a patient’s weight or their blood pressure is causing a medical ailment.

In the science of model building, when variables are multicollinear, they can behave erratically. In linear regression, the impact of variables is measured by their coefficient. For variables that are multicollinear, their coefficients can vary dramatically depending on the data set. This is also related to the bias-variance tradeoff that shows up throughout machine learning (which I’ve written about here).

In particular, multicollinear variables sometimes will exhibit huge coefficients when a model is built due to this erraticness.

More analysis of the effects of multicollinearity can be found at this great link.

Multicollinearity and Correlation vs. Causation

There is an example of multicollinearity that I still remember to this day from a long-ago stats class. The teacher was explaining the concept of confounding variables which are quite related to multicollinearity. Confounding variables are those whose effects are hard to separate in studies, very similar to what we’re talking about here.

The teacher got up in front of the class and said “you can look at the data as long as you want, but no matter how you cut it it is a FACT that areas with more churches have more crime”. Being a teacher at a Catholic school, this was a wild statement to make.

Obviously the first thought is that churches cause crime. But this is precisely not the case. In fact, there is an underlying variable that explains away the issue. It turns out that areas with high population have higher crime. It also turns out that areas with high population also have a lot of churches.

Churches and crime rates are correlated because they both rely on the underlying variable which is population density. Churches do not cause crime.

The same type of confusion shows up in models where variables exhibit multicollinearity. Because they are related, because they are correlated, it is hard to separate the effects of one from the effects of another. For this reason it can be difficult to determine causation in a model with multicollinearity.

Where Multicollinearity Shows Up in Sports Analytics

This is a sports analytics blog, so I always want to bring things back to that world. One of my favorite NBA stats to talk about is box plus minus and its variants. Plus minus is such a simple stat: determine how good a player is by looking at how a team’s quality changes when they are on or off the court.

The problem with plus-minus based stats is that when a player is on the court, they are not solely responsible for what happens in those minutes. They are just one of 10 players on the court.

For example, in the 2021-2022 season, Jae Crowder had the 12th best +/- in the league. He is definitely not the 12th best player in the league. Crowder just benefited from playing on the Suns and benefiting from their dominance. Three suns (Bridges, Booker, and CP3) finished above him that year in +/-.

On-Off Splits

The first solution to the plus minus problem above that most people think of is to take into account the other player’s on the court. Yes Jae Crowder was a net +395 in his minutes in the 21-22 season, but CP3 was +460 and Booker was +469.

Per game this is roughly a net +5 rating for these three players. This means that Crowder + Booker + CP3 together are a +5. This is very different from just observing that Crowder alone was a +5. Their combined abilities led to that point differential, not Jae Crowder’s skill alone.

The key is to try to figure out how much of that +5 was Paul’s, how much of it was Booker’s, and how muchh was Crowder’s. The way to do this is to look at how subsets of these three played without the other. If CP3 and Booker were also +5 when Crowder was off the court, we can tell that Crowder didn’t contribute much, he is probably a +0 player.

However, if CP3 and Booker were only +3 when Crowder was off the court, that means Crowder was responsible for a +2 rating. Going through this same calculation for all lineups lets us figure out exactly how much each player is worth.

At least in theory. This is where multicollinearity rears its head.

Multicollinearity and Plus Minus Stats

The problem that we’re trying to solve is determining each individual player’s +/- rating. We do this by looking at how different lineups played together. If the five players on the home team have plus minuses $h_1,\dots, h_5$ then the overall plus minus for that lineup should be $\Sigma_{i=1}^5 h_i$ . We can do the same thing with the away team’s ratings to get an overall team +/- of $\Sigma_{i=1}^5 a_i$ .

Then, if the home team outscored the opponents by x points with these lineups, we know that $\Sigma_{i=1}^5 h_i - \Sigma_{i=1}^5 a_i =x$ . If we do this for all lineups that happen over the course of the season, we get a system of equations. Solving this system lets us determine the values $h_i,a_i$ that are the ratings of the individual players.

However, if you try this you’re going to get insane answers. I encourage you to code this up and see what happens. You typically get results which are nonsense. The reason is multicollinearity.

Typically lineups don’t change much game-to-game. Sometimes guys will play all of their minutes with other guys. This is sometimes called a “platoon” system. The problem is that, if Jae Crowder and Chris Paul play all of their minutes together, it is impossible to tell which one of them was truly responsible for the +/- that happened while they were on the court. Their impacts are indistinguishable.

To say that statistically, players who play a significant amount of minutes together suffer from an identifiability problem. This makes it difficult to estimate their overall quality.

The problem still exists when guys play most of their minutes together. The small amount of minutes that they play separately can have an unduly large effect on their rating. This is why plus-minus stats are hard to build. This is also why there are so many variants of this NBA stat.

Future Work

The reason we decided to write about this is that we’re attempting to build a model that solves this problem. One way to solve this is using priors, basically guesses based on other stats to tell us how good a player is. This is how real plus minus works.

We’re going to try a different approach. We will assume that most players are basically a 0 rating. Instead of giving ratings to everyone in the league, we’re only giving ratings to the top 30 or 60 players. Then, everyone else rolls up into a “rest of the team” rating. For example, maybe the Lakers’ rating is a sum of Lebron’s rating, Anthony Davis’ rating, and the rest of the team combined.

This is a work in progress, so check back soon!