What is Data Leakage in Machine Learning?
Nearly every data scientist or sports analyst has learned a painful first-hand lesson about data leakage. Let me paint a picture.
You come up with a great idea, spend hours refining and building your model, and finally test it against real data. You compare your results with other well-known models or Vegas odds and realize oh my gosh I can make a lot of money I’m so smart.
And then you go to start using your model in real life and it doesn’t work as well as it should. You go back and look at your code and realize that through an extremely complicated turn of events, you were using information about the result of the game to predict the result of the game.
To say that another way, through no fault of your own you used information that you shouldn’t have had access to when making predictions.
This is what we mean when we talk about data leakage in machine learning. What is data leakage? Why is it a problem? And, most importantly, how can you avoid it?
What is Data Leakage
In order to understand what data leakage is, we first need to make sure everyone is on the same page when it comes to the goals of machine learning. It turns out the goals when building sports analytics models are basically the same.
In building sports models and in machine learning more generally, the goal is to predict the future given past data. A good machine learning/sports model is one which predicts the future more accurately than others. Because they’re all given the same information (data) to work with, the point of modeling is to figure out how to extract as much information as possible from the given data.
Data leakage happens when we accidentally let our model look at more data than it should be allowed to look at. This could be information about from the future or information that is otherwise hidden (like cards in poker).
Remember the point is to make predictions about the future. Data leakage happens during model training when information about the thing we’re trying to predict accidentally is reflected in the data we’re using to make the predictions. Sometimes an example is easiest.
An example of data leakage would be a computer poker bot that knows some of the cards you’re holding in your hand. The computer has more information than it should so of course it will win more than it should.
Let’s try another example. Anyone who hasn’t been under a rock the last year knows the Astros cheated to win the 2017 MLB title. They got in trouble for sign stealing: the process of knowing what the next pitch is before it is even thrown. This is data leakage; you have access to information before you’re allowed to or supposed to. And just like before, the data leakage makes the job easier for the entity to which the data is leaked.
Sometimes, though, data leakage can be really subtle. The next section shares my own experience with data leakage and how sometimes you don’t realize what you’re doing.
My own Example
A few years ago I was trying to build a baseball model, kind of like the one I describe here. The model was Monte Carlo based and tried to predict baseball outcomes by simulating games thousands of times.
I thought it was a pretty good idea. After all, I have always thought that baseball could be modeled perfectly by simulation and that hot streaks and personality factors don’t really matter (a perhaps controversial opinion that I’ve backed up with data a few times).
To do this, I needed to use individual player stats. I needed the probability of individual pitchers allowing hits, walks, and home runs. I needed the same data for hitters. I combined this data to model an at bat.
This is where the data leakage problem started rearing its head. In order to remain accurate, I used a player’s year-to-date averages in doing my predictions. Here is the problem. I used a player’s averages from after the game I was trying to predict instead of from before.
This counts as data leakage because we used information we shouldn’t have had access to. We should only have used the player’s stats from before the game, not from after. This is because when trying to predict the outcome of the game, we only have access to the older data.
Does this actually matter though? It turns out it does. My model was beating Vegas by about 2-3%. This is enough to make a ton of money. Most of my successes happened early in the season. This should make sense because one game of data contributes a lot more to a player’s stats early in the season than late in the season.
For example, if it was a pitcher’s first start and I used their ERA for predictions, then I basically told my model how many runs the opposing team scored in the game. No wonder it was overly accurate.
The point is that data leakage can be very subtle. I used data from game n when I should have used it from only up to game n-1. And this made all the difference.
Examples of Data Leakage in Sports
- Using correlated data is an example of leakage. Suppose we want to predict how many hits a player gets in a game. It might seem like the number of outs a player got in the game is OK to use to train the model, but it isn’t. In an extreme case, if a player gets no outs, you know they got quite a few hits, probably 4 or 5 in the game. That is, knowing how many outs the player recorded influenced our model’s predictions. This is an example of data leakage in sports because the number of outs a player recorded is information that wasn’t available before the game.
- Sometimes a simple assumption can lead to data leakage. Suppose we’re trying to predict the points an NBA player will score in their career. Because the answer depends so much on how long they play, we let our model know how many years a player’s career lasted for. However, this is an example of data leakage in sports because players with long careers tend to be much better. Knowing a player played for 20 years makes it more likely they were a high scorer than a bench player.
- Suppose we want to predict the outcome of playoff games in either baseball or basketball. In both cases, playoff winners are determined by who wins a series of games. Suppose you build a model using a randomly selected subset of playoff games. This subset may include games 5, 6, or 7 which aren’t guaranteed to happen. This is an example of data leakage in sports because knowing these games happened inadvertently gives information to the outcomes of games 1, 2, 3, and 4.
The point of these examples is to illustrate how simple it is to inadvertently introduce data leakage in machine learning models without intending to. It isn’t about ill-will or academic dishonesty; sometimes data leakage in ML happens because we don’t realize what we’re doing.
What is the Difference Between Data Leakage and Overfitting
While data leakage in machine learning is a common topic, more common still is the idea of overfitting. Overfitting is what happens when a model is trained for too long and gets too accustomed to the specific training data it sees. It would be like training a model only on the 1960s and 1970s NBA and using it to make predictions for the 2023 season.
Overfitting is a big problem in machine learning. It is one of the reasons for ensemble methods like XGBoost. While data leakage is a problem too, it isn’t quite as common.
The reason that data leakage and overfitting might be confused in machine learning circles is that they cause the same effect in model validation. Both data leakage in machine learning and overfitting show up as as model performing better on training data than on testing data. They are different causes that generate the same effect.
It is important to remember, though, that data leakage and overfitting are very different. Data leakage is basically cheating, giving your model too much information . Overfitting isn’t this; it is just what happens when you don’t quite train your model correctly.
How to Avoid Data Leakage
We’ve spent this entire time talking about what data leakage is and why it is bad. But that doesn’t really help machine learning practitioners who are scared of data leakage showing up in their own work. I’ve found there are two ways to easily avoid data leakage.
The first way to avoid data leakage is to look at your training accuracy. Sometimes your data leakage is so extreme that you accidentally predict 99 or 100% of cases perfectly. This usually happens when you’re accidentally using something like “total points” to predict “points per game”.
If you see your model with a 99 or 100% accuracy during training, alarm bells NEED to be going off in your head. Unless you’re doing a very easy prediction problem, the odds your model is this good are nearly 0.
There are two problems with only using training accuracy to fend off data leakage. First, if your training accuracy is 99 or 100%, it is also possible you are just overfitting to the training data. It isn’t a guarantee you are experiencing data leakage. The second problem is that often data leakage doesn’t cause this dramatic of an effect. Usually data leakage is only responsible for a few percent increase in accuracy. Subtle, and easy to miss.
The most full-proof way to eliminate data leakage in your machine learning work is to actually try to “productionize” your model. In sports, this would look like trying to predict tonight’s games before they’re played. If you can do this, then there is absolutely no way you are cheating and using “after the fact” information (unless you’re a time traveler).
If your model works to make predictions about the actual future, then there is a 0% you’re experiencing data leakage by definition.
To receive email updates when new articles are posted, use the subscription form below!