What is XGBoost and a Python XGBoost Tutorial
If neural networks dominated the early 2010s, no algorithm dominates modern machine learning discussions like XGBoost. XGBoost is a wildly powerful variant on decision trees and random forests that often takes the top prize in online competitions. We’re going to tackle this algorithm by explaining how it works and giving you a python XGBoost tutorial.
All the code used in this article is going to be available on The Data Jocks Github in the XGBoost Tutorial repository. More interestingly, look for XGBoost to be applied in some upcoming models TDJ is building.
What is XGBoost?
XGBoost stands for extreme gradient boosting. The basis for this algorithm is actually quite complicated and involves three machine learning fundamentals:
- Decision trees
- Random forests
- Gradient Boosting
We’ll describe quickly both of these objects and how XGBoost works by combining all three.
Decision Trees
Decision trees are a machine learning algorithm based on “if…then” statements. For example, if we want to pick whether team X or team Y is more likely to win, a very basic prediction algorithm would look like “if team X is the home team, they will win”.
This is a pretty bad algorithm and is only roughly 53-55% accurate. However, we can nest successive if-then statements inside each other. This leads to a decision hierarchy often called a decision tree. It also can lead to a better model For example:
- If Team X has average margin of victory more than 3 points larger than team Y
- Then, Team X will in
- If team X has an average margin of victory within 3 points of team Y’s average margin of victory:
- If Team X is home, team X will win
- If Team Y is home, Team Y will win
- Otherwise, Team Y wins
Notice now we’re using two criteria – average margin of victory and home court advantage – to make our decisions. These prediction algorithms can be arranged graphically and as more levels are gained begin to look very “tree-like”. The same example we just wrote out is shown graphically below
Random Forests
The main problem with decision trees is that they can very easily overfit to the data they are trained on. That is, they work well on previously seen data but may not work very well when making predictions in the future.
Decision trees have a bad habit of “going deeper” and finding artifacts in the data that explain past events but not future events. We can think of this like cherry picking stats. For example the following criteria would help us more accurately predict the Heat-Mavs series: “If team X has Dirk and Team Y has LeBron, then Team X will win”.
This criteria would have done really well in the past. Dirk’s team beat Lebron every (one) time they met in the finals. Therefore this criteria will increase the accuracy of our prediction model on the training data.
However, if the Mavs and Heat repeated their series there would have been no reason to expect this trend to continue.
Random forests purposely limit the “depth” of the tree to not let this happen. Instead of just creating one big tree, random forests use lots of shallow trees to make its prediction. The way this works is that each individual, shallow tree gets a vote. These votes are tallied and the group decides who wins. Hence, random forest: it’s just a group of trees.
The way the various trees are chosen is by taking random subsets of the training data and random subsets of the decision criteria (called features) to build each tree. This idea behind this is to limit the overfitting problem. And it works really well.
Gradient Boosting and XGBoost are just one step past random forests.
Gradient Boosting
Gradient boosting is a variant on how random forests work. In the example above, Random forests work by generating lots of trees that each vote on the winner. Gradient boosting changes the game just a bit.
The first tree predicts the winner. The second tree predicts how wrong the first tree will be. The third tree predicts how wrong the second tree will be and on, and on. The idea is that by combining all of these trees together, we get a pretty good picture of what is going on.
Finally, XGBoost can be viewed as a specific implementation of gradient boosting for random forests. The XGBoost algorithm incorporates some other advanced techniques such as regularization, second order optimization methods, and other features which have proven valuable.
The technical details of how XGBoost differs from basic gradient boosting are not important to being able to implement such a model. Understanding that the model (a) is decision tree/random forest based and (b) that it fits subsequent trees on the residuals of earlier trees gives the user a pretty good understanding of the underlying model.
Now that we know a bit about how XGBoost works, we’re going to look at how the XGBoost Sklearn tool works.
Python XGBoost Tutorial
We are going to try to predict how many wins a team will have in the NBA playoffs using their regular season stats and a python XGBoost model. In this Jupyter notebook, we’ll touch on regularization and feature engineering as well. All the code can be found on our Github so you too can play with hyper-parameter tuning.
All the heavy work is done by the python XGBoost library which we will import to use later.
Preliminaries
Actually building a python XGBoost model is astoundingly easy. The most important step is the pre-processing of the data to get it into a form usable by the python XGBoost libraries. While most data pre-processing happens at the data collection step, there is still a small amount of work to be done splitting the data into training and validation sets.
First, we load the relevant packages we need
The imports are:
- Pandas for efficient handling of csv files as data frames
- The train_test_split method from sklearn for pre-processing data
- The python XGBoost package itself
At this point there is a very important distinction to be made. We decided to import the XGBRegressor class and not the XGBClassifier class. The difference is whether or not we want our model to perform regression or classification. Classification tasks can be thought of as labeling while regression tasks are more continuous and granular. We’re predicting playoff wins which naturally exists on a continuous spectrum. Therefore the python XGBoost regressor class is more appropriate.
Next, we get rid of columns that we don’t think should be used for prediction. Then, we split off the independent and dependent variables into X/Y variables. Finally, we split these into training and testing sets.
The point of splitting into training and testing sets is that we can evaluate not just how our model performs on the data it has seen. We can evaluate how good our model is at making predictions on data it has never seen before!
The model is trained using only 75% of the data then it is evaluated using the remaining 25%. The next step is to actually do this training. Luckily, the python XGBoost implementation makes this part really easy.
Training a Basic Python XGBoost Model
Creating and fitting a basic model takes only two lines of code. First, we create a blank model by instantiating an XGBoost regressor object. Then, we train it using the X/Y data we’ve gathered.
That is truly all it takes to create a python XGBoost model. The rest is just icing and analysis.
Comparison to Other Models
It can always be helpful to compare fancy models like XGBoost to simpler models like linear regression. By using the predict method for both the python XGBoost and Sklearn linear regression models, we can see how well they do when predicting NBA playoff wins.
Notice that the linear model does better than the python XGBoost model. Is this a negative for our fancy model when linear regression works better? Let’s look at some more ways to improve the XGBoost model.
Overfitting Evaluation
One thing that can happen to complex models is overfitting. This means that the model finds peculiarities in the training data that don’t extrapolate outside of that dataset. They get too comfortable working in the world they know that when they see new information, they get confused.
To test for overfitting, we need to make predictions on both the train and test split of the training data and compare the results. If the results are comparable, the model is probably not overfit. If they are wildly divergent, the model is probably overfit. This analysis can be done with the predict method of the python XGBoost model. Then, we need to compute the error between the predicted and true value of the Y data.
For us, this means that we are going to predict how many playoff wins a team had given their regular season box score data. Then, we’re going to compare this to their actual wins. We’ll do this for data the model has seen before and for data the model has never encountered.
Here is what we find:
- XGBoost test set mean squared error = 12.7
- XGBoost training set mean squared error = 5e-7
This means that the model is probably overfit because it basically perfectly fits the training data but doesn’t do very well on the test data.
Regularization
Perhaps the best way to fight overfitting is regularization. I am not going to focus on this too much in this article because it is complex. However, a short description will get the point across.
Regularization is a way to decrease the model complexity so that it can’t fit the training set as well. This typically forces the model to find more “global” patterns so that it extrapolates better to test data. This is a gross oversimplification of the process, but it suffices for now.
Our model can be regularized by:
- Limiting the max depth of each tree learner
- Including either ridge regression or LASSO penalties
- Changing the learning rate
- Etc.
To see how these work, take a look at the Jupyter notebook we’ve included.
Feature Engineering
Finally, sometimes further pre-processing of the data helps a lot. This is a way to use domain knowledge to improve a model’s performance. For example, as basketball fans we know that the difference between a team’s offensive and defensive ratings is a better predictor of their quality than either of these two metrics alone. Therefore, that difference might be a valuable data point for our model to have.
The python XGBoost model doesn’t know to look at the difference, it only wants to look at the base data themselves. Therefore, introducing these features will give our model some help. We did this by using the linear model’s output as a feature that the XGBoost model can use.
We trained a model using feature engineering and some regularization techniques. With these included, our model now outperformed the linear model quite significantly on the test data. Notice also that our model has a much less pronounced difference between the training and test set performance.
You can use this same process to do your own analysis when building models!
Bonus: 2023 NBA Playoff Favorites
We used our best python XGBoost model to predict the outcome of the 2023 NBA playoffs. The favorites we uncovered were, in order:
- Golden State Warriors
- Denver Nuggets
- Philadelphia 76ers
To receive email updates when new articles are posted, use the subscription form below!