Sklearn Linear Regression Tutorial for Sports
Linear regression is the first, easiest, and most versatile tool that statisticians and data scientists will learn. The Sklearn linear regression python class provides the easiest way to implement this power technique. With the ability to implement multiple linear regression and Lasso regularization as well as reporting summary statistics including “r2 score”, the Sklearn linear regression class is a great tool.
In this article we’re going to use NBA team ratings as an example to study the Sklearn linear regression class. Click here to read our previous article about the Sklearn logistic regression class!
All of the code used in this example is available on our github here!
To receive email updates when new articles are posted, use the subscription form below!
What is Linear Regression? A Basic Example
The simplest way to describe the technique we’re using here is that linear regression is computing the “line of best fit” to observed data. In plain language: if you have a bunch of data points, drawing a line that matches the data can help us identify the relationship between the two variables and the strength of this relationship. Below is an example.
In this graphic, we plotted a team’s offensive rating against their average margin of victory in the 2022-23 NBA season. Intuition tells us that teams with better offenses tend to win by more points. The blue linear regression line in the above plot supports this intuition with data in two ways:
- Because the slope of the line is positive, an increase in offensive rating results in an increase in average margin of victory. Better offenses lead to winning by more on average.
- Because the points are relatively close to the line and tend to follow the trend well, we are confident in the conclusions made in the first point above.
Linear regression is often the first analytical technique people learn when first performing data analytics. It is simple, powerful, interpretable, and versatile. The math to compute the line of best fit is not really that complicated – it uses an idea called the Moore-Penrose pseudoinverse matrix. However, the Sklearn linear regression class makes things even easier by letting you skip all of the math and get straight to the results.
Now we can really dig into the Sklearn linear regression tutorial.
How to Run the Sklearn Logistic Regression Model
All the code found in this tutorial can be found on our Github here. I’ll assume that you have pandas, numpy, matplotlib, and Sklearn installed. I’ll also assume that you are comfortable with Jupyter notebooks, if not the video below can help.
Building a Basic Model
Building a basic Sklearn linear regression model requires very little work. There are three basic steps to do this when you are using Pandas data frames:
- Load the data
- Reshape the data
- Fit the model
Loading the data is trivial using the built-in functionality of pandas. An especially helpful tool is to use the “head” method of a data frame to see what your data looks like in the first few rows.
For us, we’ll start by looking at the relationship between a team’s offensive rating (ORtg) and their average margin of victory (MOV). To do this, we need to grab these two columns and convert them into numpy arrays. Then, the arrays need to be reshaped to be arrays of 1-tuples. If you forget this step, don’t worry: the error code thrown by scitkit-learn suggests the correct fix!
A second method to get the data into the correct format is shown later when we perform multiple linear regression in Sklearn.
After converting the data, all you need to do is instantiate an object of the Sklearn linear regression class and fit the model on the data we just converted to the right format! These steps are shown in the screen grab below.
At this point, our model has been computed and the hard work of model building is done. That’s right, it only took us a few lines to fit a model! The rest of the fun with Sklearn linear regression is still to come though. First, let’s look at how well the model worked and evaluate its fit.
Evaluating Your Sklearn Linear Regression Model with R2 Score
The first thing you should always do when fitting an Sklearn linear regression model is to look at how it fits the data. That is, we just computed a line of best fit so we better plot it to see what it looks like!
The linear regression scikit-learn object has a method called “predict”. This method takes a hypothetical input and would tell you what the model’s predicted output is. You can use the predict method on the training data to see how the model looks. For us, this results in the following plot (after plotting using matplotlib).
Remember before that we said better models exist when the training data is closer to the line of best fit. The coefficient of determination – alternatively called the R2 score – measures this effect.
While not the focus of this article, we’ll include a few sentences on R^2 . The usual interpretation is “the amount of variation in y which is explained by the variation in x”. I like to explain it in a few more words.
The y values of our data have a certain variance – variance measures spread around the mean and so describes how hard the value is to predict with no other information. However, using the model we can more accurately predict the y-value. That is, our errors in predicting y values when we have x-value and the line of best fit will be smaller. The R^2 value is the percentage decrease in variance from prediction with no line compared to prediction using the line.
The coefficient of determination can be computed using the “score” method of the object. This is shown below:
In our example, the R2 score is about 0.68, which is a solidly moderate relationship. In the next section we’ll study more complex models and how we can use R^2 to make conclusions.
Regularization in Scikit-Learn for Multilinear Regression.
Building a multiple linear regression in Sklearn is no more difficult. Let’s see how it is done below.
Sklearn Multiple Linear Regression
Just like last time, we need to separate the independent and dependent variables into usable data types. Previously we showed the method of converting the data into numpy arrays. Now we’ll show the method of feeding pandas dataframes into the model directly.
Last time we predicted the margin of victory using offensive rating. Now, let’s add in defensive rating to see how much better our model is. To do this, we need to subset the pandas dataframe and retain only the offensive/defensive rating columns. We do the same for the margin of victory column in a separate variable. Then, we fit the model exactly like before!
Notice how this isn’t any more difficult than before! In fact, this method of passing in data frames to the “fit” method might even be easier.
You might be looking at the R2 score and thinking that this model is incredible. However, this is where we need to use domain knowledge to build better models. Margin of victory is a linear function of a team’s offensive and defensive ratings. Therefore the Sklearn linear regression model will fit the data perfectly. The only difference from a perfect fit here is in rounding errors.
So while our model is nearly perfect, it isn’t that impressive. One way to describe what is happening here is that the offensive and defensive ratings contain all the information that margin of victory contains.
Lasso Regularization
Margin of victory is a perfect linear function of the offensive and defensive rating variables. Therefore, if we use an additional feature to try to predict margin of victory we shouldn’t expect any better performance. Look at what happens below!
Our model actually got better when we added in an advanced version of offensive and defensive rating. You can tell because the R^2 score increased, even if slightly. This shouldn’t happen, right?
Remember that the only errors in our two-feature model were due to rounding errors in the data. Adding in the other two features – though it adds no new information – gives the model more ability to decrease this rounding error.
“Lasso” in an Sklearn linear regression model is designed to counteract this exact effect. The mathematics are quite complicated, yet quite well understood in the mathematical community. Lasso uses L1 regularization which is inspired by the theory of compressed sensing. Compressed sensing is the study of algorithms and recovery of sparse vectors via various measurement schemes. Compressed sensing happens to be one of the topics I studied when getting my degree.
The point of Lasso in building statistical models is that it helps us determine which variables are actually important and which are redundant. Look at what happens when we build a Lasso Sklearn linear regression model on the same dataset as before.
Using our domain knowledge and expertise, we argued that the third and the fourth features were superfluous and didn’t add any predictive value over the first two. The Lasso Sklearn linear regression model has verified this intuition because it has computed coefficients of 0 for the last two variables! This means that neither of these two variables come into play at all when making predictions.
Lasso is incredibly powerful because it lets us identify those features which are actually important in the prediction process. When doing multiple linear regression in Sklearn, Lasso should be a tool to remember.
Why No Train/Test Split?
Some data scientists reading this article might complain that we messed up because we didn’t split the data into a training set and a testing set. But we encourage the reader to think about the difference between descriptive statistics and inferential statistics.
Descriptive statistics are used to describe what happened in the past. Inferential statistics are used to predict what might happen in the future. In the world of inferential statistics, we are often concerned with models over fitting to the data. To control for this effect, we often split the data into training, validation, and testing sets to aid the model design.
However, in this article we only ever used our stats in a descriptive way. We described the relationships that existed between the data, we weren’t interested in predicting how those relationships will continue into the future. It is for this reason that we did not use a train/test split.
Conclusions
Linear regression is the simplest yet possibly most powerful tool a data scientist can learn. Linear models are both broadly applicable yet sufficiently powerful. By combining these models with highly tailored regularization techniques including Lasso, many problems become easy to solve. The Sklearn linear regression class is an excellent tool to add to every data scientists repertoire.
To receive email updates when new articles are posted, use the subscription form below!