Sklearn Logistic Regression Example in Sports

Python has become the tool of choice for nearly every budding data scientist. Sklearn, including the Sklearn logistic regression module, is one of the most important tools for a data scientist to be fluent with. The goal of this article is to show how to use the Sklearn logistic regression module and apply it to an example sports analytics question.

All the code and data shown here to implement an Sklearn logistic regression model are available on our GitHub here.

We fit an Sklearn logistic regression model to NFL historical data

To receive email updates when new articles are posted, use the subscription form below!

What is Logistic Regression?

Logistic regression is a specific type of regression model. Regression models are ways to “fit curves” to observed data. The most common regression model nearly everybody has seen is the least squares regression line (LSRL) also known as a “line of best fit”.

All regression models have use cases, times they are best applied. For example, the LSRL is only appropriate when:

  • The independent variable (the x axis) is continuous
  • The dependent variable – the quantity to be computed AKA the y axis – is also continuous
  • The relationship between the two is approximately linear meaning a unit change in the input always results in a proportional change in the output.

The following shows a least squares regression line from a previous article about DFS strategies including boom bust players and stacking.

Linear regression is just one type of regression model

Logistic regression is simply a different type of regression problem with different goals. Logistic regression is most appropriately used when:

  • The independent variable is continuous
  • The dependent variable is Boolean (true/false)
  • There is an underlying normal distribution governing the randomness in the results

For example, logistic regression might be appropriate if we’re predicting whether a basketball shot is made or missed (a Boolean value) given how far the shot was taken from (a continuous value). Or, you might want to predict how likely a chess player is to win in a match given the two player’s ELO ratings.

Logistic regression gives us the tools to build such a model. There is one subtlety, though. When building a logistic regression model, the required data is pairs (X,Y) where X is continuous and Y\in \{0,1\} is Boolean. Using X, we want to predict Y. However, the logistic regression model doesn’t output a Boolean. Rather, it outputs a probability that the Boolean was true or false.

Therefore, logistic regression models map continuous inputs to probabilities of events happening. These probabilities can be converted to Boolean predictions by thresholding at 50%! Before moving on to a discussion of the Sklearn logistic regression toolbox, we’re going to include an example of how logistic regression is used in sports.

Logistic Regression Example

Logistic regression is so useful that it is one of the most commonly used tools in all data science. In sports this is especially true. Almost everything that happens in sports is Boolean – a shot is made or missed, a team wins or loses, a golfer sinks a putt or doesn’t.

A cool example is the stat expected goals or xG in Soccer. Some shots on goal in soccer are more likely to go in than others. Certain factors like distance from the goal, how close the nearest defender is, and the goalie positioning make the shot attempt easier or harder. xG is a logistic regression model which takes all these considerations and estimates the probability of the shot being made.

Expected goals is the output of a logistic regression model. It can be used for many analytics problems including:

  • Estimating which team should have won the game
  • Computing who the best shooters are
  • Computing who the best goalies are

In the next section we’ll talk about how the Sklearn logistic regression module works before starting our step-by-step example.

The Sklearn Logistic Regression Module

This tutorial assumes that you already have a working knowledge of Python and have Sklearn installed with pip. We’ll also need pandas to work with our data in a dataframe. If that is not the case, this Youtube tutorial can help you get started.

Our scripts to fit a logistic regression model are so simple, there are only four steps

  1. Load the required modules
  2. Load and pre-process data (This is usually 99% of the work)
  3. Fit the regression object
  4. Predict and analyze the results

Loading Modules and Data

The Sklearn logistic regression model is implemented as a Python class. Again, check here for a quick primer on how classes work in Python. First, we need to load the following modules:

  • Pandas for efficient handling of data frames
  • The actual LogisticRegression class which lives inside of sklearn.linear_model
  • The train_test_split function from Sklearn.model_selection
  • The matplotlib pyplot module to visualize our results
We start building a logistic regression model in Sklearn by importing all the releveant modules

Load and pre-process data

Now, overwhelmingly often data is stored in a .csv format. You can read and save data in this format as shown in the following code snippet using pandas. The “head” method of a pandas dataframe is helpful to “see” what the data looks like.

THe second part in this step-by-step logistic regression in Sklearn example is to get a feel for the data

There is a hidden step that comes immediately after this in almost any data analytics project. It is necessary to convert the raw data into a format usable by the modules you’ve loaded.

For example, when trying to use the Sklearn logistic regression module, a common mistake might be to pass in a list of numbers as your independent variable and a list of Booleans as your dependent variable. However, doing this will lead to the following error:

Luckily, the fix is provided in the error message!

After we have either a 2D array or, alternatively, a list of 1-tuples, we can fit the Sklearn logistic regression object to data! However, traditionally good practice is to split the data into training and test data. Inside of Sklearn.model_selection, the train_test_split function lets us randomly split the data into subsets of a specified size.

An example of this tailored to our specific sports analytics example is shown below.

We need to split the data into training and testing datasets

To see a full example of how we had to pre-process the raw data, take a look at the jupyter notebook on our Github here. For now, though, we move on to fitting the Sklearn logistic regression model.

Fitting the Sklearn Logistic Regression Object

After all the pre-processing and loading the appropriate modules and functions, actually training the logistic regression model is trivial.

First, we need to instantiate a Logistic Regression object. Then, we simply us the “fit” method of this object and provide it the training datasets, both the independent and dependent variables. The following code snippet shows how this can be done.

Actually building the Sklearn logistic regression object only takes two lines of code!

The fit method also accepts an argument allowing us to specify which optimization method is used. For example, there are options to train the model using Newton’s method or the memory efficient BFGS method. These are best saved for an advanced study of logistic regression in Sklearn. For now, we’re ready to move on to the last step.

Prediction and Plotting

Finally, now that our model is trained, it is able to be used to make predictions given previously unseen data. There are two types of predictions possible. First, we can have our model output Boolean values corresponding to the prediction related to an input. This is done using the “predict” method of logistic regression object.

Second, and perhaps more interestingly, we can have the model output the probability associated with the Boolean value. This is done using the “predict_proba” method of the logistic regression object.

An example showing this on a set of dummy data points is shown below. Notice how the format of the input data is a list of 1-tuples, exactly the same as the data the model was trained on to begin with.

We use the model to predict probabilities on dummy data

Now, let’s look at all of these steps together to see how we can use logistic regression in Sklearn for a sports analytics example.

Applied Logistic Regression in Sklearn

Our example is understanding point spreads and winning probabilities in the NFL. Sometimes teams are favored to win by 2 points, sometimes by 6 points or 10 points. As the spread becomes larger, it is more and more likely that the favored team wins.

If a team is favored by 1 point, then the game is almost a tossup. We would expect the favorite to win just a little more than 50% of the time, maybe 52 or 53%. If a team is favored by 20 points, then we expect that they’re virtually guaranteed to win. Maybe they win 95 or 97% of the time.

We want to put exact numbers on this. Having exact numbers is important when building more complex sports prediction models. We looked through this data set containing all past NFL games as an example. For each game in which the data was available, we recorded (a) how much the favorite was favored by (aka the spread) and (b) whether or not the favorite won. The raw data is shown below.

Raw NFL win probabilities given spread

This data is roughly what we would expect. As the spread gets smaller, the probability that the favorite wins decreases towards 50%. For larger spreads, we see a 100% probability of the favorite winning.

However, the raw data is imperfect because of relatively small sample sizes. Certainly the probability should never increase as the spread gets closer to 0 as it sometimes does in this plot. Certainly there isn’t actually a 100% chance of heavy favorites winning. The logistic regression model can actually help us smooth this data out.

The results of fitting an Sklearn logistic regression model to the above data is shown in the plot below.

The result of using logistic regression in Sklearn to solve a sports analytics question.

Notice how well this model fits the data. The blue line, the output of our logistic regression model is a smoothed version of the raw data. We can use the blue logistic regression curve to more accurately predict the probability of a team winning.

Conclusions

The Sklearn logistic regression module provides seamless and fast fitting of a powerful machine learning model. This is one of the most fundamental data science models there is. Learning logistic regression in Sklearn is one of the most important tools a new data scientist can have in their toolbox. We encourage you to explore more complex models and variants of the logistic regression paradigm on your own.

To receive email updates when new articles are posted, use the subscription form below!