Teaching the Poisson Distribution with Sports
In this edition of Teaching Math with Sports, we look at the Poisson distribution in sports. The Poisson distribution is a fundamental distribution in statistics and is closely related to the binomial distribution. Armed with these two distributions, many sports modeling problems become much simpler.
The Poisson distribution is, in many senses, the continuous version of the binomial distribution. The binomial counts how many times something happens out of a fixed number of trials. The Poisson distribution counts how many times something happens over an interval of time. As a result, the binomial distribution is much more naturally applied to discrete games like baseball, football, or golf. The Poisson distribution arises more often in continuous settings without set “plays” – it is useful in hockey, soccer, or the NBA as a result.
In this article we look at three examples and some non-examples showing how to study the Poisson distribution in sports. We also include some challenge questions with each example of teaching the Poisson distribution with sports.
Poisson Distribution Background
The Poisson distribution counts the number of times that a countable event happens during a fixed period of time. In order to use the Poisson distribution, you have to know the average number of occurrences over this period of time. Moreover, over the entire time interval, the probability of the events occurring at any individual time must remain the same. Finally, the events must happen independently; the fact that the event just happened doesn’t make it any more or less likely to happen again soon.
The Poisson distribution is parameterized by only one parameter, \lambda , the average number of occurrences. A Poisson distribution with parameter \lambda is denoted by P(\lambda) .
A common introductory example is modeling the number of phone calls that a call center may receive during a fixed hour. Here, \lambda is the average number of calls they expect to receive in an hour. Then the Poisson distribution can help the call center figure out how much staff they need by telling them the probability of having extremely busy hours.
Perhaps confusingly, the Poisson distribution is actually a discrete probability distribution even though I said above that it is a continuous version of the binomial distribution. The Poisson distribution is discrete because it is counting a number of occurrences; it can only take the values 0, 1, 2, etc. I called it continuous because it is applied to model situations where things can happen over a continuous interval. It is useful in modeling games where play is continuous (like basketball, hockey, and soccer). This distinction between continuous settings and discrete settings is key in choosing the right model.
The Poisson distribution is actually obtained by taking an appropriate limit of a binomial distribution. However, the specific theory about how to do this is not important to the sports examples below. So, we’ll skip the most mathematically fascinating part and move straight into looking at some examples of how the Poisson distribution shows up in sports.
Example 1: Modeling Goals Scored in a Game of Hockey
In the NHL, about 6.5 goals are scored per game by both teams combined. Hockey is continuous and it is fast; goals can be scored in the blink of an eye. Using the Poisson distribution to model goals being scored in a game of hockey is a reasonable approximation. The number of goals scored in a hockey game should follow a P(6.5) distribution.
The following plot shows a comparison between the predicted goals per game (black) and the observed goals per game (red) from the last 2 seasons.
One of the key features of the Poisson distribution is that we can change the length of the time interval and still use a related model. For example, if goals in an entire game of hockey follow a P(6.5) distribution, goals in the first half should be expected to follow a P(3.25) distribution.
Challenge Question 1: In the above graphic we excluded games going into overtime and shootouts. Why shouldn’t the goals scored in a sudden death overtime period be modeled with a Poisson distribution? How might you model goals scored during a shootout? For example, how would you model a shootout where each team takes 3 shots and 50% of the shots are made?
Example 2: Time of possession in soccer
In soccer, time of possession is a big predictor of winning the game. In the MLS, roughly 3 goals are scored on average in a 90 minute game. That means each team scores about 1.5 goals in a game. When a team possesses the ball, that team and that team only has a chance to score.
Scoring happens at an average rate of 3 goals per 90 minutes. That means that a team which possesses the ball for 45 minutes will score an average of 1.5 goals. The number of goals scored by a team in their 45 minutes of possession can be modeled by a P(1.5) distribution.
However, time of possession is rarely evenly split. Suppose that Team A has the ball for 50 minutes (and, therefore, Team B has it for 40 minutes). How would you compute the average goals scored for Team A and Team B now?
The answer: goals are scored at an average rate of every 30 minutes of game time. Therefore, if Team A has the ball for 50 minutes their goals can be modeled by the Poisson distribution P(1.66) . Team A’s goals in this scenario must be modeled by a P(1.33) distribution.
Challenge Question 2: How would you compute the probability of Team A beating Team B provided they have the ball for 50 minutes and assuming goals scored follows the appropriate Poisson distributions.
Example 3: Fouls in the NBA
Fouling in the NBA is another event that can be modeled with a Poisson distribution. Fouls occur roughly randomly and at roughly equal rates throughout the game. You can use a player’s average foul rate and average minutes played to compute their probability of fouling out using the Poisson cumulative distribution function!
Last year Dwight Howard recorded 200 fouls while playing 69 games for 17 minutes per game. This means he averaged roughly one foul every 6 minutes. Julius Randle recorded 225 fouls while playing in 71 games at 37.5 minutes per game. Randle averaged a foul roughly every 11.5 minutes. Which of these players is more likely to foul out?
Let’s assume both players gave heavy-minutes nights and Howard plays 20 minutes while Randle plays 42 minutes – both roughly a 15% increase over their average minutes.
To compute Howard’s fouls, we take his average fouls per minute and multiply by 20 minutes to get the Poisson distribution rate parameter. This number is the expected fouls by Dwight Howard in his 20 minutes of playing time. Then, we do the same thing for Julius Randle and multiply by 42 minutes. Howard’s fouls should follow a P(3.4) distribution while Randle’s will follow a P(3.55) distribution.
The probability that Dwight Howard fouls out would be about 13% while the probability that Julius Randle fouls out is about 15%. This means that even though Julius Randle is going to play more than twice as many minutes as Dwight Howard, their probabilities of fouling out will be approximately equal. This is because Howard’s foul rate is so much higher.
Challenge Question 3: If you look at the data, players foul out much less often than a Poisson distribution would suggest. Why do you think this is? We can think of two reasons, both with a related hint: neither the player nor the coach wants that particular player to foul out.
Example 4: Non Examples
Above we talked about how the probability of winning in soccer changes in favor of teams with more time of possession. We argued that you can use a Poisson distribution to model this effect.
The same claim about time of possession tends to be true in professional football: teams with higher times of possessions will tend to win the game more often. However, the reasons are much different. In this case, the Poisson distribution does not help us answer why.
The Poisson distribution is used to model the occurrences of an event over a period of time. One important assumption is that the event can happen any number of times. In soccer, if you have the ball for 45 minutes you could score 1 time, 2 times, 5 times, 10 times, or never at all. To use the Poisson distribution, that is how it must be.
In Football, this requirement is not satisfied. On each possession you can only score once. That means that one of the key properties of the Poisson distribution is not satisfied. Therefore, the Poisson distribution is not a good model here.
Even worse, in football “scoring” can mean one of a few different things. You can score a touchdown and an extra point, a touchdown and a two point conversion, or a field goal. Because the Poisson distribution is incapable of modeling these differences, it is an inappropriate tool to use to predict winning solely based on time of possession.
Challenge Question 4: Suppose you are staffing the finish line of a running race and want to make sure you have enough volunteers to hand out medals and water. Your plan is to take the last finish time minus the first finish time and divide by the number of runners to figure out on average how many runners cross the line per minute. Then, you want to use a Poisson distribution to predict a normal range of outcomes for number of runners who finish in any given minute. Why will this not work?
Challenge Question Answers
Question 1: In overtime the game ends as soon as the first goal is scored. Therefore, the period of time over which we’re counting events is not fixed. A better model would be to use the exponential distribution to find the expected waiting time until the first goal.
To model shootouts, we would use the binomial distribution.
Question 2: We need to compute the probability that P(\lambda_1) distributed random variable is larger than a P(\lambda_2) . There is no “closed form” solution to this problem, but it can be computed very easily either by hand or by writing code. We need to use conditional distributions to help.
First, in the \lambda_1 distribution, find the probability of scoring k goals. Then, using the \lambda_2 distribution, compute the cumulative probability of scoring fewer than k goals. The product of these numbers is the probability of the first team winning AND scoring k goals. Summing over all possible values of k will give you the probability of winning via the law of total probability.
Question 3: There are two reasons. First, if a player gets into foul trouble, the coach is likely to bench them, resulting in them getting fewer minutes, effectively decreasing their per-game foul rate. Second, a player who is in foul trouble is likely to play more conservatively, effectively decreasing their per-minute foul rate.
Either way, when a player gets into foul trouble, the rate parameter \lambda decreases as the number of accrued fouls increases. This means that a Poisson distribution is actually not the best model. For some cool theory, read my previous work about non-homogenous Poisson distributions and fouling in the NBA.
Question 4: You should not use a Poisson distribution in this setting because the rate of arrival of runners changes over time. Very few runners cross the finish line at the beginning and the end while the highest rate of arrival is in the middle.
To receive email updates when new articles are posted, use the subscription form below!
Jon…i accidently discovered (ha) the Poisson distribution and applied it to predicting the number of fire and ems emergencies in / across a given time period for a given jurisdiction. Is the poisson the correct stat?
Yes it is, the only thing to perhaps be careful of is the average number of fires per hour might not be constant over time!