Teaching Hypothesis Tests for Means and Proportions with Sports
In this edition of Teaching Math With Sports we look at hypothesis tests. In particular, we explore a few examples of hypothesis testing for means applied to study sports analytics questions.
Hypothesis tests are often applied to (and taught with!) medical examples: did this medicine work better than a placebo? While this is admittedly a compelling application, hypothesis testing can also be used to answer questions related to sports. While many people – professional data scientists specifically – will scoff at hypothesis tests as an inexact science, I argue that they actually provide great value when applied in the right settings.
In this article we look at four examples showing how sports analytics questions can be studied using hypothesis tests.
Use the subscription form below to receive email updates when new articles are posted!
Background Theory: Hypothesis Tests for Means
Hypothesis tests try to determine if there exists a statistically significant difference between one population and another. Here are some common examples:
- Administer a new medicine to one group and a placebo to a control group and see which group responds better
- Try marketing campaign A versus marketing campaign B in two different groups to see which sells more products
- New polling comes out that shows a candidate has gained ground in an upcoming electron. Is this a sign of a trend or just normal, random polling error?
In each of these examples the goal is the same. We compare two groups, look for a difference, and try to suss out whether or not that difference is “outside the normal variance that would be expected due to random errors”. The thing that is the difference between the two groups is often called the treatment which is motivated by the medical examples above.
We expect the two groups to behave differently. There are two possible explanations for the different behavior. We want to compare two alternatives: “random statistical errors caused the observed differences between the two groups” versus “the applied treatment caused the observed differences between the two groups”. In statistical parlance, we’re comparing a null hypothesis (differences = random noise) to an alternative hypothesis (differences = the treatment). .
The null hypothesis is that the two groups performed the same but there are differences which are the results of random variation. The alternative hypothesis is that the different treatment applied to the different groups actually resulted in a different response. In the examples above, this would be that “the medicine works”, or “one marketing campaign is more effective than the other”, or “the candidate is actually gaining more votes”. You can see why this is valuable.
The key statistical tool used for decision making in hypothesis tests is the p-value. While the specifics of calculating p-values can be admittedly complicated, their interpretation is straightforward. The correct interpretation of p-values is often one of the main takeaways from a statistics course. Here are a few different ways to interpret a p-value in order of decreasing formality.
A p-value is a probability. What is it the probability of?
- (Formal mathematical interpretation, in my opinion meaningless to a non-statistically literate audience) The p-value is the probability of observing an effect at least as large as the one we observed assuming the null hypothesis is true.
- (Still correct, more commonspeak) The p-value is the probability that the difference observed between the two groups is attributable to random statistical errors.
- (NOT CORRECT, but good enough for general audiences, i.e. how it should probably be presented to stake-holders) The p-value is the probability that the treatment caused no difference between the two groups.
I like to start with the last bullet point to get people thinking about things in the right way, then refine their understanding towards the more correct interpretations higher on the list. A smaller p-value means that the alternative hypothesis is more likely to be true and that whatever was done to the treated group had an effect. The smaller the p-value, the more confident we are that the treatment worked and that there is a difference between the two groups.
Example 1: Shooters Getting Better in the NBA
Suppose I’m an NBA GM and I am interested in trading for Ben Simmons. His problem has been shooting. Whether he has the yips, he is shooting with the wrong hand, or he just isn’t a good shooter, something is off about him. However, if he is able to increase his shooting accuracy he would be a fantastic NBA player.
Here’s the setting: I’m a GM and I notice that in the last half of the NBA season Simmons has improved his shooting percentage on long 2’s. I want to see if this improvement is random noise or if I actually think Simmons has gotten better. I want to use hypothesis tests to study this problem. The data we use only looks at Ben Simmons’ accuracy on wide open shots between 20 and 22 feet in order to eliminate the effect of defense.
Here’s the data (note: this is made up!). In Simmons’ last 150 attempts from this range he has shot 39%. In the 150 attempts before that, he shot 34%. Let’s use hypothesis tests to see if this difference is significant enough for us to be confident that Simmons’ shooting has gotten better.
The null hypothesis is that Ben Simmons has not gotten better, the alternative hypothesis is that he has actually improved. Amongst all possible hypothesis tests, the correct one to use is a test for a difference between two proportions. If you run the numbers, this difference is not statistically significant at that \alpha = 0.05 level.
From the GM’s perspective, this means that I should not be confident that Ben Simmons shooting has improved based on this small difference from a small sample size.
Challenge Question 1: Suppose I wanted to perform a similar analysis to see if Cade Cunningham became a better or worse shooter in his NBA rookie year. Why might it not be a good idea to compare his stats from the current year to his stats from the previous year?
Example 2: Baseball Pitchers & Sticky Stuff
A year or two ago, Trevor Bauer sent some tweets suggesting that a pitcher’s spin rate could more or less be used as a proxy to determine whether or not they were cheating. By cheating we mean using foreign substances to increase a pitcher’s grip on a baseball to be able to throw better pitches, AKA using “sticky stuff”.
Some people argue that a pitcher’s spin rate is like a fingerprint – it is specific to the pitcher and can’t really be changed. The only known way to change spin rate is to use sticky stuff. Therefore, if we look for pitchers whose spin rate changes dramatically, we can identify cheaters. That is the theory, at least.
Suppose we want to investigate a specific player. We compare samples of 500 pitches in which the pitcher’s spin rate has jumped from 2400 rpm to 2600 rpm. In each case, the standard deviation in spin rate is about 3%. In this case, a t-test for difference in means is the correct hypothesis test to use to determine if this player is cheating.
Challenge Question 2: Suppose we were in the MLB league office and wanted to crack down on sticky stuff. A junior data scientist proposes using this hypothesis test at the \alpha =0.01 level (to be safe!) and asserting that anyone who fails this test is using sticky stuff. Why is this a bad idea?
(Non) Example 3: Improvement in Running Ability
Here is an example of when you should not use a hypothesis test. Suppose a casual observer has been watching professional cross country running and thinks a certain athlete is getting much faster. They want to use some hypothesis tests to actually see if this is the case.
The casual observer tests their instinct by running a hypothesis test to see if the runner’s last five races were faster than their previous five races. I claim that this technique is not statistically sound. The reason for this is that the conclusions you can draw from hypothesis tests are actually quite subtle.
In cross country, the course difficulty has a significant effect on a runner’s time. A hypothesis test only tells you whether or not a difference exists, not what the difference is attributable to. That means that even if the hypothesis test indicates that there is a statistically significant difference, we cannot necessarily attribute this difference to any cause.
To say that another way, hypothesis testing can’t help you decipher whether a runner got better or whether the courses were easier and this caused the runner’s times to improve.
Challenge Question Answers
- Hypothesis tests only tell you whether or not a difference exists between two groups. Hypothesis testing does not tell you anything about what the difference is attributable to. Therefore, comparing Cade Cunningham’s rookie year to his college performance runs a particularly high risk of misattributing any change in performance to change in skill level. In fact, it is more likely that any differences are attributable to the vast differences between the college and professional game.
- This question asks you to think about the definition of \alpha , the probability of erroneously rejecting the null hypothesis. Moreover, it specifically concerns the multiple testing problem. Hypothesis testing is an inexact science, there exists quantifiable error. In our case, the hypothesis tests were designed to allow for a 1% chance of false positives. If we use this methodology to test every pitcher in baseball – of which there are well over 100 – then it becomes quite likely that we erroneously accuse someone of cheating.