Curve Fitting with the Tails of Distributions
Curve fitting is all about determining the relationship between two variables. More specifically, fitting a curve to data is about using one observed number to predict the value of another unobserved number. It might not seem like it, but a majority of applied stats and machine learning problems boil down to curve fitting
For example, maybe you want to predict the number of people at the beach using the temperature. Or, maybe you want to predict the price of concert tickets using twitter mentions to gauge hype. Certainly somebody out there is interested in predicting how good a player will be before the NBA or NFL draft using college data.
In each of these examples, one variable is easy to measure – temperature, twitter mentions, college performance – while the other is not. Prediction problems are everywhere and curve fitting is a way to extrapolate from small sample sizes to make accurate predictions about what is likely to happen.
Almost everyone’s first experience with curve fitting is finding a “line of best fit” that is meant to explain the relationship between two variables. However, curve fitting problems can get quite a bit more complicated. In this article we’ll take a look at a specific curve fitting problem: fitting curves to tails of probability distributions.
A fair warning to anyone more sports-oriented than math-oriented, the focus of this article is nearly purely mathematical. Though the next section will serve to motivate why we at TheDataJocks care about curve fitting with the tails of distributions, the applications of these techniques will be in later articles.
Why is This an Interesting Problem?
Before explaining what curve fitting is, I want to motivate why curve fitting to a tail of a distribution is an interesting problem. It is interesting both mathematically and from a sports analytics perspective. First, let us motivate why it is mathematically interesting.
Usually curve fitting identifies a relationship between two variables. However, when we are working with the tail of a distribution, there is only one variable present. So, instead of fitting a curve to a set of points of the form (x,y) , we are actually fitting a curve to just a sample of points x_1,x_2,\dots, x_N.
Sound impossible? It turns out that the assumption that the x_i are the tail of a distribution is enough to do the rest. We’ll deal with that later.
Now, why do we care about curve fitting tails of distributions on a sports analytics blog? Sometimes it is easier to find data on and figure out what the tail of a distribution looks like. The tail of a distribution are the “outlier events”; they stand out from the crowd and therefore are more noteworthy. Also, sometimes the tail of the distribution (a) has more structure and (b) is better behaved statistically. We can think of two specific applications for this (and at least the first of these will lead to future articles).
- When measuring how ‘unbreakable’ a record is, we only have a list of the all-time leaders in a particular stat. The all-time leaders are the tail of a distribution that is difficult to get a grasp on. Fitting a curve to this dataset gives us a measure of how unbreakable a record is.
- When evaluating player quality, most models are built for the ‘average joe’ player. But when you’re dealing with superstars, you often hear about players who ‘break the mold’. Having a separate model – arising from fitting a curve to the tail of a distribution – to capture aspects of elite performance can lead to better insights.
In the next section, we’ll develop a bit of mathematical background and language to describe curve fitting.
What is Curve Fitting?
Remember, curve fitting is all about predicting the relationship between two variables. By sampling values from the population, information can be inferred about the relationship between the two values. The example plot below shows what this could look like. The green diamonds are sampled noisy values from a distribution and the solid black line represents a curve that fits the data.
In the above example I started with the line and generated the points in a noisy way distributed around the line. In real life, though, the opposite setting is more often the case. Usually we don’t know the relationship between an X and Y variable.
By looking at the shape the green points above make, we can make some good guesses about the shape of the unknown curve – the black line. We know it is decreasing and flattening out as x increases. As we have more and more green points, we gather more information and can more accurately estimate the form of the black function.
How is this done in general? There are a variety of methods, some more applicable in certain settings than others. Commonly in statistics and parameter estimation you’ll see maximum likelihood estimation used. However, we’ll use the perhaps most common method to perform curve fitting: the method of “least squares”.
Least squares is based upon the principle that the best curve to fit to the data is the one that minimizes the distances between itself and the sampled points. Let’s look at the gory details.
Let f_\alpha (x) be a family of curves indexed by the parameter(s) \alpha . We want to find the choice of \alpha (call it \alpha’ ) that best fits the data. Then, the function f_{\alpha’} is the curve of best fit.
Given data points (x_i,y_i), 1\leq i \leq n , the parameter alpha can be determined via least squares regression by minimizing the sum \sum_{i=1}^n (f_\alpha(x_i) - y_i)^2 . Notice that y_i is the truth value associated to the input x_i while f_\alpha (x_i) is the predicted value at this input.
The difference f_\alpha(x_i)-y_i is the prediction error associated to the input x_i which is incurred by the model f_\alpha . The sum of the squares of these errors is a measure of how well the model (or curve) f_\alpha fits the entire data set. Therefore, picking \alpha that minimizes this sum is equivalent to curve fitting the best possible model to the observed data.
Tails of Distributions and the Probability Integral Transform
In the previous section, we fit parameterized models f_\alpha to pairs of data (x_i,y_i) . Now we move to the setting where we only observe sampled values x_i which constitute the tail of some otherwise unknown probability distribution. We have no y_i values anymore! What do we do?
One of the most fantastic facts in all of statistics is the universal property of the uniform distribution. This is sometimes called the probability integral transform. The basic modus operandi is that any continuous (in the sense of absolute continuity with respect to Lebesgue measure) distribution can be obtained as a transform of the uniform distribution.
The simplest possible way to write down this theorem is as follows. If you sample a distribution and write down the percentiles of each observation, then the set of percentiles is a uniformly distributed random variable. That is, we are just as likely to get a sample in the 95th-100th percentile as we are to get a sample in the 0-5th percentile.
We actually can tell the quantile of our sampled data. If we use the quantile as a y_i value to pair with each x_i value, then we can do curve fitting in the same way as before!
An Example
To show how this might work, we’ll work an example with synthetic data. We generated 999 (As a challenge problem, ask yourself why I sampled 999 values and not 1000. The answer involves order statistics and the uniform distribution.) samples from a standard normal distribution and looked at the 10 largest data points. We would expect these points to be located at the 90^{th}, 90.1^{st},\dots and 99.9^{th} percentile of the underlying distribution.
We did this two times and we’ve plotted these points in different colors in the figure below. Notice that these pairs of points are noisy versions of the cumulative distribution function of the standard normal.
In order to do curve fitting as described in the previous section, we need to assume a model. That is, we need to have a family of functions f_\alpha from which to choose the optimal curve. If we don’t sufficiently restrict the size of the family of candidate functions, we may run afoul of the bias-variance tradeoff. Let’s suppose now that we know these values are normally distributed with mean 0 but we don’t know the standard deviation.
Here, the family of functions are \Phi_\sigma , the cumulative distribution functions of the N(0,\sigma^2) distribution. The optimal choice of \sigma can be found in many ways, but for simplicity we perform a direct search over the interval [0.5,1.5] to illustrate the point. The resulting curves which best fit the underlying data are shown below.
In both cases, the data are drawn from a standard normal distribution with variance 1. In the blue curve, the standard deviation estimate is 0.968. The green curve results from the estimate \sigma = 1.01 .
Discussion
In essence, the process described above is a form of parameter estimation. I would be very curious to see how our curve fitting approach compares to more traditional parameter estimation like, for example, maximum likelihood estimation. Traditional log-likehood differentiation approaches might work better for our problem, though I am uncertain to what extent the inherently biased sampling method impacts the results.
The whole point of this analysis is to prepare for coming articles discussing the most unbreakable records in various sports. In essence, we’ve computed a way to fit a cumulative distribution function to top 10 or top 50 all time lists. Then, by looking at the percentile of the #1 spot on the list along with the age of the league we can estimate how likely it is to be broken.
To receive email updates when new articles are posted, use the subscription form below!