Thursday, April 15, 2010

Football Predictions and Statistical Significance


As a student of football statistics and of statistical methods, I have always enjoyed studying the archived predictions at Todd Beck's Prediction Tracker. About a year ago or so, I put together a simple model to help evaluate football predictions. Essentially, the model represents any two sets of predictions as being composed of common errors (that is, “common” between the two models) and independent errors. Applying this model is particularly intriguing when one of those sets of “predictions” is the “betting line” (the "gold standard" of predictions). In doing so, we find that to “beat the line,” a set of predictions must be relatively close to the actual game outcomes but, at the same time, not too close to the betting line. In fact, if it were somehow possible to generate a set of predictions whose errors were completely uncorrelated with the errors in the betting line (i.e., no “common” errors), such predictions wouldn't even have to be very “points accurate” to make football wagering a winning proposition. In case you're wondering, doing this would be quite impossible to achieve.

Since I don’t gamble, you might wonder why I care. Well, I do have a competitive streak, and I enjoy the challenge of trying to generate good football predictions. However, I soon discovered that there are some on the web (I won’t name names) who lean heavily on the betting line to “improve” the points accuracy of their predictions. The net effect of this is that various metrics commonly used to evaluate the quality of a set of predictions (such as the metrics on Prediction Tracker like “mean square error” and “percent correct”) tend to look pretty good when one’s predictions are essentially a “fuzzed up” version of the betting line. The only metric that does not benefit from this tactic is, not surprisingly, one’s win percentage “against the spread.” However, since most predictions are no better than a coin toss against the line anyway, I realized that it is actually rather difficult to tell who is particularly good at doing predictions and who isn’t. So, I began to wonder if maybe, buried up in all the noise, there was some indicator that some predictions had, in fact, some added value over and above the betting line. The answer turned out to be “yes.”

Anyway, I began to run with the aforementioned model and soon found that I could generate an expected probability of win “against the spread” that was a better long-term indicator than the actual win percentage itself. If you want to know more on this, email me. Next, while the expected win probability was interesting, it failed to account for the number of games and didn’t produce a true “significance measure.” So, I translated it to the following metric that I call a “significance score:”

where L is the line, P are the predictions, S are the actual outcomes (as spreads -- home score minus away score), N is the number of games, angled braces indicate averages over the N games, and the function indicated by the Greek phi is the standard normal cumulative distribution function (CDF). The equation is an approximation (albeit a very good one) that assumes that the difference between P and L is relatively small compared to the other two differences.

If a set of predictions is essentially random noise (or any mix of the betting line and noise), the argument in the CDF above will tend to be a standard normally distributed random variable (mean zero, standard deviation one). As a result, P will be uniformly distributed between zero and one (or 0% and 100%). If, however, a set of predictions can manage to eliminate a source of error not corrected by the line, then the argument in the CDF will tend to be increasingly positive and P will tend to be higher than 50%, potentially even much higher.

OK, so what about the results? I pulled in the data from Todd Beck’s Prediction Tracker for the period from Week 5 of the 2007 Season through the end of the 2009 Season. I excluded the first four weeks of the 2007 season since I had just had Atomic Football added to Prediction Tracker and was still monkeying with the algorithm during that period. Since week 5 of 2007, my algorithm has changed relatively little. Note that the numbers that follow were done against the opening lines. Again, I wanted that for comparison since I publish Atomic Football’s predictions on Sunday (sometimes very early) and generally do not update them during the week. Most other participants on Prediction Tracker also publish early in the week and don’t update them thereafter. So, here you are…

  System                P
  Stat Fox              99.9948%
  Atomic Football       99.9945%
  Edward Kambour        99.971%
  Nutshell Sports       99.76%
  Stephen Kerns         99.71%
  Nutshell Sports Retro 98.6%
  Born Power Index      98.0%
  Pigskin Index         97.9%
  Moore Power Ratings   97.8%
  Sagarin Predictive    96.6%
  System Median         96.4%
  Keeper                96.3%
  System Average        95.7%
  Super List            95.2%
  Dokter Entropy        93.9%
  Dunkel Index          93.7%
  Lee Burdorf           93.6%
  Dave Congrove         93.1%
  CPA Rankings          93.1%
  Ashby AccuRatings     90.4%
  Covers.com            89.1%
  Least Squares         89.0%
  Bassett Model         87.6%
  Bihl System           86.3%
  Harmon Forcast        85.1%
  Massey BCS            80.8%
  Tom Benson            80.6%
  Laz Index             71.1%
  Howell                66.8%
  Massey Consensus      65.5%
  Beck Elo              63.8%
  Marsee                63.7%
  Hank Trexler          59.8%
  Logistic Regression   57.2%
  Least Squares w/ HFA  57.1%
  Sagarin               53.8%


The percentile column indicates the relative difficulty of achieving the performance strictly by chance. For example, a score of 99% indicates a level that could be exceeded by chance alone only one time out of one hundred. Oh, and if you’re curious, the top six systems all scored greater than 92% against the updated (“Saturday morning”) line, with the next closest being below 82%. By the way, the average score across all the systems (including those not listed above) against the updated line was 44.5%, indicating that the average computer prediction may even be worse than a coin toss come game day.

As I stated in my open, I don’t gamble. One reason is that a lot of other very interesting things come out of this model. I won’t go into details here except to say that even when one can “expect” (in the statistical sense) a positive net return, there is a serious problem with managing risk. In the end, managing the volatility means limiting returns to the point that eventually the stock market still looks better. So, I still prefer to bet on teams like Walmart or Apple.

No comments: