Tuesday, November 25, 2008

Why Ranking Violations are a Flawed Metric

I hate to ever criticize anything without being prepared to offer an alternative. So, I'll let you know up front, I will offer an alternative (at the end of this post).

But first, for those who might have no idea what "ranking violations" are, here is a very brief tutorial...

Let's say John Doe has made his own football rankings. Is there an easy way to see if they make sense? A popular approach is to calculate the frequency of "ranking violations." A ranking violation occurs when a loser in a played game is ranked higher than a winner. Now why, might you ask, would any rankings ever do that? The answer is, once you're about halfway into the season, there's no way around it. There is simply no way to rank teams such that winners are always ranked above losers. Eventually, some 2-5 team beats a 5-2 team and making the ranking violations go away becomes impossible. If you would like to see some ranking violation stats in action, check out Ken Massey's College Football Ranking Comparison page (scroll to the bottom of http://www.mratings.com/cf/compare.htm).

OK, enough on what ranking violations are. If you're still unclear, google it. Next...

Now, if we can't make ranking violations go away, then it would seem to make sense that we rank teams to keep them at a minimum, right? That way, we don't have to listen to folks invoke the "head to head" argument. I think I preached on that in another post, so I won't go down that road here. The short answer to should we minimize ranking violations is... "No."

So, I've made the beginnings of an argument in support of minimizing ranking violations and now I'm suggesting it's a bad idea. Why? The reason is that it's almost, but not quite, the best metric. The problem is a little complicated, so bear with me.

Let's take a sample problem. It's not terribly realistic, but it's been designed to make a point. We have three teams in a conference -- A, B, and C. Teams A and B play each other ten times during the regular season and A wins every time. I know this wouldn't happen in the real world, I'm only making the point that A is clearly better than B. If you have a problem with this, then the alternative is that A and B play a common set of opponents. Team A wins all of their games and B loses all of theirs. Better? OK, now introduce team C. C plays two games, beating team A and losing to team B.

Now time to rank the teams. Obviously, we rank A ahead of B. But what about C. We can minimize the ranking violations by ranking them above A (first in the conference) or below B (last in the conference). Strange, our minimum ranking violations approach has clearly shown us that team C is probably either the best or the worst in their conference, but probably not in the middle. If this makes sense to you, then quit now -- there's no hope. Otherwise, read on...

OK, it would seem reasonable (both subjectively and from a "maximum likelihood" viewpoint -- we won't dive into the math on that here), that team C probably belongs between A and B, but how can we express that mathematically. The solution I propose is an alternative to ranking violations that I've dubbed "record violations" (I have also referred to it as schedule violations). It goes like this...

Team C's record is 1-1. By ranking them between A and B, one opponent is ranked higher (what I'll call the "higher") and one is ranked lower (the "lower"). Thus, their lower/higher is 1-1. Because their W/L (win-loss) matches their L/H (lower/higher), they have zero record violations.* You can check out our L/H numbers on our Atomic Football ranking page.

I first proposed this metric to Ken Massey in late 2006, and I'm hopeful he will find the time to add it to his comparison page. Here is the text from my original message:

-------------

Ken,

I wanted to suggest a variant on the ranking violation metric.

Consider a team that has beaten #13, #15, #17, and #19 and lost to #1, #3, #5, and #6**. In addition the team has beaten #9 and #11. Being 5-5 against teams of average rank #10, 1-4 against teams ranked #1-#9 and 4-1 against teams ranked #11-#20, it would seem reasonable to rank this team #10.

However, doing so yields two ranking violations. One of the violations could be alleviated by moving the team up to #8 or down to #12. This is obviously a counterintuitive situation (and one I discussed in my recent paper). Now consider an alternative metric.

If we retain the #10 ranking, then this hypothetical team is 5-5 against 5 teams that are ranked higher and 5 teams that are ranked lower.

Thus if Wins-Losses is the same as Lower-Higher (lower being the number of teams*** ranked lower and higher being the number ranked higher), then we would say that we we have zero "Record Violations" (if you have an alternative name, please let me know). In other words, with this metric we will allow a ranking violation corresponding to a win against a higher ranked team to cancel a ranking violation corresponding to a loss against a lower ranked team. Thus, for this team we find:

Rank RankingViolations RecordViolations
#2......4......4
#4......3......3
#6......2......2
#8......1......1
#10.....2......0
#12.....1......1
#14.....2......2
#16.....3......3
#18.....4......4

As you can see, ranking violations have two local optima, whereas record violations do not.

To put things on the same percentage scale as our traditional ranking violation [sic], we will continue to normalize by the number of games since the maximum number of record violations for a given team is equal to the number of games played by that team.

Obviously, record violations will always number [sic] equal to or less than ranking violations since we begin with the rankings [sic] violations but allow some to cancel out others. The purpose of this metric is to prevent the obviously nonsensical situation mentioned above in my opening example. For this reason, I think it is a slightly superior metric. I would certainly love to see the results of it on your comparison page by year's end. If you do choose to employ this metric, I would also appreciate a reference. Lastly, I did not get a reply from you on my previous message. I know this is a busy time for you, so I understand...

Thanks for all your hard work in this most important field of endeavor (I say this tongue in cheek, of course).

Jim

----------

*For those who might run with the math, yes, if you consider the record violations for all three teams, you get a minimum of two violations for any of these orderings -- ABC, ACB, or CAB. The point is, record violations, unlike ranking violations, don't force you to one of the extremes.
**This was supposed to say #7.
***Opponents.

2 comments:

Steve Wrathell, CPA said...

There are well over 100 ranking systems on Kenneth Massey's Comparison, now. Why? It's because each ranker believes that his method does something better than everyone else's (or that they are trying to make it better). How can there be so many opinions about how to rank football (& other sports) teams? Different rankers have different objectives. Some want to be predictive -- they might want to excel at picking straight up winners, ATS winners, minimize predictive mean error, etc. Others prefer a retrodictive outlook. They might want to minimize ranking violations, weighted errors, or "record violations." Each ranker looks at his rankings & says "These represent quality, as I see it." Are all wrong except for one? No.

Who's the best passer in the NFL? The QB with the most yards, the most TD passes, the fewest sacks, the fewest interceptions, the most completions, the highest completion percentage, the best QB rating (calculated differently in the NFL vs. the NCAA), or the one that wins the Super Bowl? Whether it's rating ratings or rating QB's, we can notice that some are rated highly under various criteria and some rate poorly under all criteria. Personally, I felt that when the NY Times was in the BCS, its ratings were among the worst on the Comparison.

We can see stats that evaluate rankings under various criteria on web sites from Massey, Beck, Kislanko, Wobus, and (sometimes) Russia's own Mr. Potemkin. My own CPA Retrodiction Rankings (non-MOV) ("CPR") have been #1 for ranking violation in past years although Coleman, TSR Slots, & some others will now finish ahead of CPR. Yet, they have not done as well in Kislanko's and Potemkin's weighted error calculations. Self ("SEL") & Rothman (now calculated by Wolfe) ("RTH") have done very well for weighted error, while not being at the top of the RV list. Yet CPR has also done well at calculations of weighted error (#1 a few times) and for ranking violations.

CPA Rankings (MOV) ("CPA") has won several Prediction Tracker Awards from Todd Beck -- in the NCAA & NFL, in both predictive & retrodictive categories. CPR has had some success with the Prediction Tracker, too -- including success in predictive categories -- not bad for a non-MOV system. CPA & Pigskin Index ("PIG") have done quite well with preseason rankings in the Wobus calculations. ARGH ("ARG") has done very well all year long per Wobus on a consistent basis – although the Comparison’s Consensus dominates the season as a whole.

As far as Ashburn's example of A beating B and C beating A & losing to B, some might choose to minimize the RV by not putting C in the middle. But maybe those that put C in the middle might minimize weighted errors. Thus, to evaluate a system, one should not consider just one method of measuring success, but one should look at various measurements. There is no reason for the best RV system, the best Wtd Err system, the best predictive win system, & the best ATS system to put down the systems that have had success by other measurements. Remember, too, that most of us doing rankings make no money from this & are doing it out of our love of the game (& of math). I should also state that I think the important thing is that most computer rankers have better precision & objectivity and are superior to the opinion polls that get so much more attention.
- Steve Wrathell, CPA
CPA Rankings
CPA Retrodiction Rankings

Anonymous said...

Ranking violations cannot be a "flawed" metric, since they are just a counting stat, and if 1+1 =2 there's no "flaw." I think what you mean is that they shouldn't be used as an arbitrary measure of "goodness."

Potemkin and others have created variations on the basic RRV metric, and those also accurately measure what they purport to measure.

That some systems have a high RRV% or even weighted RRV% does not mean those are not good systems - they may not have been designed to achieve a low RRV%.

That's fine. But don't say it's a "flawed metric" - it measures what it is intended to measure just fine. Say rather one should not put too much emphasis on it.