Elo Rating and the Brier score
Is the Elo update rule related to the Brier score?
The Elo rating system is a popular method to measure the relative skill level of players. While originally proposed as an improved rating system for chess, many sports have adapted the Elo to account for game-specific traits - for instance ICC, the governing body for cricket, adds heuristic accounting for draws, weighing recent matches higher, and so on. FIFA, the governing body for world football, also uses a modified version of the Elo system to account for importance of matches.
The absolute Elo rating does not mean much and can only be judged in the context of contemporary players.1 Only two things matter when working with Elo ratings - (i) The difference between player ratings, and (ii) The update rule for playerโs ratings after each game.
Are the Elo update rule and the gradient of Brier score related?
The Elo Rating System๐
Elo rating predicts the probability of a win for a player with rating against a player with rating . In its simplest form, we define this win probability in terms of the rating of w.r.t. as,
where is a positive parameter that modulates the scale of the rating difference. For instance, if , then a rating difference of implies an approximately winning chance for player .
This equation is intuitive in the sense that the probability of beating increases as the rating difference positively increases, and vice versa. This equation is also familiarly called the sigmoid function whose range is always between and , and therefore can be interpreted as probabilities.
Now, let be a binary outcome variable that is if player wins and zero otherwise. In its simplest form, Elo ratings prescribe the rating to be updated using the rule,
where is a positive parameter that dictates the maximum rating update possible.
Consider the case where , i.e. player wins. The update rule dictates that we must increase the rating of player by . When , i.e. if player loses, we decrease their rating by . A reasonably intuitive outcome.
Therefore, to roll out our own Elo rating system, we need:
- An initial rating for each player. Since the absolute values do not mean much, this can be an arbitrary number.
- The choice of , which intuitively relates the magnitude of differences to win probabilities.
- The choice of , which is the largest rating change that a player can receive after each game.
Each of these three design decisions involve game-specific heuristics. For instance, can be increased for players returning after a long injury break to avoid staleness in the ratings. More generally, much of the complexity of devising Elo ratings is about clever heuristics for these parameters.2
Brier Scores๐
Brier scores measure the accuracy of probability predictions. In fact, Brier score is a proper scoring rule, such that optimizing for the Brier score would imply learning well-calibrated probabilities. In other words, if a model predicts probability of a win, then we should observe a win of the time in the real world to be well-calibrated. A Brier score of zero corresponds to perfect calibration.
Let represent whether players wins against player . The error in the forecast of โs win probability is given by the Brier score,
Put this way, the Brier score is a functional of the win probability . To minimize the Brier score, we move in the direction opposite to its functional derivative,
This iterative approach is popularly known as gradient descent.
The Functional Derivative๐
More generally, each step of gradient descent involves a learning rate , such that the update rule for the probability of win functional is,
where we absorb the constant factor into for convenience.
Now, consider , then we increase the probability of win by . With , we decrease the probability of win by .
Is there an equivalence?๐
On the surface, equations and above look exactly the same up to a constant term . However, for quite obvious reasons, these equations are not update rules for the same quantities - equation is an update rule of the rating, whereas equation is an update rule for the probability (functional) of winning.
Nevertheless, are these similar-looking equations merely a coincidence or there is more thought behind the rating update rule? Thinking out loud,
- Could we draw an equivalence if the ratings are considered to be constrained to be probabilities?
- Can the learning rate be reinterpreted as the maximum rating update possible after any game?
- Proponents of the Elo score often tout that the win probabilities are well-calibrated, and I have seen calibrated is often reported in these circles using the Brier score. Could this functional equivalence be reinterpreted to explain improved calibration?
May be, or may be not. I need to think more.
Footnotes๐
-
Consequently, it is unwise to compare Elo ratings across different generations of players. Depending on the game, it may even be problematic to compare between different formats of the game. For instance, comparing players across Test and ODI cricket is certainly wrong owing to different playing conditions and playerโs consistency. โฉ
-
TrueSkill extends rating systems for more than two simultaneous players. โฉ