Elo Rating and the Brier score

Is the Elo update rule related to the Brier score?

5 mins
๐Ÿงฎ math

The Elo rating system is a popular method to measure the relative skill level of players. While originally proposed as an improved rating system for chess, many sports have adapted the Elo to account for game-specific traits - for instance ICC, the governing body for cricket, adds heuristic accounting for draws, weighing recent matches higher, and so on. FIFA, the governing body for world football, also uses a modified version of the Elo system to account for importance of matches.

The absolute Elo rating does not mean much and can only be judged in the context of contemporary players.1 Only two things matter when working with Elo ratings - (i) The difference between player ratings, and (ii) The update rule for playerโ€™s ratings after each game.

Are the Elo update rule and the gradient of Brier score related?

The Elo Rating System๐Ÿ”—

Elo rating predicts the probability of a win for a player AA with rating RAR_A against a player BB with rating RBR_B. In its simplest form, we define this win probability in terms of the rating of AA w.r.t. BB as,

p(Aย beatsย B)=11+10โˆ’(RAโˆ’RB)/ฮบโ‰œpA,p(A \mathrm{~beats~} B) = \frac{1}{1 + 10^{- (R_A-R_B) / \kappa}} \triangleq p_A,

where ฮบ\kappa is a positive parameter that modulates the scale of the rating difference. For instance, if ฮบ=400\kappa = 400, then a rating difference of RAโˆ’RB=400R_A-R_B = 400 implies an approximately 90%90\% winning chance for player AA.

This equation is intuitive in the sense that the probability of AA beating BB increases as the rating difference positively increases, and vice versa. This equation is also familiarly called the sigmoid function whose range is always between 00 and 11, and therefore can be interpreted as probabilities.

Now, let OAO_A be a binary outcome variable that is 11 if player AA wins and zero otherwise. In its simplest form, Elo ratings prescribe the rating to be updated using the rule,

ฮ”RA=โˆ’ฮฑ(pAโˆ’OA),(1)\tag{1} \Delta R_A = - \alpha (p_A - O_A),

where ฮฑ\alpha is a positive parameter that dictates the maximum rating update possible.

Consider the case where OA=1O_A = 1, i.e. player AA wins. The update rule dictates that we must increase the rating of player AA by (1โˆ’pA)โ‹…ฮฑ(1 - p_A) \cdot \alpha. When OA=0O_A=0, i.e. if player AA loses, we decrease their rating by pAโ‹…ฮฑp_A \cdot \alpha. A reasonably intuitive outcome.

Therefore, to roll out our own Elo rating system, we need:

  1. An initial rating for each player. Since the absolute values do not mean much, this can be an arbitrary number.
  2. The choice of ฮบ\kappa, which intuitively relates the magnitude of differences to win probabilities.
  3. The choice of ฮฑ\alpha, which is the largest rating change that a player can receive after each game.

Each of these three design decisions involve game-specific heuristics. For instance, ฮฑ\alpha can be increased for players returning after a long injury break to avoid staleness in the ratings. More generally, much of the complexity of devising Elo ratings is about clever heuristics for these parameters.2

Brier Scores๐Ÿ”—

Brier scores measure the accuracy of probability predictions. In fact, Brier score is a proper scoring rule, such that optimizing for the Brier score would imply learning well-calibrated probabilities. In other words, if a model predicts 80%80\% probability of a win, then we should observe a win 80%80\% of the time in the real world to be well-calibrated. A Brier score of zero corresponds to perfect calibration.

Let OAO_A represent whether players AA wins against player BB. The error in the forecast of AAโ€˜s win probability pAp_A is given by the Brier score,

BS(pA)=(pAโˆ’OA)2.\mathrm{BS}(p_A) = (p_A - O_A)^2.

Put this way, the Brier score is a functional of the win probability pAp_A. To minimize the Brier score, we move in the direction opposite to its functional derivative,

โˆ’โˆ‡pABS(pA)=โˆ’2(pAโˆ’OA).-\nabla_{p_A} \mathrm{BS}(p_A) = -2(p_A - O_A).

This iterative approach is popularly known as gradient descent.

The Functional Derivative๐Ÿ”—

More generally, each step of gradient descent involves a learning rate ฮฑ\alpha, such that the update rule for the probability of win functional is,

ฮ”pA=โˆ’ฮฑโˆ‡pABS(pA)=โˆ’ฮฑ(pAโˆ’OA)(2)\tag{2} \Delta p_A = - \alpha \nabla_{p_A} \mathrm{BS}(p_A) = - \alpha (p_A - O_A)

where we absorb the constant factor 22 into ฮฑ\alpha for convenience.

Now, consider OA=1O_A = 1, then we increase the probability of win by (1โˆ’pA)โ‹…ฮฑ(1-p_A)\cdot\alpha. With OA=0O_A = 0, we decrease the probability of win by pAโ‹…ฮฑp_A\cdot\alpha.

Is there an equivalence?๐Ÿ”—

On the surface, equations (1)(1) and (2)(2) above look exactly the same up to a constant term ฮฑ\alpha. However, for quite obvious reasons, these equations are not update rules for the same quantities - equation (1)(1) is an update rule of the rating, whereas equation (2)(2) is an update rule for the probability (functional) of winning.

Nevertheless, are these similar-looking equations merely a coincidence or there is more thought behind the rating update rule? Thinking out loud,

May be, or may be not. I need to think more.

Footnotes๐Ÿ”—

  1. Consequently, it is unwise to compare Elo ratings across different generations of players. Depending on the game, it may even be problematic to compare between different formats of the game. For instance, comparing players across Test and ODI cricket is certainly wrong owing to different playing conditions and playerโ€™s consistency. โ†ฉ

  2. TrueSkill extends rating systems for more than two simultaneous players. โ†ฉ