Elo ratings

Elo systems are used for ranking players in many competitions, including chess, sports like football (American or otherwise), e-sports, and board games. There are many variants in existence, but they ultimately derive from the version invented by Arpad Elo and implemented starting in the 1960s.

It’s easy to find sources explaining how to use an Elo system, but I’ve had a hard time finding a mathematical rationale for it. This is a bit unfortunate, because contrary to its predecessor, the Harkness system, it has a relatively solid statistical basis.

This post details one way to derive it from reasonable enough assumptions. The later sections of this post are inspired by Steven Morse’s Elo as a statistical learning model post.

What do we expect from a rating system?

Let’s consider a game with two players, where each match ends with exactly one winner (no ties are allowed).

A rating system should assign scores to each player which represent their relative strength. How do we measure relative strength? One way to do it is to say that the relative strength of two players is the probability that the first will win a game against the second. If that probability is close to 50%, both players have the same strength, if it’s close to 0% or 100%, one is much stronger than the other.

Most rating systems, including Elo, are based on a relation that looks like $\mathbb P [i \,\mathrm{beats}\, j] = f(\mathrm{rating}_i - \mathrm{rating}_j).$

Where $f$ is some function that depends on the specific rating system, but in any case relates a difference in ratings to a winning probability. Obviously, its image must be a subset of $[0, 1]$.

The how

The Elo system follows these intuitions. There are a few different variants of Elo systems, but they’re generally along these lines:

The win probability is given by
$\mathbb P [i \,\mathrm{beats}\, j] = \frac 1 {1 + 10^{-(\mathrm{rating}_i - \mathrm{rating}_j)/400}}.$
In particular, a score difference of 400 points translates to a win probability of 1/11 ≈ 10% for the lower rated player, a difference of 800 points into a win probability 1/101 ≈ 1%, etc.
After each game, points are updated according to the rule
$\mathrm{rating}_i^\mathrm{new} = \mathrm{rating}_i^\mathrm{old} + 30\left(r_{ij} - \mathbb P [i \,\mathrm{beats}\, j]^\mathrm{old}\right)$
where $r_{ij}$ = 1 if the match was won, and 0 otherwise. The same rule (reversed where needed) is used to update the other player’s score.

Most actual chess (or other sports) organizations do not use this exact type of Elo rating; there are many variants to address possible shortcomings with this basic version.

The why

The Elo rules do conform to what we expect of a rating system, but they can seem a bit arbitrary (at least they did to me!). So let’s explore what the underlying assumptions are, and how we can derive the Elo system ourselves.

Let’s assume we have a set of players, and consider a given player, which we’ll call $i$.

Every time they play a game, we assume they play with a performance $\mathrm{Perf}_i\in\mathbb R$, which changes for each game and reflects how well they are playing in this particular game. If player $i$’s performance is greater than their adversary’s, player $i$ will win; otherwise, the adversary will win. For the sake of simplicity, we assume there are no draws. We can’t directly measure performance, so it may not seem useful as a concept, but it provides a theoretical basis for what follows.

A player’s performance in a given match is of course neither perfectly predictable nor completely random. We can model it as being sampled, for each game, from a Gumbel distribution. This is a modeling assumption, and there are other possible choices; the original Elo book¹Elo, Arpad E. The rating of chessplayers, past and present. Arco Pub., 1978. instead models the performance as a normally distributed variable. Gumbel is more or less what the US Chess Federation uses, it’s not too far from a normal, and it makes the math neater, so I’m using it here.

A Gumbel distribution has two parameters: $\mu$ controls the location of its peak, and $\beta$ controls its variance, or how far from the peak we can expect a random sample to be. We will assume that each player has a static, intrinsic strength $\mu_i$ and use it as the $\mu$ parameter to the Gumbel distribution that controls their strength. On the other hand, we will assume all players have the same variance $\beta$, which controls how irregular their performance is (this is another modeling assumption; in reality, some players are likely to be more regular than others).

In summary, in each game we have two players $i$ and $j$, performances $\mathrm{Perf}_i \sim \mathrm{Gumbel}(\mu_i, \beta)$, and $\mathrm{Perf}_j \sim \mathrm{Gumbel}(\mu_j, \beta)$. Given this, what is the probability that $i$ wins the game? That probability is $\mathbb P [\mathrm{Perf}_i > \mathrm{Perf}_j ]$, or $\mathbb P [\mathrm{Perf}_i - \mathrm{Perf}_j > 0]$.

$\mathrm{Perf}_i - \mathrm{Perf}_j$, as a difference of random variables, is itself a random variable, but how does it behave?

Wikipedia helpfully gives us this theorem:

If $X \sim \mathrm{Gumbel}(\mu_X, \beta)$ and $Y \sim \mathrm{Gumbel}(\mu_Y, \beta)$ then $X-Y \sim \mathrm{Logistic}(\mu_X-\mu_Y,\beta)$.

Applied to our problem, this means that $D = \mathrm{Perf}_i - \mathrm{Perf}_j$ follows a logistic distribution of parameters ($\mu_i - \mu_j$, $\beta$). (The logistic distribution also has two parameters).

Let’s call $F$ this logistic distribution’s cumulative distribution function. By definition we have $\mathbb P [D>0] = F(0)$, and so

$\begin{aligned} \mathbb P [i \,\mathrm{beats}\, j] &= \mathbb P [\mathrm{Perf}_i - \mathrm{Perf}_j>0] \\ &= F(0) \\ &= \frac{1}{1+e^{-(\mu_i - \mu_)/\beta}} \\ &= \sigma(\mu_i - \mu_j/\beta) \end{aligned}$

Where σ is the sigmoid function: $x \mapsto \frac{1}{1+e^{-x}}$. If we had chosen a normal distribution instead of a Gumbel here, we would have ended up with an error function instead; it’s a bit harder to deal with because it doesn’t have a nice mathematical representation.

This clearly passes several sanity checks:

If $\mu_i$ is much larger than $\mu_j$, $\mathbb P [i \,\mathrm{\, jwins}]$ is close to 1; if instead $\mu_j$ is much larger, then it’s close to 0.
If $\mu_i = \mu_j$, both players’ intrinsic strengths are the same, and the probability of victory is 0.5.
If $\beta$ is very large, player performance is almost random, and the win probability will be close to 0.5.

You can take a look at the interactive plot below to get a better feel for this.

$\beta$ =		Score difference:
$\mu_1$ =		Player 1 win probability:
$\mu_2$ =

Elo scores

As far as I could find, FIDE or USCF do not officially say how to convert an Elo difference into a win probability, but, as we’ve seen, various other sources will tell you that for players with Elo ratings $e_i$ and $e_j$, $i$’s win probability is $\frac 1 {1+10^{-(e_i - e_j) / 400}}\,,$ which is eerily similar to what we’ve found in the formula above, so $\mu_i$ and $\mu_j$ are just Elo scores, up to a scaling factor.

Parameter reduction

If you play with the parameters above, you’ll notice that there is no real difference in terms of win probability between higher-rated players that play with a larger variance, or lower-rated players that play with a smaller variance.

More formally, if we multiply the strengths of both players by some factor $s$, and also multiply the variability of their strength by a factor of $s$, we get a new probability

$\begin{aligned} \mathbb P [i \,\mathrm{beats}\, j] &= \frac{1}{1+e^{-(s\mu_i - s\mu_j)/(s\beta)}} \\ &= \frac{1}{1+e^{-(\mu_i - \mu_j)/\beta}} \end{aligned}$

which is exactly the same as the win probability for unmodified strengths and variances. Similarly, we can add a constant strength factor $c$ to both $\mu$s, and we’ll end up with exactly the same score.

What this means is that, for the purposes of ranking, we can arbitrarily pick a $\beta$ and a reference strength $\mu_1$; what we pick will change other player’s strength values, but it will not change the winning probabilities. In fact, USCF uses (more or less) $\beta = 400/\ln 10$, and $\mu_1 = 1500$.

In the following I will take $\beta = 1$, $\mu_1 = 10$ for simplicity.

Summary

The bottom line is that making some assumptions about the performance of players in games, we find that we should be able to assign each of them a score such that their probability of winning looks like

$\mathbb P [i \,\mathrm{beats}\, j] = \frac{1}{1+e^{-(\mathrm{score}_i - \mathrm{score}_j)/\beta}}\,,$

where we can pick $\beta$ arbitrarily.

Elo computation

This justifies why it makes sense to talk about Elo scores and what they mean, but doesn’t tell us how to compute them. So now that we have the model in place, let’s look at how we can compute Elo scores.

In this first part, we’ll assume that we have a fixed set of game results, and we want to find the best possible estimates of the true player strengths based on these results. We’ll later see how to adapt this model to a continuous stream of results, like you would have for players on a chess site or in a chess federation.

Two players

If we only have two players $A$ and $B$, the only relevant data is the total number of games $n$, and the number of games won by $A$, $w$ (like in the rest of this post, I’m still assuming there are no draws, so $B$ has won $n-w$ games). As before, we pick $\beta = 1$ and $A$ will be the reference player, so we set $\mu_A = 10$.

Using our formula above, the expected number of games $W$ won by $A$ over $n$ games is:

$\begin{aligned} \mathbb E[W|\mu_B] &= \frac n {1+e^{-(\mu_A - \mu_B)/\beta}} \\ &= \frac n {1+e^{-(10 - \mu_B)}} \end{aligned}$

We can now make a maximum likelihood estimation of $\mu_B$: $\mu_B = \ln(n/w - 1) + 10$

Once again, we can sanity-check this formula:

If $w = n/2$, we get $\mu_B = 10$, so both players have the same strength
If $w = 0$, the formula for $\mu_B$ is undefined, because we know that $B$ is much stronger than $A$, but we have no idea by how much. As $w$ gets closer to 0, $\mu_B$ gets closer to infinity.
Symmetrically, if $w = n$, $B$ is a much worse player than $A$, but it’s not clear how much worse. The formula above has a limit of $-\infty$ as $w$ approaches $n$.

For practical applications, if we want to avoid undefined quantities, we could add small uncertainty factors, e.g. $\mu_B = \ln(n/(w + \varepsilon) - 1 + \varepsilon) +10$ for some small value of $\varepsilon$.

Several players

Let’s now assume again that we have $N$ players, numbered 1 to $N$. For each pair of distinct players $(i, j)$, we have win rate of $i$ over $j$: $r_{ij}$. We obviously have $r_{ij} = 1 - r_{ji}$ (or I guess that’s only obvious if there are no draws, which I’m still assuming).

For instance, here’s what it could look like for 3 players:

	1	2	3
1		1/2	3/4
2	(1/2)		2/3
3	(1/4)	(1/3)

There are 3 data points in this setup; the numbers between parentheses are completely determined by the other 3. However, the model only has two real parameters, $\mu_2$ and $\mu_3$, which are 2 and 3’s respective strengths relative to 1. This means that contrary to the 2-player model, we can’t expect to find parameters that always exactly match the observed results. For instance, using the formula above, we could deduce $\mu_2 = 10$ from $r_{12} = 1/2$, and $\mu_3 = 8.9$ from $r_{23} = 3/4$, but we would have ignored the extra datapoint that $r_{23} = 2/3$ (and, in fact, our model would predict a win rate $r_{23} = 3/4$, which is wrong).

Aside: gradient descent

Gradient descent is a method for optimizing functions, which is to say finding inputs to a function that make its output as close as possible to a desired output. There are lots of good intros out there, but I’ll recap the basics with the same notation I use below.

Say you have a function $f(x_1,\ldots,x_n)$ with $n$ different arguments, and some target value $o$ (which can be a scalar or a vector, but obviously has to have the same type as the output of $f$). We want to find values for $x_1,\ldots, x_n$ such that $f(x_1,\ldots,x_n)$ is as close as possible to $o$.

The idea behind gradient descent is the following.

First, we pick an error metric $E(x, y)$. This metric takes two arguments: the value of $f$ at some given parameters, and the target value $o$, and it tells us how far off we are: the lower the value of $E$, the better. For instance, if $o$ is a real number, we could simply use the absolute value: $E(x,y) = \lvert x - y \rvert$.

Then, we combine $E$ and $f$ to obtain the cost function $c$. We define

$c(x_1,\ldots,x_n) = E(f(x_1,\ldots,x_n), o).$

Our initial problem is now equivalent to finding parameters $(x_1,\ldots,x_n)$ that minimize the value of $c$.

Next, we pick some arbitrary initial values for $(x_1,\ldots,x_n)$. We’re going to progressively refine these values until their image through $c$ is sufficiently small. To do so, we compute the gradient of $c$ at $(x_1,\ldots,x_n)$. The gradient tells us, starting from the point at which we compute it, which small change in the $(x_i)$ will produce the fastest growth of $c$. For instance, if the gradient of $c$ at $(1, 1)$ is $(-0.3, 1.2)$, we can conclude that, for a small enough $\alpha > 0$, $c(1 - 0.3\alpha, 1 + 1.2\alpha)$ is larger than $c(1, 1)$. Conversely, the opposite of the gradient tells us the direction of fastest decrease of $c$.

This is the algorithm of gradient descent: start from a set of values (aka a position in the input space), compute the gradient at this position, update the position with the gradient. Denoting the gradient as $\nabla c(x_1,\cdots,x_n)$, the update step is

$(x_1,\ldots,x_n)^\mathrm{new} = (x_1,\ldots,x_n)^\mathrm{old} - \alpha\nabla c(x_1,\ldots,x_n)^\mathrm{old}.$

(There are various heuristics for choosing $\alpha$, which I will not get into at all because they’re not very important for what follows. Also note that there are many conditions for this method to work, such as $c$ being convex and differentiable, which I will assume are true.)

Wikipedia's illustration of gradient descent

Wikipedia’s illustration of gradient descent. The blue circles are the level lines of the cost function, just like on a map; the optimum is in the smallest circle. $x_0$ is the starting position, and the red arrow starting from it is the gradient at $x_0$ (multiplied by $\alpha$). We add this gradient to our current position and end up at $x_1$, a bit closer to the minimum. We then repeat this process until we’re tired.

Back to Elo

So how does gradient descent apply to Elo? We have

parameters we want to optimize, these are $\mu_2,\cdots,\mu_N$ (recall that I’m taking $\mu_1 = 1500$ as above, so we are not considering $\mu_1$ as a parameter),
a function $f$ that takes in our parameters and outputs the probability of win for each player $i$ against each other player $j$, by applying the Elo formula above
an objective $o$, which is the set of actual win rates $r_{ij}$ for each pair of players.

We still need to decide on an error metric to optimize: an error function $E$, where we will want to minimize the cost²This expression assumes we want to apply $E$ coordinate-by-coordinate, which is not technically required, but which we’ll do here; a good reason is that in practice we don’t necessarily have values for all the $r_{ij}$, so we might want to omit some terms. By applying $E$ to each term separately, we can just replace components with missing values with 0.

$c(\boldsymbol\mu) = \sum_{i=1}^N \sum_{j=i+1}^N E\left(r_{ij}, \frac 1 {1 + e^{-(\mu_i - \mu_j)}}\right)$

Once again, there are various possible choices for $E$. However, since $r_{ij}$ and $\sigma(\mu_i - \mu_j)$ are both probabilities, it’s natural to consider the cross-entropy error:

$E(x, y) = - \left(x\ln y + (1-x)\ln(1-y)\right),$

which in our case gives

$c(\boldsymbol\mu) = -\sum_{i=1}^N \sum_{j=i+1}^N r_{ij} \ln\sigma(\mu_i - \mu_j) + (1-r_{ij})\ln(1-\sigma(\mu_i-\mu_j))$

From there, we can use gradient descent. We compute the gradient coordinate by coordinate:

$\begin{aligned} \frac{\partial c}{\partial \mu_k}(\boldsymbol\mu) &= -\sum_{i=1, i\not=k}^N r_{ki}(1-\sigma(\mu_k-\mu_i))-(1-r_{ki})\sigma(\mu_k-\mu_i) \\ &= \sum_{i=1, i\not=k}^N \sigma(\mu_k-\mu_i) - r_{ki} \end{aligned}$

And the full gradient is:

$\nabla c(\boldsymbol\mu) = \begin{pmatrix} \sum_{i=2}^N \sigma(\mu_1-\mu_i) - r_{1i} \\ \vdots \\ \sum_{i=1}^{N-1} \sigma(\mu_N-\mu_i) - r_{Ni} \end{pmatrix}$

What this all means is that for each player, we will need to adjust their score based on the differences between their expected and actual win rate, summed across all opponents.

We need to pick initial parameters; I’m going with $\mu_k = 10$ for all $k$. We also pick a learning rate, I’ve arbitrarily chosen $\alpha = 2$. Then we can just let gradient descent do its thing! After 25 iterations, I get $\mu_2 = 9.88$ and $\mu_3 = 9.05$ (again, I’m taking $\mu_1 = 10$ as a reference and excluding that player from the calculation).

Turning these back into probabilities, we have a win rate of 53% for 1 against 2, 75% for 1 against 3, and 69% for 2 against 3, which is not far from the observed probabilities that we started with!

Online computation

In practice, Elo ratings are rarely computed by grouping all available data and performing one big gradient descent on everybody’s rating. Instead, online (in the sense of “live ranking”, no relation with the Internet) Elo rating updates are typically done after each game.

There is a theoretical basis for that too: stochastic gradient descent. This is the method that most people are really talking about when they mention gradient descent in machine learning, and the idea is to replace our full gradient computation at each iteration step with some sort of partial approximation. In the context of neural networks, this is akin to updating the weights for each training example or batch of training examples, instead of computing the aggregate error for the entire training dataset before any updates.

The gradient estimation that we use for online Elo is essentially the full gradient formula filtered down to only the most recent game. More specifically, in every sum in the gradient, we simply remove any term that includes a player that was not involved in the game. So the gradient above becomes:

$\nabla c(\boldsymbol\mu) \approx \begin{pmatrix} 0\\ \vdots \\ 0 \\ \mathrm{row}\,i\mathrm{:}\;\sigma(\mu_i-\mu_j) - r_{ij} \\ 0\\ \vdots \\ 0 \\ \mathrm{row}\,j\mathrm{:}\;\sigma(\mu_j-\mu_i) - r_{ji} \\ 0\\ \vdots \\ 0 \\ \end{pmatrix}$

With $r_{ij} = 1$ if $i$ won, and 0 otherwise; in any case, $r_{ji} = 1 - r_{ij}$ (this can be extended to be a win ratio instead of 0 or 1, if scoring e.g. the result of a contest or of several games in a row; $\alpha$ might need to be modified too).

The gradient descent update rule becomes:

$\begin{aligned} \mu_i^\mathrm{new} = \mu_i^\mathrm{old} + \alpha(r_{ij} - \sigma(\mu_i^\mathrm{old}-\mu_j^\mathrm{old})) \\ \mu_j^\mathrm{new} = \mu_j^\mathrm{old} + \alpha(r_{ji} - \sigma(\mu_j^\mathrm{old}-\mu_i^\mathrm{old})) \end{aligned}$

Which is pretty much exactly the Elo update rule!

As a side note, contrary to what we were doing before, this update rule does not take one player as a reference point (this would require changing most players’ ratings after each match). Instead, as we’ll see below, we are now holding the average rating constant.

Elo dynamics

To conclude, let’s look at a few practical questions for Elo ratings.

Drift (inflation or deflation)

Some people³https://lichess.org/@/gaboose argue that current chess champions’ ratings are inflated, being higher than those of comparably-strong past champions. Can we analyze this phenomenon?

If you look at the update rule just above, you can notice that the updates are symmetric:

$r_{ij} - \sigma(\mu_i^\mathrm{old}-\mu_j^\mathrm{old}) = -(r_{ji} - \sigma(\mu_j^\mathrm{old}-\mu_i^\mathrm{old})),$

So with each update, the sum of ratings does not change (the winner gains as much as the loser loses). Similarly, the average doesn’t change. So if players are all added to the pool with the same initial score (say, 1500), the average rating will remain constant at 1500.

On the other hand, if we allow removal of players from the pool, or only consider active players, the average rating may change over time: upwards if players generally retire with scores under 1500, downwards otherwise.

Some people have looked at FIDE ratings specifically,⁴Regan, Kenneth W., and Guy M. Haworth. “Intrinsic chess rating.” (2011): 834-839.. and evaluated games from different periods with chess engines in order to get absolute ratings; they found good correlation with Elo ratings, suggesting that there hasn’t been much inflation. Chess experts today may instead be stronger in an absolute sense, probably thanks to the help of computer-assisted training.

Comparison across pools

It’s obviously not possible to directly compare Elo ratings of players within different pools, even if they use exactly the same rating method, because the average absolute strengths within each pool may not be the same.

As an example, imagine the extreme case where three players A is 10 times stronger than B, and B 10 times stronger than C. A pool with just A and B would give respective rankings of A: 1700 and B: 1300, but a pool with just B and C would now give B a rating of 1700!

In addition different organizations use different scoring methods (for instance, assuming a normal distribution of player performances instead of Gumbel), making the scores non-comparable even if the player pools are the same.

But wait this model makes no sense

A significant issue with this model is that gradient descent generally assumes a static objective; in other words, the entire model assumes that people’s real strength does not change. In practice, stochastic gradient descent is tolerant to lots of errors, but this is still not super ideal. The Glicko rating system, which is significantly more complex, has been developed to solve this problem.

Conclusion

The main takeaway from this is that Elo is not magic, and there’s a nice machine-learningy way to derive it from first principles! Although it’s not often advertised, it’s also interesting to see how we can turn scores into win probabilities.

Elo, Arpad E. The rating of chessplayers, past and present. Arco Pub., 1978. ↩
This expression assumes we want to apply $E$ coordinate-by-coordinate, which is not technically required, but which we’ll do here; a good reason is that in practice we don’t necessarily have values for all the $r_{ij}$, so we might want to omit some terms. By applying $E$ to each term separately, we can just replace components with missing values with 0. ↩
https://lichess.org/@/gaboose ↩
Regan, Kenneth W., and Guy M. Haworth. “Intrinsic chess rating.” (2011): 834-839.. ↩

Terbium