Bayesian Elo Rating

by Rémi Coulom.

Bayeselo is a freeware tool to estimate Elo ratings. It can read a file containing game records in PGN format, and produce a rating list.

Contents

Download Software

History

Usage Documentation

Bayeselo is a command-line tool. This section presents typical examples of use.

The example below shows how to compute ratings from a PGN file (wbec.pgn, located in the directory where the program is executed).

ResultSet>readpgn wbec.pgn
37 game(s) loaded, 0 game(s) with unknown result ignored.
ResultSet>elo
ResultSet-EloRating>mm
Iteration 100: 1.60455e-005 
00:00:00,00
ResultSet-EloRating>exactdist
00:00:00,04              
ResultSet-EloRating>ratings
Rank Name                  Elo    +    - games score oppo. draws 
   1 Dragon 4.7.5          120  167  148     8   75%     5   50% 
   2 Cerebro 2.05a          60  229  211     4   63%     1   25% 
   3 Movei 0.08.336          9  138  138    12   50%     6   17% 
   4 Zarkov 4.67             2  146  150     9   44%    18   44% 
...

If you wish to get the output of the "ratings" command in a file, you may redirect its output like this: ratings >myfile.txt

bayeselo can produce a likelihood-of-superiorty matrix:

ResultSet-EloRating>los 0 5 3
                     Dr Ce Mo Za Co
Dragon 4.7.5            59 81 85 71
Cerebro 2.05a        40    57 60 68
Movei 0.08.336       18 42    51 51
Zarkov 4.67          14 39 48    50
Comet B.68           28 31 48 49

Bayeselo can also produce predictions for round-robin tournaments. For instance, if wbec1to9.pgn contains games of many players, it is possible to predict the outcome of a round-robin tournament between 4 of them and an unknown "Toto", with this kind of script:

readpgn wbec1to9.pgn
elo
 mm
 prediction
  rounds 2 ;This indicates that each player plays 4 games, 2 with each color.
  results
   addplayer Esc 1.16
   addplayer Pharaon 2.62
   addplayer Gandalf 4.32h
   addplayer TheCrazyBishop 0045
   addplayer Toto
  x
  players ;Displays the list of players
  simulate ;This runs 100000 random simulations of the tournament
  x
 x
x

Sending the commands above to bayeselo will produce an output that looks like:

Num                 Name      Elo Std.Dev.
  0             Esc 1.16  205.142  13.6535
  1         Pharaon 2.62   464.88  13.7103
  2        Gandalf 4.32h  537.769  15.2113
  3  TheCrazyBishop 0045  340.416  15.5046
  4                 Toto        0     1000
(5 players)

Rank Player name           Points EPoints  StdDev   ERank    1  2  3  4  5
   1 Gandalf 4.32h            0.0   11.55    2.06    1.59   52 37  9  1  0
   2 Pharaon 2.62             0.0   10.14    2.14    2.20   18 49 28  5  0
   3 TheCrazyBishop 0045      0.0    7.59    2.19    3.32    1  9 51 34  4
   4 Toto                     0.0    5.73    6.44    3.60   29  4  4  6 58
   5 Esc 1.16                 0.0    4.99    2.12    4.29    0  0  7 55 38

The EPoints column indicates the expected final score. ERank is the expected final rank. The matrix to the right indicates the probability in percent for every player and every rank.

This prediction tool may also be applied to a running tournament, where some of the games have already been played. In order to do this, simply replace the addplayer commands in the script by

readpgn partial.pgn

, where partial.pgn contains the current games of the round robin. If partial.pgn does not contain all the players of the tournament yet, you may add more players with the addplayer command. In case the rating of one participant is unknown, you can set it manually with the elo command at the level of the prediction interface.

The prediction tool also lets you change the number of points awarded for a win, a loss, and a draw. This way, it can be applied to generate predictions in the French Football Championship (soccer in the US), where a victory is 3 points, a draw 1 point and a loss 0 point. Default values are 1, 0.5, and 0.

If you want more usage information, you can get a list of available commands with the ? command when running bayeselo.

Theory

The fundamental formula of Elo theory gives the expected result (E) of a game as a function of the rating difference (D) of players. For instance:

E=1/(1+10^(D/400))

The fundamental assumption of the Elo rating system is that the strength of a player can be described by a single value, and that game results are drawn according to the formula above.

The problem of Elo evaluation consists in estimating the Elo rating of a set of players, from the observation of results of their games.

Elostat approach

The fundamental Elo formula can be reversed to obtain an estimation of the rating difference between two players, as a function of the average score. This is the basis of the Elostat approach, that works in two steps:

The main flaw of this approach is that the estimation of uncertainty does as if a player had played against one opponent, whose Elo is equal to the mean Elo of the opponents. This assumption has bad consequences for the estimation of ratings and uncertainties:

Also, another problem is that the estimation of uncertainty in Elostat does as if the rating of opponents are their true ratings. But those ratings also have some uncertainty that should be taken into consideration.

Bayesian approach

All these problems of the Elostat approach can be solved using a Bayesian approach. The principle of the Bayesian approach consists in choosing a prior likelihood distribution over Elo ratings, and computing a posterior distribution as a function of the observed results.

The principle of Bayesian inference is based on Bayes's formula:

P(Elos|Results) = P(Results|Elos)P(Elos)/P(Results)

P(Elos) is the prior distribution. It will be chosen to be uniform in the rest of this discussion. So we get:

P(Elos|Results) proportional to P(Results|Elos)

In order to perform this calculation, it is necessary to assume a little more than the usual ELO formula. The expected score as a function of the Elo difference is not enough. We need the probability of a win, a draw and a loss as a function of the Elo difference.

This can be done this way:

eloAdvantage indicates the advantage of playing first. eloDraw indicates how likely draws are. The default values in the program were obtained by finding their maximum-likelihood values over 29,610 games of Leo Dijksman's WBEC. The value measured, with 95% confidence intervals are:

bayeselo finds the maximum-likelihood ratings, using a minorization-maximization (MM) algorithm. A description of this algorithm is available in the Links section below.

Is Bayeselo Really Better than Elostat?

In this section, I will present some facts that highlight the differences between the two programs, that, I hope, should convince most readers that bayeselo is better than elostat.

Still, I do not claim that bayeselo is perfect, and criticism is welcome. Bayeselo has already benefited a lot from the feedback of its users, and I thank them for that. If you find a situation where the output of bayeselo looks bad or strange, do not hesitate to let me know.

Bayeselo takes color into consideration

In chess, playing with the white pieces is an advantage estimated to be worth about 33 Elo points. Bayeselo takes this into consideration. For instance, after a single draw between two players A and B, A playing white, here are the outputs of elostat and bayeselo:

Elostat:

  Program    Elo    +   -   Games   Score   Av.Op.  Draws
1 A        :    0    0   0     1    50.0 %      0   100.0 %
2 B        :    0    0   0     1    50.0 %      0   100.0 % 
Bayeselo:

Rank Name   Elo    +    - games score draws
   1 B        5  146  146     1   50%  100%
   2 A       -5  146  146     1   50%  100% 

Note that the difference in estimated playing strength according to bayeselo is relatively small compared to the 33 Elo-point value of playing first. That is because of a mechanism of bayeselo that requires many games to confirm a rating difference, detailed in the next subsection. After many such draws, bayeselo's estimated rating difference would be 33 points.

For bayeselo, 10-0 is not the same as 1-0

Bayeselo uses a prior distribution over ratings, that increases the likelihood that the ratings of players are close to each other. The consequence is that a high rating difference has to be deserved, and requires many more games than with Elostat.

Elostat:

  Program    Elo    +   -   Games   Score   Av.Op.  Draws
1 A        :  300    0   0     1   100.0 %   -300    0.0 %
2 B        : -300    0   0     1     0.0 %    300    0.0 %

  Program    Elo    +   -   Games   Score   Av.Op.  Draws
1 A        :  300    0   0    10   100.0 %   -300    0.0 %
2 B        : -300    0   0    10     0.0 %    300    0.0 % 
Bayeselo:

Rank Name   Elo    +    - games score draws
   1 A       41  181  152     1  100%    0%
   2 B      -41  152  181     1    0%    0%

Rank Name   Elo    +    - games score draws
   1 A      169  172   99    10  100%    0%
   2 B     -169   99  172    10    0%    0% 

Bayeselo behaves correctly when opponents' ratings are far apart

A big source of problems in elostat is that it assumes that many games against many opponents is equivalent to as many games against one opponent whose rating in the average of ratings. This is very wrong, and fails badly in situations where opponent's ratings are far apart. If we consider the simple case where A beats B, B draws C, C beats D, here are the outputs of the two programs:

Elostat:

  Program    Elo    +   -   Games   Score   Av.Op.  Draws
1 A        :  709    0   0     1   100.0 %    109    0.0 %
2 B        :  109  259 409     2    25.0 %    300   50.0 %
3 C        : -109  409 259     2    75.0 %   -300   50.0 %
4 D        : -709    0   0     1     0.0 %   -109    0.0 % 
Bayeselo:

Rank Name   Elo    +    - games score draws
   1 A       96  329  254     1  100%    0%
   2 C        8  195  180     2   75%   50%
   3 B       -8  180  195     2   25%   50%
   4 D      -96  254  329     1    0%    0% 

Note that elostat gives a 218-point difference between B and C. This is really completely wrong since the only information we have about their relative strength is that B drew C. So their ratings should be close. Bayeselo gives a small advantage to C, because it drew B while playing as black.

This is probably the most severe weakness of elostat. It does not only show with that kind of artificial situation, but also in real tournaments. The games of the 10th edition of the WBEC Ridderkerk tournament produce these ratings:

Elostat:
    Program              Elo    +   -   Games   Score   Av.Op.  Draws

104 NullMover 0.25     :  -92   58  58   124    39.1 %    -15   15.3 %
109 Natwarlal 0.12     : -412   73  70    94    72.9 %   -584   16.0 %
112 Alarm 0.93.1       : -495   69  67    94    61.7 %   -579   12.8 %
114 NagaSkaki 3.07     : -531   65  64    94    56.4 %   -576   19.1 %
Bayeselo:

Rank Name                Elo    +    - games score draws
  60 Natwarlal 0.12      259   69   69    94   73%   16%
  97 Alarm 0.93.1        173   66   66    94   62%   13%
 106 NullMover 0.25      138   55   55   124   39%   15%
 109 NagaSkaki 3.07      130   63   63    94   56%   19%

These ratings show big differences between bayeselo and elostat. These 4 players all participated in the "Promo D" tournament between 3rd and 4th divisions. Natwarlal, Alarm, and NagaSkaki come from division 4. NullMover from division 3. Results of the promotion tournament indicate that NullMover is weaker than Natwarlal (NullMover scored 12.5/32 whereas Natwarlal scored 23.5/32). So the ratings of bayeselo look OK according to the results of the promotion tournament, whereas those of elostat are completely wrong.

Links