Bayesian Elo Rating

by Rémi Coulom.

Bayeselo is a freeware tool to estimate Elo ratings. It can read a file containing game records in PGN format, and produce a rating list.

Download
History
Usage Documentation
Theory
Is Bayeselo Really Better than Elostat?
Links

Download Software

bayeselo.exe: Windows binary
bayeselo.tar.bz2: Source code (distributed under the terms of the GNU General Public License)

History

2010.03.31:
- Minor fixes to source code so that it compiles with less warnings and errors.
2007.01.30:
- New optional parameter to the ratings command to indicate whether rank is rank among players in the names file, or the global rank among all players.
- The offset command can be used to set the rating of one of the players. For instance offset 2800 Fruit will offset ratings so that the rating of Fruit is 2800.
- Fix to the elostat command, to produce a better imitation of elostat's rating calculation (thanks to Brian Mottershead).
2006.04.17:
- The ratings commands now takes an additional optional parameter that indicates the name of a file from which to read player names whose rating is to be displayed.
- New names command in the ResultSet interface: get the list of players in alphabetical order (can be used to write a file for the ratings commands as indicated in the previous item.)
2006.02.08:
- Better predictions: simulations take rating uncertainty into consideration.
- New pairstats command to get statistics for every pair of players.
- The ratings command now displays average opponent.
- No need to explicitly run covariance before los.
2005.12.18:
- New scale command to scale ratings. By default, maximum-likelihood ratings are now scaled down so that they look more like Elostat/SSDF ratings.
- New prediction command to predict the outcome of round-robin tournaments.
2005.09.29: Elostat confidence intervals + improved Elostat accuracy
2005.09.27: Fixed a bug that affected rating calculations
2005.09.24:
- Elostat algorithm implemented
- Better rounding of rating estimations
- Sorted opponents in the details command
- Confusing debug output of the players command removed
- Removed dependencies: smaller source and binary
- Compilation date/time added in case I forget to increment version
2005.09.22: Fixed a potential segmentation fault + link to source
2005.09.21: Better ratings thanks to the use of a proper prior (see Winboard Forum discussion for details)
2005.02.13: Many changes:
- "exactdist" does not require an amount of memory proportional to the number of players anymore, so it can now be applied to a set of results with hundreds of thousands of players.
- Deep internal changes, resulting in a different organization of commands.
- "drawdist" and "advdist" are twice faster.
- New function to get the proportion of draws as a function of the average rating of players.
- New function to estimate confidence intervals by inversing the Hessian of the log-likelihood. Also, a simplified diagonal version is available.
- Tables of "likelihood that A is stronger than B".
- Some minor problems were fixed.
2005.01.15: Fixed a PGN-parsing bug (ÿ was considered as EOF)
2005.01.14: Fixed a bug that could cause the program to crash when a player is not MM-connected to anybody.

Usage Documentation

Bayeselo is a command-line tool. This section presents typical examples of use.

The example below shows how to compute ratings from a PGN file (wbec.pgn, located in the directory where the program is executed).

ResultSet>readpgn wbec.pgn
37 game(s) loaded, 0 game(s) with unknown result ignored.
ResultSet>elo
ResultSet-EloRating>mm
Iteration 100: 1.60455e-005 
00:00:00,00
ResultSet-EloRating>exactdist
00:00:00,04              
ResultSet-EloRating>ratings
Rank Name                  Elo    +    - games score oppo. draws 
   1 Dragon 4.7.5          120  167  148     8   75%     5   50% 
   2 Cerebro 2.05a          60  229  211     4   63%     1   25% 
   3 Movei 0.08.336          9  138  138    12   50%     6   17% 
   4 Zarkov 4.67             2  146  150     9   44%    18   44% 
...

If you wish to get the output of the "ratings" command in a file, you may redirect its output like this: ratings >myfile.txt

bayeselo can produce a likelihood-of-superiorty matrix:

ResultSet-EloRating>los 0 5 3
                     Dr Ce Mo Za Co
Dragon 4.7.5            59 81 85 71
Cerebro 2.05a        40    57 60 68
Movei 0.08.336       18 42    51 51
Zarkov 4.67          14 39 48    50
Comet B.68           28 31 48 49

Bayeselo can also produce predictions for round-robin tournaments. For instance, if wbec1to9.pgn contains games of many players, it is possible to predict the outcome of a round-robin tournament between 4 of them and an unknown "Toto", with this kind of script:

readpgn wbec1to9.pgn
elo
 mm
 prediction
  rounds 2 ;This indicates that each player plays 4 games, 2 with each color.
  results
   addplayer Esc 1.16
   addplayer Pharaon 2.62
   addplayer Gandalf 4.32h
   addplayer TheCrazyBishop 0045
   addplayer Toto
  x
  players ;Displays the list of players
  simulate ;This runs 100000 random simulations of the tournament
  x
 x
x

Sending the commands above to bayeselo will produce an output that looks like:

Num                 Name      Elo Std.Dev.
  0             Esc 1.16  205.142  13.6535
  1         Pharaon 2.62   464.88  13.7103
  2        Gandalf 4.32h  537.769  15.2113
  3  TheCrazyBishop 0045  340.416  15.5046
  4                 Toto        0     1000
(5 players)

Rank Player name           Points EPoints  StdDev   ERank    1  2  3  4  5
   1 Gandalf 4.32h            0.0   11.55    2.06    1.59   52 37  9  1  0
   2 Pharaon 2.62             0.0   10.14    2.14    2.20   18 49 28  5  0
   3 TheCrazyBishop 0045      0.0    7.59    2.19    3.32    1  9 51 34  4
   4 Toto                     0.0    5.73    6.44    3.60   29  4  4  6 58
   5 Esc 1.16                 0.0    4.99    2.12    4.29    0  0  7 55 38

The EPoints column indicates the expected final score. ERank is the expected final rank. The matrix to the right indicates the probability in percent for every player and every rank.

This prediction tool may also be applied to a running tournament, where some of the games have already been played. In order to do this, simply replace the addplayer commands in the script by

readpgn partial.pgn

, where partial.pgn contains the current games of the round robin. If partial.pgn does not contain all the players of the tournament yet, you may add more players with the addplayer command. In case the rating of one participant is unknown, you can set it manually with the elo command at the level of the prediction interface.

The prediction tool also lets you change the number of points awarded for a win, a loss, and a draw. This way, it can be applied to generate predictions in the French Football Championship (soccer in the US), where a victory is 3 points, a draw 1 point and a loss 0 point. Default values are 1, 0.5, and 0.

If you want more usage information, you can get a list of available commands with the ? command when running bayeselo.

Theory

The fundamental formula of Elo theory gives the expected result (E) of a game as a function of the rating difference (D) of players. For instance:

E=1/(1+10^(D/400))

The fundamental assumption of the Elo rating system is that the strength of a player can be described by a single value, and that game results are drawn according to the formula above.

The problem of Elo evaluation consists in estimating the Elo rating of a set of players, from the observation of results of their games.

Elostat approach

The fundamental Elo formula can be reversed to obtain an estimation of the rating difference between two players, as a function of the average score. This is the basis of the Elostat approach, that works in two steps:

Iterative method to solve a fixed-point equation, so that the rating of every player is in accordance to the reverse Elo formula, assuming an expected result equal to the average score, against an opponent whose Elo is equal to the average opponent. This is done under a constraint of a given average Elo over all the players.
Measure of variance of score to estimate uncertainty.

The main flaw of this approach is that the estimation of uncertainty does as if a player had played against one opponent, whose Elo is equal to the mean Elo of the opponents. This assumption has bad consequences for the estimation of ratings and uncertainties:

The expected result against two players is not equal to the expected result against one single player whose rating is the average of the two players.
Estimation of uncertainty is wrong, because 10 wins and 10 losses against a 1500-Elo opponent should result in less uncertainty than 10 wins against a 500-Elo opponent and 10 losses against a 2500-Elo opponent.

Also, another problem is that the estimation of uncertainty in Elostat does as if the rating of opponents are their true ratings. But those ratings also have some uncertainty that should be taken into consideration.

Bayesian approach

All these problems of the Elostat approach can be solved using a Bayesian approach. The principle of the Bayesian approach consists in choosing a prior likelihood distribution over Elo ratings, and computing a posterior distribution as a function of the observed results.

The principle of Bayesian inference is based on Bayes's formula:

P(Elos|Results) = P(Results|Elos)P(Elos)/P(Results)

P(Elos) is the prior distribution. It will be chosen to be uniform in the rest of this discussion. So we get:

P(Elos|Results) proportional to P(Results|Elos)

In order to perform this calculation, it is necessary to assume a little more than the usual ELO formula. The expected score as a function of the Elo difference is not enough. We need the probability of a win, a draw and a loss as a function of the Elo difference.

This can be done this way:

f(Delta) = 1 / (1 + 10^(Delta/400))
P(WhiteWins) = f(eloBlack - eloWhite - eloAdvantage + eloDraw)
P(BlackWins) = f(eloWhite - eloBlack + eloAdvantage + eloDraw)
P(Draw) = 1 - P(WhiteWins) - P(BlackWins)

eloAdvantage indicates the advantage of playing first. eloDraw indicates how likely draws are. The default values in the program were obtained by finding their maximum-likelihood values over 29,610 games of Leo Dijksman's WBEC. The value measured, with 95% confidence intervals are:

eloAdvantage = 32.8 +/- 4
eloDraw = 97.3 +/- 2

bayeselo finds the maximum-likelihood ratings, using a minorization-maximization (MM) algorithm. A description of this algorithm is available in the Links section below.

Is Bayeselo Really Better than Elostat?

In this section, I will present some facts that highlight the differences between the two programs, that, I hope, should convince most readers that bayeselo is better than elostat.

Still, I do not claim that bayeselo is perfect, and criticism is welcome. Bayeselo has already benefited a lot from the feedback of its users, and I thank them for that. If you find a situation where the output of bayeselo looks bad or strange, do not hesitate to let me know.

Bayeselo takes color into consideration

In chess, playing with the white pieces is an advantage estimated to be worth about 33 Elo points. Bayeselo takes this into consideration. For instance, after a single draw between two players A and B, A playing white, here are the outputs of elostat and bayeselo:

Elostat:

  Program    Elo    +   -   Games   Score   Av.Op.  Draws
1 A        :    0    0   0     1    50.0 %      0   100.0 %
2 B        :    0    0   0     1    50.0 %      0   100.0 %

Bayeselo:

Rank Name   Elo    +    - games score draws
   1 B        5  146  146     1   50%  100%
   2 A       -5  146  146     1   50%  100%

Note that the difference in estimated playing strength according to bayeselo is relatively small compared to the 33 Elo-point value of playing first. That is because of a mechanism of bayeselo that requires many games to confirm a rating difference, detailed in the next subsection. After many such draws, bayeselo's estimated rating difference would be 33 points.

For bayeselo, 10-0 is not the same as 1-0

Bayeselo uses a prior distribution over ratings, that increases the likelihood that the ratings of players are close to each other. The consequence is that a high rating difference has to be deserved, and requires many more games than with Elostat.

Elostat:

  Program    Elo    +   -   Games   Score   Av.Op.  Draws
1 A        :  300    0   0     1   100.0 %   -300    0.0 %
2 B        : -300    0   0     1     0.0 %    300    0.0 %

  Program    Elo    +   -   Games   Score   Av.Op.  Draws
1 A        :  300    0   0    10   100.0 %   -300    0.0 %
2 B        : -300    0   0    10     0.0 %    300    0.0 %

Bayeselo:

Rank Name   Elo    +    - games score draws
   1 A       41  181  152     1  100%    0%
   2 B      -41  152  181     1    0%    0%

Rank Name   Elo    +    - games score draws
   1 A      169  172   99    10  100%    0%
   2 B     -169   99  172    10    0%    0%

Bayeselo behaves correctly when opponents' ratings are far apart

A big source of problems in elostat is that it assumes that many games against many opponents is equivalent to as many games against one opponent whose rating in the average of ratings. This is very wrong, and fails badly in situations where opponent's ratings are far apart. If we consider the simple case where A beats B, B draws C, C beats D, here are the outputs of the two programs:

Elostat:

  Program    Elo    +   -   Games   Score   Av.Op.  Draws
1 A        :  709    0   0     1   100.0 %    109    0.0 %
2 B        :  109  259 409     2    25.0 %    300   50.0 %
3 C        : -109  409 259     2    75.0 %   -300   50.0 %
4 D        : -709    0   0     1     0.0 %   -109    0.0 %

Bayeselo:

Rank Name   Elo    +    - games score draws
   1 A       96  329  254     1  100%    0%
   2 C        8  195  180     2   75%   50%
   3 B       -8  180  195     2   25%   50%
   4 D      -96  254  329     1    0%    0%

Note that elostat gives a 218-point difference between B and C. This is really completely wrong since the only information we have about their relative strength is that B drew C. So their ratings should be close. Bayeselo gives a small advantage to C, because it drew B while playing as black.

This is probably the most severe weakness of elostat. It does not only show with that kind of artificial situation, but also in real tournaments. The games of the 10th edition of the WBEC Ridderkerk tournament produce these ratings:

Elostat:
    Program              Elo    +   -   Games   Score   Av.Op.  Draws

104 NullMover 0.25     :  -92   58  58   124    39.1 %    -15   15.3 %
109 Natwarlal 0.12     : -412   73  70    94    72.9 %   -584   16.0 %
112 Alarm 0.93.1       : -495   69  67    94    61.7 %   -579   12.8 %
114 NagaSkaki 3.07     : -531   65  64    94    56.4 %   -576   19.1 %

Bayeselo:

Rank Name                Elo    +    - games score draws
  60 Natwarlal 0.12      259   69   69    94   73%   16%
  97 Alarm 0.93.1        173   66   66    94   62%   13%
 106 NullMover 0.25      138   55   55   124   39%   15%
 109 NagaSkaki 3.07      130   63   63    94   56%   19%

These ratings show big differences between bayeselo and elostat. These 4 players all participated in the "Promo D" tournament between 3rd and 4th divisions. Natwarlal, Alarm, and NagaSkaki come from division 4. NullMover from division 3. Results of the promotion tournament indicate that NullMover is weaker than Natwarlal (NullMover scored 12.5/32 whereas Natwarlal scored 23.5/32). So the ratings of bayeselo look OK according to the results of the promotion tournament, whereas those of elostat are completely wrong.