Until now I've been benchmarking my players by running matches against various opponents - either my own benchmark players or standards like PubEval.

A better benchmark is the "benchmark databases" created by the folks at GNUbg. This post describes the benchmark and how to use it, and subsequent posts will show results and compare it to the other benchmarks I have been using.

They gathered many hundreds of thousands of representative positions, chose both random and interesting ones (based on 0-ply vs 2-ply equity differences) to include in a benchmark list, and then rolled out the equity for different potential moves for a random dice roll for each.

The benchmark calculation is the following: for each position+dice roll in the database, use your strategy to choose a move. Then find the strategy's choice in the list of rolled-out moves and add the equity difference vs the best rolled-out move to a sum. Then at the end divide that sum by the total number of elements to get the average equity error.

This has the advantage that it does not depend on dice rolls in a set of simulated games, so removes an element of randomness. Plus the list of positions is a "real world" set, generated from games played by both humans and bots on FIBS. The alternative would be to choose a list of positions from bot self play, which runs the risk of bias due to playing style of that bot.

It does depend on the accuracy of the rolled-out equities, which were calculated with GNUbg's 0-ply player. But those should be quite accurate in most cases, especially as a broad metric of goodness of play.

The GNUbg benchmarks are split into three databases: crashed, contact, and race. They are available via FTP. I have also mirrored them at dropbox: crashed, contact, and race. There are 100,000 position+move elements for crashed positions; 107,485 for contact; and 111,008 for race.

The files have quite a lot of information about the included positions, but the subset you need for the benchmark are lines that look like:

m OAHDPAABDAOAHDPAABDA 2 4 JIGHPAABDAOAHDPAABDA 0.133905 OAHDMJABDAOAHDPAABDA 0.0835662 OAHDOEABCBOAHDPAABDA 0.111429 OAHDOBABCEOAHDPAABDA 0.124954 OAHDPAABAJOAHDPAABDA 0.189375

Any line that starts with an 'm' is a move line, and those are the relevant ones for this calculation. The strings are board representations (more on that later). The first string is the representation of the starting board; the next two numbers are the dice rolls. So in the example above, the roll is a 2 and a 4. The next two elements represent the best move from the rollout: the string is the board representation and the number is the equity in that position. The rest of the line represents the other moves, ordered by equity. The string is the board representation and the number is the equity difference vs the best move. So the second element in the list of moves is the board represented by the string OAHDMJABDAOAHDPAABDA and its equity is 0.0835662 worse than the best move.

All equities in the benchmark files are cubeless.

Only the top five moves are included in the list, and it is possible that the strategy's choice is outside that list. If this happens, I assumed that the equity error for the strategy's choice is equal to the worst of the rolled-out equities. This makes the average error look a bit better than it really should, but hopefully this doesn't happen very often.

The string board representation is a way of serializing a board layout. It starts with the GNUbg position key and translates the resulting 10-byte number into a 20-character string so it can be represented as text. This is

Other lines in the benchmark files show GNUbg system settings; details of the rollouts for positions (lines starting with '#R' show the rolled-out probability of win, probability of gammon win, probability of backgammon win, probability of gammon loss, and probability of backgammon loss); and benchmark cube decisions (lines starting with 'o'). Those are not required for the benchmark calculation described above.

A better benchmark is the "benchmark databases" created by the folks at GNUbg. This post describes the benchmark and how to use it, and subsequent posts will show results and compare it to the other benchmarks I have been using.

They gathered many hundreds of thousands of representative positions, chose both random and interesting ones (based on 0-ply vs 2-ply equity differences) to include in a benchmark list, and then rolled out the equity for different potential moves for a random dice roll for each.

The benchmark calculation is the following: for each position+dice roll in the database, use your strategy to choose a move. Then find the strategy's choice in the list of rolled-out moves and add the equity difference vs the best rolled-out move to a sum. Then at the end divide that sum by the total number of elements to get the average equity error.

This has the advantage that it does not depend on dice rolls in a set of simulated games, so removes an element of randomness. Plus the list of positions is a "real world" set, generated from games played by both humans and bots on FIBS. The alternative would be to choose a list of positions from bot self play, which runs the risk of bias due to playing style of that bot.

It does depend on the accuracy of the rolled-out equities, which were calculated with GNUbg's 0-ply player. But those should be quite accurate in most cases, especially as a broad metric of goodness of play.

The GNUbg benchmarks are split into three databases: crashed, contact, and race. They are available via FTP. I have also mirrored them at dropbox: crashed, contact, and race. There are 100,000 position+move elements for crashed positions; 107,485 for contact; and 111,008 for race.

The files have quite a lot of information about the included positions, but the subset you need for the benchmark are lines that look like:

m OAHDPAABDAOAHDPAABDA 2 4 JIGHPAABDAOAHDPAABDA 0.133905 OAHDMJABDAOAHDPAABDA 0.0835662 OAHDOEABCBOAHDPAABDA 0.111429 OAHDOBABCEOAHDPAABDA 0.124954 OAHDPAABAJOAHDPAABDA 0.189375

Any line that starts with an 'm' is a move line, and those are the relevant ones for this calculation. The strings are board representations (more on that later). The first string is the representation of the starting board; the next two numbers are the dice rolls. So in the example above, the roll is a 2 and a 4. The next two elements represent the best move from the rollout: the string is the board representation and the number is the equity in that position. The rest of the line represents the other moves, ordered by equity. The string is the board representation and the number is the equity difference vs the best move. So the second element in the list of moves is the board represented by the string OAHDMJABDAOAHDPAABDA and its equity is 0.0835662 worse than the best move.

All equities in the benchmark files are cubeless.

Only the top five moves are included in the list, and it is possible that the strategy's choice is outside that list. If this happens, I assumed that the equity error for the strategy's choice is equal to the worst of the rolled-out equities. This makes the average error look a bit better than it really should, but hopefully this doesn't happen very often.

The string board representation is a way of serializing a board layout. It starts with the GNUbg position key and translates the resulting 10-byte number into a 20-character string so it can be represented as text. This is

*not*the usual GNUbg position ID - it uses a different serialization. The translation from that string to the board position is slightly complex but the code is in the open source GNUbg repository, in gnubg.c, function ParsePosition. (Thanks to Phillippe Michel and Michael Petch for pointing this out to me.)Other lines in the benchmark files show GNUbg system settings; details of the rollouts for positions (lines starting with '#R' show the rolled-out probability of win, probability of gammon win, probability of backgammon win, probability of gammon loss, and probability of backgammon loss); and benchmark cube decisions (lines starting with 'o'). Those are not required for the benchmark calculation described above.

## No comments:

## Post a Comment