Each LLM is given the same 1000 chess puzzles to solve. See puzzles.csv
. Benchmarked on Mar 25, 2024.
Model |
Solved |
Solved % |
Illegal Moves |
Illegal Moves % |
Adjusted Elo |
gpt-4-turbo-preview |
229 |
22.9% |
163 |
16.3% |
1144 |
gpt-4 |
195 |
19.5% |
183 |
18.3% |
1047 |
claude-3-opus-20240229 |
72 |
7.2% |
464 |
46.4% |
521 |
claude-3-haiku-20240307 |
38 |
3.8% |
590 |
59.0% |
363 |
claude-3-sonnet-20240229 |
23 |
2.3% |
663 |
66.3% |
286 |
gpt-3.5-turbo |
23 |
2.3% |
683 |
68.3% |
269 |
claude-instant-1.2 |
10 |
1.0% |
707 |
66.3% |
245 |
mistral-large-latest |
4 |
0.4% |
813 |
81.3% |
149 |
mixtral-8x7b |
9 |
0.9% |
832 |
83.2% |
136 |
gemini-1.5-pro-latest* |
FAIL |
- |
- |
- |
- |
Published by the CEO of Kagi!
Likely close to 100%. If you read the (rather good) article, a little further down they test whether the LLM can play an extremely simplistic "Connect 4" game they devise, as a way of narrowing down on specifically reasoning capabilities.
It cannot.
Chess puzzles, in particular, are frequently shared and discussed in online chess spaces, so the LLM will have a significant amount of material to work with when it tries to predict the best response to give to the prompt.