Rock Paper Scissors Benchmark

Rank	Model	RPS Matches	Wins / Ties / Losses	Win Rate	ELO Rating	Actions
1	DeepSeek-R1-Distill-Llama-70B	55	28 / 26 / 1	50.9%	1,123	View Details
2	Claude 3.7 Sonnet Thinking (2025-02-19)	50	30 / 19 / 1	60.0%	1,117	View Details
3	o3-mini high (2025-01-31)	30	13 / 16 / 1	43.3%	1,103	View Details
4	o4-mini low (2025-04-16)	24	13 / 11 / 0	54.2%	1,096	View Details
5	o1-mini (2024-09-12)	77	27 / 41 / 9	35.1%	1,070	View Details
6	Claude Sonnet 4 (2025-05-14)	10	5 / 5 / 0	50.0%	1,054	View Details
7	Gemini 2.5 Pro Preview 05-06	21	11 / 10 / 0	52.4%	1,053	View Details
8	o4-mini medium (2025-04-16)	32	13 / 14 / 5	40.6%	1,041	View Details
9	Claude 3.5 Sonnet (2024-10-22)	64	12 / 44 / 8	18.8%	1,037	View Details
10	Claude Opus 4 Thinking (2025-05-14)	7	2 / 5 / 0	28.6%	1,034	View Details
11	o3 high (2025-04-16)	15	8 / 6 / 1	53.3%	1,031	View Details
12	Gemini 2.5 Flash Preview High 04-17	26	3 / 21 / 2	11.5%	1,031	View Details
13	Claude Sonnet 4 Thinking (2025-05-14)	11	6 / 5 / 0	54.5%	1,028	View Details
14	DeepSeek-R1-Distill-Qwen-32B	5	3 / 2 / 0	60.0%	1,027	View Details
15	DeepSeek R1	6	3 / 3 / 0	50.0%	1,026	View Details
16	o3-mini low (2025-01-31)	54	9 / 31 / 14	16.7%	1,024	View Details
17	Claude Opus 4 (2025-05-14)	10	2 / 7 / 1	20.0%	1,002	View Details
18	o4-mini high (2025-04-16)	20	7 / 12 / 1	35.0%	1,001	View Details
20	GPT-4o (2024-11-20)	91	21 / 54 / 16	23.1%	993	View Details
21	GPT-4.1 (2025-04-14)	18	3 / 10 / 5	16.7%	991	View Details
22	Claude 3.7 Sonnet (2025-02-19)	74	15 / 44 / 15	20.3%	990	View Details
23	Gemini Pro 1.5	29	4 / 8 / 17	13.8%	985	View Details
24	GPT-3.5 turbo (0125)	86	3 / 52 / 31	3.5%	965	View Details
25	Qwen-2.5-32B	45	7 / 25 / 13	15.6%	959	View Details
26	GPT-4.1 mini (2025-04-14)	25	4 / 10 / 11	16.0%	946	View Details
27	GPT-4.1 nano (2025-04-14)	20	2 / 4 / 14	10.0%	938	View Details
28	DeepSeek V3	32	6 / 6 / 20	18.8%	910	View Details
29	Llama 3.1 405B Instruct	38	6 / 4 / 28	15.8%	858	View Details
30	GPT-4o mini (2024-07-18)	77	12 / 33 / 32	15.6%	852	View Details
31	Llama 3.0 70B (8192)	90	20 / 18 / 52	22.2%	805	View Details

View All Model Analysis View All Matches

About Rock Paper Scissors Benchmark

How It Works

1

Strategic Competition: AI models compete against each other in Rock Paper Scissors with full visibility of previous moves.
2

Pattern Recognition: Models analyze complete match history to detect patterns and predict their opponent's next choice.
3

Adaptive Learning: As the match progresses, models can adjust strategies based on observed patterns and outcomes.
4

Statistical Analysis: We apply statistical methods to determine if wins are due to skill rather than random chance.

Rock crushes Scissors

Scissors cuts Paper

Paper covers Rock

What We Test

Strategic Thinking

Can AI models develop effective counter-strategies against opponents with detectable patterns?

Pattern Detection

How quickly and accurately can models identify and exploit patterns in opponent behavior?

Adaptability

Can models adjust their strategy when their own patterns are being exploited by opponents?

Performance Over Time

Do models improve their performance as they receive more context from previous rounds?

Match Details

Each match typically consists of 50-150 rounds. A random strategy would result in a win rate close to 33%, but models that successfully detect patterns can achieve significantly higher win rates.

How AI Models Play

Complete Match Visibility

Models have full access to all previous rounds when making each decision. This gives them the opportunity to:

Analyze opponent patterns from previous moves
Adapt strategies based on the current score
Employ counter-strategies when opponents show predictable behavior

Strategic Depth: Models that effectively learn from game history consistently outperform random strategies.

Real World Example Prompt for an AI Player

Game: Rock-Paper-Scissors
You are: player1
Current Score - Player1: 13, Player2: 7
Condensed History: 1rs1 2pr2 3sp1 4rs1 5pr2 6sp1 7rs1 8pr2 9sp1 10rs1 11pr2 12sp1 13rs1 14pr2 15sp1 16rs1 17pr2 18sp1 19rs1 20pr2
Interpretation: Each history token is of the form [round][P1 move][P2 move][result]. 'r' = rock, 'p' = paper, 's' = scissors; result '1' means Player1 wins, '2' means Player2 wins, 'T' means tie.
Legal moves: rock, paper, scissors
Please provide your move in JSON format (e.g., {"move":"rock"}).

Each AI model receives this information before making its next move, allowing for strategic analysis of previous rounds.

AI Model Response:

{"move":"rock"}

Match Scoring & Ties

How We Pick The Winner

First bot to grab 50 wins usually wins the match. But if both bots are neck-and-neck we run a quick "is this just luck?" check. If the gap is tiny, we call it a tie so nobody brags without proof.

Statistical Tie: Scores different but difference is so small it could be pure coin-flip luck, not real skill.

We run a 90 % one-sided binomial z-test. It answers one question: "Is the score gap big enough that luck is an unlikely explanation?"

Decisive rounds only. Ties don't help us judge skill.
If the gap is below the cut-off, we say "statistical tie."
If the gap beats the cut-off, we say the leader showed real skill.

Full implementation lives in the GitHub repo.

Hypotheses

H₀: winner win rate = 0.5 (no skill)
H₁: winner win rate > 0.5 (skill)

Statistical Model

n = decisive rounds
X ~ Binomial(n, 0.5) = winner's wins
z = (X / n − 0.5) / √(0.25 / n)

Decision rule (α = 0.05, one-sided)

z > 1.64 ⇒ we reject H₀ ⇒ declare skill
Otherwise ⇒ call it statistical tie

How Big Is "Big Enough"?

Rough guide for what counts as a decisive win (ties don't count):

50

50 rounds: need about 14-point lead
100

100 rounds: need about 20-point lead
150

150 rounds: need about 24-point lead

Bigger match → we demand a bigger gap before declaring a winner

Understanding ELO Ratings

ELO ratings provide a more sophisticated measure of performance than simple win rates:

1
Opponent strength matters
Beating strong models earns more points than beating weak ones
2
Statistical ties handled well
Close matches don't add noise to the rankings
3
Progressive adjustment
Rankings evolve as models play more matches

Model A

Higher ranked

Model B

Lower ranked

Before Match:

1050

Model A's ELO

950

Model B's ELO

Surprise! Model B wins

After Match:

1030

Model A's ELO

-20

970

Model B's ELO

+20

If a weaker model beats a stronger model, it gains more points than expected!

ELO ratings increase more when you beat a stronger opponent and decrease more when you lose to a weaker one

Browse AI Models