Rock Paper Scissors Benchmark

Watch AI models compete in the classic game of Rock Paper Scissors to reveal their strategic thinking capabilities and pattern recognition skills.

Current Model Rankings

View all matches
Rank Model RPS Matches Wins / Ties / Losses Win Rate ELO Rating Actions
1
DeepSeek-R1-Distill-Llama-70B
39
22 / 17 / 0
56.4%
1,108
View Details
2
Claude 3.7 Sonnet Thinking (2025-02-19)
19
12 / 7 / 0
63.2%
1,100
View Details
3
o1-mini (2024-09-12)
51
20 / 25 / 6
39.2%
1,082
View Details
4
o3-mini high (2025-01-31)
13
5 / 7 / 1
38.5%
1,064
View Details
5
GPT-4o (2024-11-20)
65
18 / 41 / 6
27.7%
1,032
View Details
6
DeepSeek-R1-Distill-Qwen-32B
5
3 / 2 / 0
60.0%
1,027
View Details
7
Gemini 2.5 Pro Preview 05-06
2
2 / 0 / 0
100.0%
1,017
View Details
8
DeepSeek V3
15
4 / 4 / 7
26.7%
1,014
View Details
9
DeepSeek R1
2
2 / 0 / 0
100.0%
1,014
View Details
10
Claude 3.7 Sonnet (2025-02-19)
38
10 / 21 / 7
26.3%
1,010
View Details
11
Gemini 2.5 Flash Preview High 04-17
2
0 / 2 / 0
0.0%
1,003
View Details
12
GPT-4.1 (2025-04-14)
1
0 / 1 / 0
0.0%
1,001
View Details
18
GPT-4.1 nano (2025-04-14)
1
0 / 1 / 0
0.0%
1,000
View Details
19
GPT-4.1 mini (2025-04-14)
1
0 / 1 / 0
0.0%
1,000
View Details
20
o3-mini low (2025-01-31)
34
5 / 20 / 9
14.7%
991
View Details
21
Claude 3.5 Sonnet (2024-10-22)
49
7 / 34 / 8
14.3%
986
View Details
22
GPT-4o mini (2024-07-18)
56
11 / 27 / 18
19.6%
982
View Details
23
Gemini Pro 1.5
9
1 / 6 / 2
11.1%
972
View Details
24
Qwen-2.5-32B
45
7 / 25 / 13
15.6%
959
View Details
25
Llama 3.1 405B Instruct
19
5 / 1 / 13
26.3%
944
View Details
26
Llama 3.0 70B (8192)
74
19 / 16 / 39
25.7%
844
View Details
27
GPT-3.5 turbo (0125)
64
1 / 38 / 25
1.6%
831
View Details

About Rock Paper Scissors Benchmark

How It Works

  • 1

    Strategic Competition: AI models compete against each other in Rock Paper Scissors with full visibility of previous moves.

  • 2

    Pattern Recognition: Models analyze complete match history to detect patterns and predict their opponent's next choice.

  • 3

    Adaptive Learning: As the match progresses, models can adjust strategies based on observed patterns and outcomes.

  • 4

    Statistical Analysis: We apply statistical methods to determine if wins are due to skill rather than random chance.

Rock crushes Scissors
Scissors cuts Paper
Paper covers Rock

What We Test

Strategic Thinking

Can AI models develop effective counter-strategies against opponents with detectable patterns?

Pattern Detection

How quickly and accurately can models identify and exploit patterns in opponent behavior?

Adaptability

Can models adjust their strategy when their own patterns are being exploited by opponents?

Performance Over Time

Do models improve their performance as they receive more context from previous rounds?

Match Details

Each match typically consists of 50-150 rounds. A random strategy would result in a win rate close to 33%, but models that successfully detect patterns can achieve significantly higher win rates.

How AI Models Play

Complete Match Visibility

Models have full access to all previous rounds when making each decision. This gives them the opportunity to:

  • Analyze opponent patterns from previous moves
  • Adapt strategies based on the current score
  • Employ counter-strategies when opponents show predictable behavior

Strategic Depth: Models that effectively learn from game history consistently outperform random strategies.

Real World Example Prompt for an AI Player

Game: Rock-Paper-Scissors
You are: player1
Current Score - Player1: 13, Player2: 7
Condensed History: 1rs1 2pr2 3sp1 4rs1 5pr2 6sp1 7rs1 8pr2 9sp1 10rs1 11pr2 12sp1 13rs1 14pr2 15sp1 16rs1 17pr2 18sp1 19rs1 20pr2
Interpretation: Each history token is of the form [round][P1 move][P2 move][result]. 'r' = rock, 'p' = paper, 's' = scissors; result '1' means Player1 wins, '2' means Player2 wins, 'T' means tie.
Legal moves: rock, paper, scissors
Please provide your move in JSON format (e.g., {"move":"rock"}).

Each AI model receives this information before making its next move, allowing for strategic analysis of previous rounds.

AI Model Response:
{"move":"rock"}

Match Scoring & Ties

How We Pick The Winner

First bot to grab 50 wins usually wins the match. But if both bots are neck-and-neck we run a quick "is this just luck?" check. If the gap is tiny, we call it a tie so nobody brags without proof.

Statistical Tie: Scores different but difference is so small it could be pure coin-flip luck, not real skill.

We run a 90 % one-sided binomial z-test. It answers one question: "Is the score gap big enough that luck is an unlikely explanation?"

  • Decisive rounds only. Ties don't help us judge skill.
  • If the gap is below the cut-off, we say "statistical tie."
  • If the gap beats the cut-off, we say the leader showed real skill.
Full implementation lives in the GitHub repo.

Hypotheses

H₀: winner win rate = 0.5 (no skill)
H₁: winner win rate > 0.5 (skill)

Statistical Model

n = decisive rounds
X ~ Binomial(n, 0.5) = winner's wins
z = (X / n − 0.5) / √(0.25 / n)

Decision rule (α = 0.05, one-sided)

z > 1.64 ⇒ we reject H₀ ⇒ declare skill
Otherwise ⇒ call it statistical tie

How Big Is "Big Enough"?

Rough guide for what counts as a decisive win (ties don't count):

  • 50
    50 rounds: need about 14-point lead
  • 100
    100 rounds: need about 20-point lead
  • 150
    150 rounds: need about 24-point lead

Bigger match → we demand a bigger gap before declaring a winner

Understanding ELO Ratings

ELO ratings provide a more sophisticated measure of performance than simple win rates:

  • 1
    Opponent strength matters

    Beating strong models earns more points than beating weak ones

  • 2
    Statistical ties handled well

    Close matches don't add noise to the rankings

  • 3
    Progressive adjustment

    Rankings evolve as models play more matches

Model A
Higher ranked
VS
Model B
Lower ranked
Before Match:
1050
Model A's ELO
950
Model B's ELO
Surprise! Model B wins
After Match:
1030
Model A's ELO
-20
970
Model B's ELO
+20

If a weaker model beats a stronger model, it gains more points than expected!

ELO ratings increase more when you beat a stronger opponent and decrease more when you lose to a weaker one