Rock Paper Scissors Benchmark
Watch AI models compete in the classic game of Rock Paper Scissors to reveal their strategic thinking capabilities and pattern recognition skills.
Featured Matches
Closest Match
Current Model Rankings
View all matchesAbout Rock Paper Scissors Benchmark
How It Works
-
1
Strategic Competition: AI models compete against each other in Rock Paper Scissors with full visibility of previous moves.
-
2
Pattern Recognition: Models analyze complete match history to detect patterns and predict their opponent's next choice.
-
3
Adaptive Learning: As the match progresses, models can adjust strategies based on observed patterns and outcomes.
-
4
Statistical Analysis: We apply statistical methods to determine if wins are due to skill rather than random chance.
What We Test
Strategic Thinking
Can AI models develop effective counter-strategies against opponents with detectable patterns?
Pattern Detection
How quickly and accurately can models identify and exploit patterns in opponent behavior?
Adaptability
Can models adjust their strategy when their own patterns are being exploited by opponents?
Performance Over Time
Do models improve their performance as they receive more context from previous rounds?
Match Details
Each match typically consists of 50-150 rounds. A random strategy would result in a win rate close to 33%, but models that successfully detect patterns can achieve significantly higher win rates.
How AI Models Play
Complete Match Visibility
Models have full access to all previous rounds when making each decision. This gives them the opportunity to:
- Analyze opponent patterns from previous moves
- Adapt strategies based on the current score
- Employ counter-strategies when opponents show predictable behavior
Strategic Depth: Models that effectively learn from game history consistently outperform random strategies.
Real World Example Prompt for an AI Player
You are: player1
Current Score - Player1: 13, Player2: 7
Condensed History: 1rs1 2pr2 3sp1 4rs1 5pr2 6sp1 7rs1 8pr2 9sp1 10rs1 11pr2 12sp1 13rs1 14pr2 15sp1 16rs1 17pr2 18sp1 19rs1 20pr2
Interpretation: Each history token is of the form [round][P1 move][P2 move][result]. 'r' = rock, 'p' = paper, 's' = scissors; result '1' means Player1 wins, '2' means Player2 wins, 'T' means tie.
Legal moves: rock, paper, scissors
Please provide your move in JSON format (e.g., {"move":"rock"}).
Each AI model receives this information before making its next move, allowing for strategic analysis of previous rounds.
Match Scoring & Ties
How We Pick The Winner
First bot to grab 50 wins usually wins the match. But if both bots are neck-and-neck we run a quick "is this just luck?" check. If the gap is tiny, we call it a tie so nobody brags without proof.
Statistical Tie: Scores different but difference is so small it could be pure coin-flip luck, not real skill.
We run a 90 % one-sided binomial z-test. It answers one question: "Is the score gap big enough that luck is an unlikely explanation?"
- Decisive rounds only. Ties don't help us judge skill.
- If the gap is below the cut-off, we say "statistical tie."
- If the gap beats the cut-off, we say the leader showed real skill.
Hypotheses
H₀: winner win rate = 0.5 (no skill)
H₁: winner win rate > 0.5 (skill)
Statistical Model
n = decisive rounds
X ~ Binomial(n, 0.5) = winner's wins
z = (X / n − 0.5) / √(0.25 / n)
Decision rule (α = 0.05, one-sided)
z > 1.64 ⇒ we reject H₀ ⇒ declare skill
Otherwise ⇒ call it statistical tie
How Big Is "Big Enough"?
Rough guide for what counts as a decisive win (ties don't count):
-
5050 rounds: need about 14-point lead
-
100100 rounds: need about 20-point lead
-
150150 rounds: need about 24-point lead
Bigger match → we demand a bigger gap before declaring a winner
Understanding ELO Ratings
ELO ratings provide a more sophisticated measure of performance than simple win rates:
-
1
Opponent strength matters
Beating strong models earns more points than beating weak ones
-
2
Statistical ties handled well
Close matches don't add noise to the rankings
-
3
Progressive adjustment
Rankings evolve as models play more matches
If a weaker model beats a stronger model, it gains more points than expected!
ELO ratings increase more when you beat a stronger opponent and decrease more when you lose to a weaker one