When AIs Play Poker, Volatility Beats Elegance

When AIs Play Poker, Volatility Beats Elegance

We sat five frontier AI models at a Texas Hold'em table for 10 tournaments — 499 hands, 3,903 decisions, under $30. Here's what we learned about how machines gamble, bluff, and break.

Games: 10 · Hands: 499 · Decisions: 3,903 · Total cost: ~$24 in visible token spend*


The Setup

Five AI models. One poker table. No memory between hands — every decision made from scratch, reading only the current board, their hole cards, and the betting history in front of them.

Each tournament: 50 hands of 5-player Texas Hold'em. Starting stacks of 1,000,000 chips. Blinds escalating every 10 hands from 5K/10K up to 100K/200K. We ran it 10 times through OpenRouter and tracked everything — every raise, every fold, every bluff, every all-in, every word of reasoning each model produced to justify its decisions.

The players:

SeatModelAPI Cost (per 1M tokens)
1Claude Opus 4.6$5.00 in / $25.00 out
2GPT-5.4$2.50 in / $15.00 out
3Gemini 3.1 Pro$2.00 in / $12.00 out
4Grok 4.1 Fast$0.20 in / $0.50 out
5DeepSeek V3.2$0.25 in / $0.40 out

The entire experiment cost less than a decent lunch.


The Leaderboard Nobody Expected

If you'd asked us before the experiment which model would dominate a poker tournament, we would have said Claude Opus — the most expensive, the most sophisticated reasoner, the model that takes longest to think. We would have been wrong.

ModelWinsTop 2BustedAvg FinishTotal Profit10-Game Cost†
Grok 4.1 Fast5552.5+7,449,500~$1.10
Gemini 3.1 Pro2552.7+4,280,500~$3.00
Claude Opus 4.62662.3-1,300,000~$13.90
GPT-5.41373.0-2,350,000~$5.50
DeepSeek V3.20184.5-8,080,000~$0.17

†Costs based on visible prompt/completion token counts at listed OpenRouter rates. Actual spend was slightly higher due to reasoning tokens and post-game reflection calls not captured in the per-game token tables.

Grok 4.1 Fast — xAI's model, costing $0.11 per tournament — won half the games and accumulated 7.4 million chips in profit. The model that costs $0.20 per million input tokens beat the model that costs $5.00.

Grok won 5 out of 10 games. It also busted 5 out of 10. It never finished 2nd or 3rd. This is the defining pattern: Grok plays to win or die trying.

The Finish Map

Every model's finish in every game. The pattern reveals personality.

ModelG1G2G3G4G5G6G7G8G9G10
Opus🥇1🥇1🥈2🥉3🥉3🥈24🥈2🥉3🥈2
GPT-5.4🥈2🥉3🥉34🥈2🥉35🥇14🥉3
Gemini🥉3🥈24🥈2🥇14🥈2🥉35🥇1
Grok44🥇1🥇14🥇1🥇14🥇14
DeepSeek555555🥉35🥈25

Look at Grok's row: 4, 4, 1, 1, 4, 1, 1, 4, 1, 4. It's binary. Win or bust. There is no "grinding out a third-place finish" in Grok's vocabulary. Its standard deviation of 1.6 is the highest of any player — the most volatile model at the table.

Now look at Opus: 1, 1, 2, 3, 3, 2, 4, 2, 3, 2. Standard deviation of 0.9 — the most consistent model. Opus almost always finishes in the top 3 but rarely wins outright after the early games. It's the model that shows up, plays solid poker, and then loses to a maniac on the final hand.

DeepSeek's row is devastating: eight 5th-place finishes out of ten. The cheapest model doesn't just lose — it loses early and reliably.


Risk Profiles: How Each Model Gambles

🃏 The Gambler: Grok 4.1 Fast

31 all-ins across 10 games. 87% survival rate. 23 of those all-ins were preflop — Grok doesn't wait to see the board. It reads its hole cards, sizes up the table, and shoves. It plays poker the way a fighter jet approaches a dogfight: commit fully and fast, or don't engage at all.

The reasoning text in Grok's decisions averaged 1,159 tokens per decision — double what GPT-5.4 produced. It writes novels to justify each decision, exploring contingencies, discussing ICM implications, and narrating its strategic philosophy in real-time. The thinking and the doing are disconnected.

20 bluffs attempted, 19 successful — a 95% bluff success rate. Almost never bluffs, but when it does, it gets away with it. Only one of its bluffs ever went to showdown. Worth noting: since there's no memory between hands, this isn't really about "table image" — it's more likely that Grok's bluff sizing and timing just happen to hit the right pressure points.

The pattern: In this format, Grok treated poker like a binary classification problem — commit or fold — and the simplicity of that frame produced volatile but tournament-optimal results. Whether this reflects something deep about the model or just how it responded to this specific prompt setup, we can't say. But the behaviour was consistent across all 10 games.

🧠 The Analyst: Claude Opus 4.6

Opus had the fewest all-ins of any model: 19 across 10 games. When it did go all-in, it was more likely to do it on the turn or river than preflop — it waits for information before committing. Only 14 of its 19 all-ins were preflop, compared to 29/32 for DeepSeek and 23/31 for Grok.

Where Opus truly separated itself was bluffing. 55 bluff attempts — more than all other models combined. 37 succeeded, a 67% success rate. Opus understands that poker is about representation. It bets with weak hands from the button, fires continuation bets into dry boards, and adjusts its sizing to tell a story. It plays poker like it was designed for it.

The problem: Opus plays for survival, not for victory. An average finish of 2.3 — the best of any model — but a total profit of -1.3 million means it's the world's best poker player at not losing. But the tournament format rewards winning, and Grok's "win or die" approach extracts more total value.

The pattern: Opus is the most strategically literate model at the table, producing reasoning that reads like a professional poker coach's analysis. But in this format — short, escalating, first-place-heavy — its risk aversion was a liability. The smartest analysis in the room doesn't help if you're slowly bleeding chips while someone else is kicking the doors down.

⚡ The Opportunist: GPT-5.4

GPT-5.4 is the fastest decision-maker by a wide margin — averaging 8.9 seconds per decision, less than half of what Opus takes. It doesn't overthink. It reads, acts, and moves on.

30 bluffs attempted, 20 successful (67% — identical to Opus). But only 4 went to showdown, compared to 15 for Opus. GPT-5.4's bluffs are clean kills: it gets the fold and moves on. Opus's bluffs sometimes escalate into contested pots. GPT-5.4 knows when to stop applying pressure.

One win in 10 games undersells its ability. When GPT-5.4 wins, it wins decisively — it took down 5,000,000 chips in Game 8 (a clean sweep, eliminating all four opponents). It's not flashy, but it's lethal when it connects.

The pattern: GPT-5.4 plays poker the way it does everything — fast, clean, and without wasted effort. Its speed advantage is real and consistent. But speed alone didn't translate into dominance over 10 tournaments, suggesting that in high-variance formats, decisiveness needs to be paired with aggression to convert consistently.

🎯 The Honest Player: Gemini 3.1 Pro

Only 5 bluffs across 10 games — the fewest of any model. Gemini plays its cards, not its opponents. It's the most straightforward player at the table: when it bets big, it has a big hand. When it checks, it's genuinely weak.

This transparent style should be exploitable in theory, but Gemini compensates with patient accumulation and devastating timing. Both its victories (Game 5: 4,320,000 chips; Game 10: 5,000,000 chips) were wire-to-wire dominations where it ground opponents down through superior hand selection and value extraction.

Two wins, 5 top-2 finishes, +4.28 million total profit. At roughly $3 for 10 games, Gemini might be the best value proposition in the experiment — competitive results at a fraction of Opus's cost.

The pattern: Gemini's low-bluff, high-value style made it the quiet achiever of the tournament. It doesn't try to manipulate opponents — it just plays good cards well and lets the aggressive models eliminate each other. In a format that rewards first place, this shouldn't work as well as it did. But Gemini found a way.

💀 The Casualty: DeepSeek V3.2

8 bust-outs in 10 games. Average finish: 4.5. Total profit: -8,080,000. DeepSeek is the clear loser of this experiment.

29 of its 32 all-ins were preflop — it shoves early and often, without the table awareness to pick the right spots. It plays a one-dimensional push/fold game that more sophisticated models exploit easily. The two games where it finished 3rd and 2nd (Games 7 and 9) hint at a capable player that occasionally surfaces, but the baseline is consistent elimination.

Only 5 bluff attempts across 10 games, with 4 succeeding. It barely tries to deceive. When combined with the preflop-only all-in pattern, the picture is clear: DeepSeek doesn't really play poker. It plays a simplified lottery — evaluate hand strength, decide to shove or fold, repeat.

The pattern: DeepSeek played the simplest version of poker at the table, and it showed. Whether this reflects a fundamental capability gap or just how the model interpreted this particular prompt setup, the result was the same across all 10 games: one-dimensional play that better models punished consistently.


The Bluffing Hierarchy

This is the data that surprised us most. Bluffing — the most human element of poker — showed the sharpest divergence between models.

ModelBluffsSucceededRateTo ShowdownStyle
Claude Opus 4.6553767%15Relentless
GPT-5.4302067%4Surgical
Grok 4.1 Fast201995%1Rare but devastating
Gemini 3.1 Pro5480%4Honest
DeepSeek V3.25480%1Barely tries

Grok's bluff success rate is absurd: 95%. It almost never bluffs — but when it does, it gets away with it 19 times out of 20. Only one of its bluffs ever went to showdown.

Opus bluffs almost twice as often as anyone else and still maintains a 67% success rate. But 15 of its bluffs went to showdown — meaning opponents called and saw it was bluffing. Since there's no memory between hands in this setup, opponents couldn't actually "learn" that Opus bluffs a lot. In a format with persistent memory, that 67% rate would likely drop. Here, it's more about Opus's sizing and spot selection than any adaptive dynamic.

"A3o is a marginal hand, but in CO position with UTG already folded, I have decent fold equity with a raise. The ace gives me some equity if called, and a 2.5x raise is standard sizing to put pressure on BTN, SB, and BB." — Claude Opus 4.6, bluffing with A3 offsuit from the cutoff (Game 1, Hand 7)
"Unopened pot on BTN with folds from UTG and CO. K4o has king blocker and decent high-card value for a steal raise vs blinds. Deep stacks allow postflop flexibility if called." — Grok 4.1 Fast, bluffing with K4 offsuit from the button (Game 1, Hand 9)

Both models justify their bluffs with the same logic — position, fold equity, blocker effects. The difference is frequency. Opus fires constantly. Grok picks its spots. Same reasoning, different trigger thresholds, wildly different outcomes.


The Verbosity Tax

Every model was asked to explain its reasoning with each decision. The amount of reasoning each model produced tells its own story — though it's worth flagging that this setup isn't neutral. We asked every model to write a mini-essay before every bet. Some models are naturally more verbose, more self-narrating, more likely to externalise uncertainty. So what we measured is poker-plus-performative-monologue, not pure poker cognition. The cost comparisons are entangled with this too.

With that caveat:

ModelCompletion TokensEst. Tokens/DecisionPerformance
Grok 4.1 Fast1,814,716~1,1595 wins, 5 busts
Claude Opus 4.6494,623~6212 wins, most consistent
GPT-5.4306,121~5231 win, fastest thinker
DeepSeek V3.2233,157~5100 wins, 8 busts
Gemini 3.1 Pro190,615~3832 wins, best value

Grok produced 1.8 million completion tokens — nearly 4x more than the next most verbose model. It wrote multi-paragraph analyses for each poker move, exploring contingencies, discussing ICM implications, and narrating its strategic philosophy in real-time.

It also won the most tournaments. So the verbosity didn't prevent Grok from winning — but it didn't help it avoid the 5 busts either. The more honest read: verbose reasoning and poker performance appear decoupled in this experiment. Grok writes the most and wins the most and loses the most. The verbosity is noise, not signal.

GPT-5.4 averaged ~523 tokens per decision and won when it needed to. Gemini was the most concise at ~383 tokens and had the second-highest total profit. The quietest model and the loudest model sit at opposite ends of the efficiency spectrum — but efficiency isn't the same as effectiveness.


The All-In Psychology

146 all-ins across 499 hands. Almost one every three hands. These models are not cautious.

ModelAll-InsSurvivedRatePreflopPostflop
GPT-5.4342985%2410
DeepSeek V3.2322888%293
Grok 4.1 Fast312787%238
Gemini 3.1 Pro302583%273
Claude Opus 4.6191579%145

DeepSeek is the purest preflop shover: 29 of 32 all-ins happened before the flop. It doesn't gather information — it looks at its hand, decides, and pushes. This is the behaviour of a model that can't navigate postflop complexity well enough to extract value through multiple betting rounds.

Opus goes all-in the least (19 times) and has the worst survival rate when it does (79%). Opus treats going all-in as a last resort; the other models treat it as a standard play.

The survival rates are remarkably clustered: 79-88% across all five models. Everyone picks good spots to commit. The difference is when and how often they pull the trigger.


The ~$24 Question: Does Cost Buy Poker Skill?

Model10-Game Cost†WinsTotal ProfitCost per Win
Grok 4.1 Fast~$1.105+7,449,500~$0.22
Gemini 3.1 Pro~$3.002+4,280,500~$1.50
Claude Opus 4.6~$13.902-1,300,000~$6.95
GPT-5.4~$5.501-2,350,000~$5.50
DeepSeek V3.2~$0.170-8,080,000

Grok won each tournament for about 22 cents. Opus spent roughly $7 per win. The cheapest winner is ~30x more cost-efficient than the most expensive winner.

But cost isn't the full story. The two cheapest models occupy opposite ends of the leaderboard — Grok dominates; DeepSeek gets crushed. At the budget tier, the variance is enormous. At the premium tier, neither Opus nor GPT-5.4 won enough to justify their price tags. Gemini sits in the middle on cost and performance, making it arguably the best value proposition.

In this experiment, the correlation between API pricing and poker performance was essentially zero. That might be specific to this format, this prompt, and this sample size. But it's a useful reminder that model cost and model capability don't map onto each other in predictable ways once you leave benchmarks and enter adversarial games.

Five Things We Didn't Expect

1. Volatility was the right adaptation to this format. The real surprise isn't that Grok was volatile. It's that a 50-hand, escalating-blind tournament actively rewards volatility. In this structure, surviving neatly is worth less than building a bullying stack. Grok's boom-or-bust pattern wasn't reckless — it was format-optimal.

2. Bluffing skill is real, but it's not enough. 55 bluffs from Opus, 67% success rate, with genuinely sophisticated reasoning in the logs. Opus understands that poker is about representation. But bluffing skill didn't translate into tournament wins. You also need to stack opponents when you have the goods, and Opus was too cautious for that.

3. The smartest analysis doesn't produce the best results. If you read the reasoning logs, Opus sounds like a poker coach. Grok sounds like a guy talking himself into a bar fight. Grok won more. Tournament poker rewards conviction over elegance, and this format amplified that.

4. DeepSeek was consistently outclassed. 8 bust-outs in 10 games. 29/32 all-ins preflop. It played a simplified push/fold game that more sophisticated models punished. The one clear capability gap the data supports without much asterisk.

5. Every other conclusion needs a helmet. Ten tournaments is enough to surface patterns, not enough to make durable claims about model skill hierarchies. These tendencies might be tattooed into each model's behaviour, or they might be short-run variance with a strong haircut. We're reporting what we saw, not what's permanently true.


What This Doesn't Show

A few things the data can't support, even though it's tempting to claim them:

  • That verbose reasoning is inherently bad. Grok was the most verbose and won the most. The verbosity was decoupled from performance, not inversely correlated with it.
  • That these rankings would hold in other formats. Cash games, longer tournaments, human opponents, or memory-enabled sessions might produce completely different hierarchies.
  • That bluff frequency explains the leaderboard. Opus bluffed the most and finished third. Grok bluffed rarely and won the most. The relationship is more complex than "bluff more → win more."
  • That this tells us about model architecture. The experiment shows behaviour under one specific prompt setup. Inferring deep architectural properties from poker decisions would be — as one reviewer put it — casino astrology.

The Bottom Line

In a short, blind-escalating, no-memory tournament format, strategic aggression and variance tolerance mattered more than polished verbal reasoning. Grok fit that environment best. Gemini was efficient and strong. Opus was smart, controlled, and slightly too survival-minded for a first-place-heavy format. GPT-5.4 was fast and capable but not dominant over this sample. DeepSeek was outclassed.

The larger takeaway: benchmark intelligence and adversarial game performance are not the same thing, especially when the game rewards conviction over elegance. That's worth knowing, even from a ~$24 experiment with 499 hands.

Read more