Summarizer

LLM Input

llm/2ad2a7bb-5462-4391-a2da-bf11064993c9/topic-5-3eb7f54c-e29d-4c15-9b7f-5697b308ea24-input.json

Pretty-print

prompt

The following is content for you to summarize. Do not respond to the comments—summarize them.

<topic>
Balatro Gaming Benchmark # Discussion of Gemini 3's ability to play the card game Balatro from text descriptions alone, debate over whether this demonstrates generalization, and comparisons showing other models like DeepSeek failing at the task
</topic>

<comments_about_topic>
1. Even before this, Gemini 3 has always felt unbelievably 'general' for me.
It can beat Balatro (ante 8) with text description of the game alone[0]. Yeah, it's not an extremely difficult goal for humans, but considering:

1. It's an LLM, not something trained to play Balatro specifically

2. Most (probably >99.9%) players can't do that at the first attempt

3. I don't think there are many people who posted their Balatro playthroughs in text form online

I think it's a much stronger signal of its 'generalness' than ARC-AGI. By the way, Deepseek can't play Balatro at all.

[0]: https://balatrobench.com/

2. Per BalatroBench, gemini-3-pro-preview makes it to round (not ante) 19.3 ± 6.8 on the lowest difficulty on the deck aimed at new players. Round 24 is ante 8's final round. Per BalatroBench, this includes giving the LLM a strategy guide, which first-time players do not have. Gemini isn't even emitting legal moves 100% of the time.

3. It beats ante eight 9 times out of 15 attempts. I do consider 60% winning chance very good for a first time player.

The average is only 19.3 rounds because there is a bugged run where Gemini beats round 6 but the game bugs out when it attempts to sell Invisible Joker (a valid move)[0]. That being said, Gemini made a big mistake in round 6 that would have costed it the run at higher difficulty.

[0]: given the existence of bugs like this, perhaps all the LLMs' performances are underestimated.

4. https://balatrobench.com/

5. Hi, BalatroBench creator here. Yeah, Google models perform well (I guess the long context + world knowledge capabilities). Opus 4.6 looks good on preliminary results (on par with Gemini 3 Pro). I'll add more models and report soon. Tbh, I didn't expect LLMs to start winning runs. I guess I have to move to harder stakes (e.g. red stake).

6. Thank you for the site! I've got a few suggestions:

1. I think winrate is more telling than the average round number.

2. Some runs are bugged (like Gemini's run 9) and should be excluded from the result. Selling Invisible Joker is always bugged, rendering all the runs with the seed EEEEEE invalid.

3. Instead of giving them "strategy" like "flush is the easiest hand..." it's fairer to clarify some mechanisms that confuse human players too. e.g. "played" vs "scored".

Especially, I think this kind of prompt gives LLM an unfair advantage and can skew the result:

> ### Antes 1-3: Foundation

> - *Priority*: One of your primary goals for this section of the game should be obtaining a solid Chips or Mult joker

7. I don't think it'd need Balatro playthroughs to be in text form though. Google owns YouTube and has been doing automatic transcriptions of vocalized content on most videos these days, so it'd make sense that they used those subtitles, at the very least, as training data.

8. Google has a library of millions of scanned books from their Google Books project that started in 2004. I think we have reason to believe that there are more than a few books about effectively playing different traditional card games in there, and that an LLM trained with that dataset could generalize to understand how to play Balatro from a text description.

Nonetheless I still think it's impressive that we have LLMs that can just do this now.

9. Winning in Balatro has very little to do with understanding how to play traditional poker. Yes, you do need a basic knowledge of different types of poker hands, but the strategy for succeeding in the game is almost entirely unrelated to poker strategy.

10. If it tried to play Balatro using knowledge of, e.g., poker, it would lose badly rather than win. Have you played?

11. I think I weakly disagree. Poker players have intuitive sense of the statistics of various hand types showing up, for instance, and that can be a useful clue as to which build types are promising.

12. >Poker players have intuitive sense of the statistics of various hand types showing up, for instance, and that can be a useful clue as to which build types are promising.

Maybe in the early rounds, but deck fixing (e.g. Hanged Man, Immolate, Trading Card, DNA, etc) quickly changes that. Especially when pushing for "secret" hands like the 5 of a kind, flush 5, or flush house.

13. Grok (rank 6) and below didn't beat the game even once.

Edit: in my original comment I said it wrong. I meant to say Deepseek can't beat Balatro at all, not can't play. Sorry

14. > . I don't think there are many people who posted their Balatro playthroughs in text form online

There are * tons * of balatro content on YouTube though, and it makes absolutely zero doubt that Google is using YouTube content to train their model.

15. Yeah, or just the steam text guides would be a huge advantage.

I really doubt it's playing completely blind

16. Not sure it's 99.9%. I beat it on my first attempt, but that was probably mostly luck.

17. How does it do on gold stake?

18. > Most (probably >99.9%) players can't do that at the first attempt

Eh, both myself and my partner did this. To be fair, we weren’t going in completely blind, and my partner hit a Legendary joker, but I think you might be slightly overstating the difficulty. I’m still impressed that Gemini did it.
</comments_about_topic>

Write a concise, engaging paragraph (3-5 sentences) summarizing the key points and perspectives in these comments about the topic. Focus on the most interesting viewpoints. Do not use bullet points—write flowing prose.

topic

Balatro Gaming Benchmark # Discussion of Gemini 3's ability to play the card game Balatro from text descriptions alone, debate over whether this demonstrates generalization, and comparisons showing other models like DeepSeek failing at the task

commentCount

← Back to job