llm/2ad2a7bb-5462-4391-a2da-bf11064993c9/topic-12-c12a37ba-fb40-4c38-9928-f7c42c5aa3ac-input.json
The following is content for you to summarize. Do not respond to the comments—summarize them.
<topic>
Test Time Compute Approaches # Analysis of thinking vs non-thinking models, best-of-N approaches like Deep Think, computational complexity differences, and questions about whether sufficiently large non-thinking models can match smaller thinking ones
</topic>
<comments_about_topic>
1. Agreed. Gemini 3 Pro for me has always felt like it has had a pretraining alpha if you will. And many data points continue to support that. Even as flash, which was post trained with different techniques than pro is good or equivalent at tasks which require post training, occasionally even beating pro. (eg: in apex bench from mercor, which is basically a tool calling test - simplifying - flash beats pro). The score on arc agi2 is another datapoint in the same direction. Deepthink is sort of parallel test time compute with some level of distilling and refinement from certain trajectories (guessing based on my usage and understanding) same as gpt-5.2-pro and can extract more because of pretraining datasets.
(i am sort of basing this on papers like limits of rlvr, and pass@k and pass@1 differences in rl posttraining of models, and this score just shows how "skilled" the base model was or how strong the priors were. i apologize if this is not super clear, happy to expand on what i am thinking)
2. This seems fine...?
https://chatgpt.com/share/698e992b-f44c-800b-a819-f899e83da2...
I don't see anything wrong with its reasoning. UM16 isn't explicitly mentioned in the data sheet, but the UM prefix is listed in the 'Device marking code' column. The model hedges its response accordingly ("If the marking is UM16 on an SMA/DO-214AC package...") and reads the graph in Fig. 1 correctly.
Of course, it took 18 minutes of crunching to get the answer, which seems a tad excessive.
3. But it might be true if we can't find any tasks where it's worse than average--though i do think if the task talks several years to complete it might be possible bc currently there's no test time learning
4. Do opus 4.6 or gemini deep think really use test time adaptation ? How does it work in practice?
5. Wouldn't you deal with spatial reasoning by giving it access to a tool that structures the space in a way it can understand or just is a sub-model that can do spatial reasoning? These "general" models would serve as the frontal cortex while other models do specialized work. What is missing?
6. Well, fair comparison would be with GPT-5.x Pro, which is the same class of a model as Gemini Deep Think.
7. And made almost zero impact, it was just a bigger version of Deepseek V2 and when mostly unnoticed because its performances weren't particularly notable especially for its size.
It was R1 with its RL-training that made the news and crashed the srock market.
8. I'm having trouble just keeping track of all these different types of models.
Is "Gemini 3 Deep Think" even technically a model? From what I've gathered, it is built on top of Gemini 3 Pro, and appears to be adding specific thinking capabilities, more akin to adding subagents than a truly new foundational model like Opus 4.6.
Also, I don't understand the comments about Google being behind in agentic workflows. I know that the typical use of, say, Claude Code feels agentic, but also a lot of folks are using separate agent harnesses like OpenClaw anyway. You could just as easily plug Gemini 3 Pro into OpenClaw as you can Opus, right?
Can someone help me understand these distinctions? Very confused, especially regarding the agent terminology. Much appreciated!
9. What exact models were you using? And with what settings? 4.6 / 5.3 codex both with thinking / high modes?
10. Their models are absolutely not impressive.
Not a single person is using it for coding (outside of Google itself).
Maybe some people on a very generous free plan.
Their model is a fine mid 2025 model, backed by enormous compute resources and an army of GDM engineers to help the “researchers” keep the model on task as it traverses the “tree of thoughts”.
But that isn’t “the model” that’s an old model backed by massive money.
11. It's very hard to tell the difference between bad models and stinginess with compute.
I subscribe to both Gemini ($20/mo) and ChatGPT Pro ($200/mo).
If I give the same question to "Gemini 3.0 Pro" and "ChatGPT 5.2 Thinking + Heavy thinking", the latter is 4x slower and it gives smarter answers.
I shouldn't have to enumerate all the different plausible explanations for this observation. Anything from Gemini deciding to nerf the reasoning effort to save compute, versus TPUs being faster, to Gemini being worse, to this being my idiosyncratic experience, all fit the same data, and are all plausible.
12. You nailed it. Gemini 3 Pro seems very "lazy" and seems to never reason for more than 30 seconds, which significantly impacts the quality of its outputs.
13. No projects, completely forgets context mid dialog, mediocre responses even on thinking, research got kneecapped somehow and is completely uses now, uses propaganda Russian videos as the search material (what’s wrong with you, Google?), janky on mobile, consumes GIGABYTES of RAM on web (seriously, what the fuck?). Left a couple of tabs over night, Mac is almost complete frozen because 10 tabs consumed 8 GBs of RAM doing nothing. It’s a complete joke.
14. Suggestion: run the identical prompt N times (2 identical calls to Gemini 3.0 Pro + 2 identical calls to GPT 5.2 Thinking), then running some basic text post-processing to see where the 4 responses agree vs disagree. The disagreements (substrings that aren't identical matches) are where scrutiny is needed. But if all 4 agree on some substring it's almost certainly a correct transcription. Wouldn't be too hard to get codex to vibe code all this.
15. They could likely increase their budget slightly and run an LLM-based judge.
16. I think this is 3.1 (3.0 Pro with the RL improv of 3.0 Flash).
But they probably decided to market it as Deep Think because why not charge more for it.
17. The Deep Think moniker is for parallel compute models though, not long CoT like pro models.
It's possible though that deep think 3 is running 3.1 models under the hood.
18. I think it'll be 3.1 by the time it's labelled GA - they said after 3.0 launch that they figured out new RL methods for Flash that the Pro model hasn't benefitted from.
19. OT but my intuition says that there’s a spectrum
- non thinking models
- thinking models
- best of N models like deep think an gpt pro
Each one is of a certain computational complexity. Simplifying a bit, I think they map to - linear, quadratic and n^3 respectively.
I think there are certain class of problems that can’t be solved without thinking because it necessarily involves writing in a scratchpad. And same for best of N which involves exploring.
Two open questions
1) what’s the higher level here, is there a 4th option?
2) can a sufficiently large non thinking model perform the same as a smaller thinking?
20. The difference between thinking and no-thinking models can be a little blurry. For example, when doing coding tasks Anthropic models with no-thinking mode tend to use a lot of comments to act as a scratchpad. In contrast, models in thinking mode don't do this because they don't need to.
Ultimately, the only real difference between no-thinking and thinking models is the amount of tokens used to reach the final answer. Whether those extra scratchpad tokens are between <think></think> tags or not doesn't really matter.
21. I think step 4 is the agent swarm. Manager model gets the prompt and spins up a swarm of looping subagents, maybe assigns them different approaches or subtasks, then reviews results, refines the context files and redeploys the swarm on a loop till the problem is solved or your credit card is declined.
22. > best of N models like deep think an gpt pro
Yeah, these are made possible largely by better use at high context lengths. You also need a step that gathers all the Ns and selects the best ideas / parts and compiles the final output. Goog have been SotA at useful long context for a while now (since 2.5 I'd say). Many others have come with "1M context", but their usefulness after 100k-200k is iffy.
What's even more interesting than maj@n or best of n is pass@n. For a lot of applications youc an frame the question and search space such that pass@n is your success rate. Think security exploit finding. Or optimisation problems with quick checks (better algos, kernels, infra routing, etc). It doesn't matter how good your pass@1 or avg@n is, all you care is that you find more as you spend more time. Literally throwing money at the problem.
23. > can a sufficiently large non thinking model perform the same as a smaller thinking?
Models from Anthropic have always been excellent at this. See e.g. https://imgur.com/a/EwW9H6q (top-left Opus 4.6 is without thinking).
24. its interesting that opus 4.6 added a paramter to make it think extra hard.
25. It's a shame that it's not on OpenRouter. I hate platform lock-in, but the top-tier "deep think" models have been increasingly requiring the use of their own platform.
26. OpenRouter is pretty great but I think litellm does a very good job and it's not a platform middle man, just a python library. That being said, I have tried it with the deep think models.
https://docs.litellm.ai/docs/
27. I just tested it on a very difficult Raven matrix, that the old version of DeepThink, as well as GPT 5.2 Pro, Claude Opus 4.6, and pretty much every other model failed at.
This version of DeepSeek got it first try. Thinking time was 2 or 3 minutes.
The visual reasoning of this class of Gemini models is incredibly impressive.
28. Deep Think not DeepSeek
29. I feel like a luddite: unless I am running small local models, I use gemini-3-flash for almost everything: great for tool use, embedded use in applications, and Python agentic libraries, broad knowledge, good built in web search tool, etc. Oh, and it is fast and cheap.
I really only use gemini-3-pro occasionally when researching and trying to better understand something. I guess I am not a good customer for super scalers. That said, when I get home from travel, I will make a point of using Gemini 3 Deep Think for some practical research. I need a business card with the title "Old Luddite."
30. I'm pretty certain that DeepMind (and all other labs) will try their frontier (and even private) models on First Proof [1].
And I wonder how Gemini Deep Think will fare. My guess is that it will get half the way on some problems. But we will have to take an absence as a failure, because nobody wants to publish a negative result, even though it's so important for scientific research.
[1] https://1stproof.org/
31. This benchmark outcome is actually really impressive given the difficulty of this task. It shows that this particular model manages to "think" coherently and maintain useful information in its context for what has to be an insane overall amount of tokens, likely across parallel "thinking" chains. Likely also has access to SVG-rendering tools and can "see" and iterate on the result via multimodal input.
32. I can't shake of the feeling that Googles Deep Think Models are not really different models but just the old ones being run with higher number of parallel subagents, something you can do by yourself with their base model and opencode.
33. And after i do that, how do i combine the output of 1000 subagents into one output? (Im not being snarky here, i think it's a nontrivial problem)
34. You just pipe it to another agent to do the reduce step (i.e. fan-in) of the mapreduce (fan-out)
It's agents all the way down.
35. The idea is that each subagent is focused on a specific part of the problem and can use its entire context window for a more focused subtask than the overall one. So ideally the results arent conflicting, they are complimentary. And you just have a system that merges them.. likely another agent.
36. Claude Cowork does this by default and you can see how exactly it is coordinating them etc.
37. Start with 1024 and use half the number of agents each turn to distill the final result.
38. They could do it this way: generate 10 reasoning traces and then every N tokens they prune the 9 that have the lowest likelihood, and continue from the highest likelihood trace.
This is a form of task-agnostic test time search that is more general than multi agent parallel prompt harnesses.
10 traces makes sense because ChatGPT 5.2 Pro is 10x more expensive per token.
That's something you can't replicate without access to the network output pre token sampling.
39. It’s incredible how fast these models are getting better. I thought for sure a wall would be hit, but these numbers smashes previous benchmarks. Anyone have any idea what the big unlock that people are finding now?
40. It's correlated with the ability to solve logic puzzles.
It's also interesting because it's very very hard for base LLMs, even if you try to "cheat" by training on millions of ARC-like problems. Reasoning LLMs show genuine improvement on this type of problem.
41. Do we know what model is used by Google Search to generate the AI summary?
I've noticed this week the AI summary now has a loader "Thinking…" (no idea if it was already there a few weeks ago). And after "Thinking…" it says "Searching…" and shows a list of favicons of popular websites (I guess it's generating the list of links on the right side of the AI summary?).
42. Always the same with Google.
Gemini has been way behind from the start.
They use the firehose of money from search to make it as close to free as possible so that they have some adoption numbers.
They use the firehose from search to pay for tons of researchers to hand hold academics so that their non-economic models and non-economic test-time-compute can solve isolated problems.
It's all so tiresome.
Try making models that are actually competitive, Google.
Sell them on the actual market and win on actual work product in millions of people lives.
</comments_about_topic>
Write a concise, engaging paragraph (3-5 sentences) summarizing the key points and perspectives in these comments about the topic. Focus on the most interesting viewpoints. Do not use bullet points—write flowing prose.
Test Time Compute Approaches # Analysis of thinking vs non-thinking models, best-of-N approaches like Deep Think, computational complexity differences, and questions about whether sufficiently large non-thinking models can match smaller thinking ones
42