llm/2ad2a7bb-5462-4391-a2da-bf11064993c9/batch-7-377ec7a5-ee06-4ce9-8b20-83245cdc162b-input.json
The following is content for you to classify. Do not respond to the comments—classify them.
<topics>
1. ARC-AGI Benchmark Validity
Related: Debate over whether ARC-AGI measures general intelligence or just spatial reasoning puzzles, concerns about benchmarkmaxxing, semi-private vs private test sets, cost per task at $13.62, and whether solving it indicates anything meaningful about AGI capabilities
2. Gemini vs Claude for Coding
Related: Strong consensus that Claude dominates agentic coding workflows while Gemini lags behind, discussion of tool calling failures, instruction following issues, and hallucinations when using Gemini for development tasks
3. Benchmarkmaxxing Concerns
Related: Skepticism that high benchmark scores reflect real-world performance, suspicions that labs optimize specifically for popular tests, concerns about training data leakage, and debate over whether improvements are genuine or gamed
4. Definition of AGI
Related: Philosophical debate about what constitutes artificial general intelligence, whether consciousness is required, Chollet's definition involving tasks feasible for humans but unsolved by AI, and moving goalposts in AI evaluation
5. Google Product Quality Issues
Related: Complaints about Gemini app UX problems including context loss, Russian propaganda sources, switching languages mid-sentence, document upload failures, and poor integration compared to ChatGPT
6. Balatro Gaming Benchmark
Related: Discussion of Gemini 3's ability to play the card game Balatro from text descriptions alone, debate over whether this demonstrates generalization, and comparisons showing other models like DeepSeek failing at the task
7. Model Release Acceleration
Related: Observation that AI model releases are accelerating dramatically, multiple frontier models released within days, connection to Chinese New Year timing, and competition between US and Chinese labs
8. Cost vs Performance Tradeoffs
Related: Analysis of inference costs versus capabilities, Gemini Flash praised for cost-performance ratio, concerns about $13.62 per ARC-AGI task, and debate over what price makes models practical for real applications
9. Deep Research Reliability
Related: Mixed experiences with AI deep research capabilities, complaints about garbage citations, hallucinated sources, contradictory information, and questions about whether it saves time when sources must be verified
10. Google's Competitive Position
Related: Debate over whether Google is leading or behind in AI, discussion of their data advantages from YouTube and Books, claims they let competitors think they were behind, and analysis of their strengths in visual AI
11. Pelican on Bicycle Benchmark
Related: Simon Willison's informal SVG generation test, discussion of whether it's being trained on specifically, quality improvements in latest models, and debate over its validity as a casual benchmark
12. AI Consciousness Claims
Related: Pushback against suggestions that passing tests indicates consciousness, comparisons to simple programs claiming consciousness, discussion of self-awareness research, and skepticism about anthropomorphizing AI capabilities
13. Test Time Compute Approaches
Related: Analysis of thinking vs non-thinking models, best-of-N approaches like Deep Think, computational complexity differences, and questions about whether sufficiently large non-thinking models can match smaller thinking ones
14. Real World Task Performance
Related: Frustration that benchmark gains don't translate to practical improvements, examples of models failing simple debugging tasks, and arguments that actual work product matters more than test scores
15. AI Job Displacement Fears
Related: Concerns about software engineers being replaced, comparisons to factory worker displacement, debate over whether AI creates or destroys jobs, and skepticism about optimistic narratives from AI company executives
16. Spatial Reasoning Limitations
Related: Discussion of LLMs struggling with spatial tasks, image orientation affecting OCR accuracy, and whether ARC-AGI improvements indicate genuine spatial reasoning advances or benchmark-specific solutions
17. Model Architecture Secrecy
Related: Observation that frontier labs no longer share architecture details like parameter counts, shift from technical discussions to capability-focused marketing, and desire for more transparency
18. Academic vs Practical Intelligence
Related: Distinction between Gemini excelling at academic benchmarks while feeling less useful for practical tasks, discussion of book smart vs street smart analogies for AI capabilities
19. First Proof Mathematical Challenge
Related: Discussion of newly released unsolved math problems designed to test frontier models, predictions about whether current models can solve genuine research-level mathematics
20. Subscription Pricing Frustration
Related: Complaints about $250/month Google AI Ultra subscription required for Deep Think access, desire to test new models without platform lock-in, and calls for OpenRouter availability
0. Does not fit well in any category
</topics>
<comments_to_classify>
[
{
"id": "47001752",
"text": "The difference between thinking and no-thinking models can be a little blurry. For example, when doing coding tasks Anthropic models with no-thinking mode tend to use a lot of comments to act as a scratchpad. In contrast, models in thinking mode don't do this because they don't need to.\n\nUltimately, the only real difference between no-thinking and thinking models is the amount of tokens used to reach the final answer. Whether those extra scratchpad tokens are between <think></think> tags or not doesn't really matter."
}
,
{
"id": "46994725",
"text": "I think step 4 is the agent swarm. Manager model gets the prompt and spins up a swarm of looping subagents, maybe assigns them different approaches or subtasks, then reviews results, refines the context files and redeploys the swarm on a loop till the problem is solved or your credit card is declined."
}
,
{
"id": "46999038",
"text": "So Google Answers is coming back?!?!?!"
}
,
{
"id": "46994801",
"text": "i think this is the right answer\n\nedit: i don't know how this is meaningfully different from 3"
}
,
{
"id": "46992180",
"text": "> best of N models like deep think an gpt pro\n\nYeah, these are made possible largely by better use at high context lengths. You also need a step that gathers all the Ns and selects the best ideas / parts and compiles the final output. Goog have been SotA at useful long context for a while now (since 2.5 I'd say). Many others have come with \"1M context\", but their usefulness after 100k-200k is iffy.\n\nWhat's even more interesting than maj@n or best of n is pass@n. For a lot of applications youc an frame the question and search space such that pass@n is your success rate. Think security exploit finding. Or optimisation problems with quick checks (better algos, kernels, infra routing, etc). It doesn't matter how good your pass@1 or avg@n is, all you care is that you find more as you spend more time. Literally throwing money at the problem."
}
,
{
"id": "46992275",
"text": "> can a sufficiently large non thinking model perform the same as a smaller thinking?\n\nModels from Anthropic have always been excellent at this. See e.g. https://imgur.com/a/EwW9H6q (top-left Opus 4.6 is without thinking)."
}
,
{
"id": "46992359",
"text": "its interesting that opus 4.6 added a paramter to make it think extra hard."
}
,
{
"id": "46993617",
"text": "It's a shame that it's not on OpenRouter. I hate platform lock-in, but the top-tier \"deep think\" models have been increasingly requiring the use of their own platform."
}
,
{
"id": "46993973",
"text": "OpenRouter is pretty great but I think litellm does a very good job and it's not a platform middle man, just a python library. That being said, I have tried it with the deep think models.\n\nhttps://docs.litellm.ai/docs/"
}
,
{
"id": "46995726",
"text": "Part of OpenRouter's appeal to me is precisely that it is a middle man. I don't want to create accounts on every provider, and juggle all the API keys myself. I suppose this increases my exposure, but I trust all these providers and proxies the same (i.e. not at all), so I'm careful about the data I give them to begin with."
}
,
{
"id": "46996260",
"text": "Unfortunately that's ending with mandatory-BYOK from the model vendors. They're starting to require that you BYOK to force you through their arbitrary+capricious onboarding process."
}
,
{
"id": "46998621",
"text": "Will still be able to use open weights models, which is what I use openrouter primarily for anyway"
}
,
{
"id": "46996618",
"text": "The golden age is over."
}
,
{
"id": "46999291",
"text": "I just tested it on a very difficult Raven matrix, that the old version of DeepThink, as well as GPT 5.2 Pro, Claude Opus 4.6, and pretty much every other model failed at.\n\nThis version of DeepSeek got it first try. Thinking time was 2 or 3 minutes.\n\nThe visual reasoning of this class of Gemini models is incredibly impressive."
}
,
{
"id": "47001591",
"text": "Deep Think not DeepSeek"
}
,
{
"id": "46993674",
"text": "it is interesting that the video demo is generating .stl model.\nI run a lot of tests of LLMs generating OpenSCAD code (as I have recently launched https://modelrift.com text-to-CAD AI editor) and Gemini 3 family LLMs are actually giving the best price-to-performance ratio now. But they are very, VERY far from being able to spit out a complex OpenSCAD model in one shot. So, I had to implement a full fledged \"screenshot-vibe-coding\" workflow where you draw arrows on 3d model snapshot to explain to LLM what is wrong with the geometry. Without human in the loop, all top tier LLMs hallucinate at debugging 3d geometry in agentic mode - and fail spectacularly."
}
,
{
"id": "46994929",
"text": "Hey, my 9 year old son uses modelrift for creating things for his 3d printer, its great! Product feedback:\n1. You should probably ask me to pay now, I feel like i've used it enough.\n2. You need a main dashboard page with a history of sessions. He thought he lost a file and I had to dig in the billing history to get a UUID I thought was it and generate the url. I would say naming sessions is important, and could be done with small LLM after the users initial prompt.\n3. I don't think I like the default 3d model in there once I have done something, blank would be better.\n\nWe download the stl and import to bambu. Works pretty well. A direct push would be nice, but not necessary."
}
,
{
"id": "47001451",
"text": "Thank you for this feedback, very valuable!\nI am using Bambu as well - perfect to get things printed without much hassle. Not sure if direct push to printer is possible though, as their ecosystem looks pretty closed. It would be a perfect use case - if we could use ModelRift to design a model on a mobile phone and push to print.."
}
,
{
"id": "46994645",
"text": "Yes, I've been waiting for a real breakthrough with regard to 3D parametric models and I don't think think this is it. The proprietary nature of the major players (Creo, Solidworks, NX, etc) is a major drag. Sure there's STP, but there's too much design intent and feature loss there. I don't think OpenSCAD has the critical mass of mindshare or training data at this point, but maybe it's the best chance to force a change."
}
,
{
"id": "47000354",
"text": "yes, i had the same experience. As good as LLMs are now at coding - it seems they are still far away from being useful in vision dominated engineering tasks like CAD/design. I guess it is a training data problem. Maybe world models / artificial data can help here?"
}
,
{
"id": "46998957",
"text": "I was looking for your GitHub, but the link on the homepage is broken: https://github.com/modelrift"
}
,
{
"id": "47001452",
"text": "right, I need to fix this one"
}
,
{
"id": "46994751",
"text": "If you want that to get better, you need to produce a 3d model benchmark and popularize it. You can start with a pelican riding a bicycle with working bicycle."
}
,
{
"id": "47001465",
"text": "building a benchmark is a great idea, thanks, maybe I will have a couple of days to spend on this soon"
}
,
{
"id": "46991427",
"text": "According to benchmarks in the announcement, healthily ahead of Claude 4.6. I guess they didn't test ChatGPT 5.3 though.\n\nGoogle has definitely been pulling ahead in AI over the last few months. I've been using Gemini and finding it's better than the other models (especially for biology where it doesn't refuse to answer harmless questions)."
}
,
{
"id": "46993072",
"text": "Google is way ahead in visual AI and world modelling. They're lagging hard in agentic AI and autonomous behavior."
}
,
{
"id": "46991722",
"text": "The general purpose ChatGpt 5.3 hasn’t been released yet, just 5.3-codex."
}
,
{
"id": "46992179",
"text": "It's ahead in raw power but not in function. Like it's got the worlds fast engine but one gear! Trouble is some benchmarks only measure horse power."
}
,
{
"id": "46992324",
"text": "> Trouble is some benchmarks only measure horse power.\n\nIMO it's the other way around. Benchmarks only measure applied horse power on a set plane, with no friction and your elephant is a point sphere. Goog's models have always punched over what benchmarks said, in real world use @ high context. They don't focus on \"agentic this\" or \"specialised that\", but the raw models, with good guidance are workhorses. I don't know any other models where you can throw lots of docs at it and get proper context following and data extraction from wherever it's at to where you'd need it."
}
,
{
"id": "46994660",
"text": "> especially for biology where it doesn't refuse to answer harmless questions\n\nUsually, when you decrease false positive rates, you increase false negative rates.\n\nMaybe this doesn't matter for models at their current capabilities, but if you believe that AGI is imminent, a bit of conservatism seems responsible."
}
,
{
"id": "46993111",
"text": "Google models and CLI harness feels behind in agentic coding compared OpenAI and Antrophic"
}
,
{
"id": "46992354",
"text": "I gather that 4.6 strengths are in long context agentic workflows? At least over Gemini 3 pro preview, opus 4.6 seems to have a lot of advantages"
}
,
{
"id": "46992527",
"text": "It's a giant game of leapfrog, shift or stretch time out a bit and they all look equivalent"
}
,
{
"id": "46991450",
"text": "The comparison should be with GPT 5.2 pro which has been used successfully to solve open math problems."
}
,
{
"id": "46994346",
"text": "The problem here is that it looks like this is released with almost no real access. How are people using this without submitting to a $250/mo subscription?"
}
,
{
"id": "47001972",
"text": "I have some very difficult to debug bugs that Opus 4.6 is failing at. Planning to pay $250 to see if it can solve those."
}
,
{
"id": "46995144",
"text": "People are paying for the subscriptions."
}
,
{
"id": "46995564",
"text": "I gather this isn't intended a consumer product. It's for academia and research institutions."
}
,
{
"id": "46995040",
"text": "Gemini has always felt like someone who was book smart to me. It knows a lot of things. But if you ask it do anything that is offscript it completely falls apart"
}
,
{
"id": "46995279",
"text": "I strongly suspect there's a major component of this type of experience being that people develop a way of talking to a particular LLM that's very efficient and works well for them with it, but is in many respects non-transferable to rival models. For instance, in my experience, OpenAI models are remarkably worse than Google models in basically any criterion I could imagine; however, I've spent most of my time using the Google ones and it's only during this time that the differences became apparent and, over time, much more pronounced. I would not be surprised at all to learn that people who chose to primarily use Anthropic or OpenAI models during that time had an exactly analogous experience that convinced them their model was the best."
}
,
{
"id": "46998719",
"text": "We train the AI. The AI then trains us."
}
,
{
"id": "46995078",
"text": "I'd rather say it has a mind of its own; it does things its way. But I have not tested this model, so they might have improved its instruction following."
}
,
{
"id": "46995158",
"text": "Well, one thing i know for sure: it reliably misplaces parentheses in lisps."
}
,
{
"id": "46995248",
"text": "Clearly, the AI is trying to steer you towards the ML family of languages for its better type system, performance, and concurrency ;)"
}
,
{
"id": "46998074",
"text": "I made offmetaedh.com with it. Feels pretty great to me."
}
,
{
"id": "46996077",
"text": "It found a small but nice little optimization in Stockfish: https://github.com/official-stockfish/Stockfish/pull/6613\n\nPrevious models including Claude Opus 4.6 have generally produced a lot of noise/things that the compiler already reliably optimizes out."
}
,
{
"id": "46996807",
"text": "I feel like a luddite: unless I am running small local models, I use gemini-3-flash for almost everything: great for tool use, embedded use in applications, and Python agentic libraries, broad knowledge, good built in web search tool, etc. Oh, and it is fast and cheap.\n\nI really only use gemini-3-pro occasionally when researching and trying to better understand something. I guess I am not a good customer for super scalers. That said, when I get home from travel, I will make a point of using Gemini 3 Deep Think for some practical research. I need a business card with the title \"Old Luddite.\""
}
,
{
"id": "47001313",
"text": "3 Flash is criminally under appreciated for its performance/cost/speed trifecta. Absolutely in a category of its own."
}
,
{
"id": "46992634",
"text": "I'm pretty certain that DeepMind (and all other labs) will try their frontier (and even private) models on First Proof [1].\n\nAnd I wonder how Gemini Deep Think will fare. My guess is that it will get half the way on some problems. But we will have to take an absence as a failure, because nobody wants to publish a negative result, even though it's so important for scientific research.\n\n[1] https://1stproof.org/"
}
,
{
"id": "47000189",
"text": "As a non-mathematician, reading these problems feels like reading a completely foreign language.\n\nhttps://arxiv.org/html/2602.05192v1"
}
]
</comments_to_classify>
Based on the comments above, assign each to up to 3 relevant topics.
Return ONLY a JSON array with this exact structure (no other text):
[
{
"id": "comment_id_1",
"topics": [
1,
3,
5
]
}
,
{
"id": "comment_id_2",
"topics": [
2
]
}
,
{
"id": "comment_id_3",
"topics": [
0
]
}
,
...
]
Rules:
- Each comment can have 0 to 3 topics
- Use 1-based topic indices for matches
- Use index 0 if the comment does not fit well in any category
- Only assign topics that are genuinely relevant to the comment
Remember: Output ONLY the JSON array, no other text.
50