Summarizer

LLM Input

llm/2ad2a7bb-5462-4391-a2da-bf11064993c9/topic-15-3b8cdd20-b02e-4df4-b7c6-1cbfe80b82e5-input.json

prompt

The following is content for you to summarize. Do not respond to the comments—summarize them.

<topic>
Spatial Reasoning Limitations # Discussion of LLMs struggling with spatial tasks, image orientation affecting OCR accuracy, and whether ARC-AGI improvements indicate genuine spatial reasoning advances or benchmark-specific solutions
</topic>

<comments_about_topic>
1. I suspect the non-spikey part is the more interesting comparison

Why is it so easy for me to open the car door, get in, close the door, buckle up. You can do this in the dark and without looking.

There are an infinite number of little things like this you think zero about, take near zero energy, yet which are extremely hard for Ai

2. > There's a term for this, but I can't think of it at the moment.

Moravec's paradox: https://epoch.ai/gradient-updates/moravec-s-paradox

3. I'm excited for the big jump in ARC-AGI scores from recent models, but no one should think for a second this is some leap in "general intelligence".

I joke to myself that the G in ARC-AGI is "graphical". I think what's held back models on ARC-AGI is their terrible spatial reasoning, and I'm guessing that's what the recent models have cracked.

Looking forward to ARC-AGI 3, which focuses on trial and error and exploring a set of constraints via games.

4. Agreed. I love the elegance of ARC, but it always felt like a gotcha to give spatial reasoning challenges to token generators- and the fact that the token generators are somehow beating it anyway really says something.

5. Wouldn't you deal with spatial reasoning by giving it access to a tool that structures the space in a way it can understand or just is a sub-model that can do spatial reasoning? These "general" models would serve as the frontal cortex while other models do specialized work. What is missing?

6. That's a bit like saying just give blind people cameras so they can see.

7. I mean, no not really. These models can see, you're giving them eyes to connect to that part of their brain.

8. They should train more on sports commentary, perhaps that could give spatial reasoning a boost.

9. Arc-AGI (and Arc-AGI-2) is the most overhyped benchmark around though.

It's completely misnamed. It should be called useless visual puzzle benchmark 2.

It's a visual puzzle, making it way easier for humans than for models trained on text firstly. Secondly, it's not really that obvious or easy for humans to solve themselves!

So the idea that if an AI can solve "Arc-AGI" or "Arc-AGI-2" it's super smart or even "AGI" is frankly ridiculous. It's a puzzle that means nothing basically, other than the models can now solve "Arc-AGI"

10. One discovery I've made with gemini is that ocr accuracy is much higher when document is perfectly aligned at 0 degree. When we provided images with handwritten text to gemini which were horizontal (90 or 180 degree) it had lots of issues reading dates, names etc. Then we used paddle ocr image orientation model to find orientation and rotate the image it solved most of our issues with ocr.

11. I just tested it on a very difficult Raven matrix, that the old version of DeepThink, as well as GPT 5.2 Pro, Claude Opus 4.6, and pretty much every other model failed at.

This version of DeepSeek got it first try. Thinking time was 2 or 3 minutes.

The visual reasoning of this class of Gemini models is incredibly impressive.

12. it is interesting that the video demo is generating .stl model.
I run a lot of tests of LLMs generating OpenSCAD code (as I have recently launched https://modelrift.com text-to-CAD AI editor) and Gemini 3 family LLMs are actually giving the best price-to-performance ratio now. But they are very, VERY far from being able to spit out a complex OpenSCAD model in one shot. So, I had to implement a full fledged "screenshot-vibe-coding" workflow where you draw arrows on 3d model snapshot to explain to LLM what is wrong with the geometry. Without human in the loop, all top tier LLMs hallucinate at debugging 3d geometry in agentic mode - and fail spectacularly.

13. Yes, I've been waiting for a real breakthrough with regard to 3D parametric models and I don't think think this is it. The proprietary nature of the major players (Creo, Solidworks, NX, etc) is a major drag. Sure there's STP, but there's too much design intent and feature loss there. I don't think OpenSCAD has the critical mass of mindshare or training data at this point, but maybe it's the best chance to force a change.

14. yes, i had the same experience. As good as LLMs are now at coding - it seems they are still far away from being useful in vision dominated engineering tasks like CAD/design. I guess it is a training data problem. Maybe world models / artificial data can help here?

15. If you want that to get better, you need to produce a 3d model benchmark and popularize it. You can start with a pelican riding a bicycle with working bicycle.

16. Google is way ahead in visual AI and world modelling. They're lagging hard in agentic AI and autonomous behavior.

17. Not very likely?

ARC-AGI-3 has a nasty combo of spatial reasoning + explore/exploit. It's basically adversarial vs current AIs.

18. How is spatial reasoning useless??
</comments_about_topic>

Write a concise, engaging paragraph (3-5 sentences) summarizing the key points and perspectives in these comments about the topic. Focus on the most interesting viewpoints. Do not use bullet points—write flowing prose.

topic

Spatial Reasoning Limitations # Discussion of LLMs struggling with spatial tasks, image orientation affecting OCR accuracy, and whether ARC-AGI improvements indicate genuine spatial reasoning advances or benchmark-specific solutions

commentCount

18

← Back to job