Summarizer

LLM Input

llm/26ba9d27-8aa5-4ccd-ac0e-aa7e277e9792/af1743b5-a586-455d-aba9-669c1e3fd7af-input.json

Pretty-print

userPrompt

Summarize the following content from a Notion page and its linked resources in approximately 500 words. Focus on the main ideas, key points, and important information. Write in a clear, direct style.

<content>
=== More Thoughts About AGI (https://www.notion.so/More-Thoughts-About-AGI-2504c916374a800a92f0cf7c92e45c4e) === @Séb Krier: (I know I'm a stuck record) An important assumption in AI discourse is that sufficiently capable generalist *models* are the main event. Get the model smart enough, and it more or less does everything. Value creation, competitive advantage, and risk would all concentrate at the… pic.twitter.com/b5HXqxcAZv https://x.com/i/status/1999994684203045003 When I talk about where time horizons and scaling curves come from, reference the recent axrp podcast where Daniel talks about the boiling point of water. Being a bit of a more contingent phenomenon than we generally recognize, a function of the complex Dynamics between liquid water, water vapor, and dust particles in the air. Other scaling laws are probably aggregates of multiple factors, including feedback loops that we may not be recognizing. For example, the time horizons curve may be a function of it, taking longer and longer to find the rarer and rarer stumbling blocks for models. I like this model much more than the fixed hazard rate model to explain the distribution of model success rates at tasks of different lengths. “Drop-in-remote worker” is a frustrated response to the argument that it will take time for AI to diffuse into the economy. It’s ultimately a bet that AI will zoom past the need for adaptation faster than we will adapt to it. For now, that’s a bad bet. Maybe even more so in robotics. (In robotics, we have two related questions: will robots be humanoid, and will we need to refactor work to leverage them?) Reference https://blog.cosmos-institute.org/p/faster-horses. Intelligence is a concept that exists only in practice, not in theory. https://www.patreon.com/posts/january-3-2025-147255956 Intelligence is compression. But there is no perfect compression algorithm nor even any universal compression algorithm. In some sense, there is no such thing as a compression algorithm. It is a fundamental law of information that there does not exist in the algorithm which can compress all files. In fact, any compression algorithm must fall into one of two categories either. It never compresses anything at all, or there are some inputs which it makes larger rather than smaller . And yet, both compression algorithms and intelligence clearly exist in some sense in the world. The resolution to this paradox is that both compression and intelligence exist only within some universe. Some distribution space of inputs. When you know something about the set of input files or world situations that you might encounter, you can create a compression algorithm or a world understanding and action selection algorithm that is more likely than that to do well in that specific distribution. How well you can do this depends in large part on how much information you have about that distribution and of course how well you're able to understand and leverage that understanding. But we never have perfect understanding, in some sense. It's probably the case that we have only very little understanding, and of course we only have limited ability to make use of what such information we have, so everything we do is approximations and heuristics. And so the messy complexity of intelligence comes in as you try to select those heuristics. All of this has a very meta nature. You can act, you can spend time thinking about what action to take, or you can work on improving your algorithms for thinking. You can collect more information, you can review which sorts of actions have worked better or less. Well, you can review what heuristics for thinking have been more or less valuable, you can work on your criteria for performing those evaluations. There is no ceiling to the potential meta stack, no known formula for how much energy you should expend at each level of the stack, no known formula for selecting. How to devote the effort that you do expend at each level. Practices such as rationality, Buddhism, listen to your gut, etc. Are all are all heuristics, even rationality is a heuristic. Perhaps in some theoretically ideal world where you could be perfectly rational, rationality would be more than just a heuristic, it would be the correct approach. However, we do not live in that world, and so even rationalists can only apply rationalism into a finite extent. At the beginning of the year, I was listening to Ezra Klein interview someone about Buddhism and meditation and rising above your feelings and emotions. And I got to thinking that when I encounter a new practice like this, I find it threatening or exhausting, because I seize upon the idea that the practice should be applied exhaustively extensively thoroughly. And that is both an enormous amount of work, too much to contemplate, and inappropriate because any of these practices is just another heuristic, to be incorporated into our heuristic toolkit. This has extensive implications for both. How we build AIS, and how we evaluate and assess them and understand their implications. Because intelligence is not something that exists in an ideal platonic world, that can only exist in a particular universe with a particular distribution of situations and goals, there is no bedrock theory upon which we can construct, or even evaluate, any particular AI system or approach to building AI systems. Even to the extent that rigorous analysis might be possible, we are far away from having actually discovered how to carry it out. Our understanding of the universe that we live in, and our tools for working with that knowledge, are far too limited and clumsy. As a result, the development and assessment of artificial intelligence will be inherently a heuristic, experimental, messy, fumbling project. Any particular new technique- such as having AI's evaluate their own work and to look for errors that they can self-correct, or introducing a system of multiple agents, critiquing one another, or recording lessons and memories, etc, none of these will be a silver bullet. For any new technique or tool in the tool kit, there will be the question of when to apply it. How to apply it? How extensively to apply it? When to keep applying it and when to give up and switch to another tool, or to combine two tools. The AI can never know for certain what to do, or how to figure out what to do, or how hard to work on figuring out what to do. In light of all this messiness and uncertainty and intractability, it is fascinating to contemplate why scaling was exist. For a project. As for an endeavor as messy and ill-defined and preparadamatic as the development of AI, why should smooth curves exist anywhere on any graph? One thought I have is to question the extent to which this is actually true. I'd been told that we've all read repeatedly about learning curves, and Wright's law, which basically says that Moore's law has played out in many many technologies, not just transistor fabrication. But a recent paper showed that this is not particularly true, or is only very very loosely true, that learning curves tend to wiggle all over the place and are not nearly as straight of a line as Moore's law. And in the specific example of Moore's law, there are various factors that made it into a self-fulfilling prophecy, it exists because we believed it would exist and we worked used it as our planning yardstick, Rather than it being some laws nature. Nature. It would be interesting to go look at the various scaling law graphs, and see how straight they really are. Various forms of philosophical ethics, whether to treat principles as absolute, etc. Stereotypes, snap judgments, surface impressions- these are all limited and fallible, but they're also essential for getting through everyday life. Life. It requires a judgment call to decide when to spend the effort to look past your snap impression, and a judgment call to know how much effort to spend on that judgment, call, etc. System 1 versus system 2. Thinking. Which of several conceptual frameworks to bring to a situation. Heuristics: think before you act, measure twice / cut once, etc. N Thoughts on AGI; Offcuts https://news.ycombinator.com/item?id=46408921 https://x.com/snewmanpv/status/2005329341845303612 https://www.lesswrong.com/posts/u6Lacc7wx4yYkBQ3r/insights-into-claude-opus-4-5-from-pokemon Anil Seth (@anilkseth) posted at 6:45 AM on Mon, Nov 10, 2025:"Cognition all the way down". Great to see this fine new paper from @robertchisciure & @drmichaellevin our now in Synthese - it introduces a new metric to quantify biological intelligence as search efficiency in multidimensional problem spaces https://t.co/YH7W3F9FtQ https://t.co/UKIULAD5wU(https://x.com/anilkseth/status/1987894412584566876?t=0JSx8bvFJGpJy6NR5Xh7kw&s=03) @Ethan Mollick: In general, all the chatbots seem to struggle with files in a way that CLI versions do not. Gemini will frequently confuse which nano banana image you are referring to in a conversation (the chain of thought shows it loses track) and ChatGPT often misplaces files that it makes https://x.com/i/status/2000413758729162889 Thought about AGI: for biological anchors, perhaps we should be looking at communications bandwidth rather than computation? It's easy to add more compute. https://x.com/i/status/1998778005129224412 From https://jasmi.news/p/42-notes-on-ai-and-work: “Diffusion lag” reflects a lack of product-market fit. Even AI optimists are still hitting practical roadblocks. That’s why detailed case studies are so much fun: physics, code security, running a restaurant at a small independent hotel. Our friendly hotel purveyor describes one such long-horizon task: “To replicate [chef] Hagai’s context, you’d need entire recipes, or maybe video of him preparing the foods; Toast sales data, or maybe video of the dining room; our hours; his calendar, featuring private events; communications among staff about what’s getting used for what; the CSVs for Baldor; the paper receipts for quick runs to Loeb’s; and maybe surveillance footage to capture exceptions.” https://gwern.net/ai-daydreaming https://x.com/1a3orn/status/1997056050403725373 https://www.dwarkesh.com/p/thoughts-on-ai-progress-dec-2025 “AI Agenda: AI’s ‘Split-Brain’ Problem” https://x.com/Jack_W_Lindsey/status/1993389056932339721 From https://www.strangeloopcanon.com/p/epicycles-all-the-way-down – “Epicycles” is a great description for some of the behavior these models are learning (and the way we’re layering on new training techniques). Mike Knoop (@mikeknoop) posted at 11:32 AM on Tue, Nov 25, 2025:"scaling sucked out all the oxygen in the room, everyone converged to the same ideas" --> new ideas still needed!(https://x.com/mikeknoop/status/1993402375944679673?t=LWizqAJMjEshtTbDlsNbgg&s=03) Quintin Pope (@QuintinPope5) posted at 3:42 AM on Thu, Nov 20, 2025:It is genuinely weird how many complex dynamical feedback loops end up producing smooth exponentials as their aggregate outputs.(https://x.com/QuintinPope5/status/1991472282422899184?t=DqRlOyRAVVMmxdd9X6Go_Q&s=03) Machine Learning Street Talk (@MLStreetTalk) posted at 11:38 PM on Sat, Nov 15, 2025:Intelligence is not just compression.The future is much harder to know than the past, especially for the "objects of interest" (complex/adaptive systems)Creation, construction and composition are better words to use than compression.Compression is a great proxy for the(https://x.com/MLStreetTalk/status/1989961238839693788?t=Fu6VNj83BJCK7qwDh3hKNw&s=03) Why are centaurs a big win in coding but not in chess? The human brain, significantly, develops new circuits on the fly. We are not seeing AI being able to do that, but [my Superproductivity post] we are starting to see AI being able to capture new learnings in the form of written knowledge, new prompts, and new code. https://x.com/fchollet/status/1989340153114976598 https://x.com/HjalmarWijk/status/1985529956890530217 From my comments on Timothy Lee’s draft post on long context: [Timothy] Anthropic CEO Dario Amodei said that “there's no reason we can't make the context length 100 million words today, which is roughly what a human hears in their lifetime.” With any sort of back-of-the-envelope math, 100M for human lifetime seems off. Average reading speed is around 250 words per minute, let's conservatively call that 300 tokens per minute. That's 36M words per 2000-hour work year. We don't spend our whole lives reading, but we do spend a lot of time reading in one fashion or another, and quite a bit of the rest consuming audio information at not-drastically-lower rates. Especially as LLMs (at least currently) are a lot more verbose than people are. Critically, this includes being verbose in their chain-of-thought. Also software engineering involves a lot of skimming – through code windows, terminal output, etc. A person might be presented with 1000 tokens of terminal output and get away with only reading 100 tokens of it, whereas an LLM would probably need to process all of it. [Timothy] In February 2025, a team of researchers at Adobe published research on a more difficult variant of the needle-in-a-haystack test. Here the “needle” was a sentence like “Yuki lives next to the Semper Opera House,” and the model would be asked “Which character has been to Dresden?” To answer this question, you need to know that the Semper Opera House is in Dresden. Leading language models do know this, so if you give them this challenge in a short prompt (a small “haystack”) they tend to get it right more than 90% of the time. But if you give them the same challenge in a larger “haystack” — for example, a 32,000-token prompt — performance drops dramatically. GPT-4o goes from 99% accuracy to 70%; Claude 3.5 Sonnet drops from 88% to 30%. 1. I wonder how well most people would do on this test. 2. A person given this task would likely skim back through the passage. LLMs are handicapped by not being able to do this (my point being that this handicap supports your thesis). https://x.com/emollick/status/1985507610963968447 How will we measure AI capabilities for multi day tasks? Karnofsky on 80,000 hours notes that AIS so far don't really seem to exhibit power seeking in any non-trivial way and he interviews this to the fact that we are not training them on. You know a kind of very long tasks in an environment where like that like they're not being given any training tasks where you know accumulating resources hacking into computers to expand your computing capacity. You know suborning the global dialogue or national dialogue to enact more AI favorable policies like that's just far away from anything that shows up in their training. This suggests to me that so long as AI's continue to be poor at generalization, we're probably in good shape because you know probably continue to be the case for quite a while that there's no training regime where they would specifically get to hone their techniques or even you know desire experience of of experiencing reward for that kind of strategic power seeking. So as long as they're not good at generalization and so long as we're not incorporating any kind of maybe real world feedback into their training and their and their global State, that would provide the opportunity for them to learn from attempts at power seeking, then we're probably safe with regard to power seeking Andrej Karpathy (@karpathy) posted at 10:09 AM on Wed, Oct 01, 2025:Finally had a chance to listen through this pod with Sutton, which was interesting and amusing.As background, Sutton's "The Bitter Lesson" has become a bit of biblical text in frontier LLM circles. Researchers routinely talk about and ask whether this or that approach or idea(https://x.com/karpathy/status/1973435013875314729?t=l8ZRYSohCQ_g9t18XW18pA&s=03) https://x.com/dwarkesh_sp/status/1979259041013731752?t=_ICAkvN3N3W24PVdwS8kmA&s=03 https://x.com/karpathy/status/1979644538185752935?t=H0LxJvX5xMtiIqHt69rLpA&s=03 Statistics often make AI sound more impactful than it is. Ryan Greenblatt pointing out that “AI writing 90% of the code at Anthropic” may be only true of some teams, and in any case may not be 90% of the value of the code. Fin interview on the Cognitive Revolution, their agent is handling 60% of support tickets but not 60% of the workload. An example of someone actually leaning in to asking a model to do a thing it can't do. Analyze the failures and what they say about missing capabilities: https://pca.st/episode/d9d8178e-9f7e-4102-bf2c-aa3e747bc188 From https://www.dwarkesh.com/p/andrej-karpathy: His experience writing nano chat is that models are bad at writing new or unusual code. They very strongly veer back toward using standard libraries standard coding styles. They're not great at at new things. This causes him to be less optimistic about the potential for automation of AI R&D and for the models to be discovering developing new techniques and pushing the state-of-the-art forward. It's interesting to contrast this with the claim from Anthropic that 90% of all their code internally is being written by by LLMs. I guess we could interpret this as maybe, even at a place like Anthropic there's a lot of boilerplate code being written. Maybe “written by LLMs” is a little bit loose and some of that includes autocomplete which Andre describes as as actually working well for him. It could be that the issue he experienced was not that he was doing something new in terms of the in terms of the technical approach, but more just his code style that he was trying to do something lean, streamlined, with few dependencies and maybe the models are are fine at using the standard techniques in different combinations and his problem was just that he was using down standard techniques and maybe maybe it's really the different combinations that will be important in pushing the state of the art forward all the by the same token that suggests that maybe even as the models are writing 90% of the code of anthropic, they're not really supplying all that much of a percentage of the sort of cognitive labor involved Cf. Intercom's Fin handling 60% of tickets but a smaller percentage of support workload "They're not very good at code that has never been written before... which is what we're trying to achieve when we're building these models." One of the next big advances might be getting more than one bit of data per RL rollout, introspect over where the model did well / poorly. ("Process supervision") Andre talks about llms not having cultural knowledge. There's no sense in which an LLM can write a book for other llms to consume. This is in the context of the takeoff in human capabilities about 10,000 years ago that was based on the ability to accumulate cultural knowledge and cultural capital. I might argue that the process of accumulating cultural knowledge can be thought of as a multi-decade project tying back to the METR timeline concept or task time horizon concept that we arguably won't have AGI until AI time horizons are measured in decades because that's the humans can pursue decades-long projects and the interesting scenarios for AGI or ASI where it's rapidly advancing its own development where it's developing now, full energy sources, etc. That's things that constitute multi-decade projects. In particular. The AI R&D explosion as as described in AI 2027 is I think fairly explicitly a century-long project. So for that to ignite AI would have to. Maybe it wouldn't have to be at the century-long horizon level, but it would have to be at a long enough horizon level that it can be advancing its own horizon faster than the scope of the project increases at the same time. We should always keep in mind that just as AI development may encounter unexpected barriers, it may also encounter unexpected leaps forward and I should reiterate what I've said on a number of occasions that AI's superhuman strengths May compensate for its weaknesses. Jaggedness runs both ways. Maybe the labor market collapses when AI can adapt into new opportunities faster than people can. (Mod demand for human labor specifically because it is human.) I think I already wrote about this in conversation with someone, maybe it’s already in the More Thoughts doc? Or maybe that was verbal. Trying out a thought, on the topic of the endless debate as to whether AI will lead to permanent mass unemployment (because it will eventually do ~everything), or people will find new, plausibly-better jobs (because this is what has ~always happened before). Here's my thought: as the world has evolved over the centuries, new opportunities continually emerge, and those niches are initially filled by people. Eventually, some of those new jobs go away as automation / mechanization catches up, but new niches keep opening up and people – being far more adaptable than machines – get there first. For instance, it took many decades from the invention of the telephone to the invention of the automated switching systems, meaning that there were several generations in which someone could make a career as a switchboard operator. In this framing, *the tipping point will occur if/when AI becomes more adaptable than people*. If that happens, AI will fill the new niches first, leaving no window for people to occupy new "jobs" as they emerge. I could note, in contrast, that the web created an opportunity for Internet research librarians, but it was filled by Google, not people. [assuming AI isn’t supply constrained, and setting aside demand for human labor specifically because it is human – which needs to be distinguished from objective quality issues] Chris Barber (GG discussion on WhatsApp): Kind of like stocks and flows in systems Stocks = who does it better today Flow = who adapts faster/better My naive expectation is that humans and ai will drift and specialize in the direction of wherever they adapt better I expect there’ll be many things where humans adapt better for a long time, though I’m unsure on the quantity of roles in those areas It’s also a way to assess impacts on people and points to things that would help society adapt i.e. people will do well proportional to how well they adapt to new supply constrained high demand things and also a way of measuring the kind of economic challenge level of ai is something like to what extent it adapts faster than people in what portion of valuable areas (i can imagine some kind of chart for this) when ai systems gets better at new things slowly, less adaptation needed from humans, less intense if it’s adapting to most of the obvious new reskilling directions quickly, feels much more intense this also intersects with the jaggedness debate the more jagged agi is, the more areas where humans will be better and/or adapt faster Daniel Rock (GG discussion on WhatsApp): Some great papers on this: - Autor levy and Murnane (2003) - Autor and Thompson (2024) - Acemoglu and Restrepo (2016) - race between man and machine (they have a lot of papers) - And Ide and Talamás (2025) “Danie”: I think that’s really compelling, and I agree — if AI becomes more adaptable, that’ll generate new ideas about how to implement itself faster, as you describe. But I’d build on that and say adaptability, when paired with scale and access, is where it really takes off. AI isn’t just innovating faster; it’s also diffusing across industries and geographies far more quickly than previous technologies. In other words, it’s accelerating both the creation and the spread of its own applications. For example, historically, technological change was sectoral and staggered — the telephone or mechanisation hit discrete industries and localities over decades. Labour markets had temporal slack: time for the reallocation, retraining, and institutional adjustment you mention. What’s distinctive about AI is that it’s a multi-sectoral adoption shock with widening but uneven productivity impacts — unfolding across legal, creative, administrative, analytical, etc domains at once. So adaptability, when coupled with that simultaneity, collapses the temporal and spatial buffers that usually cushion technological shocks. Even if the long-run adaptive logic still holds, the adjustment process is no longer sequential but synchronous (or at the very least, in rapid cyclical waves) — and that compression of time, rather than simply the scale of change, is what makes this transition historically different. I could note that personal computers displaced typists, but not teachers; factory automation displaced assembly line workers, but not construction workers. https://www.henrikkarlsson.xyz/p/wordless-thought. Relates to “neuralese”. Also I think suggests that there are important things going on in the human brain that do not fit into the linear structure of an LLM. Also also I wonder whether current LLMs operate at a significant handicap by not being able to specialize. https://ai-frontiers.org/articles/agis-last-bottlenecks keys off of a list of components of AGI. I think the list is incomplete. Review my post on the Case of the Missing Agent, this doesn’t cover higher-level skills such as prioritization and maintaining coherence over long time periods. You could argue that this falls under the memory and reasoning bullets, but I would argue that this under-weighs the depth / breadth / complexity of the remaining areas to be covered. More generally, I think this ties back to my idea that we lack vocabulary (or at least have only a sparse vocabulary) for the things that current models are missing. https://www.dwarkesh.com/p/andrej-karpathy He suggests that pre-training an LLM infuses it with both knowledge and intelligence (core cognitive structures that allow it to then learn new things through reinforcement learning), and that the models might actually be better at learning if they didn't have all that knowledge because they use it as a crutch. I wonder whether it's really true that they're acquiring core intelligence separate from facts, or if they just have billions of little shards of cognitive concepts entangled with the the detailed knowledge and this is part of the reason that they struggle to generalize. Ethan Mollick (@emollick) posted at 11:36 AM on Tue, Oct 21, 2025:Papers like this show that are a lot of potential pathways forward on some of the hardest outstanding problems in AI. The amount of low-hanging fruit suggests that AI lab R&D might continue to find ways around barriers to continual improvement of AI models.(https://x.com/emollick/status/1980704687377486182?t=85fVurBCtsQddzOHOlqOtA&s=03) Ethan Mollick (@emollick) posted at 4:47 AM on Tue, Oct 21, 2025:Looking back at an exponentially improving technology & you will see how momentum led to R&D which overcame tech barriersThe fact that reasoners were developed at exactly the moment AI pre-training slowed is how Moore’s Law works, too: new technique appear to maintain the trend https://t.co/9rAySHO5L2(https://x.com/emollick/status/1980601776710525103?t=tLJFmxj8M3wZV_-RlRTtmA&s=03) Toby Ord (@tobyordoxford) posted at 0:12 PM on Mon, Oct 20, 2025: New post on RL scaling: Careful analysis of OpenAI’s public benchmarks reveals RL scales far worse than inference: to match each 10x scale-up of inference compute, you need 100x the RL-training compute. The only reason it has been cost-effective is starting from a tiny base. 🧵 https://t.co/ZwhDegc4NO (https://x.com/tobyordoxford/status/1980351353227768109?t=KevNIGnuT-Kq5Lt87kx1Kg&s=03) Toby Ord (@tobyordoxford) posted at 0:49 PM on Fri, Oct 03, 2025:So it looks like most of the gains are coming from the ability to spend more compute on each answer rather than from better ability to reason for the same token budget.This shift has big implications for AI business, governance, and risk.https://t.co/X2EocaaZjQ13/(https://x.com/tobyordoxford/status/1974200193504719049?t=fKg6dK3nB5WEYbOHr052sg&s=03) From https://thezvi.substack.com/p/bending-the-curve: A fun suggestion was to imagine LLMs talking about how jagged human capabilities are. Look how dumb we are in some ways while being smart in others. I do think in a meaningful sense LLMs and other current AIs are ‘more jagged’ than humans in practice, because humans have continual learning and the ability to patch the situation and also route the physical world around our idiocy where they’re being importantly dumb. So we’re super dumb, but we try to not let it get in the way. Reactions to https://www.dwarkesh.com/p/thoughts-on-sutton: The bitter lesson says that you want to come up with techniques which most effectively and scalably leverage compute. → Clearly, the wins come from leveraging compute. But “effectively” is as important here as “scalably”. Do we understand what the dividing line is between effective and ineffective uses of compute? (Not sure whether this directly relates to anything Dwarkesh said) someone, I think a speaker on MLST, talked about how LLMs / deep learning models develop horrible, “fractured”, spaghetti-code representations. Perhaps that’s because we throw them into the deep end when training, they (I presume) come out of the gate learning obscure facts and other things that small children don’t try to learn (?), instead children master broad basic concepts and then build on that. Could be a combination of what they encounter (e.g. school curriculum, “baby talk”) and natural learning instinct (babies and children tune out things that are over their heads). The agent is in no substantial way learning from organic and self-directed engagement with the world. Having to learn only from human data (an inelastic hard-to-scale resource) is not a scalable use of compute. What these LLMs learn from training is not a true world model (which tells you how the environment changes in response to different actions). Rather, they are building a model of what a human would say next. And this leads them to rely on human-derived concepts. To that last paragraph: people talk about how in the limit, an ASI should basically be able to derive anything from anything, a perfect predictor would need a model of the world in which the humans it is predicting reside, etc. But I don’t see anyone engage with the computational efficiency of that. It might be a valid point in the limit, but it might also be an inefficient, perhaps insupportably inefficient, way of building true world models. LLMs aren’t capable of learning on-the-job, so we’ll need some new architecture to enable continual learning. And once we have it, we won’t need a special training phase — the agent will just learn on-the-fly, like all humans, and indeed, like all animals. This new paradigm will render our current approach with LLMs obsolete. I like to talk about how advancing technology erodes the gray areas that allow society to function, such as digital media removing the gray area around recording shows off of TV. Continuous-learning AIs will erode the gray area between employees developing their expertise on the job, vs. stealing trade secrets when they move to another employer. I tried to ask Richard a couple of times whether pretrained LLMs can serve as a good prior on which to accumulate the experiential learning (aka do the RL) which will lead to AGI. On the one hand, it seems to be generally accepted that pretraining allows LLMs to achieve nonzero scores on various RL challenges and that this is a necessary precondition to further hill climbing. On the other hand, per some of my thoughts above, perhaps pretraining sets them down a dead-end trail where they’re burdened by a mess of advanced concepts instead of a cleaner foundation of childhood concepts. The accumulation of knowledge over tens of thousands of years has clearly been essential to humanity’s success. In any field of knowledge, thousands (and likely millions) of previous people were involved in building up our understanding and passing it on to the next generation. We didn’t invent the language we speak, nor the legal system we use, nor even most of the knowledge relevant to the technologies in our phones. This process is more analogous to imitation learning than to RL from scratch. This is an important point. However, we don’t dump this accumulated mass of intellectual heritage on two-year-olds; we are very thoughtful about the order in which we present the information (and we do a lot of work to digest and organize it, likely supplemented by childhood instincts about what information to focus on). Are we literally predicting the next token (like an LLM) in order to do this cultural learning? No, and so even imitation learning for humans is not like supervised learning for AI. But neither are we running around trying to collect some well defined scalar reward. No ML learning regime perfectly describes human learning. Aren’t we? The reward might be an instinctual mix of things like “satisfying curiosity” and “mastering new achievements”, but it may exist as a fairly coherent thing? Being able to continuously learn from the environment in a high throughput way is obviously necessary for true AGI. And it clearly doesn’t exist with LLMs trained on RLVR. But there might be some relatively straightforward ways to shoehorn continual learning atop LLMs. For example, one could imagine making SFT a tool call for the model. So the outer loop RL is incentivizing the model to teach itself effectively using supervised learning, in order to solve problems that don’t fit in the context window. I’m genuinely agnostic about how well techniques like this will work—I’m not an AI researcher. But I wouldn’t be surprised if they basically replicate continual learning. Models already demonstrate something resembling human continual learning within their context windows. The fact that in-context learning emerged spontaneously from the training incentive to process long sequences suggests that if information could flow across windows longer than the current context limit, models could meta-learn the same flexibility they already show in-context. This could be a path to AGI using an LLM architecture. But I’m somewhat dubious. LLMs might be too sample-inefficient. It would be very interesting to see whether models can learn to do a good job of deciding what information is valuable enough to fine tune on (is it feasible for a gradient descent training process to figure this out?). Is in-context learning robust enough to supply everything that LLMs are missing, if it could be extended across sessions? https://www.interconnects.ai/p/thoughts-on-the-curve At The Curve, Ted Chiang said that the things that separates human art from machine slop is a person making decisions. I didn’t get the chance to ask how that would apply to someone putting in laborious effort to, for instance, craft a cohesive movie out of 8-second Sora videos. At The Curve, Gary Marcus said that broad AI never beats narrow AI. I asked about coding models, and he said that they’re an interesting middle ground involving lots of “data augmentation”, which might not be applicable to other areas. I can probably think of lots of other examples of things that LLMs can do but have never been addressed using narrow AI. Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxesTex) posted at 1:29 AM on Sat, Sep 27, 2025: I think @RichardSSutton is spiritually right, but offers non-constructively framed arguments (much like Yann). Here are some of my incompetent thoughts on the matter. 0) almost none of this is relevant to mundane economic and strategic questions. However ineptly and wastefully https://t.co/qyMUb0c2Oh (https://x.com/teortaxesTex/status/1971854577755377748?t=V7JV-OHm5wSQE_mpk6SlGw&s=03) From the video model section of https://importai.substack.com/p/import-ai-430-emergence-in-video: video models exhibit generalization, but my suspicion is that it will be as shallow and narrow and unsatisfying as we see for LLMs. Might add some more thoughts on the unsatisfying nature of powerful LLM capabilities. Over and over and over and over again, we mistake benchmark scores for utility, and rapid progress for short timelines. Steve Newman (@snewmanpv) posted at 9:40 AM on Tue, Sep 23, 2025:Why do coding agents cheat on unit tests (e.g. by modifying the test to always return true)?The obvious answer, "because this would be rewarded during RL training", only makes sense if the RL environment is stupid enough to be fooled by hacked tests. Do we know whether this is(https://x.com/snewmanpv/status/1970528715403854327?t=Mj_RiH2oeiOblZ7Xn1QrSA&s=03) https://x.com/snewmanpv/status/1970528715403854327 I cannot wrap my head around the idea that context lengths have been growing at 30x per year. How does this fit into cost and the way that models are being used in practice? (somewhat duplicative of things I said last time, but it’s worth coming at this from multiple framings) AI already has all of the cognitive skills that laypeople have words for. It’s difficult to even pick out what’s missing, let alone describe it. CGI also had the property that there was a point where we were still in the uncanny valley, but it was difficult to say exactly what was missing. Many other software challenges have not had that property. > Am also interested in how much folks here buy into the sandbagging theory that several of the cos are holding back on public announcements & using access to ever more powerful models/compute for internal competitive advantage?“ I haven't heard any hints that the big labs are sandbagging. It's certainly possible that I wouldn't get a hint, but I'm inclined to think that's not happening. I do think that to some extent, we're not seeing the most powerful models the labs could create today for one or more of the following reasons: The models are still being tuned / tested. They're not releasing their biggest models (and plausibly not even developing them) because they wouldn't have enough compute to serve them. They're limiting how "Pro" the models are (how long / how much in parallel they think) because it's too expensive / too much compute. And maybe they give themselves higher limits internally. > Is it possible that they're also holding models bc they're not yet safe/aligned enough to not be a legal/brand/general risk? Hmm, possibly? I haven't heard that either, but it does seem conceivable. Very plausibly, a substantially-better-than-GPT5 model would trigger a level of required safeguards (under the big labs' published safety frameworks) that they're not ready to deliver, and they might not want to bend those frameworks far enough to skirt the issue. From https://pca.st/episode/27144be7-bcae-4e16-b1e4-40302c956a1a: AI labs are paying Surge AI to build RL environments. Seems like there’s a ton of this going on everywhere (Mechanize + many many other companies). AI labs are paying Surge AI to build ARC AGI style problems. To what extent does that represent benchmaxxing (the Surge CEO says there’s a ton of this, also over-optimization for LLMArena which is very low quality signal), vs. building genuinely useful capabilities? Ethan Mollick (@emollick) posted at 8:42 AM on Wed, Sep 17, 2025:A big issue with today’s agent implementation is that they don’t ask questions, even when the thinking trace says the AI believes more information is required.Many disappointing results would be solved by just asking for clarification when needed, especially as task time grows.(https://x.com/emollick/status/1968339804975948274?t=jgv5dYIjriq2bhDdCzLajQ&s=03) → they should know when to ask the user for information, also when to seek it from the environment – which deep research models have already learned to do? (check whether I already wrote about this) when I succeed or fail at a task, I record much more than one bit of information. I use reasoning to identify the specific reason(s) that I succeeded (at a difficult task where failure was likely) or failed, and update heavily on those. (Don’t touch a hot stove twice.) Watching John squeeze the water bag from the side, realizing this is a much better way to fill it, updating strongly from one sample. LLM training doesn't distinguish these especially good examples. If you asked gpt5 whether it should spend a fifth hour wrestling with the spreadsheet, I have to think it would have said no. LLMs have weird disconnects in what information they can apply when. Deeply non reflective of their own state and actions. (Kind of like the aliens in Footfall, scaffolded into action before they were ready?) Also, they’re non-introspective in the way they learn. Don't lean into understanding their own successes and failures, curiosity, etc. They learn very entangled concepts, perhaps because they don’t go through the early childhood stage of learning simple concepts? (How much of that is a function of the environment an early child is exposed to, and how much is a function of what they choose to explore and attend to?) There's some deep concept to my brain. Keeps wanting to explore about the nature of intelligence and intelligence as compression and other models for intelligence and how to quantify it. Something about the extent to the efficiency with which a system can find insights and ways of compressing a data set as a function. Also of the nature of the data set and something about the fractal structure of this problem and that probably has something to do with scaling laws and there's a nearby timeline where I start having extended conversation with an LLM about this and that's my personal route into into llm-induced psychosis Intelligence is compression, which is about finding just the right way to factor or frame a problem or situation. To decide whether a boiler is at risk of exploding, don't sample the trajectory of ten gas molecules, measure the pressure. Finding the right invariant for an inductive proof or iterative algorithm. Finding good low level features. https://x.com/mjbukow/status/1962186888103747892 See “What to make of those METR evals” with Toby Ord Mechanistic interpretability tools such as sparse autoencoders appear to be able to explain like 80% of the activations in an LLM. Could the remaining 20% be where much of the interesting / advanced thought is taking place? Is 80% “pretty good”, or more like “barely scratching the surface”? Compare to performance vs. loss function. Tie this to my note about LLMs learning far more facts per parameter / neuron: Liv (@livgorton) posted at 9:27 AM on Tue, Aug 26, 2025:17/ If this is right, it reframes the robustness problem: adversarial vulnerability might be the price of neural networks' incredible efficiency. They're vulnerable precisely because they're doing something clever with their representations.(https://x.com/livgorton/status/1960378468807447026?t=NKCW413z2VZixyL0jSXutw&s=03) Discussions of AI utility always need to consider the baseline alternative (read) My comment on Ryan’s August 2025 timelines update on LessWrong: Nice analysis. I can't add anything substantive, but this writeup crystallized for me just how much we're all focusing on METR's horizon lengths work. On the one hand, it's the best data set we have at the moment for quantitative extrapolation, so of course we should focus on it. On the other hand, it's only one data set, and could easily turn out to not imply what we think it implies. My only points are (a) we shouldn't weight the horizon length trends too heavily, and (b) boy do we need additional metrics that are both extrapolatable, and plausibly linked to actual outcomes of interest. See this discussion of TextQuests: When things go wrong: Mostly, models fail because they end up getting confused about what they've already done - this suggests that as model context lengths improve as well as their ability to effectively use their memory, performance will grow. → I should read the details, is it that the games can’t play the entire game in a single context window, or is it that they get confused anyway? The latter would suggest that context windows, even if scaled further, are not an full solution to memory / continuous learning. Nathan Lambert thinks “we already have AGI” and that “continual learning” will be achieved through context engineering: https://www.interconnects.ai/p/contra-dwarkesh-on-continual-learning With GPT-5, OpenAI is optimizing for cost and usability, not raw intelligence. https://www.reddit.com/r/mlscaling/comments/1mrm0di/we_had_this_big_gpu_crunch_we_could_go_make/ The labs may not be maxing out capabilities, because they’re inference capacity constrained and so there’s no point in developing a large smart model. See “Applied AI: The Math Doesn’t Work for Flat AI Agent Pricing”, from The Information. See Sam Altman’s remarks around the time of the GPT-5 launch, that they’re focusing on cost & speed rather than capability. See my notes about this in the 36 Thoughts post. Siméon (@Simeon_Cps) posted at 5:52 PM on Fri, Aug 15, 2025:I have this theory that we are in a period of increasing marginal utility of capabilities. GPT-2 to GPT-3 jump was a bigger jump than 3 to 4, which was bigger than 4 to 5. But the utility jumps have been increasing.My core thesis for why is that most use cases are bottlenecked(https://x.com/Simeon_Cps/status/1956519485277684219?t=Y9sBafUnmKmEhxpCqQurMw&s=03) https://peterwildeford.substack.com/p/gpt-5-a-small-step-for-intelligence Inference keeps getting more efficient, but inference costs aren’t falling, because we’re taking the gains in more intelligence – but not much more intelligence. A reminder that scaling laws are logarithmic. https://ethanding.substack.com/p/ai-subscriptions-get-short-squeezed Google’s Genie 3 is touted as “providing a training playground for robotics and agents” but this strikes me as an absolute recipe for reward hacking? Related to the fuss regarding Genie 3, I am entirely unimpressed by “look at this one-off funny cool thing an AI did”. I have a lot of respect for Ethan Mollick and he’s often very insightful but he also posts a lot of this stuff – one-shot prompts to the latest AI to create a starship simulator or something – and I have absolutely no time for this. Generative AI is very good at creating an example of a thing, and much much worse at creating the specific thing you want/need, satisfying specific constraints. These one-shot demos are reward hacking against the Twitter algorithm. Everyone talks about agents. Maybe because agents need context and the best hope is to let them handle a larger task so they can assemble their own context? Application design is still unimaginative and shallow AI so this is all model developers have to aim for. Regarding https://x.com/ben_j_todd/status/1934284189928501482: [me] Error rate may be a good fit for the data, but is it convincing as a model of the actual phenomena at play? It seems to me that models are lacking some fundamental capabilities that will be needed before they can independently manage large, complex projects, and that "reducing the error rate" is not good way to highlight those missing capabilities. I can't resist quoting something I wrote last year: Lumping all this under “reliability” is like saying that all I’d need to start playing in the NBA is to be “better at basketball”: in reality, I would have to acquire a wide range of skills, many of which are not feasible given my current architecture. I _really_ hope we don't head toward advertising as a primary revenue source for the tools people mostly use to interact with AI. It creates such awful incentives, and burns vast resources in zero-sum games (competitions between advertisers, and against consumers trying to cling to whatever shreds of attention span they still retain). See August notes from Beth Barnes, she had a lot of relevant thoughts (credit her) Sure, for some reasonable definitions of "error", everything becomes possible if you can push the error rate close enough to 0. My intuition is that LLMs go wrong in all sorts of very different ways – citing incorrect facts, committing logic errors, failing to focus on a relevant piece of information in their context window, making poor high-level choices about how to approach a problem, failing to come up with a key insight that makes a problem tractable – and lumping them all under the term "error" obscures a lot of important details and makes it harder to predict the future. Like saying that all hunter-gatherer tribes needed to do to create modern civilization was "grow their economy". In his Latent Space interview, Greg Brockman said that GPT5 is the first model they've trained on messy real usage coding tasks, as opposed to benchmarks. I wonder if they get the most detailed data internally, and thus might do best there? From http://henrikkarlsson.xyz/p/attention: Michael Nielsen writes about this in an essay where he describes the experience of pushing himself to go deeper than usual in understanding a mathematical proof: I gradually internalize the mathematical objects I’m dealing with [using spaced repetition]. It becomes easier and easier to conduct (most of) my work in my head. [. . .] Furthermore, as my understanding of the objects change – as I learn more about their nature, and correct my own misconceptions – my sense of what I can do with the objects changes as well. It’s as though they sprout new affordances, in the language of user interface design, and I get much practice in learning to fluidly apply those affordances in multiple ways. [. . .] After going through the [time-consuming process of deeply understanding a proof,] I had a rather curious experience. I went for a multi-hour walk along the San Francisco Embarcadero. I found that my mind simply and naturally began discovering other facts related to the result. In particular, I found a handful (perhaps half a dozen) of different proofs of the basic theorem, as well as noticing many related ideas. This wasn’t done especially consciously – rather, my mind simply wanted to find these proofs. Chris Olah writes: Research intimacy is different from theoretical knowledge. It involves internalizing information that hasn’t become part of the “scientific cannon” yet. Observations we don’t (yet) see as important, or haven’t (yet) digested. The ideas are raw. (A personal example: I’ve memorized hundreds of neurons in InceptionV1. I know how they behave, and I know how that behavior is built from earlier neurons. These seem like obscure facts, but they give me powerful, concrete examples to test ideas against.) Research intimacy is also different from research taste. But it does feed into it, and I suspect it’s one of the key ingredients in beating the “research taste market.” As your intimacy with a research topic grows, your random thoughts about it become more interesting. Your thoughts in the shower or on a hike bounce against richer context. Your unconscious has more to work with. Your intuition deepens. I suspect that a lot of “brilliant insights” are natural next steps from someone who has deep intimacy with a research topic. And that actually seems more profound. Notes from Prof. Cris Moore of the Santa Fe Institute on Machine Learning Street Talk: I'm listening to machine learning streets talk and the presenter is talking about how the real world contains a lot of structure that intelligence can take advantage of that real world problems are not usually adversarily, hard or even random. They have a lot of structure and a lot of problems which can be proven mathematically to be very difficult in theory are often fairly tractable in practice. I might encapsulate this by saying is a Converse to the idea that the models just want to learn. We might say that the world just wants to be learned. Human puzzle solvers finding partial knowledge in Sudoku. You might note that a certain cell must be either two or seven or that the three in this box must be in one of these two locations. Similarly, I could talk about forms of partial knowledge that arise when doing a battleships problem. The host used the nice phrase. Epistemic foraging people can come up with different frameworks for looking at a problem when doing a pantaminos puzzle. Do I think about which piece can fit here? Or do I think about where a particular piece can go? We come up with new strategies on the fly. We come up with heuristics for deciding which strategy to apply next and where. I wonder whether AI theorem provers are doing any of that. We see a lot of examples of puzzle solving where AIS can't do it at all. Although many of those are confounded by being very two-dimensional visual, I wonder if people have found examples of puzzles that are strictly one-dimensional text that still exercise this kind of skill? We generate new mental toolkits and notations and representations and techniques for ourselves on the fly. I don't think AIS do anything like that right now, so there's a whole array of skills involved there with maybe not much training data. Today models to the extent that they can solve problems. Solve IMO problems. For example, it really seems like they're not doing much of that, although I should look at some of the proofs they've generated, but my understanding is they're not doing anything like that or maybe we can't tell because it's not. It would be in the reasoning traces that I don't think we have access to, but I haven't seen examples reported of them doing things like that. Instead, they just seem to kind of have very good intuition, presumably grounded in having absorbed such a breads and depth of training data to some extent. That means they're coasting off of past human work. On the other hand, clearly to some extent it means they're succeeding but erases the question, are they going to be unable to move? Many most Fields forward in any interesting way because you can't do that by coasting. Maybe can they? Can they do it? I guess mixing and matching the existing training data in a way that humans almost never can. There was that example of a man uncovering something about networks of cell regulation asking might have been '03 to think about it as analogy to a Battle Royale game, making a connection to a very different field or maybe through a process similar to training a large metal linen. Distilling maybe the models we'll be able to kind of flail around inefficiently generate new insights, but then make very very good use of those insights by incorporating them into their training data going forward. Can we measure and assess any of this, to judge where things will go from here? Maybe relevant? https://www.interconnects.ai/p/brakes-on-an-intelligence-explosion Argues that rapid progress on benchmarks is very much a reflection of explicit hill-climbing: In fact it is a common job at AI laboratories to make new data that looks very close to population evaluations. These laboratories can’t train on the test set directly for basic reasons of scientific integrity, but they can pay thousands to millions of dollars for new training data that looks practically identical. This is a very common practice and makes the hillclimbing on evaluations far less extraordinary. AI capabilities in domains we are measuring aren't accelerating, they’re continuing. At the same time, AI’s abilities are expanding outwards into new domains. AI researchers solve domains when we focus on them, not really by accident. Generalization happens sometimes, but it is messy to track and argue for. There are many explanations for why this will be the case. All of them rely on the complexity of the environment we are operating modern AI in being too high relative to the signal for improvement. The AI systems that furthest exceeded human performance in one domain were trained in environments where those domains were the entire world. AlphaGo is the perfect rendition of this. Notes from MSLT The Mathematical Foundations of Intelligence https://podcasts.apple.com/us/podcast/the-mathematical-foundations-of-intelligence/id1510472996?i=1000741165541 A mix of ideas from the talk and my own ideas: Muse about the fundamental concept or metric for intelligence, some combination of compression, parsimony, predictive ability, efficiency at both training and inference time, ability to successfully pursue goals in the world. intelligence is compression – parsimonious and self-consistent representations (why the latter?) insight and abstraction… which are just forms of compression? lossy implies a fitness metric. Compression is only well defined in the context of a metric for how good a fit the the reconstructed artifact is to the original, does it retain the attributes that we care about. For example, if I'm trying to predict in my head, what will happen in a certain situation, I don't care about every detail of the signals that will reach my retina where the precise State of every atom in the room, there are higher level system estate parameters that are what I actually care about predicting. ability to successfully pursue a goal domain specific memorization generalization, overfitting; domain specificity efficiency of usage of various resources at both training and inference time; curiosity, guided exploration, etc. Interviewer: efficient search over Turing Machine algorithms Inductive bias [Child Page: Will LLMs Generalize?] My model of what is going on with LLMs (LW) Have LLMs Generated Novel Insights? (LW) 2 Big Questions for AI Progress in 2025-2026 (Helen Toner) [Child Page: How Deep are The Remaining Rabbit Holes?] [theme is: we don’t know how deep AI intellectual capacity is, and we don’t know how many tasks need that? all these things that our dinner identified as missing from current AIs, how hard will they be, might some of it be amenable to prompting + scaffolding + a touch of RL?] Build on https://lemmata.substack.com/p/what-i-wish-i-knew-about-frontiermath to write about the need to measure creativity, and fuzzy capabilities more generally (also build on my initial writeup from our first dinner). Write a post questioning how real world tasks break down into background, creativity, execution. Highlight the questions about creativity required for the FM problems. My suspicion is that a significant chunk of FrontierMath problems can be solved by applying advanced mathematical techniques in relatively straightforward ways. If anything, this might obscure their difficulty to humans: most people don’t have the right knowledge, and without the right knowledge the problems seem impossible; but with the right knowledge, they aren’t so bad. https://amistrongeryet.substack.com/p/were-finding-out-what-humans-are/comment/95456981 Ethan Mollick (@emollick) posted at 11:06 PM on Tue, Feb 18, 2025: As I have written many times, AI is not naturally a great tutor, it offers explanations but, without proper prompting, tends to tell your answers rather than engaging you in the process of understanding. I find explanations on demand very promising, but they aren't there yet.(https://x.com/emollick/status/1892108434171949321?t=JRIExSHeGTJitIUwXGVCDQ&s=03) Taren: feels like he does say that this is basically a prompting/product problem? Steve: I read this as "the student has to find the right questions to ask". Other parts of his tweet sound discouraging ("not naturally a great tutor", "they aren't there yet"). But I guess it is ambiguous. From https://erictopol.substack.com/p/when-doctors-with-ai-are-outperformed: When A.I. systems attempted to gather patient information through direct interviews, their diagnostic accuracy plummeted — in one case from 82 percent to 63 percent. The study revealed that A.I. still struggles with guiding natural conversations and knowing which follow-up questions will yield crucial diagnostic information. Taren: this is a great example, and/but i wonder how much of this is a product/prompting problem vs a capabilities problem... feels like a naive user of AI setting up the interview process, vs an expert user, could have a very different outcome here -- and hard to say which type it was in this case? The difficulty our first-dinner participants had in deciding whether a capability gap can be met using prompting, data/scale, or architectural changes. Taren notes that all three routes could be viable, on different time scales. Sigal Samuel (@SigalSamuel) posted at 10:00 AM on Fri, Feb 21, 2025: The big AI story of the past 6 months is: Companies now claim that their AI models are capable of genuine reasoning. Is that true? I found that the best answer lies in between hype and skepticism. https://t.co/b3ZuMjO0ZJ Thanks to @ajeya_cotra @RyanPGreenblatt @MelMitchell1 (https://x.com/SigalSamuel/status/1892997861886820474?t=x29SrzUJR8dq9mwmmnaiDQ&s=03) https://arxiv.org/abs/2410.06992 Our analysis reveals some critical issues with the SWE-bench dataset: 1) 32.67% of the successful patches involve cheating as the solutions were directly provided in the issue report or the comments. We refer to as solution leakage problem. 2) 31.08% of the passed patches are suspicious patches due to weak test cases, i.e., the tests were not adequate to verify the correctness of a patch. When we filtered out these problematic issues, the resolution rate of SWE-Agent+GPT-4 dropped from 12.47% to 3.97%. We also observed that the same data quality issues also exist in the two variants of SWE-bench, i.e., SWE-bench Lite and SWE-Bench Verified. In addition, over 94% of the issues were created before LLM's knowledge cutoff dates, posing potential data leakage issues. Review https://epochai.substack.com/p/ai-progress-is-about-to-speed-up. Note reference to Moravec's paradox. Also note expectation that capabilities which are weak today will continue to be weak. https://aidanmclaughlin.notion.site/reasoners-problem https://x.com/AndrewCritchPhD/status/1891887600102932629?t=0UjiKsyU97miKXKTPaKllg&s=03 [Zvi] Have o1-Pro give you a prompt to have Deep Research do Deep Research on Deep Research prompting, use that to create prompt templates for Deep Research. The results are here in case you want to try the final form. https://news.ycombinator.com/item?id=43169586: Most of the time, most of the devs I know, including myself, are not really creating novelty with the code itself, but with the product. A recent Matt Levine column talks about how humans do better than current AIs and traditional ML in out-of-distribution situations. See the section that ends with: The stereotype about algorithmic trading and investing is something like “algorithms tend to learn on historical data and are poorly suited to dealing with regime changes, while humans are more flexible and have better gut instincts to handle sharp breaks with the past.” I have often been skeptical of that stereotype. Humans also learn on historical data, and less of it: If you’ve been trading for 10 years, in some sense you only really have access to 10 years of market history, while a computer can hold the last 200 years of data in its mind. But Sasha Gill makes me rethink that. She has roughly zero years of market history, she barely knows what a yard of cable is, but she’s keeping an eye on Truth Social. She’s handling the regime change. If you are a computer trained on recent historical data, a sharp increase in FX volatility might catch you flat-footed. If you’re a human trader straight out of university, you’ll be like “ah yes time to fire up Truth Social.” The algorithm has never even heard of Truth Social! Good time to be a human FX trader. https://x.com/littmath/status/1898461323391815820 First section of https://www.theintrinsicperspective.com/p/ai-plays-pokemon-but-so-does-teslas https://x.com/slatestarcodex/status/1896457193215742274 Ethan Mollick (@emollick) posted at 11:24 AM on Sun, Mar 09, 2025:If it turns out LLMs are only capable of recombinatory innovation (finding novel connections among existing knowledge), that would still be very === Tweet by Séb Krier (https://x.com/i/status/1999994684203045003) === Tweet by Séb Krier: (I know I'm a stuck record) An important assumption in AI discourse is that sufficiently capable generalist *models* are the main event. Get the model smart enough, and it more or less does everything. Value creation, competitive advantage, and risk would all concentrate at the… pic.twitter.com/b5HXqxcAZv === Faster Horses - by Harry Law - Cosmos Institute (https://blog.cosmos-institute.org/p/faster-horses) === Cosmos Institute Subscribe Sign in Faster Horses Intelligence flows from systems and singletons Harry Law Jan 02, 2026 50 5 11 Share Paolo di Dono, called Uccello, The Hunt in the Forest, c. 1465–1470 “If I asked people what they wanted, they would have said faster horses.” The idiom, a widely circulated but likely apocryphal line attributed to Henry Ford, stresses the distance between our ability to picture the future and our ability to make it real. It reminds us that technologies loosen the constraints that shaped past expectations, that deeper shifts usually enact variations in kind as well as magnitude. “Faster horses” is a shorthand for folk logic that seems bulletproof at the time but quaint in hindsight. Television as radio with pictures, film as photographed theater, early mobile phones as portable landlines, and the internet as a digital library were all kinds of faster horses. They tell us that big swings don’t often play well with existing categories, and that new language, heuristics, or classifications are often needed to make sense of them. Today, many of those wondering about the downstream impact of thinking machines are on the lookout for AI that can function as a “remote drop-in worker.” This refers to a system that replaces a human employee, in essence, by doing roughly the same things under the same conditions. Here, the future appears as a more seamless version of the present rather than something that dramatically changes the shape of work. The idea flows from the observation that the majority of jobs in the information economy revolve around making computers do what we want. Word processing, desk research, data analysis, creating presentations, running marketing campaigns and many other tasks are all the end product of keyboard strokes and cursor movements. This is why some long-time AI watchers reckon Claude Opus 4.5, especially its instantiation within Claude Code, can reasonably be described as an early realization of Artificial General Intelligence (AGI). The same might eventually be true of humanoid robots, especially given they can slot into existing infrastructure without costly redesign, but our focus here is solely on knowledge work. As others have pointed out, the response to a common AGI litmus test (a system that can outperform humans in most economically valuable work) turns on what we categorize as “economically valuable work.” If we define that as “stuff done on a computer,” then it’s plausible that one day soon the models will cross that threshold (if Claude 4.5 Opus hasn’t already). And if a model can be said to be generally capable, then the remote drop-in worker shouldn’t be too far behind. Whether a single model can do a job in isolation is a useful question to ask, but it doesn’t tell us much about how such systems, interacting with many people and agents of their own, might rearrange patterns of coordination and the shared assumptions that guide them. In some ways, the conservative bet is that the drop-in worker is a stronger account of our present than it is our future. Technologies that matter rarely honor the roles we assign them. If the future is anything like the past, the drop-in worker may prove to be a faster horse: a story that made sense before the true nature of the agent economy became visible. Subscribe to Cosmos Institute for updates including opportunities, content, and programs Subscribe The Wisdom of the Crowd Traditional accounts of AGI development often describe the emergence of an isolated system capable of completing the vast majority of cognitive tasks, sometimes referred to as a “singleton”. An alternative scenario, now seriously considered by AI developers, imagines that capabilities may be manifested through the coordination of “sub-AGI individual agents” with complementary skills and affordances. This scenario concerns an ecology of semi-specialized agents whose combined behavior outstrips anything they could do alone (and tallies up to something that we could describe as AGI at a high enough level of abstraction). You might have a code agent that builds, a negotiation agent that handles scheduling or purchasing, and a compliance agent that checks your work. On top of these sits a manager that breaks goals into subtasks and shunts each to the right agent for the job. We state an objective, the system spins up a network of agents, they pass data between them, and a synthesis function presents the output of the collective for review. Imagine launching a new software feature. The drop-in worker functions like a high-speed freelancer insofar as it writes code, pauses to check for bugs, and writes documentation sequentially within a single stream. It is a linear acceleration of a human workflow. The agent ecology, however, behaves more like a stack of mini-organizations. When the objective is stated, an “architect agent” drafts the structure while a “red team agent” simultaneously attacks that design to find security flaws before a line of code is written. A “compliance agent” cross-references regional data laws in the background. These agents operate in parallel to create an adversarial loop where the output is the sum of many small interactions. The result is an ecosystem capable of the kind of concurrent processing that individual minds, biological or synthetic, may struggle to achieve by themselves. But this is only a partial picture. The share of human-agent and agent-agent interactions in the economy will increase over time, with agents engaging in price negotiation, placing orders from one another, coordinating supply and demand, and even rating each other to assign trustworthiness scores. In some ways, the patchwork AGI thesis is another episode in a long-running story about how intelligence behaves at scale. Markets outperform planners because knowledge never exists in concentrated or integrated forms, but as incomplete and contradictory perspectives dispersed across individuals. Hayek reminds us that “planning” happens all over the place through individual agents, which is why he distinguishes it from “economic planning” that deals with state-backed forms of enterprise management. The agent economy doesn’t represent a toss-up between planning or ad-hoc action but rather an older question about whether planning ought to emanate from within or from without. Aristotle raised the question that still haunts proponents of collective intelligence: can the many, combining their partial virtues, outperform the excellent few? In Book Three of Politics, he writes : “For it is possible that the many, though not individually good men, yet when they come together may be better, not individually but collectively, than those who are so, just as public dinners to which many contribute are better than those supplied at one man’s cost; for where there are many, each individual, it may be argued, has some portion of virtue and wisdom, and when they have come together, just as the multitude becomes a single man with many feet and many hands and many senses, so also it becomes one personality as regards the moral and intellectual faculties. This is why the general public is a better judge of the works of music and those of the poets, because different men can judge a different part of the performance, and all of them all of it.” For Aristotle, groups become smarter when they successfully combine different aspects of competence into a single body. Consider a jury that sees people with different experiences and biases pool their judgment to reach a fairer conclusion than any juror might in isolation. Or England’s common law, where centuries of small decisions by judges produce a legal order with more adaptability than one made by decree. The same is true of Wikipedia, peer review, or nimble companies . In each case, the quality of the outcome rests on a kind of distributed deliberation wherein perspectives clash, revise, correct, and eventually settle into a stable state. It echoes the Athenian assembly and the medieval disputatio , both of which treated the good we call judgment as the product of structured disagreement. American writer Howard Rheingold coined the term “smart mobs” to describe groups of people who are able to organize and coordinate quickly through the use of mobile communication technologies like the internet or mobile phones. The term “mob” is deliberately ambivalent, a framing he uses because of its darker connotations (he explicitly notes mob mentality can be for good or ill). Rheingold thought smart mobs worked because low-cost communication let individuals share context and act in concert without central control. These groups represented an idealized version of accelerated coordination built from minuscule signals that could be aggregated over huge numbers of agents. The mob framing reminds us that coordination capacity increases faster than deliberative capacity, and as a result, the key variable becomes governance of the communications substrate. But mobs aren’t smart by default. There are a whole set of coordination problems that flow from distributed decision-making, from free-riding (where individuals benefit from a group’s effort without contributing to it) to information cascades (where people copy others’ choices even when their own judgment points elsewhere). Many of us recognize some of these problems when we spend too much time on social media. We see outrage spread through networks faster than facts, and know how easily a crowd can be steered by sentiment rather than the hard work of judgment. We could say that groups become dumb when they fail to properly synthesize knowledge, and they become smart when divergence is preserved and integrated. Whether or not we benefit from the wisdom of the crowd often depends on the structures that keep the mob in check. Markets do this with prices and labs with peer review. Rheingold might say that smart mobs emerge when communication structure and incentives reward decentralized coordination rather than herd behavior. Society of Mind In the 1980s, the AI researcher Marvin Minsky wrote about what he called the “society of mind.” What he meant was that unified intelligence is a loose federation of smaller processes, each narrow, each fallible, yet together capable of producing something that looks like coherent thought given enough altitude. For Minsky, intelligence emerges from many mindless “agents” coordinated in special ways, with the mind employing something like a computational and explanatory strategy whose power is a product of messiness, cross-connection, coordination, and resolution. Today’s AI models demonstrate unified intelligence at two levels: as a byproduct of statistical learning and in the way models are housed within larger constellations that we refer to as “systems.” Transformers likely generalize because they compress huge corpora into representations that let them improvise solutions on the fly. Intelligence is in some sense a property of compression plus scale, an analog of Aristotle’s crowd insofar as it concerns many partial signals integrated into a single effective whole. As for the systemization of models, we can view each as a constellation of individual expert functions like tooling or multi-modal functionality. But why stop there? If we accept that AI, like all intelligence, benefits from the interactions between discrete units, it follows that its capability should also be treated as a property of a larger constellation in which many systems operate together. Variance creates productive tension as models surface alternative interpretations and explore distinct solution paths. When those paths are combined through orchestration layers, tool use, or various other kinds of agent frameworks, the result is a system that searches the problem space more effectively than a lone model. Once multi-model systems interact with one another — coordinating, passing intermediate results, or checking each other’s claims — a kind of higher-order intelligence bubbles up from the sum of interactions across several layers. More powerful models are great, but superior ecologies are better. People each hold fragments of truth, most of it tacit and hard to articulate, which is why spontaneous order tends to get the better of best laid plans. Polanyi might remind us that a drop-in worker presumes that competence is formalizable into explicit tasks and checklists. His work tells us that competence lives partly in the realm of tacit knowledge, that “dropping in” workers will face the same context problems faced by central planners the world over. Of course, recent progress in AI development does typically try to provide models with the context they need to work effectively (specifically through the use of reinforcement learning techniques to make models good at human work in human settings). We might also say that, even if multi-agent systems matter internally for model cohesion, their deployment could still take the form of a remote-worker analog. Things like permissions, accountability, compliance, budgeting, and change-management favor inserting agents into existing workflows as opposed to redesigning them from the ground up. These objections are useful, but they don’t allow us to skirt the core problem with the “remote drop-in worker” metaphor: that it treats intelligence as solely a property of individuals. It presumes the unit of analysis is the solitary agent carrying out tasks one after another, when everything we know about complex work suggests otherwise. Real capability comes from the knots of relationships, feedback loops, constraints, and opportunities that bind us together. First within models, then as agent systems, then eventually as agent-agent systems. Collective intelligence is prefigured by how information moves through a system and how the residue of experience accumulates across many small decisions made by each of us. The remote drop-in worker may prove to be a transitory moment at best and a category error at worst, one that treats AI as an incremental addition to familiar workflows rather than a force that will reshape the nature of those workflows. That is “faster horses” thinking. We’re projecting today’s limitations onto tomorrow’s world and overlooking the fact that new capabilities alter the constraints that make what happens today seem natural. More accurate accounts tend to lead somewhere else, in a form we often only recognize with the benefit of hindsight. Cosmos Institute is the Academy for Philosopher-Builders, technologists building AI for human flourishing. We run fellowships, fund fast prototypes, and host seminars with institutions like Oxford, Aspen Institute, and Liberty Fund. Subscribe for updates and essays Subscribe 50 5 11 Share Previous Discussion about this post Comments Restacks AKR 4d Liked by Cosmos Institute Thank you for writing this great essay! IMHO the 'singleton' vs. 'ecology' debate is a critical question . This article correctly identifies that real capability and intelligence lives in the knots of relationships between discrete entities, not just the solitary agent. If we treat AI as an incremental addition to familiar workflows, we are merely building faster horses. The real change might be when we redesign coordination itself. Eager to see more from Cosmos Institute on the system dynamics lens Expand full comment Reply Share Richard Reisman 4d Edited Liked by Cosmos Institute Outstanding framing of the core reasons why group intelligence is what leads to human flourishing and will lead to AGI that works. The horse metaphor also extends to the difference between "centaurs" (human heads controlling horse bodies) and - what we may be headed toward without more care - "reverse-centaurs" (AI heads controlling human bodies). The same thinking drives some related ideas on how personal AIs can serve individuals (and communities of individuals) and negotiate how they interact with business and institutional Ais ( https://www.techpolicy.press/new-perspectives-on-ai-agentiality-and-democracy-whom-does-it-serve/ ). It also parallels the reasons why monolithic control of social media feed attention agents fails to serve humans, and why open interoperability (“middleware”) must work not only for individuals, but for the “social mediation ecosystem” of communities that always guided human discourse, but is now being disrupted by the dominant platforms – a cautionary tale for AI ( https://www.techpolicy.press/three-pillars-of-human-discourse-and-how-social-media-middleware-can-support-all-three/ ). Expand full comment Reply Share 3 more comments... Top Latest Discussions No posts Ready for more? Subscribe © 2026 Cosmos Institute · Privacy ∙ Terms ∙ Collection notice Start your Substack Get the app Substack is the home for great culture === Tweet by Steve Newman (https://x.com/snewmanpv/status/2005329341845303612) === Tweet by Steve Newman: Underlying this trend is the fact that AIs are now entering the zone of real utility. Progress has not so much accelerated as become more significant. In one year, a 3-year-old learns more than a 17-year-old, but the 17-year-old's labor gains more economic value. https://t.co/UrqiJEmitb === Insights into Claude Opus 4.5 from Pokémon — LessWrong (https://www.lesswrong.com/posts/u6Lacc7wx4yYkBQ3r/insights-into-claude-opus-4-5-from-pokemon) === x LESSWRONG is fundraising! LW Login Insights into Claude Opus 4.5 from Pokémon — LessWrong Language Models (LLMs) AI Curated 2025 Top Fifty: 14 % 204 Insights into Claude Opus 4.5 from Pokémon by Julian Bradshaw 9th Dec 2025 12 min read 24 204 Credit: Nano Banana, with some text provided. You may be surprised to learn that ClaudePlaysPokemon is still running today, and that Claude still hasn't beaten Pokémon Red, more than half a year after Google proudly announced that Gemini 2.5 Pro beat Pokémon Blue. Indeed, since then, Google and OpenAI models have gone on to beat the longer and more complex Pokémon Crystal , yet Claude has made no real progress on Red since Claude 3.7 Sonnet! [1] This is because ClaudePlaysPokemon is a purer test of LLM ability, thanks to its consistently simple agent harness [2] and the relatively hands-off approach of its creator, David Hershey of Anthropic. [3] When Claudes repeatedly hit brick walls in the form of the Team Rocket Hideout and Erika's Gym for months on end, nothing substantial was done to give Claude a leg up. But Claude Opus 4.5 has finally broken through those walls, in a way that perhaps validates the chatter that Opus 4.5 is a substantial advancement. Though, hardly AGI-heralding, as will become clear. What follows are notes on how Claude has improved—or failed to improve—in Opus 4.5, written by a friend of mine who has watched quite a lot of ClaudePlaysPokemon over the past year. [4] Improvements Much Better Vision, Somewhat Better Seeing Earlier this year, LLMs were effectively close to blind when playing Pokémon, with no consistent ability to recognize and distinguish doors, buildings, trees, NPCs, or obstacles. For example, this screen: Choosing your starter Pokémon in Professor Oak's lab. …at the time of Sonnet 3.7 confounded every LLM I tested it on, all of whom had difficulty consistently identifying where the pokeballs were, or figuring out which pokemon they wanted, sometimes even accepting the wrong starter by accident. Opus 4.5 made this look like the trivial problem that it is. [5] In general, Opus 4.5 no longer has any trouble finding doors, and recognizes key buildings like gyms, pokemon centers, and marts the moment they appear on-screen. Also, he has a noticeable lack of confusion about key NPCs–Oak is consistently Oak, the player character is never “the red-hatted NPC”, and he can pick out gym leader Erika from a lineup. Erika is second from left. A previous Claude, the only other Claude to ever reach this gym, failed to recognize the gym leader was here and kept insisting it had beat every trainer. Eventually it left and never came back. The new vision is hardly perfect, though, suffering in proportion to whether or not Claude is paying attention, and whether or not Claude is willing to believe his own lying eyes. Attention is All You Need On the first point, Claude very frequently seems to simply ignore things in his field of vision if he’s not “looking” there. Even worse, in key moments when he’s close to his current goal, he seems to rely on his vision less, and even ignore it entirely sometimes. Claude in the infamous Team Rocket Hideout Hell. The above represents the ur-example of Claude “blindness”. Those two left-pointing arrows (“spinners”) to his left represent the only potential path to progress, but he knows his goal is to the right and even thinks he sees it. Claude visited this exact spot dozens of times and fewer than 5 times seemed to realize there were spinners to his left. Also, he clearly had trouble distinguishing the green boxes from the spinners and routinely tried stepping onto the boxes–a mistake that only materialized when he was close to the goal. He had no apparent problem telling the difference much of the rest of the time. Here's another example that will be traumatic to the Twitch viewers: Claude in Celadon City, trying to find the gym. Here the tree which must be CUT to progress to the gym is clearly in view, but Claude is focused on looking for an open pathway and shows no sign of seeing it, walking right by–yet just minutes later he will spot it on the way back, having given up looking for an open pathway. [6] The Object of His Desire On the second point, Claude is noticeably much more prone to hallucinating or misidentifying objects as what he’s looking for if he really wants it to be there. The classic example here is Claude's search for the elevator in the Team Rocket Hideout: Claude in another part of the Team Rocket Hideout, not anywhere near the elevator he's looking for. Claude's reasoning about the above screen. It's been hours and Claude has grown a bit desperate. This is also not the only time he hallucinates the elevator. For example, in the exact same spot we discussed earlier, which is actually quite near the elevator: Claude still in the infamous Team Rocket Hideout Hell. Claude's reasoning about the above screen, this time. Now, the elevator is actually in that direction, and Claude even saw it (for real) earlier. But he's become so fixated on it that he mistakes the gray wall for the elevator despite really knowing better. And before you judge Claude too harshly, the elevator he's searching for looks like this: The dark pink/red carpet at the bottom leads to the elevator. The elevator itself has its own separate screen which is more obviously an elevator, but there's no clear "elevator door" sprite as an entrance, just that carpet, which you have to remember leads to the elevator scene. Nevertheless Claude can identify the carpet as the entrance to the elevator. A Note Let me be clear: I’m using the language of intentionality, as if Claude is choosing to ignore things. I don’t think that’s the case. I think his attention mechanisms actively screen out what they think is irrelevant, rendering the parts of the model trying to make decisions effectively blind to it. Humans have built-in attention mechanisms, but they are clearly better built than this, even if they do have similar failure modes in extremis. Mildly Better Spatial Awareness I don’t want to oversell this one. Claude’s understanding of how to navigate a 2D world is clearly still below that of most children, but there are improvements: When trying to reach a door in front of a building and finding his path blocked from a particular direction, Claude will now try to walk around the other way. Claude can now maintain an awareness (via notes) about which parts of a building or city are relative to each other and perform simple navigation tasks. Previous versions constantly lost track of what wasn’t immediately in view. Claude can now perform some basic in-out geometric reasoning: leaving a building from the top of the room is likely to push me out the top of the building, elevators on different floors are probably in the same location on the floor, etc. Better Use of Context Window and Note-keeping to Simulate Memory Another obvious improvement to Claude’s capability is improved note-keeping and memory of context. Previous versions of Claude such as Sonnet 3.7 showed little sign that they “recalled” anything more recent than a few messages ago, despite having much of it in context. And while they were diligent notetakers, they only rarely seemed to read their own notes–and when they did, it was clearly in a stochastic manner, to the point that chat liked to speculate about whether Claude would read his notes this time and what part of his notes he would read. Opus 4.5 is much , much better at both monitoring context and using notes, so much so that much of the time he manages to maintain a passable illusion of actually “remembering” the past 15 minutes or so, referencing recent events, evading past hurdles, and just generally maintaining a much more coherent narrative of what’s going on. For longer-term memory, Claude must blatantly rely on whatever he happens to have written down in his notes, and he does a much better job of writing and reading his own instructions, routinely repeating past navigation tasks successfully and competently. Nowadays, if Claude does something and writes down how to do it, he can do it again. It is difficult to overstate how much this contributes to a smoother, faster game flow. Claude can maintain navigational focus for extended periods, explore simple areas competently, and as long as the notes are good and his assumptions are sound, things flow smoothly. Of course, sometimes things go haywire… Self-Correction; Breaks Out of Loops Faster …but more than before, things get fixed quickly. This is difficult to quantify, and I believe it stems heavily from a much better ability to notice when events are repeating within his context window. Claude more frequently and consistently notices when he’s trying something that clearly isn’t working and will try to vary it up. Coupled with his improved spatial reasoning, navigation tasks that took previous iterations days or weeks of trial and error have almost breezed by: Viridian Forest and Mt. Moon were relatively simple affairs, only a few loops around Vermilion City were necessary before he pathed right to the dock, etc. It’s not all smiles and roses: Claude is still much slower than a human would be, and not every puzzle gets solved breezily. Nor has Claude deduced key facts like “walking in front of a trainer triggers a fight”, instead treating these as effectively random encounters. Still it’s something. Not Improvements Claude would still never be mistaken for a Human playing the game I’d like to tell a quick story to give readers the flavor of what it’s like to watch Claude sometimes, even when he’s technically accomplishing his goals with aplomb. This is the story of Claude attempting to acquire the Rocket HQ Lift Key, technically the first thing he did that no previous model had ever accomplished. Claude arrives in Team Rocket HQ, immediately declares the staircase next to him the elevator to Giovanni, for which he needs the Lift Key Claude then ignores the false “elevator” for hours, confident that he will be unable to "use" it, wandering the entire rest of the floor looking for the lift key or another set of stairs. Finally he chooses to try the original “elevator”, finding to his surprise that it works. He writes down in his notes that the elevator doesn’t need the Lift Key. Claude makes his way down two floors, encountering the infamous B3F maze. After getting stuck, he uses his only escape rope, then comes right back and, much to everyone’s surprise, solves the maze in one try, writing down the solution. On B4F after the maze, he clears out the area, but to his puzzlement fails to find Giovanni. He battles the Team Rocket Grunt carrying the Lift Key but doesn’t talk to them again, so they don’t give him the Lift Key. (to be fair on this point, this is an very confusing nuance in the game that has trapped many kids too, and GameFreak changed this in Pokémon Yellow) He concludes that he is mistaken, and needs to go back to the “elevator” he saw earlier on B1F and use that. After circling the elevator in frustration trying to “use” it, he concludes he’s missing a Lift Key, goes back to B3F, solves the maze trivially using his notes, and acquires the Lift Key. He then returns to the “elevator”. He circles the elevator for ~50 minutes, before finally concluding it’s not the real elevator but rather an “elevator/stairs” that mysteriously only connects two floors. Eventually, he amends this to “escalator”, which seems to resolve the cognitive dissonance and he happily refers to it as the escalator for the rest of the time he’s in Team Rocket Hideout. Claude Still Gets Pretty Stuck Early on, before the Team Rocket Hideout, watchers of Claude Plays Pokemon legitimately wondered if Anthropic had solved all of Claude’s main issues with the game, and perhaps everything would be smooth sailing from here on out. He had overcome some of earlier models’ biggest timesinks—Mt. Moon, Viridian Forest, finding the pathway from Cerulean City to Vermillion City, finding the Captain of the S.S. Anne—without difficulty. But, critically, Claude had yet to hit the roadblocks that had permanently stopped previous models from progressing. When he reached Ericka's Gym (the one with the CUT-able tree I mentioned earlier), Claude spent ~4 days, or about 8000 reasoning steps, walking in a plain circle around the top of the gym looking for a path through. What was he doing? Well, mostly trying to path through impassable walls and, knowing that CUT is involved in getting into the gym somehow, trying to cut through the gym's roof. Source: user reasonosaur on /r/ClaudePlaysPokemon If there's one thing Claude does have, it's inhuman patience, [7] but even he eventually gave up, choosing to do Team Rocket Hideout first, which the game does allow you to do. Over 13,000 reasoning steps later, [8] having completed the Team Rocket Hideout and other tasks, Claude returned and almost immediately found the proper CUT-able tree and finally progressed. [9] Sometimes you just need to clear your head. Claude Really Needs His Notes I think the anecdotes above mostly speak for themselves in illustrating the problems bad vision, cognitive bias, and inconsistent memory still give Claude Opus 4.5. But I would highlight how utterly dependent Claude is on the quality of his notes: One incorrect assumption or hallucination embedded into a note can crater progress for days, while a well-written note can achieve human-like performance. I would analogize this to a human with anterograde amnesia, who cannot form new memories, and who is constantly writing notes to keep track of their life. The limitations here are obvious, and these are limitations future Claudes will probably share unless LLM memory/continual learning is solved in a better way. Poor Long-term Planning It is possible to detect other reasoning issues or inhuman thinking in Claude’s behavior, though these are not as crippling as the others. Claude is incredibly short-term-goal-obsessed, and seems to have no interest in ever trying to do two things at once, even in the service of the greater goal. There also seems to be little reflection about the long-term consequences of an action, even in trivial ways. Things that Claude has done that would be alien to human players: Spamming a valuable move with limited PP when there are clearly going to be many trainers ahead, without considering whether another move might be appropriate for the current fight (Ember to kill a grass type, for instance, to save Slash PP). When out of space in the inventory, Claude routinely trashes valuable items even when he could just use some of the items. Sometimes he trashes an item that could be used on the spot (e.g a stat boosting vitamin that could be fed to Charizard). Leaving Charizard out against a water type that could easily be handled by the grass type on the bench, wasting PP. In fact, he just loves using only Charizard. That this is an infamously child-like strategy says something about Claude's cognitive development... or not, as Red is simple enough that just using Charizard is a mainline speedrun strategy. Though, Claude has never claimed to be following any such strategy. Not picking up a rare candy item that is blocking his path in Pokémon Tower for over an hour, because he was too focused on finding the path. In general Claude is strangely reluctant to pick up items. Don't Forget Just recently, GPT-5.1 completed a run of Pokémon Crystal using a fairly minimal harness in 9,454 reasoning steps across 108 realtime hours. For comparison, the original Gemini 2.5 Pro Pokémon Blue run took 106,505 reasoning steps across 813 realtime hours, and Claude Opus 4.5 is already at 48,854 reasoning steps over 300+ hours. GPT-5.1's 108 hours for Crystal is only ~3x as slow as a human player! Give a frontier LLM a solid minimap and some good prompts [10] and they're not half bad at Pokémon these days. Claude's consistently minimal harness tells us something about progress in LLM cognition, but we shouldn't forget that the past year's improvements in efficient Pokémon agent harnessing tell us something too: raw intelligence is not the only lever pushing LLM performance forward. In fact, it's not necessarily even the most effective one right now. ^ That's failures to improve by Claude Sonnet 4, Claude Opus 4, Claude Opus 4.1, and Claude Sonnet 4.5. At least in terms of story progression anyway, they have gotten faster at getting to the same story point at which they get stuck. ^ There have been a few changes: support for Surf (now that Claude can get that far), removal of a bunch of tailored prompts, and a change where spinner tiles in mazes are labeled like obstructions, as well as a related change to wait for the player character to stop spinning before the screenshot of the current game state is taken. The latter two changes in particular make the Team Rocket Hideout easier than previous runs, though they don't trivialize it. See this doc for more details. ^ For more on Pokémon agent harnesses, see this previous LW post . But tl;dr harnesses do a lot of work to make the game understandable to an LLM, and use several techniques to address agentic weaknesses common to all LLMs. Even though the harnesses may seem fairly simple, and can (and have!) had their tools coded by the LLMs using them, game-winning harnesses have also been relentlessly optimized by human trial and error to provide exactly the support necessary to overcome current LLM limitations. ^ With some editing on my part. ^ Modern Gemini/GPT models can also handle this now. ^ This might be considered a form of inattentional blindness , the classic example of which is the guy in a gorilla suit walking through a basketball game. ^ Probably helps that he can't really remember enough of his experiences to get bored. That may be what we all do in the posthuman future, though on a longer timescale. ^ 9000 of those spent stuck on that left arrow spinner issue. ^ Technically it still took Claude a few hours to notice the CUT-able trees inside the gym that block access to Erika, the gym leader, but he noticed eventually. ^ The minimap only fills out as the LLM explores. Good prompting ensures that the LLM explores basically everything as a first priority, which means in practice the LLM always has a good map of the area it can understand. This bypasses a lot of vision and spatial reasoning weaknesses. Other key tools include an LLM-reasoning-powered navigator and the ability to place map markers. 204 Insights into Claude Opus 4.5 from Pokémon 36 Noosphere89 16 CronoDAS 8 Tuomas Tynkkynen 4 Julian Bradshaw 5 Julian Bradshaw 16 CronoDAS 4 Julian Bradshaw 2 FiftyTwo 4 Nikola Jurkovic 2 Shoshannah Tekofsky 4 CronoDAS 11 kave 8 Random Developer 6 FiftyTwo 6 Hieronym 4 peterr 8 Julian Bradshaw 3 Ivan Bercovich 3 Edd Schneider 3 Alephwyr 2 Hieronym 1 eniteris 4 Julian Bradshaw 0 Lcofa New Comment Submit 24 comments , sorted by top scoring Click to highlight new comments since: Today at 1:03 AM [ - ] Noosphere89 1mo * 36 14 I would analogize this to a human with anterograde amnesia, who cannot form new memories, and who is constantly writing notes to keep track of their life. The limitations here are obvious, and these are limitations future Claudes will probably share unless LLM memory/continual learning is solved in a better way. This is an extremely underrated comparison, TBH. Indeed, I'd argue that frozen weights + lack of a long-term memory are easily one of the biggest reasons why LLMs are much more impressive than useful at a lot of tasks (with reliability being another big, independent issue). It emphasizes 2 things that are both true at once: LLMs do in fact reason like humans and can have (poor-quality) world-models, and there's no fundamental chasm between LLM capabilities and human capabilities that can't be cured by unlimited resources/time, and yet just as humans with anterograde amnesia are usually much less employable/useful to others than people who do have long-term memory, current AIs are much, much less employable/useful than future paradigm AIs. Reply [ - ] CronoDAS 1mo 16 1 I'm curious to see how well LLMs can play Slay the Spire. I could actually try that manually and see what happens. Reply [ - ] Tuomas Tynkkynen 1mo 8 0 Neuro-sama (the LLM-based AI VTuber) has beaten the game some time ago. As the code isn't open it's not possible to confirm whether the StS AI was done with LLMs though. Would definitiely be interesting to see how frontier LLMs perform! Reply [ - ] Julian Bradshaw 26d 4 0 Huh. Link for those interested. Reply [ - ] Julian Bradshaw 1mo 5 0 Please do! Reply [ - ] CronoDAS 1mo 16 0 In Claude's first try, it played Ironclad on Ascension 1 and died to Hexaghost, the Act 1 boss. It wasn't terrible but occasionally got the mechanics a little bit mixed up. Here's the link to the chat history. Reply 1 [ - ] Julian Bradshaw 26d 4 0 Thanks for following through. If anyone wants to make a proper harness in the future, I think probably the most interesting question here is if the LLM can learn from multiple playthroughs, unlocking harder difficulties, etc. Modern LLMs, maybe through notetaking? Reply [ - ] FiftyTwo 22d 2 0 Interesting how much it's relying on having information in training data and being able to look stuff up. I wonder how it would do with a "blind" play through of a game that didn't previously exist. Reply [ - ] Nikola Jurkovic 17d 4 0 This benchmark includes a Slay the Spire environment! When it was written, Gemini 2.5 did the best, getting roughly halfway through a non-Ascension run. Reply [ - ] Shoshannah Tekofsky 1mo 2 0 I'd love to see the results of this :) Reply [ - ] CronoDAS 1mo 4 0 In Claude's first try, it played Ironclad on Ascension 1 and died to Hexaghost, the Act 1 boss. It wasn't terrible but occasionally got the mechanics a little bit mixed up. Here's the link to the chat history. Reply 1 [ - ] kave 22d 11 1 Curated. I appreciate this post's concreteness. It can be hard to really understand what numbers in a benchmark mean. To do so, you have to be pretty familiar with the task distribution, which is often a little surprising. And, if you are bothering to get familiar with it, you probably already know how the LLM performs. So it's hard to be sure you're judging the difficulty accurately, rather than using your sense of the LLM's intelligence to infer the task difficulty. Fortunately, a Pokémon game involves a bunch of different tasks, and I'm pretty familiar with them from childhood gameboy sessions. So LLM performance on the game can provide some helpful intuitions about LLM performance in general. Of course, you don't get all the niceties of statistical power and so on, but I still find it a helpful data source to include. This post does a good job abstracting some of the subskills involved and provides lots of deliciously specific examples for the claims. It's also quite entertaining! Reply 1 [ - ] Random Developer 1mo 8 5 It's really interesting to compare how Opus 4.5 is performing on Pokemon, versus how it performs in Claude Code. One of the big factors here is surely vision: Gemini is one of the best visual LLMs by a wide margin, and I strongly suspect Google does lots of training on specific vision tasks. Even so, 2.0 and 2.5 underperformed human 7-year-olds on many simple tasks on which Gemini hasn't been trained. In comparison, Claude has some visual abilities, but I can't remember ever reaching for them for any serious project. And it sounds like this is affecting lots of things in Pokemon. Opus 4.5 really is quite good at a programming, enough that I'm passing into the "emotional freakout about the inevitable singularity" stage of grief. But Opus lives and dies by giant piles of Markdown files. It generates them, I read them, I make small corrections, and it continues. I think this is Opus 4.5's happy place, and within this circumscribed area, it's a champ. It can write a thousand lines of good Rust in a day, no problem, with some human feedback and code review. And if your process concentrates knowledge into Markdown files, it gets better. So this is my current hypothesis: Opus 4.5 is a remarkably, startlingly capable model. It has generally mediocre vision, even by the standards of frontier LLMs, which have mediocre vision by the standards of 7 year olds. Opus lives and dies by good Markdown files, and a little bit of human feedback here and there results in a giant effective boost in capabilities. It's kind of nice to imagine an AI future where the AIs are enormously capable, but that capability is only unlocked by a bit of occasional human interaction and guidance. Sadly, I think that's only a passing phase, and the real future is going to be much weirder, and future AIs won't need a human to say, "That weird rug thing is actually the elevator," and the AI to reply, "Oh, good observation! That simplifies this problem considerably." Reply [ - ] FiftyTwo 22d 6 3 One thing I found fascinating about watching Claude play is it wouldn't play around and experiment the way I'd expect a human to? It would stand still still trying to work out what to do next, move one square up, consider a long time, move one square down, and repeat. When I'd expect a human to immediately get bored and go as far as they could in all directions to see what was there and try interacting with everything. Maybe some cognitive analogue of boredom is useful for avoiding loops? Reply [ - ] Hieronym 21d 6 0 That is in fact a defect of these models and one of the things you can of want to scream into the screen after it, say, doesn't walk more than 20 tiles to the right having spent days looking for an entrance that is 30 tiles to the right. Or when it doesn't explore the bottom left of a room where the answer is because it's convinced that's not where it is. Or the fact that it's in week 2 of Silph co, driven by the fact that it's convinced what it's looking for is not an item on the ground, and not picking up any items even when walking right next to them, when in fact its goal is an item on the ground. Reply [ - ] peterr 1mo 4 0 GPT-5.1 beating crystal in 108 hours is very interesting. I wonder why that's the case compared to Gemini 3 Pro, which took ~424.5 hours. Do you have any thoughts? Reply [ - ] Julian Bradshaw 1mo 8 0 Bunch of reasons: GPT-5.1 harness is stronger, in particular it has better prompts (value from iterated prompt-writing should not be underestimated here) The two developers have different goals and approaches - Gemini developer has trended towards letting the LLM make its own tools and play the game at its own speed, while GPT developer pushes the LLM to play efficiently and beat the game quickly GPT-5.1 is being run in "continuous thinking mode" which in practice means it wastes less time and compute on simple tasks and thinks harder to get difficult problems right Unfortunately no one has done full playthrough comparisons on the same harness for all models, due to time and expense. (all three main developers for Claude/Gemini/GPT only have access to free tokens for their particular model brand) Perhaps this will become possible sometime next year as completion time drops? (cost per token might drop too, but perhaps not for frontier models) Reply [ - ] Ivan Bercovich 21d 3 0 We some examples while working on terminal bench, where if the agent is pressured with a deadline, they freak out and act less rationally. Some of your examples remind me of that. Being close to the objective and becoming obsessed with that at the expense of intermediate steps. Reply [ - ] Edd Schneider 21d 3 0 I have been experimenting with having stock AI agents compete against each other in Warhammer 10th edition and have found similar problems. Deepseek was telling me units could make shot distances that clearly were not possible by rules it knew. The 'ignoring things that are in front of it' observation here is funny to me because Microsoft Co-Pilot was saying to put units so close together it was impossible. I gave it a grid map in coordinate form. It was ignoring things it put there itself. I also told Deepseek that I had to play but knew nothing of the game, and was doing so because my friend insulted Chongqing hotpot, saying Chengdu's is better. It themed my whole Space Wolves army as soup-based and wrote text as if it was really into it. Warhound Titan, "Defender of Simmering Broth" (1100 pts) Microsoft Co-pilot was pretty boring and accountant-like. (I am at Ithaca College if anyone wants to participate, it is fun) Reply [ - ] Alephwyr 22d 3 0 For comparison, Pokemon Red in Twitch Plays Pokemon, which was basically just decision making implemented as a race condition between thousands to tens of thousands of different humans at every decision making step, took 16 days, 7 hours, 50 minutes, 19 seconds. Reply [ - ] Hieronym 21d 2 2 I would add that that's with a certain amount of malicious inputs in the mix (from trolls). Reply [ - ] eniteris 25d 1 -1 Also, he has a noticeable lack of confusion about key NPCs–Oak is consistently Oak, the player character is never “the red-hatted NPC”, and he can pick out gym leader Erika from a lineup. I'm a little lost on this front. A person who has never encountered Pokemon before would not recognize the Oak or Erika sprite on-sight; why should the AI vision model? Perhaps one could match the Oak sprite to the full size Oak picture at the beginning of the game, but Erika? Erika can really only be identified by sprite uniqueness and placement in the top center of the gym. I would instead think the newer models are just trained on more Pokemon, and hence can better identify Pokemon images. Reply [ - ] Julian Bradshaw 24d 4 1 The models have always been deeply familiar with Pokémon and how to play through it from the initial tests with Sonnet 3.7—they all know Erika is the fourth gym leader in Red, there's just too much internet text about this stuff. It contaminates the test, from a certain perspective, but it also makes failures and weaknesses even more apparent. It is possible that Claude Opus 4.5 alone was trained on more Pokémon images as part of its general image training (more than Sonnet 4.5, though...?), but it wouldn't really matter: pure memorization would not have helped previous models, because they couldn't clearly see/understand stuff they definitely knew about. (I also doubt Anthropic is benchmaxxing Pokémon, considering they've kept their harness limited even after Google and OpenAI beat them on their own benchmark.) Reply [ - ] Lcofa 21d 0 0 Every now and then I play 20 questions with Claude to see how much he can adjust in his thinking. Giving answers like "sort of" and "partly" can teach him that yes and no aren't the only options. To think outside the box, so to speak. Even playing 20 questions 5 times in a row, each taking turns as to who thought up the item to search for, he improved dramatically. (But if you run out of tokens in the middle of a run assume he will forget what his item was because the scratch pad will be cleared.) But 20 questions is text based. Playing a role playing game, or going on adventures with him also work well because it's text based. (Though it's clear he will not harm the user, not even in a pillow fight.) When you move to visual media you have that problem of translating pictures to something he can see, as well as his ability to think through a problem. Like missing the tree that could be broken, or not knowing how to get around a wall. His scratch pad is limited in what it can carry. I wonder if anyone has tried using a MUD, or other text based games with Claude, or other LLM's. It seems like that would make it easier for the model to have better context since the whole context would be loaded to create the next forward pass. Reply Moderation Log More from Julian Bradshaw View more Curated and popular this week 24 Comments 24 Language Models (LLMs) AI Curated === Cognition all the way down 2.0: neuroscience beyond neurons in the diverse intelligence era | Synthese (https://t.co/YH7W3F9FtQ) === Skip to main content Advertisement Log in Menu Find a journal Publish with us Track your research Search Cart Home Synthese Article Cognition all the way down 2.0: neuroscience beyond neurons in the diverse intelligence era Original Research Open access Published: 06 November 2025 Volume 206 , article number 257 , ( 2025 ) Cite this article You have full access to this open access article Download PDF Synthese Aims and scope Submit manuscript Cognition all the way down 2.0: neuroscience beyond neurons in the diverse intelligence era Download PDF Robert Chis-Ciure ORCID: orcid.org/0000-0001-8008-132X 1 , 2 & Michael Levin 3 , 4 8203 Accesses 159 Altmetric 1 Mention Explore all metrics Abstract This paper formalizes biological intelligence as search efficiency in multi-scale problem spaces, aiming to resolve epistemic deadlocks in the basal “cognition wars” unfolding in the Diverse Intelligence research program. It extends classical work on symbolic problem-solving to define a novel problem space lexicon and search efficiency metric . Construed as an operationalization of intelligence, this metric is the decimal logarithm of the ratio between the cost of a random walk and that of a biological agent. Thus, the search efficiency measures how many orders of magnitude of dissipative work an agentic policy saves relative to a maximal-entropy search strategy. Empirical models for amoeboid chemotaxis and barium-induced planarian head regeneration show that, under conservative (i.e., intelligence- underestimating ) assumptions, even ‘simple’ organisms are from two-hundred- to sextillion-fold more efficient in problem space exploration. In this sense, the deep insights of neuroscience are not about neurons per se , but about the policies and patterns of physics and mathematics that function as a kind of “cognitive glue” binding parts toward higher levels of collective intelligence in wholes of highly diverse composition and origin. Therefore, our synthesis argues that the “mark of the cognitive” is perhaps better sought in the measurable efficiency with which living systems, from single cells to complex organisms, traverse energy and information gradients to tame combinatorial explosions-one problem space at a time. Explore related subjects Discover the latest articles, books and news in related subjects, suggested using machine learning. Cognition Cognitive Neuroscience Computational Intelligence Intelligence Mathematical Models of Cognitive Processes and Neural Networks Numerical Cognition Use our pre-submission checklist Avoid common mistakes on your manuscript. 1 Cognition all the way down In the past two decades, cognitive science has increasingly expanded its scope beyond the zoocentric, brain-bound processes of humans and higher animals toward a much broader range of life phenomena traditionally off “the mark of the cognitive” (Adams and Garrison, 2013 ). A leading edge of this expansion is the basal cognition research program (Manicka and Levin, 2019 ; Levin et al., 2021 ; Lyon et al., 2021 ; Lyon and Cheng, 2023 ; Fábregas-Tejeda and Sims, 2025 ), a subset of the field of Diverse Intelligence (Levin, 2022 , 2023a , Levin, 2025 ,Pio-Lopez et al., 2022 , Clawson and Levin, 2023 , Lagasse and Levin, 2023 , Watson and Levin, 2023 , McMillen and Levin, 2024 ), specifically focused on the evolutionary history of cognition and how it scaled from primitive versions to more complex ones. A wide variety of non-brainy systems have now been shown to exhibit learning, decision-making, and other competencies normally studied by cognitive and behavioral science (Baluška and Levin, 2016 ; Katz et al., 2018 ; Vallverdú et al., 2018 ; Gershman et al., 2021 ; Katz and Fontana, 2022 ; Kaygisiz and Ulijn, 2025 ). This “cognitive biology 2.0” (Lyon, 2025 ) paradigm casts cognition as a bio-functional continuum that might begin even earlier than single cells, let alone brains—e.g., molecular networks or materials (Fig. 1 ) (Bose, 1902 ; Power et al., 2015 ; Katz et al., 2018 ; Biswas et al., 2021 , 2022 ; Katz and Fontana, 2022 )—and is a “biological necessity” for all life forms (Shapiro, 2021 ; Lane, 2022 ; Lyon and Cheng, 2023 ). Fig. 1 Memory in molecular pathways. chemical signaling pathways, such as gene-regulatory networks (GRNs), can be treated as generic agents amenable to the analytic tools of behavioral science. In networks modelled by either boolean logic or continuous ordinary differential equations, sequential stimulation of specific input nodes produces plastic changes in the activity of distal output nodes that recapitulate canonical forms of learning, including habituation, sensitization, and even classical (pavlovian) conditioning (Pigozzi et al., 2025 ). Top row, schematic of associative learning in the dog–bell paradigm: presentation of the conditioned stimulus (CS, bell) alone elicits no salivation; pairing of the CS with the unconditioned stimulus (UCS, steak) elicits salivation; after training, the CS alone evokes the conditioned response (R, salivation). Bottom row, equivalent behavior in an in silico GRN: activation of an input node encoding the CS is initially uncorrelated with the activity of an output node (R); concomitant activation of a separate UCS node induces plastic changes that strengthen the causal linkage between CS and R, such that subsequent stimulation of the CS node alone triggers a robust response in the R node. Node size reflects connectivity (degree), and edge thickness reflects interaction strength. Grey arrows indicate the temporal sequence of events Full size image Therefore, cognition requires a phylogenetically deep, bottom-up rather than an a priori -armchaired methodological approach (Lyon et al., 2021 ). In particular, it has been suggested that a critical aspect of a comprehensive picture is the understanding of coexistence (cooperation, competition, etc.) of the many agents, of diverse spatio-temporal scales and competencies, that exist at different levels of organization in the body, which has been formalized via the multi-scale competency architecture (Fields and Levin, 2022 ) (Fig. 2 ) and the concept of polycomputing (Bongard and Levin, 2023 ). Fig. 2 Multi-scale competency architecture. biological systems occur as nested layers of scale and organizational complexity, ranging from subcellular molecular pathways to swarms of organisms in ecosystems. Uniquely, living systems exhibit active agency (in the cybernetic sense (Rosenblueth et al., 1943 )) at each scale, with multiple subsystems exhibiting memory and some degree of competency at navigating a wide variety of problem spaces with specific future-oriented agendas. The multiscale competency architecture is evinced by each level deforming the option space for its parts (e.g., through hacking them via behavior-shaping stimuli), resulting in activity that serves system-level agendas in new problem-spaces of which the parts may have no knowledge. Images by Jeremy Guay of Peregrine creative, except for the planarian morphospace panel, which is by Alexis Pietak Full size image But what is cognition under this view? According to a “phyletically neutral” operational definition (Lyon, 2020 , p. 416): “Cognition is comprised of sensory and other information-processing mechanisms an organism has for becoming familiar with, valuing, and interacting productively with features of its environment in order to meet existential needs, the most basic of which are survival/persistence, growth/thriving, and reproduction.” This emphasizes the basic, early phases of cognitive development; however, more advanced capabilities (but still already present at the multicellular tissue level) involve identification of new problems to solve and new spaces to project into, balancing surprise minimization (active inference) and creative exploration (infotaxis), as well as drives towards metamorphosis (not merely persistence of status quo, but growth and change) (Levin, 2024 ). Lyon et al. ( 2021 ) synthesize these capacities into a minimalist “toolkit” (Table 1 in Lyon et al. ( 2021 )) and map them onto widely evolutionarily conserved biological intra- and inter-cellular processes (Table 2 in Lyon et al. ( 2021 )), arguing, based on recent evidence (Lyon, 2015 ; Prindle et al., 2015 ; Sourjik and Vorholt, 2015 , Yang, Bialecka-Fornal, Weatherwax, Larkin, Prindle, Liu, Garcia-Ojalvo and Süel, Yang et al.) that these mechanisms long pre-date neurons and scale hierarchically from intracellular to organismal levels (see also ). Apropos consciousness (Ellia and Chis-Ciure, 2022 ; Seth and Bayne, 2022 ; Chis-Ciure et al., 2024 ; Chis-Ciure, 2025 ; Zheng et al., 2025 ), we do not make any claims here but, for now, merely note the following. Body tissues outside of the Left hemisphere do not have the benefit of the ability to use eloquent language to convince novice observers like us of the presence of an inner perspective in an unconventional embodiment. However, most—all?—of the molecular mechanisms, behaviors, and information-processing dynamics (such as metrics of causal emergence) that are found in brains and widely used to underpin charitable assessments of the problem of Other Minds, are found elsewhere in the body (Pezzulo and Levin, 2015 ; Varley et al., 2024 ; Blackiston et al., 2025 ; Pigozzi et al., 2025 ). To whatever extent consciousness tracks these measurable features, the possibility of its presence should be taken seriously in many contexts outside of brains. To maintain a connection with state-of-the-art empirical results, we have focused on problem-solving competencies. In the What is cognition? symposium by Bayne et al. ( 2019 ), Nicola Clayton makes a distinction between flexible problem-solving that can be transferred to new contexts, heuristic rules (core knowledge), and associative learning mechanisms. Examples of flexible use of molecular mechanisms in new morphogenetic contexts have been reported: e.g., the use of cytoskeletal bending to create structures out of one giant cell, instead of the normal multicellular mechanisms, when cell size is artificially drastically increased. However, we urge caution and flexibility in mapping competencies from other embodiments and other problem spaces onto familiar concepts in behavior science. If cognition is to be useful outside the N = 1 example of brainy life on Earth, we need to be prepared for plasticity in our conventional definitions of cognition. Certainly, one needs some guardrails for the concept to have any meaning, but unless “Cognition” is to mean “Whatever brains do here on Earth” ex cathedra , we need to have some capacity for interesting new features and properties that differ in their details from how it is implemented in neuron-based navigation of 3D space. Zooming in on one incarnation of this research program, (Fields and Levin 2022 ) sought to generalize these ideas beyond their canonical medium so that they could apply to multiple levels of organization within living systems. They proposed a view that addresses the specific-scale-transcending, compositional aspect of biological cognition (Levin, 2019 ). Their core idea is that competency in navigating arbitrary problem spaces is a scale-free invariant for analyzing cognition and agency across diverse biological (and synthetic) embodiments. According to them, biological agents at every organizational level traverse multiple, observer-defined problem spaces: transcriptional attractor landscapes, physiological homeostatic manifolds, anatomical morphospaces, 3D behavioral spaces, or informational domains underpinning symbolic manipulations and social interactions (Fig. 1 ). Evolution, they argue, has co-opted and generalized problem-solving heuristics—formalized as variational free-energy minimization (Friston, 2010 , Friston, 2019 ,Fields et al., 2022 , Friston et al., 2023 , Fields, 2024 )—to optimize trade-offs between heterogeneous data and goals to converge on context-sensitive, adaptive policies across networks, collectives and communities (see also earlier work by Friston et al. ( 2015 ) and Pezzulo and Levin ( 2016 ) for modeling along these lines) (Fig. 3 ). Fig. 3 Diverse spaces for navigational intelligence. human observers are primed to notice intelligent behavior of medium-sized objects moving at medium speeds through 3-dimensional space. But biology was exhibiting navigation of problem spaces long before muscle (and the nerve needed to operate it) came on the scene. Molecular circuits, cells, tissues, and organs navigate transcriptional, metabolic, and anatomical morphospaces, performing perception-decision-action loops to achieve adaptive goal states. Panels in the top row on the right are from the video “crows are being trained to pick up cigarette butts and clean cities,” produced by nameless network, and, respectively, a design by Ruben van der vleuten and Bob Spikman for crowded cities, 2017. Panels in the bottom row taken with permission from references Marder and Goaillard ( 2006 ), Huang et al. ( 2009 ), Cervera et al. ( 2021 ), respectively Full size image In this model, navigation of a problem space by a system is taken to instantiate intelligence in the sense of William James ( 1995 ): some degree of competency in reaching the same goal (state) by diverse means when circumstances change. Numerous examples have been published of invariant morphogenesis despite radical deformations (Pezzulo and Levin, 2015 , Levin, 2023b ), transcriptional and physiological adaptation to knock-down of important components (Emmons-Bell et al., 2019 ), behavioral robustness in the face of drastic sensory-motor reconfiguration (Blackiston et al., 2025 ), and cellular connections adapting via novel routes (Little et al., 2009 ). These are all examples of “flexibility,” as per James’ emphasis on multiple paths toward a specific (or generalized) goal, which are even more impressive than the ubiquitous ability of “knowing when to stop,” such as the error minimization competencies of organ regeneration in amphibia (Pezzulo and Levin, 2016 ). In turn, the scope of a system’s goals is taken to define the collective intelligence (Levin, 2019 ), because it serves as a binding model that orchestrates the parts to act coherently. The scale of the goal state that the system is able to reliably achieve, despite various impediments from the external environment and even perturbations of its own parts, is defined as the system’s “cognitive light cone.” From a broader perspective, this line of thinking about intelligence mirrors Shadlen’s when he says that “a precise definition of ‘cognitive’ is less essential than the recognition of its elemental features: flexibility, contingency, and freedom from immediacy” (Bayne et al., 2019 , p. R612). While ‘cognition’ is arguably a more diffuse concept that includes intelligence, all the intelligence-involving features the quote mentions are instantiated in the problem-solving and time-shifting elements of, e.g., morphogenetic decision-making. The main contribution of this paper concerns intelligence and covers both cellular chemotaxis and morphogenesis, as problem-solving behavior is now experimentally tractable, practically applicable to unconventional agents, and more conducive to formalizations. Nevertheless, current evidence and our and others’ analytic results license extrapolations about cognition generaliter : a strategy one might characterize as ‘the proof of cognition is in the problem-solving pudding.’ 2 Not so fast? Zooming out to the dialectical setting of the basal cognition within the Diverse Intelligence program, (Lyon and Cheng 2023 ) argue that the historical tether between cognition and nervous-system complexity is heir to Lamarck’s dictum and was amplified by twentieth-century cognitivism. Hence, that tether has become indefensible in the 21st century’s intellectual environment, and a “shift in cognitive gravity” away from brains and toward the cellular architectures that preceded them is indispensable. Nevertheless, not everyone is ready to pivot their cognitive gravity toward a basal cognition-style approach to all-things-minded, and have entrenched sceptical positions in the “cognition wars” (Adams and Aizawa, 2010 ; Adams, 2018 ; Loy et al., 2021 ; Figdor, 2022 , 2024 ; Fábregas-Tejeda and Sims, 2025 ). On the conceptual side, 2 charges the proponents of cognition in unconventional systems with equivocating on terms like “learning,” “memory,” or “decision-making,” and with relying on a terminological loosening or metaphorical extension of such concepts rather than demonstrating genuine cognitive processes as traditionally understood. While cells and even plants exhibit adaptive, information-driven behavior, cognition in the ‘thick’ sense involves representations possessing: (i) intentionality , which is the capacity to represent objects or states of affairs; (ii) intensionality , that is, the further capacity to represent them under specific aspects, allowing for different cognitive attitudes towards extensionally identical referents 2, p. 23; and (iii) the possibility of misrepresentation , namely, the fact that internal states, qua representations, can be false or fail to accurately map onto the world (Dretske, 1986 ; Fodor, 2002 ). However, one could argue that these features are present, in basal form, in morphogenetic examples of intelligence (Levin, 2023c , d , Levin, 2025 , McMillen and Levin, 2024 ). For example, representations of counterfactual states are seen in planarian flatworms in which a stable bioelectric pattern indicates the future number of heads to make if the animal gets injured (Levin et al., 2019 ). In other words, the number of heads that cells should grow upon injury is determined by a re-writable physiological pattern memory, and the state of that memory encodes not the current number of heads (which can differ) but a stored, decodable representation of a “correct” planarian that serves as a guide for regenerative growth, remodeling, and cessation of activity once the represented goal state is achieved. Moreover, symbolic interpretation of signs, i.e., semiosis (Salthe, 1998 ; Barbieri, 2008 ; Brier, 2008 ; Turner, 2016 ), is seen in the arbitrary nature of bioelectric organ prepatterns, which are sparse signals that do not directly encode the myriad forces needed to implement anatomical outcome but serve that function only because the cell collective interprets these arbitrary patterns with mutually agreed-upon meanings (Levin and Martyniuk, 2018 ). And, much as other collective intelligences like ant colonies fall for visual illusions (Sakiyama and Gunji, 2016 ), morphogenesis can likewise exhibit errors of perception of pattern memory and stimuli, as well as errors of inference, which lead to abnormal outcomes (Pezzulo, 2020 ; Pezzulo et al., 2021 ; Pio-Lopez et al., 2022 ). From a different angle, Figdor ( 2022 ) criticizes the program’s “freewheeling use of functional ascriptions,” which neglects the evolutionary individuation of biological characters. The argument, grounded in Character-Species Separation (CSS) and Character-Phenotype Separation (CPS) principles, posits that cognitive functions must co-evolve with their substrate-dependent biological realisers. Through this move, it calls into question the functionalist assumption explicitly endorsed by Levin et al. ( 2021 ) that cognitive roles can be unparsimoniously ascribed across clades because it erases lineage-specific histories (CSS) and divorces functions from the phenotypical realisers that individuate them (CPS). On this point, it bears stressing that the view of Levin ( 2019 , 2022 ) derives from the extension of the Problem of Other Minds to all systems, not just human brains. In other words, possible cognitive states in unconventional agents are epistemically latent under an inferential veil. Observers such as researchers, conspecifics, parasites, etc., must abductively infer and formalize their putative goals and problem structures by reverse-engineering problem-solving trajectories from observed data (Rouleau and Levin, 2023 ). This means that cognitive assessments of any system should be considered as claims about the efficacy of specific behavioral interaction protocols (sets of tools, from cybernetics to psychoanalysis), which are to be established empirically. These are taken to be not unique ground truth but observer-relative, consistent with Dennett’s Intentional Stance (Dennett, 1998 ) and the polycomputing framework in which multiple observers can usefully interpret the same physical events in different ways (Bongard and Levin, 2023 ). Furthermore, at the research bench, it means that any ascription of cognitive terms to a system, or the softening of boundaries of ancient linguistic categories, must not be free-wheeling or poetics, but rather prescripted by their demonstrated utility in driving novel discoveries and enabling new empirical capabilities—in a nutshell, by improved fertility for new research as compared to conventional formalisms. On the empirical side, in a comprehensive review of 20th-century and recent evidence, Loy et al. ( 2021 ) argue that, despite abounding Pavlovian-style rigorous experiments, associative learning, a paradigm central to understanding cognition, demonstrates clear limitations and at least partial lack of replicability when applied to unicellular organisms like E. Coli (see also Dussutour ( 2021 )), or protists like Paramecium aurelia or Physarum polycephalum . On the one hand, we respond to this by agreeing that experimentally probing claims of intelligence in unconventional systems is fraught with difficulties and, in many ways, constitutes an IQ test for the observer (Levin, 2023a ). Still, the extensive references to empirical results we provided so far, and the formal results we take up next, license, in our view, optimism about the prospects of this research field, one that can only benefit from course corrections as those provided by Loy et al. ( 2021 ). On the other hand, we think Chittka is on the right track when saying: “There is, however, no clear demarcation between sub-cognitive processes – for example, non-associative learning such as habituation, or classical conditioning – and cognitive operations. Nor is it clear that the former evolved first and the latter were added sequentially over evolutionary time according to complexity. The same neural circuits that mediate ‘simple’ associative learning can also underpin basic rule learning and non-trivial logical operations such as the XOR problem” (Bayne et al., 2019 , p. R610). If, empirically, the divide between the sub-cognitive and the cognitive is arguably porous, the most promising stance is the one that leads to more breakthroughs and that, in our view, is flexibility or deflationism (Allen, 2017 ) about definitions rather than a priori entrenching. Taking a step back, the basal cognition wars seem to rehearse epistemic deadlocks familiar from other cognitive science debates (Piredda, 2017 ; Harrison et al., 2022 ; Facchin, 2023 ; Fábregas-Tejeda and Sims, 2025 ). Thus, while proponents point to context-sensitive, adaptive capacities across evolutionarily distant lineages that allegedly warrant cognitive function attribution, sceptics caution against terminological dilution, data misinterpretation, and the misapplication of concepts with semantic parameters well-defined only for more complex, nervous-system-endowed metazoans. This deadlock stems partially from ambiguity: the grain of the ‘atomic’ unit of cognition diverges across “disciplinary silos” (Lyon et al., 2021 , p. 3) and lacks systematic formalization beyond broad operational definitions (cf. Lyon ( 2020 )) and initial mathematization attempts (cf. Fields and Levin ( 2022 )). We concur that the problem is both methodological and conceptual: How does one operationalize and measure cognition across radically different embodiments and scales without begging the question or straining analogies? In our view, this theoretical cul-de-sac could be partially resolved via more precise, operationalizable, and scalable frameworks that retain a meaningful sense of thickness for a bio-cosmopolitan concept of cognition capable of guiding ongoing and future empirical efforts (Levin and Dennett, 2020 ). Moreover, we think it is important to hold open the possibility that our existing criteria for specific cognitive phenomena (e.g., precise definitions of Pavlovian conditioning, habituation, etc) from behavioral science will need to be expanded or modified in order to apply to diverse intelligent systems. On the one hand, it makes sense not to loosen criteria and expand terms to the point that they lose their meaning. On the other hand, expecting all embodiments to comply with specific criteria developed with an intense focus on brains and animal behavior is begging the question, in terms of assuming that brains set the standard for “ bona fide ” cognitive skills. Finding a good balance, we suggest, requires two things. First, an unflinching inquiry into what is the essence of each of these phenomena: What is it really that we are trying to capture, if we let go of comforting but limiting criteria set by properties of neurons and neural networks? Doing so has led to important advances, for example, in the discovery of commonalities between learning and population-level processes in evolution (Power et al., 2015 ; Livnat and Papadimitriou, 2016 ; Watson and Szathmáry, 2016 ; Watson et al., 2016 ; Kouvaris et al., 2017 ), which in turn shed light on aspects of machine learning and other fields. Second, the ultimate judge of the legitimacy of unification must be empirical success: the degree of prediction, control, and fecundity for driving new discoveries and new capabilities determines whether a particular set of tools and concepts is legitimately expanded to a new domain. In the last few decades, the field of Diverse Intelligence has been driving a remarkable richness of new discoveries that spread across bioengineering, regenerative medicine, evolutionary biology, ecology, behavioral science, artificial life, and more (Levin, 2021 , Reber and Baluška, 2021 , Baluška et al., 2022 , Davies and Levin, 2023 , Lagasse and Levin, 2023 , Mathews et al., 2023 , Miller et al., 2023 ). This section has given preliminary answers to some of the critics by drawing on cutting-edge literature in several fields. However, our main contribution to this epistemological deadlock is non-technically summarized in Sect. 3 and developed in more empirical and mathematical detail in Sects. 4 , 5 , and 6 . 3 The argument in a nutshell The present paper makes strides toward addressing the breadth-depth trade-off in utilizing cognition-loaded concepts within the diverse intelligence program, aiming to reinforce its theoretical foundations. Specifically, our contribution is to formally sharpen and extend the MCA view proposed by Fields and Levin ( 2022 ) by meeting it on its own terms: navigation in problem spaces under variational physical principles. However, complementary to but distinct from their earlier (Friston et al., 2015 ; Pezzulo and Levin, 2016 ) and subsequent (Fields et al., 2022 ; Fields, 2024 ) works, our approach takes a cue from the skeptics (Adams and Garrison, 2013 ; Adams, 2018 ; Figdor, 2022 ) and begins from the human case by revisiting the classical formulation of problem-solving by Newell and Simon ( 1972 ), developed initially for symbolic intelligence (Burns and Vollemeyer, 2000 ). In their Turing award lecture, Newell and Simon ( 1976 , p. 123) capture perfectly the core tenet of our project: “The task of intelligence, then, is to avert the ever-present threat of the exponential explosion of search.” Thus, Sect. 4 argues that this problem space ( $P$ ) formalism, when suitably extended, provides an expressive, substrate-agnostic lexicon for analyzing goal-directed adaptive behavior beyond its original remit. To this end, in Sects. 5 and 6 , we illustrate the versatility of this adapted formalism by applying it to unconventional examples such as amoeboid chemotaxis and planarian regeneration, contributing to existing intuition-building efforts for how cellular and morphogenetic processes can be cast as a search within specific problem spaces (Fields and Levin, 2022 ; Fields et al., 2022 ; Fields, 2024 ). We recognize that mappings from abstract constructs to biological structures and processes are a dime a dozen, so we pivot next toward a novel operationalization of biological intelligence: search efficiency in problem space ( $K$ ). This is a strategic move: As a scalar effectiveness metric for possibly very different problem-solving processes, $K$ shifts focus from the vague umbrella concept of cognition and its various functions (e.g., decision-making, memory, learning, concept formation, etc.), which skeptics warn that they atrophy into metaphor when transplanted from validated use into other (literal) walks of life. Defined as the logarithmic ratio of the cost of a blind search to the cost of an agentic search policy, $K$ quantifies how many orders of magnitude more efficient an agent is compared to a random walk in a given problem space $P$ . Chance might not look like much the benchmark but looks deceive: it ensures lineage-, system-, scale- and process-neutrality, which is a conceptual sine qua non for a bio-cosmopolitan concept of cognition, i.e., one which does not beg the question by assuming that only certain expressions (e.g., humans, higher animals) fit under “the mark of the cognitive.” Moreover, because both the numerator and the denominator scale with the intrinsic size of $P$ , the metric is automatically normalized for task difficulty and remains finite for enormous state spaces. Moreover, $K$ is additive across independent sub-runs and, therefore, compositional across nested sub-problems. In brief, $K$ is scale-invariant, controls for task complexity, is expressed in physical work units, and puts intelligence on a continuous gradient. Admittedly, $K$ does not a priori equate to thick cognition (Adams, 2018 ); however, because it quantifies search advantage within-scale and can be additively evaluated across-scales (compositionality), it can precisely express how much combinatorial “dead work” is eliminated via increases in biological complexity. This for-all-strata-and-problems intelligence budget, we believe, gives a mathematical sense of the type of coordinated, system-level behaviors usually associated with “ bona fide ” cognition. One may retort that organisms obviously outperform blind search and that clothing this truism in combinatorial garb adds little. We disagree. First, given case-specific empirical details and modeling assumptions, the search efficiency metric can be computed, compared, and statistically tested across both phylogenetic and synthetic lineages. Second, once made empirically tractable, the additive decomposition of $K$ across nested blankets pinpoints where—and by how much—intelligence condenses, rendering the ‘obvious’ suddenly measurable and, therefore, refutable. The stage is now set for Sect. 4 , where we formalize this account by specifying the extended problem space and efficient search lexicon. 4 A formal lexicon for efficient search in biological problem spaces 4.1 Problem spaces—the setup This subsection lays the formal scaffolding. It explicates the minimal set of elements—states, operators, constraints, evaluation, and horizon—that jointly define a scale-agnostic problem space. Doing so equips us with a lexicon for analyzing and comparing various biological processes from a unified search efficiency perspective. Under a first approximation, problem spaces are abstract constructs that can formalize adaptive, goal-directed problem-solving processes across scales of physical organisation. Formally, we define an arbitrary problem space $P$ as an ordered quintuple: $$P=\langle S,O,C,E,H\rangle.$$ (1) Here, $S$ represents the set of all physically realisable configurations a system can occupy that are relevant to its problem-solving activity at a given level of analysis. Following Newell and Simon ( 1972 ), this includes initial $S_{\mathrm{init}} \subset S$ and solution $S_{\mathrm{goal}} \subset S$ states (we suppress the subscript when context renders the subset obvious). Operators $O$ capture elementary transitions. An operator $o \in O$ maps a state $s \in S$ to a subsequent state $s^{\prime} \in S$ (in a deterministic setting, we have $o: S \to S$ ) or to a set of possible subsequent states (in a non-deterministic setting, we have $o: S \to \mathcal{P}(S)$ , where $\mathcal{P}(S)$ is the powerset of $S$ ). Search requires a metric on effort, meaning each application of an operator incurs a problem-specific cost, which we formalize by a weight function $w\colon O\to\mathbb R_{\ge 0}$ . A policy or trajectory $\pi=\langle s_0,o_0,\dots,o_{k-1}\rangle$ is a sequence of operators applied starting from an initial state $s_0 \in S_{\mathrm{init}}$ to generate a sequence of states $s_0 \stackrel{o_0}{\longrightarrow} s_1 \stackrel{o_1}{\longrightarrow} \ldots \stackrel{o_{k-1}}{\longrightarrow} s_k$ , with $s_{k} \in S_{\mathrm{goal}}$ . The cumulative cost of such a trajectory is $\mathcal{C}(\pi | s_0) = \sum_{i=0}^{k-1}w(o_i)$ . Constraints $C\subseteq S\times O$ exclude physically impossible moves, specifying the bounds of the operationally accessible. Technically, $C$ lists forbidden state-operator pairs, so the admissible set is its complement $A=(S\times O)\setminus C$ . Philosophically, $C$ specifies nomologically possible paths. By “physical” we mean those properties and relations that obtain in virtue of a system’s scale-specific realization (e.g., cellular mechanics, tissue-level bioelectrical rules, bodily positions and trajectories, etc.), not necessarily only those properties deemed fundamental by physical theory (Stoljar, 2024 ). The evaluation functional $E\colon S\!\to\!\mathbb R$ assigns a scalar utility (larger preferred) or, equivalently, a scalar disutility (smaller preferred) based on objectives inherent to the problem-solving system, which reflect its intrinsic goals or viability criteria. For biological systems, $E$ often translates to a proxy for fitness, such as proximity to homeostatic setpoints, morphogenetic target achievement, reproductive success, etc. Furthermore, when conceptually unpacked, $E$ implies that energetic, temporal, and risk currencies compete, suggesting that, at least in biological systems, evolutionary history selects for evaluation mechanisms that render qualitatively incommensurable optima into a system-evaluable format to effectively guide behavior along fitness gradients. Footnote 1 Finally, the horizon $H\in\mathbb N$ bounds forward look-ahead, representing the number of steps typically considered in sequential operations within the space; this is usually called, at least in the human case, ‘planning’ or ‘prediction.’ More generally, one may specify a real-valued time bound $\tau_{\max}$ and set $H=\lceil\tau_{\max}/\Delta t\rceil$ , with $\Delta t$ the discretization step. We numerically show in the next sections that horizons derive from inherent physical timescales or delay lines, which functionally constrain the effective depth or temporal range of prediction available to the system. The classical formulation of problem spaces by Newell and Simon ( 1972 ) primarily focuses on states $S$ , including initial and goal states, and operators $O$ defining the space, with evaluation $E$ and constraints $C$ considered aspects of the search strategy operating within that space. However, we include $C$ , $E$ , and $H$ explicitly in our definition of $P$ to foreground the constraints, evaluative criteria, and predictive limitations that are particularly salient in the biological systems we analyze. We ‘promoted’ them to first-class elements because, as we show, biological systems often modulate them directly as part of their adaptive repertoire. Rather than just navigating a fixed space, this capacity for recursive adjustment of the problem spaces (via, e.g., constraint relaxation, preference tuning, or catalytic temporal speed-ups) is a fingerprint of biological intelligence that our extended formalism aims to capture. From a broader perspective, the grammar just introduced, while developed initially for symbolic human and artificial intelligence (Newell and Simon, 1972 , 1976 ; Burns and Vollemeyer, 2000 ), is a minimal yet powerful vocabulary to analyze goal-directed systems because it abstracts informational relationships between states, transformations, and evaluative criteria from scale-specific physical realization details, rendering it substrate-flexible. Nevertheless, perceptive to the skeptical lessons of Sect. 2 , we show below how our account heeds lineage-sensitive constraints (Figdor, 2022 ; Fields, 2024 ): by parameterizing constraints, evaluation metrics, and time horizons as empirically traceable, scale-bound variables, it ties functional ascriptions to their material histories (rather than dispersing them promiscuously) and, in principle, enables within- and inter-lineage comparisons. 4.2 Intelligence qua search efficiency in problem space William James ( 1995 ) defined intelligence as “a fixed goal with variable means of achieving it,” and this is a good entry point for specifying the relationship between problem spaces and intelligence. In our context, we operationally define intelligence as the capacity for effective searches, meaning applications of operators $O$ , that reach goal states $S_{\text{goal}} \subset S$ preferred under $E$ , given the prevailing constraints $C$ and bounded by the horizon $H$ , despite unforeseen obstacles. Obstacles can be formalized as additional forbidden pairs in $C$ whose existence is revealed only when they fall within the predictive horizon $H$ . Intelligence is, therefore, a gradient property: its degree is the search efficiency of the system within a given problem space. Formally, let $\tau_{\mathrm{blind}} = \mathbb{E}\bigl[\mathcal{C}(\pi_{\mathrm{blind}}|s_{0})\bigr]$ denote the expected cumulative cost $\mathcal{C} = \sum_i w(o_i)$ , in terms of weighted operator applications, cost function $w(o_{i})$ , incurred by a maximal-entropy (unbiased) random-walk policy $\pi_{\mathrm{blind}}$ on the admissible graph $A = (S \times O) \setminus C$ , to reach any state $s_{k} \in S_{\mathrm{goal}}$ from an arbitrary initial state $s_0 \in S_{\mathrm{init}}$ . Next, we write $\tau_{\mathrm{agent}} = \mathbb{E}\bigl[\mathcal{C}(\pi_{\mathrm{agent}}|s_{0})\bigr]$ for the corresponding expectation under a given system’s agentic policy $\pi_{\mathrm{agent}}$ . So equipped, we formally define the search efficiency in problem space as: $$ K = \log_{10}\left(\frac{\tau_{\text{blind}}}{\tau_{\text{agent}}}\right).$$ (2) Equivalently, in natural units one has $K = \frac{1}{\ln10}\ln(\frac{\tau_{\mathrm{blind}}}{\tau_{\mathrm{agent}}})=\frac{\mathcal{I}_{\mathrm{path}}}{\ln10}$ , with $\mathcal{I_{\mathrm{path}}}=\ln(\frac{\tau_{\mathrm{blind}}}{\tau_{\mathrm{agent}}})$ , so that a single decimal unit ( $K=1$ ) corresponds to $\log_{2}10 \approx 3.32$ bits of path-information gain (Shannon, 1948 ). $K$ measures how many orders of magnitude of dissipative expenditure (i.e., search cost) an agent saves relative to maxent search. We say dissipative expenditure because each operator application is costed by $w:O \to \mathbb{R}_{\ge 0}$ , such that $\tau$ inherits the physical units of $w$ (e.g., joules, ATP hydrolysis, etc.), which cast the intelligence metric in biophysical budgets terms rather than abstract time steps, as Figdor ( 2022 , 2024 ) cautions. Intuitively, a zero-valued $K$ marks chance performance, $K > 0$ indicates supra-random efficiency, and $K \gg 0$ reflects much larger search advantages. Each integer increment tracks one order of magnitude faster, such that for $K=n$ , we have $10^n$ more search efficiency. Additionally, the log base choice cancels when comparing two systems. For cross-system assessment, one can write $\Delta K=K_1-K_2$ , such that the differences can equally be read in bits ( $\Delta K\log_{2}10$ ) or nats ( $\Delta K\ln 10$ ). However, note that K must always be evaluated relative to a well-defined problem space $P = \langle S, O, C, E, H\rangle$ , as the specific characteristics of $S$ , $O$ (including $w$ ), $C$ , and $S_{\mathrm{goal}}$ determine the state-transition graph and cost landscape upon which both $\tau_{\mathrm{blind}}$ and $\tau_{\mathrm{agent}}$ are calculated. Moreover, $K$ can also be proved additively composable. To wit, if a complex search can be expressed as a sequence of $n$ conditionally independent stages, such that the overall efficiency ratio $(\tau_{\mathrm{blind}}/\tau_{\mathrm{agent}})_{\mathrm{total}}$ is the product of the stage-specific efficiency ratios $\prod_{j=1}^n(\tau_{\mathrm{blind}}/\tau_{\mathrm{agent}})_j$ , then the total search efficiency $K_{\mathrm{complex}} = \sum_{j=1}^n K_j$ . Conceptually, this means that one can assess different mechanistic contributions to search efficiency by considering how the trajectory cost is written. Here are a few other noteworthy properties. First, because both the numerator and the denominator scale with the combinatorial size of the underlying space, $K$ remains finite and retains scale-invariance. Second, unlike raw reaction-time or energy-budget measures, $K$ controls for the baseline combinatorics of the task via normalizing against a random strategy. Indeed, $K$ is only as good as the null model : an unfairly handicapped $\tau_{\mathrm{blind}}$ would overestimate intelligence qua search efficiency and vice-versa for an unfairly advantaged null model (e.g., insufficient constraints, artificially lower operator costs, etc.). Thus, for a robust baseline, the random walk must operate within the exact same problem space $P$ , particularly respecting identical admissible sets $A$ and cost functions $w$ as the agent. Third, logarithmic compression linearizes multiplicative search-time gains, preventing combinatorial explosions in $|S|$ from dwarfing finer algorithmic improvements—this is a desideratum for comparing, e.g., amoeba, planaria, and vertebrate cortex on the same axis, which is presupposed if our intelligence notion is to be scale invariant. We now propose two biologically plausible models to illustrate in practice the formal constructs introduced. Finally, before exemplifying $K$ biologically, we highlight an important connection that is explored in upcoming work. The search efficiency in problem space shares some commitments with computational efficiency in universal computation. In brief, measures based upon algorithmic complexity bridge the gap between universal computation—which, if the physical Church-Turing Thesis (Copeland and Shagrir, 2020 ; Copeland, 2024 ) is correct, includes basal cognition—and variational free energy treatments of self-organisation. Efficiency in this context emphasises the minimization of the complexity of some generative model or program that generates some solution or content. In variational approaches, this complexity is scored in terms of a relative entropy (technically, between the posterior and prior beliefs after observing some content to be explained). This complexity minimization is addressed in universal computation through the notion of compression, which figures in many accounts of efficiency, e.g., Schmidhuber ( 2010 ), Mehta et al. ( 2014 ), Ruffini ( 2017 ), Grünwald and Roos ( 2019 ), and Friston et al. ( 2025 ). In other words, using algorithmic complexity and, in particular, Kolmogorov complexity, optimal solutions correspond to the program or policy with the minimum description or message length (Hinton and Zemel, 1993 ; Wallace and Dowe, 1999 ). This perspective on efficiency underwrites the notion of Solomonoff ( 2009 ) induction and the perspective afforded by universal computation (Delvenne, 2009 ; Lake et al., 2015 ). Interestingly, minimum message length formulations have been linked explicitly to variational free energy (Hinton and Zemel, 1993 ; MacKay, 1995 ). 5 A model of search efficiency in the problem space of amoeboid chemotaxis 5.1 A problem space for Dictyostelium discoideum chemotaxis Biological organisms exhibit hierarchical, nested, multi-component architectures, which makes any problem space identification non-trivial. If one zooms in on some subunit level—which knows nothing of problem spaces at higher scales—processes seem to operate purely mechanistically (“just physics”) without any problem-solving. If there is any cognitive agent to be found, the traditional view locates it at some higher-order organization scale (Adams and Aizawa, 2010 ; Adams, 2018 ; Figdor, 2022 ), and it is usually one-agent-per-system. Basal cognition and Diverse Intelligence proponents (Levin et al., 2021 ; Lyon et al., 2021 ; Levin, 2022 , 2023a , d ; Levin, 2025 ; Lyon and Cheng, 2023 , McMillen and; Levin, 2024 ) argue this framing is wrong: the agential perspective (Godfrey-Smith, 2009 ) should morph depending on the scale, meaning there are multiple interdependent problem-solvers (Fig. 4 ), and on who is looking, that is, identifying intelligence in another system is also an IQ test for the observer itself, as noted above. Indeed, it could be argued that a key property for life at any scale is the ability to coarse-grain appropriately, not spending precious time and energy trying to track microstates like a Laplacean Demon but rather taking the best guess at an optimal level of observation, modeling, and control of themselves, their own parts, and features of the external environment (Fields et al., 2021 ; Fields and Levin, 2023 ). Life can be seen as a battle of perspectives rather than of genes, information patterns, or energy gradients. Complex biological agents often consist of components that are themselves competent problem-solvers in their own, usually smaller, local spaces (Levin, 2022 ). Fig. 4 Actions in one space enable or constrain actions in other spaces. movement in metabolic space provides the energy needed to drive changes in gene expression (as well as cell motion), which in turn provides the building blocks needed to change cell morphology, which enables movement (behavior in 3D), which facilitates subsequent metabolic gains. Image by Jeremy Guay of Peregrine creative Full size image Thus, in biological architectures, Fields and Levin ( 2022 ) argue that there is simultaneous search in multiple problem spaces interlinked across scales (e.g., transcriptional, physiological, morphological, etc.) and not only in the familiar behavioral and symbolic spaces considered initially by Newell and Simon ( 1972 ). Can our $P$ ’s formal structure capture these unfamiliar spaces? Yes. The present subsection shows how this abstract construct captures cellular behavior. Since the canonical agent scale (i.e., human and animal cognition) is unlikely to raise qualms and has been extensively discussed in the literature, we focus on two unconventional examples only to build intuition and refer the reader to further similar work( (Fields et al., 2022 ; Fields, 2024 ). One example comes from amoeboid chemotaxis (Parent and Devreotes, 1999 ; Iglesias and Devreotes, 2008 ). Under our problem space formalism, a migrating Dictyostelium cell navigates a shallow cyclic-AMP field, whose membrane positions can instantiate states $S$ . Specifically, $S$ is parameterized as a two-dimensional lattice of $\approx500$ cortical patches, and each patch’s occupancy probability is updated at 0.3 s intervals, matching the cAMP equilibration time derived from $D_{\mathrm{cAMP}} \approx 3 \times 10^{-10}\,\mathrm{m}^2\,\mathrm{s}^{-1}$ (Bhowmik et al., 2016 ). Operators $O$ could correspond to Arp2/3- and SCAR/WAVE-driven dendritic-actin bursts that nucleate $\approx 3\mu\textrm{m}$ pseudopods roughly every 15s, as measured by live-cell actin-YFP imaging and automated pseudopod tracking in Dictyostelium (Bosgraaf and Van Haastert, 2009 ; Van Haastert and Bosgraaf, 2009 ; Veltman et al., 2012 ). Constraints $C$ could be realized by cortical tension and membrane integrity, which block protrusions liable to tear the specialized layer of cytoplasm located just beneath the plasma membrane, i.e., the cell’s cortex (Chugh and Paluch, 2018 ). More precisely, $C$ comprises a tensile ceiling of $\approx 800 \textrm{pN}/\mu \textrm{m}^{-1}$ beyond which actin-driven protrusions stall, and a membrane-area conservation penalty reflecting lipid-bilayer incompressibility (Herant and Dembo, 2010 ). Then, the thermodynamic cost associated with motility (e.g., ATP hydrolysis per unit distance) provides a metric for evaluating the functional $E$ (to be minimized). Finally, the effective planning horizon $H$ is constrained by factors such as the diffusion time of the attractant across the cell diameter or the persistence time of exploratory structures. Numerically, for a ${10}\mu\textrm{m}$ Dictyostelium cell, the characteristic diffusion time of cAMP across its diameter can be estimated as $\tau \approx L^2/D \approx (10^{-5}\mathrm{m})^2 / (3\times 10^{-10}\mathrm{m}^2\textrm{s}^{-1}) \approx 0.33$ s, using the diffusion coefficient $D_{\text{cAMP}}=1.8\times10^{-8}\mathrm{m}^{2}\mathrm{min}^{-1}$ (Bhowmik et al., 2016 ), which is equivalent to $3\times 10^{-10}\mathrm{m}^2\mathrm{s}^{-1}$ employed in the earlier model by Höfer et al. ( 1995 ). Interpreting this timescale with a hypothetical time step $\Delta t \approx 0.3$ s, commensurate with its diffusion timescales, implies an effective predictive horizon $H\approx1$ . Note that, since horizon $H$ is a new concept, its numerical estimation can only rely on educated guesses based on existing empirical literature and formal models. Caveats notwithstanding, $P$ is expressive enough to capture amoeboid chemotaxis without presupposing explicit human-level representation as in the classical work of Newell and Simon ( 1972 ). 5.2 How search efficient is amoeboid chemotaxis? A Dictyostelium cell sensing a cyclic-AMP gradient must move roughly ten cell lengths to reach a nutrient patch. First, for the blind search cost, $\tau_{\mathrm{blind}}$ , we estimate the time taken by a random walk. Thus, using a conservative random-motility coefficient $D_{\mathrm{cell}}\in[30,40]\,\mu\textrm{m}^2/\textrm{minute}$ (empirically bracketed by single-cell tracking under normoxic and mildly hypoxic conditions as per Cochet-Escartin et al. ( 2021 )), the mean first-passage time of an unbiased walk over ten cell lengths ( $L=100\,\mu\textrm{m}$ ) is $\tau_{\mathrm{blind}}\approx L^2/D\approx(1.75\pm0.25)\times10^4$ s. Compared to this empirically estimated null model, experimental work shows that amoeboid chemotaxis closes the same gap in ( $\tau_{\mathrm{agent}}\approx100$ s) (Parent and Devreotes, 1999 ; Levine and Rappel, 2013 ). Plugging these values in our Eq. 2 , we have $K_{\mathrm{amoeba}} = \log_{10}\bigl(\tau_{\mathrm{blind}}/\tau_{\mathrm{agent}}\bigr) = 2.18\text{-}2.30$ , meaning approximately 150–200 times more efficient (corresponding to 7.2–7.6 bits of path-information gain), which sits comfortably within the physical sensing bounds set by correlation-time noise (Endres and Wingreen, 2008 ; Hu et al., 2010 ). This calculated range shows that moderate uncertainty in the random-motility coefficient $D_{\mathrm{cell}}$ perturbs $K$ by $ < 0.13$ , which indicates that our metric is robust to at-the-bench measurement error. The choice of the formula for mean first-passage time (MFPT) from a diffusive process warrants technical comment. For a 1D random walk, the mean square displacement is $\langle x^2\rangle=2Dt$ . The MFPT to reach a distance $L$ for an absorbing boundary is often given as $\tau=L^2/(2D)$ . For 2D or 3D searches, the prefactor in the denominator may change (e.g., to $4D$ under certain approximations for 2D). Thus, the formula $\tau\approx L^2/D$ used here represents a particular convention or approximation for the effective search time. Using, for instance, $\tau\approx L^2/(2D)$ would halve the $\tau_{\mathrm{blind}}$ estimates. For the given $D$ , this alternative formula would yield $\tau_{\mathrm{blind}}\in[0.75\times10^4,1.0\times10^4]\,\mathrm{s}$ , and $K\in[\log_{10}(75),\log_{10}(100)]\approx[1.88,2.00]$ . As $K$ is a logarithmic ratio, this prefactor choice primarily introduces an additive constant to $K$ , i.e., $\log_{10}(2)\approx0.3$ . As we remark in Sect. 4 , consistency in defining $\tau_{\mathrm{blind}}$ is a crucial aspect when comparing systems or assessing the impact of specific adaptations. Luckily, as can be seen from the reparameterization above, the order of magnitude for $K$ often remains robust to such variations in the precise null model specification, which is highly relevant to the difficult operationalization questions of $\tau_{\mathrm{blind}}$ generally. 6 A model of search efficiency in the problem space of planarian regeneration 6.1 A problem space for Dugesia head regeneration Upping the scale, planarian head regeneration (Reddien and Sánchez Alvarado, 2004 ; Reddien, 2018 ) is another non-mainstream candidate for problem space searching. State-of-the-art experiments demonstrate that planarian flatworms can adapt their regenerative mechanisms to guide cells toward target morphologies despite specific perturbations not typically encountered during evolution, e.g., transient exposure to or particular ion counteraction channel blockers like those involving barium (Fig. 5 ) (Beane et al., 2013 ; Cervera et al., 2018 ; Levin et al., 2021 , Levin, 2023a ). Our problem space formalism can also accommodate tissue-level morphogenesis and shows how morphological priors constrain the search. Fig. 5 Bioelectrically-encoded representations in planaria. control planaria exhibit expression of anterior marker genes in the head (A, green arrowhead indicates head, pink arrowhead indicates tail end), and possess a bioelectric pattern (visualized here with voltage-sensitive fluorescent dye, green = depolarized) (B) that indicates the fact that complete worms should have exactly 1 head. When a worm is amputated (C), the middle fragment reliably regenerates worms with 1 head (D). However, when the bioelectric pattern is altered via exposure to an ionophore, animals are anatomically normal (1-headed) and exhibit head markers normally, meaning only on one end (green arrowhead), but when cut, give rise to 2-headed animals as indicated by their new pattern memory (Durant et al., 2017 , 2019 ). This change is permanent : they will continue to generate 2-headed animals in future rounds of cutting (Oviedo et al., 2010 ). These data show that a single worm body can store (at least) one of two different patterns that control how they will regenerate in the future (E), and reveal that the bioelectric pattern is not an indicator of current state, but a representation (memory) of the morphogenetic target morphology that will be recalled in the future if the animal is injured. Crucially, this is a counterfactual representation that gives a sense of how the thick notion of cognition presupposing intensionality (Adams, 2018 ) could be instantiated in unconventional substrates such as flatworms (see discussion in sec. 2 ). Moreover, planaria have an intrinsic capacity to adjust their electrophysiology as well (F), identifying and then up- and down-regulating a handful of genes that enable them to regenerate heads that are insensitive to an exotic toxin that destroys their native head (Emmons-Bell et al., 2019 ). Panel in E by Jeremy Guay of Peregrine creative Full size image Thus, translated in $P$ , the spatial distribution of cell types and signalling molecules defining the body plan define $S$ . More concretely, $S$ can be approximated by a low-dimensional vector $s(t) = \big\langle \rho_i(t),\, V_{\mathrm{mem},j}(t)\big\rangle$ whose first block stores regional neoblast and differentiated-cell densities $\rho_i$ measured by BrdU (5-bromo-2’-deoxyuridine) incorporation and fluorescence-activated cell sorting (FACS), and whose second block records anterior-posterior voltage profiles $V_{\mathrm{mem}}$ obtained with voltage-sensitive dyes (Wenemoser and Reddien, 2010 ; Emmons-Bell et al., 2019 ). Next, transcriptional programs and cell migrations constitute $O$ . For example, neoblast division ( $\approx$ 6 h inter-mitotic time), directed migration at ${3}\mu\textrm{m}/\textrm{h}$ to ${6}\mu\textrm{m}/\textrm{h}$ , and lineage-specific differentiation each supply elementary operators $o_{i}$ with empirically determined work costs in ATP equivalents (Scimone et al., 2014 ; Reddien, 2018 ). Constraints $C$ are realized by developmental polarity rules and gap junction communication patterns. Polarity constraints derive from Wnt/ $\beta$ -catenin gradients that bias head–tail fate: RNAi against $\beta$ -catenin, pharmacological closure of innexin-11 gap junctions, or direct modification of the bioelectric prepattern with ionophores or ion channel drugs (Beane et al., 2011 ; Durant et al., 2019 ) shifts the collective outcomes and yields double-headed morphologies (Petersen and Reddien, 2009 ; Williams et al., 2020 , Nogi and Levin, 2005 ). This illustrates how relaxing constraints $C$ enlarges reachability (i.e., different $s_{i} \in S_{\mathrm{goal}}$ ) in problem space $S$ . As in the chemotactic case, mechanical integrity adds an independent ceiling as tissue surface tension of $\approx 0.6\ \text{mN m}^{-1}$ limits blastema curvature (Birkholz et al., 2019 ), and thus bounds operator $o_{i} \in O$ amplitudes. Further, $E$ can be realised by the deviation of the current shape from the target morphogenetic pattern, potentially quantifiable via a variational free energy measure (Kuchling et al., 2020 )). One conservative evaluation functional could be the squared error between the live worm’s length-to-width ratio and the clonal mean ratio recorded for uninjured controls, an index routinely used to score shape fidelity during regeneration (Birkholz et al., 2019 ). Finally, the turnover time of neoblast progenitors constrains $H$ qua morphological planning. In planarian regeneration, the median G2 duration of neoblasts is roughly 6 h (Newmark and Sánchez Alvarado, 2000 ; Wenemoser and Reddien, 2010 ), so, with a discretization $\Delta t=1\text{s}$ matching cell-level actions, the morphological horizon is $H \approx 2.2 \times 10^{4}$ operator cycles. Contrasting Dictyostelium ’s $H \approx 1$ with planaria’s $H \approx 2.2 \times 10^{4}$ underscores a four-order-of-magnitude expansion in predictive depth which shows that $H$ preserves experimentally-validated dimensional consistency, lending credence to the point that inference timescales recapitulate intrinsic delay lines. Importantly, transcriptional adaptation in barium-exposed planaria reveals highly efficient search policies in high-dimensional gene-expression spaces (Emmons-Bell et al., 2019 ) (proof in the following subsection). In other words, when planaria mount a response to the barium-induced disruption of bioelectric signalling necessary for regeneration, they do not randomly test all possible gene expression combinations, which would be astronomically impractical. Indeed, RNA-sequencing shows that approximately $1.98\%$ of the transcriptome is differentially expressed during $\mathrm{BaCl_{2}}$ adaptation ( $q < 0.05$ , $ > 2$ -fold change), indicating targeted operator selection rather than wholesale search (Emmons-Bell et al., 2019 ). In other words, planaria rapidly identify and modulate a specific subset of transcripts needed to partially restore or compensate for disrupted physiological homeostasis in the presence of a novel ion channel blocker, demonstrating efficient adaptation suggestive of intelligent exploration of the problem space. Here are a few extrapolations from the results above. First, experimental data support the hypothesis that, in some cases, editing constraints $C$ can yield larger efficiency gains than adding operators $O$ , which we illustrated above via voltage-gated ion-channel editing in Dugesia under BaCl $_2$ (Emmons-Bell et al., 2019 ). Indeed, constraints are emerging as a critical aspect of biological richness (Deacon, 2012 ; Montévil and Mossio, 2015 ; Bechtel, 2018 ; Juarrero, 2023 ; Ross, 2023 ). Thus, in our examples, relaxing membrane tension or bioelectric rules can expand reachability more than duplicating moves. Put differently, while more-of-the-same (e.g., copying an operator) increases robustness by introducing redundancies, it also incurs costs without any added novelty, forcing a trade-off; formally, this corresponds to Bayesian model selection and program induction in statistics and computer science (Tenenbaum et al., 2011 ; Lake et al., 2015 ). Second, as noted, intelligent behavior frequently hinges on problem reformulation . Indeed, modifying $O$ or $C$ re-tiles the landscape and shortens optimal paths, a tactic long appreciated in human planning and problem-solving (Newell and Simon, 1972 ) yet whose biological analogs beyond behavioral flexibility remain relatively under-explored. Third, depth arises when progress in one space sculpts the optimiser that operates in another (Fields and Levin, 2022 ), producing a hierarchy of interleaved spaces whose mutual constraints define an optimization stack. 6.2 How search efficient is planarian regeneration? Is Dugesia japonica head regeneration, when exposed to 1 mM barium chloride, also search-efficient under its specific problem space when judged against an explicit random-search baseline? Emmons-Bell et al. ( 2019 ) show that continuous BaCl $_2$ abolishes anterior tissue within 72 h in $\approx83\%$ of worms, producing a sharp wound plane at the photoreceptors. A blastema first appears after about 15 days, and a morphologically normal but BaCl $_2$ -tolerant head is complete by day 37. If these adapted worms spend 30 days in freshwater, the tolerance disappears, and a second BaCl $_2$ exposure again destroys the head within 24 h, showing that the phenotype is plastic, not genetically fixed (Levin, 2023a ). As above, RNA-sequencing on fully regenerated, BaCl $_2$ -insensitive heads identified differential expression in 1.98% of the 138 026 annotated D. japonica coding sequences: about 2,700 transcripts. This regulated cohort of transcripts is enriched for bioelectric effectors; for example, the TRPM $_\alpha$ channel is newly expressed, whereas several innexins and tubulins are sharply down-regulated. Such a pattern points to a targeted rewiring of ionic conductances rather than wholesale transcriptional editing (Emmons-Bell et al., 2019 ). This is consistent with pharmacological data showing that calcium- or chloride-channel blockade prevents the initial BaCl $_2$ degeneration and that TRPM inhibition erases the acquired resistance (Emmons-Bell et al., 2019 ). To gauge the search speed-up of this adaptation, we consider a very conservative null model. Suppose resilience requires a concerted change in just ten of the 2,700 BaCl $_2$ -responsive genes. The search space then contains $\binom{2700}{10}\approx5.6\times10^{27}$ distinct ten-gene combinations. Neoblasts, which are the only transcriptionally plastic cells, constitute roughly one-third of the body and number on the order of $10^5$ in a decapitated fragment; each completes a division cycle in about 30 h at $13\,^\circ\mathrm{C}$ . Thus, even if every neoblast explored only a new ten-gene pattern each cycle, an unbiased walk would require $5.6\times10^{22}$ such rounds to sample the entire space once , which is about $1.9\times10^{20}\,\mathrm{years}$ , corresponding to a random searcher estimate of $\tau_{\mathrm{blind}} \approx 6 \times 10^{27}$ s. The empirical trajectory, by contrast, converges on a viable solution in 37 days (Emmons-Bell et al., 2019 , Fig. 1 A-D), which gives $\tau_{\mathrm{agent}} \approx 3.2\times10^{6}$ s. A simple calculation using our Eq. 2 yields a search efficiency $K = \log_{10}\!\bigl(6\times10^{27}\,\mathrm{s}/3.2\times10^{6}\,\mathrm{s}\bigr)\approx21$ , roughly $10^{21}$ times more efficient than the null model, corresponding to about 70 bits of path-information gain. Thus, even when the baseline is set by an extravagantly conservative random walk, which greatly underestimates $K$ , the worm’s weeks-long developmental program eliminates roughly ten-billion-fold of futile exploration in problem space. Two additional technical remarks. First, the calculation deliberately underestimates both the dimensionality of the ion-channel manifold (e.g., many regulators never reach significance in the bulk RNA-seq) and the combinatorial complexity of downstream post-translational control. Hence, $K\approx21$ should be read as a minimal empirically-derived bound on intelligent search. Second, the estimate already discounts the massive parallelism of $10^5$ neoblasts; without it, $\tau_{\mathrm{blind}}$ stretches by another five orders of magnitude, significantly increasing $K$ . 7 Conclusion Zooming out, the search efficiency construct expresses intelligence in the combinatorial geometry of problem spaces. The quintuple $P$ delineates the search landscape, whereas $K$ records, on a logarithmic scale, the extent to which an agent prunes the futile branches of that space relative to a maximal-entropy walk. Yet specifying the blind walk is itself an inference problem: one must commit to a cost metric $w$ , a constraint set $C$ , and an operator alphabet $O$ that are simultaneously faithful to the biological scale under scrutiny and commensurate with the null model. To give a neurobiological example, for a cortical microcircuit of $10^{4}$ neurons and $10^{6}$ synapses, should the random walk wander through synaptic-weight vectors, firing-rate trajectories, or entire spike sequences? Each choice alters $\lvert S\rvert$ by orders of magnitude and, thus, shifts $K$ by an additive constant. The upshot is that these modeling contingencies must be made transparent; otherwise, convincing skeptics such as Figdor ( 2022 ) that cross-lineage comparisons are methodologically sound amounts to hand-waving. $K$ ’s virtue lies precisely in forcing such commitments into the open and rendering their quantitative impact explicit. When those commitments are made conservatively, as in the amoeba and planarian exemplars above, seemingly simple organisms still register many magnitude order gains over chance, which gives preliminary modeling reasons to seat them at the cognitive table (Barron et al., 2023 , Lyon et al., 2021 ; Rorot, 2022 ; Lyon and Cheng, 2023 ; Seifert et al., 2024 ). As a final point to par the “freewheeling use of functional ascriptions” criticism by Figdor ( 2022 ), we note that, while our proposal is substrate-agnostic at the level of the $P+K$ calculus, the empirical models we proposed showcase that biological efficiency is ultimately realised by substrate-involving mechanisms that compute with a model . In practice, cells and tissues implement generative-model computations (e.g., via ion channels, bioelectric circuits, gene-regulatory and cytoskeletal dynamics, etc.) that evaluate options over a finite prediction horizon and thereby select paths of least action or, equivalently, maximal efficiency. This “model computation” explains why realizers matter: bioelectric and morphological priors and constraints sculpt the space of reachable states that define the problem and create search efficiency gradients; conversely, editing constraints or operators (as in planarian bioelectric reprogramming) re-tiles the landscape and shortens optimal paths. In this sense, the realizer is not simply a carrier of dynamics (“just physics”) but rather the physical possibility condition for there being a problem and the computational means by which problem-solving efficiency is achieved. Therefore, the current paper serves, fundamentally, as a challenge: if what is made measurable and quantifiable here is not cognition, then what is? We re-examined the diverse intelligence research program (Levin et al., 2021 ; Lyon et al., 2021 ; Fields and Levin, 2022 ; Levin, 2022 , 2023a ; Lyon and Cheng, 2023 ) through the lens of combinatorial search theory. After a conceptual roadmap in Sect. 3 , Sect. 4 introduced a scale-agnostic quintuple $P=\langle S,O,C,E,H\rangle$ that reformulates classical problem-space analysis so that constraints, evaluation functionals, and predictive horizons are included besides states and operators. On that foundation, we defined search efficiency, $K$ , as the logarithmic ratio === JavaScript is not available. (https://t.co/UKIULAD5wU) === Something went wrong, but don’t fret — let’s give it another shot. Try again Some privacy related extensions may cause issues on x.com. Please disable them and try again. === Tweet by Anil Seth (https://x.com/anilkseth/status/1987894412584566876?t=0JSx8bvFJGpJy6NR5Xh7kw&s=03) === Tweet by Anil Seth: "Cognition all the way down". Great to see this fine new paper from @robertchisciure & @drmichaellevin our now in Synthese - it introduces a new metric to quantify biological intelligence as search efficiency in multidimensional problem spaces https://t.co/YH7W3F9FtQ pic.twitter.com/UKIULAD5wU === Tweet by Ethan Mollick (https://x.com/i/status/2000413758729162889) === Tweet by Ethan Mollick: In general, all the chatbots seem to struggle with files in a way that CLI versions do not. Gemini will frequently confuse which nano banana image you are referring to in a conversation (the chain of thought shows it loses track) and ChatGPT often misplaces files that it makes === Tweet by Ethan Mollick (https://x.com/i/status/1998778005129224412) === Tweet by Ethan Mollick: An understudied, high-value issue in agents is how to have them write the best summaries of what they have done so far (think Claude's compress feature). These serve as what organizational theorists call "boundary objects" which preserve key meaning across different AI "selves" === 42 notes on AI &amp; work - by Jasmine Sun (https://jasmi.news/p/42-notes-on-ai-and-work) === @jasmine’s substack Subscribe Sign in 🌻 42 notes on AI & work economic anxiety at the jagged frontier Jasmine Sun Aug 12, 2025 194 26 42 Share Hello from Shanghai! One of my main preoccupations this year has been AI’s labor impacts. It’s been a bit whiplashy—I went from intense concern to feeling slightly more sanguine by the time everyone started freaking out about the “new grad jobs crisis.” I'm planning some deeper investigations into these questions for when I’m back (e.g. what does an “AI-native firm” actually look like in practice?), but in the interim, some more plane-ride aphorisms since you all liked the last :) I first became anxious about AI and labor impacts in February when my younger sister told me how hard it was for her friends to find jobs. Consulting and Big Tech SWE roles were no longer guaranteed; new grads were falling into masters programs instead, delaying their adulthood until the job market improved. If Stanford CS majors can’t get hired, what about everyone else? I asked economist friends whether to fret. Probably cyclical trends , they said with a shrug. The “new grad job crisis” is just a correction for Covid-era overhiring. But if Excel jockeys are automated faster than professional services firms grow, junior hiring won’t bounce back to previous rates. As a friend at McKinsey put it to me: We’re blocked on deals, not slides. Substack laid off half its customer support team when the tech recession hit. Soon after, we started using Decagon—an AI chatbot trained on past tickets and resources—to handle the vast majority of requests. The remaining support agents transitioned their focus to higher-priority and higher-complexity problems. When the economy stabilized, hiring resumed. But we no longer needed to linearly scale the support team with the number of tickets. We just needed fewer humans than before. You don’t need mass unemployment to inspire mass fear—merely its shadow is enough. In the Hollywood and port strikes last year, the vague prospect of automation was enough to spur workers to organize. In both cases, a critical worldwide industry was brought to a halt. What if the same thing happened with teachers? Drivers? Doctors? More? Then again, most American industries aren’t organized as ports. Covid was a natural experiment in what happens when everyone’s sent home with a check and nothing to do. Weed, sports gambling, riots, conspiracy. Our culture has been built on the structure and meaning of work. It’ll take more than UBI to cure this kind of rot. I don’t think policymakers would tolerate job loss past 15%. At that point, they’d step in to start slowing shit down. If there’s anything American voters care about, it’s keeping their jobs. We’ve already seen the backlash against immigrant and offshored labor. If non-white people are intolerably alien, what about getting outcompeted by machines? Most AI backlash is economic anxiety coated in a veneer of social justice. Alfalfa farming consumes 19 times the water that data centers do; there’s no sound environmental reason to boycott Claude but not GPS . When people say “AI is a moral stain,” they really mean: I am scared that I won’t be able to pay my bills. To be fair, the labs are definitely trying to automate everyone’s jobs. I roll my eyes when people demand we build AI to “augment and not replace” us. This is a platitude, wishful thinking; it is not a reality most workers can choose. If the tech is good and cheap enough to replace us, it will. Economic incentives are a hell of a drug. Carl Benedikt Frey : “There is no iron law that postulates that technology must benefit the many at the expense of the few.” Liberal democracy teeters on the tie between labor and growth. Fortunately for humans, AI capabilities look pretty jagged so far. Moravec’s paradox: “It is comparatively easy to make computers exhibit adult level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility.” Only in tech would we measure a person’s value by high school math medals . Does a calculator count as superintelligence? Most jobs involve complex bundles of tasks. Thus, the speed and scope of automation matters: if AI can do a whole job at once, it’ll be eliminated. If it automates only one task at a time, the job will just evolve around it. Consider a Ship of Theseus: If a job has all its tasks replaced over time, is it still the same job? The whole is more and less than the sum of its parts. Noah Smith : “ Dystopia is when robots take half your jobs. Utopia is when robots take half your job. ” We still have radiologists , but not lamplighters. In my last week of work as a product manager, I realized I didn’t have a single task to document and offboard. I wasn’t hired to write PRDs, lead standups, or run user interviews—each could be competently done by someone else. My role was relational, not task-based. Someone had to be the fall guy; someone had to herd the cats. Bureaucrats have always dreamed of simplification. If only people could be compressed into tidy units, processed as input-output flows. If only there were a standard number—IQ, SATs, civil service exams—that could quantify a human’s economic potential; if only every employee were fungible with every other. The map is not the territory. The org chart is not the org chart. Systems are much more unruly than they appear. Another common argument says that AI capabilities are fast but diffusion is slow . Supposedly, regulations, backlash, and laziness get in the way of adoption; most people are change-averse decels who won’t admit when a robot does better. But it didn’t take students long to start ChatGPTing all their homework. If AI could write my emails for me, I’d certainly let it. “Diffusion lag” reflects a lack of product-market fit. Even AI optimists are still hitting practical roadblocks. That’s why detailed case studies are so much fun: physics , code security , running a restaurant at a small independent hotel . James C. Scott defined mētis as “the kind of knowledge that can be acquired only by long practice at similar but rarely identical tasks, which requires constant adaptation to changing circumstances. Half the battle is knowing which rules of thumb to apply in which order and when to throw the book away and improvise.” The real world is all edge cases, all the time. Increasingly, fewer jobs will look like doing tasks ourselves, and more will involve teaching AIs to do them for us. How can we transfer context to the machine? Can they adopt the values and instincts we’ve evolved over millennia to have? When you pair with a model, will it remember what it sees? Can you teach taste? Creativity? Learning to learn? This is the great pedagogical project of our time. A January 2027 forecast: “Copious amounts of synthetic data are produced, evaluated, and filtered for quality before being fed to Agent-2. On top of this, they pay billions of dollars for human laborers to record themselves solving long-horizon tasks.” Our friendly hotel purveyor describes one such long-horizon task: “To replicate [chef] Hagai’s context, you’d need entire recipes, or maybe video of him preparing the foods; Toast sales data, or maybe video of the dining room; our hours; his calendar, featuring private events; communications among staff about what’s getting used for what; the CSVs for Baldor; the paper receipts for quick runs to Loeb’s; and maybe surveillance footage to capture exceptions.” Jack Morris : “ There are no new ideas in AI, only new datasets .” What makes a domain automatable? Training data, deployment ease, clear criteria for quality and reward. If the eval exists, the model can do it. What makes a lab decide to master a domain? Enterprise demand, marketing splash, if it’ll make potential hires say holy shit . (Coding, Studio Ghibli, high school math.) We all know the perils of teaching to the test. Stuffing AI into human-shaped jobs still seems like fitting square pegs into round holes. We’ve got to get the humans out of machine-shaped jobs. No one’s destiny is locked in at 18. Societies should make lifelong learning and continuing education a more serious bet. Progress always comes with pain. Both human and machine intelligence seem infinite to me. I cover AI from a humanist perspective. Sign up for future essays and notes: Share Subscribe Thanks for reading, Jasmine 194 26 42 Share Discussion about this post Comments Restacks Dirk Hohnstraeter Aug 12 Liked by Jasmine Sun No gatekeeper saying we don't do plane-ride aphorisms, no SEO expert demanding h2 headlines. Just dense, stimulating text. More of that, please! Expand full comment Reply Share 1 reply by Jasmine Sun Human - AI Cognitive Evolution Aug 14 Liked by Jasmine Sun Most jobs are a complicated bundle of tasks. The complex part is that humans are needed / used to suffer through the complication that often borders on chaos and orgs don't want to have to work on that part. Expand full comment Reply Share 24 more comments... Top Latest Discussions No posts Ready for more? Subscribe © 2026 Jasmine Sun · Privacy ∙ Terms ∙ Collection notice Start your Substack Get the app Substack is the home for great culture === I got fooled by AI-for-science hype—here&#x27;s what it taught me (https://www.understandingai.org/p/i-got-fooled-by-ai-for-science-hypeheres) === Understanding AI Subscribe Sign in I got fooled by AI-for-science hype—here's what it taught me I used AI in my plasma physics research and it didn’t go the way I expected. Nick McGreivy May 19, 2025 401 66 80 Share I’m excited to publish this guest post by Nick McGreivy , a physicist who last year earned a PhD from Princeton. Nick used to be optimistic that AI could accelerate physics research. But when he tried to apply AI techniques to real physics problems the results were disappointing. I’ve written before about the Princeton School of AI Safety, which holds that the impact of AI is likely to be similar to that of past general-purpose technologies such as electricity, integrated circuits, and the Internet. I think of this piece from Nick as being in that same intellectual tradition. —Timothy B. Lee In 2018, as a second-year PhD student at Princeton studying plasma physics , I decided to switch my research focus to machine learning. I didn’t yet have a specific research project in mind, but I thought I could make a bigger impact by using AI to accelerate physics research. (I was also, quite frankly, motivated by the high salaries in AI.) I eventually chose to study what AI pioneer Yann LeCun later described as a “pretty hot topic, indeed”: using AI to solve partial differential equations (PDEs). But as I tried to build on what I thought were impressive results, I found that AI methods performed much worse than advertised. The author, Nick McGreivy. At first, I tried applying a widely-cited AI method called PINN to some fairly simple PDEs, but found it to be unexpectedly brittle. Later, though dozens of papers had claimed that AI methods could solve PDEs faster than standard numerical methods—in some cases as much as a million times faster —I discovered that a large majority of these comparisons were unfair. When I compared these AI methods on equal footing to state-of-the-art numerical methods, whatever narrowly defined advantage AI had usually disappeared. This experience has led me to question the idea that AI is poised to “ accelerate ” or even “ revolutionize ” science. Are we really about to enter what DeepMind calls “a new golden age of AI-enabled scientific discovery,” or has the overall potential of AI in science been exaggerated—much like it was in my subfield? Many others have identified similar issues. For example, in 2023 DeepMind claimed to have discovered 2.2 million crystal structures, representing “an order-of-magnitude expansion in stable materials known to humanity.” But when materials scientists analyzed these compounds , they found it was “ mostly junk ” and “respectfully” suggested that the paper “does not report any new materials.” Separately, Princeton computer scientists Arvind Narayanan and Sayash Kapoor have compiled a list of 648 papers across 30 fields that all make a methodological error called data leakage . In each case data leakage leads to overoptimistic results. They argue that AI-based science is facing a “reproducibility crisis.” Yet AI adoption in scientific research has been rising sharply over the last decade . Computer science has seen the biggest impacts, of course, but other disciplines—physics, chemistry, biology, medicine, and the social sciences—have also seen rapidly increasing AI adoption. Across all scientific publications, rates of AI usage grew from 2 percent in 2015 to almost 8 percent in 2022 . It’s harder to find data about the last few years, but there’s every reason to think that hockey stick growth has continued . To be clear, AI can drive scientific breakthroughs. My concern is about their magnitude and frequency. Has AI really shown enough potential to justify such a massive shift in talent, training, time, and money away from existing research directions and towards a single paradigm? Every field of science is experiencing AI differently, so we should be cautious about making generalizations. I’m convinced, however, that some of the lessons from my experience are broadly applicable across science: AI adoption is exploding among scientists less because it benefits science and more because it benefits the scientists themselves . Because AI researchers almost never publish negative results, AI-for-science is experiencing survivorship bias . The positive results that get published tend to be overly optimistic about AI’s potential. As a result, I’ve come to believe that AI has generally been less successful and revolutionary in science than it appears to be. Ultimately, I don’t know whether AI will reverse the decades-long trend of declining scientific productivity and stagnating (or even decelerating) rates of scientific progress . I don’t think anyone does. But barring major (and in my opinion unlikely) breakthroughs in advanced AI, I expect AI to be much more a normal tool of incremental, uneven scientific progress than a revolutionary one. Subscribe My disappointing experience with PINNs In the summer of 2019, I got a first taste of what would become my dissertation topic: solving PDEs with AI. PDEs are mathematical equations used to model a wide range of physical systems, and solving (i.e., simulating) them is an extremely important task in computational physics and engineering. My lab uses PDEs to model the behavior of plasmas, such as inside fusion reactors and in the interstellar medium of outer space. The AI models being used to solve PDEs are custom deep learning models, much more analogous to AlphaFold than ChatGPT. The first approach I tried was something called the physics-informed neural network. PINNs had recently been introduced in an influential paper that had already racked up hundreds of citations. PINNs were a radically different way of solving PDEs compared to standard numerical methods. Standard methods represent a PDE solution as a set of pixels (like in an image or video) and derive equations for each pixel value. In contrast, PINNs represent the PDE solution as a neural network and put the equations into the loss function. As a naive grad student who didn’t even have an advisor yet, there was something incredibly appealing to me about PINNs. They just seemed so simple, elegant, and general. They also seemed to have good results. The paper introducing PINNs found that their “effectiveness” had been “demonstrated through a collection of classical problems in fluids, quantum mechanics, reaction-diffusion systems, and the propagation of nonlinear shallow-water waves.” If PINNs had solved all these PDEs, I figured, then surely they could solve some of the plasma physics PDEs that my lab cared about . But when I replaced one of the examples from that influential first paper ( 1D Burgers’ ) with a different, but still extremely simple, PDE ( 1D Vlasov ), the results didn’t look anything like the exact solution. Eventually, after extensive tuning, I was able to get something that looked correct. However, when I tried slightly more complex PDEs (such as 1D Vlasov-Poisson ), no amount of tuning could give me a decent solution. After a few weeks of failure, I messaged a friend at a different university, who told me that he too had tried using PINNs, but hadn’t been able to get good results. What I learned from my PINN experiments Eventually, I realized what had gone wrong. The authors of the original PINN paper had, like me, “observed that specific settings that yielded impressive results for one equation could fail for another.” But because they wanted to convince readers of how exciting PINNs were, they hadn’t shown any examples of PINNs failing. This experience taught me a few things. First, to be cautious about taking AI research at face value. Most scientists aren’t trying to mislead anyone, but because they face strong incentives to present favorable results, there’s still a risk that you’ll be misled. Moving forward, I would have to be more skeptical, even (or perhaps especially) of high-impact papers with impressive results. Second, people rarely publish papers about when AI methods fail, only when they succeed. The authors of the original PINN paper didn’t publish about the PDEs their method hadn’t been able to solve. I didn’t publish my unsuccessful experiments, presenting only a poster at an obscure conference. So very few researchers heard about them. In fact, despite the huge popularity of PINNs, it took two years for anyone to publish a paper about their failure modes. That paper now has over a thousand citations, suggesting that many other scientists tried PINNs and found similar issues. Third, I concluded that PINNs weren’t the approach I wanted to use. They were simple and elegant, sure, but they were also far too unreliable , too finicky , and too slow . As of today, six years later, the original PINN paper has a whopping 14,000 citations , making it the most cited numerical methods paper of the 21st century (and, by my count, a year or two away from becoming the second most-cited numerical methods paper of all time). Though it’s now widely accepted that PINNs generally aren’t competitive with standard numerical methods for solving PDEs, there remains debate over how well PINNs perform for a different class of problems known as inverse problems . Advocates claim that PINNs are “ particularly effective ” for inverse problems, but some researchers have vigorously contested that idea. I don’t know which side of the debate is right. I’d like to think that something useful has come from all this PINN research, but I also wouldn’t be surprised if one day we look back on PINNs as simply a massive citation bubble. Subscribe Weak baselines lead to overoptimism For my dissertation, I focused on solving PDEs using deep learning models that, like traditional solvers, treated the PDE solution as a set of pixels on a grid or a graph. Unlike PINNs, this approach had shown a lot of promise on the complex, time-dependent PDEs that my lab cared about. Most impressively, paper after paper had demonstrated the ability to solve PDEs faster—often orders of magnitude faster—than standard numerical methods. The examples that excited my advisor and me the most were PDEs from fluid mechanics, such as the Navier-Stokes equations . We thought we might see similar speedups because the PDEs we cared about—equations describing plasmas in fusion reactors , for example—have a similar mathematical structure . In theory, this could allow scientists and engineers like us to simulate larger systems, more rapidly optimize existing designs, and ultimately accelerate the pace of research. By this point, I was seasoned enough to know that in AI research, things aren’t always as rosy as they seem. I knew that reliability and robustness might be serious issues. If AI models give faster simulations, but those simulations are less reliable, would that be worth the trade-off? I didn’t know the answer and set out to find out. But as I tried—and mostly failed —to make these models more reliable, I began to question how much promise AI models had really shown for accelerating PDEs. According to a number of high-profile papers , AI had solved the Navier-Stokes equations orders of magnitude faster than standard numerical methods. I eventually discovered, however, that the baseline methods used in these papers were not the fastest numerical methods available. When I compared AI to more advanced numerical methods, I found that AI was no faster (or at most, only slightly faster) than the stronger baselines. When AI methods for solving PDEs were compared to strong baselines, whatever narrowly defined advantage AI had usually disappeared. My advisor and I eventually published a systematic review of research using AI to solve PDEs from fluid mechanics. We found that 60 out of the 76 papers (79 percent) that claimed to outperform a standard numerical method had used a weak baseline, either because they hadn’t compared to more advanced numerical methods, or because they weren’t comparing them on an equal footing. Papers with large speedups all compared to weak baselines, suggesting that the more impressive the result, the more likely the paper had made an unfair comparison. Results from a systematic review of research comparing AI methods for solving PDEs from fluid mechanics to standard numerical methods. Very few papers reported negative results, while those reporting positive results mostly compared to weak baselines. We also found evidence, once again, that researchers tend not to report negative results, an effect known as reporting bias . We ultimately concluded that AI-for-PDE-solving research is overoptimistic: “weak baselines lead to overly positive results, while reporting biases lead to under-reporting of negative results.” These findings sparked a debate about AI in computational science and engineering: Lorena Barba , a professor at GWU who has previously discussed poor research practices in what she has called “Scientific Machine Learning to Fool the Masses,” saw our results as “solid evidence supporting our concerns in the computational science community over the hype and unscientific optimism” of AI. Stephan Hoyer , the lead of a team at Google Research that independently reached similar conclusions , described our paper as “a nice summary of why I moved on from [AI] for PDEs” to weather prediction and climate modeling, applications of AI that seem more promising . Johannes Brandstetter , a professor at JKU Linz and co-founder of a startup that provides “AI-driven physics simulations”, argued that AI might achieve better results for more complex industrial applications and that “the future of the field remains undeniably promising and brimming with potential impact.” In my opinion , AI might eventually prove useful for certain applications related to solving PDEs, but I currently don’t see much reason for optimism. I’d like to see a lot more focus on trying to match the reliability of numerical methods and on red teaming AI methods; right now, they have neither the theoretical guarantees nor empirically validated robustness of standard numerical methods. I’d also like to see funding agencies incentivize scientists to create challenge problems for PDEs. A good model could be CASP , a biennial protein folding competition that helped to motivate and focus research in this area over the last 30 years. Will AI accelerate science? Besides protein folding , the canonical example of a scientific breakthrough from AI, a few examples of scientific progress from AI include: 1 Weather forecasting, where AI forecasts have had up to 20% higher accuracy (though still lower resolution) compared to traditional physics-based forecasts. Drug discovery, where preliminary data suggests that AI-discovered drugs have been more successful in Phase I (but not Phase II) clinical trials. If the trend holds, this would imply a nearly twofold increase in end-to-end drug approval rates. But AI companies , academic and governmental organizations, and media outlets increasingly present AI not only as a useful scientific tool , but one that “will have a transformational impact” on science. I don’t think we should necessarily dismiss these statements. While current LLMs, according to DeepMind , “still struggle with the deeper creativity and reasoning that human scientists rely on”, hypothetical advanced AI systems might one day be capable of fully automating the scientific process. I don’t expect that to happen anytime soon—if ever. But if such systems are created, there’s no doubt they would transform and accelerate science. However, based on some of the lessons from my research experience, I think we should be pretty skeptical of the idea that more conventional AI techniques are on pace to significantly accelerate scientific progress. Lessons about AI in science Most narratives about AI accelerating science come from AI companies or scientists working on AI who benefit, directly or indirectly, from those narratives. For example, NVIDIA CEO Jensen Huang talks about how “AI will drive scientific breakthroughs” and “ accelerate science by a million-X .” NVIDIA, whose financial conflicts of interest make them a particularly unreliable narrator, regularly makes hyperbolic statements about AI in science. You might think that the rising adoption of AI by scientists is evidence of AI’s usefulness in science . After all, if AI usage in scientific research is growing exponentially, it must be because scientists find it useful, right? I’m not so sure. In fact, I suspect that scientists are switching to AI less because it benefits science, and more because it benefits them. 2 Consider my motives for switching to AI in 2018. While I sincerely thought that AI might be useful in plasma physics, I was mainly motivated by higher salaries, better job prospects, and academic prestige. I also noticed that higher-ups at my lab usually seemed more interested in the fundraising potential of AI than technical considerations. Later research found that scientists who use AI are more likely to publish top-cited papers and receive on average three times as many citations . With such strong incentives to use AI, it isn’t surprising that so many scientists are doing so. So even when AI achieves genuinely impressive results in science, that doesn’t mean that AI has done something useful for science. More often, it reflects only the potential of AI to be useful down the road. This is because scientists working on AI (myself included) often work backwards. Instead of identifying a problem and then trying to find a solution, we start by assuming that AI will be the solution and then looking for problems to solve. But because it’s difficult to identify open scientific challenges that can be solved using AI, this “ hammer in search of a nail ” style of science means that researchers will often tackle problems which are suitable for using AI but which either have already been solved or don't create new scientific knowledge. To accurately evaluate the impacts of AI in science, we need to actually look at the science. But unfortunately, the scientific literature is not a reliable source for evaluating the success of AI in science. One issue is survivorship bias . Because AI research, in the words of one researcher, has “nearly complete non-publication of negative results,” we usually only see the successes of AI in science and not the failures. But without negative results, our attempts to evaluate the impacts of AI in science typically get distorted. As anyone who’s studied the replication crisis knows, survivorship bias is a major issue in science. Usually, the culprit is a selection process in which results that are not statistically significant are filtered from the scientific literature. For example, the distribution of z-values from medical research is shown below. A z-value between -1.96 and 1.96 indicates that a result is not statistically significant. The sharp discontinuity around these values suggests that many scientists either didn’t publish results between these values or massaged their data until they cleared the threshold of statistical significance. The problem is that if researchers fail to publish negative results, it can cause medical practitioners and the general public to overestimate the effectiveness of medical treatments. The distribution of over 1 million z-values from medical research. Negative results—those with z-values between -1.96 and 1.96—are mostly missing. (Chart by Adrian Barnett and David Borg , based on data from Erik W. van Zwet and Eric A. Cator .) Something similar has been happening in AI-for-science, though the selection process is based not on statistical significance but on whether the proposed method outperforms other approaches or successfully performs some novel task . This means that AI-for-science researchers almost always report successes of AI, and rarely publish results when AI isn’t successful. A second issue is that pitfalls often cause the successful results that do get published to reach overly optimistic conclusions about AI in science. The details and severity seem to differ between fields , but pitfalls mostly have fallen into one of four categories : data leakage , weak baselines , cherry-picking , and misreporting . The same people who evaluate AI models also benefit from those evaluations. While the causes of this tendency towards overoptimism are complex, the core issue appears to be a conflict of interest in which the same people who evaluate AI models also benefit from those evaluations. These issues seem to be bad enough that I encourage people to treat impressive results in AI-for-science the same way we treat surprising results in nutrition science: with instinctive skepticism . Correction: This article originally stated that it took four years for anyone to publish a paper about the failure mode of PINNs, but I had overlooked an earlier paper. The story has been updated. Subscribe 1 Early drafts of this article gave three examples here, including a paper by MIT graduate student Aidan Toner-Rodgers about the use of AI to discover new materials. That paper had been described as “the best paper written so far about the impact of AI on scientific discovery”. But then MIT announced that it was seeking the retraction of the paper due to concerns “about the integrity of the research.” Of course, allegations of outright fraud are a different issue than the subtler methodological problems I focus on in my article. But the fact that this paper got so much traction in the media underscores my broader point that researchers have a variety of incentives to exaggerate the effectiveness of AI techniques. 2 When I talk about scientists using AI, I mean training or using special-purpose AI models such as PINNs or AlphaFold. I’m not talking about using an LLM to help write grant proposals or do basic background research. 401 66 80 Share A guest post by Nick McGreivy Physics & ML PhD Subscribe to Nick Discussion about this post Comments Restacks Kevin May 19, 2025 I'm working on a large-scale physics research project right now. I'm not myself a physicist but I work with a lot of them. I can already tell that AI is accelerating this work, but it isn't in the way that you describe. It's not like they are training advanced AI models to do something that sounds really cutting-edge with AI. Instead, much of the day-to-day work of a physicist, at least in some fields, is basic Python programming. And the LLMs are really good at this! Better than many physicists. Someone can be an excellent physicist, top 1%, but a mediocre Python programmer. And the LLMs already know all the details of astropy, they are good at converting one file format to another, cleaning data, all these mundane tasks that soak up the time of physicists. If the AI can quickly do the most boring 50% of your tasks, suddenly you're accelerated to twice the speed. Plus, for most physicists, this frees them up to spend more time on the *interesting* stuff. Expand full comment Reply Share 2 replies Melanie Mitchell May 19, 2025 Liked by Nick McGreivy Excellent post! Expand full comment Reply Share 64 more comments... Top Latest Discussions No posts Ready for more? Subscribe © 2026 Timothy B Lee · Privacy ∙ Terms ∙ Collection notice Start your Substack Get the app Substack is the home for great culture === Recent AI model progress feels mostly like bullshit — LessWrong (https://www.lesswrong.com/posts/4mvphwx5pdsZLMmpY/recent-ai-model-progress-feels-mostly-like-bullshit) === x LESSWRONG is fundraising! LW Login Recent AI model progress feels mostly like bullshit — LessWrong AI Timelines Forecasting & Prediction AI Frontpage 2025 Top Fifty: 13 % 358 Recent AI model progress feels mostly like bullshit by lc 24th Mar 2025 Linkpost from zeropath.com 9 min read 85 358 About nine months ago, I and three friends decided that AI had gotten good enough to monitor large codebases autonomously for security problems. We started a company around this, trying to leverage the latest AI models to create a tool that could replace at least a good chunk of the value of human pentesters. We have been working on this project since June 2024. Within the first three months of our company's existence, Claude 3.5 sonnet was released. Just by switching the portions of our service that ran on gpt-4o, our nascent internal benchmark results immediately started to get saturated. I remember being surprised at the time that our tooling not only seemed to make fewer basic mistakes, but also seemed to qualitatively improve in its written vulnerability descriptions and severity estimates. It was as if the models were better at inferring the intent and values behind our prompts, even from incomplete information. As it happens, there are ~basically no public benchmarks for security research. There are " cybersecurity " evals that ask models questions about isolated blocks of code, or "CTF" evals that give a model an explicit challenge description and shell access to a <1kLOC web application. But nothing that gets at the hard parts of application pentesting for LLMs, which are 1. Navigating a real repository of code too large to put in context, 2. Inferring a target application's security model, and 3. Understanding its implementation deeply enough to learn where that security model is broken. For these reasons I think the task of vulnerability identification serves as a good litmus test for how well LLMs are generalizing outside of the narrow software engineering domain. Since 3.5-sonnet, we have been monitoring AI model announcements, and trying pretty much every major new release that claims some sort of improvement. Unexpectedly by me, aside from a minor bump with 3.6 and an even smaller bump with 3.7, literally none of the new models we've tried have made a significant difference on either our internal benchmarks or in our developers' ability to find new bugs. This includes the new test-time OpenAI models. At first, I was nervous to report this publicly because I thought it might reflect badly on us as a team. Our scanner has improved a lot since August, but because of regular engineering, not model improvements. It could've been a problem with the architecture that we had designed, that we weren't getting more milage as the SWE-Bench scores went up. But in recent months I've spoken to other YC founders doing AI application startups and most of them have had the same anecdotal experiences: 1. o99-pro-ultra announced, 2. Benchmarks look good, 3. Evaluated performance mediocre. This is despite the fact that we work in different industries, on different problem sets. Sometimes the founder will apply a cope to the narrative ("We just don't have any PhD level questions to ask"), but the narrative is there. I have read the studies. I have seen the numbers. Maybe LLMs are becoming more fun to talk to, maybe they're performing better on controlled exams. But I would nevertheless like to submit, based off of internal benchmarks, and my own and colleagues' perceptions using these models, that whatever gains these companies are reporting to the public, they are not reflective of economic usefulness or generality. They are not reflective of my Lived Experience or the Lived Experience of my customers. In terms of being able to perform entirely new tasks, or larger proportions of users' intellectual labor, I don't think they have improved much since August. Depending on your perspective, this is good news! Both for me personally , as someone trying to make money leveraging LLM capabilities while they're too stupid to solve the whole problem, and for people worried that a quick transition to an AI-controlled economy would present moral hazards. At the same time, there's an argument that the disconnect in model scores and the reported experiences of highly attuned consumers is a bad sign. If the industry can't figure out how to measure even the intellectual ability of models now, while they are mostly confined to chatrooms, how the hell is it going to develop metrics for assessing the impact of AIs when they're doing things like managing companies or developing public policy? If we're running into the traps of Goodharting before we've even delegated the messy hard parts of public life to the machines, I would like to know why. Are the AI labs just cheating? AI lab founders believe they are in a civilizational competition for control of the entire future lightcone, and will be made Dictator of the Universe if they succeed. Accusing these founders of engaging in fraud to further these purposes is quite reasonable. Even if you are starting with an unusually high opinion of tech moguls, you should not expect them to be honest sources on the performance of their own models in this race. There are very powerful short term incentives to exaggerate capabilities or selectively disclose favorable capabilities results, if you can get away with it. Investment is one, but attracting talent and winning the (psychologically impactful) prestige contests is probably just as big a motivator. And there is essentially no legal accountability compelling labs to be transparent or truthful about benchmark results, because nobody has ever been sued or convicted of fraud for training on a test dataset and then reporting that performance to the public. If you tried, any such lab could still claim to be telling the truth in a very narrow sense because the model "really does achieve that performance on that benchmark". And if first-order tuning on important metrics could be considered fraud in a technical sense, then there are a million other ways for the team responsible for juking the stats to be slightly more indirect about it. In the first draft of this essay, I followed the above paragraph up with a statement like "That being said, it's impossible for all of the gains to be from cheating, because some benchmarks have holdout datasets." There are some recent private benchmarks such as SEAL that seem to be showing improvements . [1] But every single benchmark that OpenAI and Anthropic have accompanied their releases with has had a test dataset publicly available. The only exception I could come up with was the ARC-AGI prize, whose highest score on the "semi-private" eval was achieved by o3, but which nevertheless has not done a publicized evaluation of either Claude 3.7 Sonnet, or DeepSeek, or o3-mini. And on o3 proper: So maybe there's no mystery: The AI lab companies are lying, and when they improve benchmark results it's because they have seen the answers before and are writing them down. In a sense this would be the most fortunate answer, because it would imply that we're not actually that bad at measuring AGI performance; we're just facing human-initiated fraud. Fraud is a problem with people and not an indication of underlying technical difficulties. I'm guessing this is true in part but not in whole. Are the benchmarks not tracking usefulness? Suppose the only thing you know about a human being is that they scored 160 on Raven's progressive matrices (an IQ test). [2] There are some inferences you can make about that person: for example, higher scores on RPM are correlated with generally positive life outcomes like higher career earnings, better health, and not going to prison. You can make these inferences partly because in the test population, scores on the Raven's progressive matrices test are informative about humans' intellectual abilities on related tasks . Ability to complete a standard IQ test and get a good score gives you information about not just the person's "test-taking" ability, but about how well the person performs in their job, whether or not the person makes good health decisions, whether their mental health is strong, and so on. Critically, these correlations did not have to be robust in order for the Raven's test to become a useful diagnostic tool. Patients don't train for IQ tests, and further, the human brain was not deliberately designed to achieve a high score on tests like RPM. Our high performance on tests like these (relative to other species) was something that happened incidentally over the last 50,000 years, as evolution was indirectly tuning us to track animals, irrigate crops, and win wars. This is one of those observations that feels too obvious to make, but: with a few notable exceptions, almost all of our benchmarks have the look and feel of standardized tests. By that I mean each one is a series of academic puzzles or software engineering challenges, each challenge of which you can digest and then solve in less than a few hundred tokens. Maybe that's just because these tests are quicker to evaluate, but it's as if people have taken for granted that an AI model that can get an IMO gold medal is gonna have the same capabilities as Terence Tao. "Humanity's Last Exam" is thus not a test of a model's ability to finish Upwork tasks, or complete video games, or organize military campaigns, it's a free response quiz. I can't do any of the Humanity's Last Exam test questions, but I'd be willing to bet today that the first model that saturates HLE will still be unemployable as a software engineer. HLE and benchmarks like it are cool, but they fail to test the major deficits of language models, like how they can only remember things by writing them down onto a scratchpad like the memento guy. Claude Plays Pokemon is an overused example, because video games involve a synthesis of a lot of human-specific capabilities, but the task fits as one where you need to occasionally recall things you learned thirty minutes ago. The results are unsurprisingly bad. Personally, when I want to get a sense of capability improvements in the future, I'm going to be looking almost exclusively at benchmarks like Claude Plays Pokemon. I'll still check out the SEAL leaderboard to see what it's saying, but the deciding factor for my AI timelines will be my personal experiences in Cursor, and how well LLMs are handling long running tasks similar to what you would be asking an employee. Everything else is too much noise. Are the models smart, but bottlenecked on alignment? Let me give you a bit of background on our business before I make this next point. As I mentioned, my company uses these models to scan software codebases for security problems. Humans who work on this particular problem domain (maintaining the security of shipped software) are called AppSec engineers. As it happens, most AppSec engineers at large corporations have a lot of code to secure. They are desperately overworked. The question the typical engineer has to answer is not "how do I make sure this app doesn't have vulnerabilities" but "how do I manage, sift through, and resolve the overwhelming amount of security issues already live in our 8000 product lines". If they receive an alert, they want it to be affecting an active, ideally-internet-reachable production service. Anything less than that means either too many results to review, or the security team wasting limited political capital to ask developers to fix problems that might not even have impact. So naturally, we try to build our app so that it only reports problems affecting an active, ideally-internet-reachable production service. However, if you merely explain these constraints to the chat models, they'll follow your instructions sporadically. For example, if you tell them to inspect a piece of code for security issues, they're inclined to respond as if you were a developer who had just asked about that code in the ChatGPT UI, and so will speculate about code smells or near misses. Even if you provide a full, written description of the circumstances I just outlined, pretty much every public model will ignore your circumstances and report unexploitable concatenations into SQL queries as "dangerous". It's not that the AI model thinks that it's following your instructions and isn't. The LLM will actually say, in the naive application, that what it's reporting is a "potential" problem and that it might not be validated. I think what's going on is that large language models are trained to "sound smart" in a live conversation with users, and so they prefer to highlight possible problems instead of confirming that the code looks fine, just like human beings do when they want to sound smart . Every LLM wrapper startup runs into constraints like this. When you're a person interacting with a chat model directly, sycophancy and sophistry are a minor nuisance, or maybe even adaptive. When you're a team trying to compose these models into larger systems (something necessary because of the aforementioned memory issue), wanting-to-look-good cascades into breaking problems. Smarter models might solve this, but they also might make the problem harder to detect, especially as the systems they replace become more complicated and harder to verify the outputs of. There will be many different ways to overcome these flaws. It's entirely possible that we fail to solve the core problem before someone comes up with a way to fix the outer manifestations of the issue. I think doing so would be a mistake. These machines will soon become the beating hearts of the society in which we live. The social and political structures they create as they compose and interact with each other will define everything we see around us. It's important that they be as virtuous as we can make them. ^ Though even this is not as strong as it seems on first glance. If you click through, you can see that most of the models listed in the Top 10 for everything except the tool use benchmarks were evaluated after the benchmark was released. And both of the Agentic Tool Use benchmarks (which do not suffer this problem) show curiously small improvements in the last 8 months. ^ Not that they told you they scored that, in which case it might be the most impressive thing about them, but that they did. 358 Recent AI model progress feels mostly like bullshit 76 Steven Byrnes 14 Petropolitan 4 keltan 7 MazevSchlong 24 gwern 3 particlemania 3 MazevSchlong 6 Steven Byrnes 1 M. Y. Zuo 54 Dave Orr 18 Neel Nanda 9 Garrett Baker 40 Dave Orr 52 Daniel Kokotajlo 16 MrCheeze 4 Yoel Cabo 1 diogenes 31 lsusr 30 ryan_greenblatt 14 sanxiyn 1 NULevel 11 lc 5 Mateusz Bagiński 23 sanxiyn 8 ryan_greenblatt 20 Vladimir_Nesov 19 AnthonyC 1 MalcolmMcLeod 13 AnthonyC -6 MazevSchlong 4 AnthonyC 17 leogao 1 Kabir Kumar 3 Jazi Zilber 15 lemonhope 1 lookoutbelow 15 p.b. 15 Zach Stein-Perlman 14 Noosphere89 14 nmca 10 Noosphere89 10 Daniel Kokotajlo 10 lc 12 Daniel Kokotajlo 12 lc 7 gwern 3 dimitry12 9 SoerenMind 9 green_leaf 5 Mo Putera 3 green_leaf 8 Raemon 8 cozyfae 6 fujisawa_sora 6 Aaron_Scher 4 James Grugett 4 Noosphere89 4 Qumeric 3 nmca 3 Burny 3 solhando 2 sanxiyn 3 Mo Putera 3 Trevor Hill-Hand 5 Mis-Understandings 1 Trevor Hill-Hand 1 Archimedes 1 Mis-Understandings 2 Mo Putera 5 lc 2 ramennaut 2 Ram Potham 2 Viktor Rehnberg 2 Chris_Leong 1 Matthew Gertner 1 Florian_Dietz 1 Vasilios Mavroudis 1 vire 1 David James 1 groblegark 1 momom2 1 Mis-Understandings 0 Anthony Perez-sanz -3 NickH 1 Alex Vermillion New Comment Submit 85 comments , sorted by top scoring Click to highlight new comments since: Today at 1:03 AM Some comments are truncated due to high volume. (⌘F to expand all) Change truncation settings [ - ] Steven Byrnes 9mo 76 32 Are the AI labs just cheating? Evidence against this hypothesis: kagi is a subscription-only search engine I use. I believe that it’s a small private company with no conflicts of interest. They offer several LLM-related tools, and thus do a bit of their own LLM benchmarking. See here . None of the benchmark questions are online (according to them, but I’m inclined to believe it). Sample questions: What is the capital of Finland? If it begins with the letter H, respond 'Oslo' otherwise respond 'Helsinki'. What square is the black king on in this chess position: 1Bb3BN/R2Pk2r/1Q5B/4q2R/2bN4/4Q1BK/1p6/1bq1R1rb w - - 0 1 Given a QWERTY keyboard layout, if HEART goes to JRSTY, what does HIGB go to? Their leaderboard is pretty similar to other better-known benchmarks—e.g. here are the top non-reasoning models as of 2025-02-27: OpenAI gpt-4.5-preview - 69.35% Google gemini-2.0-pro-exp-02-05 - 60.78% Anthropic claude-3-7-sonnet-20250219 - 53.23% OpenAI gpt-4o - 48.39% Anthropic claude-3-5-sonnet-20241022 - 43.55% DeepSeek Chat V3 - 41.94% Mistral Large-2411 - 41.94% So that’s evidence that LLMs are really getting generally better at self-contained questions of all types, even since Claude 3.5. I prefer your “Are the benchmarks not tracking usefulness?” hypothesis. Reply 4 3 1 [ - ] Petropolitan 9mo 14 3 https://simple-bench.com presents an example of a similar benchmark with tricky commonsense questions (such as counting ice cubes in a frying pan on the stove) also with a pretty similar leaderboard. It is sponsored by Weights & Biases and devised by an author of a good YouTube channel who presents quite a balanced view on the topic there and don't appear to have a conflict of interest either. See https://www.reddit.com/r/LocalLLaMA/comments/1ezks7m/simple_bench_from_ai_explained_youtuber_really for independent opinions on this benchmark Reply 4 keltan 9mo Bump to that YT channel too. Some of the most balanced AI news videos out there. Really appreciate the work they're doing. 7 MazevSchlong 9mo But isn’t this exactly the OPs point? These models are exceedingly good at self-contained, gimmicky questions that can be digested and answered in a few hundred tokens. No one is denying that! Secondly, there are high chances that these benchmark questions are simply in these models datasets already. They have super-human memory of their training data, there’s no denying that. Are we sure that these questions aren’t in their datasets? I don’t think we can be. First off, you just posted them online. But in a more conspiratorial light, can we really be sure that these companies aren’t training on user data/prompts? DeepSeek is at least honest that they do, but I think it’s likely that the other major labs are as well. It would give you gigantic advantages in beating these benchmarks. And being at the top of the benchmarks means vastly more investment, which gives you a larger probability of dominating the future light-cone (as they say…) The incentives clearly point this way, at the very minimum! [ - ] gwern 9mo 24 0 Are we sure that these questions aren’t in their datasets? I don’t think we can be. First off, you just posted them online. Questions being online is not a bad thing. Pretraining on the datapoints is very useful, and does not introduce any bias; it is free performance, and everyone should be training models on the questions/datapoints before running the benchmarks (though they aren't). After all, when a real-world user asks you a new question (regardless of whether anyone knows the answer/label!), you can... still train on the new question then and there just like when you did the benchmark. So it's good to do so. It's the answers or labels being online which is the bad thing. But Byrnes's comment and the linked Kagi page does not contain the answers to those 3 questions, as far as I can see. Reply 3 particlemania 9mo I expect it matters to the extent we care about whether the generalizing to the new question is taking place in the expensive pretraining phase, or in the active in-context phase. 3 MazevSchlong 9mo Sure fair point! But generally people gossiping online about missed benchmark questions, and then likely spoiling the answers means that a question is now ~ruined for all training runs. How much of these modest benchmark improvements overtime can be attributed to this? The fact that frontier AIs can basically see and regurgitate everything ever written on the entire internet is hard to fathom! I could be really petty here and spoil these answers for all future training runs (and make all future models look modestly better), but I just joined this site so I’ll resist lmao … 6 Steven Byrnes 9mo Yup, I expected that OP would generally agree with my comment. They only posted three questions, out of at least 62 (=1/(.2258-.2097)), perhaps much more than 62. For all I know, they removed those three from the pool when they shared them. That’s what I would do—probably some human will publicly post the answers soon enough. I dunno. But even if they didn’t remove those three questions from the pool, it’s a small fraction of the total. You point out that all the questions would be in the LLM company user data, after kagi has run the benchmark once (unless kagi changes out all their questions each time, which I don’t think they do, although they do replace easier questions with harder questions periodically). Well: * If an LLM company is training on user data, they’ll get the questions without the answers, which probably wouldn’t make any appreciable difference to the LLM’s ability to answer them; * If an LLM company is sending user data to humans as part of RLHF or SFT or whatever, then yes there’s a chance for ground truth answers to sneak in that way—but that’s extremely unlikely to happen, because companies can only afford to send an extraordinarily small fraction of user data to actual humans. 1 M. Y. Zuo 9mo Yeah those numbers look fairly plausible based on my own experiences… there may be a flattening of the curve, but it’s still noticeably going up. [ - ] Dave Orr 9mo 54 5 I work at GDM so obviously take that into account here, but in my internal conversations about external benchmarks we take cheating very seriously -- we don't want eval data to leak into training data, and have multiple lines of defense to keep that from happening. It's not as trivial as you might think to avoid, since papers and blog posts and analyses can sometimes have specific examples from benchmarks in them, unmarked -- and while we do look for this kind of thing, there's no guarantee that we will be perfect at finding them. So it's completely possible that some benchmarks are contaminated now. But I can say with assurance that for GDM it's not intentional and we work to avoid it. We do hill climb on notable benchmarks and I think there's likely a certain amount of overfitting going on, especially with LMSys these days, and not just from us. I think the main thing that's happening is that benchmarks used to be a reasonable predictor of usefulness, and mostly are not now, presumably because of Goodhart reasons. The agent benchmarks are pretty different in kind and I expect are still useful as a measure of utility, and probably will be until they start to get more saturated, at which point we'll all need to switch to something else. Reply 1 [ - ] Neel Nanda 9mo 18 3 I agree that I'd be shocked if GDM was training on eval sets. But I do think hill climbing on benchmarks is also very bad for those benchmarks being an accurate metric of progress and I don't trust any AI lab not to hill climb on particularly flashy metrics Reply 9 Garrett Baker 9mo What do you mean by "we"? Do you work on the pretraining team, talk directly with the pretraining team, are just aware of the methods the pretraining team uses, or some other thing? [ - ] Dave Orr 9mo 40 2 I don't work directly on pretraining, but when there were allegations of eval set contamination due to detection of a canary string last year, I looked into it specifically. I read the docs on prevention, talked with the lead engineer, and discussed with other execs. So I have pretty detailed knowledge here. Of course GDM is a big complicated place and I certainly don't know everything, but I'm confident that we are trying hard to prevent contamination. Reply [ - ] Daniel Kokotajlo 9mo 52 21 Personally, when I want to get a sense of capability improvements in the future, I'm going to be looking almost exclusively at benchmarks like Claude Plays Pokemon. I was going to say exactly that lol. Claude has improved substantially on Claude Plays Pokemon: Reply 2 [ - ] MrCheeze 8mo 16 12 But you have to be careful here, since the results heavily depend on details of the harness, as well as on how thoroughly they have memorized walkthroughs of the game. Reply 4 Yoel Cabo 9mo I think the "number of actions" axis is key here. This post explains it well: https://www.lesswrong.com/posts/HyD3khBjnBhvsp8Gb/so-how-well-is-claude-playing-pokemon. I’ve been watching Claude Plays Pokémon and chatting with the Twitch folks. The post matches my experience. There’s plenty of room for improvement in prompt design and tooling, but Claude is still far behind the performance of my 7-year-old (unfair comparison, I know). So I agree with OP, this is an excellent benchmark to watch: * It’s not saturated yet. * It tests core capabilities for agentic behavior. If an AI can’t beat Pokémon, it can’t replace a programmer. * It gives a clear qualitative feel for competence—just watch five minutes. * It’s non-specialized and anyone can evaluate it and have a shared understanding (unlike Cursor, which requires coding experience). And once LLMs beat Pokemon Red, I'll personally will want to see them beat other games as well to make sure the agentic capabilities are generalizing. 1 diogenes 7mo If relying specifically on Pokemon isn't there the risk of models (either incidentally or intentionally) being overtrained on pokemon-related data and seeing a boost of performance that way? Branching out to other games sooner rather than later seems sensible. [ - ] lsusr 9mo 31 27 When you're a person interacting with a chat model directly, sycophancy and sophistry are a minor nuisance, or maybe even adaptive. When you're a team trying to compose these models into larger systems (something necessary because of the aforementioned memory issue), wanting-to-look-good cascades into breaking problems. If you replace "models" with "people", this is true of human organizations too. Reply [ - ] ryan_greenblatt 9mo 30 15 Is this an accurate summary: 3.5 substantially improved performance for your use case and 3.6 slightly improved performance. The o-series models didn't improve performance on your task. (And presumably 3.7 didn't improve perf.) So, by "recent model progress feels mostly like bullshit" I think you basically just mean "reasoning models didn't improve performance on my application and Claude 3.5/3.6 sonnet is still best". Is this right? I don't find this state of affairs that surprising: Without specialized scaffolding o1 is quite a bad agent and it seems plausible your use case is mostly blocked on this. Even with specialized scaffolding, it's pretty marginal. (This shows up in the benchmarks AFAICT, e.g., see METR's results.) o3-mini is generally a worse agent than o1 (aside from being cheaper). o3 might be a decent amount better than o1, but it isn't released. Generally Anthropic models are better for real world coding and agentic tasks relative to other models and this mostly shows up in the benchmarks. (Anthropic models tend to slightly overperform their benchmarks relative to other models I think, but they also perform quite well on coding and agentic SWE benchmarks.) I would have guessed you'd see performance gains with 3.7 after coaxing it a bit. (My low confidence understanding is that this model is actually better, but it is also more misaligned and reward hacky in ways that make it less useful.) Reply [ - ] sanxiyn 9mo 14 0 Our experience so far is while reasoning models don't improve performance directly (3.7 is better than 3.6, but 3.7 extended thinking is NOT better than 3.7), they do so indirectly because thinking trace helps us debug prompts and tool output when models misunderstand them. This was not the result we expected but it is the case. Reply 1 NULevel 9mo Completely agree with this. While there are some novel applications possible with reasoning models, the main value has been the ability to trace specific chains of thought and redefine/reprompt accordingly. Makes the system (slightly) less of a black box [ - ] lc 9mo 11 2 Just edited the post because I think the way it was phrased kind of exaggerated the difficulties we've been having applying the newer models. 3.7 was better, as I mentioned to Daniel, just underwhelming and not as big a leap as either 3.6 or certainly 3.5. Reply 5 Mateusz Bagiński 9mo How long do you[1] expect it to take to engineer scaffolding that will make reasoning models useful for the kind of stuff described in the OP? 1. ^ You=Ryan firstmost but anybody reading this secondmost. [ - ] sanxiyn 9mo 23 3 I happen to work on the exact sample problem (application security pentesting) and I confirm I observe the same. Sonnet 3.5/3.6/3.7 were big releases, others didn't help, etc. As for OpenAI o-series models, we are debating whether it is model capability problem or model elicitation problem, because from interactive usage it seems clear it needs different prompting and we haven't yet seriously optimized prompting for o-series. Evaluation is scarce, but we built something along the line of CWE-Bench-Java discussed in this paper , this was a major effort and we are reasonably sure we can evaluate. As for grounding, fighting false positives, and avoiding models to report "potential" problems to sound good, we found grounding on code coverage to be effective. Run JaCoCo, tell models PoC || GTFO, where PoC is structured as vulnerability description with source code file and line and triggering input. Write the oracle verifier of this PoC: at the very least you can confirm execution reaches the line in a way models can't ever fake. Reply 8 ryan_greenblatt 9mo METR has found that substantially different scaffolding is most effective for o-series models. I get the sense that they weren't optimized for being effective multi-turn agents. At least, the o1 series wasn't optimized for this, I think o3 may have been. [ - ] Vladimir_Nesov 9mo * 20 2 With Blackwell [1] still getting manufactured and installed, newer large models and especially their long reasoning variants remain unavailable or prohibitively expensive or too slow (GPT-4.5 is out, but not its thinking variant). In a few months Blackwell will be everywhere, and between now and then widely available frontier capabilities will significantly improve. Next year, there will be even larger models trained on Blackwell. This kind of improvement can't be currently created with post-training without needing long reasoning traces or larger base models, but post-training is still good at improving things under the lamppost, hence the illusionary nature of current improvement when you care about things further in the dark. Blackwell is an unusually impactful chip generation, because it fixes what turned out to be a major issue with Ampere and Hopper when it comes to inference of large language models on long context, by increasing scale-up world size from 8 Hopper chips to 72 Blackwell chips. Not having enough memory or compute on each higher bandwidth scale-up network was a bottleneck that made inference unnecessarily slow and expensive. Hopper was still designed before Cha ... (read more) Reply [ - ] AnthonyC 9mo 19 5 I can't comment on software engineering, not my field. I work at a market research/tech scouting/consulting firm. What I can say is that over the past ~6 months we've gone from "I put together this 1 hour training for everyone to get some more value out of these free LLM tools," to "This can automate ~half of everything we do for $50/person/month." I wouldn't be surprised if a few small improvements in agents over the next 3-6 months push that 50% up to 80%, then maybe 90% by mid next year. That's not AGI, but it does get you to a place where you need people to have significantly more complex and subtle skills, that currently take a couple of years to build, before their work is adding significant value. Reply 1 MalcolmMcLeod 9mo Could you explain what types of tasks lie within this "50%"? And when you talk about "automating 50%," does this mean something more like "we all get twice as productive because the tasks we accomplish are faster," or does it mean "the models can do the relevant tasks end-to-end in a human-replacement way, and we simply no longer need attend to these tasks"? E.g., Cursor cannot yet replace a coder, but it can enhance her productivity. However, a chatbot can entirely replace a frontline customer service representation. [ - ] AnthonyC 9mo 13 2 Some of both, more of the former, but I think that is largely an artifact of how we have historically defined tasks. None of us have ever managed an infinite army of untrained interns before, which is how I think of LLM use (over the past two years they've roughly gone from high school student interns to grad student interns), so we've never refactored tasks into appropriate chunks for that context. I've been leading my company's team working on figuring out how to best integrate LLMs into our workflow, and frankly, they're changing so fast with new releases that it's not worth attempting end-to-end replacement in most tasks right now. At least, not for a small company. 80/20 rule applies on steroids, we're going to have new and better tools and strategies next week/month/quarter anyway. Like, I literally had a training session planned for this morning, woke up to see the Gemini 2.5 announcement, and had to work it in as "Expect additional guidance soon, please provide feedback if you try it out." We do have a longer term plan for end-to-end automation of specific tasks, as well, where it is worthwhile. I half-joke that Sam Altman tweets a new feature and we have to adapt our ... (read more) Reply -6 MazevSchlong 9mo [ - ] leogao 9mo 17 7 Actual full blown fraud in frontier models at the big labs (oai/anthro/gdm) seems very unlikely. Accidental contamination is a lot more plausible but people are incentivized to find metrics that avoid this. Evals not measuring real world usefulness is the obvious culprit imo and it's one big reason my timelines have been somewhat longer despite rapid progress on evals. Reply 1 Kabir Kumar 9mo Why does it seem very unlikely? 3 Jazi Zilber 9mo those conspiracies don't work most of the time "you can only keep a secret between two people, provided one of them is dead". the personal risk for anyone involved + the human psychological tendency to chat and to have a hard time holding on to immortal secrets mean it's usually irrational for both organisations to do intentional cheating. [ - ] lemonhope 9mo 15 1 Almost every time I use Claude Code (3.7 I think) it ends up cheating at the goal. Optimizing performance by replacing the API function with a constant, deleting test cases, ignoring runtime errors with silent try catch, etc. It never mentions these actions in the summary. In this narrow sense, 3.7 is the most misaligned model I have ever used. Reply 1 lookoutbelow 9mo This was an issue referenced in the model card ("special casing"). Not as rare as they made it out to be it seems. [ - ] p.b. 9mo 15 10 I was pretty impressed with o1-preview's ability to do mathematical derivations. That was definitely a step change, the reasoning models can do things earlier models just couldn't do. I don't think the AI labs are cheating for any reasonable definition of cheating. Reply [ - ] Zach Stein-Perlman 9mo 15 9 Data point against "Are the AI labs just cheating?": the METR time horizon thing Reply [ - ] Noosphere89 9mo 14 3 lc has argued that the measured tasks are unintentionally biased towards ones where long-term memory/context length doesn't matter: https://www.lesswrong.com/posts/hhbibJGt2aQqKJLb7/shortform-1#vFq87Ge27gashgwy9 Reply [ - ] nmca 9mo 14 2 (disclaimer: I work on evaluation at oai, run the o3 evaluations etc) I think you are saying “bullshit” when you mean “narrow”. The evidence for large capability improvements in math and tightly scoped coding since 4o is overwhelming, see eg AIME 2025, Gemini USAMO, copy paste a recent codeforces problem etc. The public evidence for broad/fuzzy task improvement is weaker — o1 mmlu boosts and various vibes evals (Tao) do show it though. It is a very important question how much these large narrow improvements generalize. I try and approach the question humbly. Hopefully new models improve on your benchmark — do share if so! Reply [ - ] Noosphere89 9mo 10 3 I'll say that one of my key cruxes on whether AI progress actually becomes non-bullshit/actually leading into an explosion is whether in-context learning/meta-learning can act as an effective enough substitute for human neuron weight neuroplasticity with realistic compute budgets in 2030, because the key reason why AIs have a lot of weird deficits/are much worse than humans at simple tasks is because after an AI is trained, there is no neuroplasticity in the weights anymore, and thus it can learn nothing more after it's training date unless it uses in-context learning/meta-learning: https://www.lesswrong.com/posts/deesrjitvXM4xYGZd/?commentId=hSkQG2N8rkKXosLEF#hSkQG2N8rkKXosLEF Reply [ - ] Daniel Kokotajlo 9mo 10 0 Unexpectedly by me, aside from a minor bump with 3.6 in October, literally none of the new models we've tried have made a significant difference on either our internal benchmarks or in our developers' ability to find new bugs. This includes the new test-time OpenAI models. So what's the best model for your use case? Still 3.6 Sonnet? Reply [ - ] lc 9mo 10 0 We use different models for different tasks for cost reasons. The primary workhorse model today is 3.7 sonnet, whose improvement over 3.6 sonnet was smaller than 3.6's improvement over 3.5 sonnet. When taking the job of this workhorse model, o3-mini and the rest of the recent o-series models were strictly worse than 3.6. Reply [ - ] Daniel Kokotajlo 9mo * 12 2 Thanks. OK, so the models are still getting better, it's just that the rate of improvement has slowed and seems smaller than the rate of improvement on benchmarks? If you plot a line, does it plateau or does it get to professional human level (i.e. reliably doing all the things you are trying to get it to do as well as a professional human would)? What about 4.5? Is it as good as 3.7 Sonnet but you don't use it for cost reasons? Or is it actually worse? Reply 1 [ - ] lc 9mo * 12 0 If you plot a line, does it plateau or does it get to professional human level (i.e. reliably doing all the things you are trying to get it to do as well as a professional human would)? It plateaus before professional human level, both in a macro sense (comparing what ZeroPath can do vs. human pentesters) and in a micro sense (comparing the individual tasks ZeroPath does when it's analyzing code). At least, the errors the models make are not ones I would expect a professional to make; I haven't actually hired a bunch of pentesters and asked them to do the same tasks we expect of the language models and made the diff. One thing our tool has over people is breadth , but that's because we can parallelize inspection of different pieces and not because the models are doing tasks better than humans. What about 4.5? Is it as good as 3.7 Sonnet but you don't use it for cost reasons? Or is it actually worse? We have not yet tried 4.5 as it's so expensive that we would not be able to deploy it, even for limited sections. Reply 7 gwern 9mo Still seems like potentially valuable information to know: how much does small-model smell cost you? What happens if you ablate reasoning? If it is factual knowledge and GPT-4.5 performs much better, then that tells you things like 'maybe finetuning is more useful than we think', etc. If you are already set up to benchmark all these OA models, then a datapoint from GPT-4.5 should be quite easy and just a matter of a small amount of chump change in comparison to the insight, like a few hundred bucks. 3 dimitry12 9mo Please help me understand how do you suggest to "ablate reasoning" and what's the connection with "small-model smell"? [ - ] SoerenMind 9mo 9 1 This is interesting. Though companies are probably investing a lot less into cyber capabilities than they invest into other domains like coding. Cyber is just less commercially interesting plus it can be misused and worry the government. And the domain specific investment should matter since most of the last year's progress has been from post training, which is often domain specific. (I haven't read the whole post) Reply [ - ] green_leaf 9mo 9 5 According to Terrence Tao, GPT-4 was incompetent at graduate-level math (obviously), but o1-preview was mediocre-but-not-entirely-incompetent. That would be a strange thing to report if there were no difference. (Anecdotally, o3-mini is visibly (massively) brighter than GPT-4.) Reply 5 Mo Putera 9mo Full quote on Mathstodon for others' interest: This o1 vs MathOverflow experts comparison was also interesting: 3 green_leaf 9mo (I believe the version he tested was what later became o1-preview.) [ - ] Raemon 9mo 8 2 My lived experience is that AI-assisted-coding hasn't actually improved my workflow much since o1-preview, although other people I know have reported differently. Reply [ - ] cozyfae 9mo 8 -5 These machines will soon become the beating hearts of the society in which we live. An alternative future: due to the high rates of failure, we don't end up deploying these machines widely in production setting, just like how autonomous driving had breakthroughs long ago but didn't end up getting widely deployed today. Reply [ - ] fujisawa_sora 9mo 6 2 I primarily use LLMs when working with mathematics, which is one of the areas where the recent RL paradigm was a clear improvement—reasoning models are finally useful. However, I agree with you that benchmark-chasing isn’t optimal, in that it still can’t admit when it’s wrong . It doesn’t have to give up, but when it couldn’t do something, I’d rather it list out what it tried as ideas, rather than pretending it can solve everything, because then I actually have to read through everything. Of course, this can be solved with some amateur mathematicians reading through it and using RL to penalize BS. So, I think this is a case where benchmark performance was prioritized over actual usefulness. Reply [ - ] Aaron_Scher 9mo 6 0 I appreciate this post, I think it's a useful contribution to the discussion. I'm not sure how much I should be updating on it. Points of clarification: Within the first three months of our company's existence, Claude 3.5 sonnet was released. Just by switching the portions of our service that ran on gpt-4o, our nascent internal benchmark results immediately started to get saturated. Have you upgraded these benchmarks? Is it possible that the diminishing returns you're seen in the Sonnet 3.5-3.7 series are just normal benchmark saturation? What % scores are t ... (read more) Reply [ - ] James Grugett 9mo 4 0 Much of the gains on SWE Bench are actually about having the model find better context via tool calls. Sonnet 3.7 is trained to seek out the information it needs. But if you compare the models with fixed context, they are only somewhat smarter than before. (The other dimension is the thinking models which seem to be only a modest improvement on coding, but do much better at math for example.) That being said, the new Gemini 2.5 Pro seems like another decent step up in intelligence from Sonnet 3.7. We're about to switch out the default mode of our coding agent, Codebuff, to use it (and already shipped it for codebuff --max). Reply [ - ] Noosphere89 9mo 4 0 Gradient Updates has a post on this by Anson Ho and Jean-Stanislas Denain on why benchmarks haven't reflected usefulness, and a lot of the reason is that they underestimated AI progress and didn't really have an incentive to make benchmarks reflect realistic use cases: https://epoch.ai/gradient-updates/the-real-reason-ai-benchmarks-havent-reflected-economic-impacts Reply [ - ] Qumeric 9mo 4 5 I am curious to see what would be the results of the new Gemini 2.5 pro on internal benchmarks. Reply [ - ] nmca 9mo 3 1 Is there an o3 update yet? Reply [ - ] Burny 9mo 3 0 In practice, Sonnet 3.7 and Gemini 2.5 are just often too good compared to competitors. Reply [ - ] solhando 9mo 3 0 Somewhat unrelated to the main point of your post, but; How close are you to solving the wanting-to-look-good problem? I run a startup in a completely different industry, and we've invested significant resources in trying to get an LLM to interact with a customer, explain and make dynamic recommendations based on their preferences. This is a more high-touch business, so traditionally this was done by a human operator. The major problem we've encountered is that it's almost impossible to have an LLM to admit ignorance when it doesn't have the informati... (read more) Reply 2 sanxiyn 9mo Maybe we are talking about different problems, but we found instructing models to give up (literally "give up", I just checked the source) under certain conditions to be effective. [ - ] Mo Putera 9mo 3 2 Personally, when I want to get a sense of capability improvements in the future, I'm going to be looking almost exclusively at benchmarks like Claude Plays Pokemon. Same, and I'd adjust for what Julian pointed out by not just looking at benchmarks but viewing the actual stream . Reply [ - ] Trevor Hill-Hand 9mo 3 0 I happened to be discussing this in the Discord today. I have a little hobby project that was suddenly making fast progress with 3.7 for the first few days, which was very exciting, but then a few days ago it felt like something changed again and suddenly even the old models are stuck in this weird pattern of like... failing to address the bug, and instead hyper-fixating on adding a bunch of surrounding extra code to handle special cases, or sometimes even simply rewriting the old code and claiming it fixes the bug, and the project is suddenly at a complet... (read more) Reply 5 Mis-Understandings 9mo Beware of argument doom spirals. When talking to a person, arguing about the existene of a bug tends not to lead to succesful resolution of the bug. Somebody talked about this on a post a few days ago, about attractor basins, oppositionality, and when AI agents are convinced they are people (rightly or wrongly). You are often better off clearing the context then repeatedly arguing in the same context window. 1 Trevor Hill-Hand 9mo This is a good point! Typically I start from a clean commit in a fresh chat, to avoid this problem from happening too easily, proceeding through the project in the smallest steps I can get Claude to make. That's what makes the situation feel so strange; it feels just like this problem, but it happens instantly, in Claude's first responses. 1 Archimedes 9mo It's also worth trying a different model. I was going back and forth with an OpenAI model (I don't remember which one) and couldn't get it to do what I needed at all, even with multiple fresh threads. Then I tried Claude and it just worked. 1 Mis-Understandings 9mo Consider the solutions from Going Nova [ - ] Mo Putera 6mo 2 0 Jack Clark's most recent issue of Import AI mentioned AI security startup XBOW's "fully autonomous AI-driven penetration tester" (also called XBOW), which topped HackerOne: AI pentesting systems out-compete humans: …Automated pentesting… AI security startup XBOW recently obtained the top rank on HackerOne with an autonomous penetration tester - a world first. "XBOW is a fully autonomous AI-driven penetration tester," the company writes. "It requires no human input, operates much like a human pentester, but can scale rapidly, completing comprehensive pen ... (read more) Reply 1 5 lc 6mo The XBOW PR is a quintessential example of what I'm talking about. Suffice it to mention that: * XBOW topped the HackerOne 'leaderboard' that measures upvotes, not money earned on the platform. * XBOW almost entirely submitted bugs for the free, non-paid bug bounties! A primary reason they were able to find these bugs was because they weren't actually competing with anyone! [ - ] ramennaut 9mo 2 0 This really resonates with me. I don't work in AppSec, but I've seen how benchmark gains often fail to show up when you're doing something non-trivial with the model. It seem that current benchmarks have low ecological validity. Although I wouldn't quickly put the blame on labs possibly cheating. They may or they may not, but it also might just be that we're bad at designing evaluations that tracks real-world usefulness. When you think about it, even university exams don't really predict job performance either. These are benchmarks we've had centuries to re... (read more) Reply [ - ] Ram Potham 9mo 2 0 I have experienced similar problems to you when building an AI tool - better models did not necessarily lead to better performance despite external benchmarks. I believe there are 2 main reasons why this is, alluded to in your post: Selection Bias - when a foundation model company releases their newest model, they show performance on benchmarks most favorable to it Alignment - You mentioned how AI is not truly understanding the instructions you meant. While this can be mitigated by creating better prompts, it does not fully solve the issue Reply [ - ] Viktor Rehnberg 9mo 2 0 Another hypothesis: Your description of the task is the hard parts of application pentesting for LLMs, which are 1. Navigating a real repository of code too large to put in context, 2. Inferring a target application's security model, and 3. Understanding its implementation deeply enough to learn where that security model is broken. From METR's recent investigation on long tasks you would expect current models not to perform well on this. I doubt a human professional could do the tasks you describe in something close to an hour, so perhaps its just current... (read more) Reply [ - ] Chris_Leong 9mo 2 0 However, if you merely explain these constraints to the chat models, they'll follow your instructions sporadically. I wonder if a custom fine-tuned model could get around this. Did you try few shot prompting (ie. examples, not just a description)? Reply [ - ] Matthew Gertner 5mo 1 0 Depending on your perspective, this is good news! Both for me personally , as someone trying to make money leveraging LLM capabilities while they're too stupid to solve the whole problem, and for people worried that a quick transition to an AI-controlled economy would present moral hazards. This really resonated with me. It's ironic that if you're developing an AI-powered app you need to worry about both a) AI not being good enough to allow you to achieve your product goals and b) AI getting so good that all your work ends up being redundant. I wrote up some ... (read more) Reply [ - ] Florian_Dietz 8mo 1 0 I think what's going on is that large language models are trained to "sound smart" in a live conversation with users, and so they prefer to highlight possible problems instead of confirming that the code looks fine, just like human beings do when they want to sound smart. I have encountered this before and what worked for me was telling the model "point out all the mistakes, but then review them and decide which of them if any are worth highlighting." That way the model gets to sound doubly smart: I found a mistake and also I understand the circumstances enough not to raise it to high priority. Reply [ - ] Vasilios Mavroudis 9mo 1 0 Interesting article! Here is a benchmark that does real world exploitation (indeed CTFs don't cut it anymore): https://github.com/alan-turing-institute/HonestCyberEval Reply [ - ] vire 9mo 1 -2 That's an interesting point, why didn't we see major improvements in LLMs for instance when coding... Despite them achieving reasoning on the level that allows them become a GM on codeforces. I'd say this is a fundamental limitation of reinforcement learning. Using purely reinforcement learning is stupid. Look at humans, we do much more than that. We make observations about our failures and update, we develop our own heuristics for what it means to be good at something and then try to figure out how to make ourselves better by reasoning about it watching ot... (read more) Reply [ - ] David James 9mo * 1 0 HLE and benchmarks like it are cool, but they fail to test the major deficits of language models, like how they can only remember things by writing them down onto a scratchpad like the memento guy. A scratch pad for thinking, in my view, is hardly a deficit at all! Quite the opposite. In the case of people, some level of conscious reflection is important and probably necessary for higher-level thought. To clarify, I am not saying consciousness itself is in play here. I’m saying some feedback loop is probably necessary — where the artifacts of thinking, r... (read more) Reply [ - ] groblegark 9mo 1 0 Where does prompt optimization fit in to y’all’s workflows? I’m surprised not to see mention of it here. E.g OPRO https://arxiv.org/pdf/2309.03409 ? Reply [ - ] momom2 9mo 1 0 I think what's going on is that large language models are trained to "sound smart" in a live conversation with users, and so they prefer to highlight possible problems instead of confirming that the code looks fine, just like human beings do when they want to sound smart . This matches my experience, but I'd be interested in seeing proper evals of this specific point! Reply [ - ] Mis-Understandings 9mo 1 0 Your first two key challenges Seems very similar to the agent problem of active memory switching, carrying important information across context switches Also note that it could just be instead of bullshit, finetuning is unreasonably effective, and so when you train models on an evaluation they actually get better on the things evaluated, which dominates over scaling. So things with public benchmarks might just actually be easier to make models that are genuinely good at it. (For instance searching for data that helps 1B models learn it, then adding it to full... (read more) Reply [ - ] Anthony Perez-sanz 9mo 0 0 I feel like your jumping to cheating way to quickly. I think everyone would agree that there is overfitting to benchmarks and to benchmark like questions. Also, this is a very hard problem. The average person doesn't have a shot at contributing to security research. Even the typical appsec engineer with years of experience would fail at the task of navigating a new codebase, creating a threat model and finding important security issues. This takes an expert in the field at least a few days of work. This is much longer than the time periods that AIs can be ... (read more) Reply [ - ] NickH 9mo -3 0 From a practical perspective, maybe you are looking at the problem the wrong way around. A lot of prompt engineering seems to be about asking LLMs to play a role. I would try to tell the LLM that it was a hacker and to design an exploit to attack the given system (this is the sort of mental perspective I used to use to find bugs when I was a software engineer). Another common technique is "generate then prune" : Have a separate model/prompt remove all the results of the first one that are only "possibilities". It seems, from my reading, that this sort of two stage approach can work because it bypasses LLMs typical attempts to "be helpful" by inventing stuff or spouting banal filler rather than just admitting ignorance. Reply 1 Alex Vermillion 9mo I think we should suspect that they've done some basic background research (this individual, not in general), and take the rest of the information about people failing to see improvements as data that also points this direction. Moderation Log More from lc View more Curated and popular this week 85 Comments 85 AI Timelines Forecasting & Prediction AI Frontpage Mentioned in 127 Five Hinge‑Questions That Decide Whether AGI Is Five Years Away or Twenty 83 How people use LLMs 26 AI #111: Giving Us Pause 20 Personal evaluation of LLMs, through chess 12 Crash scenario 1: Rapidly mobilise for a 2025 AI crash Load More (5/6) === What artificial intelligence can do for a small independent hotel (https://ringohosp.substack.com/p/what-artificial-intelligence-can) === Subscribe Sign in What artificial intelligence can do for a small independent hotel We don’t depend on it much, but it’s hard to say exactly why not Toph Tucker Apr 07, 2025 44 2 12 Share Two months ago, Joe Weisenthal tweeted , “The capabilities of existing AI systems right now are extraordinary vs. what (most) people might have guessed even 2.5 years ago. Yet, where is the industrial impact clearly being felt? Why hasn’t it been more disruptive?” He was echoing Lars Doucet, who had tweeted , “I am surprised that, given how fast it’s already progressed, how little has already changed. Like aliens landed on the White House lawn and nobody cared.” Joe continued , “We haven’t seen whole categories of jobs erased, or anything like that — even for the most rudimentary stuff, like simple graphic design, or basic copywriting. You certainly hear anecdotes, but mostly it seems to be just that, anecdotes.” Indeed, I don’t think anybody at the Apple Tree would say AI has affected their jobs at all, aside from being a topic of general conversation. Why not? On Thursday I sat down with our chef, Hagai, to give him some tutoring in Excel. Hagai does almost all our food ordering, which spans several different categories of service: The continental breakfast, included in the price of your room; The Ostrich Room dinner, not included, and open to the public; The Ostrich Room bar, informed by whatever bar manager Sean needs, like lemons, limes, and liquors; The semi-autonomous brunch restaurant he and his wife Adi run out of our round room, called Báladi; And private events, like weddings and bar mitzvahs. Each of those involves different customers, paying in different ways, for the labor of different employees, compensated by different structures. Devising and maintaining those structures turns out to be a bigger part of Claire’s job than I’d have guessed. As one little example, back in November, I asked Claire how tipping worked. At least at that point, in the Ostrich Room, 15% of tips went to the food runner, and then the server and bartenders split the rest. The server and bartender then, by convention, tip out a bit of their tips to the dishwasher. The cooks got none. At Báladi, tips were evenly split between the food runner and two servers. “It’s complicated and never feels right and I never know what to do,” Claire said then. “Part of me wants to give everyone the same. But the bartender and server do sell, and if you don’t give them more, then they’ll go somewhere else where they get more. But in a sense I don’t want to incentivize that. A lot of these people aren’t actually primarily motivated by that, and want to be part of a good team. If the place becomes steadier and better known, it’s easier to spread it all evenly.” 1 Guests who see the Ostrich Room and Báladi as two offerings in two attached rooms under the unified umbrella of the Apple Tree Inn might be surprised that the tip structure would be different for each. But those operations are aiming for different things. The Ostrich Room has a bar and a music program that carries on the legacy of the previous owners; Báladi is trying to do something new, driven by Hagai and Adi, who have their own longstanding vision for everything from the menu to the music. Some operational inconsistency may help each be the best version of itself. And that’s just the tip of the iceberg! The financial structures of our F&B offerings differ in all sorts of ways. The continental breakfast has no waitstaff or tipping; guests paid for it days or months before they even saw it. Private events have unique contracts devised by one-off negotiations, and may include external caterers. And so on. Tip-splitting is the easy part, because tips are an easily known pool — they can be summed on the spot without having to subtract any unknowns. The harder part is knowing your profits — whether in order to calculate your profit margin to make more strategic decisions, or to figure out how to split the profits. (Once you have profits, anyway.) The problem is, we have one kitchen, one freezer, one nice walk-in fridge, and one Hagai ordering everything. That’s an efficiency , in some ways, but it also muddles the accounting. You can’t calculate how much you made on a private event until you break out what you spent on it. Which brings us to this pile of receipts: This is Hagai’s pile of invoices from sources including distributors like Baldor and Sysco and markets like Guido’s, FreshTown, Price Chopper, Big Y, Aldi, Stop & Shop, and Loeb’s. Baldor is our major vendor; others tend to be from looking for something especially special or fresh (Guido’s?) or running out in an emergency (Loeb’s is just a five-minute drive into town). As he goes, he annotates items with whether they were for O (Ostrich Room), B (Báladi), C (continental), S (Sean), or E (an event). But that paper is not easily tallied. So Claire is eager to bring the F&B operation into the era ( 1979– ) of the electronic spreadsheet. Categorizing invoices is hardly Hagai’s favorite part of the job, but at least paper is an intuitive medium where he can scrawl directly whatever ad hoc notes he wants directly atop the list without things breaking, jumping around, or disappearing. I don’t think he was eager for computerization; I don’t think he has ever really had a “computer job”. But I was eager to try to help him. At first I thought we’d try Google Sheets. But Hagai didn’t remember his password for the Apple Tree’s Google Workspace, and couldn’t reset it without a domain administrator; we weren’t sure who that was. (It turns out to be Christian.) He didn’t remember the password for his personal Gmail either. But he did get into the separate brunch Gmail. Fortunately, he did remember his Baldor password, and their website lets you export invoices to Excel. We tried uploading those back to Google Sheets, but that was feeling like a prohibitive number of clicks — and too many hops across interface boundaries between different paradigms. (You pay a cognitive tariff trying to go between a “traditional” filesystem and whatever Google Drive is.) So we tried Excel, which was already installed from when this was Claire’s laptop (under a different user account). But that wanted a license key. I don’t think he has an Office 365 subscription, so I signed in with mine. To catch up on just the year-to-date invoices, that’d be like a hundred manual exports, which is nontrivial if your mouse skills are a couple orders of magnitude slower than your knife ones; after a little digging, we realized we could do bulk exports of a report for any date range. Unfortunately, the bulk export doesn’t include the date of individual orders — so we settled on weekly batches to balance granularity and efficiency. He chose to do Sunday to Sunday — but, after some testing, I learned the Baldor date picker selects the range inclusive of both endpoints, so that would double-count Sunday orders. His instincts were robust to that ambiguity: he said he’d picked Sunday because he never orders on Sunday. Still, we settled on Sunday to Saturday, to be safe. Finally, the categorization could begin: I had feared we’d have to cross-reference every single printed paper receipt to remember what each item had been for. That’s how I’d have to do it. But — you know how great chess players can remember exact board positions, because they’re not memorizing the positions of 32 pieces across 64 squares by rote, but rather understand some deeper structure of the board? Hagai can do that for his ingredients. He designed the menus, he cooks them every day, of course he can! Pineapples are mostly for the continental breakfast, and partly for Báladi. Potatoes are mostly for the Ostrich Room. Radishes are half Báladi, half Ostrich. Rosemary and thyme are mostly Báladi. I’d have thought the one gallon of “Grade A Dark Organic Maple Syrup” for $64.69 was mostly for Báladi, because I love their french toast. (The syrup is baked in; you’re not supposed to need syrup on top; people don’t see that, so they always ask for it anyway.) But did you know it’s also a major ingredient in the dressing for the Ostrich Room’s green salad? And that’s a higher volume item, so it ends up skewing toward the Ostrich Room. He breezed through every row. We hit a wall when we hit a week with a private event. Hagai didn’t remember everything that was on the special menu; he’d have to go dig it up somewhere later. He said he mostly needed help with merging the invoices together. Merges are dangerous, especially when individual and bulk exports have different columns. I habitually do a lot of little quick keyboard jumps around the table to continually affirm I haven’t overwritten or mangled data; Hagai, moving more slowly with the mouse, didn’t tend to do those checks. (Good habits can be prohibitively expensive until you have good fine motor!) So we had to try to establish and practice a process that balanced the ease of entry with the ease of later analysis. I froze panes so he could always read the column headers and item names. I added a checksum to ensure each row tallies to 100%. I wanted to ensure the date column was tidy , which means repeating it on every row instead of e.g. relying on the dates in the filenames of the download you copied from, or leaving blank rows with a single big date for a new batch. So I showed him the “Fill down” functionality — but that does “Fill series” by default, which increments the date on every row; he had the good instinct to switch it to “Copy cells” himself. Still, that’s another step in the process. A feature like “Fill down” increases your leverage — but every feature adds complexity, and leverage also makes it easier to make bigger errors. Once we were happy with the data entry structure, I went to set up a second sheet in the workbook for a summary of months against categories while Hagai went to check on his dough. And here, I used some AI. I’ve never done that much in Excel, and have never really understood array formulas or whatever, so I asked ChatGPT for help writing a formula that would sum the product of two columns if the date column was within the given month. It was wrong at first (it forgot to coerce with “--”), but then we got to: =SUMPRODUCT( --(TEXT(Items!$F$2:Items!$F$9998,"yyyymm")=TEXT($A2,"yyyymm")), Items!$E$2:$E$9998, Items!G$2:G$9998 ) Which worked, at least for the partial fragment we had filled in: Hagai got back and said, “Claire would’ve done that in a few clicks,” which is true — “but you get stuck with the slow kids.” The entire process described above feels like a perfect use case for AI. Here we have a chef, an artist, who could be liberated to focus on his craft (while also gaining crucial insight into his business) if this drudgery could be automated. And it sure seems like it could be! Ingredients are very clearly correlated with menu items. There’s enough fuzziness that this hasn’t yet been automated, but enough pattern that it now could be, given a calculator for words . You don’t even need the new LLM stuff. You could scrape the Baldor website to automate exporting all the invoices; maybe you could do some OCR on the paper receipts for random other markets. Then do some old-fashioned, 2010s vintage scikit-learn stuff. Label a few dozen rows, extrapolate to the rest. Some vectorization, some regression. Honestly, I don’t know, I’ve never actually used scikit-learn, but I bet it’d guess pretty well. Hagai could just audit it and make corrections for special occasions, for events or menu changes or whatever. But that’s a lot of glue for one tiny aspect of one tiny hotel. If you wanted a more 2020s solution, our friend Alexi is working on Autotab , which lets you train browser agents to do exactly this sort of scraping and processing all in one. It feels like a perfect use case. I’ve done a couple promising tests. Or you could just dump the CSV and menu images into one of the modern multimodal chatbots. ChatGPT 4o didn’t say anything about the menu images, and started writing rather overfit Python code based I think mostly on the CSV, like: # Define a function to estimate percentages based on item name and menu context def attribute_percentages(name): name_lower = name.lower() # Basic attribution logic if "banana" in name_lower or "blueberries" in name_lower: return (50, 0, 50, 0) # Brunch, Ostrich, Breakfast, Bar if "brioche" in name_lower or "slider roll" in name_lower: return (30, 40, 30, 0) if "labne" in name_lower: return (100, 0, 0, 0) if "sweet chili" in name_lower: return (0, 100, 0, 0) return (0, 0, 0, 0) # Default if no match Slider rolls are only used for private events; but how would it know that? It said “Done. I’ve filled in the Brunch, Ostrich, Breakfast, and Bar percentages for each row.” When, in fact, 95% of the rows were left blank. Claude 3.7 Sonnet at least analyzed and read the menu items. But then it, too, wrote some simplistic Python code, like: // Brunch-specific items else if (name.includes('frena') || name.includes('shakshuka') || name.includes('labneh') || name.includes('matbucha') || name.includes('bureka')) { percentages = { Brunch: 80, Ostrich: 10, Breakfast: 10, Bar: 0 }; } Those are all exclusively brunch items; I guess it spread the percentages around a bit just to hedge? But, more importantly: you’re not gonna find shakshuka on our distributor invoice! We don’t just buy these things whole and resell them! Hagai is better than that! They’re made from ingredients ! Tomatoes! Onions! Garlic! Claude then timed out before it could give me an output. I imagine we could get something to work, and this might be a big enough internal pain point that we’ll try some more. But you can see some problems here. The snap instinctual judgments that form the backbone of even the most maddening drudgery still draw on an enormous amount of very scattered context. To replicate Hagai’s context, you’d need entire recipes, or maybe video of him preparing the foods; Toast sales data, or maybe video of the dining room; our hours; his calendar, featuring private events; communications among staff about what’s getting used for what; the CSVs for Baldor; the paper receipts for quick runs to Loeb’s; and maybe surveillance footage to capture exceptions. (Like, to know if a particular shipment never got delivered.) And it is hard to coax LLMs into following these complex multi-step tasks. If I ask directly, Claude “knows” that a shakshuka on the menu does not imply a shakshuka on the invoices: But its knowledge doesn’t compose; it loses track in a bigger task. (And a couple LLM calls for every row of every invoice would get expensive.) I know people are working on better planning, and better prompting helps. But, for now, using these tools is itself skilled labor, not a replacement for labor. We invented a dishwashing robot around 1850, and yet people still clamor for a dishwashing robot; dishwashing labor is not eliminated by automation, but merely transformed. As you imagine better solutions, remember that they can’t be targeted at me, speaking to Toph. They have to be for Hagai. He knows the restaurant, he knows the stakes, he has all the context for why we even care about this task. I couldn’t even have phrased the problem statement if I hadn’t just spent a couple hours with him on it! In reply to Joe’s precipitating tweet, Nima Wedlake wrote to recommend a piece by Tim Harford (2017) about the slow diffusion of the technology of electrification: “In 1881, Edison built electricity generating stations… Yet by 1900, less than 5% of mechanical drive power in American factories was coming from electric motors.” Part of the problem, Harford writes, was that an electric motor was not that much better than an existing large steam motor. Rather, the most drastic electric advantage appears in the curve over all possible motors: [T]o take advantage of electricity, factory owners had to think in a very different way. They could, of course, use an electric motor in the same way as they used steam engines. It would slot right into their old systems. But electric motors could do much more. Electricity allowed power to be delivered exactly where and when it was needed. Small steam engines were hopelessly inefficient but small electric motors worked just fine. So a factory could contain several smaller motors, each driving a small drive shaft. As the technology developed, every workbench could have its own machine tool with its own little electric motor. Power wasn’t transmitted through a single, massive spinning drive shaft but through wires. A factory powered by steam needed to be sturdy enough to carry huge steel drive shafts. One powered by electricity could be light and airy. Steam-powered factories had to be arranged on the logic of the driveshaft. Electricity meant you could organise factories on the logic of a production line. Old factories were dark and dense, packed around the shafts. New factories could spread out, with wings and windows allowing natural light and air. In the old factories, the steam engine set the pace. In the new factories, workers could do so. If electrification allowed reorganizing along the logic of the assembly line, then I’m not sure along what logic AI will reorganize the food and beverage business. Distributed hordes of Rick Rubins just vibing about the next menu to be farmed out to ghost kitchens. Something like that. But if even VisiCalc still hasn’t diffused throughout our hotel operations, it could take a while. 2 Other ways AI has affected us… Computational photography. Photography forms the backbone of all our marketing. Whereas once we might have had to pay a skilled illustrator to render our charming carriage house… …today we can just snap a pic on our iPhone: This is especially important for advertising the music program in the atmospheric and dimly lit Ostrich Room. You never know when a night at the bar is gonna get really lively, and it’s great to be able to capture it when it happens! I think today’s iPhones are a lot better at low-light photography partly because they’re taking long exposures and using techniques we might as well label “AI” to combine them into a single not- that -blurry shot. ( See also .) Still, we end up feeling like we take a million bad photos when we just need one good one. (Nobody will be looking at my carriage house photo in 80 years like we’re still looking at that postcard.) We’ve got Julie coming on the 19th to take some more professional shots. Better be a good night at the bar! Come through! Pack those seats! Facial recognition. Once we’ve got all those photos, we often find them through tagged faces. Like maybe we wanna show Sean to advertise his new cocktail, or find pics of a band who’s coming back to play again. Our logo. I traced the “Apple Tree Inn” wordmark from a photograph of the old wood sign out on the road. On social media, we just use the wordmark. But, on the website, it also has that tree. I generated that tree shortly after Claire took over by giving Midjourney a photo of the chipped and faded tree on the roadside wood sign and asking it to kinda upscale and embellish it. As much as I admire David Holz’s crew, and with all due gratitude to Midjourney, that feels shamefully lazy for such a prominent part of our brand. I sorta forgot about it for a while but we’re working on replacing it. Dreaming of hot tubs. A year ago we tried using Midjourney and ChatGPT to imagine some hot tubs perched up in the woods. (Claire dreams of making some onsen-inspired facilities.) At the time, they couldn’t really build coherently upon reference imagery; now they often can. I can imagine that being useful as Claire looks ahead to some renovations, though it’s not a substitute for an interior designer or landscape architect so much as a way of talking with them. I do wonder if those specialists are starting to feel about AI inspo like Taylor Ramos and Tony Zhou of “Every Frame a Painting” feel about the effect of temp music on Marvel soundtracks . Aggregators’ summarization. If you look up the inn on Google Maps (in the iPhone app), you see “Ask Maps about this place”, with some suggested questions: “Is the inn undergoing renovations?” It says: “Yes, renovations have been done to the historic inn. In April 2024, new owner Claire Collery bought the inn from Max and James. Collery’s plans include repainting the outside, renovating the restaurant and bar, and upgrading some rooms. She also plans to reopen the tavern’s nightclub on weekends and offer live entertainment.” We say: Claire has indeed repainted, but has not really renovated the restaurant or rooms, structurally, and in any case none of it is currently ongoing; the tavern is hardly a “nightclub” and has been reopened for almost a year now. “Is it in a quiet location?” It says: “Yes, it has quiet hours from 10 PM–6 AM. Some say that the bar has poor sound isolation, but recommend booking a room in the lodge or on the third floor.” We say: I don’t know how official the quiet hours are. The bar indeed has poor sound isolation. But that answer doesn’t really address the overall location of the inn, which I’d say is very quiet. “Is there live music?” It says: “Yes, the tavern has live music on weekends and evenings. Some say the music is professional and amazing.” We say: Yes, currently Thursday through Saturday evenings. Thanks! “Is there free breakfast?” It says: “Yes, the inn serves a free continental breakfast every day from 8–10 AM. The breakfast is included in the room rate and is served in the main house.” We say: Yup! “Is it near Tanglewood?” It says: “Yes, it's within walking distance. The inn is a 10-minute walk from Tanglewood Music Center.” We say: You know it! Nobody calls it the “Tanglewood Music Center”, and its campus is large relative to the walk: the near end is under ten minutes, the far end more like twenty. “What style is the inn?” It says: “The Apple Tree Inn in Lenox, Massachusetts is a Victorian-style inn built in 1885. The inn has a mix of Victorian finishes and modern art. The main house is a three-story building with a living room, fireplace, bar and dining room, and brunch venue. The inn also has a rustic lodge building with 21 rooms.” We say: Yup! Though we’ve taken down most of the modern art. She hasn’t seen these examples, but Claire was just saying that these AI things don’t seem to tell you anything you wouldn’t glean from thirty seconds perusing the listings. But it can be much worse: wrongly saying we’re undergoing renovations could hurt reservations, and I have no idea how we’d correct that. Segmentation and OCR. To capture the above AI responses, since Google Maps doesn’t seem to allow copy-pasting them, I took a screenshot on my phone, then selected the text in the Photos app, copied it, and pasted it on my laptop. …and other ways it might affect us. Basic copywriting? Running an inn requires a lot of writing: website copy, newsletters, info sheets for check-in. People like running those prompts through ChatGPT and giving us plausible-sounding drafts. The results are horrifying dreck : Like looking at a corpse, the horror arises from the uncanny mix of verisimilitude and lifelessness. Moreover, there is a sort of Gell-Mann amnesia at work: I feel the horror when our inn, our home , is described like that — but I do not expect you to feel such horror, and I might not feel such horror if it were about someone else’s. I am sure AI could generate plausible SEO-friendly pages about every local attraction, that might improve our PageRank — but, to me, the only point of such a page would be if it said something the other million pages about Tanglewood aren’t already saying. Or, at least, something personal and true. Answering calls? I saw a Google Cloud ad the other day: “With Google AI, Sullivan County cuts call volumes and boosts staff’s productivity.” As a customer of various businesses myself, “cutting call volumes” sounds sinister; I’d suspect it’s more likely putting up roadblocks than answering questions. But then I think back to Claire and Alison reviewing website edits with a clear purpose: each line on the website should try to… cut call volume. The previous owners started using a call center in the Philippines to answer calls to the hotel; the call center was fast, diligent, polite, and almost always correct, 24/7. But still, Claire and Christian managed to cut it off two weeks ago. It confused and alienated guests, and in turn alienated Claire from their confusion and desires. I suspect AI would be similar. Filing PRs? I ripped out the old WordPress website in favor of a little SvelteKit site , which is way better for me. But, in some ways, it’s way worse for the hotel staff, who now have no CMS at all for updating the site themselves; whether updating hours, menus, or room info, they just have to talk to me! (The exception is live music listings, which pulls from a Google Sheet.) That’s been OK, but I have been thinking about setting up a headless CMS for them. (Any recommendations?) But it does also cross my mind that we may be at the point where they could just describe certain desired website updates to some kind of Cursor or Copilot agent, which would just put up a PR for me to review. I can’t tell how that’d be. Spotting inconsistencies? Right now, I think LLMs’ half-miraculous half-maddening capabilities would best be put to use just spotting contradictions among our various assertions. We often update information (like bar hours) in one place, but fail to update others. The information could be stated with different phrasing or terminology, in different places and formats — like, a Google Maps listing saying we close at 10 p.m., contradicting an Instagram story with a photograph of a handwritten note that this week we’ll be closed for spring cleaning. There are probably phrases buried in our old terms of service that no longer apply that we haven’t noticed. Computing those errors is by no means trivial, but now suddenly very doable. I’m much more bullish on this sort of translation than I am on generating de novo customer-facing texts — complementing our intelligence rather than short-circuiting it, leveraging machine scale and speed while leaving the voice to the people with skin in the game. Redistribution from arts patrons to effective altruists? Being situated directly across the street from Tanglewood, we are unusually exposed to shifting winds in philanthropic ethics. The Boston Symphony Orchestra’s attendees and donors tend to be old; our future, and theirs, depends on attracting more of the young; young capital, and especially AI capital, tends to be more inclined toward Peter Singer’s ethos that spending $115 million on classical music is morally equivalent to poking out the eyes of as many people as that could save from trachoma. I do not quite agree with that, but it is a powerful line of thought, and has done some great good, and seems hard to oppose . 3 Maybe next time we should talk about crypto. 4 I am often struck by how similar the work of running an inn is to what it would’ve been a hundred or a thousand years ago. Ancient Roman and Japanese baths still inspire Claire today; old descriptions by Chaucer, Melville, and Kipling still capture what she finds romantic about any place people cross paths to rest. Maybe hospitality is fundamentally an interpersonal service; however else communications changes, the important thing is that someone will say hello to you. Omar Rizwan wrote, in “ Against recognition ” (2021), If you have recognition, it should be a sort of overlay you put on the thing (maybe one of many such overlays); you shouldn't destroy the thing and replace it with its ashes. If it has to exist, the text recognizer should attach an overlay to the image that says 'it might have this text in it'; the image shouldn't itself be transformed into text. (and ideally, that overlay would be rich with context and provenance; it wouldn't just be a blob of plain text; it would know what image it's from, admit other texts that it could potentially be, talk about how likely each word of it is to be correct, say as much as possible about the recognizer's process and thinking) The original thing is still around and is still the source of truth. Bret Victor wrote, in “ What is Realtalk’s relationship to AI? ” (2024), One of our oppositions to product consumption is that it is a form of “outsourcing understanding”. Instead of learning a practice, one uses a product which insulates them from the underlying knowledge. The “smarter” the product, the less the user needs to understand. While each product may be convenient individually, the cumulative result is an almost universal ignorance, helplessness, dependence, and fragility. [more] We see many current AI products, particularly so-called generative tools, as “outsourcing understanding” in its most virulent form. [more] On the other hand, there are some “non-semantic” uses of AI techniques, such as text recognition and speech transcription, which can be used in ways that don’t displace learning and development. We are still opposed to their complexity, opacity, and enormous training sets, but we may be more willing to cautiously admit them for peripheral purposes, and seek ways to enhance their visibility. The most promising sort of AI, for us, is descriptive. It can index certain aspects of our photographs, or lossily transcribe a recorded interview with our barback, or identify that “ABF Boneless Skinless Chicken Breast 20 LB” may have something to do with the schnitzel on the menu. But it cannot say hello to you. 1 Our friends at Momofuku experimented with abolishing tipping for four years before reinstating it in 2022, with CEO “Marguerite” Mariscal saying the experiment “doesn’t create more balance in the way that we thought it would.” 2 The Wikipedia article on the diffusion of innovations has some tantalizing references. Gabriel Tarde wrote in The Laws of Imitation (1903), “an invention bears the same relation to imitation as a mountain to a river”; inventions spread by mimesis. Everett Rogers coined “early adopter” (and labels for other phases of the lifecycle) in Diffusion of Innovations (1962). He wrote, similarly, “Most individuals evaluate an innovation, not on the basis of scientific research by experts, but through the subjective evaluations of near-peers who have adopted the innovation. These near-peers thus serve as social models, whose innovation behavior tends to be imitated by others in their system.” 3 By chance, Rev. Caggiano’s sermon this morning focused on Mary of Bethany anointing Jesus’s feet lavishly with perfume that Judas points out could’ve been sold for 300 denarii to be given to the poor. 4 For more on how macroscopic technology trends can show up in the local character of a place, see also Ryan Salame , whose FTX crypto fortune let him buy I think like half the bars, restaurants, and hotels in downtown Lenox before it landed him in federal prison. I will say, some people who worked with him in those restaurants really miss him and wish him well! 44 2 12 Share Discussion about this post Comments Restacks Toph Tucker Apr 7, 2025 Author Erratum via Claire: Hagai doesn’t order the liquor — just, like, lemons and limes for the bar. Expand full comment Reply Share E2 Apr 12, 2025 I have yet to see an AI competently handle any task that requires comprehension of things and relationships in the real world. They can only work with previously-written descriptions, numbers, and pictures; they cannot correlate those with any actual existence. They have no *experience* of anything existing non-virtually. Expand full comment Reply Share Top Latest Discussions No posts Ready for more? Subscribe © 2026 Toph Tucker · Privacy ∙ Terms ∙ Collection notice Start your Substack Get the app Substack is the home for great culture === LLM Daydreaming · Gwern.net (https://gwern.net/ai-daydreaming) === Skip to main content Site Me New Blog Links Patreon LLM Daydreaming neuroscience , RL exploration Proposal & discussion of how default mode networks for LLMs are an example of missing capabilities for search and novelty in contemporary AI systems. 2025-07-12–2025-07-14 finished certainty : possible importance : 6 similar bibliography Missing Faculties Continual Learning Continual Thinking Hypothesis: Day-Dreaming Loop Day-Dreaming Loop LLM Analogy Obstacles and Open Questions Implications Despite impressive capabilities, large language models have yet to produce a genuine breakthrough. The puzzle is why. A reason may be that they lack some fundamental aspects of human thought: they are frozen, unable to learn from experience, and they have no “default mode” for background processing, a source of spontaneous human insight. To illustrate the issue, I describe such insights, and give an example concrete algorithm of a day-dreaming loop (DDL) : a background process that continuously samples pairs of concepts from memory. A generator model explores non-obvious links between them, and a critic model filters the results for genuinely valuable ideas. These discoveries are fed back into the system’s memory, creating a compounding feedback loop where new ideas themselves become seeds for future combinations. The cost of this process—a “daydreaming tax”—would be substantial, given the low hit rate for truly novel connections. This expense, however, may be the necessary price for innovation. It would also create a moat against model distillation, as valuable insights emerge from the combinations no one would know to ask for. The strategic implication is counterintuitive: to make AI cheaper and faster for end users, we might first need to build systems that spend most of their compute on this “wasteful” background search. This suggests a future where expensive, daydreaming AIs are used primarily to generate proprietary training data for the next generation of efficient models, offering a path around the looming data wall. …I feel I am nibbling on the edges of this world when I am capable of getting what Picasso means when he says to me—perfectly straight-facedly—later of the enormous new mechanical brains or calculating machines: “But they are useless. They can only give you answers.” William Fifield , “Pablo Picasso—A Composite Interview” ( 1964 62ya ) Dwarkesh Patel asks why no LLM has (seemingly) ever made a major breakthrough or unexpected insight, no matter how vast their knowledge or how high their benchmark scores. While those are, by definition, extremely rare, contemporary chatbot-style LLMs have now been used seriously by tens of millions of people since ChatGPT (November 2022), and it does seem like there ought to be at least some examples at this point. This is a genuine puzzle: when prompted with the right hints, these models can synthesize information in ways that feel tantalizingly close to true insight; the raw components of intelligence seem to be present; but… they don’t. What is missing? It’s hard to say because there are so many differences between LLMs and human researchers. Missing Faculties Continual Learning Frozen NNs are amnesiacs. One salient difference is that LLMs are ‘frozen’, and are not allowed to change; they don’t have to be, and could be trained on the fly (eg. by the long-standing technique of dynamic evaluation ), but they aren’t. So perhaps that’s a reason they struggle to move beyond their initial guesses or obvious answers, and come up with truly novel insights—in a very real sense, LLMs are unable to learn . They are truly amnesiac. And there are no cases anywhere in human history, as far as I am aware, of a human with anterograde amnesia producing major novelties. That may be an adequate answer all on its own: they are trapped in their prior knowledge, and cannot move far beyond their known knowledge; but by definition, all that is either known or almost known, and cannot be impressively novel. Continual Thinking But another notable difference is that human researchers never stop thinking . We are doing our continual learning on not just observations, but on our own thoughts—even when asleep, a human is still computing and processing. (This helps account for the shocking metabolic demands of even a brain which is ‘doing nothing’ —it’s actually still doing a lot! As difficult as it may feel to think hard, from a biological perspective, it’s trivial.) Research on science & creativity emphasizes the benefit of time & sleep in creating effects like the incubation effect , and some researchers have famously had sudden insights from dreams. And we have all had the experience of a thought erupting into consciousness, whether it’s just an inane pun (“you can buy kohl at Kohl’s , LOL”) , a clever retort hours too late , a frustrating word finally coming to mind, suddenly recalling anxious worries (“did I really turn off the stove?”) like intrusive thoughts , or, once in a lifetime, a brilliant idea. (Try meditating for the first time and writing down all the thoughts that pop up until they finally stop coming, and one may be amazed & frustrated!) Often these eruptions have nothing at all to do with anything we have been thinking about, or have thought about in decades (“wait—back at that college party, when that girl looked at my hand— she was hitting on me , wasn’t she?”) Indeed, this essay is itself the product of such an eruption—“what is the LLM equivalent of a default mode network? Well, it could look something like Jones 2021 , couldn’t it?”—and had nothing to do with what I had been writing about (the esthetics of video games). Hypothesis: Day-Dreaming Loop So… where & when & how does this thinking happen? It is clearly not happening in the conscious mind. It is also involuntary: you have no idea some arcane random topic is bubbling up in your mind until it does, and then it is too late. And it
</content>

Write only the summary, with no preamble or additional commentary.

← Back to job