Might reference Levels of AGI https://semianalysis.com/2024/12/11/scaling-laws-o1-pro-architecture-reasoning-training-infrastructure-orion-and-claude-3-5-opus-failures/ Running all these models at once and properly parallelizing it is an insanely difficult systems and infrastructure problem. For example, all the different models must run on various GPUs and the results must be routed to the appropriate next stage in the pipeline while also updating multiple model’s weights while also making sure the workload is load balanced appropriately. Furthermore, the RL training functional verifier “sandboxes” often don’t run well on GPUs, which means they often get offloaded to the CPU. https://twitter.com/Miles_Brundage/status/1862070079581782248 https://twitter.com/DKokotajlo67142/status/1862095827700793499 https://twitter.com/deanwball/status/1861975137425244200 https://twitter.com/random_walker/status/1866858354942832966 Gwern on test-time compute being used for training: https://www.lesswrong.com/posts/HiTjDZyWdLEGCDzqu?s=03#comments. He doesn’t seem to address the question of how much capability it’s possible to train into a model of a given size, in the limit. Also, training on distilled / perfected chains-of-thought might not give the model the ability to correct its own trains of thought when they go awry. https://news.ycombinator.com/item?id=42565606: 30% drop in O1-preview accuracy when Putnam problems are slightly variated https://www.aisnakeoil.com/p/is-ai-progress-slowing-down Let’s stop deferring to insiders Not only is it strange that the new narrative emerged so quickly, it’s also interesting that the old one persisted for so long, despite the potential limitations of model scaling being obvious. The main reason for its persistence is the assurances of industry leaders that scaling would continue for a few more years.2 In general, journalists (and most others) tend to defer to industry insiders over outsiders. But is this deference justified? Industry leaders don’t have a good track record of predicting AI developments. A good example is the overoptimism about self-driving cars for most of the last decade. (Autonomous driving is finally real, though Level 5 — full automation — doesn’t exist yet.) As an aside, in order to better understand the track record of insider predictions, it would be interesting to conduct a systematic analysis of all predictions about AI made in the last 10 years by prominent industry insiders. There are some reasons why we might want to give more weight to insiders’ claims, but also important reasons to give less weight to them. Let’s analyze these one by one. It is true that industry insiders have proprietary information (such as the performance of as-yet-unreleased models) that might make their claims about the future more accurate. But given how many AI companies are close to the state of the art, including some that openly release model weights and share scientific insights, datasets, and other artifacts, we’re talking about an advantage of at most a few months, which is minor in the context of, say, 3-year forecasts. Besides, we tend to overestimate how much additional information companies have on the inside — whether in terms of capability or (especially) in terms of safety. Insiders warned for a long time that “if only you know what we know...” but when whistleblowers finally came forward, it turns out that they were mostly relying on the same kind of speculation that everyone else does. https://x.com/snewmanpv/status/1870496367233438200 https://twitter.com/S_OhEigeartaigh/status/1868606482507616353 From https://thezvi.substack.com/p/ai-94-not-now-google: How fast does ‘capability density’ of LLMs increase over time, meaning how much you can squeeze into the same number of parameters? A new paper proposes a new scaling law for this, with capability density doubling every 3.3 months In https://www.hyperdimensional.co/p/thresholds, Dean writes about threshold effects and notes: There are an enormous number of vectors for improvement on the table, and many are low-hanging fruit. All of this should combine to make you think that the next 18-24 months of AI progress will be more rapid than the last 18-24 months. It might also be less controllable: I’ve written before that the o1 models fundamentally challenge some of the fundamental assumptions of post-ChatGPT AI policy in general the compute thresholds in particular.  Tsarathustra (@tsarnick) posted at 6:54 PM on Wed, Nov 27, 2024:Yann Lecun says his estimate for the creation of human-level AI is not that different from Sam Altman or Demis Hassabis - it is possible within 5-10 years https://t.co/7gG705nmNL(https://x.com/tsarnick/status/1861921602235150545?t=bJerP2YQgBMQrnJMcG0ttA&s=03) Richard Ngo continues to consider AGIs as an AGI for a given time interval - a ‘one minute AGI’ can outperform one minute of a human, with the real craziness coming around a 1-month AGI, which he predicts for 6-15 years from now. Richard expects maybe 2-5 years between each of 1-minute, 1-hour, 1-day and 1-month periods, whereas Daniel Kokotajlo points out that these periods should shrink as you move up. If you do have the 1-day AGI, then that seems like it should greatly accelerate your path to the 1-month one. https://www.theintrinsicperspective.com/p/great-scientists-follow-intuition Prepare for Takeoff section of Zvi, including: I’m not sure that’s what this study means? Yes, they could improve their scores over more time, but there is a very easy way to improve score over time when you have access to a scoring metric as they did here - you keep sampling solution attempts, and you do best-of-k, which seems like it wouldn’t score that dissimilarly from the curves we see. And indeed, we see a lot of exactly this ‘trial and error’ approach, with 25-37 attempts per hour. From https://thezvi.substack.com/p/ai-91-deep-thinking, the section beginning with “Here’s another perspective on why people might be underestimating AI progress?” https://www.newcomer.co/p/our-biggest-takeaways-from-cerebral?s=03 From https://www.theintrinsicperspective.com/p/nonpolitical-content-online-gets, section “Why did AI companies stop competing in e-sports?” notes that AI has not gone superhuman in complex games. Dean W. Ball (@deanwball) posted at 4:14 PM on Fri, Nov 22, 2024:Counterintuitively, AI research may end up being one of the easiest tasks to substantively automate. Maybe not 100%, but a very significant majority of the work.(https://x.com/deanwball/status/1860114571111084250?t=mHKNdAI88B-x5PDm5Q7DXA&s=03) Eli Lifland (@eli_lifland) posted at 2:25 PM on Fri, Nov 22, 2024:Takeaways re: AI R&D performance:1. Claude 3.5 Sonnet reaches ~50th percentile human baseline 8-hour performance.2. Sonnet Old-> New is a 0.2 jump in 4 months. We're 0.6 away from 90th percentile baselines.I think this significantly shortens my timelines. (caveats in reply) https://t.co/VzldSvcMCU(https://x.com/eli_lifland/status/1860087262849171797?t=GRtV2rrGVdp0agNmzzFIGA&s=03) https://www.understandingai.org/p/are-transformer-based-foundation From Zvi: Eduard Harris (CTO Gladstone): There's a big and growing disconnect between the AI models you and I are using, and the versions major labs are keeping for themselves internally. Internal versions are more capable. Be cautious when claiming AI can't do something solely based on trying it with a public model. https://x.com/polynoamial/status/1855037689533178289?t=fO5427vG6vqJkDJbvh4O_g&s=03 Zvi: Tear Down This Wall wh (@nrehiew_) posted at 0:39 PM on Tue, Nov 26, 2024:If you actually want to learn about test time scaling, just watch this talk instead. Significant higher levels of alpha than "o1-replication" papers https://t.co/lcy9sVjC4D(https://x.com/nrehiew_/status/1861464774904537215?t=6xU6YvYS-Jpbfj3-fPZu0Q&s=03) From https://thezvi.substack.com/p/ai-91-deep-thinking: Humans attach too much importance to when AIs fail tasks that are easy for humans, and are too impressed when they do things that are hard for humans, paper confirms. You see this all over Twitter, especially on new model releases - ‘look at this idiot model that can’t even do [X].’ As always, ask what it can do, not what it can’t do, but also don’t be too impressed if it’s something that happens to be difficult for humans. Review The Last Mile RE-Bench is of limited fidelity: https://x.com/BethMayBarnes/status/1860065450824204686?t=HZ53dk_vC5XccxYX3VjT5Q&s=03 https://www.interconnects.ai/p/scaling-realities Ethan Mollick (@emollick) posted at 10:42 PM on Wed, Nov 13, 2024:For what it is worth, the idea that there is no barrier to scaling to AGI is something multiple heads of AI labs have been saying publicly - Dario took his victory lap on scaling & Kevin Scott of Microsoft said similar thingsWe will find out whether they are right soon, I guess(https://x.com/emollick/status/1856950856395984940?t=dvS4aVH3AkDKOrgolUcQKA&s=03) Miles Brundage (@Miles_Brundage) posted at 1:22 PM on Wed, Nov 13, 2024:Betting against AI scaling continuing to yield big gains is a bad idea.Would recommend that anyone staking their career, reputation, money etc. on such a bet reconsider it.(https://x.com/Miles_Brundage/status/1856809970089721991?t=kZV4vNQ3O_9ig9VYzvIKfQ&s=03) swyx (@swyx) posted at 11:10 AM on Wed, Nov 13, 2024:balanced thoughts on the "ai is hitting a wall" meme:- gpt5, 3 opus, gemini 2 delays are real. no >2T models  have made it past release 1.5yrs after GPT4- its poignant that Nvidia is hyping FP4 perf just as research comes out saying FP4 is quantizing too far- ilya, noam,(https://x.com/swyx/status/1856776660986859632?t=OCywwz3dX18kIpNi7zwy6A&s=03) https://www.theinformation.com/articles/following-openai-google-changes-tack-to-overcome-slowdown-in-ai-improvement Erik Hoel’s skeptical take, arguing that we’re now moving back to the classic sorts of search algorithms that never really worked in the past: https://www.theintrinsicperspective.com/p/ai-progress-has-plateaued-at-gpt?utm_source=post-email-title&publication_id=332996&post_id=144605076. From The Information, paths forward that aren’t simple scaling: Welcome back! Kalley, Rocket and Jon here. Now that we know OpenAI isn’t the only artificial intelligence developer seeing a slowdown in improvements using traditional “scaling” methods, it’s worth looking at all the ways companies are trying to make up for that. There are tried and true methods to make large language models perform better, such as tweaking the parameters—settings that determine how the models “learn” from data and answer questions. And there are loads of other changes developers are making to models after they are initially trained, using a bevy of data and computing power, including asking armies of humans to rate their answers and steer them toward better ones. There are also some newer strategies, including one that’s hot off the research presses and has turned some heads this week. First, let’s look at some common techniques. One way that staff at Google have been trying to eke out gains is by focusing more on settings that determine how a model learns from data during pre-training, a technique known as hyperparameter tuning. It differs from finetuning, which involves making changes to models after pre-training so they are better at specific tasks, such as coding. Some AI researchers are also trying to remove duplicates from training data because they suspect that repeated information could be hurting performance. But preventing copies at such a large scale is inherently difficult, given that the models are essentially learning from the entire internet. Google, for its part, says this is not a new challenge. Another strategy is focusing on post-training, when a model learns to follow instructions and provide responses that humans prefer through steps such as finetuning. Post-training doesn’t appear to be slowing in improvement or facing data shortages, AI researchers tell us, in part because finetuning relies on data that people have annotated to help a model perform a particular task. That would suggest that AI developers could improve their models’ performance by adding more and better annotations to their data. (This story explains more about the gains AI firms are making via post-training tweaks, especially in coding.) AI developers are also using older models to generate training data for newer models, with the hope that AI-generated, or synthetic, data can make up for the dearth of other types of data. Meta Platforms, for example, used Llama 3.1 to generate post-training data for Llama 3.2. But there may be limits to synthetic data generation. Google hasn’t seen significant improvements from using synthetic data for Gemini. Meanwhile, OpenAI employees have concerns that the company’s unreleased next big large language model, Orion, may resemble earlier models because those models generated data for Orion. If the well of training improvements runs dry, developers will likely continue throwing computing power into models after they have been trained, to improve performance. One simple approach, sampling, involves asking a model the same question dozens or even a hundred times, and then picking the best answer. Another way to use more computing power after the pre-training phase involves reasoning, when a model takes more time to “think” when answering questions—a concept called test-time compute. OpenAI and Google both are working on reasoning, which essentially gets baked into LLMs, and Meta focused on improving reasoning capabilities in its recent model Llama 3. Many people at OpenAI believe the new reasoning paradigm will make up for the limits it is facing in the training phase—so we look forward to seeing more results from the full version of o1, and then o2—what Orion may eventually be called, as we discussed Monday. In an apparent nod to this idea, CEO Sam Altman tweeted late Wednesday: “there is no wall.” And over the past week, an idea has emerged for how to improve reasoning: doing a tiny bit of finetuning before the model gives an answer. In a preprint of a paper released Monday, researchers from the Massachusetts Institute of Technology said they made gains by training Llama models on a few modified versions of each question before getting an answer. OpenAI researcher Noam Brown said on X that he was “excited” by this approach, so we may hear more about it soon. From a draft paper Sayash shared (thread “Dinner the night before The Curve”, search for “wanted to share a draft of a paper where Arvind and I try to get at the root of the differences in worldviews on the future of AI”) – could reiterate that there’s a long history of claiming that we’re “almost there” in AI. (See also Cyc.) AI pioneers considered the two big challenges of AI (what we now call AGI) to be (what we now call) hardware and software. Having built programmable machines, there was a palpable sense that AGI was close. The Dartmouth conference organizers hoped to make significant progress toward the goal through a “2-month, 10-man” effort. Today we’ve climbed many more rungs of the ladder of generality. We often hear that all that is needed to build AGI is scaling, or generalist AI agents, or sample-efficient learning. But it is useful to keep in mind that what appears to be a single step might not be so. For example, there may not exist one single breakthrough algorithm that enables sample-efficient learning across all contexts. Indeed, in-context learning in LLMs is already "sample efficient", but only works for a limited set of tasks [30]. Clean up these notes, then file under Paths to AGI or something Metacognition at both train and inference time Self-curating training data Sample efficiency Credit assignment; figuring out which level of the system to update Why are people able to learn tasks that don’t have clear right & wrong answers, but we can’t do that for LLMs with RL right now? Comes back to credit assignment? Or maybe if you frame everything as a prediction problem then you have right & wrong answers: was the prediction correct? → email Sam about how Yann LeCun’s thing, being predictor-based, may get around the RL-on-unclear-outcomes and credit-assignment problems. (Maybe still no metacognition though?) Metacognition == consciousness? Sam Hammond talking about consciousness needing coherence, maybe it is how we seek coherence From https://podcasts.apple.com/us/podcast/its-not-about-scale-its-about-abstraction-francois-chollet/id1510472996?i=1000672837092: GPT-4 can solve Caesar cyphers, but only for n=3 or 5. This gets to what level of generality you learn, and inductive biases. Will it get harder to measure progress? Ethan Mollick (@emollick) posted at 4:28 PM on Fri, Oct 18, 2024:Some comments seem to be missing some of the point - AI has always had mixed performance on tasks, the “jagged frontier.” But it has been relatively easy to evaluate AI performance on lower level work, but when it comes to advanced problems, only a handful of people can test AI(https://x.com/emollick/status/1847419607281213444?t=hA8v5VKwLJnS3yewLVHuug&s=03) FoL Institute interview with Tamay Besiroglu: he talks about the potential difficulty of generating synthetic training data for tasks with no clear right / wrong outcomes, but then says scaling laws tell us we’re going to progress rapidly? Maybe ask him about how to evaluate the y axis of the scaling laws. https://dynomight.substack.com/p/data-wall Timelines / two-paths post: Untitled, https://www.cantorsparadise.com/gpt-4-is-amazing-but-still-struggles-at-high-school-math-competitions-cbc2e73738e, Untitled, Two Paths To AGI Substack draft, My Forecast For AI Milestones draft, Can We Set Goalposts That We Won't Need To Move? draft The fact that LLMs lack a constrained inner monologue feels like a fairly catastrophic weakness that needs to be changed. They're certainly given time to think between tokens, but that thinking is fixed compute, brief, and left totally undiscretised. (https://x.com/aidangomez/status/1749760985102164118?t=2BSnLpqow0PhZcoZj6UuMQ&s=03) From https://importai.substack.com/p/import-ai-357-facebooks-open-source, the section on “Facebook bootstraps LLaMa 2”: if this “self-training" approach has legs, it could remove the data barrier to going past human capability https://www.threads.net/@yannlecun/post/C8bXz_Huu2M/?xmt=AQGzf-Q_XBZq9jdUNN_1B-oXNBvsi2-pW4J-Wrp_UD-iyw Dan Hendrycks tweet on algorithmic progress Perhaps the path to AGI will require models to have some sort of metacognitive ability to recognize when they're going wrong (or being messed with) and need to slow down and avoid running on autopilot. Progress on this could serve as an early warning sign for progress toward AGI: Marcello Herreshoff explains the idea of ‘template fixation.’ If your question is close enough to a sufficiently strong cliche, the cliche gets applied even if it does not make sense. Hence the stupid answers to river crossing questions or water pouring tests or other twists on common riddles. If we can’t find a way to avoid this, math is going to remain tough. It is easy to see why this would happen. Another: https://twitter.com/DanHendrycks/status/1798390576922178004 Notes while listening to https://axrp.net/episode/2024/04/25/episode-29-science-of-deep-learning-vikrant-varma.html: Some thoughts on why llms are less sample efficient than people. People are learning is driven by curiosity. We will deliberately seek out new information that directly bears on things we don't know yet, and maybe that are specifically on the edge of what we're ready to learn. We can explicitly reason about our own learning process. We know what we've learned and what we haven't. We can potentially reason about what works well in our own learning. What is what is our learning style? Maybe even specifically? What is our learning style for different kinds of tasks? We can see where we are or are not making progress on learning a particular thing so we can constantly optimize our own learning process This paper talks about extracting latent knowledge from models which is interesting. You know in a human teacher-student relationship, the student can talk about what they've learned, what their tentative understanding is. Eventually, if they begin to surpass their teacher, they can talk to the teacher about what they've learned. We don't really have anything like that with these models yet. https://thegradientpub.substack.com/p/update-73-language-erasure-long-context contains evidence that models can't really make great use of their advertised token buffers. Even the new RULER benchmark proposed here seems unrealistically simple. From this Zvi post, review the section beginning “What would automated R&D look like? Epoch AI reports on some speculations.” Engage with Samuel Härgestam (Halcyon retreat, has short timelines) Could reach out to the author of this excellent comment from an engineer who is already being heavily accelerated by Copilot, to probe the question of how things might unfold from here. If Copilot is doing “80% of the typing”, how significant is that really? Review https://thezvi.substack.com/p/on-devin. E.g. I do notice that this is exactly the lowest-resistance default shortest path to ensuring that AI agents exist and have the capabilities necessary to cause serious trouble at the earliest possible moment, when sufficient improvements are made in various places. Our strongest models are optimizing for writing code, and we are working on how to have them write code and form and execute plans for making software without humans in the loop. You do not need to be a genius to know how that story might end. Review: Epoch AI's Research - LINK Epoch AI is a research org that investigates the trajectory of AI and its implications for society. I'd recommend clicking around their site, which is particularly helpful for building intuitions on AI timelines. Flow Engineering and Code Integrity at Scale with Itamar Friedman, CEO of Codium AI: not worth reviewing at length, but the middle to second half of this interview was interesting. To solve coding challenges, they’re building nontrivial workflows involving many LLM invocations: analyze a problem, analyze the test cases, think about how the problem might be solved, write some more tests, write the code, run the tests, iterate. This is an example of a current trend: layering hand-coded System 2 workflows + RAG on top of LLMs. For timelines, a key question is how far this will carry us. Review this section from Zvi #52 on what slow vs. fast takeoff looks like The irresistible force vs. immovable object of scaling laws vs. AI Progress Paradox / slow adoption / everything always takes longer than you think. Do LLMs build world models? Yes: the recent results of chess @ ELO 1800. No: inability to generalize from “A relation B” to “B inverse-relation A”. Net: probably, when it’s some combination of relatively-easy and strongly-encouraged-by-the-training-data? https://www.reddit.com/r/math/comments/19fg9rx/some_perspective_on_alphageometry/ Most of the lifting is not done by AI. The big news here is that these geometry problems are more amenable to a particular form of brute force + algorithmic search than was understood. Seems unlikely to generalize to a broad class of math problems. The interesting question is how many other problem domains will turn out to be amenable to similar "too simple" attacks. Think about responding to Daniel Kokotaklo Tabs I had open https://amistrongeryet.substack.com/p/e492b0ae-d637-4604-81d5-efdc9c1e0fbf https://amistrongeryet.substack.com/p/cfc9096d-a38c-476c-9eb2-e2bfd156953e https://amistrongeryet.substack.com/p/c06d9966-3cca-4168-9bb9-95abc3dbf537 https://amistrongeryet.substack.com/p/684cd573-defd-4a5c-875e-9fc1071de140 https://www.notion.so/The-Long-Road-From-ChatGPT-To-AGI-309b276186a24e78bad6a5cba754cbd9 https://arxiv.org/pdf/2310.08560.pdf https://www.notion.so/A-Lower-Bound-On-AGI-Timelines-04135c7de1ce4f8aaffcfc3ed6e550ee https://www.notion.so/Two-Paths-to-AGI-1d698bcf62a444b69accf376da688ebd https://www.notion.so/The-Future-Of-Training-Data-beafc63e62394174a2518adaec770cbc https://www.notion.so/Other-People-s-Estimates-6e9f8a6a18414f37b37a40d9ffc0dda6 https://www.notion.so/My-Forecast-For-AI-Milestones-c5a5ade965854bab8efcb5165e792ff8 https://www.notion.so/My-Estimate-254770a7d6504b25b8d85c3471054225 https://www.notion.so/Paths-for-Near-Term-Progress-1e3c0d5017e74394b5f8a23339d8ea22 https://www.notion.so/More-On-Timelines-And-Limitations-Of-LLMs-and-Missing-Capabilities-c48fa565774d4634abfc2667799eea28 Levels of AGI: Operationalizing Progress on the Path to AGI Also review Untitled https://www.theintrinsicperspective.com/p/the-1-ai-cant-teach-basic-stuff-like: AIs still are bad at some fairly basic things (some of the examples may be down to tokenization, but still) Distilling System 2 into System 1 Dean: If we see GPT-5, and it has the rumored reasoning capabilities, does that mean that it can solve Olympiad math problems but still doesn’t know how many r’s are in ‘strawberry’, or does it extrapolate to situations where there is no ground truth? The reason they’re able to get so good with math and code is because there’s ground truth. If they go beyond that, we’re in a different regime.” From https://thezvi.substack.com/p/ai-74-gpt-4o-mini-me-and-llama-3: Andrej Karpathy: LLM model size competition is intensifying… backwards! My bet is that we'll see models that "think" very well and reliably that are very very small. There is most likely a setting even of GPT-2 parameters for which most people will consider GPT-2 "smart". The reason current models are so large is because we're still being very wasteful during training - we're asking them to memorize the internet and, remarkably, they do and can e.g. recite SHA hashes of common numbers, or recall really esoteric facts. (Actually LLMs are really good at memorization, qualitatively a lot better than humans, sometimes needing just a single update to remember a lot of detail for a long time). But imagine if you were going to be tested, closed book, on reciting arbitrary passages of the internet given the first few words. This is the standard (pre)training objective for models today. The reason doing better is hard is because demonstrations of thinking are "entangled" with knowledge, in the training data. Therefore, the models have to first get larger before they can get smaller, because we need their (automated) help to refactor and mold the training data into ideal, synthetic formats. It's a staircase of improvement - of one model helping to generate the training data for next, until we're left with "perfect training set". When you train GPT-2 on it, it will be a really strong / smart model by today's standards. Maybe the MMLU will be a bit lower because it won't remember all of its chemistry perfectly. Maybe it needs to look something up once in a while to make sure. Maybe. Somewhat. I see a lot of post-hoc or virtue of what happened to happen going on in there. The story might also be a lot less complicated than that. The story could be mostly about cost and speed, and thus this is how we are choosing to spend our algorithmic bounty. Being smarter than the average bear or model is still highly useful, and I assume I will be switching to Opus 3.5 for personal (non-API) use the moment it is available unless GPT-5 (or Gemini-2 or something) comes out first and is even better. I am surprised we are not doing more to build multi-step queries and other trickery to get more out of the smaller stuff in combination with the big stuff and work around weaknesses. I suppose things aren’t standing still long enough to allow it. The question increasingly becomes, where are the bigger smarter models? Claude 3.5 Sonnet is impressive, but shouldn’t we have a Claude 3.5 Opus or a GPT-4.5 or Gemini Advanced 1.5? Ajeya Cotra: I think this is true, but what's even more important is when GPT-2-sized models are as smart as GPT-4 is today, GPT-4-sized models will be *much smarter.* I think discussion of the "miniaturization trend" doesn't emphasize that enough. I think there will still be reason to train and use ever bigger models, even when day-to-day work can be done by much smaller and cheaper models: the biggest models at any given time will be the best for some especially difficult tasks like R&D. Gallabytes: this does feel like the thing to bet on and yet so far we're really not seeing it? I have the same intuition you do here but wonder how long to keep holding that intuition in the face of evidence to the contrary. wdyt? The bigger runs are getting actually expensive. If you do a ‘yolo run’ of such a model, and fail, it hurts even if nothing dangerous happens, whereas with smaller attempts you can safely fail and iterate. Safely in the economic sense, and also in other senses. Also from https://thezvi.substack.com/p/ai-74-gpt-4o-mini-me-and-llama-3: James Campbell: it's just so weird how the people who should have the most credibility--sam, demis, dario, ilya, everyone behind LLMs, scaling laws, RLHF, etc--also have the most extreme views regarding the imminent eschaton, and that if you adopt their views on the imminent eschaton, most people in the field will think *you're* the crazy one. it's like, "i'm crazy? no you're crazy! ilya fucking sutskever, the guy behind alexnet and openai, created a company called safe superintelligence! sam altman is raising $7 trillion to build The Final Invention. but yeah, i'm sure they're all definitely 100% wrong without a second thought, just keep working on your langchain b2b saas app or graph neural network theory" i'm all for people forming their own idiosyncratic view of general intelligence and what it takes to get there. but the burden of proof is on you when most of the staff at the secretive top labs are seriously planning their lives around the existence of digital gods in 2027 Anton: My theory of why people inside the labs have very different timelines from people outside is because it’s a lot easier to believe in continued model improvement when you see it happening in front of your eyes with every training run. Conversely, relative to the promise, outside the labs the immediate impact of ai has so far been fairly limited. Most people aren’t using what exists today effectively and find it hard to conceptualize what they’d do with it if it got better. They think it’s for writing essays. I do think the people at the labs largely believe their hype. And yes, they have insider information. That can help you. It can also can blind you, and put you in an echo chamber. And again from https://thezvi.substack.com/p/ai-74-gpt-4o-mini-me-and-llama-3: the other angle i have been thinking a lot about is the separation of reasoning from knowledge. RAG/memory plugs knowledge easily but not reasoning. 82 MMLU is plenty. you can get it up to 90, but it’s not going to be appreciably smarter in normal use without advancing other metrics. So in 2025 we're likely to evolve towards 0) context utilization (RULER) 1) instruction following (IFEval) 2) function calling (Gorilla) 3) multistep reasoning (MUSR), 4) coding ability (SciCode), 5) vision understanding (VibeEval?) for all the stuff that RAG can't do. See “Reality bites” section of https://importai.substack.com/p/import-ai-380-distributed-13bn-parameter; also the discussion of Eyeballvul, and in particular the poor performance of current models. See “ROBUST ROBOTS THAT WORK EVERYTIME” section of https://rodneybrooks.com/rodney-brooks-three-laws-of-robotics https://www.ben-evans.com/benedictevans/2024/7/9/the-ai-summer talks about slow adoption and false promise of early technology; there are lots of great quotes, here are a couple: LLMs look like they work, and they look generalised, and they look like a product - the science of them delivers a chatbot and a chatbot looks like a product. You type something in and you get magic back! But the magic might not be useful, in that form, and it might be wrong. It looks like product, but it isn’t. LLMs skipped that part, where you work out what this is and what it’s for, and went straight to ‘it’s for everything!’ before meeting an actual user. https://nicholas.carlini.com/writing/2024/how-i-use-ai.html talks about current (August 2024) utility of LLMs. Some snippets: I'm an emacs person. And I have my environment set up so that, whenever I run a program and it exits with a non-zero status code (meaning something went wrong), it automatically invokes whatever the latest fastest LLM is and asks it to explain the answer, and in parallel asks for a patch that can be directly applied to fix the bug in the code. language models are not yet good enough that they can solve the interesting parts of my job as a programmer. And current models can only solve the easy tasks. But five years ago, the best an LLM could do was write a plausibly-English sounding paragraph. And we were amazed when they could form coherent ideas from one sentence to the next. Their practical utility was exactly zero. Today, though, they've improved my productivity at the programming aspects of my job by at least 50% on the average project, and have removed enough of the drudgery that I built several things I would never have attempted otherwise. How do people come up with research ideas in AI? Will the "AI Scientist" finally make me work full-time on my chicken farm? https://x.com/IntuitMachine/status/1843238594669932686?t=VsIgmshZYkyf9O-pYedugg&s=03 touches on synthetic data RSI may not be fast (see discussion of AlphaChip at https://importai.substack.com/p/import-ai-386-googles-chip-designing) A comment I wrote on how talking about the “reliability” of an LLM for multi-step reasoning or agentic operations is oversimplified: https://www.interconnects.ai/p/how-scaling-changes-model-behavior/comment/72027805 Teortaxes▶️ (@teortaxesTex) posted at 1:46 PM on Sun, Oct 20, 2024: btw I've started having doubts about near-term transformative AGI. Timelines up to 2035 or 2040 seem plausible (my mainline scenario is ≤2028 still). We have scant theory of human scientific genius and near zero reliable data on its inner operation. «A genius is just a https://t.co/aHFcjlVguy (https://x.com/teortaxesTex/status/1848103571310747978?t=UZY6DVSv-yZMKSu6BMV1zQ&s=03) [Child Page: Agent Timelines] CAIS Izzy Barrass asked about this on 6/5/24: Does anyone have any good papers/citations/data points on why we're at a tipping point for AI agents? Need to find data points for NeurIPS workshop proposal. → https://on.ft.com/3yWP1kV → https://www.insightpartners.com/ideas/ai-agents-disrupting-automation/ [Steven Basart] @Zifan Wang is working on AI Agents so he's a good (primary) source Steven Basart also pointed to https://www.swebench.com/ Dan Hendrycks, 6/2/24 on #random: Some timelines analysis: Around ~15 trillion tokens are available through scraping and OCRing PDFs. This is enough for GPT-4.5 (10x compute GPT-4). It seems plausible GPT-4.5 will unlock adequate agentic capabilities, but other capabilities are unclear. There will be multiple GPT-4.5-level compute models at the end of the year. GPT-5 (100x compute GPT-4) would need more tokens. The main path is synthetic data, which may end up just being multiple translations of existing tokens. It is not clear how much models will learn from these tokens. A GPT-5 could definitely be trained just by continuing to train the GPT-4.5 for another ~9 months, so 100x compute of GPT-4 is feasible for fall of 2025. If synthetic data isn’t enough, then a better pretraining algorithm could make all the difference. Consequently, the ML community needs to find a better pretraining algorithm fairly quickly for scaling to continue if synthetic data does not help that much. [Child Page: More Material for Timelines] Ethan Mollick (@emollick) posted at 10:16 PM on Wed, Dec 04, 2024:It is worth noting that this is an increasingly common message from insiders at the big AI labs.It isn't unanimous, and you absolutely don't have to believe them, but I hear the same confidence privately as they are broadcasting publicly.(Still not sure how you plan for this)(https://x.com/emollick/status/1864509024214905060?t=ZVSDTAuR3RAtnjMFEDvyhQ&s=03) From a recent Dwarkesh podcast, I think https://www.dwarkeshpatel.com/p/sholto-douglas-trenton-bricken: Most intelligence is pattern matching, and you can cover a lot of ground with associative hierarchical retrieval. (Intelligence is compression) Maybe the tricky bit is in learning which associations to follow. Tie into the common refrain that for long-task agents, we “just” need better reliability; where does that come from, and why should it be just one thing? The key bottleneck in advancing AI is not coming up with ideas, nor implementing them. It's intuitively selecting which ideas to try, and then evaluating imperfect data to judge what did or did not work and why. Bound by compute to run experiments, and taste to select them. As architectures improve, the model becomes a better and better map of the training data, so it's all about the data, or we'll hit a ceiling based on the data, or we need to make our own data / transcend the data. This is how Olympiad geometry problems were solved. Larger models seem to be more sample efficient. Superposition suggests that current models may be dramatically under-parameterized for the amount of data we're training them on. From https://dblalock.substack.com/p/2024-4-7-arxiv-roundup-dbrx-backlog – file for a later post: People sometimes have this impression that big LLMs are like this space race where we stand around at whiteboards having breakthroughs and the company with the most talent or insights gets the best model. The reality is much less sexy than that. We’re basically all getting the same power law scaling behavior, and it’s just a matter of: Parameter count Training data quantity / epoch count Training data quality Whether it’s an MoE model or not Hyperparameter tuning The real innovation is in dataset construction and the systems you build to scale efficiently. It also takes immense skill to debug the training, with all sorts of subtleties arising from all over the stack. E.g., a lot of weird errors turn out to be symptoms of expert load imbalance and the resulting variations in memory consumption across devices and time steps. Another fun lesson was that, if your job restarts at exactly the same rate across too many nodes, you can DDOS your object store—and not only that, but do so in such a way that the vendor’s client library silently eats the error. I could list war stories like this for an hour (and I have). Another non-obvious point about building huge LLMs is that there are a few open design/hparam choices we all pay attention to when another LLM shop releases details about their model. This is because we all assume they ablated the choice, so more labs making a given decision is evidence that it’s a good idea. E.g., seeing that MegaScale used parallel attention increased our suspicion that we could get away with it too (although so far we’ve always found it’s too large a quality hit). It’s also an interesting datapoint around AI progress. DBRX would have been the world’s best LLM 15 months ago—and it could be much better if we had just chosen to throw more money at it. Since OpenAI had GPT-3 before we even existed as a company, this indicates that the gap between the leaders and others is narrowing (as measured in time, not necessarily model capabilities). My read is that everyone is hitting the same power law scaling, and there was just a big ramp-up time for people to build good training and serving infra initially. That’s not to say that no meaningful innovations will happen—just that they’re extremely rare (e.g., MoE) or incremental (e.g., the never-ending grind of data quality). From https://thezvi.substack.com/p/ai-58-stargate-agi, a recent example of eking out extra performance: New paper suggests using evolutionary methods to combine different LLMs into a mixture of experts. As Jack Clark notes, there is likely a large capabilities overhang available in techniques like this. It is obviously a good idea if you want to scale up effectiveness in exchange for higher inference costs. It will obviously work once we figure out how to do it well, allowing you to improve performance in areas of interest while minimizing degradation elsewhere, and getting ‘best of both worlds’ performance on a large scale. [Child Page: More On Timelines And Limitations Of LLMs and Missing Capabilities] In case I ever want to revisit this topic, e.g. to respond to pushback… Inside Views, Impostor Syndrome, and the Great LARP – dives a bit deeper into “insight” (though doesn’t use that term, and doesn’t discuss it as an AI capability) https://sideways-view.com/2018/02/24/takeoff-speeds/ — good 2018 analysis of takeoff speeds The Generative AI Paradox. As I said to Sam: This was interesting (note, should link to the middle section of a longer post). A study claiming to demonstrate that LLMs are better at generating text than understanding text. I always question the details on these things, but the high-level concept is interesting, plausible, and may provide another perspective on the challenge of long-form thinking: if indeed an LLM has trouble understanding what it has done, then it will also struggle to make good choices about how to proceed further. Faith and Fate: Limits of Transformers on Compositionality provides evidence that merely scaling transformers won’t get us to AGI. I’ve only skimmed the first couple of pages. We propose two hypotheses. First, Transformers solve compositional tasks by reducing multi-step compositional reasoning into linearized path matching. This contrasts with the systematic multi-step reasoning approach that learns to apply underlying computational rules required for building correct answers [54, 32, 24]. Shortcut learning [26] via pattern-matching may yield fast correct answers when similar compositional patterns are available during training but does not allow for robust generalization to uncommon or complex examples. Second, due to error propagation, Transformers may have inherent limitations on solving high-complexity compositional tasks that exhibit novel patterns. Errors in the early stages of the computational process can lead to substantial compounding errors in subsequent steps, preventing models from finding correct solutions. Empirical results show that training on task-specific data leads to near-perfect performance on in-domain instances and under low compositional complexity, but fails drastically on instances outside of this region. This substantial gap suggests that systematic problem-solving capabilities do not emerge from maximum likelihood training [5] on input-output sequences, even when prompted or trained with human-like reasoning steps (i.e., a linearization of computation graphs; §3.1). Our careful study based on the computation graph and analyses demonstrates that Transformers can often solve multi-step compositional problems by collapsing the depth of the compositional operations via analogical pattern matching. More broadly, our findings suggest that the strong performance of Transformers should be taken with a certain grain of salt: Despite initially appearing challenging, certain tasks may not possess the inherent compositionality they seem to have. This is due to the fact that desired solutions could be readily derived from input-output sequences present in the training data, allowing for shortcut pattern matching to produce acceptable solutions. [Child Page: How Long Do We Have To Take Action / Timeline Until No Return] The date of AI Takeover is not the day the AI takes over – the critical date is not foom, it is the point where we no longer have enough time + influence to change course