Work from AI Watersheds Phase 2: Indicators

[Child Page: Steve’s Notes from the Phase 2 Brainstorming Doc]

Brainstorming doc

Why Do Measurements?

Thoughts on the purpose of forecasting AI’s schedule

General trend visibility for several years (important for lots of policy + other questions)

At least several years notice of transformational change

Estimate likely steepness & extent of transformational change

Ask panel for feedback on this

Josh: interested in the connection between current policy questions and our questions here. Would be great to get more specific / explicit about the connections with policy questions.

Josh: interested in the connection between current policy questions and our questions here. Would be great to get more specific / explicit about the connections with policy questions.

General thoughts on what to measure

Practical utility is a function of both AI model capability and complementary app innovation (coding models vs. Cursor)

Approaches to measurement

Josh (+’d by Helen, Nikola, Sayash):

Noting cross-cutting theme: Generally seems like more high-quality surveys of companies would be valuable.

A lot of ideas involve sensitive data, so would require controlled access (and/or governmental involvement?)

Using AIs to qualitatively grade lots of screen recordings and other messy high-volume data

Find ways of calibrating / correlating benchmark scores with measurements we care about more directly? (METR time horizons being an example in this direction)

Generally use lots of different measurements and try to map them to common units for cross-checking

Surveys / prediction markets; focus real-world measurements (e.g. uplift studies) on areas of disagreement

Or get both sides of a dispute to agree on a cheaper proxy measurement

Case studies

How firms in various sectors are using AI; scale of tasks assigned to AI / nature + granularity of interaction with humans; degree of trust / freedom of action given to AI

Deep study of revenue / spending / profit across the AI value chain (and by domain)

Also: number of “complementary innovation” (Sayash’s term) AI-powered apps, adoption rate, # of customers, “Surveys of customers to see how they are improving their workflows”, etc.

Helen: something about revenue from AI more broadly vs. revenue of frontier model providers. What can we say about open models eating into OpenAI / Google / Anthropic revenue?

Broad metrics (a la Epoch), e.g. electricity used by AI, global private investment

Keep pushing benchmarks toward greater realism of both input (task specification + context) and output (not just “passes tests”, but “PR is mergeable” / other measures of quality)

Nikola: need more realistic benchmarks. Source PRs (ideally from inside AI labs) and test AIs on them.

Analysis of usage at the model level (Anthropic economic AI index)

Logs of agent usage i.e. usage of model context protocol servers, pulls

[Sayash] Analysis of AI logs from them being asked to solve various tasks (we’re doing this with 10B tokens of HAL agent log data, but could imagine increasing many fold)

Social media is subject to transparency measures, AI companies currently are not. It’d be great to have controlled researcher access to e.g. ChatGPT transcripts.

Lots of general data sets for public analysis

AIs really like to email Eliezer, that’s an indicator of various types of strange AI behavior. Ajeya: there was an NYTimes article about a journalist getting similar emails. Could there be a “weird AI shit” incident tracker that would categorize reports + report trends?

Josh: MIT AI incident repository? Populated from public reports – links to existing reports. Ajeya: hard to make sense of it. Needs to be a larger team sifting the signal from the noise + writing blog posts. Doesn’t track veracity.

AI Risk repository / incidents database is relevant here: https://airisk.mit.edu/, also https://incidentdatabase.ai/

Ajeya comment: Too hard to make sense of current databases. Need someone to do better synthesis.

Make an open source AI company where the company itself is open source. Have it try to use AI as much as possible, make all its logs about everything (Slack etc) public, study that. 

Sayash: there’s been a lot of work on the science of AI evaluations. We don’t know the shape of the DAG, how AI inputs translate into outputs. We could do more to distinguish between different DAGs – which is a core source of disagreement, for instance Sayash has a very different DAG in mind. Inputs might include things like data, compute; output measured in the real world. Econometrics.

Josh: Meta measurements: Big increase in expected OpenAI revenue, according to a prediction market, based on a reported algorithmic breakthrough

Pushback from Ryan+Ajeya: limitations of prediction markets mean it’ll be tough for them to pick this up reliably

Surveys at frontier AI companies

Surveys of the broader workforce

Data center construction spend → currently dependent on media reporting, is there some way to get better/more consistent data on this?

Trying to unpack inputs/outputs per Sayash’s point about DAGs/understanding how improvements in one element flow through to improvements in other elements

Lots of (different types of) surveys

Helen: important to track military applications (broadly – decision support, targeting, etc.). Important to track freedom of action here.

Measuring Utility

Metrics about fluid intelligence, messy tasks

Benchmarks for domains vs. skills

Domains: cyber/law/professional labor markets

Skills: Context awareness, reliability, long-context reasoning, sample efficiency

Need deep info for each sector before assessing results. I.e. persuasion surveys overestimate results because opinion change dissipates 6 weeks later. I like Steve’s note that AI can continually be used cheaply. 

Uplift RCTs in real-world settings, measuring real-world outcomes (ideally: profit!)

e.g. if you get an actual large company to actually do randomized staged rollout of enterprise LLMs that’s great

Ajeya: three-armed uplift trials, AI / human / cyborg.

Josh:

Persuasion: Cost for AI to swing a vote, relative to other leading interventions (Josh: lots of polysci studies here)

AI-bio: % of amateurs that can synthesize flu

Look for leapfrog companies that get on board with a new technology ahead of the incumbents, that could highlight utility ahead of adoption.

e.g. fintech cos using AI in ways that let them overtake traditional banks, or healthcare startups using AI in ways that let you bypass doctors’ offices

Power user case studies – screen recordings + deep interviews of people who (think they) are getting enormous uplift

Organization version of this: find those YC startups that supposedly do 95% of everything with AI, study them (“power organizational users”)

Field trials (a la AI Village) and realistic benchmarks

Do a time horizon analysis (a la METR), but on real-world usage? Also, qualitative analysis of the size + difficulty + nature of tasks successfully delegated to AI (and dig into ranges of “successfully”)

Building on Josh’s work test idea: could a METR-like org collect work test setups from a range of places (under NDA so as not to break the work tests) and centralize the work of eliciting good performance, figuring out measurement, reporting results, etc?

Ryan:

I think uplift will be messy (because of humans being messy and possibility for phase changes in how AI should be used productively), so I’m more into end-to-end automation of actual tasks. Also, we ultimately care most about full automation regimes and predicting when this will happen. Uplift studies correspondingly seem somewhat worse than looking at full automation of tasks of varying difficulty/size (and benchmark transfer to actual things people are doing in their job so we don’t need to constantly run these tests).

Maybe you can get AI companies to run semi-informal experiments with uplift internally and publish results: many companies have effectively committed to doing this eventually due to their safety policies. I think Anthropic might open to this, but there are various sources of trickyness here.

Early reports of AI doing remarkable things (e.g. major scientific discovery / insight; solving a Millennium Prize Problem in mathematics)

pass@any for existing benchmarks can predict pass@1 in the future

Helen: where is the ceiling on various capabilities (headroom)? This seems under-investigated.

Sayash: are most tasks like chess, or like writing? (Ryan: unclear whether writing is the right contrast here.)

Are there less controversial examples than writing?

Taking out the trash: saturates.

Helen: the writing example highlights we’re way under-theorized on this. Quality of parenting is another example. “Persuasion” is poorly specified.

Abi: a scientific breakthrough requires an insight that goes against dogma. Could be considered an aspect of very high-quality writing.

Ryan: I care most about questions like “will energy production 100x within 5 years of full AI R&D automation if people want this to happen”. Seems like this has many possible routes in terms of capabilities head room.

Sayash: three broad categories: computational limits (chess), intrinsic limits / saturation (writing?), and knowledge limits (need new breakthroughs to exceed past performance). Better AI can help with computational limits, but not intrinsic limits.

Work tests for orgs in the space:

GiveWell lit review test

Open Philanthropy work tests

High-quality summaries of a conversation (AI is still failing our first-round work test at FRI about this)

Abi: “I liked how [this] list broke down sectors as each having their separate indicators. Very societal frictions approach.”

Measuring Adoption / Impact

YouTube resulted in some homeowners not calling plumbers for small jobs. What are the AI equivalents, and how should they be measured?

Compare metrics across AI-feasible vs. AI-infeasible sectors: employment (incl. hiring patterns + plans), productivity, profit, qualitative measures of AI usage, …

Revenue of AI service providers

Studies of AI impact across various fields

What percent of sales of the U.S. pharmaceutical industry is generated by AI-discovered drugs and products derived from these?

What percent of publications in the fields of Chemistry, Physics, Materials Science, and Medicine will be ‘AI-engaged’ as measured in this study or a replication study, in 2030?

How many hours per week on average will K-12 students in G7 countries use AI-powered tutoring or teaching tools, as reported by their school systems or education ministries?

A September 2024 study by the Federal Reserve Bank of St. Louis, based on the Real-Time Population Survey (N=3,216), estimated that 0.5%–3.5% of all U.S. work hours were assisted by generative AI. 

Measuring Diffusion

Ajeya:

I want to know about adoption within AI companies, so e.g. surveys about how much compute they’re running for inference, freedom of action related surveys like the ones suggested below, internal estimates of how much AIs speed them up, etc. (Also, I don’t think I literally care only about AI companies, will prob also care about USG.)

Measuring Freedom of Action

Surveys of workers - how often do you have to approve an AI decision, how much thought do you put into each approval (e.g. level of review / pass rate of AI pull requests)

Monitoring progress on agent infrastructure → if AI agents are operating online, are there meaningful constraints on them? Does e.g. Cloudflare have any meaningful controls, or is it all a chaotic mess?

Something military or military-adjacent - reports of smaller militaries using autonomous/semi-autonomous systems..?

We can also make a big list of discrete flags and ask people about them, e.g. “Is your AI allowed to spend money on business expenses? Up to how much?” or “Is your AI allowed to search the internet freely in the course of completing a task?” or “Are AIs at your company allowed to talk to employees other than the human that started the task in the course of completing their task? What about external people?” 

White-hat reports of prompt injections paid out by bug bounties at the top (100) websites across domains over time (as a proxy for how many applications allow AI systems to take actions that can exfiltrate user data)

Black-hat exploits of AIs (only possible if AIs are in a position to take important actions without adequate review?)

Curve-Bending Mechanisms

Note that the impact of AI can be difficult to predict. E.g. the prediction of AI flooding the zone with misinformation in the 2024 election didn’t pan out. Sycophancy → mental health impact, conversely, was an event that wasn’t widely predicted.

Intelligence explosion (software and/or hardware)

Measure both initial speedup, and whether r > 1

Surveys of AI companies on internal AI use (bunch of ideas for granular questions, e.g. subj sense of speedup, what tasks they’re used for now, what tasks aren’t they used for, hypothetical questions)

Internal measures of the absolute rate of algorithmic progress at AI companies (e.g. how much compute does it take to get GPT-4-level perf), watch for that trend accelerating

Source

Ryan: Get people who have done the takeoff modeling to look at the data on algo progress (both historical from humans and data under automation regime). Unclear if this data will resolve disagreements, but seems like it can help. (Tom, Daniel, maybe some people from Epoch)

Algorithmic breakthrough / new approach (e.g. neuralese, in-context learning, long-term memory, recurrence, brain emulation) (either known or unknown unknown)

Study new model releases to look for signatures of known ideas, or look for an increase in papers on some approach

Discontinuous jumps across a range of difficult benchmarks

Could ask companies key questions like e.g. “are you still using English CoT?” and make them report

Survey expectations at frontier labs (also covers intelligence explosion)

Analysis attributing performance gains to different sources (e.g. model size scaleup, data quality improvement, RL, etc)

Ryan: really big breakthroughs are like, AlexNet, GPT-1 / scale-up, scaling RL (o1, o3). Easy to notice something is turning into a big deal, hard to tell where on the Richter scale it will land.

Could track the current 5 most promising potential trend-breakers.

Ajeya: track breakdown of capabilities improvements over time: how much is from scaling pretraining, scaling RL posttraining, etc.

Threshold effects / phase changes (e.g. the moment when people are no longer needed; maybe time horizon scaling suddenly becomes much easier)

Phase changes in behavior of AI Village or other observations of AI capability / usage in real or realistic contexts

Phase change in qualitative analysis of uplift trials

Some resource (data, compute, electricity, intelligence headroom) is exhausted

Note, may lead to workarounds rather than an end to progress. E.g. pretraining data gets harder to scale → scale RL instead. Hard to predict the effect.

Low-hanging fruit is exhausted; training for long-horizon tasks is expensive

Some important missing capability doesn’t yield to scaling + progress (e.g. sample-efficient learning, “judgement”, long-time-horizon skills)

Economics of AI not working out → slowdown in investment

AI winter – we run out of ideas for improving AI

Josh: FRI expert panel. Could ask them “what’s the probability of an AI winter in the next 5 years”?

External event (market downturn; war in Taiwan; public backlash / disaster → regulation; focused national effort / Manhattan Project)

Track public sentiment and other political indicators

Applications of models are a lagging indicator to model capabilities. First sign of a slowdown would be the time horizon graph slowing down.

Other

Nikola: if timelines are long enough to enable human emulations (Ems), that would make a difference, because we’d have been able to explore more implications.

More generally, does other transformative tech come first. Possible techs: EMs, nanotech (unlikely), genetic engineering (seems like this could radically transform society in ~30 years if heavily invested in now, but in practice might not happen because of e.g. legal/regulatory blockers).

[Child Page: Notes for Early Indicators]

David Langer at Lionheart suggested that they might be in a position to help us measure some things.

https://x.com/StringChaos/status/1928476388274716707

https://x.com/Simeon_Cps/status/1900218546904293863

Relevant indicators hinted at here: https://www.theinformation.com/articles/openai-says-business-will-burn-115-billion-2029

Example of finding useful data sets: https://mikelovesrobots.substack.com/p/wheres-the-shovelware-why-ai-coding

Plan

For this sort of project, serious work requires multiple rounds of deep thought. This can’t be done in a compressed period of time, so doing this collaboratively requires an extended multi-turn conversation. I have not had success in getting most people to engage with this in a group email or Slack. The only way I’ve found success is me doing 1:1 engagement with each participant, over email and occasional calls, with a lot of nagging. This is annoying but does seem to work.

I’ve witnessed one counterexample (Helen’s CSET workshop) and heard about at least one other (something Sayash worked on); these were both collaborative projects centered around a multi-hour (or multi-day) synchronous group discussion. However, I’m not sure how many new ideas were generated at CSET, as opposed to just aggregating people’s existing thoughts. The CSET workshop did generate some new ideas but also there was a lot of aggregation.

Maybe we should just embrace this: perhaps our niche is projects that require lots of multi-turn 1:1 interaction.

Phase 2 of this specific project is more about aggregating ideas than deep analysis, so perhaps getting a group of people together on a Zoom to brainstorm together could work. Brainstorming and ideation was, I think, the focus of the two success stories I referred to. We could try to get a bunch of people on a call without being too fussy about exactly which people (though it would be really good to have someone from METR, someone from Epoch, and perhaps also the Forecasting Research Institute). This could be a meeting to put a cap on phase 1 (without necessarily completely locking it down, I’ll continue to get input from a few people I’m only connecting with now) and doing the initial brainstorming for phase 2. I can seed the conversation with a few ideas shared in advance.

Note that our measurable indicators don’t need to directly answer a crux, we’re just looking for data that will be helpful in future discussions of those cruxes.

→ share thoughts with Taren, Ajeya, Helen, ?

Ask Helen how far she’s gotten in turning the discussion into a paper and whether she wishes she could do more rounds with at least some folks.

Ajeya:

Might be productive to do this in two rounds. Do the fuzzy thing, then get someone who’s interested in doing the legwork to do the first round of coming up with concrete experiments – Ajeya might help with this – and take that back to the participants to rate them.

Group lists the fuzzy question.

Individual – Ajeya or Ryan Greenblatt – proposes a list of concrete experiments

Go back to the group to rate the proposals.

Realtime, concentrated bursts of time are so much more productive that it’s better to ask someone to come to an event that sounds cool & fun, spend the first hour reading the doc. Or can get people to pre-read more reliably if you get them to agree to come to an event, that’s what she did for the loss-of-control workshop. And keep the writeup short.

Siméon (@Simeon_Cps) posted at 5:31 PM on Fri, Aug 15, 2025:I've entertained that theory for a few years and have been confused since then why people expected confidently so much GDP growth. Basically prices if goods should crash so fast that the question of "how do you count inflation" will become the first order parameter of whether and(https://x.com/Simeon_Cps/status/1956514153528742168?t=qJkEFlWoQAD6z6g4HMhNAw&s=03)

Notes with Taren, 8/8/25

Post four brainstorming time slots, let people sign up; probably don’t want >4 people/session (though could do breakout rooms)

Taren, and probably many people, will do better in a brainstorming session with people with different expertise

Could let each group decide which two cruxes they’ll talk about; if we wind up with a gap, do something about it at the end. Or maybe assign cruxes to time slots.

Doing an in-person session could be fun; might be more trouble than it’s worth, but might not be trouble at Constellation, ask them to help organize & recruit people, e.g. two 4-person groups

Try to do one in the south bay? Could include Sam.

Ask Ajeya how she’d like to participate

Talk to Helen Toner, how to make sure what I’m doing is complementary

Other people / groups to include

DC – government does a lot of data gathering – propose content for legislation – Abi could help?

Taren could do something in DC

Oliver Stephenson (FAS)

Elham, or the guy who worked for her?

FAI

IFP

CHT (CBT?)

FAS?

Some Horizon fellows who are placed to draft a bill?

Plan a third stage where we produce a piece of draft legislation?

Discuss with Abi

Discuss with Victoria next Friday

Taren discuss with some people in DC

Should discuss with Helen Toner

Next steps:

First step should be some Zooms, invite everyone. We won’t exhaust potential participants from Constellation.

Taren to talk to Abi about whether to make this the theme for the dinner they’re organizing; if not, Taren will convene some other small meeting while she’s in DC.

Pre-brainstorm: me, Taren, maybe Ajeya, maybe Abi? Chris Painter? Ryan Greenblatt? Josh Rosenberg? Someone with economics expertise (a grad student from David Otter’s lab)?

More focus on identifying kinds of measurements (quantitative, qualitative, horizontal, vertical, etc.) to seed later conversations (+ as pre-read)

Participants in Stage 2

Jaime Sevilla (Epoch)

Content Notes

See Untitled 

Review the Cruxes list in the early writeup

Incorporate this idea from my discussion with Nick Allardice:

Our economy and decision-making processes are so fragmented and messy. Trying to answer my crux questions at a societal level is unhelpfully generalizing. More tractable & beneficial is to pick some sectors, industries, types of problems, and come up with ways of measuring change in those, as leading indicators for what might happen in other sectors and industries. E.g. software engineering may be one of the more tractable problem spaces for AI; he’d be very interested in tracking diffusion here: hiring practices, how much autonomy is being granted, what level of productivity is it unlocking.

If CS is a fast example, find a few slow examples, get super specific & granular about measuring diffusion. Get an idea of the uneven distribution.

Carey:

I don't know where this fits but I think the question of "what are the most common failure modes that prevent current models from excelling at practical tasks?" would be a relevant crux or root cause for your cruxes.

Anna Makanju:

It’s hard to break down someone’s usage of chatbots into productivity into other uses. In the last year it’s flipped from predominantly productivity usage into companionship. If you measure usage, need to try to disentangle in the nature of that usage – apply a classifier to their chat history, or focus on measuring enterprise accounts. Could work with universities and government agencies who will have enterprise accounts and might be more willing to share data for a study.

Also see notes from my 8/11 conversation with Anna

Why Focus on Early Indicators?

If you want to make predictions about a future that is similar to the present, you might be able to simply extrapolate from past values of the variable you need to predict. For instance, Moore’s Law was an observation about trends in transistor counts, and for many decades it provided excellent forecasts of future transistor counts [FOOTNOTE:  Though this may be a story about self-fulfilling prophecies as much as about the tendency of important variables to follow predictable trends.].

If you need to make predictions about a future that looks quite different from the present, you can’t get by with simple extrapolation. You need a model of how the future is going to unfold, and you need data to calibrate your model. For instance, if you want to predict the potential of fusion power, you can’t extrapolate the graph of historical electricity generation; for fusion, that graph is flatlined at 0. But if you understand the path that current fusion efforts are following, you can extrapolate metrics like “triple product” and “Q” [FOOTNOTE: I got help from Claude Opus 4 on this; the answer matches my vague recollection well enough that I’m not bothering to fact-check it: The most critical metrics for tracking progress toward practical fusion power are the fusion energy gain factor (Q), which measures the ratio of fusion power output to heating power input and must exceed 10-20 for commercial viability; the triple product (density × temperature × confinement time), which needs to reach approximately 10²¹ keV·s/m³ to achieve ignition conditions; and the reactor's availability factor or duty cycle, measuring the percentage of time the reactor can operate continuously, as commercial plants will need to run reliably for months at a time rather than just achieving brief fusion pulses.] to get an idea of how close we are to a functioning generator.

Verify that an early application of the steam engine was to pump water out of coal mines. Make a reference to this being a sort of recursive self-improvement.  Observe that if you had wished to measure the uptake of steam engines for pumping water out of coal mines, you could have looked at inputs to the process such as the amount of coal being consumed in steam engines outputs such as the amount of water being pumped or impact such as lowering the water level within the mines.

When will “superhuman coders” and “superhuman AI researchers”, as defined in AI 2027, emerge?

How is the task horizon identified in Measuring AI Ability to Complete Long Tasks progressing? What are we learning about time horizons for higher reliability levels (I presume reliability much higher than 80% will be necessary)?

How fundamental is the gap between performance on benchmarks and real-world tasks? Is it growing or shrinking? [QUESTION: Does this cover “capability-reliability gap” described in AI as Normal Technology, or do we need to expand the description?] [this might better belong under the question regarding advances in domains other than coding and AI research.]

Are any skills under coding or AI research emerging as long poles (more difficult to automate), and if so, are there feasible ways of compensating (e.g. by relying more on other skills)?

What does the plot of AI research effort vs. superhuman performance look like? How does this vary according to the nature of the cognitive task?

In particular, for tasks such as “research taste” that are critical to accelerating AI R&D?

Basically redundant with previous cruxes, but perhaps worth listing as something that could be independently measured: across the broad range of squishy things that people do every day, how rapidly will the set of tasks for which AI provides significant uplift grow, and what are the contours of that set (what separates tasks experiencing uplift from tasks which are not)?

Can AIs think their way around the compute bottleneck? If the R&D labor input races ahead of compute and data, how much progress in capabilities will that yield? To what extent does this depend on {quantity, speed, quality/intelligence} of the AI workers? Does this apply to all compute-intensive aspects of AI R&D?

Is Ege’s forecast for NVIDIA revenue bearing out? Is his model for relating NVIDIA revenue to real-world impact of AI valid?

Could rapid algorithmic improvements (driven by a software explosion) decouple impact from NVIDIA revenue?

Could adoption lags result in revenue lagging capabilities?

Possibly other questions as to whether various current trends continue – I’m not sure whether there are any other cruxes lurking here.

For instance, is general progress in LLM capabilities speeding up or slowing down? Are breakthroughs emerging that accelerate the curve? Is RL for reasoning tasks hitting a ceiling? Are any of Thane’s bear-case predictions bearing out? Etc.

As the automation frontier advances into increasingly long-horizon, messy, judgement-laden tasks, will the speed and cost advantages of AI (vs. humans) erode, to the point where they aren’t significantly faster or cheaper than humans for advanced tasks (and a few years of optimization doesn’t fix the problem)?

Resources from the CSET conference

Helen’s working doc

My slides

Ryan’s slides and notes

It's important to detect when the most senior skills start to become automated. This could indicate a tipping point both for progress at the big Labs and or the ability for a breakout at a rogue actor who doesn't have access to senior talent. Perhaps we can look at the percentage of impactful ideas that come from unassisted AI. Perhaps we can look at the ratio of of major ideas, paradigm changing ideas to other inputs.

Look for additional domains in which to measure sample efficient learning. In other domains, domains look at the ratio of spending on real world data collection versus in silico data generation.

ARC Prize (@arcprize) posted at 10:21 AM on Tue, Sep 09, 2025:ARC Prize Foundation @ MITWe're hosting an evening with top researchers to explore measuring sample efficient in humans and machinesJoin us to hear from Francois Chollet along with a world class panel: Josh Tenenbaum, Samuel Gershman, Laura Schulz, Jacob Andreas https://t.co/dq7NJyXkNk(https://x.com/arcprize/status/1965465501079142814?t=L7y6E9f9cwxgjwWD1FMZJA&s=03)

https://epochai.substack.com/p/after-the-chatgpt-moment-measuring

Check in with Divya / CIP to see whether their global pulse surveys have questions relevant to the cruxes. Perhaps we can draw from this / make suggestions for it.

They’re running “global pulse” surveys, to understand what people want from the future but also to understand how much AI has diffused into people’s lives. Every two months, started in March. Questions around trust, diffusion, how much are you relying on AI for medical or emotional advice, are you using it in the workplace, etc. https://globaldialogues.ai/cadence/march-2025, “In three years, what questions will we wish we had been tracking?” Maybe could be interesting to co-author something at some point.

Cheryl Wu (@cherylwoooo) posted at 6:25 PM on Sun, Jun 01, 2025:Are we at the cusp of recursive self-improvement to ASI? This tends to be the core force behind short timelines such as AI-2027. We set up an economic model of AI research to understand whether this story is plausible. (1/6)(https://x.com/cherylwoooo/status/1929348520370417704?t=f9E9Yty2m27EQ-Z_NbbJ3Q&s=03)

From Does AI Progress Have a Speed Limit?, a measurement:

Ajeya: I'm kind of interested in getting a sneak peek at the future by creating an agent that can do some task, but too slowly and expensively to be commercially viable. I'm curious if your view would change if a small engineering team could create an agent with the reliability needed for something like shopping or planning a wedding, but it's not commercially viable because it's expensive and takes too long on individual actions, needing to triple-check everything.

Arvind: That would be super convincing. I don't think cost barriers will remain significant for long.

Another:

Ajeya: Here's one proposal for a concrete measurement — we probably wouldn't actually get this, but let's say we magically had deep transparency into AI companies and how they're using their systems internally. We're observing their internal uplift RCTs on productivity improvements for research engineers, sales reps, everyone. We're seeing logs and surveys about how AI systems are being used. And we start seeing AI systems rapidly being given deference in really broad domains, reaching team lead level, handling procurement decisions, moving around significant money. If we had that crystal ball into the AI companies and saw this level of adoption, would that change your view on how suddenly the impacts might hit the rest of the world?

Another:

Ajeya: …do you have particular experiments that would be informative about whether transfer can go pretty far, or whether you can avoid extensive real-world learning?

Arvind: The most convincing set of  experiments would involve developing any real-world capability purely (or mostly) in a lab — whether self-driving or wedding planning or drafting an effective legal complaint by talking to the client.

From Deric Cheng (Convergence Analysis / Windfall Trust)

Early indicators: he’s friends with the Metaculus folks. They’re working on indicators for AI diffusion and disempowerment. Don’t share: Metaculus Diffusion Index – they’ll publish in a few weeks.

https://metr.substack.com/p/2025-07-14-how-does-time-horizon-vary-across-domains

When monitoring progress in capabilities, need to watch for the possibility that capabilities are advancing on some fronts while remaining stalled on some critical attribute such as reliability, hallucinations, or adversarial robustness

Nick Allardice has a prior that labor market disruption is not going to be meaningfully different from other times in history… but he’s highly uncertain. Evidence that might push him to believe that meaningfully different levels and pace of labor market disruption would [?].

The burden of proof is on this time being meaningfully different from the past. If AI gets good at something, we’ll focus on something else. He hasn’t seen enough evidence to shift his priors. Leading indicators currently reinforce his prior: unemployment is low. It’s harder to get a job as a junior developer, but not impossible, and mostly seems to be due to other factors.

Even if capabilities advance, diffusion challenges will leave room for human workers. Our institutions aren’t going to turn everything over to AI.

Offcuts

qualitative rubrics [MPH is a good indicator because every M takes about the same number of H; cities passed per hour would break down in the Great Plains; MPH breaks down when you enter a city center]

If we only measure high-level, downstream attributes such as practical utility, we won’t have any way of anticipating these twists and turns. As I noted in the introduction to the cruxes writeup, by the time Facebook started to show up as a major contributor to overall Internet usage, it was already well along its journey to global dominance.

[Child Page: Seeing Past AGI; AI 2027 vs. AINT]

Summary:

Engage with Ryan Greenblatt and others on what I’m now calling “Seeing Past AGI”. Ryan keeps pointing out that there are important disagreements which won’t manifest until after AGI is reached, and which may be hard to shed light on until that point (which may be too late to be of much use). Working with Ryan and other usual suspects, I’d be interested in digging into this, try to clearly characterize the dramatic things Ryan expects to see happen post-AGI, and identify precursors which Ryan would agree ought to be visible pre-AGI. This might turn out to fit neatly into the existing AI Watersheds framework, or might turn into a bit of a separate sub-project.

As we close in on the relevant milestones, we won’t resolve the first two bullets above [I think this meant the first two cruxes]. The difficult questions are not when X gets automated, it’s what happens afterwards. Can’t imagine any measurement that would disambiguate AINT for him. A lot of these metrics do shed light on whether / when we’ll get AGI, or automation of AI R&D.

I have a hard time imagining updating toward anything like “full cheap automation of cognitive labor would increase GDP growth by <10%”

I have a hard time imagining updating against large impacts of ASI in advance

I’d like to push on this and try to come up with ways to forecast past the AGI event horizon – I have a conviction that we can do better than just throwing up our collective hands. My plan for a next step is to engage with, perhaps, Ryan and Sayash to really dig into their models of what happens post-AGI. [Abi: Let’s chat about different definitions of AGI that Sayash and Ryan have. I wonder whether getting very specific on this question will point to two very different visions here, which are leading to some of the divergence. + how we handle this in our approach.]

(related: per https://blog.ai-futures.org/p/ai-as-profoundly-abnormal-technology, offer to meditate a conversation between Sayash+Arvind and the AI 2027 crew)

If/when we pursue this:

Think about Ryan’s comment about takeoff in the phase 2 doc. Come up with a plan for drilling in on takeoff models and early indicators.

Work with Sayash and Ryan to drill in on disagreements post-AGI and then look for early indicators

Abi:

During our call, let’s chat about different definitions of AGI that Sayash and Ryan have. I wonder whether getting very specific on this question will point to two very different visions here, which are leading to some of the divergence. + how we handle this in our approach

https://blog.ai-futures.org/p/ai-as-profoundly-abnormal-technology

https://x.com/sayashk/status/1964016339690909847

[Ryan, under “measuring freedom of action”] It seems really hard to use indicators to distinguish between my perspective and one that predicts way less freedom of action at the point of ~full automation of AI R&D. Maybe the crux is mostly capabilities, but then it comes back to crux 1.

Ryan: as we close in on the relevant milestones, we won’t resolve the first two bullets above. The difficult questions are not when X gets automated, it’s what happens afterwards. Can’t imagine any measurement that would disambiguate AINT for him. A lot of these metrics do shed light on whether / when we’ll get AGI, or automation of AI R&D.

Delaying timelines by decades is a big deal, but doesn’t have a decisive effect on what I ultimately expect to happen. E.g., it’s like a factor of 3 on various things, not a factor of 10-100.

More random takes from Ryan:

I can imagine updating towards much longer timelines (though this is limited by some exogenous rate of large breakthroughs)

I can imagine updating away from software-only singularity based on detailed empirical evidence about returns to compute vs labor within AI companies (especially if this evidence is coming in as we’re automating)

Though idk how big this update would be.

I can imagine updating towards moving through the human range slower than I currently expect via a variety of mechanisms.

I have a hard time imagining updating toward anything like “full cheap automation of cognitive labor would increase GDP growth by <10%”

I have a hard time imagining updating against large impacts of ASI in advance

After this happens, I’d update, but this is too late.

AI Scenarios Network – AI Watersheds Brainstorm

Prioritize putting together a list of measurements, so I can ask for additional suggestions + then do a round of voting

Turn phase 2 notes into a rough draft

Early writeup

Group brainstorm

Steve’s Notes from the Phase 2 Brainstorming Doc 

Slide deck

Intro Material for Phase 2

The importance of measurements that can be collected over a long period of time (won’t saturate), and ideally can be measured historically as well.

[more ideas from the panel will belong here… using fine-grained data sets, collecting data from inside labs, using LLMs to analyze qualitative data, setting up controlled access to sensitive data sets, etc. Maybe these detailed ideas of “how to measure things” would belong in a separate section later on, and up here we’d just talk about basic philosophical ideas like “focus on real-world impact”.]

https://ai-frontiers.org/articles/the-hidden-ai-frontier talks about how important developments may be hidden inside the frontier labs (and the dangers that poses)

General Material for Phase 2

File for AI Watersheds, eg "Concretely, one reviewer proposed tracking deployments of AI agents that (i) are general-purpose systems, (ii) operate with minimal supervision, and (iii) handle tasks with a high cost of errors.”

gavin leech (@g_leech_) posted at 4:00 AM on Thu, Nov 13, 2025:Glad somebody did this (expert interviews on why LLMs are not currently AGI, and why they could be)feat: @random_walker, @DKokotajlo, @ben_j_todd, @daniel_d_kang, @rohinmshah https://t.co/MuHjAoXvAw(https://x.com/g_leech_/status/1988939922842218558?t=OSiH9KzQDuAzCWo776O6Vg&s=03)

@AI Security Institute:
📈 Today, we’re releasing our first Frontier AI Trends Report: evaluation results on 30+ frontier models from the past two years, showing rapid progress in chemistry and biology, cyber capabilities, autonomy, and more.▶️Read now: https://t.co/afJoJy0pYl pic.twitter.com/uIBmxQMNjJhttps://x.com/i/status/2001579052830953668

https://x.com/sayashk/status/1963343022252315112

https://epochai.substack.com/p/the-changing-drivers-of-llm-adoption

Follow up on “AI Watersheds Phase 1 writeup” with Jonas Sandbrink; lots of good ideas in his email, and we should discuss further.

https://x.com/EpochAIResearch/status/1996248575400132794

Herbie Bradley (@herbiebradley) posted at 5:13 AM on Thu, Nov 20, 2025:Looks like a very promising benchmark(https://x.com/herbiebradley/status/1991495140633141550?t=B2ssdv1dhSdNGdMxxOJ6TA&s=03)

Might incorporate: https://arxiv.org/pdf/2510.07575v1

[Abi]

Geopolitics + AGI team at RAND takeaways:

This team also sees their goal as getting decisonmakers (not just gov) to think more about TAI.

ON BOTTLENECKS TO ECON DATA:

Their econ team said that lack of granular data constrains econ research.

Example: o-net's list of tasks is great but needs more granularity and to be updated more. O-net surveys are only send sporadically to ~5 people per firm.

Example: The Census collects longitudinal employer/employee data ("LAHD") but only 25 states opt in. Also, researchers need special status to use it. RAND has access but notes that it's not even that good because occupation data isn't linked to worker-firm datasets.

This might be relevant to Watersheds!
TO DOs:
I will add a chat with RAND's AGI NatSec team to the Director of Event's onboarding doc.
If desired in January, I can set up a chat with someone from RAND Econ team to talk to Taren or Steve about how to unlock from task-level or occupation-level data.

https://epochai.substack.com/p/the-software-intelligence-explosion

Are these trends bearing out? https://x.com/Hangsiin/status/1950645770346283083?t=2LsNFOIyCZy2Ar22eUR-iA&s=03

Are these cognitive limitations easing? https://leap.forecastingresearch.org/reports/wave2

Notes from Jonas:

He sees the key disagreements as feasibility of ASI, and speed of diffusion (and will AGI increase or decrease barriers to adoption). [feasibility of ASI → potential bend in the curve, is this on our list?]

If you extrapolate the METR curve, you’re probably not looking at AGI in 2028. The only way to get there soon is through some sort of speedup. So if you want to evaluate whether AGI is coming within a few years, you should be looking for signs of speedup. If you expect AGI in more like 2035, you should be looking at broader progress.

How will we measure AI capabilities for multi day tasks?

From https://digitaleconomy.stanford.edu/news/ai-and-labor-markets-what-we-know-and-dont-know/:

One way to bolster evidence is to collect better data on when individual firms adopt AI (see more in point 4) to track employment changes before and after at the firm level, hopefully improving upon the measures in Humlum and Vestergaard (2025), Hosseini and Lichtinger (2025), and other work. Even better would be to find some kind of experiment in firm-level AI-adoption. An example would be an A/B test at an AI company that randomly offered discounts on subscriptions to different firms. Ideally the experiment would have been run starting in the early days of AI and run for months, if not years. 

It would be great to get actual large-scale data from AI labs on usage by occupation, perhaps via survey rather than relying on predictions based on conversations.

More research should be done on other labor markets. Three promising avenues are to use Revelio or ADP in other countries, if feasible; use other private payroll data from other countries; or use government administrative data to track employment changes. Some infrastructure likely needs to be built out to measure AI exposure for local occupations.

A particular area of focus should be countries with high levels of employment in exposed jobs such as call center operations. Further modeling can also help with predicting how impacts may vary across different institutional contexts.

 Ideally we would have some sort of continuous index of AI adoption, with differences in “how much” firms or workers have adopted AI. One option is to measure token counts, as suggested by Seed AI. Business spend data seems promising as well. Another option is the number of unique users or the number of conversations. We should encourage AI companies to share data on this to the extent feasible. Business surveys should also explore alternative questions and test how sensitive reported adoption rates are to the specific wording.

Not related to cybersecurity, but we did a deep dive on agent benchmarks. Many of them are broken and measure AI agent performance poorly:

Twitter/X: https://x.com/daniel_d_kang/status/1942641179461648629LinkedIn: https://www.linkedin.com/posts/daniel-kang-1223b343_ai-agent-benchmarks-are-broken-activity-7348406954253312000-B_Yq/Substack: https://ddkang.substack.com/p/ai-agent-benchmarks-are-broken

Remotelabor.ai to track the new Remote Labor Index, measuring what percentage of remote work AI can automate. Currently the top score is 2.5%, so ‘not much,’ but that’s very different from 0%.

Diffusion: https://www.convergenceanalysis.org/fellowships/spar-economics/decoding-ai-diffusion-mapping-the-path-of-transformative-ai-across-industries

Ethan Mollick (@emollick) posted at 5:09 PM on Thu, Sep 25, 2025:After reading it, this does seem like a big dealIndustry experts outlined important, real-world, hard tasks for AI to do. Other experts were asked to do the tasks themselves & yet others graded human & AI outputModels approached parity with humans & AI is getting better fast. https://t.co/z666YcNyH6(https://x.com/emollick/status/1971366497244348625?t=suZJLFoe4S9wZoSROiNeKw&s=03)

OpenAI (@OpenAI) posted at 9:24 AM on Thu, Sep 25, 2025:Today we’re introducing GDPval, a new evaluation that measures AI on real-world, economically valuable tasks.Evals ground progress in evidence instead of speculation and help track how AI improves at the kind of work that matters most.https://t.co/uKPPDldVNS(https://x.com/OpenAI/status/1971249374077518226?t=aXFDp9V1lvUVuMBJOotUFA&s=03)

Lawrence H. Summers (@LHSummers) posted at 9:36 AM on Thu, Sep 25, 2025:A research team at @OpenAI, where I am proud to be a board member, released an important new paper today. This paper looks at what might be thought of as task specific Turing Tests and shows that AI systems, even with limited guidance, perform many tasks -- such as planning(https://x.com/LHSummers/status/1971252567981146347?t=dnQgGIFT7yFz-Ex3KHVhpA&s=03)

Sayash Kapoor (@sayashk) posted at 1:54 PM on Wed, Oct 15, 2025:
📣New paper: Rigorous AI agent evaluation is much harder than it seems.

For the last year, we have been working on infrastructure for fair agent evaluations on challenging benchmarks.

Today, we release a paper that condenses our insights from 20,000+ agent rollouts on 9 https://t.co/TvSxUsptdW
(https://x.com/sayashk/status/1978565190057869344?t=sQeizZZnd9uOH-Chjv8e5A&s=03)

Dan Hendrycks (@DanHendrycks) posted at 7:20 AM on Thu, Oct 16, 2025:Our definition of AGI is an AI that can match or exceed the cognitive versatility and proficiency of a well-educated adult.To measure this, we assess the multiple dimensions of intelligence derived from the most empirically validated model of human intelligence (CHC theory). https://t.co/e1wEkzmHwb(https://x.com/DanHendrycks/status/1978828383581561009?t=b6nlhd1Uh6Xj57adoDnzEQ&s=03)

Ethan Mollick (@emollick) posted at 10:24 AM on Thu, Oct 16, 2025:A lot I like & some I don’t in this paper:Like: Clear definition of AGI, diverse authors, shows jaggedness, tracking metrics over time (huge leap from GPT-4 to GPT-5)Dislike: AGI defined as replicating a model of human cognition, benchmarks are scattershot, narrow view of AI https://t.co/T3XOu2PVl8(https://x.com/emollick/status/1978874737892667718?t=hWtw4waaXLy-Xa4djFU-iQ&s=03)

Sayash Kapoor (@sayashk) posted at 10:12 AM on Fri, Sep 12, 2025:Agent benchmarks lose *most* of their resolution because we throw out the logs and only look at accuracy.I’m very excited that HAL is incorporating @TransluceAI’s Docent to analyze agent logs in depth.Peter’s thread is a simple example of the type of analysis this enables,(https://x.com/sayashk/status/1966550402129592738?t=SX4UR2z0FabBX_mgLc98dw&s=03)

The Point Magazine (@the_point_mag) posted at 6:05 AM on Thu, Oct 16, 2025:New online, @saffronhuang on what it means to measure intelligence—in large language models and in us:https://t.co/AKZdHiyzGE(https://x.com/the_point_mag/status/1978809403609382977?t=pzJVLmXj6ZGb3k88X648qQ&s=03)

At the CAIS event on Oct. 2, someone (Dan?) mentioned that they’d be posting an AI Automation Index in a few weeks

Alex Tamkin (Anthropic Economic Index coauthor): data sources: model usage data (can't be longitudinal because don't keep logs, also confounded by changing models), government (states might be a good source if can't get federal help), downstream apps, Stripe. He'd love to see interviews, eg of hiring managers.

Bharat: someone in his group wanted to know the capital vs labor contribution to AI R&D, would be helpful in calibrating the model, that's the missing variable for his model.

Miles Brundage:

How confident is he in short timelines? Pretty confident. He’s typical of people who have spent multiple years at a frontier AI company and lived through / closely watched / participated in multiple scaleups, going from “signs of life” to maturity stage for image generation, codegen, video gen, writing, math. We’re still early on RL, and even pretraining – for instance, we’ve barely scratched the surface on video data (YouTube). On fuzzy vs. tidy problems: he views these as differences of degree vs. kind. E.g. there’s a lot of positive spillover from math RL to code, and code RL to writing or policy research. There’s very little data relating to papers he writes in OpenAI’s RL, but chain of thought induces useful skills (such as breaking problems down into parts, checking your work)… that makes it useful for working on his papers? It just takes intellectual labor to turn a non-verifiable task into something you can test and verify.

[Josh 6/9/25] My colleague Alexa (cc'd) put together some initial ideas for forecasting questions that could help to further specify and concretize some of the cruxes described in your post. She gives more context on the approach in the summary of her document.

Would you be interested in incorporating a revised version of any of these questions into your work, or possibly trying to get forecasts on them as part of your research? We'd also be happy to help with collecting forecasts if you'd find that valuable.

Dean W. Ball (@deanwball) posted at 9:09 AM on Sun, Sep 14, 2025:I think Demis is fundamentally correct here. The current systems are extremely impressive, and will get much more so soon, but it’s clear there are fundamental breakthroughs still needed.As I have written before, I expect us to get “superintelligence” (AI systems that can, say,(https://x.com/deanwball/status/1967259417029837122?t=slJA0l1BoEy3WQdOJzlUGQ&s=03)

Dean W. Ball (@deanwball) posted at 5:09 AM on Tue, Sep 16, 2025:If this mirrors anything like the experience of other frontier lab employees (and anecdotally it does), it would suggest that Dario’s much-mocked prediction about “AI writing 90% of the code” was indeed correct, at least for those among whom AI diffusion is happening quickest.(https://x.com/deanwball/status/1967923900685386222?t=FztCAYh5PbN51JLbzAcGUA&s=03)

Steven Adler @ Progress Conference 2025

Talked about looking at UpWork task mix, prices, etc. as a signal

Talked about building evals around more open ended real work tasks.

1a3orn (@1a3orn) posted at 4:20 PM on Sat, Oct 25, 2025:data from OpenAI / Anthropic that I wish I had, but do not:1. What percent of Transformer improvements in OAI / Anthropic are original to the company, and what percent come from outside?2. What "Constitutional principles" does Anthropic currently use for alignment?(https://x.com/1a3orn/status/1982225866470899728?t=BxaB8hw-2u-HkAGBlH93bg&s=03)

https://thezvi.substack.com/p/asking-some-of-the-right-questions

Could talk to Gabe Weil, who mentioned that Basil Halperin argues that AGI should raise interest rates.

https://newsletter.forethought.org/p/how-quick-and-big-would-a-software

https://tecunningham.github.io/posts/2025-09-19-transformative-AI-notes.html

Incorporate ideas from my chat with Jaime

Will investor confidence continue to support scaling of compute / training budgets? Would love to have more visibility into the revenue chain, how solid the demand is, and where the room for short- and long-term growth comes from.

We talked about my question about how revenues flow through the AI value chain and how much of this is speculative vs. committed users who are experiencing values. Jaime noted that Anthropic is very dependent on coding tools.

Investors are ready to fund roughly 3 years of burn, so perhaps 10x ARR (at current growth rates). To keep up this rate of growth, it’s necessary to keep expanding into new markets. What’s the penetration rate of coding tools? Jaime is surprised how low… anecdotally, 6 months ago talking to random developers in Spain and Mexico, no one is really using these tools professionally.

Areas he expects to see impact soon: accounting, customer service, legal work, finance, assistants / operations, market research analysts. Seems like there’s plenty of room here for a couple more years of revenue growth at the current rate.

What would be useful for looking more than a couple of years ahead? He likes to take an outside view, look at what other markets are exposed. A colleague went through the OINT (?) database and made a list of exposed occupations. They could do this at larger scale. What future capabilities will be needed? Can guess, e.g. much better computer use.

His big disagreement with AI 2027 is around returns to intelligence and returns to parallelization of research. He doesn’t foresee nearly the same degree of benefit from running lots of small experiments in parallel.

One thing that has shaken his beliefs on returns to intelligence is the insane amount that companies are willing to pay their top researchers. Difficult to interpret exactly what that means, but might suggest returns to intelligence.

https://blog.cip.org/p/notes-on-building-collective-intelligence

Pass @ kitchen sink: https://www.notion.so/Todo-5cbf5bb74635457381c2f814628c73f9

Notes from Jason Clinton’s talk at The Curve 2025 (John Hart may have more); could incorporate these into metrics of internal usage at the labs and how this differs from other orgs:

Anthropic has automated level 1 SOC analysts: reviewing alerts to decide if they require action. 2-3 months from automating tier 2: deciding which alerts are too noisy. Blocked on visual reasoning???

Human code review is 25% effective at catching security bugs. AI could be better.

AI reviewer writes a repo before reporting the issue to a person.

An AI is greenlighting 60% of design docs as being low risk with no need for human security review.

John’s notes

90% of code at Anthropic is written by Claude; they have eliminated all junior dev roles from their open jobs list

"vibe hacking" - 3 weeks ago a "low-skill russian hacker" used Claude to hack people - https://www.bbc.com/news/articles/crr24eqnnq9o

DARPA just concluded an "AI Cyber Challenge" - https://aicyberchallenge.com

They have a specialized Claude agent that focuses on flaky tests.  If it gets stuck / can't un-flake a test, it will "reach out" to a "Claude SRE" agent to see if it's infra-related.

Likewise they have a specialized Claude agent that just does security review of design proposals; this has offloaded some routine work from their principals.  It compares design docs against all known MITRE attacks.

Claude does all Tier 1 SOC analyst work; only when it raises to Tier 2 level (arbitrating noisy alerts, for example) does a human get in the loop

"Literally everyone is working on memory".  Multiple startups will be offering "virtual employees" (presumably w/ long-term context memory) starting in April or May of next year.

Responsible disclosure timelines are woefully out-of-date in the AI era.

https://ai-frontiers.org/articles/the-hidden-ai-frontier talks about how important developments may be hidden inside the frontier labs (and the dangers that poses)

Per discussion with Nick Allardice, impact will play out very differently (and more slowly) in the global south.

Research priorities

I’m excited about this. I think our “neutral/switz” angle can help with this. On the collective action, could potentially frame it in labs’ interest as a way to get better forecasts, maybe pair with some of the platforms where leading labs already are i.e. Coalition for Secure AI, maybe GPAI.

Propose some specific initiatives. Emphasize projects that require collective action, such as:

A large effort to collect a valuable data set which would be useful for multiple research projects

Collecting data that requires cooperation from AI labs or other private sources, because the data is sensitive and/or requires effort for the private actor to supply. Collective action may be needed to pressure the labs into cooperating, and/or to create a high-trust context in which appropriate safeguards can be provided (controlled access to data).