David Langer at Lionheart suggested that they might be in a position to help us measure some things.

https://x.com/StringChaos/status/1928476388274716707

https://x.com/Simeon_Cps/status/1900218546904293863

Relevant indicators hinted at here: https://www.theinformation.com/articles/openai-says-business-will-burn-115-billion-2029

Example of finding useful data sets: https://mikelovesrobots.substack.com/p/wheres-the-shovelware-why-ai-coding

Plan

For this sort of project, serious work requires multiple rounds of deep thought. This can’t be done in a compressed period of time, so doing this collaboratively requires an extended multi-turn conversation. I have not had success in getting most people to engage with this in a group email or Slack. The only way I’ve found success is me doing 1:1 engagement with each participant, over email and occasional calls, with a lot of nagging. This is annoying but does seem to work.

I’ve witnessed one counterexample (Helen’s CSET workshop) and heard about at least one other (something Sayash worked on); these were both collaborative projects centered around a multi-hour (or multi-day) synchronous group discussion. However, I’m not sure how many new ideas were generated at CSET, as opposed to just aggregating people’s existing thoughts. The CSET workshop did generate some new ideas but also there was a lot of aggregation.

Maybe we should just embrace this: perhaps our niche is projects that require lots of multi-turn 1:1 interaction.

Phase 2 of this specific project is more about aggregating ideas than deep analysis, so perhaps getting a group of people together on a Zoom to brainstorm together could work. Brainstorming and ideation was, I think, the focus of the two success stories I referred to. We could try to get a bunch of people on a call without being too fussy about exactly which people (though it would be really good to have someone from METR, someone from Epoch, and perhaps also the Forecasting Research Institute). This could be a meeting to put a cap on phase 1 (without necessarily completely locking it down, I’ll continue to get input from a few people I’m only connecting with now) and doing the initial brainstorming for phase 2. I can seed the conversation with a few ideas shared in advance.

Note that our measurable indicators don’t need to directly answer a crux, we’re just looking for data that will be helpful in future discussions of those cruxes.

→ share thoughts with Taren, Ajeya, Helen, ?

Ask Helen how far she’s gotten in turning the discussion into a paper and whether she wishes she could do more rounds with at least some folks.

Ajeya:

Might be productive to do this in two rounds. Do the fuzzy thing, then get someone who’s interested in doing the legwork to do the first round of coming up with concrete experiments – Ajeya might help with this – and take that back to the participants to rate them.

Group lists the fuzzy question.

Individual – Ajeya or Ryan Greenblatt – proposes a list of concrete experiments

Go back to the group to rate the proposals.

Realtime, concentrated bursts of time are so much more productive that it’s better to ask someone to come to an event that sounds cool & fun, spend the first hour reading the doc. Or can get people to pre-read more reliably if you get them to agree to come to an event, that’s what she did for the loss-of-control workshop. And keep the writeup short.

Siméon (@Simeon_Cps) posted at 5:31 PM on Fri, Aug 15, 2025:I've entertained that theory for a few years and have been confused since then why people expected confidently so much GDP growth. Basically prices if goods should crash so fast that the question of "how do you count inflation" will become the first order parameter of whether and(https://x.com/Simeon_Cps/status/1956514153528742168?t=qJkEFlWoQAD6z6g4HMhNAw&s=03)

Notes with Taren, 8/8/25

Post four brainstorming time slots, let people sign up; probably don’t want >4 people/session (though could do breakout rooms)

Taren, and probably many people, will do better in a brainstorming session with people with different expertise

Could let each group decide which two cruxes they’ll talk about; if we wind up with a gap, do something about it at the end. Or maybe assign cruxes to time slots.

Doing an in-person session could be fun; might be more trouble than it’s worth, but might not be trouble at Constellation, ask them to help organize & recruit people, e.g. two 4-person groups

Try to do one in the south bay? Could include Sam.

Ask Ajeya how she’d like to participate

Talk to Helen Toner, how to make sure what I’m doing is complementary

Other people / groups to include

DC – government does a lot of data gathering – propose content for legislation – Abi could help?

Taren could do something in DC

Oliver Stephenson (FAS)

Elham, or the guy who worked for her?

FAI

IFP

CHT (CBT?)

FAS?

Some Horizon fellows who are placed to draft a bill?

Plan a third stage where we produce a piece of draft legislation?

Discuss with Abi

Discuss with Victoria next Friday

Taren discuss with some people in DC

Should discuss with Helen Toner

Next steps:

First step should be some Zooms, invite everyone. We won’t exhaust potential participants from Constellation.

Taren to talk to Abi about whether to make this the theme for the dinner they’re organizing; if not, Taren will convene some other small meeting while she’s in DC.

Pre-brainstorm: me, Taren, maybe Ajeya, maybe Abi? Chris Painter? Ryan Greenblatt? Josh Rosenberg? Someone with economics expertise (a grad student from David Otter’s lab)?

More focus on identifying kinds of measurements (quantitative, qualitative, horizontal, vertical, etc.) to seed later conversations (+ as pre-read)

Participants in Stage 2

Jaime Sevilla (Epoch)

Content Notes

See Untitled 

Review the Cruxes list in the early writeup

Incorporate this idea from my discussion with Nick Allardice:

Our economy and decision-making processes are so fragmented and messy. Trying to answer my crux questions at a societal level is unhelpfully generalizing. More tractable & beneficial is to pick some sectors, industries, types of problems, and come up with ways of measuring change in those, as leading indicators for what might happen in other sectors and industries. E.g. software engineering may be one of the more tractable problem spaces for AI; he’d be very interested in tracking diffusion here: hiring practices, how much autonomy is being granted, what level of productivity is it unlocking.

If CS is a fast example, find a few slow examples, get super specific & granular about measuring diffusion. Get an idea of the uneven distribution.

Carey:

I don't know where this fits but I think the question of "what are the most common failure modes that prevent current models from excelling at practical tasks?" would be a relevant crux or root cause for your cruxes.

Anna Makanju:

It’s hard to break down someone’s usage of chatbots into productivity into other uses. In the last year it’s flipped from predominantly productivity usage into companionship. If you measure usage, need to try to disentangle in the nature of that usage – apply a classifier to their chat history, or focus on measuring enterprise accounts. Could work with universities and government agencies who will have enterprise accounts and might be more willing to share data for a study.

Also see notes from my 8/11 conversation with Anna

Why Focus on Early Indicators?

If you want to make predictions about a future that is similar to the present, you might be able to simply extrapolate from past values of the variable you need to predict. For instance, Moore’s Law was an observation about trends in transistor counts, and for many decades it provided excellent forecasts of future transistor counts [FOOTNOTE:  Though this may be a story about self-fulfilling prophecies as much as about the tendency of important variables to follow predictable trends.].

If you need to make predictions about a future that looks quite different from the present, you can’t get by with simple extrapolation. You need a model of how the future is going to unfold, and you need data to calibrate your model. For instance, if you want to predict the potential of fusion power, you can’t extrapolate the graph of historical electricity generation; for fusion, that graph is flatlined at 0. But if you understand the path that current fusion efforts are following, you can extrapolate metrics like “triple product” and “Q” [FOOTNOTE: I got help from Claude Opus 4 on this; the answer matches my vague recollection well enough that I’m not bothering to fact-check it: The most critical metrics for tracking progress toward practical fusion power are the fusion energy gain factor (Q), which measures the ratio of fusion power output to heating power input and must exceed 10-20 for commercial viability; the triple product (density × temperature × confinement time), which needs to reach approximately 10²¹ keV·s/m³ to achieve ignition conditions; and the reactor's availability factor or duty cycle, measuring the percentage of time the reactor can operate continuously, as commercial plants will need to run reliably for months at a time rather than just achieving brief fusion pulses.] to get an idea of how close we are to a functioning generator.

Verify that an early application of the steam engine was to pump water out of coal mines. Make a reference to this being a sort of recursive self-improvement.  Observe that if you had wished to measure the uptake of steam engines for pumping water out of coal mines, you could have looked at inputs to the process such as the amount of coal being consumed in steam engines outputs such as the amount of water being pumped or impact such as lowering the water level within the mines.

When will “superhuman coders” and “superhuman AI researchers”, as defined in AI 2027, emerge?

How is the task horizon identified in Measuring AI Ability to Complete Long Tasks progressing? What are we learning about time horizons for higher reliability levels (I presume reliability much higher than 80% will be necessary)?

How fundamental is the gap between performance on benchmarks and real-world tasks? Is it growing or shrinking? [QUESTION: Does this cover “capability-reliability gap” described in AI as Normal Technology, or do we need to expand the description?] [this might better belong under the question regarding advances in domains other than coding and AI research.]

Are any skills under coding or AI research emerging as long poles (more difficult to automate), and if so, are there feasible ways of compensating (e.g. by relying more on other skills)?

What does the plot of AI research effort vs. superhuman performance look like? How does this vary according to the nature of the cognitive task?

In particular, for tasks such as “research taste” that are critical to accelerating AI R&D?

Basically redundant with previous cruxes, but perhaps worth listing as something that could be independently measured: across the broad range of squishy things that people do every day, how rapidly will the set of tasks for which AI provides significant uplift grow, and what are the contours of that set (what separates tasks experiencing uplift from tasks which are not)?

Can AIs think their way around the compute bottleneck? If the R&D labor input races ahead of compute and data, how much progress in capabilities will that yield? To what extent does this depend on {quantity, speed, quality/intelligence} of the AI workers? Does this apply to all compute-intensive aspects of AI R&D?

Is Ege’s forecast for NVIDIA revenue bearing out? Is his model for relating NVIDIA revenue to real-world impact of AI valid?

Could rapid algorithmic improvements (driven by a software explosion) decouple impact from NVIDIA revenue?

Could adoption lags result in revenue lagging capabilities?

Possibly other questions as to whether various current trends continue – I’m not sure whether there are any other cruxes lurking here.

For instance, is general progress in LLM capabilities speeding up or slowing down? Are breakthroughs emerging that accelerate the curve? Is RL for reasoning tasks hitting a ceiling? Are any of Thane’s bear-case predictions bearing out? Etc.

As the automation frontier advances into increasingly long-horizon, messy, judgement-laden tasks, will the speed and cost advantages of AI (vs. humans) erode, to the point where they aren’t significantly faster or cheaper than humans for advanced tasks (and a few years of optimization doesn’t fix the problem)?

Resources from the CSET conference

Helen’s working doc

My slides

Ryan’s slides and notes

It's important to detect when the most senior skills start to become automated. This could indicate a tipping point both for progress at the big Labs and or the ability for a breakout at a rogue actor who doesn't have access to senior talent. Perhaps we can look at the percentage of impactful ideas that come from unassisted AI. Perhaps we can look at the ratio of of major ideas, paradigm changing ideas to other inputs.

Look for additional domains in which to measure sample efficient learning. In other domains, domains look at the ratio of spending on real world data collection versus in silico data generation.

ARC Prize (@arcprize) posted at 10:21 AM on Tue, Sep 09, 2025:ARC Prize Foundation @ MITWe're hosting an evening with top researchers to explore measuring sample efficient in humans and machinesJoin us to hear from Francois Chollet along with a world class panel: Josh Tenenbaum, Samuel Gershman, Laura Schulz, Jacob Andreas https://t.co/dq7NJyXkNk(https://x.com/arcprize/status/1965465501079142814?t=L7y6E9f9cwxgjwWD1FMZJA&s=03)

https://epochai.substack.com/p/after-the-chatgpt-moment-measuring

Check in with Divya / CIP to see whether their global pulse surveys have questions relevant to the cruxes. Perhaps we can draw from this / make suggestions for it.

They’re running “global pulse” surveys, to understand what people want from the future but also to understand how much AI has diffused into people’s lives. Every two months, started in March. Questions around trust, diffusion, how much are you relying on AI for medical or emotional advice, are you using it in the workplace, etc. https://globaldialogues.ai/cadence/march-2025, “In three years, what questions will we wish we had been tracking?” Maybe could be interesting to co-author something at some point.

Cheryl Wu (@cherylwoooo) posted at 6:25 PM on Sun, Jun 01, 2025:Are we at the cusp of recursive self-improvement to ASI? This tends to be the core force behind short timelines such as AI-2027. We set up an economic model of AI research to understand whether this story is plausible. (1/6)(https://x.com/cherylwoooo/status/1929348520370417704?t=f9E9Yty2m27EQ-Z_NbbJ3Q&s=03)

From Does AI Progress Have a Speed Limit?, a measurement:

Ajeya: I'm kind of interested in getting a sneak peek at the future by creating an agent that can do some task, but too slowly and expensively to be commercially viable. I'm curious if your view would change if a small engineering team could create an agent with the reliability needed for something like shopping or planning a wedding, but it's not commercially viable because it's expensive and takes too long on individual actions, needing to triple-check everything.

Arvind: That would be super convincing. I don't think cost barriers will remain significant for long.

Another:

Ajeya: Here's one proposal for a concrete measurement — we probably wouldn't actually get this, but let's say we magically had deep transparency into AI companies and how they're using their systems internally. We're observing their internal uplift RCTs on productivity improvements for research engineers, sales reps, everyone. We're seeing logs and surveys about how AI systems are being used. And we start seeing AI systems rapidly being given deference in really broad domains, reaching team lead level, handling procurement decisions, moving around significant money. If we had that crystal ball into the AI companies and saw this level of adoption, would that change your view on how suddenly the impacts might hit the rest of the world?

Another:

Ajeya: …do you have particular experiments that would be informative about whether transfer can go pretty far, or whether you can avoid extensive real-world learning?

Arvind: The most convincing set of  experiments would involve developing any real-world capability purely (or mostly) in a lab — whether self-driving or wedding planning or drafting an effective legal complaint by talking to the client.

From Deric Cheng (Convergence Analysis / Windfall Trust)

Early indicators: he’s friends with the Metaculus folks. They’re working on indicators for AI diffusion and disempowerment. Don’t share: Metaculus Diffusion Index – they’ll publish in a few weeks.

https://metr.substack.com/p/2025-07-14-how-does-time-horizon-vary-across-domains

When monitoring progress in capabilities, need to watch for the possibility that capabilities are advancing on some fronts while remaining stalled on some critical attribute such as reliability, hallucinations, or adversarial robustness

Nick Allardice has a prior that labor market disruption is not going to be meaningfully different from other times in history… but he’s highly uncertain. Evidence that might push him to believe that meaningfully different levels and pace of labor market disruption would [?].

The burden of proof is on this time being meaningfully different from the past. If AI gets good at something, we’ll focus on something else. He hasn’t seen enough evidence to shift his priors. Leading indicators currently reinforce his prior: unemployment is low. It’s harder to get a job as a junior developer, but not impossible, and mostly seems to be due to other factors.

Even if capabilities advance, diffusion challenges will leave room for human workers. Our institutions aren’t going to turn everything over to AI.

Offcuts

qualitative rubrics [MPH is a good indicator because every M takes about the same number of H; cities passed per hour would break down in the Great Plains; MPH breaks down when you enter a city center]

If we only measure high-level, downstream attributes such as practical utility, we won’t have any way of anticipating these twists and turns. As I noted in the introduction to the cruxes writeup, by the time Facebook started to show up as a major contributor to overall Internet usage, it was already well along its journey to global dominance.