Brainstorming doc

Why Do Measurements?

Thoughts on the purpose of forecasting AI’s schedule

General trend visibility for several years (important for lots of policy + other questions)

At least several years notice of transformational change

Estimate likely steepness & extent of transformational change

Ask panel for feedback on this

Josh: interested in the connection between current policy questions and our questions here. Would be great to get more specific / explicit about the connections with policy questions.

Josh: interested in the connection between current policy questions and our questions here. Would be great to get more specific / explicit about the connections with policy questions.

General thoughts on what to measure

Practical utility is a function of both AI model capability and complementary app innovation (coding models vs. Cursor)

Approaches to measurement

Josh (+’d by Helen, Nikola, Sayash):

Noting cross-cutting theme: Generally seems like more high-quality surveys of companies would be valuable.

A lot of ideas involve sensitive data, so would require controlled access (and/or governmental involvement?)

Using AIs to qualitatively grade lots of screen recordings and other messy high-volume data

Find ways of calibrating / correlating benchmark scores with measurements we care about more directly? (METR time horizons being an example in this direction)

Generally use lots of different measurements and try to map them to common units for cross-checking

Surveys / prediction markets; focus real-world measurements (e.g. uplift studies) on areas of disagreement

Or get both sides of a dispute to agree on a cheaper proxy measurement

Case studies

How firms in various sectors are using AI; scale of tasks assigned to AI / nature + granularity of interaction with humans; degree of trust / freedom of action given to AI

Deep study of revenue / spending / profit across the AI value chain (and by domain)

Also: number of “complementary innovation” (Sayash’s term) AI-powered apps, adoption rate, # of customers, “Surveys of customers to see how they are improving their workflows”, etc.

Helen: something about revenue from AI more broadly vs. revenue of frontier model providers. What can we say about open models eating into OpenAI / Google / Anthropic revenue?

Broad metrics (a la Epoch), e.g. electricity used by AI, global private investment

Keep pushing benchmarks toward greater realism of both input (task specification + context) and output (not just “passes tests”, but “PR is mergeable” / other measures of quality)

Nikola: need more realistic benchmarks. Source PRs (ideally from inside AI labs) and test AIs on them.

Analysis of usage at the model level (Anthropic economic AI index)

Logs of agent usage i.e. usage of model context protocol servers, pulls

[Sayash] Analysis of AI logs from them being asked to solve various tasks (we’re doing this with 10B tokens of HAL agent log data, but could imagine increasing many fold)

Social media is subject to transparency measures, AI companies currently are not. It’d be great to have controlled researcher access to e.g. ChatGPT transcripts.

Lots of general data sets for public analysis

AIs really like to email Eliezer, that’s an indicator of various types of strange AI behavior. Ajeya: there was an NYTimes article about a journalist getting similar emails. Could there be a “weird AI shit” incident tracker that would categorize reports + report trends?

Josh: MIT AI incident repository? Populated from public reports – links to existing reports. Ajeya: hard to make sense of it. Needs to be a larger team sifting the signal from the noise + writing blog posts. Doesn’t track veracity.

AI Risk repository / incidents database is relevant here: https://airisk.mit.edu/, also https://incidentdatabase.ai/

Ajeya comment: Too hard to make sense of current databases. Need someone to do better synthesis.

Make an open source AI company where the company itself is open source. Have it try to use AI as much as possible, make all its logs about everything (Slack etc) public, study that. 

Sayash: there’s been a lot of work on the science of AI evaluations. We don’t know the shape of the DAG, how AI inputs translate into outputs. We could do more to distinguish between different DAGs – which is a core source of disagreement, for instance Sayash has a very different DAG in mind. Inputs might include things like data, compute; output measured in the real world. Econometrics.

Josh: Meta measurements: Big increase in expected OpenAI revenue, according to a prediction market, based on a reported algorithmic breakthrough

Pushback from Ryan+Ajeya: limitations of prediction markets mean it’ll be tough for them to pick this up reliably

Surveys at frontier AI companies

Surveys of the broader workforce

Data center construction spend → currently dependent on media reporting, is there some way to get better/more consistent data on this?

Trying to unpack inputs/outputs per Sayash’s point about DAGs/understanding how improvements in one element flow through to improvements in other elements

Lots of (different types of) surveys

Helen: important to track military applications (broadly – decision support, targeting, etc.). Important to track freedom of action here.

Measuring Utility

Metrics about fluid intelligence, messy tasks

Benchmarks for domains vs. skills

Domains: cyber/law/professional labor markets

Skills: Context awareness, reliability, long-context reasoning, sample efficiency

Need deep info for each sector before assessing results. I.e. persuasion surveys overestimate results because opinion change dissipates 6 weeks later. I like Steve’s note that AI can continually be used cheaply. 

Uplift RCTs in real-world settings, measuring real-world outcomes (ideally: profit!)

e.g. if you get an actual large company to actually do randomized staged rollout of enterprise LLMs that’s great

Ajeya: three-armed uplift trials, AI / human / cyborg.

Josh:

Persuasion: Cost for AI to swing a vote, relative to other leading interventions (Josh: lots of polysci studies here)

AI-bio: % of amateurs that can synthesize flu

Look for leapfrog companies that get on board with a new technology ahead of the incumbents, that could highlight utility ahead of adoption.

e.g. fintech cos using AI in ways that let them overtake traditional banks, or healthcare startups using AI in ways that let you bypass doctors’ offices

Power user case studies – screen recordings + deep interviews of people who (think they) are getting enormous uplift

Organization version of this: find those YC startups that supposedly do 95% of everything with AI, study them (“power organizational users”)

Field trials (a la AI Village) and realistic benchmarks

Do a time horizon analysis (a la METR), but on real-world usage? Also, qualitative analysis of the size + difficulty + nature of tasks successfully delegated to AI (and dig into ranges of “successfully”)

Building on Josh’s work test idea: could a METR-like org collect work test setups from a range of places (under NDA so as not to break the work tests) and centralize the work of eliciting good performance, figuring out measurement, reporting results, etc?

Ryan:

I think uplift will be messy (because of humans being messy and possibility for phase changes in how AI should be used productively), so I’m more into end-to-end automation of actual tasks. Also, we ultimately care most about full automation regimes and predicting when this will happen. Uplift studies correspondingly seem somewhat worse than looking at full automation of tasks of varying difficulty/size (and benchmark transfer to actual things people are doing in their job so we don’t need to constantly run these tests).

Maybe you can get AI companies to run semi-informal experiments with uplift internally and publish results: many companies have effectively committed to doing this eventually due to their safety policies. I think Anthropic might open to this, but there are various sources of trickyness here.

Early reports of AI doing remarkable things (e.g. major scientific discovery / insight; solving a Millennium Prize Problem in mathematics)

pass@any for existing benchmarks can predict pass@1 in the future

Helen: where is the ceiling on various capabilities (headroom)? This seems under-investigated.

Sayash: are most tasks like chess, or like writing? (Ryan: unclear whether writing is the right contrast here.)

Are there less controversial examples than writing?

Taking out the trash: saturates.

Helen: the writing example highlights we’re way under-theorized on this. Quality of parenting is another example. “Persuasion” is poorly specified.

Abi: a scientific breakthrough requires an insight that goes against dogma. Could be considered an aspect of very high-quality writing.

Ryan: I care most about questions like “will energy production 100x within 5 years of full AI R&D automation if people want this to happen”. Seems like this has many possible routes in terms of capabilities head room.

Sayash: three broad categories: computational limits (chess), intrinsic limits / saturation (writing?), and knowledge limits (need new breakthroughs to exceed past performance). Better AI can help with computational limits, but not intrinsic limits.

Work tests for orgs in the space:

GiveWell lit review test

Open Philanthropy work tests

High-quality summaries of a conversation (AI is still failing our first-round work test at FRI about this)

Abi: “I liked how [this] list broke down sectors as each having their separate indicators. Very societal frictions approach.”

Measuring Adoption / Impact

YouTube resulted in some homeowners not calling plumbers for small jobs. What are the AI equivalents, and how should they be measured?

Compare metrics across AI-feasible vs. AI-infeasible sectors: employment (incl. hiring patterns + plans), productivity, profit, qualitative measures of AI usage, …

Revenue of AI service providers

Studies of AI impact across various fields

What percent of sales of the U.S. pharmaceutical industry is generated by AI-discovered drugs and products derived from these?

What percent of publications in the fields of Chemistry, Physics, Materials Science, and Medicine will be ‘AI-engaged’ as measured in this study or a replication study, in 2030?

How many hours per week on average will K-12 students in G7 countries use AI-powered tutoring or teaching tools, as reported by their school systems or education ministries?

A September 2024 study by the Federal Reserve Bank of St. Louis, based on the Real-Time Population Survey (N=3,216), estimated that 0.5%–3.5% of all U.S. work hours were assisted by generative AI. 

Measuring Diffusion

Ajeya:

I want to know about adoption within AI companies, so e.g. surveys about how much compute they’re running for inference, freedom of action related surveys like the ones suggested below, internal estimates of how much AIs speed them up, etc. (Also, I don’t think I literally care only about AI companies, will prob also care about USG.)

Measuring Freedom of Action

Surveys of workers - how often do you have to approve an AI decision, how much thought do you put into each approval (e.g. level of review / pass rate of AI pull requests)

Monitoring progress on agent infrastructure → if AI agents are operating online, are there meaningful constraints on them? Does e.g. Cloudflare have any meaningful controls, or is it all a chaotic mess?

Something military or military-adjacent - reports of smaller militaries using autonomous/semi-autonomous systems..?

We can also make a big list of discrete flags and ask people about them, e.g. “Is your AI allowed to spend money on business expenses? Up to how much?” or “Is your AI allowed to search the internet freely in the course of completing a task?” or “Are AIs at your company allowed to talk to employees other than the human that started the task in the course of completing their task? What about external people?” 

White-hat reports of prompt injections paid out by bug bounties at the top (100) websites across domains over time (as a proxy for how many applications allow AI systems to take actions that can exfiltrate user data)

Black-hat exploits of AIs (only possible if AIs are in a position to take important actions without adequate review?)

Curve-Bending Mechanisms

Note that the impact of AI can be difficult to predict. E.g. the prediction of AI flooding the zone with misinformation in the 2024 election didn’t pan out. Sycophancy → mental health impact, conversely, was an event that wasn’t widely predicted.

Intelligence explosion (software and/or hardware)

Measure both initial speedup, and whether r > 1

Surveys of AI companies on internal AI use (bunch of ideas for granular questions, e.g. subj sense of speedup, what tasks they’re used for now, what tasks aren’t they used for, hypothetical questions)

Internal measures of the absolute rate of algorithmic progress at AI companies (e.g. how much compute does it take to get GPT-4-level perf), watch for that trend accelerating

Source

Ryan: Get people who have done the takeoff modeling to look at the data on algo progress (both historical from humans and data under automation regime). Unclear if this data will resolve disagreements, but seems like it can help. (Tom, Daniel, maybe some people from Epoch)

Algorithmic breakthrough / new approach (e.g. neuralese, in-context learning, long-term memory, recurrence, brain emulation) (either known or unknown unknown)

Study new model releases to look for signatures of known ideas, or look for an increase in papers on some approach

Discontinuous jumps across a range of difficult benchmarks

Could ask companies key questions like e.g. “are you still using English CoT?” and make them report

Survey expectations at frontier labs (also covers intelligence explosion)

Analysis attributing performance gains to different sources (e.g. model size scaleup, data quality improvement, RL, etc)

Ryan: really big breakthroughs are like, AlexNet, GPT-1 / scale-up, scaling RL (o1, o3). Easy to notice something is turning into a big deal, hard to tell where on the Richter scale it will land.

Could track the current 5 most promising potential trend-breakers.

Ajeya: track breakdown of capabilities improvements over time: how much is from scaling pretraining, scaling RL posttraining, etc.

Threshold effects / phase changes (e.g. the moment when people are no longer needed; maybe time horizon scaling suddenly becomes much easier)

Phase changes in behavior of AI Village or other observations of AI capability / usage in real or realistic contexts

Phase change in qualitative analysis of uplift trials

Some resource (data, compute, electricity, intelligence headroom) is exhausted

Note, may lead to workarounds rather than an end to progress. E.g. pretraining data gets harder to scale → scale RL instead. Hard to predict the effect.

Low-hanging fruit is exhausted; training for long-horizon tasks is expensive

Some important missing capability doesn’t yield to scaling + progress (e.g. sample-efficient learning, “judgement”, long-time-horizon skills)

Economics of AI not working out → slowdown in investment

AI winter – we run out of ideas for improving AI

Josh: FRI expert panel. Could ask them “what’s the probability of an AI winter in the next 5 years”?

External event (market downturn; war in Taiwan; public backlash / disaster → regulation; focused national effort / Manhattan Project)

Track public sentiment and other political indicators

Applications of models are a lagging indicator to model capabilities. First sign of a slowdown would be the time horizon graph slowing down.

Other

Nikola: if timelines are long enough to enable human emulations (Ems), that would make a difference, because we’d have been able to explore more implications.

More generally, does other transformative tech come first. Possible techs: EMs, nanotech (unlikely), genetic engineering (seems like this could radically transform society in ~30 years if heavily invested in now, but in practice might not happen because of e.g. legal/regulatory blockers).