Daniel Kang’s Post

AI agents are increasingly used in production, but how can we know which agents to use and what they can do? Frontier labs, researchers, and practitioners are increasingly turning to AI agent benchmarks to answer this question. Unfortunately, AI agent benchmarks are broken! Consider WebArena, a benchmark used by OpenAI and others to evaluate AI agents on interactions with websites. In a task to calculate the duration of a route, an agent answered “45 + 8 minutes” and was marked correct by WebArena, although the correct answer is “63 minutes.” In our new research, we break down the failure modes in current AI agent benchmarks and introduce a checklist that minimizes the gamability of AI agent benchmarks and ensures they measure what they claim to measure. Read about our work here: - Substack: https://lnkd.in/eA8BwtAc - Paper: https://lnkd.in/e6i5vsyb - Website: https://lnkd.in/eX6JsgZd - GitHub: https://lnkd.in/etwf8epA This work is joint w/ Yuxuan Zhu, Yada Pruksachatkun, and other folks from Stanford, Berkeley, Yale, Princeton, MIT, Transluce, ML Commons, Amazon, and UK AISI.

Spot on—most agent benchmarks reward shortcut hacks, not real reasoning. Raises a question: are we measuring capability or just prompt gymnastics? Time to rethink how we evaluate agents under real-world noise and failure.

Like
Reply

It’s interesting how benchmarks can be gamed in ways we might not expect. Given all these issues, what do you think is the most practical step for teams currently relying on benchmarks? Looking forward to seeing the impact of your checklist on improving evaluations!

Dario Amodei has publicly expressed skepticism about the usefulness of existing benchmarks for evaluating advanced AI systems.This breakdown clearly shows why those benchmarks are fragile.

Like
Reply

lies have a new name its broken

Like
Reply
See more comments

To view or add a comment, sign in

Explore content categories