Tweet by Sayash Kapoor:

📣New paper: Rigorous AI agent evaluation is much harder than it seems.

For the last year, we have been working on infrastructure for fair agent evaluations on challenging benchmarks. 

Today, we release a paper that condenses our insights from 20,000+ agent rollouts on 9… pic.twitter.com/TvSxUsptdW