Tweet by Sayash Kapoor:

Agent benchmarks lose *most* of their resolution because we throw out the logs and only look at accuracy.

I’m very excited that HAL is incorporating @TransluceAI’s Docent to analyze agent logs in depth.

Peter’s thread is a simple example of the type of analysis this enables,… https://t.co/7DLMsFRHXA