Tweet by Sayash Kapoor:

Can AI agents reliably navigate the web? Does the choice of agent scaffold affect web browsing ability? To answer these questions, we added Online Mind2Web, a web browsing benchmark, to the Holistic Agent Leaderboard (HAL). 

We evaluated 9 models (including GPT-5 and Sonnet 4)… pic.twitter.com/jwS2iFG27E