Tweet by Epoch AI:

Most AI benchmarks share a common flaw: they saturate too quickly to study long-run trends.

Our solution: “stitch” many benchmarks together. This lets us compare models across a wide range of capabilities on a single unified scale.

Here’s how this works.🧵 pic.twitter.com/d6Gvr6Ip1B