Ethan Mollick (@emollick) posted at 10:16 PM on Wed, Dec 04, 2024:It is worth noting that this is an increasingly common message from insiders at the big AI labs.It isn't unanimous, and you absolutely don't have to believe them, but I hear the same confidence privately as they are broadcasting publicly.(Still not sure how you plan for this)(https://x.com/emollick/status/1864509024214905060?t=ZVSDTAuR3RAtnjMFEDvyhQ&s=03)

From a recent Dwarkesh podcast, I think https://www.dwarkeshpatel.com/p/sholto-douglas-trenton-bricken:

Most intelligence is pattern matching, and you can cover a lot of ground with associative hierarchical retrieval. (Intelligence is compression) Maybe the tricky bit is in learning which associations to follow. Tie into the common refrain that for long-task agents, we “just” need better reliability; where does that come from, and why should it be just one thing?

The key bottleneck in advancing AI is not coming up with ideas, nor implementing them. It's intuitively selecting which ideas to try, and then evaluating imperfect data to judge what did or did not work and why. Bound by compute to run experiments, and taste to select them.

As architectures improve, the model becomes a better and better map of the training data, so it's all about the data, or we'll hit a ceiling based on the data, or we need to make our own data / transcend the data. This is how Olympiad geometry problems were solved.

Larger models seem to be more sample efficient. Superposition suggests that current models may be dramatically under-parameterized for the amount of data we're training them on.

From https://dblalock.substack.com/p/2024-4-7-arxiv-roundup-dbrx-backlog – file for a later post:

People sometimes have this impression that big LLMs are like this space race where we stand around at whiteboards having breakthroughs and the company with the most talent or insights gets the best model.

The reality is much less sexy than that. We’re basically all getting the same power law scaling behavior, and it’s just a matter of:

Parameter count

Training data quantity / epoch count

Training data quality

Whether it’s an MoE model or not

Hyperparameter tuning

The real innovation is in dataset construction and the systems you build to scale efficiently.

It also takes immense skill to debug the training, with all sorts of subtleties arising from all over the stack. E.g., a lot of weird errors turn out to be symptoms of expert load imbalance and the resulting variations in memory consumption across devices and time steps. Another fun lesson was that, if your job restarts at exactly the same rate across too many nodes, you can DDOS your object store—and not only that, but do so in such a way that the vendor’s client library silently eats the error. I could list war stories like this for an hour (and I have).

Another non-obvious point about building huge LLMs is that there are a few open design/hparam choices we all pay attention to when another LLM shop releases details about their model. This is because we all assume they ablated the choice, so more labs making a given decision is evidence that it’s a good idea. E.g., seeing that MegaScale used parallel attention increased our suspicion that we could get away with it too (although so far we’ve always found it’s too large a quality hit).

It’s also an interesting datapoint around AI progress. DBRX would have been the world’s best LLM 15 months ago—and it could be much better if we had just chosen to throw more money at it. Since OpenAI had GPT-3 before we even existed as a company, this indicates that the gap between the leaders and others is narrowing (as measured in time, not necessarily model capabilities). My read is that everyone is hitting the same power law scaling, and there was just a big ramp-up time for people to build good training and serving infra initially. That’s not to say that no meaningful innovations will happen—just that they’re extremely rare (e.g., MoE) or incremental (e.g., the never-ending grind of data quality).

From https://thezvi.substack.com/p/ai-58-stargate-agi, a recent example of eking out extra performance:

New paper suggests using evolutionary methods to combine different LLMs into a mixture of experts.

As Jack Clark notes, there is likely a large capabilities overhang available in techniques like this. It is obviously a good idea if you want to scale up effectiveness in exchange for higher inference costs. It will obviously work once we figure out how to do it well, allowing you to improve performance in areas of interest while minimizing degradation elsewhere, and getting ‘best of both worlds’ performance on a large scale.