Summarizer

Architecture Importance

Debate over whether transformer architecture components are essential or merely convenient tradeoffs, and whether removing specific tricks would significantly impact performance

← Back to There Will Be a Scientific Theory of Deep Learning

While some argue that neural-network architecture is merely a collection of tradeoffs for managing compute, others contend that specific structural "tricks" are fundamental to the learning process and cannot be easily discarded. The discussion highlights how the transformer’s success stems from its unique synergy with gradient descent and neural scaling laws, which prevents issues like catastrophic forgetting that plague less robust designs. Furthermore, there is a growing interest in revisiting older memory-centric models to see if integrating modern attention mechanisms can unlock new efficiencies that simple scaling cannot achieve. Ultimately, the consensus suggests that architecture is not just a temporary convenience but a critical driver of a model's ability to converge and generalize across massive datasets.

5 comments tagged with this topic

View on HN · Topics
In my opinion current research should focus on revisiting older concepts to figure out if they can be applied to transformers. Transformers are superior "database" encodings as the hype about LLMs points out, but there have been promising ML models that were focusing on memory parts for their niche use cases, which could be promising concepts if we could make them work with attention matrixes and/or use the frequency projection idea on their neuron weights. The way RNNs evolved to LSTMs, GRUs, and eventually DNCs was pretty interesting to me. In my own implementations and use cases I wasn't able to reproduce Deepmind's claims in the DNC memory related parts. Back at the time the "seeking heads" idea of attention matrixes wasn't there yet, maybe there's a way to build better read/write/access/etc gates now. [1] a fairly good implementation I found: https://github.com/joergfranke/ADNC
View on HN · Topics
No it isn't, and it's frustrating when the "common wisdom" tries to boil it down to this. If this was true, then the models with "infinitely many" parameters would be amazing. What about just training a gigantic two-layer network? There is a huge amount of work trying to engineer training procedures that work well. The actual reason is due to complex biases that arise from the interaction of network architectures and the optimizers and persist in the regime where data scales proportionally to model size. The multiscale nature of the data induces neural scaling laws that enable better performance than any other class of models can hope to achieve.
View on HN · Topics
> but I think most ML people now think of neural-network architectures as being, essentially, choices of tradeoffs that facilitate learning in one context or another when data and compute are in short supply, but not as being fundamental to learning. I feel like you are downplaying the importance of architecture. I never read the bitter lesson, but I have always heard more as a comment on embedding knowledge into models instead of making them to just scale with data. We know algorithmic improvement is very important to scale NNs (see https://www.semanticscholar.org/paper/Measuring-the-Algorith... ). You can't scale an architecture that has catastrophic forgetting embedded in it. It is not really a matter of tradeoffs, some are really worse in all aspects. What I agree is just that architectures that scale better with data and compute do better. And sure, you can say that smaller architectures are better for smaller problems, but then the framing with the bitter lesson makes less sense.
View on HN · Topics
> I think most ML people now think of neural-network architectures as being, essentially, choices of tradeoffs that facilitate learning in one context or another when data and compute are in short supply, but not as being fundamental to learning. Is this a practical viewpoint? Can you remove any of the specific architectural tricks used in Transformers and expect them to work about equally well?
View on HN · Topics
I think this question is one of the more concrete and practical ways to attack the problem of understanding transformers. Empirically the current architecture is the best to converge training by gradient descent dynamics. Potentially, a different form might be possible and even beneficial once the core learning task is completed. Also the requirements of iterated and continuous learning might lead to a completely different approach.