Transformer Architecture Origins

Historical context about attention mechanisms developing from RNN limitations and linguistic insights about parallel hierarchical sentence structure

The Transformer’s development was primarily catalyzed by the limitations of sequential RNNs, which were poorly suited for the massive parallel processing power of modern GPUs. This shift was fueled by the linguistic insight that language is inherently hierarchical rather than just linear, allowing researchers to replace slow recurrence with "attention" mechanisms that mirror the parallel structure of sentence parse trees. While some argue that this architecture could have theoretically emerged decades earlier if not for historical lulls in AI funding, others believe the future of the field lies in revisiting older, memory-centric models to see if their unique gating mechanisms can be improved by modern attention matrices.

View on HN · Topics

In my opinion current research should focus on revisiting older concepts to figure out if they can be applied to transformers.

Transformers are superior "database" encodings as the hype about LLMs points out, but there have been promising ML models that were focusing on memory parts for their niche use cases, which could be promising concepts if we could make them work with attention matrixes and/or use the frequency projection idea on their neuron weights.

The way RNNs evolved to LSTMs, GRUs, and eventually DNCs was pretty interesting to me. In my own implementations and use cases I wasn't able to reproduce Deepmind's claims in the DNC memory related parts. Back at the time the "seeking heads" idea of attention matrixes wasn't there yet, maybe there's a way to build better read/write/access/etc gates now.

[1] a fairly good implementation I found: https://github.com/joergfranke/ADNC

View on HN · Topics

Without fast parallel hardware there would neither have been the incentive to design the Transformer, or much benefit even if someone had come up with the design all the same!

The incentive to design something new - which became the Transformer - came from language model researchers who had been working with recurrent models such as LSTMs, whose recurrent nature made them inefficient to train (needing BPPT), and wanted to come up with a new seq-2-seq/language model that could take advantage of the parallel hardware that now existed and (since AlexNet) was now being used to good effect for other types of model.

As I understand it, the inspiration for the concept of what would become the Transformer came from Attention paper co-author Jakob Uzkoreit who realized that language, while superficially appearing sequential (hence a good match for RNNs) was in fact really parallel + hierarchical as can be seen by linguist's sentence parse trees where different branches of the tree reflect parallel analysis of different parts of the sentence, which are then combined at higher levels of the hierarchical parse tree. This insight gave rise to the idea of a language model that mirrored this analytical structure with hierarchical layers of parallel processing, with the parallel processing being the whole point since this could be accelerated by GPUs. While the concept was Uzkoreit's, it took another researcher, Noam Shazeer, to take the concept and realize it as a performant architecture - the Transformer.

Without the fast parallel hardware already pre-existing, there would not have been any incentive to design a new type of language model to take advantage of it!

The other point is that while the Transformer is a very powerful general purpose and scalable type of model, it only really comes into it's own at scale. If a Transformer had somehow been designed in the pre-GPU-compute era, before the compute power to scale it up to massive size existed it, then it would likely not have appeared so promising/interesting.

The other aspect to the history is that neural networks, of various types, have evolved in complexity and sophistication over time. RNNs and LSTMs came first, then Bahdanau attention as a way to improve their context focus and performance. Attention was now seen to be a valuable part of language and seq-2-seq modelling, so when GPUs motivated the Transformer, attention was retained, recurrence ditched, and hence "Attention is all you need".

The time was right for the Transformer to appear when it did, designed to take advantage of recent GPU advances, building on top of this new attention architecture, and now with the compute power and dataset size available that it started to really shine when scaled from GPT-1 to GPT-2 size, and beyond.

View on HN · Topics

the concept of a transformer could have been used on much slower hardware much earlier.

It could have been done in the early 1970s -- see "Paper tape is all you need" at https://github.com/dbrll/ATTN-11 and the various C-64 projects that have been posted on HN -- but the problem was that Marvin Minsky "proved" that there was no way a perceptron-based network could do anything interesting. Funding dried up in a hurry after that.

Summarizer