Historical Development Timeline

Discussion of the field's evolution from AlexNet in 2012 through transformers in 2017, including the role of ImageNet, GPU hardware improvements, and the transition from RNNs to attention mechanisms

The modern AI revolution was sparked by AlexNet’s 2012 victory, which proved that massive datasets like ImageNet paired with GPU-accelerated networks could decisively outperform traditional hand-crafted algorithms. This inflection point validated the "bitter lesson" that raw computational scale often triumphs over intricate model design, eventually forcing a transition from sequential RNNs to parallelizable Transformers built specifically to leverage modern hardware efficiency. Although historical skepticism and anemic hardware delayed this evolution for decades, many contributors argue that the "Attention Is All You Need" era was a necessary response to the hierarchical nature of language and the demand for better scalability. Looking forward, current discourse suggests the next breakthrough may involve revisiting discarded memory-focused concepts from the field's past to enhance the limitations of current attention mechanisms.

View on HN · Topics

In my opinion current research should focus on revisiting older concepts to figure out if they can be applied to transformers.

Transformers are superior "database" encodings as the hype about LLMs points out, but there have been promising ML models that were focusing on memory parts for their niche use cases, which could be promising concepts if we could make them work with attention matrixes and/or use the frequency projection idea on their neuron weights.

The way RNNs evolved to LSTMs, GRUs, and eventually DNCs was pretty interesting to me. In my own implementations and use cases I wasn't able to reproduce Deepmind's claims in the DNC memory related parts. Back at the time the "seeking heads" idea of attention matrixes wasn't there yet, maybe there's a way to build better read/write/access/etc gates now.

[1] a fairly good implementation I found: https://github.com/joergfranke/ADNC

View on HN · Topics

Here's where I'm missing understanding: for decades the idea of neural networks had existed with minimal attention. Then in 2017 Attention Is All You Need gets released and since then there is an exponential explosion in deep learning. I understand that deep learning is accelerated by GPUs but the concept of a transformer could have been used on much slower hardware much earlier.

View on HN · Topics

The inflection point was 2012, when AlexNet [0], a deep convolutional neural net, achieved a step-change improvement in the ImageNet classification competition.

After seeing AlexNet’s results, all of the major ML imaging labs switched to deep CNNs, and other approaches almost completely disappeared from SOTA imaging competitions. Over the next few years, deep neural networks took over in other ML domains as well.

The conventional wisdom is that it was the combination of (1) exponentially more compute than in earlier eras with (2) exponentially larger, high-quality datasets (e.g., the curated and hand-labeled ImageNet set) that finally allowed deep neural networks to shine.

The development of “attention” was particularly valuable in learning complex relationships among somewhat freely ordered sequential data like text, but I think most ML people now think of neural-network architectures as being, essentially, choices of tradeoffs that facilitate learning in one context or another when data and compute are in short supply, but not as being fundamental to learning. The “bitter lesson” [1] is that more compute and more data eventually beats better models that don’t scale.

Consider this: humans have on the order of 10^11 neurons in their body, dogs have 10^9, and mice have 10^7. What jumps out at me about those numbers is that they’re all big. Even a mouse needs hundreds of millions of neurons to do what a mouse does.

Intelligence, even of a limited sort, seems to emerge only after crossing a high threshold of compute capacity. Probably this has to do with the need for a lot of parameters to deal with the intrinsic complexity of a complex learning environment. (Mice and men both exist in the same physical reality.)

On the other hand, we know many simple techniques with low parameter counts that work well (or are even proved to be optimal) on simple or stylized problems. “Learning” and “intelligence”, in the way we use the words, tends to imply a complex environment, and complexity by its nature requires a large number of parameters to model.

0. https://en.wikipedia.org/wiki/AlexNet

1. https://en.wikipedia.org/wiki/Bitter_lesson

View on HN · Topics

Thanks for posting a through and accurate summary of the historical picture. I think it is important to know the past trajectory to extrapolate to the future correctly.

For a bit more context: Before 2012 most approaches were based on hand crafted features + SVMs that achieved state of the art performance on academic competitions such as Pascal VOC and neural nets were not competitive on the surface. Around 2010 Fei Fei Li of Stanford University collected a comparatively large dataset and launched the ImageNet competition. AlexNet cut the error rate by half in 2012 leading to major labs to switch to deeper neural nets. The success seems to be a combination of large enough dataset + GPUs to make training time reasonable. The architecture is a scaled version of ConvNets of Yan Lecun tying to the bitter lesson that scaling is more important than complexity.

View on HN · Topics

Sigmoid-type activation functions were popular, probably for the bounded activity and some measure of analogy to biological neuron responses. They work, but get problematic scaling of gradient feedback outside their most dynamic span.

My understanding of the development is that persistent layer-wise pretraining with RBM or autoencoder created an initiation state where the optimization could cope even for more layers, and then when it was proven that it could work, analysis of why led to some changes such as new initiation heuristics, rectified linear activation, eventually normalizations ... so that the pretraining was usually not needed any more.

One finding was that the supervised training with the old arrangement often does work on its own, if you let it run much longer than people reasonably could afford to wait around for just on speculation contrary to observations in CPU computations in the 80s--00s. It has to work its way to a reasonably optimizable state using a chain of poorly scaled gradients first though.

View on HN · Topics

A much earlier major win for deep learning was AlexNet for image recognition in 2012. It dominated the competition and within a couple years it was effectively the only way to do image tasks. I think it was Jeremy Howard who wrote a paper around 2017 wondering when we’d get a transfer learning approach that worked as well for NLP as convnets did for images. The attention paper that year didn’t immediately dominate. The hardware wasn’t good enough and there wasn’t consensus on belief that scale would solve everything. It took like five more years before GPT3 took off and started this current wave.

I also think you might be discounting exactly how much compute is used to train these monsters. A single 1ghz processor would take about 100,000,000 years to train something in this class. Even with on the order of 25k GPUs training GPT3 size models takes a couple months. The anemic RAM on GPUs a decade ago (I think we had k80 GPUs with 12GB vs 100’s of GBs on H100/H200 today) and it was actually completely impossible to train a large transformer model prior to the early 2020s.

I’m even reminded how much gamers complained in the late 2010s about GPU prices skyrocketing because of ML use.

View on HN · Topics

As others pointed out, the explosion of interest started with the deep convolutional networks that were applied in image problems. What I always thought was interesting was that prior to that, NNs were largely dismissed as interesting. When I took a course on them around the year 2000 that was the attitude most people took. It seems like what it took to spark renewed interest was ImageNet and seeing what you get when you have a ton of training data to throw at the problem and fast processors to help. After that the ball kept rolling with the subsequent developments around specific network architectures. In the broader community AlexNet is viewed as the big inflection point, but in the academic community you saw interest simmering a couple years earlier - I began to see more talks at workshops about NNs that weren’t being dismissed anymore, probably starting around 2008/09.

View on HN · Topics

This video gives a great overview of the history of the acceleration:

https://youtu.be/glWvwvhZkQ8?si=-HGtfd_KHYfatEQ

Although it’s focused on Ilya, some great history is covered.

View on HN · Topics

Without fast parallel hardware there would neither have been the incentive to design the Transformer, or much benefit even if someone had come up with the design all the same!

The incentive to design something new - which became the Transformer - came from language model researchers who had been working with recurrent models such as LSTMs, whose recurrent nature made them inefficient to train (needing BPPT), and wanted to come up with a new seq-2-seq/language model that could take advantage of the parallel hardware that now existed and (since AlexNet) was now being used to good effect for other types of model.

As I understand it, the inspiration for the concept of what would become the Transformer came from Attention paper co-author Jakob Uzkoreit who realized that language, while superficially appearing sequential (hence a good match for RNNs) was in fact really parallel + hierarchical as can be seen by linguist's sentence parse trees where different branches of the tree reflect parallel analysis of different parts of the sentence, which are then combined at higher levels of the hierarchical parse tree. This insight gave rise to the idea of a language model that mirrored this analytical structure with hierarchical layers of parallel processing, with the parallel processing being the whole point since this could be accelerated by GPUs. While the concept was Uzkoreit's, it took another researcher, Noam Shazeer, to take the concept and realize it as a performant architecture - the Transformer.

Without the fast parallel hardware already pre-existing, there would not have been any incentive to design a new type of language model to take advantage of it!

The other point is that while the Transformer is a very powerful general purpose and scalable type of model, it only really comes into it's own at scale. If a Transformer had somehow been designed in the pre-GPU-compute era, before the compute power to scale it up to massive size existed it, then it would likely not have appeared so promising/interesting.

The other aspect to the history is that neural networks, of various types, have evolved in complexity and sophistication over time. RNNs and LSTMs came first, then Bahdanau attention as a way to improve their context focus and performance. Attention was now seen to be a valuable part of language and seq-2-seq modelling, so when GPUs motivated the Transformer, attention was retained, recurrence ditched, and hence "Attention is all you need".

The time was right for the Transformer to appear when it did, designed to take advantage of recent GPU advances, building on top of this new attention architecture, and now with the compute power and dataset size available that it started to really shine when scaled from GPT-1 to GPT-2 size, and beyond.

View on HN · Topics

The modern neural net revival got kicked off long before 2017.

View on HN · Topics

Alex net in 2012 is only 5 years earlier.

View on HN · Topics

If you are in the radiology field it started “exploding” much earlier, with CNNs.

View on HN · Topics

the concept of a transformer could have been used on much slower hardware much earlier.

It could have been done in the early 1970s -- see "Paper tape is all you need" at https://github.com/dbrll/ATTN-11 and the various C-64 projects that have been posted on HN -- but the problem was that Marvin Minsky "proved" that there was no way a perceptron-based network could do anything interesting. Funding dried up in a hurry after that.

View on HN · Topics

Haven't read the page but a promising-looking search result is here: https://seantrott.substack.com/p/perceptrons-xor-and-the-fir...

I'm sure it's an oversimplification to blame the entire 1970s AI winter on Minsky, considering they couldn't have gotten much further than the proof-of-concept stage due to lack of hardware. But his voice was a loud, widely-respected one in academia, and it did have a negative effect on the field.

View on HN · Topics

I suspect all Minsky did was reinforce what many people were already thinking. I experimented with neural nets in the late 80s and they seemed super interesting, but also very limited. My sense at the time was that the general thinking was, they might be useful if you could approach the number of neurons and connections in the human brain, but that seemed like a very far off, effectively impossible goal at the time.

Summarizer