Pre-GPU Neural Network History

How neural networks were dismissed before 2012 due to training difficulties, with kernel methods and SVMs being preferred for their tractability

Before the 2012 breakthrough, neural networks were often sidelined in favor of kernel methods and SVMs, which offered reliable global minima and better performance on the smaller, hand-labeled datasets of the era. Many researchers found early networks "fiddly" and computationally prohibitive, a sentiment reinforced by historical skepticism surrounding Marvin Minsky’s influential proofs regarding the limitations of simple perceptrons. However, the field eventually discovered that technical hurdles like vanishing gradients could be overcome by replacing sigmoid functions with more efficient architectures and leveraging the "bitter lesson" of scale. Ultimately, the shift to deep learning was fueled by the realization that massive datasets like ImageNet paired with GPU acceleration could outperform hand-crafted features, turning what were once considered "toy" models into the dominant force in AI.

View on HN · Topics

Thanks for posting a through and accurate summary of the historical picture. I think it is important to know the past trajectory to extrapolate to the future correctly.

For a bit more context: Before 2012 most approaches were based on hand crafted features + SVMs that achieved state of the art performance on academic competitions such as Pascal VOC and neural nets were not competitive on the surface. Around 2010 Fei Fei Li of Stanford University collected a comparatively large dataset and launched the ImageNet competition. AlexNet cut the error rate by half in 2012 leading to major labs to switch to deeper neural nets. The success seems to be a combination of large enough dataset + GPUs to make training time reasonable. The architecture is a scaled version of ConvNets of Yan Lecun tying to the bitter lesson that scaling is more important than complexity.

View on HN · Topics

> The conventional wisdom is that it was the combination of (1) exponentially more compute than in earlier eras with (2) exponentially larger, high-quality datasets (e.g., the curated and hand-labeled ImageNet set) that finally allowed deep neural networks to shine.

I'd thought it was some issue with training where older math didn't play nice with having too many layers.

View on HN · Topics

Sigmoid-type activation functions were popular, probably for the bounded activity and some measure of analogy to biological neuron responses. They work, but get problematic scaling of gradient feedback outside their most dynamic span.

My understanding of the development is that persistent layer-wise pretraining with RBM or autoencoder created an initiation state where the optimization could cope even for more layers, and then when it was proven that it could work, analysis of why led to some changes such as new initiation heuristics, rectified linear activation, eventually normalizations ... so that the pretraining was usually not needed any more.

One finding was that the supervised training with the old arrangement often does work on its own, if you let it run much longer than people reasonably could afford to wait around for just on speculation contrary to observations in CPU computations in the 80s--00s. It has to work its way to a reasonably optimizable state using a chain of poorly scaled gradients first though.

View on HN · Topics

> NNs were largely dismissed

I agree with your larger point but dismissed is rather too strong. They were considered fiddly to train, prone to local minima, long training time, no clear guidelines about what the number of hidden layers and number of nodes ought to be. But for homework (toy) exercises they were still ok.

In comparison, kernel methods gave a better experience over all for large but not super large data sets. Most models had easily obtainable global minimum. Fewer moving parts and very good performance.

It turns out, however, that if you have several orders of magnitude more data, the usual kernels are too simple -- (i) they cannot take advantage of more data after a point and start twiddling the 10th place of decimal of some parameters and (ii) are expensive to train for very large data sets. So bit of a double whammy. Well, there was a third, no hardware acceleration that can compare with GPUs.

Kernels may make a comeback though, you never know. We need to find a way to compose kernels in a user friendly way to increase their modeling capacity. We had a few ways of doing just that but they weren't great. We need a breakthrough to scale them to GPT sized data sets.

In a way DNNs are "design your own kernels using data" whereas kernels came in any color you liked provided it was black (yes there were many types, but it was still a fairly limited catalogue. The killer was that there was no good way of composing them to increase modeling capacity that yielded efficiently trainable kernel machines )

View on HN · Topics

the concept of a transformer could have been used on much slower hardware much earlier.

It could have been done in the early 1970s -- see "Paper tape is all you need" at https://github.com/dbrll/ATTN-11 and the various C-64 projects that have been posted on HN -- but the problem was that Marvin Minsky "proved" that there was no way a perceptron-based network could do anything interesting. Funding dried up in a hurry after that.

View on HN · Topics

> Marvin Minsky "proved" that there was no way a perceptron-based network could do anything interesting

What result are you referring to?

View on HN · Topics

Haven't read the page but a promising-looking search result is here: https://seantrott.substack.com/p/perceptrons-xor-and-the-fir...

I'm sure it's an oversimplification to blame the entire 1970s AI winter on Minsky, considering they couldn't have gotten much further than the proof-of-concept stage due to lack of hardware. But his voice was a loud, widely-respected one in academia, and it did have a negative effect on the field.

View on HN · Topics

I suspect all Minsky did was reinforce what many people were already thinking. I experimented with neural nets in the late 80s and they seemed super interesting, but also very limited. My sense at the time was that the general thinking was, they might be useful if you could approach the number of neurons and connections in the human brain, but that seemed like a very far off, effectively impossible goal at the time.

Summarizer