How neural networks were dismissed before 2012 due to training difficulties, with kernel methods and SVMs being preferred for their tractability
← Back to There Will Be a Scientific Theory of Deep Learning
Before the 2012 breakthrough, neural networks were often sidelined in favor of kernel methods and SVMs, which offered reliable global minima and better performance on the smaller, hand-labeled datasets of the era. Many researchers found early networks "fiddly" and computationally prohibitive, a sentiment reinforced by historical skepticism surrounding Marvin Minsky’s influential proofs regarding the limitations of simple perceptrons. However, the field eventually discovered that technical hurdles like vanishing gradients could be overcome by replacing sigmoid functions with more efficient architectures and leveraging the "bitter lesson" of scale. Ultimately, the shift to deep learning was fueled by the realization that massive datasets like ImageNet paired with GPU acceleration could outperform hand-crafted features, turning what were once considered "toy" models into the dominant force in AI.
8 comments tagged with this topic