Debate over why gradient descent works effectively for neural networks despite billions of local minima, including arguments about high-dimensional spaces making local minima statistically rare
← Back to There Will Be a Scientific Theory of Deep Learning
While the success of gradient descent may seem as intuitive as walking downhill, the sheer volume of local minima in neural networks—potentially exceeding the number of atoms in the universe—presents a significant theoretical puzzle. Some argue that these traps are actually rare in high-dimensional spaces because every single gradient component would have to hit zero simultaneously to halt progress, allowing a single non-zero path to facilitate an escape. Furthermore, the noisy, stochastic nature of the descent acts as a "biased random walk" that can leap over shallow wells, only becoming trapped by substantial, high-quality minima that dominate their neighborhoods. Despite counter-arguments regarding parameter correlation, this mechanism remains remarkably effective compared to alternative optimization methods like simulated annealing.
10 comments tagged with this topic