Summarizer

Gradient Descent Mystery

Debate over why gradient descent works effectively for neural networks despite billions of local minima, including arguments about high-dimensional spaces making local minima statistically rare

← Back to There Will Be a Scientific Theory of Deep Learning

While the success of gradient descent may seem as intuitive as walking downhill, the sheer volume of local minima in neural networks—potentially exceeding the number of atoms in the universe—presents a significant theoretical puzzle. Some argue that these traps are actually rare in high-dimensional spaces because every single gradient component would have to hit zero simultaneously to halt progress, allowing a single non-zero path to facilitate an escape. Furthermore, the noisy, stochastic nature of the descent acts as a "biased random walk" that can leap over shallow wells, only becoming trapped by substantial, high-quality minima that dominate their neighborhoods. Despite counter-arguments regarding parameter correlation, this mechanism remains remarkably effective compared to alternative optimization methods like simulated annealing.

10 comments tagged with this topic

View on HN · Topics
https://en.wikipedia.org/wiki/Universal_approximation_theore... the better question is why does gradient descent work for them
View on HN · Topics
I don't follow. Why wouldn't it work? It seems to me that a biased random walk down a gradient is about as universal as it gets. A bit like asking why walking uphill eventually results in you arriving at the top.
View on HN · Topics
It wouldn't work if your landscape has more local minima than atoms in the known universe (which it does) and only some of them are good. Neural networks can easily fail, but there's a lot of things one can do to help ensure it works.
View on HN · Topics
A funny thing is, in very high-dimensional space, like millions and billions of parameters, the chance that you'd get stuck in a local minima is extremely small. Think about it like this, to be stuck in a local minima in 2D, you only need 2 gradient components to be zero, in higher dimension, you'd need every single one of them, millions up millions of them, to be all zero. You'd only need 1 single gradient component to be non-zero and SGD can get you out of it. Now, SGD is a stochastic walk on that manifold, not entirely random, but rather noisy, the chance that you somehow walk into a local minima is very very low, unless that is a "really good" local minima, in a sense that it dominates all other local minimas in its neighborhood.
View on HN · Topics
>you'd need every single one of them, millions up millions of them, to be all zero If they were all correlated with each other that does not seem far fetched.
View on HN · Topics
Not a mathematician so I’m immediately out of my depth here (and butchering terminology), but it seems, intuitively, like the presence of a massive amount of local minima wouldn’t really be relevant for gradient descent. A given local minimum would need to have a “well” at least be as large as your step size to reasonably capture your descent. E.g. you could land perfectly on a local minima but you won’t stay the unless your step size was minute or the minima was quite substantial.
View on HN · Topics
I believe what was meant was that assuming local minima of a sufficient size to capture your probe, given a sufficiently high density of those, you become extremely likely to get stuck. A counterpoint regarding dimensionality is made by the comment adjacent to yours.
View on HN · Topics
Wait a min. Does this paper says we don't know how back-propogation works?
View on HN · Topics
No
View on HN · Topics
Sane & interesting enough to have been disproven, by Boaz Barak iirc. Maybe not surprising since simulated annealing never achieved the results of gradient descent + backprop.