High-Dimensional Optimization

Explanation of why getting stuck in local minima is unlikely in million-parameter spaces, since only one non-zero gradient component is needed to escape

In high-dimensional optimization, local minima are often considered statistically improbable because an escape path exists as long as even one out of millions of gradient components remains non-zero. While stochastic gradient descent acts as a noisy walk that effectively bypasses most shallow traps, some argue that this advantage could be neutralized if gradient components are highly correlated or if the density of minima is sufficiently high. Ultimately, the success of modern deep learning may rely less on finding a single perfect point and more on the inherent redundancy of the solution space and the evolution of sophisticated normalization techniques.

View on HN · Topics

A funny thing is, in very high-dimensional space, like millions and billions of parameters, the chance that you'd get stuck in a local minima is extremely small. Think about it like this, to be stuck in a local minima in 2D, you only need 2 gradient components to be zero, in higher dimension, you'd need every single one of them, millions up millions of them, to be all zero. You'd only need 1 single gradient component to be non-zero and SGD can get you out of it. Now, SGD is a stochastic walk on that manifold, not entirely random, but rather noisy, the chance that you somehow walk into a local minima is very very low, unless that is a "really good" local minima, in a sense that it dominates all other local minimas in its neighborhood.

View on HN · Topics

>you'd need every single one of them, millions up millions of them, to be all zero

If they were all correlated with each other that does not seem far fetched.

View on HN · Topics

I believe what was meant was that assuming local minima of a sufficient size to capture your probe, given a sufficiently high density of those, you become extremely likely to get stuck. A counterpoint regarding dimensionality is made by the comment adjacent to yours.

View on HN · Topics

Deep-learning hinges on highly redundant solution space (highly redundant weights), along with normalized weights (optimization methodology is commoditized). The original neural network work had no such concepts.

Summarizer