Explanation of why getting stuck in local minima is unlikely in million-parameter spaces, since only one non-zero gradient component is needed to escape
← Back to There Will Be a Scientific Theory of Deep Learning
In high-dimensional optimization, local minima are often considered statistically improbable because an escape path exists as long as even one out of millions of gradient components remains non-zero. While stochastic gradient descent acts as a noisy walk that effectively bypasses most shallow traps, some argue that this advantage could be neutralized if gradient components are highly correlated or if the density of minima is sufficiently high. Ultimately, the success of modern deep learning may rely less on finding a single perfect point and more on the inherent redundancy of the solution space and the evolution of sophisticated normalization techniques.
4 comments tagged with this topic