Summarizer

Universal Approximation Limitations

Discussion of why the universal approximation theorem is necessary but not sufficient to explain neural network performance, noting that SVMs and other models share this property, making it insufficient to distinguish neural network superiority

← Back to There Will Be a Scientific Theory of Deep Learning

While the universal approximation theorem establishes a necessary foundation for neural networks, it fails to explain their practical superiority because many other models, such as SVMs and gradient boosting, share the same theoretical capability. Commenters argue that the true "secret sauce" lies elsewhere, suggesting that implicit regularization, complex biases within optimization, and massive parameter scaling are the real drivers of modern performance. This debate highlights a significant gap between academic theories regarding scaling laws and a more pragmatic view that attributes success to sheer computational power and empirical refinement. Ultimately, universal approximation is seen as a mere baseline for computability that offers little insight into why specific architectures excel where traditional models falter.

10 comments tagged with this topic

View on HN · Topics
As someone who works in the area, this provides a decent summary of the most popular research items. The most useful and impressive part is the set of open problems at the end, which just about covers all of the main research directions in the field. The skepticism I'm seeing in the comments really highlights how little of this work is trickling down to the public, which is very sad to see. While it can offer few mathematical mechanisms to infer optimal network design yet (mostly because just trying stuff empirically is often faster than going through the theory, so it is more common to retroactively infer things), the question "why do neural networks work better than other models?" is getting pretty close to a solid answer. Problem is, that was never the question people seem to have ever really been interested in, so the field now has to figure out what questions we ask next.
View on HN · Topics
Do neural networks work better than other models? They can definitely model a wider class of problems than traditional ML models (images being the canonical example). However, I thought where a like for like comparison was possible they tend to worse than gradient boosting.
View on HN · Topics
"why do neural networks work better than other models?" That sounds really interesting - any references (for a non specialist)?
View on HN · Topics
https://en.wikipedia.org/wiki/Universal_approximation_theore... the better question is why does gradient descent work for them
View on HN · Topics
The properties that the uniform approximation theorem proves are not unique to neural networks. Any models using an infinite dimensional Hilbert space, such as SVMs with RBF or polynomial kernels, Gaussian process regression, gradient boosted decision trees, etc. have the same property (though proven via a different theorem of course). So the universal approximation theorem tells us nothing about why should expect neural networks to perform better than those models.
View on HN · Topics
Extremely well said. Universal approximation is necessary but not sufficient for the performance we are seeing. The secret sauce is implicit regularization, which comes about analogously to enforcing compression.
View on HN · Topics
Universal approximation is like saying that a problem is computable sure, that gives some relief - but it says nothing in practice unlike f.e. which side of P/NP divide the problem is on
View on HN · Topics
> why do neural networks work better than other models The only people for whom this is an open question are the academics - everyone else understands it's entirely because of the bagillions of parameters.
View on HN · Topics
> The actual reason is due to complex biases that arise from the interaction of network architectures and the optimizers and persist in the regime where data scales proportionally to model size. The multiscale nature of the data induces neural scaling laws that enable better performance than any other class of models can hope to achieve. That’s a lot of words to say that, if you encode a class of things as numbers, there’s a formula somewhere that can approximate an instance of that class. It works for linear regression and works as well for neural network. The key thing here is approximation.
View on HN · Topics
That isn't what they are saying at all, lol.