- tags
- Machine learning

## Two-layers neural network

Mathematically, a simple two-layers neural network with relu non-linearities can be written like below. For an input vector \(x \in \mathbb{R}^D\), \(\mathbf{a} = (a_1, \cdots, a_N)\in \mathbb{R}^M\) are the *output weights*, \(\mathbf{b} = (b_1, \cdots, b_N)\in \mathbb{R}^D\) are the *input weights*

\[ h(x) = \frac{1}{m} \sum_{i=1}^m a_i \max\{ b_i^\top x,0\}, \]

## Universal approximation theorem

Cybenko showed in 1989 that a neural network of arbitrary width with sigmoid activation function could approximate any continuous function (Cybenko 1989).

Barron added rates of convergence by enforcing smoothness condition on the target function (Barron 1993).

## Training a neural network is hard in general

The established way of training a neural network is to use the backpropagation algorithm and gradient descent methods.

“Training a 3-node neural network is NP-complete” (Blum, Rivest 1992). However practice has shown we can actually efficiently train them. But there are still major unanswered question that Leo Breiman was already raising in 1995 (Breiman 2018):

- Why don ’t heavily parameterized neural networks overfit the data?
- What is the effective number of parameters?
- Why doesn’t backpropagation head for a poor local minima?

## Bibliography

- G. Cybenko. . "Approximation by Superpositions of a Sigmoidal Function".
*Mathematics of Control, Signals, and Systems*2 (4):303–14. DOI. - A. R. Barron. . "Universal Approximation Bounds for Superpositions of a Sigmoidal Function".
*IEEE Transactions on Information Theory*39 (3):930–45. DOI. - Avrim L. Blum, Ronald L. Rivest. . "Training a 3-node Neural Network Is Np-complete".
*Neural Networks*5 (1):117–27. DOI. - Leo Breiman. . "Reflections After Refereeing Papers for NIPS". In
*The Mathematics of Generalization*, 1st ed., 11–15. CRC Press. DOI.