Gradient descent for wide two-layer neural networks – I : Global convergence

tags: Neural networks, Optimization
authors: Francis Bach, Lénaïc Chizat
source: Francis Bach’s blog

In the rest, we use the mathematical definition of a neural network from Neural networks.

Two layer neural network

Even simple neural network models are very difficult to analyze. This is primarily due to two difficulties:

Non-linearity: the problem is typically non-convex, which in general is a bad thing in optimization.
Overparametrization: there are often a lot of parameters, sometimes many more parameters than observations.

Results presented here are actually taking advantage of overparametrization, with $m\rightarrow \infty$ and two key properties of the problem.

Separability: The problem can be decomposed in a sum of terms independently parametrized in $\omega_i = (a_i, b_i)$, with $h = \frac{1}{m} \sum_{i=1}^m \Phi(\omega_i)$ where $\Phi : \mathbb{R}^p \rightarrow \mathcal{F}$ and $\mathcal{F}$ is a space of functions. Here, $p = d+1$ and \[ \Phi(w)(x) = a (b^\top x)_+. \] This part is only true for two-layer neural networks however.
Homogeneity: ReLU is positively homogeneous and $\Phi$ is 2-homogeneous, meaning that for $\omega = (a, b)$, $\Phi(\lambda\omega) = \lambda^2 \Phi(\omega)$.

Infinitely wide neural network

The goal is to minimize a functional $R$ w.r.t function $h$

\[R(h) = \mathbb{E}_{p(x, y)} \left[ \ell(y, h(x))\right]\]

with $\ell(y, h(x))$ the loss incurred by predicting $h(x)$ when the true label was $y$. $R$ is assumed to be convex in its second argument, like least-square or logistic losses.

We re-use the two layer neural network formulation above to obtain

\[ G(W) = G(w_1,\dots,w_m) = R \Big( \frac{1}{m} \sum_{i=1}^m \Phi(w_i) \Big), \]

which can be seen as the minimization of

\[ F(\mu) = R \Big( \int_{\mathbb{R}^p} \Phi(w) d\mu(w) \Big),\]

with respect to a probability measure which is a sum of Dirac measures at $w_i$, $\mu = \frac{1}{m} \sum_{i = 1}^m \delta(w_i)$.

This makes the minimization problem much nicer since the set of measures is convex and $h = \int_{\mathbb{R}^p} \Phi(w) d\mu(w)$ is linear in $\mu$. However, because the set of measures is infinite dimensional, the choice of new neurons to minimize $F$ is very difficult (NP-hard for a threshold activation, polynomial but with very high complexity for ReLU). In practise, SGD is used.

Gradient flow

The gradient flow is defined by

\[ \dot{W} = - m \nabla G(W) \]

On a metric space $\mathcal{X}$, a gradient flow can be defined for a function $f$ and seen a the limit of taking infinitesimal steps of length $\gamma$, defining a sequence $(x_k)$

\[ x_{k+1} \in \arg\min_{x\mathcal{x}} f(x) + \dfrac{1}{2\gamma} d(x, x_k)^2 \]

The smaller the step, the closer this sequence is to the actual gradient flow.

Euler discretization

In $\mathbb{R}^d$ and a continuously differentiable $f$, $x_{k + 1} = x_k - \gamma f’(x_k) + o(\gamma)$, we simply obtain Euler steps.

Vector space gradient flows on probability measures

Probability measures are a convex subset of measures with finite total variation, which is equal to the $ℓ_1$-norm between densities when the two probability measures have densities with respect to the same base measure. It is a normed vector space for which we could derive our first type of gradient flow, which can be seen as a continuous version of Frank-Wolfe algorithm, where atoms are added one by one, until convergence.

However, the flow of measures cannot be approximated by the set of neurons of the network.

Wasserstein gradient flow on probability measures

The Wasserstein distance is directly related to Optimal transport. For two empirical measures with the same number of points, it is the minimal sum of squared distances between pairs of point over all possible permutations.

The Wasserstein gradient flow is written

\[ \dot{w}_i = \ – \nabla \Phi(w_i) \nabla R\Big(\int_{\mathbb{R}^p} \Phi d\mu \Big), \]

Global convergence

Under two main assumptions (plus additional technical assumptions), if the Wasserstein gradient flow converges to a measure, it has to be the global minimum of the function $F$ defined above.

The two assumptions are:

Homogeneity of $\Phi: \mathbb{R}^p \rightarrow \mathcal{F}$
$w_i$s are uniformly distributed on the sphere

The authors show on simple examples that for data generated with a neural network with $m_0 = 5$ neurons. The result above suggest that with large $m$ a neural network should converge to the original neurons. Surprisingly, relatively small $m$s are already very good at doing that.