Gradient descent

tags
Optimization, Algorithm
resources
Slides by Christian S. Perone

Fixed learning rate

The simplest way to apply the gradient descent algorithm on a function \(g\) convex and $L-$smooth on \(\mathbb{R}^d\) is to use the parameter update:

\[ \theta_t = \theta_{t-1} - \gamma g'(\theta_{t-1}) \]

This is based on the standard first-order approximation of the function \(g\). It can be very sensitive to the learning rate and suffer from pathological curvature.

Momentum

Momentum has been introduced in the gradient descent algorithm to dampen oscillations in the pathological cases and accelerating descent. Trajectories are less sensitive to changes in the loss landscape.

\[ \begin{align} V_{t + 1} = \beta V_t + g'(\theta_{t})\newline \theta_{t + 1} = \theta_t - \gamma V_{t + 1} \end{align}\]

Another common way to apply the gradient descent algorithm with non fixed learning rate is to use line search at each step. This corresponds to searching for the best \(\gamma\) to use in \(\theta_{} + \gamma \Delta\theta\) (in gradient descent, \(\Delta\theta = - g'(\theta)\)).

For exact line search, we take \(\gamma = \arg\min_{s\geq 0} g(\theta + s \Delta \theta)\). This method is used when the cost of minimizing the function above is low compared to computing the search direction \(\Delta\theta\).

This is an algorithm for inexact line search. It depends on two parameters \(0 < \alpha < 0.5\) and \(0 < \beta < 1\). It is composed of one step which is run until the following condition is verified

\[ g(\theta + t \Delta \theta)> g(\theta) + \alpha t \nabla g(\theta)^T \Delta\theta \]

At each step, \(t \leftarrow \beta t\), \(t\) is initialized to 1.


← Back to Notes