- tags
- Optimization, Algorithm
- resources
- Slides by Christian S. Perone
Fixed learning rate
The simplest way to apply the gradient descent algorithm on a function \(g\) convex and $L-$smooth on \(\mathbb{R}^d\) is to use the parameter update:
\[ \theta_t = \theta_{t-1} - \gamma g'(\theta_{t-1}) \]
This is based on the standard first-order approximation of the function \(g\). It can be very sensitive to the learning rate and suffer from pathological curvature.
Momentum
Momentum has been introduced in the gradient descent algorithm to dampen oscillations in the pathological cases and accelerating descent. Trajectories are less sensitive to changes in the loss landscape.
\[ \begin{align} V_{t + 1} = \beta V_t + g'(\theta_{t})\newline \theta_{t + 1} = \theta_t - \gamma V_{t + 1} \end{align}\]
Line search
Another common way to apply the gradient descent algorithm with non fixed learning rate is to use line search at each step. This corresponds to searching for the best \(\gamma\) to use in \(\theta_{} + \gamma \Delta\theta\) (in gradient descent, \(\Delta\theta = - g'(\theta)\)).
Exact line search
For exact line search, we take \(\gamma = \arg\min_{s\geq 0} g(\theta + s \Delta \theta)\). This method is used when the cost of minimizing the function above is low compared to computing the search direction \(\Delta\theta\).
Backtracking line search
This is an algorithm for inexact line search. It depends on two parameters \(0 < \alpha < 0.5\) and \(0 < \beta < 1\). It is composed of one step which is run until the following condition is verified
\[ g(\theta + t \Delta \theta)> g(\theta) + \alpha t \nabla g(\theta)^T \Delta\theta \]
At each step, \(t \leftarrow \beta t\), \(t\) is initialized to 1.
SGD and global minima
Bibliography
- Shengchao Liu, Dimitris S. Papailiopoulos, Dimitris Achlioptas. . "Bad Global Minima Exist and SGD Can Reach Them". In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neurips 2020, December 6-12, 2020, Virtual, edited by Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin. https://proceedings.neurips.cc/paper/2020/hash/618491e20a9b686b79e158c293ab4f91-Abstract.html.