Frank-Wolfe algorithm

tags
Optimization, Algorithm
resources
Fabian Pedregosa’s series on FW

Definition

It was originally published in (Frank, Wolfe 1956) and (Jaggi 2013) gives a more recent overview.

For a function $$f$$ differentiable with $L$-Lipschitz gradients, and its domain $$\mathcal{C}$$ is a convex and compact set, we want to solve the optimization problem:

$\min_{\boldsymbol{x} \in \mathcal{C}} f(\boldsymbol{x})$

The algorithm starts with an initial guess $$\boldsymbol{x}_0$$ and constructs a sequence of values $$\boldsymbol{x}_1, \boldsymbol{x}_2, \cdots$$ which converges to the solution. We can write the algorithm as follows:

\begin{align} & \textbf{Input}: \text{initial guess $xx_0$, tolerance $\delta > 0$} \nonumber \\ & \textbf{For }t=0, 1, \ldots \textbf{ do } \\ &\quad \boldsymbol{s}_t \in \argmax_{\boldsymbol{s} \in \mathcal{C}} \langle -\nabla f(\boldsymbol{x}_t), \boldsymbol{s}\rangle \\ &\quad \boldsymbol{d}_t = \boldsymbol{s}_t - \boldsymbol{x}_t\\ &\quad g_t = \langle \nabla - f(\boldsymbol{x}_t), \boldsymbol{d}_t \rangle\\ &\quad \textbf{If } g_t < \delta: \\ &\quad\quad \text{// exit if gap is below tolerance }\nonumber\\ &\quad\quad\textbf{return } \boldsymbol{x}_t\\ &\quad {\textbf{Variant 1}}: \text{set step size as} \nonumber\\ &\quad\qquad\gamma_t = \min \left\{\frac{g_t}{L\|\boldsymbol{d}_t\|^2}, 1 \right\}\\ &\quad \textbf{Variant 2}: \text{set step size by line search} \nonumber\\ &\quad\qquad \gamma_t = \argmin_{\gamma \in [0, 1]} f(\boldsymbol{x}_t + \gamma \boldsymbol{d}_t)\\ &\quad \textbf{Variant 2}: \text{step size schedule} \nonumber\\ &\quad\qquad \gamma_t = \frac{2}{t+1}\\ &\quad\boldsymbol{x}_{t+1} = \boldsymbol{x}_t + \gamma_t \boldsymbol{d}_t~.\\ &\textbf{end For loop}\\ & \textbf{return } \boldsymbol{x}_t \end{align}

Line 9 can be re-written $$\quad\boldsymbol{x}_{t+1} = \boldsymbol{x}_t + \gamma_t (\boldsymbol{s}_t - \boldsymbol{x}_t)$$, meaning that the algorithm moves in the direction of the vector $$\boldsymbol{s}_t$$ with a step size that decreases as the algorithm progresses.

The vector $$\boldsymbol{s}_t$$ is the result of another optimization problem that is sometimes called the linear minimization oracle. The result is a vector within the constraint domain $$\mathcal{C}$$ that correlates maximally with the negative gradient. There are many ways to find a step size for the FW algorithm, of which two variants are shown above.

Bibliography

1. . . "An Algorithm for Quadratic Programming". Naval Research Logistics Quarterly 3 (1-2):95–110. DOI.
2. . . "Revisiting Frank-wolfe: Projection-free Sparse Convex Optimization". In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, 28:427–35. JMLR Workshop and Conference Proceedings. JMLR.org. http://proceedings.mlr.press/v28/jaggi13.html.
Last changed | authored by