Supervised learning

tags: Machine learning

Data

Input/output example pairs: \[ \{(x_i, y_i)\}_{i\leq n} \sim_{iid} \mathbb{P}, \quad \mathbb{P} \in \mathcal{P}(\mathcal{X} \times \mathcal{Y}) \text{ unknown} \]

Mapping

We search for a mapping \(f: \mathcal{X} \rightarrow \mathcal{Y}\). It is also common to parameterize this mapping with a parameter \(\theta \in \mathbb{R}^d\) and write \(h: \mathcal{X} \times \mathbb{R}^d \rightarrow \mathcal{Y}\).

The prediction \(\hat{y}\) is written

\[ \hat{y} = f(x) = h(x, \theta) \]

Objective

The goal is to find the above mapping such as to minimize an objective. For example, the objective can be the expectation quadratic loss below: \[ R(f) = \mathbb{E}\left\{ \frac{1}{2}(y_{new} - f(x_{new}))^2 \right\}, \quad (y_{new}, x_{new}) \sim \mathbb{P} \]

In general we write

\[ R(f) = \mathbb{E}_\mathbb{P} \left[ \ell(y, f(x)) \right]\]

Usually, because this objective cannot be minimized directly — we don’t have access to infinitely many new samples — minimization on the set of available examples is done instead: this is Empirical risk minimization (ERM).

Empirical risk minimization

For the above \(R\) with least-square loss, ERM is written \[ \text{minimize}\quad \hat{R}_n(\theta) := \frac{1}{2n}\sum_{i=1}^n(y_i - f(x_i;\theta))^2, \] \[ \text{subj. to } f(\cdot;\theta) : \mathbb{R}^d \rightarrow \mathbb{R}, \quad \theta \in \Theta \subseteq \mathbb{R}^p \]

This minimization problem can be solved using Optimization methods. Gradient descent is one of them. Its stochastic variant is written as follows for the problem above.

\[ \theta^{k+1} = \theta^k - \frac{\epsilon}{2} \nabla_\theta(y_{I(k)} - f(x_{I(k)}; \theta^k))^2 \] \[ I(k) \sim \text{Unif}([n]) \]

Expected risk minimization

This other quantity measures the generalization performance of our model , the expectation below is taken with respect to unseen data and thus cannot be computed. This is the setting of a single pass optimization or online learning, and gives the best generalization guarantees.

\[ R(f) = \mathbb{E}_\mathbb{P} \left[ \ell(y, f(x)) \right]\]

Usual losses

In general, the loss as the form \(\ell: \mathcal{Y} \times \mathcal{Y} \rightarrow \mathbb{R}\) Several losses are usually used in the two most common supervised learning settings:

Regression

The quadratic loss is most often used: \[ \frac{1}{2} (y - f(x))^2 \]

Classification

With classification, the prediction \(y\) is in \(\{ -1, 1 \}\), and we usually take \(\hat{y} = \text{sign}(f(x))\)

An “ideal” loss would be \(\ell(y\cdot f(x)) = 1_{y\cdot f(x) < 0}\). Usually, this is substituted for convex losses:

Hinge loss \(\max(0, - y \cdot f(x) + 1)\)
Square loss \(\frac{1}{2}(1- y \cdot f(x))^2\)
Logistic loss \(\log ( 1 + \exp (- y \cdot f(x)))\)

Learning

A learning algorithm in the supervised setting can be reduced down to a mapping between the initial set of input/output pairs and a prediction function:

\[ L: (\mathcal{X} \times \mathcal{Y})^n \rightarrow (f: \mathcal{X} \rightarrow \mathcal{Y}) \]

A computing analogy for Supervised learning

If we look at it from a computing point of view (instead of the mathematical/functional point of view), the problem of supervised learning is similar to trying to construct a (computer + program) pair from some data such that the program run on that computer with some part of our data produces the right other part of that data.