# Supervised learning

tags
Machine learning

## Data

Input/output example pairs: $\{(x_i, y_i)\}_{i\leq n} \sim_{iid} \mathbb{P}, \quad \mathbb{P} \in \mathcal{P}(\mathcal{X} \times \mathcal{Y}) \text{ unknown}$

## Mapping

We search for a mapping $$f: \mathcal{X} \rightarrow \mathcal{Y}$$. It is also common to parameterize this mapping with a parameter $$\theta \in \mathbb{R}^d$$ and write $$h: \mathcal{X} \times \mathbb{R}^d \rightarrow \mathcal{Y}$$.

The prediction $$\hat{y}$$ is written

$\hat{y} = f(x) = h(x, \theta)$

## Objective

The goal is to find the above mapping such as to minimize an objective. For example, the objective can be the expectation quadratic loss below: $R(f) = \mathbb{E}\left\{ \frac{1}{2}(y_{new} - f(x_{new}))^2 \right\}, \quad (y_{new}, x_{new}) \sim \mathbb{P}$

In general we write

$R(f) = \mathbb{E}_\mathbb{P} \left[ \ell(y, f(x)) \right]$

Usually, because this objective cannot be minimized directly — we don’t have access to infinitely many new samples — minimization on the set of available examples is done instead: this is Empirical risk minimization (ERM).

### Empirical risk minimization

For the above $$R$$ with least-square loss, ERM is written $\text{minimize}\quad \hat{R}_n(\theta) := \frac{1}{2n}\sum_{i=1}^n(y_i - f(x_i;\theta))^2,$ $\text{subj. to } f(\cdot;\theta) : \mathbb{R}^d \rightarrow \mathbb{R}, \quad \theta \in \Theta \subseteq \mathbb{R}^p$

This minimization problem can be solved using Optimization methods. Gradient descent is one of them. Its stochastic variant is written as follows for the problem above.

$\theta^{k+1} = \theta^k - \frac{\epsilon}{2} \nabla_\theta(y_{I(k)} - f(x_{I(k)}; \theta^k))^2$ $I(k) \sim \text{Unif}([n])$

### Expected risk minimization

This other quantity measures the generalization performance of our model , the expectation below is taken with respect to unseen data and thus cannot be computed. This is the setting of a single pass optimization or online learning, and gives the best generalization guarantees.

$R(f) = \mathbb{E}_\mathbb{P} \left[ \ell(y, f(x)) \right]$

### Usual losses

In general, the loss as the form $$\ell: \mathcal{Y} \times \mathcal{Y} \rightarrow \mathbb{R}$$ Several losses are usually used in the two most common supervised learning settings:

#### Regression

The quadratic loss is most often used: $\frac{1}{2} (y - f(x))^2$

#### Classification

With classification, the prediction $$y$$ is in $$\{ -1, 1 \}$$, and we usually take $$\hat{y} = \text{sign}(f(x))$$

An “ideal” loss would be $$\ell(y\cdot f(x)) = 1_{y\cdot f(x) < 0}$$. Usually, this is substituted for convex losses:

• Hinge loss $$\max(0, - y \cdot f(x) + 1)$$
• Square loss $$\frac{1}{2}(1- y \cdot f(x))^2$$
• Logistic loss $$\log ( 1 + \exp (- y \cdot f(x)))$$

## Learning

A learning algorithm in the supervised setting can be reduced down to a mapping between the initial set of input/output pairs and a prediction function:

$L: (\mathcal{X} \times \mathcal{Y})^n \rightarrow (f: \mathcal{X} \rightarrow \mathcal{Y})$

## A computing analogy for Supervised learning

If we look at it from a computing point of view (instead of the mathematical/functional point of view), the problem of supervised learning is similar to trying to construct a (computer + program) pair from some data such that the program run on that computer with some part of our data produces the right other part of that data.

## Links to this note

Last changed | authored by