CREATING A CLASSIFIER

OBJECTIVE

design a function that takes image as input and responds with a label

A POSSIBLE TECHNIQUE, IMAGE FLATTENING

The idea is to scan linearly the image pixels:

so the classifier becomes:

f (x, W) = W x = l ab e l

where $W$ is a linear vector of $3 \times M \times N$ size as the $x$ input.

LIMITS OF IMAGE FLATTENING

This solution is not practical cause the label is a categorical and not numerical value so closer values do not imply that images are of similar classes

A better choice is to output scores for each label so

f (x, W) = W x = scores

where $W$ is a matrix of size $(3 \times M \times N) \times n l ab e l s$ and scores is a vector of size $n l ab e l s$

This type of linear classifier can be realize with the template matching approach Where templates are rows of the $W$ matrix

LOSS FUNCTION FOR A LINEAR CLASSIFIER

A common approach is to translate the scores in probabilities with the softmax function

so f t ma x_{j} (s) = \frac{e x p ( s _{j} )}{\sum _{k = 1}^{C} e x p ( s _{k} )}

the true label can be represented as a one hot encoded row of scores such as

[000 ... 1 ... 00]

The loss function must be a function that decrease as the probability given by the model becomes higher

L (θ, (x^{(i)}, y^{(y)})) = - lo g p_{m o d e l} (y = y^{(i)} ∣ x = x^{(i)}; θ)

L (θ, D^{t r ain}) = i = 1 \sum N - lo g p_{m o d e l} (y = y^{(i)} ∣ x = x^{(i)}; θ)

So in the case of a linear classifier the $- lo g p_{m o d e l} (y = y^{(i)} ∣ x = x^{(i)}; θ)$ function can be defined as

- lo g p_{m o d e l} (y = y^{(i)} ∣ x = x^{(i)}; θ) = \frac{e x p ( s _{j} )}{\sum _{k = 1}^{C} e x p ( s _{k} )}

so the per sample loss is given by:

- lo g \frac{e x p ( S _{y} ( i ))}{\sum _{k = 1}^{C} e x p ( s _{k} )} = S_{y} (i) + lo g k = 1 \sum C e x p (s_{k}) ≃

- S_{y} (i) + k max S_{k}

MINIMIZING THE LOSS FUNCTION

The loss function can be seen as a multivariate function with the variables being the parameters of the model $θ$ so the problem becomes an optimization problem of the type

a r g mi n_{θ \in Θ} (L (θ, D^{t r ain}))

The most common approach to this problem is to compute the gradient of the loss function

\nabla L (θ, D^{t r ain}) = \frac{δ L ( θ , D ^{t r ain} )}{δ θ _{1}} ⋮ \frac{δ L ( θ , D ^{t r ain} )}{δ θ _{k}}

and follow his direction (GRADIENT DESCENT)

GRADIENT DESCENT

Randomly initialize $θ^{(0)}$

for $e = 1, .., E$ epochs:

Forward pass classify all training data to get the predictions and the loss function
Backward pass compute the gradient $\nabla L$
Update parameters $θ^{(e)} = θ^{(e - 1)} - α \nabla L$ where $α$ is the learning rate hyper parameter

The learning rate can influence the convergence speed of the training procedure:

GRADIENT DESCENT LIMITS

Computing the gradient of the loss function on all the training data results in a computational infeasible task as the the loss function is the mean of the per-sample losses the gradient needs to be computed on ALL of the samples

STOCASTIC GRADIENT DESCENT

Instead of computing the gradient on the global loss function the parameter update is done for each sample, this method is more computationally efficient but is not robust to noise

BEST COMPROMISE

A good compromise can be to use mini data batches to compute the gradient choosing a batch of $B$ size ( $B$ is also an hyperparameter), the number of updates in each epoch (e.g. the number of batches ) can be computed as $\frac{N}{B}$

In this case larger batches approximate better the gradient at the cost of memory occupation

IMPROVEMENT ON THE APPROXIMATIONS

In order to improve further the approximation a momentum parameter to the update phase can be deployed:

β \in [0, 1), v^{(0)} = 0

{v^{(t + 1)} = β v^{(t)} - α \nabla L (θ^{(0)}) θ^{(t + 1)} = θ^{(t)} + v^{(t + 1)}

With this parameter the update becomes a mean of the previous ones smoothing the gradient

LIMITS OF A LINEAR CLASSIFIER

For a lot of application capture all the variability with one template is impossible, there is the need of something more meaningful than row pixels. There is the need to transform pixels in some form of feature

PREVIOUS NEXT

Explorer