# Neural Network Training

This post is part of the series Neural Networks.

In this series of posts, I have already shown the basic idea behind computational neural networks as a machine learning technique and the implementation of the forward propagation algorithm. In this post, we will now look into a method for fitting such a network to a set of training data.

## Cost Function

The direct outputs from a neural network are, as discussed in previous posts, the activations of the output layer. Each entry of the output activations corresponds to a specific classification predicted by the network, with values ranging from 0 to 1. The closer these activations are to 1, the more likely the network thinks that classification is for the network inputs provided.

In the ‘ideal’ case, the output activations for a network with 4 outputs and for input data corresponding to a class of 3 would be:

$\mathbf{a}^{(L)} = \left[ \begin{matrix}0\\0\\1\\0\end{matrix} \right]$

Say that, in reality, the network predicted output activations of:

$\mathbf{a}^{(L)} = \left[ \begin{matrix}0.12\\0.02\\0.87\\0.26\end{matrix} \right]$

This would give the correct prediction, but as the entry for output 3 becomes smaller and the others become larger, the prediction is less and less ‘certain’. The closer the output activation is to the ideal case above, the better our confidence in its prediction, and so we would want the trained network to minimise the difference between the ideal output activations, which we will denote from here on as $$\mathbf{y}$$, and the actual ones $$\mathbf{a}^{(L)}$$.

A popular choice of cost function for such a situation would be the mean squared error:

$\left\| \, \mathbf{y} - \mathbf{a}^{(L)} \right\|^2 = \sum_j \left( y_j - a_j^{(L)} \right)^2$

However, for neural networks a better cost function to use for training turns out to be the cross entropy cost function:

$- \left[ \mathbf{y} \cdot \ln \mathbf{a}^{(L)} + (1 - \mathbf{y}) \cdot \ln (1 - \mathbf{a}^{(L)}) \right] \\ = - \sum_j \left[ y_j \ln a_j^{(L)} + (1 - y_j) \ln (1 - a_j^{(L)}) \right]$

The reasons why this cost function are better are beyond the scope of this post series, however for further reading on the topic I refer you to Michael Nielsen’s excellent guide on neural networks.

As we usually wish to evaluate the cost of the network given a set of training examples, this can be extended to:

$- \frac{1}{n} \sum_{i=1}^n \left[ \mathbf{y}_i \cdot \ln \mathbf{a}_i^{(L)} + (1 - \mathbf{y}_i) \cdot \ln (1 - \mathbf{a}_i^{(L)}) \right]$

where we average the cost over all provided training examples.

### Regularisation

To prevent overfitting, an additional regularisation term is added to the cost function:

$\frac{\lambda}{2n} \sum_{l=1}^L \left\| \, \mathbf{w}^{(l)} \right\|^2$

where the coefficient $$\lambda > 0$$ is the regularisation parameter. Simply speaking, this term in the cost function penalises larger weights on the neurons, effectively preventing the network from forming more complex hypotheses than are apropriate. Choosing $$\lambda$$ requires some empirical study of the specific application, and will be discussed in a later post of the series.

### Implementation

Calculation of the above cost function is implemented by the cost method of the NeuralNetwork class in my project repository. See below how this method propagates the activations through the network, calculates the cross entropy cost as above and then adds on the regularisation penalty:

The regularisation cost is calculated using the regularisation_penalty method of the NeuralNetworkLayer class, which simply returns the summed squared weights for that layer:

## Cost Function Gradient

We have code to compute the cost function for a particular neural network, but how do you go about finding the configuration of the network that minimises this cost function, and thus has the best predictive power?

This is an optimisation problem. There are a number of ways to approach it, such as with a parameter sweep (try many different networks with different parameters (weights and biases) and see which works best), however with a well behaved cost function with a large number of parameters a gradient descent method works well.

Explaining the detail behind gradient descent is beyond the scope of this series of posts, but in essence it is like finding the lowest point in a bowl. Considering height of the point in the bowl as cost, we can find the bottom by moving in the direction of greatest descending steepness. Proceeding step-wise in such a fashion, one can arrivive at the minumum cost/height.

The gradient of a cost function $$C(\mathbf{x})$$ can be calculated by a finite differencing approach:

$\frac{\partial C}{\partial x_i} \approx \frac{C(x_i + \Delta x) - C(x_i)}{\Delta x}$

However for a model with as many parameters as a neural network, estimating the partial derivatives for all parameters in this way will be very slow.

### Backpropagation Algorithm

Instead, we use a neural network-specific method for calculating the gradient. In the backpropagation algorithm, we first estimate the errors on all neurons of the network, then use another relation to calclulate the gradient of the cost function with respect to each weight and bias.

This is done with the following relations. Firstly, the error of a neuron is defined as:

$\delta_j^{(l)} \equiv \frac{\partial C}{\partial z_i^{(l)}}$

It can be shown that for the cross entropy cost function, the error in the output layer $$l=L$$ can be simply calculated with:

$\delta_j^{(L)} = a_j^{(L)} - y_j$

or in vector notation:

$\mathbf{\unicode{x3b4}}^{(L)} = \mathbf{a}^{(L)} - \mathbf{y} \label{deltalast}\tag{1}$

These errors can be propagated back through the network using the relation:

$\mathbf{\unicode{x3b4}}^{(l)} = \left( \left( \mathbf{w}^{(l+1)} \right)^T \mathbf{\unicode{x3b4}}^{(l+1)} \right) \odot \sigma'\left( \mathbf{z}^{(l)} \right) \label{backprop}\tag{2}$

where $$\odot$$ is the Hadamard, or element-wise, product and $$\sigma'$$ is the first derivative of the sigmoid function.

The gradient of the cost function with respect to a single weight is then given by:

$\frac{\partial C}{\partial w_{jk}^{(l)}} = a_k^{(l-1)} \delta_j^{(l)} + \lambda w \label{gradweight}\tag{3}$

and with respect to a single bias is:

$\frac{\partial C}{\partial b_j^{(l)}} = \delta_j^{(l)} \label{gradbias}\tag{4}$

For additional insights into the backpropagation algorithm and information on the derivation of the above equations, please see the backpropagation chapter of Michael Neilsen’s online book on neural networks.

### Implementation

Equations $$\ref{deltalast}$$ to $$\ref{gradbias}$$ above are implemented in my neural network project once again through interaction of the NeuralNetwork and NeuralNetworkLayer classes. The gradient is computed within the cost method of the NeuralNetwork class when the gradient optional argument is set to True:

Here, the line:

corresponds to equation $$\ref{deltalast}$$, computing the errors in the output layer. The errors are then propagated backwards inside the loop using the backward method of the NeuralNetworkLayer class, which implements equation $$\ref{backprop}$$:

Equations $$\ref{gradweight}$$ and $$\ref{gradbias}$$ are implemented by the gradient method of NeuralNetworkLayer:

Note that this method returns the gradients in the same order as parameters for that layer are set. The cost method of NeuralNetwork computes and stores the gradients for each layer as it moves backwards through the network, finally assembling the gradients for the whole network and returning them along with the cost:

## Training

With the functionality in place to compute the cost and its gradient, it is possible to begin training the network to some data! In this project, I used the ready made optimisation algorithms from SciPy. The training method of the NeuralNetwork class combines these algorithms with the cost and gradient functionality to train a network to some provided data:

Provided the input features/activations and output labels for a set of training examples, this method maps the labels to output activations, then wraps the network’s cost function before passing it to the minimize optimisater interface. SciPy then calls the wrapped cost function as needed to find the optimal parameter set.

### Improving Training Speed

In order to take full advantage of its efficient C backend, all significant number crunching in this implementation is done using NumPy array operations. These arrays use the efficient BLAS linear algebra library to perform calculations quickly.

To further improve performance when training to larger data sets, I recommend ensuring that your NumPy installation is compiled to execute in parallel , taking advantage of modern multi-core systems. This is particularly effective on my workstation, which has 12 physical CPUs.