Tuesday, July 22, 2014

A Gentle Introduction to Backpropagation

Why is this blog being written?

[Update: A more detailed version of this article with coding outline is available. Please read the note at the end of this article.]

Neural networks have always fascinated me ever since I became aware of them in the 1990s. They are often represented with a hypnotizing array of connections. In the last decade, deep neural networks have dominated pattern recognition, often replacing other algorithms for applications like computer vision and voice recognition. At least in specialized tasks, they indeed come close to mimicking the miraculous feats of cognition our brains are capable of.

While neural networks are capable of such feats, the very discovery of a method of programming such a computational device seems to me to be a miraculous feat of cognition worthy of celebration. My purpose in writing this blog is to share my perspective on an amazing algorithm made widely known by a 1986 publication in Nature.
I will assume that the reader has some understanding of the meaning and purpose of the various elements of a neural network such as the one shown in Figure 1. I have provided a little bit of background below. Other than that, this blog should be an easy read for those with some familiarity of basic calculus.

Figure 1: An example of a multi-layered neural network that can be used to associate an input consisting of 10 numbers with one of 4 decisions or predictions.

What is so difficult about designing a neural network?

To appreciate the difficulty involved in designing a neural network, consider this: The neural network shown in Figure 1 can be used to associate an input consisting of 10 numbers with one of 4 decisions or predictions. For example, the neural network shown may be used by a bank to determine if credit should be extended to a customer. In this case, the 10 input numbers represent various parameters relevant to an individual's financial responsibility such as balance in savings accounts, outstanding loan amounts, number of years of employment, and so on. The neural network takes in these 10 numbers, performs calculations and produces an output consisting of 4 numbers. Depending on the position at which the maximum of these 4 numbers appears in the output, the prediction could be one of the following:
  1. Excellent creditworthiness with high spending limit
  2. Average creditworthiness with moderate spending limit
  3. Low creditworthiness with low spending limit
  4. High default risk.
Based on this prediction, the bank would take the appropriate decision.

A neural network such as the one shown in Figure 1 can perform this miraculous feat of cognition only if it is specifically trained to do so. For the network to work correctly, the weight at each of its (10$\times $4)+(4$\times $6)+(6$\times $8)+(8$\times $4) =144 connections have to be carefully chosen such that the network classifies every input drawn from a known training set with a high degree of accuracy. This is a classic application of the supervised learning paradigm in machine learning.

There are, of course, no formulas in existence to directly set the values of the 144 weights. The only recourse is to start with some initial values for the 144 weights, check how good the resulting neural network is, and repeatedly refine the weights to progressively make the network more and more accurate. So, what is needed is a method of refining the weights.

If one were to think of the accuracy as some grand function of the weights, it makes sense to refine each weight by changing it by an amount proportional to the partial derivative of that grand function with respect to that weight. Why bring in partial derivatives? Because they precisely predict how the accuracy responds to small changes in the weights. In fact, at every iteration, performing a refinement guided by the partial derivatives results in a more advantageous gain in accuracy compared to any other method of refinement. The method of steepest descent does exactly what is suggested here – with the minor, but entirely equivalent, goal of seeking to progressively decrease the error, rather than increase the accuracy.

To keep things straight, let me list the concepts I have described thus far:
  1. The weights of the neural network must be set such that the error on a known training set is minimized. Ideally, the network must yield zero error on a large and representative training data set.
  2. For any given setting of weights, we can attempt to refine them by changing each weight by an amount proportional to the partial derivative of the error with respect to that weight. The partial derivatives themselves change after any such refinement, and must be recomputed.
  3. We can keep on refining the weights in this manner, and stop refining when the error is zero. In real life, we call it done when the error is low enough. Or refuses to fall anymore. Or we run out of time after a few million rounds of refinements.
Seems simple, right? Yes. As long as there is a simple way of calculating the partial derivative of error with respect to every weight. In principle, the partial derivatives can be calculated by systematically perturbing the weights and measuring the resulting changes in error. A word to the wise, don't try this at home. Or even on your neighborhood supercomputer.


In fact, this error minimization problem that must be solved to train a neural network eluded a practical solution for decades till D. E. Rumelhart, C. E. Hinton, and R. J. Williams (drawing inspiration from other researchers) demonstrated a technique, which they called backpropagation, and made it widely known (Nature 323, 533-536, 9 October 1986). It is essentially by building upon their method that today others have ventured to program neural networks with 60 million weights, with astounding results.

According to Bernard Widrow, now Professor Emeritus at Stanford University and one of the pioneers of neural networks "The basic concepts of backpropagation are easily grasped. Unfortunately, these simple ideas are often obscured by relatively intricate notation, so formal derivations of the backpropagation rule are often tedious." This is indeed unfortunate because the backpropagation rule is one of the most elegant applications of calculus that I have known. 

Easy as 1-2-3
Once you appreciate the fact that, in order to train a neural network, you need to somehow calculate the partial derivatives of the error with respect to weights, backpropagation can be easily and qualitatively derived by reducing it to three core concepts. It also helps immensely to keep the notation intuitive and easy to connect to the concept being symbolized.
Figure 2: The derivation of the backpropagation algorithm is simplified by adding an extra computational block to calculate the error and also by boxing parts of the network.

1. Boxing

Since training the neural network is all about minimizing the training error, the first step in the derivation involves tacking on an extra computational block to calculate the error between the actual output $\left\{o_1,o_2,o_3,o_4\right\}$ and a known target $\{t_1,t_2,t_3,t_4\}$. This is shown as a triangular block in Figure 2. For now, let us think of the output and the target as known and fixed entities. Although we need not concern ourselves with the exact formula to compute the error, I offer the familiar sum-of-squares error as an example

Next, we choose one of the layers (say Layer 3) and enclose that layer and all following layers (including the error calculating block) in a box, as shown in gray in Figure 2. Keep in mind that this is just one of several nested boxes we can construct in order to compartmentalize the network. Let us resolve not to worry about anything going on inside the box but simply think of the relationship between the input to the box and the output (i.e the error) coming out of the box. We will call this box the current box, and call the input to this box the current input$\{c_1,c_2,c_3,c_4,c_5,c_6\}$. It is important to recognize that the functional relationship between the current input to the box and the output emanating from the box is completely defined and can be computed. Let us denote this function as $E(c_1,c_2,c_3,c_4,c_5,c_6)$.
Figure 3: The box contains parts of the neural network hidden from view. It allows us to think about broad relationships among different parts of the network.

2. Sensitivity

As our journey through backpropagation continues, I gently request you to assume that the vector of partial derivatives $\frac{\partial E}{\partial c_1},\frac{\partial E}{\partial c_2}, \ldots
$ of the function $E(c_1,c_2,c_3,c_4,c_5,c_6)$ is known. This might seem like a stretch. After all, we have set out to find a method to compute some other (equally unrealized) partial derivatives. But I assure you it will all work out in the end. To emphasize the crucial nature of this simple concept, it has been given a name: Sensitivity. Let us denote the sensitivity of our current box as $\{\text{$\delta $c}_1,\text{$\delta $c}_2,\text{$\delta $c}_3,\text{$\delta
   $c}_4,\text{$\delta $c}_5,\text{$\delta $c}_6\}$. With the help of Figure 3 to hold these concepts in our mind, we can concretely think about how the output of the current box responds to a small perturbation applied to any of its current inputs. For example, if the fourth component of the current input changes by the small amount $\Delta c_4,$ we can expect the error at the output to change by $\Delta c_4 \delta c_4.$ Further, in addition to the hypothetical change in component 4, if there is a simultaneous change of $\Delta c_6$ in component 6, we can expect the error at the output to change by an additional amount, making the total change $\Delta c_4 \delta c_4 + \Delta c_6 \delta c_6.$ The effect of small simultaneous changes in the current input components simply add up at the output.

Knowing the sensitivity of the current box, what can we say about the sensitivity of the preceding box? Keep in mind that the preceding box encloses Layer 2 and all following layers, including the error calculating block. For our specific example, let us call the input to this box the preceding input, $\{p_1,p_2,p_3,p_4.\}$ It follows quite logically that the sensitivity of the preceding box (which we will naturally denote as $\{\text{$\delta $p}_1,\text{$\delta $p}_2,\text{$\delta $p}_3,\text{$\delta
   $p}_4\}$) must be related to the sensitivity of the current box and the extra neural network elements making up the difference between the two nested boxes. The extra elements are the very vital nonlinear activation function units, summing junctions and weights.

Figure 4(a) shows the current box and the extra elements that must be added to construct the preceding box. For clarity, all the elements not relevant to the calculation of the first component of sensitivity ($\delta p_1$) have been grayed out. Look closely at Figures 4(a) and 4(b) to understand how the sensitivity of the preceding box can easily be derived from first principles. Specifically, Figure 4(b) provides insight into how $\delta p_1(=\frac{\partial e}{\partial p_1})$ can be computed  by allowing the input component $p_1$ to change by a small quantity $\Delta p_1$ and following the resulting changes in the network. Notes: (i) The notation $\mathcal{A}'\left(p_1\right)$ has been used for the derivative of the activation function evaluated at $p_1.$ (ii) For clarity, not all changes in signals have been explicitly labeled. Those that are not labeled can easily be determined since they all follow an obvious pattern.
Figure 4(a) shows the current box and the extra elements that must be added to construct the preceding box. Figure 4(b) provides insight into how $\delta p_1(=\frac{\partial e}{\partial p_1})$ can be computed  by allowing the input component $p_1$ to change by a small quantity $\Delta p_1$ and following the resulting changes in the network.
This is indeed a deep result. (And one which we have managed to arrive at without recourse to tedious calculus and numbered equations.) By repeated application, this result allows us to work backwards and calculate the sensitivity of the error to changes in the input at every layer.

The algorithm gets the name backpropagation because the sensitivities are propagated backwards as they are calculated in sequence. The textbook formula to express the sensitivity of the preceding layer in terms of the sensitivity of the current layer is easily seen to be
\[\delta p_i = \mathcal{A}'(p_i) \sum _{j} w_{i j}\delta c_j \]

A starting point is all we need to completely calculate all the sensitivity terms throughout the neural network. To do this, we consider the error computing block itself as the first box. For this box, the input is $\left\{o_1,o_2,o_3,o_4\right\},$ and the output is $e$ as given in the sum-of-squares error formula we have seen before. Simple calculus gives us the components of the sensitivity of the error computing block
2 \left(o_1-t_1\right),
2 \left(o_2-t_2\right),
2 \left(o_3-t_3\right),
2 \left(o_4-t_4\right)

3. Weight updates

At this point, the last section writes itself. Following the same strategy outlined in the previous figure, look at Figure 5(a) and 5(b) to intuitively understand how the error changes in response to a small change in one of the weights, say $w_{1 1}$. Once again in these figures, details of connections not immediately relevant to this calculation have been grayed out. The much sought after partial derivative of error with respect to the specific weight $w_{1 1}$ is easily seen to be $\mathcal{A}\left(p_1\right) \delta c_1$. The textbook formula to compute the partial derivative of the error with respect to any weight is easily seen to be
\[\frac{\partial e}{\partial w_{i j}} =\mathcal{A}\left(p_i\right) \delta c_j \]
Figure 5: Another set of figures designed to provide insight into the error changes in response to a small change in one of the weights, $w_{1 1}$. Using this intuition, the partial derivative $\frac{\partial e}{\partial w_{1 1}}$ can be computed as $\mathcal{A}\left(p_1\right) \delta c_1$.
Now that we have a formula to calculate the partial derivative of error with respect to every weight in the network, we can proceed to iteratively refine the weights and minimize the error using the method of steepest descent.

In the most popular version of backpropagation, called stochastic backpropagation, the weights are initially set to small random values and the training set is randomly polled to pick out a single input-target pair. The input is passed through the network to compute internal signals (like $\mathcal{A}\left(p_1\right)$ and $\mathcal{A}'\left(p_1\right)$ shown in Figures 4 and 5) and the output vector. Once this is done, all the information needed to initiate backpropagation becomes available. The partial derivatives of error with respect to the weights can be computed, and the weights can be refined with intent to reduce the error. The process is iterated using another randomly chosen input-target pair.

The miraculous feat of cognition

I am in awe of the miraculous feat of cognition that lead early neural network researchers to arrive at the backpropagation algorithm. They clearly had the ability to see patterns and make elegant groupings which ultimately made it possible to train huge networks. Their work not only resulted in the neural network applications we use today, but have also inspired a host of other related algorithms which depend on error minimization.
Although this algorithm has been presented here as a single established method, it should be regarded as a framework. In my experience, appreciating how an algorithm is derived leads to insight which makes it possible to explore beneficial variations. The process of designing robust and efficient scientific algorithms frequently leads us to regard established frameworks as substrates upon which to build new and better algorithms.

Note: A PDF version of this article with high quality graphics and pseudocode ("Training Wheels for Training Neural Networks") for implementing the algorithm is available. Please visit www.numericinsight.com and find Downloads on the home page.