2806 Neural Computation Multilayer neural networks Lecture 4 2005 Ari Visa Agenda Some historical notes Multilayer neural networks

Backpropagation Error surfaces and feature mapping Speed ups Conclusions Some historical notes Rosenblatts perceptron 1958: a single neuron with adjustable synaptic weights and bias Perceptron convergence theorem 1962 (Rosenblatt): If patterns used to train the perceptron are drawn from two linearly separable classes, then the algorithm converges and positions the decision surface in the form of hyperplane between two classes. The limitations of networks implementing linear discriminants were well known in the 1950s and

1960s. Some historical notes A single neuron -> the class of solutions that can be obtained is not general enough -> multilayer perceptron Widrow & Hoff (1960): least-mean square algorithm (LMS) = delta rule: three-layer networks designed by hand (fixed input-to-hidden weights, trained hidden-tooutput weights). The development of backpropagation: Kalman filtering (Kalman 1960, Bryson,Denham,Dreyfus 1963,1969).

Some historical notes Werbos : Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Science, Ph.D. thesis, Harvard University, 1974 Parker 1982,1985 Rumelhart, Hinton, Williams: Learning

internal representations by backpropagating errors, Nature 323(99),pp533-536,1986 Multilayer Neural Networks Input layer Hidden layer Output layer Bias unit = neuron netj = wtjx

yj = f(netj) f(net) = Sgn(net) zk = f(netk) Multilayer Neural Networks

The XOR problem: (x1 OR x2) AND NOT (x1 AND x2) y1 : x1 + x2 +0.5 = 0 > 0 -> y1 = 1, otherwise 1 y2 : x1 + x2 -1.5 = 0 > 0 -> y2 = 1,

otherwise 1 Multilayer Neural Networks Any desired continuous function can be implemented by a threelayer network given sufficient number of hidden units, proper nonlinearitiers and

weights (Kolmogorov) Feedforward Learning (Demo chapter 11) Multilayer Neural Networks Nonlinear multilayer networks How to set the weights of a three-layer neural

network based on training patterns and the desired output? Backpropagation Algorithm The LMS algorithm exists for linear systems Training error

Learning rule Backpropagation Algorithm Learning rule Hidden-to-output

Input-to-hidden Note, that wij are initialized with random values Demo Chapter 11 Backpropagation Algorithm

Compare with LMS algoritms 1) Method of Steepest Descent The direction of steepest descent is in direction opposite to the gradient vector g = E(w) w(n+1) = w(n) g(n) is the stepsize or learning-rate parameter

Backpropagation Algorithm Training set = a set of patterns with known labels Stochastic training = patterns are chosen randomly from the training set Batch training = all patterns are presented to the network before learning takes place On-line protocol = each pattern is presented once and only once, no memory in use Epoch = a single presentation of all patterns in the

training set. The number of epochs is an indication of the relative amount of learning. Backpropagation Algorithm Stopping criterion Backpropagation Algorithm Learning set Validation set Test set Stopping criterion Learning curve, the average error per

pattern Cross-validation Error Surfaces and Feature Mapping Note, error backpropagation is based on gradient descent in a criterion function J(w) that is represented represented Error Surfaces and Feature Mapping

The total training error is minimized. It usually decreases monotonically, even though this is not the case for the error on each individual pattern. Error Surfaces and Feature Mapping

Hidden-to-output weights ~ a linear discriminant Input-to-hidden weights ~ matched filter Practical Techniques for Improving Backpropagation

How to improve convergence, performance, and results? Neuron: Sigmoid function = centered zero and antisymmetric Practical Techniques for Improving Backpropagation Scaling input variables = the input patterns should be shifted so that the average over the training set of each feature is zero. = the full data set should be scaled to have the same

variance in each feature component Note, this standardization can only be done for stochastic and batch learning! Practical Techniques for Improving Backpropagation When the training set is small one can generate surrogate training patterns. In the absence of problem-specific information, the

surrogate patterns should be made by adding ddimensional Gaussian noise to true training points, the category label should be left unchanged. If we know about the source of variation among patterns we can manufacture training data. The number of hidden units should be less than the total number of training points n, say roughly n/10. Practical Techniques for Improving Backpropagation

We cannot initialize the weights to 0. Initializing weights = uniform learning -> choose weights randomly from a single distribution Input-to-hidden weights: -1 / d < wij < + 1 / d ,where d is the number of input units Hidden-to-output weights: -1 / nH < wkj < + 1 / nH ,where d is the number of hidden units Practical Techniques for Improving Backpropagation

Learning Rates Demo Chapter 9 The optimal rate Practical Techniques for Improving Backpropagation Momentum : allows the network to learn more quickly when plateaus in the error surface exist. Demo chapter 12. 0.9

Practical Techniques for Improving Backpropagation Weight Decay to avoid overfitting = to start with a network with too many weights and decay all the weights during training wnew = wold(1-), where 0 < < 1 The weights that are not needed for reducing the error function become smaller and smaller -> eliminated

Practical Techniques for Improving Backpropagation If we have insufficient training data for the desired classification accuracy. Learning with hints is to add output units for addressing an ancillary problem, one different from but related to the

specific classification problem at hand. Practical Techniques for Improving Backpropagation Stopped Training Number of hidden layers: typically 2-3. Criterion Function: cross entropy, Minkowski Error Practical Techniques for

Improving Backpropagation Second-Order Methods to speed up the learning rate: Newtons Method, demo chapter 9 Quickprop Conjugate Gradient Descent requires batch

training, demo chapter 9 Practical Techniques for Improving Backpropagation 2) Newtons method The idea is to minimize the quadratic approximation of the cost function E(w) around the current point w(n). Using a second-order Taylor series expansion of the cost function around the point w(n).

Ew(n) gT(n)w(n) + wT(n) H(n) w(n) g(n) is the m-by-1 gradient vector of the cost function E(w) evaluated at the point w(n). The matrix H(n) is the m-by-m Hessian matrix of E(w) (second derivative), H = E(E(w) Practical Techniques for Improving Backpropagation

H = E(E(w) requires the cost function E(w) to be twice continuously differentiable with respect to the elements of w. Differentiating Ew(n) gT(n)w(n) + wT(n) H(n) w(n) with respect to w, the change E(n) is minimized when g(n) + H(n)w(n) = 0 -> w(n) = H-1(n)g(n) w(n+1) = w(n) + w(n) w(n+1) = w(n)+H-1(n)g(n) where H-1(n) is the inverse of the Hessian of E(w). Practical Techniques for Improving Backpropagation

Newtons method converges quickly asymtotically and does not exhibit the zigzagging behavior. Newtons method requires that the Hessian H(n) has to be a positive definite matrix for all n! Practical Techniques for Improving Backpropagation 3) Gauss-Newton Method It is applicable to a cost function that is expressed as

the sum of error squares. E(w) = i=1n eE((i), note that all the error terms are calculated on the basis of a weight vector w that is fixed over the entire observation interval 1 i n. The error signal e(i) is a function of the adjustable weight vector w. Given an operating point w(n), we linearize the dependence of e(i) on w by writing e(i,w) = e(i) + [e(i)/w]Tw=w(n) (w-w(n)), i=1,2,...,n Practical Techniques for Improving Backpropagation e(n,w) = e(n) + J(n)(w-w(n)) where e(n) is the error vector e(n) = [e(1),e(2),...,e(n)]T and J(n) is the n-by-m Jacobian matrix of e(n) (The

Jacobian J(n) is the transpose of the m-by-n gradient matrix e(n), where e(n) =[e(1), e(2), ...,e(n)]. w(n+1) = arg min w {e(n,w)E(} = e(n)E( +eT(n)J(n)(w-w(n)) + (w-w(n))TJT(n)J(n)(w-w(n)) Differentiating the expression with respect w and setting the result equal to zero Practical Techniques for Improving Backpropagation JT(n)e(n) + JT(n)e(n) (w-w(n)) = 0 w(n+1) = w(n) (JT(n)J(n))-1JT(n)e(n) The Gauss-Newton requires only the Jacobian matrix of the error function e(n). For the Gauss-Newton iteration to be computable, the

matrix product JT(n)J(n) must be nonsigular. JT(n)J(n) is always nonnegative definite but to ensure that it is nonsingular, the Jacobian J(n) must have row rank n. -> add the diagonal matrix I to the matrix JT(n)J(n), the parameter is a small positive constant. Practical Techniques for Improving Backpropagation JT(n)J(n)+ I ; positve definite for all n. -> The Gauss-Newton method is implemented in the following form: w(n+1) = w(n) (JT(n)J(n) + I )-1JT(n)e(n) This is the solution to the modified cost function: E(w) = {w-w(0)E(+ i=1n eE((i)}

where w(0) is the initial value of w. Regularization, Complexity Adjustment and Pruning If we have too many weight and train too long > overfitting Wald Statistic: we can estimate the importance of

a parameter in a model The Optimal Brain Damage, The Optimal Brain Surgeon: eliminate the parameter having the least importance Summary Multilayer nonlinear neural networks trained by gradient descent methods such as backpropagation perform a maximum-likelihood estimation of the

weight values in the model defined by the network topology. f(net) at the hidden units allows the networks to form an arbitary decision boundary, so long as there are sufficiently many hidden units.