Neural Networks Geoff Hulten The Human Brain (According to a computer scientist) Send electro-chemical signals Network of ~100 Billion Neurons Each ~1,000 10,000 connections Activation time ~10 ms second ~100 Neuron chain in 1 second Image from Wikipedia Artificial Neural Network Grossly simplified approximation of how the brain works Artificial Neuron (Sigmoid Unit) Features used as input to an initial set of artificial neurons Output of artificial neurons used as input to others Output of the network used as prediction Mid 2010s image processing ~50-100 layers ~10-60 million artificial neurons

50 million into 100 billion 0.05% Example Neural Network Fully connected network Single Hidden Layer 0 1 connection per pixel + bias 2313 weights to learn 0 0 1 connection per pixel + bias 1 connection per pixel + bias 576 Pixels (Normalized) 1 connection per pixel + bias

0 5 Weights 2,308 Weights Input Layer ( =1) 0 Hidden Layer Output Layer Decision Boundary for Neural Networks Neural network with single node in output layer (no hidden layer) 20 Hidden Nodes Concept Linear Model Non-linear decision boundary - Enabled by non-linear (sigmoid) activation - Complexity via network structure

10 Hidden Nodes 6 Hidden Nodes Underfitting 4 Hidden Nodes 1 Layer Neural Network Example of Predicting with Neural Network Sigmoid Function Hidden Layer 0.5 -1.0 1.0 0.0 ~0.5 1.0 1.5 0.5 1.0

( =1 )= 0.82 ~0.75 0.25 1.0 1.0 1.0 0.5 -1.0 Input Layer Output Layer Activations Example for Blink Task Very limited feature engineering on input Scale Normalize

Hidden nodes learn useful features so you dont have to Trick is figuring out how many neurons to use and how to organize them Positive Weight? ( =1) Weights from Hidden Node 1 Negative Weight? Input Image (Normalized) Weights from Hidden Node 2 Logistic Regression with responses as input Multi-Layer Neural Networks

Fully connected network Two Hidden Layers 2333 weights to learn Filters on filters, for example maybe: Layer 1 learns eye shapes Layer 2 learns combinations 1 connection per pixel + bias 1 connection per pixel + bias ( =1) 1 connection per pixel + bias 576 Pixels (Normalized) Output Layer 1 connection per pixel + bias 2,308 Weights 5 Weights 20 Weights

Hidden Layer Hidden Layer Decision Boundary for Multi-Layer Neural Networks Concept 20 Hidden Nodes 20 Per Layer 10 Hidden Nodes 10 Per Layer 6 Hidden Nodes 6 Per Layer 4 Hidden Nodes 4 Per Layer 1 Hidden Layer Neural Network 2 Hidden Layer Neural Network

Did not converge Linear Model Much more powerful - Difficult to converge - Easy to overfit - Later lecture: how to adapt Best Fit Output Layer Single network (training run), multiple tasks is a vector, not a single value Hidden nodes learn generally useful filters () () () 576 Pixels (Normalized)

(h) Hidden Layer Output Layer Neural Network Architectures/Concepts Fully connected layers Recurrent Networks (LSTM & attention) Convolutional Layers Embeddings MaxPooling Residual Networks Activation (ReLU) Batch Normalization Softmax Dropout Will explore in more detail later

Loss For Neural Networks Mean Squared Error (MSE): Cross Entropy (BCE): In Book Use for Assignment 1 2 2 ( ( .5 1 ) + ( .1 0 ) ) =.135 2 B 1 ( )=

.5 .1 1 0 .1 .95 1 1 Optimizing Neural Nets Back Propagation Gradient descent over entire networks weight vector Easy to adapt to different network architectures Converges to local minimum (usually wont find global minimum) Training can be very slow! For this weeks assignmentsorry For next week well use public neural network software In general very well suited to run on GPU 1. Forward Propagation

Conceptual Backprop with MSE h : = ( ( ) ) () 2. Back Propagation 3. Update Weights h1 With MSE: ~0.5 2. Figure out how much each part contributes to the error. 1.0 0.5 1 h=h (1h ) ~0.75

h h 2 3. Step each weight to reduce the error it is contributing to ~0.82 1. Figure out how much error the network makes on the sample: 1. Forward Propagation Backprop Example 2. Back Propagation 3. Update Weights =0.1 0.5 -1.0 h1

1.0 = (1 )( ) =0.027 ~0.5 1.0 ~0.82 Error = ~0.18 0.25 0.5 1 1.0 ~0.75 1.0 005 005

25 0.5 -1.0 1.0 h 2 h=h (1h ) h 2= .005 h Backprop Algorithm Initialize all weights to small random number (-0.05 0.05) While not time to stop repeatedly loop over training data: Input a single training sample to network and calculate for every neuron Back propagate the errors from the output to every neuron

Downstream error This nodes effect on error Update every weight in the network Stopping Criteria: # of Epochs (passes through data) Training set loss stops going down Accuracy on validation data Backprop with Hidden Layer 1. (or multiple outputs) 2. Back Propagation 3. Update Weights +) h1,1 1,1 2,1 Forward Propagation h 2,1

1,1 2,2 1.0 0.5 1 h1,2 h 2,2 = (1 )( ) h=h (1h ) h Stochastic Gradient Descent

Gradient Descent Calculate gradient on all samples Step Per Sample Gradient Stochastic Gradient Descent Calculate gradient on some samples Step Stochastic can make progress faster (large training set) Stochastic takes a less direct path to convergence Gradient Descent Stochastic Gradient Descent Batch Size: N instead of 1 Local Optimum and Momentum Local Optimum Loss Why is this okay? In practice: Neural networks overfit

Momentum Power through local optimums Converge faster (?) Parameters Dead Neurons & Vanishing Gradients Neurons can die * Large weights (positive or negative) cause gradients to vanish Test: Assert if this condition occurs What causes this Poor initialization of weights Optimization that gets out of hand Input variables unnormalized What should you do with Neural Networks? As a model (similar to others weve learned) Fully connected networks Few hidden layers (1,2,3)

A few dozen nodes per hidden layer Leveraging recent breakthroughs Understand standard architectures Get some GPU acceleration Get lots of data Do some feature engineering Normalization Tune parameters # layers # nodes per layer Be careful of overfitting Simplify if not converging Craft a network architecture More on this next class Summary of Artificial Neural Networks Model that very crudely approximates the way human brains work Neural networks learn features (which we might have hand

crafted without them) Each artificial neuron is a linear model, with non-linear activation function Many options for network architectures Neural networks are very expressive, can learn complex concepts (and overfit) Backpropagation is a flexible algorithm to learn neural networks