# SRNs and some other learning models - Stanford University

Learning in Recurrent Networks Psychology 209 February 25 & 27, 2013 Outline Back Propagation through time Alternatives that can teach networks to settle to fixed points Learning conditional distributions An application Collaboration of hippocampus & cortex in learning new associations Back Propagation

Through Time Error at each unit is the injected error (arrows) and the back-propagated error; these are summed and scaled by deriv. of activation function to calculate deltas. Continuous back prop through time as implemented in rbp Time is viewed as consisting of intervals from 0 to nintervals (tmax). Inputs clamped typically from t=0 for 2-3 intervals.

Activation equation (for t = t:t:tmax): neti(t)= t ( Sjaj(t-t)wij + bi ) + (1 t) neti(t-t) Calculation of deltas (for t = tmax:-t:t): dj(t) = t ( f(netj(t)) E/aj(t) ) + (1 t) dj(t+t) Where dj(tmax+t) = 0 for all j and E/aj(t) = Skwkjdk(t+t) + (t(t) a(t))

Targets are usually provided over the last 2-3 intervals. Then change weights using: E/wij = St=1:t:tmaxaj(t-1)di(t) Include momentum and weight decay if desired. Use CE instead of E if desired: CE = -Si [tilog(ai) + (1-ti)log(1-ai)]

Recurrent Network Used in Rogers et al Semantic Network Simulation Plusses and Minuses of BPTT Can learn arbitrary trajectories through state space (figure eights, etc). Works very reliably in training networks to settle to desired target states. Biologically implausiblemax Gradient gets very thin over many time steps Several Variants and Alternative Algorithms (all relevant to networks that settle to a fixed point) Almeda/Pineda algorithm Discussed in Williams and Zipser reading along with

many other variants of back prop through time Recirculation and Generec. Discussed in OReilly Reading Contrastive Hebbian Learning. Discussed in Movellan and McClelland reading Almeda Pineda Algorithm (Notation from OReilly, 1996) Update net inputs (h) until they stop changing according to (s(.)(.) = logistic fcn): ji Then update deltas (y) til they stop changing according to: J represents the external error to the

unit, if any. Adjust weights using the delta rule Assuming symmetric connections: jk Only activation is propagated. Time difference of activation reflects error signal. Maybe this is more biologically plausible that explicit backprop of error? Generalized Recirculation

OReilly, 1996 Minus phase: Present input, feed activation forward,compute output, let it feed back, let network settle. tk Plus phase: Then clamp both input and output units into desired state, and let network settle again.* hj, yj si *equations neglect the component

to the net input at the hidden layer from the input layer. A problem for backprop and approximations to it: Average of Two Solutions May not be a Solution Network Must Be Stochastic Boltzmann Machine P(a = 1) = logistic(net/T) Continuous Diffusion Network (g = 1/T), Zi(t) is a sample of Gaussian noise Contrastive Hebbian Learning Rule

Present Input only (minus phase) Settle to equilibrium (change still occurs but distribution stops changing) Do this several times to sample distribution of states at equilibrium Collect coproducts ai-aj-; avg = Present input and targets (plus phase) Collect coproducts ai+aj+; avg = Change weights according to: Dwij = (- )

The contrastive Hebbian learning rule minimizes: The sum, over different input patterns I: of the contrastive divergence or Information Gain between probability distributions over states s of the output units for desired (plus) and obtained (minus) phases, conditional on the Input I p (s | I ) ds TIG p ( s | I ) ln I s p (s | I ) In a continuous diffusion network, probability flows over time until it reaches an equilibrium distribution

Patterns and Distributions Desired Distrib Obtained Results Problems and Solutions Stochastic neural networks are VERY slow to train because you need to settle (which takes many time steps) many times in each of the plus and minus phases to collect adequate statistics. Perhaps RBMs and Deep Networks can help here? Collaboration of Hippocampus and

Neocortex The effects of prior association strength on memory in both normal and control subjects are consistent with the idea that hippocampus and neocortex work synergistically rather than simply providing two different sources of correct performance. Even a damaged hippocampus can be helpful when the prior association is very strong. Performance of Control and Amnesic Patients in Learning Word Pairs with Prior Associations Cutting (1978), Expt. 1 100

Control (Expt) Percent Correct 80 Amnesic (Expt) 60 40 20 0 Base rates -20

Very Easy Easy Fairly Easy Hard Category (Ease of Association) man:woman hungry:thin city:ostrich Very Hard

Kwok & McClelland Model Model includes slow learning cortical system representing the content of an association and the context. Hidden units in neo-cortex mediate associative learning. Cortical network is pre-trained with

several cue-relation-response triples for each of 20 different cues. When tested just with cue as probe it tends to produce different targets with different probabilities: Dog (chews) bone (~.30) Dog (chases) cat (~.05) Then the network is shown cue-responsecontext triples. Hippo. learns fast and cortex learns (very) slowly.

Hippocampal and cortical networks work together at recall, so that even weak hippocampal learning can increase probability of settling to a very strong preexisting association. Hippocampus Context Neo-Cortex Relation Cue

Response Data with Simulation Results From K&M Model Cutting (1978), Expt. 1 100 Control (Model) 84 80 Amnesic (Model) Percent Correct

70 68 Control (Expt) 60 Amnesic (Expt) 40 20 9 0 0

0 -20 Very Easy Easy Fairly Easy Hard Category (Ease of Association) Very Hard

## Recently Viewed Presentations

• Chapter Summary Amagat's law of additive volumes states that the volume of a gas mixture is equal to the sum of the volumes each gas would occupy if it existed alone at the mixture temperature and pressure. Chapter Summary Dalton's...
• Geographic disruptions, even deep into the supply chain, can level businesses. Historically, companies have assessed only their immediate tier 1 suppliers. As the world globalizes, the need to expand this supply chain risk assessment to tier 2 suppliers and beyond...
• Delphi and IT offer classes and training designed to encourage innovative Tegrity use . When possible, place in classrooms microphones which support Tegrity and other recordings. Stock recommended microphones in iTech Xpress and have them available for checkout through IT's...
• Projectiles launched at different angles. Sports Trivia Maximum range is achieved if the projectile is fired at an angle of 45 degrees with respect to the horizontal. In Conclusion A projectile is any object upon which the only force is...
• Normative theory posits that actors in the praxis of IR do have alternatives and real choices and can always change their conduct. That international order can be changed in any specified way and that people 's normative ideas (norms, morals...
• Many of the procedures of corporate America, while good for business, conflict with the value system needed in an effective Army, where selfless service is so important, and the unit is valued over the individual; but the individual should be...
• Storage: Bones store minerals, such as calcium and phosphorus for later use by the body The Skeleton The Joints Hinge: Only move in one direction Elbow Knee Gliding: Allow bones to slide over each other Wrists Ankles Pivot: Allows for...
• Actually for any nonzero disorder, the system will be localized. To get more interesting features, and to be closed to the real word, (change slide) we consider the system with many but not one particles, and with interactions between the...