CS623: Introduction to Computing with Neural Nets (lecture-4) Pushpak Bhattacharyya Computer Science and Engineering Department IIT Bombay Weights in a ff NN wmn is the weight of the connection from the nth neuron to the mth neuron E vs W surface is a complex surface in the space defined by

the weights wij E w gives the direction in which a movement of the operating point in the wmn co-ordinate space will result in maximum decrease in error mn m wmn

n wmn E wmn Sigmoid neurons Gradient Descent needs a derivative computation - not possible in perceptron due to the discontinuous step function used! Sigmoid neurons with easy-to-compute derivatives used!

y 1 as x y 0 as x Computing power comes from non-linearity of sigmoid function. Derivative of Sigmoid function 1 y x 1 e x dy

1 e x ( e ) x 2 x 2 dx (1 e ) (1 e ) 1 x

1 e 1 y (1 y ) 1 x 1 e Training algorithm Initialize weights to random values. For input x = , modify weights as follows Target output = t, Observed output = o

E wi wi 1 E (t o) 2 Iterate until E 2< (threshold)threshold) Calculation of wi n 1 E E net
where : net wi xi Wi net Wi i 0 E o net o net Wi (t o)o(1 o) xi

E wi ( learning constant, 0 1) wi wi (t o)o(1 o) xi Observations Does the training technique support our intuition? The larger the xi, larger is wi Error burden is borne by the weight values corresponding to large input values

Observations contd. wi is proportional to the departure from target Saturation behaviour when o is 0 or 1 If o < t, wi > 0 and if o > t, wi < 0 which is consistent with the Hebbs law Hebbs law nj wji ni

If nj and ni are both in excitatory state (threshold)+1) Then the change in weight must be such that it enhances the excitation The change is proportional to both the levels of excitation wji e(threshold)nj) e(threshold)ni) If ni and nj are in a mutual state of inhibition (threshold) one is +1 and the other is -1), Then the change in weight is such that the inhibition is enhanced (threshold)change in weight is negative) Saturation behavior

The algorithm is iterative and incremental If the weight values or number of input values is very large, the output will be large, then the output will be in saturation region. The weight values hardly change in the saturation region If Sigmoid Neurons Are Used, Do We Need MLP? Does sigmoid have the power of separating non-linearly separable data? Can sigmoid solve the X-OR problem

(threshold)X-ority is non-linearly separable data) link O 1 yu O = 1 / 1+ e -net yl net

O = 1 if O > yu O = 0 if O < yl Typically yl << 0.5 , yu >> 0.5 Inequalities O = 1 / (threshold)1+ e net ) W2 X2 W1 W0

X0=-1 X1 <0, 0> O=0 i.e 0 < yl 1 / 1 + e(threshold)w1x1- w2x2+w0) < yl i.e. (threshold)1 / (threshold)1+ ewo)) < yl (threshold)1)

<0, 1> O=1 i.e. 0 > yu 1/(threshold)1+ e (threshold)w1x1- w2x2 + w0)) > yu (threshold)1 / (threshold)1+ e-w2+w0)) > yu (threshold)2) <1, 0> O=1 i.e. (threshold)1/1+ e-w1+w0) > yu (threshold)3)

<1, 1> O=0 i.e. 1/(threshold)1+ e-w1-w2+w0) < yl (threshold)4) Rearranging, 1 gives 1/(threshold)1+ ewo) < yl i.e. 1+ ewo > 1 / yl i.e. Wo > ln (threshold)(threshold)1- yl) / yl)

(threshold)5) 2 Gives 1/1+ e-w2+w0 > yu i.e. 1+ e-w2+w0 < 1 / yu i.e. e-w2+w0 < 1-yu / yu i.e. -W2 + Wo < ln (threshold)1-yu) / yu i.e. W2 - Wo > ln (threshold)yu / (threshold)1yu)) (threshold)6) 3 Gives W1 - Wo > ln (threshold)yu / (threshold)1- yu))

(threshold)7) 4 Gives -W1 W2 + Wo > ln (threshold)(threshold)1- yl)/ yl) (threshold)8) 5 + 6 + 7 + 8 Gives 0 > 2ln (threshold)1- yl )/ yl + 2 ln yu / (threshold)1 yu ) i.e. 0 > ln [ (threshold)1- yl )/ yl * yu / (threshold)1 yu )] i.e. (threshold)(threshold)1- yl ) / yl)

* (threshold)yu / (threshold)1 yu )) < 1 i. [(threshold)1- yl ) / (threshold)1- yy )] * [yu / yl] < 1 ii. 2) Yu >> 0.5 iii. 3) Yl << 0.5 From i, ii and iii; Contradiction, hence sigmoid cannot compute X-OR

Exercise Use the fact that any non-linearly separable function has positive linear combination to study if sigmoid can compute any non-linearly separable function. Non-linearity is the source of power y y = m1(h1.w1 + h2.w2) + c1 h1 = m2(w3.x1 + w4.x2) + c2

w2 w1 h2 w6 x1 h2 = m3(w5.x1 + w6.x2) + c3 h1 Substituting h1 & h2 y = k1x1 + k2x2 + c

w5 w4 w3 x2 Thus a multilayer network can be collapsed into an eqv. 2 layer n/w without the hidden layer Can a linear neuron compute X-OR?

yL yU y = mx + c y > yU is regarded as y = 1 y < yL is regarded as y = 0 yU > yL Linear Neuron & X-OR We want

y w2 x2 y = w1x1 + w2x2 + c w1 x1 Linear Neuron 1/4 for (1,1), (0,0) y < yL
For (0,1), (1,0) y > yU yU > yL Can (w1, w2, c) be found Linear Neuron 2/4 (0,0) y = w1.0 + w2.0 + c =c y < yL c < yL - (1) (0,1)
y = w1.1 + w2.0 + c y > yU w1 + c > yU - (2) Linear Neuron 3/4 1,0 w2 + c > yU - (3) 1,1 w1 + w2 + c < yL
- (4) yU > yL - (5) Linear Neuron 4/4 c < yL - (1) w1 + c > yU

- (2) w2 + c > yU - (3) w1 + w2 + c < yL - (4) yU > yL - (5)

Inconsistent Observations A linear neuron cannot compute XOR A multilayer network with linear characteristic neurons is collapsible to a single linear neuron. Therefore addition of layers does not contribute to computing power. Neurons in feedforward network must be non-linear Threshold elements will do iff we can linearize a nonlinearly function.