Logistic Regression Rong Jin Logistic Regression Model In Gaussian generative model: ,i ,i p( y 1) p ( x | y 1) m log ~ 2 i 1 xi c 2 p( y 1) p ( x | y 1) i Generalize the ratio to a linear model p( y 1| x ) log x w c p( y 1| x )

Parameters: w and c Logistic Regression Model In Gaussian generative model: ,i ,i p( y 1) p ( x | y 1) m log ~ 2 i 1 xi c 2 p( y 1) p ( x | y 1) i Generalize the ratio to a linear model p( y 1| x ) log x w c p( y 1| x ) Parameters: w and c

Logistic Regression Model The log-ratio of positive class to negative class p( y 1| x ) log x w c p( y 1| x ) p ( y 1| x ) exp( x w c) p( y 1| x ) p ( y 1| x ) p ( y 1| x ) 1 Results 1 p ( y 1| x ) 1 exp( x w c)

1 p( y | x ) 1 1 exp y ( x w c) p ( y 1| x ) 1 exp( x w c) Logistic Regression Model The log-ratio of positive class to negative class p( y 1| x ) log x w c p( y 1| x )

p ( y 1| x ) exp( x w c) p( y 1| x ) p ( y 1| x ) p ( y 1| x ) 1 Results 1 p ( y 1| x ) 1 exp( x w c) 1 p( y | x ) 1 1 exp y ( x w c)

p ( y 1| x ) 1 exp( x w c) Logistic Regression Model Assume the inputs and outputs are related in the log linear function 1 p ( y | x ; ) 1 exp y ( x w c) {w1 , w2 ,..., wd , c} Estimate weights: MLE approach {w1 , w2 ,..., wd , c} * n w, c max l ( Dtrain ) max i 1 log p ( yi | xi ; ) w, c w,c n max i 1 log w, c

1 1 exp( y x w c ) Example 1: Heart Disease Number of People 10 8 No heart Disease 1: 25-29 Heart disease 2: 30-34 6 3: 35-39 4 4: 40-44 2 0 1 2 3 4

5 6 7 8 Age group 5: 45-49 6: 50-54 7: 55-59 Input feature x: age group id output y: having heart disease or not +1: having heart disease -1: no heart disease 8: 60-64 Example 1: Heart Disease Logistic regression model {w, c} Number of People 1 p( y | x) 1 exp y ( xw c) No heart Disease

10 Heart disease 8 6 4 2 0 1 Learning w and c: MLE approach 2 3 4 5 6 7 8 Age group 8 l ( Dtrain ) i 1 ni () log p( | i) ni ( ) log p( | i)

8 n () log i 1 i 1 exp 1 1 ni ( ) log 1 exp iw c iw c Numerical optimization: w = 0.58, c = -3.34 Example 1: Heart Disease 1 1 p ( | x; ) ; p( | x; ) 1 exp xw c 1 exp xw c W = 0.58 An old person is more likely to have heart disease C = -3.34

xw+c < 0 p(+|x) < p(-|x) xw+c > 0 p(+|x) > p(-|x) xw+c = 0 decision boundary x* = 5.78 53 year old No heart Disease 10 Number of People Heart disease 8 6 4 2 0 1 2 3 4 5

Age group 6 7 8 Nave Bayes Solution 9 8 7 Inaccurate fitting Non Gaussian distribution 6 5 4 i* = 5.59 3 Close to the estimation by logistic regression

2 1 0 1 2 3 4 5 Age group 6 7 8 Even though nave Bayes does not fit input patterns well, it still works fine for the decision boundary Number of People Problems with Using Histogram Data? Heart Disease 10

No heart disease 8 6 4 2 0 1 2 3 4 5 Age group 6 7 8 Uneven Sampling for Different Ages 14 12 10 8 6 4 2 0

1 2 3 4 5 Age group 6 7 8 Solution # group l ( Dtrain ) i 1 ni () log p( | i) ni ( ) log p( | i) 1 # group n ( ) log

i i 1 1 exp 1 ni ( ) log 1 exp iw c iw c ni () p i () : percentage of people at the ith age group that have heart disease ni ( ) p i ( ) : percentage of people at the ith age group that have heart disease # group l ( Dtrain ) i 1 p i () log p( | i) p i ( ) log p( | i) # group p i () log i 1 1 exp 1 1 pi ( ) log

1 exp iw c iw c w = 0.63, c = -3.56 i* = 5.65 Solution # group l ( Dtrain ) i 1 ni () log p( | i) ni ( ) log p( | i) 1 # group n ( ) log i i 1 1 exp 1 ni ( ) log 1 exp iw c iw c

ni () p i () : percentage of people at the ith age group that have heart disease ni ( ) p i ( ) : percentage of people at the ith age group that have heart disease # group l ( Dtrain ) i 1 p i () log p( | i) p i ( ) log p( | i) # group p i () log i 1 1 exp 1 1 pi ( ) log 1 exp iw c iw c w = 0.63, c = -3.56 i* = 5.65 < 5.78 Solution # group l ( Dtrain ) i 1

ni () log p( | i) ni ( ) log p( | i) 1 # group n ( ) log i i 1 1 exp 1 ni ( ) log 1 exp iw c iw c ni () p i () : percentage of people at the ith age group that have heart disease ni ( ) p i ( ) : percentage of people at the ith age group that have heart disease # group l ( Dtrain ) i 1 p i () log p( | i) p i ( ) log p( | i) # group

p i () log i 1 1 exp 1 1 pi ( ) log 1 exp iw c iw c w = 0.63, c = -3.56 i* = 5.65 < 5.78 Example: Text Classification Learn to classify text into predefined categories Input x: a document Represented by a vector of words Example: {(president, 10), (bush, 2), (election, 5), } Output y: if the document is politics or not +1 for political document, -1 for not political document

Training data: d1 , d 2 ,..., d n ; d1 , d 2 ,..., d n N n n di = word1 , ti,1 , word 2 , ti,2 ,..., word n , ti,n Example 2: Text Classification Logistic regression model

Every term ti is assigned with a weight wi d word1 , t1 , word 2 , t2 ,..., word n , tn 1 p ( y | d ; ) 1 exp y ( i wi ti c) {w1 , w2 ,..., wn , c} Learning parameters: MLE approach N ( ) N ( ) l ( Dtrain ) i 1 log p( | di ) i 1 log p( | di ) 1 N ( ) i 1 log 1 exp t d wi ti c i i Need numerical solutions N ( ) i 1 log

1 1 exp t d wi ti c i i Example 2: Text Classification Logistic regression model Every term ti is assigned with a weight wi d word1 , t1 , word 2 , t2 ,..., word n , tn 1 p ( y | d ; ) 1 exp y ( i wi ti c) {w1 , w2 ,..., wn , c} Learning parameters: MLE approach n n l ( Dtrain ) i 1 log p( | di ) i 1 log p ( | di ) 1 n i 1 log

1 exp j w j ti , j c Need numerical solutions n i 1 log 1 1 exp j w j ti , j c Example 2: Text Classification Weight wi wi > 0: term ti is a positive evidence wi < 0: term ti is a negative evidence wi = 0: term ti is irrelevant to the category of documents

The larger the | wi |, the more important ti term is determining whether the document is interesting. Threshold c i wi ti c 0 : more likely to be a political document i wi ti c 0 : more likely to be a non-political document i wi ti c 0 : decision boundary Example 2: Text Classification Weight wi wi > 0: term ti is a positive evidence wi < 0: term ti is a negative evidence wi = 0: term ti is irrelevant to the category of documents The larger the | wi |, the more important ti term is determining whether the document is interesting. Threshold c

i wi ti c 0 : more likely to be a political document i wi ti c 0 : more likely to be a non-political document i wi ti c 0 : decision boundary Example 2: Text Classification Dataset: Reuter-21578 Classification accuracy Nave Bayes: 77% Logistic regression: 88% Why Logistic Regression Works better for Text Classification? Optimal linear decision boundary Generative model Weight ~ logp(w|+) - logp(w|-) Sub-optimal weights Independence assumption Naive Bayes assumes that each word is generated independently

Logistic regression is able to take into account of the correlation of words Discriminative Model Logistic regression model is a discriminative model Models the conditional probability p(y|x), i.e., the decision boundary Gaussian generative model Models p(x|y), i.e., input patterns of different classes Comparison Generative Model Model P(x|y) Model the input patterns Usually fast converge Cheap computation Robust to noise data But Usually performs worse Discriminative Model Model P(y|x) directly Model the decision boundary Usually good performance

But Slow convergence Expensive computation Sensitive to noise data Comparison Generative Model Model P(x|y) Model the input patterns Usually fast converge Cheap computation Robust to noise data But Usually performs worse Discriminative Model Model P(y|x) directly Model the decision boundary Usually good performance But Slow convergence Expensive computation Sensitive to noise data A Few Words about Optimization * n w, c max i 1 log w, c

1 1 exp( y x w c ) Convex objective function Solution could be non-unique Problems with Logistic Regression? 1 p (| x; ) 1 exp (c x1 w1 x2 w2 ... xm wm ) {w1 , w2 ,..., wm , c} How about words that only appears in one class? Overfitting Problem with Logistic Regression Consider word t that only appears in one document d, and d is a positive document. Let w be its associated weight N ( ) N ( ) l ( Dtrain ) i 1 log p ( | di ) i 1 log p( | di ) N ( ) log p ( | d ) d d log p( | di ) i 1 log p( | di ) i

log p ( | d ) l l Consider the derivative of l(Dtrain) with respect to w l ( Dtrain ) log p ( | d ) l l 1 00 0 w w w w 1 exp c x w w will be infinite ! Overfitting Problem with Logistic Regression Consider word t that only appears in one document d, and d is a positive document. Let w be its associated weight N ( ) N ( ) l ( Dtrain ) i 1 log p ( | di ) i 1 log p( | di ) N ( ) log p ( | d ) d d log p( | di ) i 1 log p( | di ) i

log p ( | d ) l l Consider the derivative of l(Dtrain) with respect to w l ( Dtrain ) log p ( | d ) l l 1 00 0 w w w w 1 exp c x w w will be infinite ! Example of Overfitting for LogRes Decrease in the classification accuracy of test data Iteration Solution: Regularization Regularized log-likelihood lreg ( Dtrain ) l ( Dtrain ) s w

N ( ) log i 1 p( | di ) 2 2 N ( ) log i 1 p( | di ) s m 2 w i 1 i s||w||2 is called the regularizer Favors small weights Prevents weights from becoming too large The Rare Word Problem

Consider word t that only appears in one document d, and d is a positive document. Let w be its associated weight N ( ) N ( ) l ( Dtrain ) i 1 log p ( | di ) i 1 log p ( | d i ) N ( ) log p( | d ) d d log p( | di ) i 1 log p( | d i ) i log p( | d ) l l N ( ) N ( ) m lreg ( Dtrain ) i 1 log p( | di ) i 1 log p ( | di ) s i 1 wi2 N ( ) m log p ( | d ) d d log p( | di ) i 1 log p( | di ) s i 1 wi2 i m log p ( | d ) l l s i 1 wi2 The Rare Word Problem

Consider the derivative of l(Dtrain) with respect to w l ( Dtrain ) log p ( | d ) l l 1 00 0 w w w w 1 exp c x w lreg ( Dtrain ) w log p( | d ) l l 2 sw w w w 1 0 0 2sw 1 exp c x w When s is small, the derivative is still positive But, it becomes negative when w is large The Rare Word Problem

Consider the derivative of l(Dtrain) with respect to w l ( Dtrain ) log p ( | d ) l l 1 00 0 w w w w 1 exp c x w lreg ( Dtrain ) w log p( | d ) l l 2 sw w w w 1 0 0 2sw 1 exp c x w When w is small, the derivative is still positive But, it becomes negative when w is large Regularized Logistic Regression Using regularization

Without regularization Iteration Interpretation of Regularizer lreg ( Dtrain ) l ( Dtrain ) s w 2 2 Many interpretation of regularizer Bayesian stat.: model prior Statistical learning: minimize the generalized error Robust optimization: min-max solution Regularizer: Robust Optimization assume each data point is unknown-but-bounded in a sphere of radius s and center xi find the classifier w that is able to classify the unknown-butbounded data point with high classification confidence Sparse Solution

What does the solution of regularized logistic regression look like ? A sparse solution Most weights are small and close to zero Sparse Solution What does the solution of regularized logistic regression look like ? A sparse solution Most weights are small and close to zero Why do We Need Sparse Solution? Two types of solutions 1. 2. Many non-zero weights but many of them are small Only a small number of non-zero weights, and many of them are large

Occams Razor: the simpler the better A simpler model that fits data unlikely to be coincidence A complicated model that fit data might be coincidence Smaller number of non-zero weights less amount of evidence to consider simpler model case 2 is preferred Occams Razer 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 2

2.5 3 3.5 4 4.5 5 Occams Razer: Power = 1 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 2

2.5 3 y a1 x 3.5 4 4.5 5 Occams Razer: Power = 3 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5

2 2.5 3 3.5 y a1 x a2 x 2 a3 x 3 4 4.5 5 Occams Razor: Power = 10 2.5 2 1.5 1 0.5 0 0 0.5 1

1.5 2 2.5 2 3 3.5 4 4.5 10 y a1 x a2 x ... a10 x 5 Finding Optimal Solutions N m lreg ( Dtrain ) l ( Dtrain ) w 2 i 1 log p ( y | x ) s i 1 wi2 N i 1 log

1 m 2 s w i1 i 1 exp y (c x w) Concave objective function No local maximum Many standard optimization algorithms work Gradient Ascent Maximize the log-likelihood by iteratively adjusting the parameters in small increments In each iteration, we adjust w in the direction that increases the log-likelihood (toward the gradient) Predication Errors Preventing weights N m 2 w

w log p ( y | x ) s w from being too large i 1 i 1 i i i w N w sw i 1 xi yi (1 p ( yi | xi )) c c c c

N log i 1 N y (1 i 1 i m p ( yi | xi ) s i 1 wi2 p ( yi | xi )) where is learning rate. Graphical Illustration No regularization case Gradient Ascent

Maximize the log-likelihood by iteratively adjusting the parameters in small increments In each iteration, we adjust w in the direction that increases the log-likelihood (toward the gradient) N m w w i 1 log p( yi | xi ) s i 1 wi2 w N w sw i 1 xi yi (1 p ( yi | xi )) c c c c

N log i 1 N y (1 i 1 i m p ( yi | xi ) s i 1 wi2 p ( yi | xi )) where is learning rate. Using regularization Without regularization Iteration When should Stop? The gradient ascent learning method converges when there is no incentive to move the parameters in any particular direction:

N m N 2 log p ( y | x ) w sw x y (1 p ( y | x i 1 i i i i i

i i )) 0 i 1 i 1 w c N log i 1 p( yi | xi ) m 2 w i 1 i N

y (1 i 1 i p( yi | xi )) 0