# Learning Energy-Based Models of High-Dimensional Data CSC2515 Fall 2008 Introduction to Machine Learning Lecture 11a Boosting and Nave Bayes All lecture slides will be available as .ppt, .ps, & .htm at www.cs.toronto.edu/~hinton Many of the figures are provided by Chris Bishop from his textbook: Pattern Recognition and Machine Learning A commonsense way to use limited computational resources First train a model on all of the data Lets assume it get the great majority of the cases right. Then train another model on all the cases the first model got wrong plus an equal number that it got right. This focusses the resources on modelling the hard cases. Train a third model focusssing on cases that either or both previous models got wrong. Then use a simple committee of the three models This is quite effective for learning to recognize handwritten digits, but it is also very heuristic. Can we give it a theoretical foundation? Making weak learners stronger Suppose you have a weak learning module (a base classifier) that can always get 0.5+epsilon correct when

given a two-way classification task That seems like a weak assumption but beware! Can you apply this learning module many times to get a strong learner that can get close to zero error rate on the training data? Theorists showed how to do this and it actually led to an effective new learning procedure (Freund & Shapire, 1996) Boosting (ADAboost) First train the base classifier on all the training data with equal importance weights on each case. Then re-weight the training data to emphasize the hard cases and train a second model. How do we re-weight the data? Maybe some theory can help. Keep training new models on re-weighted data Finally, use a weighted committee of all the models for the test data. How do we weight the models in the committee? Maybe some theory can help. How to train each classifier input is x, output is y (x) {1, 1} tarrget is t {1, 1}, weight on case n for classifier m is

m wn Cost function for classifier m : N J m n 1 m wn ym (x n ) tn weighted errors 1 if error, 0 if correct How to weight each training case for classifier m Let m Jm m wn

weighted error rate of classifier n 1 m Let m ln m m 1 wn m wn This is the quality of the classifier. It is zero if the classifier has weighted error rate of 0.5 and infinity if the classifier is perfect exp m [ y ( xn ) t n ] How to make predictions using a committee of classifiers Weight the binary prediction of each classifier by the quality of that classifier:

YM (x) sign M m m 1 ym (x) This completely ignores the confidence of the classifier which is very un-bayesian and probably inefficient, but may make it easier to prove things. An alternative derivation of ADAboost Just write down the right cost function and let each classifier minimize it greedily (Freidman et. al. 2000). N E exp t n f m (x n ) n 1 f m ( x) 1

2 l m l l 1 yl (x) the exponential loss for classifier m real-valued prediction by committee of models up to m Learning classifier m using exponential loss N E exp t n f m (x n ) n 1 f m ( x) 1 2 l m

l yl (x) l 1 N 1 m 2 y m ( x) Erelevant exp t n f m 1 (x n ) n 1 N n 1 m wn exp 1 t 2 n 1

tn 2 1 2 l m 1 l yl (x) l 1 m ym (x n ) m ym ( x n ) Re-writing the part of the exponential loss that is relevant when fitting classifier m N Erelevant wnm exp n 1

e m 2 m w n e right m m e 2 e 2 tn m 2 m

2 ym (x n ) m w n wrong multiplicative constant m w n t n ym ( x n ) n wrong cases unmodifiable e

m 2 m w n n An impressive example of boosting Viola and Jones created a very fast face detector that can be scanned across a large image to find the faces. The base classifier just compares the total intensity in two rectangular pieces of the image. There is a neat trick for computing the total intensity in a rectangle in a few operations. So its easy to evaluate a huge number of base classifiers and they are very fast at runtime. We add classifiers greedily based on their quality on the weighted training cases.