# Learning Energy-Based Models of High-Dimensional Data

CSC2515 Fall 2008 Introduction to Machine Learning Lecture 11a Boosting and Nave Bayes All lecture slides will be available as .ppt, .ps, & .htm at www.cs.toronto.edu/~hinton Many of the figures are provided by Chris Bishop from his textbook: Pattern Recognition and Machine Learning A commonsense way to use limited computational resources First train a model on all of the data Lets assume it get the great majority of the cases right. Then train another model on all the cases the first model got wrong plus an equal number that it got right. This focusses the resources on modelling the hard cases. Train a third model focusssing on cases that either or both previous models got wrong. Then use a simple committee of the three models This is quite effective for learning to recognize handwritten digits, but it is also very heuristic. Can we give it a theoretical foundation? Making weak learners stronger Suppose you have a weak learning module (a base classifier) that can always get 0.5+epsilon correct when

given a two-way classification task That seems like a weak assumption but beware! Can you apply this learning module many times to get a strong learner that can get close to zero error rate on the training data? Theorists showed how to do this and it actually led to an effective new learning procedure (Freund & Shapire, 1996) Boosting (ADAboost) First train the base classifier on all the training data with equal importance weights on each case. Then re-weight the training data to emphasize the hard cases and train a second model. How do we re-weight the data? Maybe some theory can help. Keep training new models on re-weighted data Finally, use a weighted committee of all the models for the test data. How do we weight the models in the committee? Maybe some theory can help. How to train each classifier input is x, output is y (x) {1, 1} tarrget is t {1, 1}, weight on case n for classifier m is

m wn Cost function for classifier m : N J m n 1 m wn ym (x n ) tn weighted errors 1 if error, 0 if correct How to weight each training case for classifier m Let m Jm m wn

weighted error rate of classifier n 1 m Let m ln m m 1 wn m wn This is the quality of the classifier. It is zero if the classifier has weighted error rate of 0.5 and infinity if the classifier is perfect exp m [ y ( xn ) t n ] How to make predictions using a committee of classifiers Weight the binary prediction of each classifier by the quality of that classifier:

YM (x) sign M m m 1 ym (x) This completely ignores the confidence of the classifier which is very un-bayesian and probably inefficient, but may make it easier to prove things. An alternative derivation of ADAboost Just write down the right cost function and let each classifier minimize it greedily (Freidman et. al. 2000). N E exp t n f m (x n ) n 1 f m ( x) 1

2 l m l l 1 yl (x) the exponential loss for classifier m real-valued prediction by committee of models up to m Learning classifier m using exponential loss N E exp t n f m (x n ) n 1 f m ( x) 1 2 l m

l yl (x) l 1 N 1 m 2 y m ( x) Erelevant exp t n f m 1 (x n ) n 1 N n 1 m wn exp 1 t 2 n 1

tn 2 1 2 l m 1 l yl (x) l 1 m ym (x n ) m ym ( x n ) Re-writing the part of the exponential loss that is relevant when fitting classifier m N Erelevant wnm exp n 1

e m 2 m w n e right m m e 2 e 2 tn m 2 m

2 ym (x n ) m w n wrong multiplicative constant m w n t n ym ( x n ) n wrong cases unmodifiable e

m 2 m w n n An impressive example of boosting Viola and Jones created a very fast face detector that can be scanned across a large image to find the faces. The base classifier just compares the total intensity in two rectangular pieces of the image. There is a neat trick for computing the total intensity in a rectangle in a few operations. So its easy to evaluate a huge number of base classifiers and they are very fast at runtime. We add classifiers greedily based on their quality on the weighted training cases.

## Recently Viewed Presentations

• MONDAY- Hand back papers, work on essays (finish intros, body paragraphs) ***HOMEWORK TONIGHT***-Check out Fitz's blog -study Vocabulary Unit1
• Learning Objectives. 6.1 Discuss what it means to be socially responsible and what factors influence that decision.. 6.2 . Explain . green management and how organizations can go green. 6.3 . Discuss . the factors that lead to ethical and...
• Chapter 3 Enumeration Last modified 1-30-09 Name Resolution Windows needs to change a computer name to an IP address to send data packets Windows uses two naming systems: DNS (the preferred method) NetBIOS Name Resolution (still used by all versions...
• Men masturbate more than women Men find it more difficult to live without sex Gender Roles Gender Differences in Sex Drive Erotic Plasticity The degree to which innate sexual responses are shaped and altered by social, cultural, and situational factors.
• Walt Whitman and Emily Dickinson Two of the greatest American poets of the 19th century! Walt Whitman personality - sociable and gregarious. He was a traveler. expectations - wanted his message to be carried to the world. famous - during...
• 12.010 Computational Methods of Scientific Programming Lecturers Thomas A Herring, Room 54-820A, [email protected] Chris Hill, Room 54-1511, [email protected]
• The information provided in this session is for information purposes only. It . must not be relied on as legal . advice. You should seek legal advice about your own particular circumstances.
• English poet, dramatist, actor. Used 29,066 different words in his play. Born in Stratkford-upon-Avon. Died on April 23, 1616 . Works included: 38 plays, 154 sonnets, 2 narrative poems