NAVE BAYES David Kauchak CS159 Spring 2019 Admin Assignment 7 out soon (due next Friday at 5pm) Quiz #3 next Monday - Text similarity -> this week (though, light on ML) Final project Final project 1. 2. 3. Your project should relate to something involving NLP Your project must include a solid experimental evaluation Your project should be in a pair or group of three. If youd like to do it solo or in a

group of four, please come talk to me. Final project Read the final project handout ASAP! Start forming groups and thinking about what you want to do Final project ideas pick a text classification task evaluate different machine learning methods implement a machine learning method analyze different feature categories n-gram language modeling implement and compare other smoothing techniques implement alternative models

parsing lexicalized PCFG (with smoothing) n-best list generation parse output reranking implement another parsing approach and compare parsing non-traditional domains (e.g. twitter) EM try and implement IBM model 2 word-level translation models Final project application areas

spelling correction part of speech tagger text chunker dialogue generation pronoun resolution compare word similarity measures (more than the ones we looked at) word sense disambiguation machine translation information retrieval information extraction question answering summarization

speech recognition training data Probabilistic Modeling Model the data with a probabilistic model in a tr probabilis tic model specifically, learn p(features, label) p(features, label) tells us how likely these features and this example are Basic steps for probabilistic modeling Probabilistic models Step 1: pick a model

Step 2: figure out how to estimate the probabilities for the model Step 3 (optional): deal with overfitting Which model do we use, i.e. how do we calculate p(feature, label)? How do train the model, i.e. how to we we estimate the probabilities for the model? How do we deal with overfitting? Nave Bayes assumption m p( features, label) =p(y) p(xj | y, x1,..., xj- 1 ) j=1 p(xj | y, x1, x2 ,..., xj- 1 ) =p(xj | y)

What does this assume? Nave Bayes assumption m p( features, label) =p(y) p(xj | y, x1,..., xj- 1 ) j=1 p(xj | y, x1, x2 ,..., xj- 1 ) =p(xj | y) Assumes feature i is independent of the the other features given the label Nave Bayes model m p( features, label) =p(y) p(xj | y, x1,..., xj- 1 ) j=1 m =p(y) p(xj | y) nave Bayes assumption

j=1 |y) is the probability of a particular feature value given the la do we model this? binary features (e.g., banana occurs in the text) discrete features (e.g., banana occurs xi times) real valued features (e.g, the text contains xi proportion of v p(x|y) Binary features (aka, Bernoulli Nave Bayes) : q j p(xj | y) = 1- q j if xi =1 otherwise biased coin toss! Basic steps for probabilistic modeling Probabilistic models

Step 1: pick a model Step 2: figure out how to estimate the probabilities for the model Step 3 (optional): deal with overfitting Which model do we use, i.e. how do we calculate p(feature, label)? How do train the model, i.e. how to we we estimate the probabilities for the model? How do we deal with overfitting? Obtaining probabilities p(x1 | y) in a

tr probabilis tic model m p(y) p(xj | y) p(x2 | y) j=1 training data p(y) p(xm | y) (m = number of features) MLE estimation for Bernoulli NB m

p(y) p(xj | y) training data i=1 in a tr probabilis tic model p(y) p(xj | y) What are the MLE estimates for these? Maximum likelihood estimates count(y) p(y) =

n number of examples with label total number of examples count(xj , y) number of examples with the label with fea p(xj | y) = number of examples with label count(y) What does training a NB model then involve? How difficult is this to calculate? Text classification count(y) p(y) = n count(wj , y)Unigram features: p(wj | y) = wj, whether or not word wj occur count(y) in the text hat are these counts for text classification with unigram feat text classification

count(y) p(y) = n number of texts with label total number of texts count(wj , y) number of texts with the label with word w p(wj | y) = number of texts with label count(y) Nave Bayes classification yellow, curved, no leaf, 6oz, banana NB Model p(features, label) 0.004 m p(y) p(xj | y)

j=1 the label yellow, curved, no leaf, predict 6oz Given an unlabeled example: ow do we use a probabilistic model for classification/predictio NB classification probabilistic model: p(features, label1) yellow, curved, no leaf, 6oz, banana m p(y=1) p(xj | y=1) j=1 m yellow, curved, no leaf, 6oz, apple p(y=2) p(xj | y=2)

j=1 m label = argmax ylabels p(y) p(xj | y) j=1 pick largest NB classification yellow, curved, no leaf, 6oz, banana yellow, curved, no leaf, 6oz, apple probabilistic model: p(features, label1) m p(y=1) p(xj | y=1) j=1 pick largest

m p(y=2) p(xj | y=2) j=1 Notice that each label has its own separate set of parameters, i.e. p(xj|y) Bernoulli NB for text classification probabilistic model: p(features, label1) m p(y=1) p(wj | y=1) (1, 1, 1, 0, 0, 1, 0, 0, ) j=1 pick largest m p(y=2) p(wj | y=2) j=1

How good is this model for text classification? Bernoulli NB for text classification m 8 w 7 w 4 w 5 w 6 w 3 w

1 w 2 w (1, 1, 1, 0, 0, 1, 0, 0, ) m p(y=1) p(wj | y=1) p(y=2) p(wj | y=2) j=1 j=1 pick largest For text classification, what is this computation? Does it make sense? Bernoulli NB for text classification

m 8 w 7 w 4 w 5 w 6 w 3 w 1 w 2 w (1, 1, 1, 0, 0, 1, 0, 0, )

m p(y=1) p(wj | y=1) p(y=2) p(wj | y=2) j=1 j=1 pick largest Each word that occurs, contributes p(wj|y) Each word that does NOT occur, contributes Generative Story To classify with a model, were given an example and we obtain the probability We can also ask how a given model would generate an example This is the generative story for a model Looking at the generative story can help understand the model We also can use generative stories to help develop a model

Bernoulli NB generative story m p(y) p(xj | y) j=1 What is the generative story for the NB model? Bernoulli NB generative story m p(y) p(xj | y) j=1 Pick a label according to p(y) 1. roll a biased, num_labels-sided die

For each feature: 2. Flip a biased coin: if heads, include the feature if tails, dont include the feature What does this mean for text classification, assuming unigram features? Bernoulli NB generative story m p(y) p(wj | y) j=1 Pick a label according to p(y)

1. roll a biased, num_labels-sided die For each word in your vocabulary: 2. Flip a biased coin: if heads, include the word in the text if tails, dont include the word Bernoulli NB m p(y) p(xj | y) j=1

Pros/cons? Bernoulli NB Pros Easy to implement Fast! Can be done on large data sets Cons Nave Bayes assumption is generally not true Performance isnt as good as other models For text classification (and other sparse feature domains) the p(xi=0|y) can be problematic Another generative story Randomly draw words from a bag of words until

document length is reached w1 w3 w2 w1 w1 w1 w3 Draw words from a fixed distribution Selected: w1 w3 w2 w1 w1

w1 w3 Draw words from a fixed distribution Selected: w1 sampling with replacement Put a copy of w1 back w1 w3 w2 w1 w1 w1 w3

Draw words from a fixed distribution Selected: w1 w1 w1 w3 w2 w1 w1 w3 Draw words from a fixed distribution Selected: w1

w1 sampling with replacement Put a copy of w1 back w1 w3 w2 w1 w1 w1 w3 Draw words from a fixed distribution Selected: w1 w1

w2 w1 w1 w1 w3 w1 w3 Draw words from a fixed distribution Selected: w1 w1 w2 sampling with replacement

Put a copy of w2 back w1 w3 w2 w1 w1 w1 w3 Draw words from a fixed distribution Selected: w1 w1 w2

w1 w3 w2 w1 w1 w1 w3 Draw words from a fixed distribution Is this a NB model, i.e. does it assume each individual word occurrence is independent? w1 w3 w2 w1

w1 w1 w3 Draw words from a fixed distribution Yes! Doesnt matter what words were drawn previously, still the same probability of getting any particular word w1 w3 w2 w1 w1 w1 w3 Draw words from a fixed distribution

Does this model handle multiple word occurrences? w1 w3 w2 w1 w1 w1 w3 Draw words from a fixed distribution Selected: w1 w1 w2

w1 w3 w2 w1 w1 w1 w3 NB generative story Bernoulli NB Pick a label according to p(y) 1. For each word in your vocabulary: Flip a biased coin:

if heads, include the word in the text if tails, dont include the word Pick a label according to p(y) 1. roll a biased, num_labels-sided die 2. Multinomial NB 2.

roll a biased, num_labelssided die Keep drawing words from p(words|y) until text length has been reached. Probabilities Bernoulli NB Pick a label according to p(y) 1. Multinomial NB roll a biased, num_labels-sided die Pick a label according to p(y) 1. roll a biased, num_labels-sided

die For each word in your vocabulary: 2. Flip a biased coin: if heads, include the word in the text if tails, dont include the word m p(y) p(xj | y) j=1 (1, 1, 1, 0, 0, 1, 0, 0, ) 2.

Keep drawing words from p(words|y) until document length has been reached ? (4, 1, 2, 0, 0, 7, 0, 0, ) A digression: rolling dice Whats the probability of getting a 3 for a single roll of this dic 1/6 A digression: rolling dice What is the probability distribution over possible single rolls? 1/6 1/6 1/6 1/6

1/6 1/6 1 2 3 4 5 6 A digression: rolling dice What if I told you 1 was twice as likely as the others? 2/7 1/7 1/7

1/7 1/7 1/7 1 2 3 4 5 6 A digression: rolling dice What if I rolled 400 times and got the following number? 1: 2: 3: 4: 5:

6: 100 50 50 100 50 50 1/4 1/8 1/8 1/4 1/8 1/8 1 2 3

4 5 6 A digression: rolling dice . What is the probability of rolling a 1 and a 5 (in any order)? . Two 1s and a 5 (in any order)? . Five 1s and two 5s (in any order)? 1/4 1/8 1/8 1/4 1/8

1/8 1 2 3 4 5 6 A digression: rolling dice . What it the probability of rolling a 1 and a 5 (in any order)? (1/4 * 1/8) * 2 = 1/16 prob. of those two rolls number of ways that can happen (1,5 and 5,1) . Two 1s and a 5 (in any order)? ((1/4)2 * 1/8) * 3 = 3/128

. Five 1s and two 5s (in any order)? ((1/4)5 * (1/8)3) * 21 = 21/524,288 = 0.00004 General formula? 1/4 1/8 1/8 1/4 1/8 1/8 1 2 3 4

5 6 Multinomial distribution Multinomial distribution: independent draws over m possible categories If we have frequency counts x1, x2, , xm over each of the categories, the probability is: m n! p(x1, x2 ,..., xm | q1, q 2 ,..., q m) = m x ! j x q j j j=1

j=1 number of different ways to get those counts probability of particular counts 1 2 3 4 5 6 1 2 3

4 5 6 Multinomial distribution m n! p(x1, x2 ,..., xm | q1, q 2 ,..., q m) = m x ! j x q j j j=1

j=1 What are j? Are there any constraints on the values that they can take? 1 2 3 4 5 6 1 2 3 4

5 6 Multinomial distribution m n! p(x1, x2 ,..., xm | q1, q 2 ,..., q m) = m x ! j x q j j j=1 j=1

j: probability of rolling j q j 0 m q j =1 j=1 1 2 3 4 5 6 1

2 3 4 5 6 Back to words Why the digression? p(x1, x2 ,..., xm | q1, q 2 ,..., q m) = m n! m x ! j

x q j j j=1 j=1 Drawing words from a bag is the same as rolling a die! number of sides = number of words in the vocabulary Back to words Why the digression? m n! p(x1, x2 ,..., xm | q1, q 2 ,..., q m) = m x ! j

x q j j j=1 j=1 p( features, label) =p(y) m n! m x ! j x (q y) j j j=1 j=1 j for class y

Basic steps for probabilistic modeling Model each class as a multinomial: p( features, label) =p(y) m n! m x ! j x j ( q ) yj j=1 j=1

Step 2: figure out how to estimate the probabilities for the model How do we train the model, i.e. estimate j for each class A digression: rolling dice What if I rolled 400 times and got the following number? 1: 2: 3: 4: 5: 6: 100 50 50 100 50 50 1/4 1/8

1/8 1/4 1/8 1/8 1 2 3 4 5 6 Training a multinomial label1 label2 1/4

1/8 1/8 1/4 1/8 1/8 1 2 3 4 5 6 Training a multinomial label1

For each label, y: w1: w2: w3: w4: 100 times 50 times 10 times qj = count(wj , y) m k=1 count(wk, y) number of times word wj occurs in label y docs

= total number of words in label y docs 1/4 1/8 1/8 1/4 1/8 1/8 1 2 3 4 5

6 Classifying with a multinomial p(y=1) n! m xj ! m x (q1 ) j j p(y=2) j=1 8

w 7 w 4 w 5 w 6 w 3 w 1 w 2 w (10, 2, 6, 0, 0, 1, 0, 0, ) m n!

m x ! j j=1 xj 2 j (q ) j=1 j=1 Any way I can make this simpler? pick largest Classifying with a multinomial m x

p(y=1)(q1 ) j j j=1 8 w 7 w 4 w 5 w 6 w 3 w 1 w 2 w

(10, 2, 6, 0, 0, 1, 0, 0, ) m x p(y=2)(q 2 ) j j j=1 n! x !Is a constant! m m j=1 pick largest Multinomial finalized Training:

Calculate p(label) For each label, calculate s qj = count(wj , y) m k=1 count(wk, y) Classification: Get word counts For each label you had in training, calculate: m x

p(y)q j j j=1 and pick the largest Multinomial vs. Bernoulli? Handles word frequency Given enough data, tends to performs better http://www.cs.cmu.edu/~knigam/papers/multinomial- Multinomial vs. Bernoulli? Handles word frequency Given enough data, tends to performs better http://www.cs.cmu.edu/~knigam/papers/multinomial- Multinomial vs. Bernoulli? Handles word frequency Given enough data, tends to performs better http://www.cs.cmu.edu/~knigam/papers/multinomial- Maximum likelihood estimation

Intuitive Sets the probabilities so as to maximize the probability of the training data Problems? Overfitting! Amount of data particularly problematic for rare events Is our training data representative Basic steps for probabilistic modeling Probabilistic models Step 1: pick a model Step 2: figure out how to estimate the probabilities for the model Step 3 (optional): deal

with overfitting Which model do we use, i.e. how do we calculate p(feature, label)? How do train the model, i.e. how to we we estimate the probabilities for the model? How do we deal with overfitting? training data Unseen events qj = count(wj , y) m k=1

positive banana: 2 negative banana: 0 count(wk, y) What will banana be for the negative clas training data Unseen events qj = positive banana: 2 negative banana: 0

count(wj , y) m k=1 count(wk, y) What will banana be for the negative clas 0! Is this a problem? training data Unseen events positive banana: 2 negative banana: 0 p(I ate a bad banana, negative) = ? training data

Unseen events positive banana: 2 negative banana: 0 p(I ate a bad banana, negative) = 0 p(. banana , negative) = 0 Solution? training data Add lambda smoothing qj = count(wj , y) qj =

m k=1 count(wk, y) count(wj , y) + l m l m+ count(wk, y) k=1 for each label, pretend like weve seen each feature/word occur in additional examples training data Different than positive banana: 0 negative

banana: 0 How is this problem different? training data Different than positive banana: 0 negative banana: 0 I ate a bad banana, positive) p(I ate a bad, positive) I ate a bad banana, negative) p(I ate a bad, negative) Out of vocabulary. Many ways to solve for our implementation, well just ignore