Corpora and Statistical Methods Lecture 12

Corpora and Statistical Methods Lecture 12

Corpora and Statistical Methods Albert Gatt Part 1 Semantic similarity and the vector space model Synonymy Different phonological/orthographic words highly related meanings: sofa / couch boy / lad Traditional definition: w1 is synonymous with w2 if w1 can replace w2 in a sentence, salva veritate Is this ever the case? Can we replace one word for another and keep our sentence identical?

The importance of text genre & register With near-synonyms, there are often register-governed conditions of use. E.g. naive vs gullible vs ingenuous You're so bloody gullible [] [] outside on the pavement trying to entice gullible idiots in [] You're so ingenuous . You tackle things the wrong way. The commentator's ingenuous query could just as well have been prompted [] However, it is ingenuous to suppose that peace process [] (source: BNC)

Synonymy vs. Similarity The contextual theory of synonymy: based on the work of Wittgenstein (1953), and Firth (1957) You shall know a word by the company it keeps (Firth 1957) Under this view, perfect synonyms might not exist. But words can be judged as highly similar if people put them into the same linguistic contexts, and judge the change to be slight. Synonymy vs. similarity: example Miller & Charles 1991:

Weak contextual hypothesis: The similarity of the context in which 2 words appear contributes to the semantic similarity of those words. E.g. snake is similar to [resp. synonym of] serpent to the extent that we find snake and serpent in the same linguistic contexts. It is much more likely that snake/serpent will occur in similar contexts than snake/toad NB: this is not a discrete notion of synonymy, but a continuous definition of similarity The Miller/Charles

experiment Subjects were given sentences with missing words; asked to place words they felt were OK in each context. Method to compare words A and B: find sentences containing A find sentences containing B delete A and B from sentences and shuffle them ask people to choose which sentences to place A and B in. Results: People tend to put similar words in the same context, and this is highly correlated with occurrence in similar contexts in corpora. Issues with similarity Similar is a much broader concept than

synonymous: Contextually related, though differing in meaning: man / woman boy / girl master / pupil Contextually related, but with opposite meanings: big / small clever / stupid Uses of similarity Assumption: semantically similar words

behave in similar ways Information retrieval: query expansion with related terms K nearest neighbours, e.g.: given: a set of elements, each assigned to some topic task: classify an unknown w by topic method: find the topic that is most prevalent among ws semantic neighbours Common approaches Vector-space approaches: represent word w as a vector containing the words (or other features) in the context of w compare the vectors of w1, w2 various vector-distance measures available Information-theoretic measures:

w1 is similar to w2 to the extent that knowing about w1 increases my knowledge (decreases my uncertainty) about w2 Part 2 Vector-space models Basic data structure Matrix M Mij = no. of times wi co-occurs with wj (in some window). Can also have Document * word matrix We can treat matrix cells as boolean: if M ij > 0, then wi co-occurs with wj , else it does not. Distance measures

Many measures take a set-theoretic perspective: vectors can be: binary (indicate co-occurrence or not) real-valued (indicate frequency, or probability) similarity is a function of what two vectors have in common Classic similarity/distance measures Boolean vector (sets) Real-valued vector Dice coefficient

Dice coefficient n 2 w v 2 min( wi , vi ) wv w v Jaccard Coefficient i 1 n i

i i 1 Jaccard Coefficient n wv min( w , v ) wv max( w , v ) i i

i 1 n i i 1 i Dice vs. Jaccard Dice (car, truck) On the boolean matrix: (2 * 2)/(4+2) = 0.66 Jaccard On the boolean matrix: 2/4 = 0.5 Dice is more generous; Jaccard penalises

lack of overlap more. Classic similarity/distance measures Boolean vector (sets) Real-valued vector Cosine similarity Cosine similarity vw (= angle between 2 n

i 1 i 2 v w v wi n n v w i1vi i v w vectors) Part 3 probabilistic approaches Turning counts to

probabilities P(spacewalking|cosmonaut) = = 0.5 P(red|car) = = 0.25 NB: this transforms each row into a probability distribution corresponding to a word Probabilistic measures of distance KL-Divergence: treat W1 as an approximation of W2 P( x | v) D(v || w) P( x | v) log P( x | w) x Problems:

asymmetric: D(p||q) D(q||p) not so useful for word-word similarity if denominator = 0, then D(v||w) is undefined Probabilistic measures of distance Information radius (aka Jenson-Shannon Divergence) compares total divergence between p and q to the average of p and q vw v w symmetric!

IRad (W1 , W2 ) IRad (v, w) D v || D w || 2 2 Dagan et al (1997) showed this measure to be superior to KL-Divergence, when applied Some characteristics of vectorspace measures 1.

Very simple conceptually; 2. Flexible: can represent similarity based on document co-occurrence, word cooccurrence etc; 3. Vectors can be arbitrarily large, representing wide context windows; 4. Can be expanded to take into account grammatical relations (e.g. head-modifier, verb-argument, etc).

Grammar-informed methods: Lin (1998) Intuition: The similarity of any two things (words, documents, people, plants) is a function of the information gained by having: a joint description of a and b in terms of what they have in common compared to describing a and b separately E.g. do we gain more by a joint description of:

apple and chair (both THINGS) apple and banana (both FRUIT: more specific) Lins definition cont/d Essentially, we compare the info content of the common definition to the info content of the separate definition sim(a, b) inf . content of joint description inf content of separate descriptions NB: essentially mutual information! An application to corpora From a corpus-based point of view, what do

words have in common? context, obviously How to define context? just bag-of-words (typical of vector-space models) more grammatically sophisticated Kilgarriffs (2003) application Definition of the notion of context, following Lin: define F(w) as the set of grammatical contexts in which w occurs a context is a triple : rel is a grammatical relation w is the word of interest w is the other word in rel

Grammatical relations can be obtained using a dependency parser. Grammatical co-occurrence matrix for cell Source: Jurafsky & Martin (2009), after Lin (1998) Example with w = cell Example triples: Observe that each triple f consists of the relation r, the

second word in the relation w, ..and the word of interest w We can now compute the level of association between the word w and each of its triples f: P( w, f ) I ( w, f ) log 2 P( w) P(r | w) P( w' | w) An information-theoretic measure that was proposed as a generalisation of the idea of pointwise mutual information. Calculating similarity Given that we have grammatical triples for our words of interest, similarity of w1 and

w2 is a function of: the triples they have in common the triples that are unique to each 2 I ( F ( w1 ) I ( F ( w2 )) simLin ( w1 , w2 ) I ( F ( w1 )) I ( F ( w2 )) I.e.: mutual info of what the two words have in common, divided by sum of mutual info of what each word has Sample results: master & pupil common: Subject-of: read, sit, know Modifier: good, form

Possession: interest master only: Subject-of: ask Modifier: past (cf. past master) pupil only: Subject-of: make, find PP_at-p: school Concrete implementation The online SketchEngine gives grammatical relations of words, plus thesaurus which rates words by similarity to a head word. This is based on the Lin 1998 model. Limitations (or characteristics)

Only applicable as a measure of similarity between words of the same category makes no sense to compare grammatical relations of different category words Does not distinguish between near-synonyms and similar words student ~ pupil master ~ pupil MI is sensitive to low-frequency: a relation which occurs only once in the corpus can come out as highly significant.

Part 4 Applications of vector-space models to information retrieval Information retrieval Basic problem definition: Store a (very large) collection of documents Document = Newspaper articles, encyclopedia entries, medline abstracts, html pages... Given a user query (some set of words), retrieve the documents that are most relevant to that query. Most IR systems take a bag of words approach: Document = the words it contains

No syntactic information or higher order semantic information IR architecture Basic representation Same as for semantic similarity, except that we use a document by term (=word) matrix d A document d is represented as a vector whose cells contain term weights. d j (w1, j , w2, j ,...,wn, j )

Example document representation Fried eggplant recipe Document representation flour Place the flour, egg, and bread crumbs each in 3 small bowls. Add the 1/2 teaspoon of salt to the egg and whisk to combine. Season the bread crumbs with the tablespoon of Essence and stir with a fork or your hands to thoroughly combine. Dredge each piece of eggplant in the flour,

coating thoroughly and then shaking to remove any excess flour. Coat each piece with the egg, then dredge in the bread crumb mixture, pressing to make the bread crumbs adhere. Transfer the eggplant pieces to a rack or to paper towels to let them dry slightly before frying. In a deep, heavy skillet heat 1/2-inch of vegetable oil to 375 degrees F. Fry the eggplant pieces, in batches if necessary, for about 1 minute on each side, or until golden brown. Transfer with tongs to paper towels to drain. Sprinkle lightly with salt before serving. Serve with marinara sauce, or as desired. dj 3

egg 3 Brea eggp d lant crum b 4 3 The term weights are just simple document frequencies (for now) Example query

representation Document representation flour dj 3 egg 3 User query Brea eggp d lant

crum b 4 3 The term weights are just simple document frequencies (for now) Suppose user types: egg and breadcrumb q (rep 0,1,1,could

0) Query be: More generally Let d1 be the eggplant recipe, and d2 be a fried chicken recipe. flou egg r Bread eggpla crumb nt chick en d1

3 3 4 3 0 d2 2 3 3

0 2 User query: q (0,1,1,0,0) Note: intuitively, this query should match both docs (both contain egg and breadcrumb) Which doc would the query fried chicken match? Query processing We can use the same model as we used for

computing word similarity, to compute the degree of match between a query and a doc. E.g. Compute the cosine similarity between the query and the document vector. Documents can then be ranked by their similarity to the query. Term weighting So far, the intuition has been that: frequent terms in a document capture the basic meaning of the document. Another intuition: terms that crop up in a few documents are more discriminatory. flou egg

r Bread eggpla crumb nt chick en d1 3 3 4 3

0 d2 2 3 3 0 2 Inverse document frequency (IDF) A way of giving a higher weight to more

discriminative words. N idf i log ni N = no. of docs in the collection ni = number of documents containing term i We combine IDF with TF (the term frequency) wi , j tf i , j idfi TF/IDF tf flou egg r Bread eggpla

crumb nt chick en d1 3 3 4 3 0 d2

2 3 3 0 2 idf flou egg r 0 0

Bread eggpla crumb nt 0 0.30 chick en 0.30 TF-IDF weighting Properties: Weights terms higher if they are: frequent in a document AND rare in the document collection as a whole. Modified similarity for query/document

retrieval: 2 tf tf ( idf ) w , q w , d thew words actually We only take intoaccount wq , d sim q, d query ) in (the 2 2 (tf q ,qidf q ) (tf d ,d idf d )

q q i i i d d i i i Part 5 Evaluation of IR Evaluation of IR systems

As with most NLP systems, we require some function that our system should maximise. A lot of NLP evaluation rely on precision and recall. Basic rationale For a given classification problem, we have: a gold standard against which to compare our systems results, compared to the target gold standard: false positives (fp) false negatives (fn) true positives (tp) true negatives (tn) Performance typically measured in terms of

precision and recall. Precision Definition: proportion of items that are correctly classified i.e. proportion of true positives out of all the systems classifications tp precision (P) tp fp Recall Definition: proportion of the actual target (gold standard)

items that our system classifies correctly tp recall (R) tp fn total no. of items that should be correctly classified, including those the system doesnt get Combining precision and recall Typically use the F-measure as a global estimate of performance against gold standard

We need some factor (alpha) to weight precision and recall; 0.5 gives them equal weighting 1 F 1 1 1 P R Precision/Recall in IR We assume that the results returned by the IR system can be divided into: Relevant docs (tp) Non-relevant docs (fp)

Precision = the fraction of docs that are relevant out of the set of returned docs Recall: fraction of docs that are relevant out of the whole set of relevant docs Problem: IR systems tend to rank documents by relevance. Method 1: interpolated precision and recall We can split documents into equivalence classes (those at a given rank) and compute P and R for each rank. From: J&M 2009 p. 807

Based on a collection of 9 docs. Recall increases when relevant items are encountered. Precision is very variable! Method 1: interpolated precision and recall Plot of max precision at different recall intervals.

Method 2: Mean average precision Here, we just compute the average precision at or above a given rank. 1 MAP Pr (d ) Rr dR r Rr = set of relevant docs at or above rank r Precisionr(d) = the precision at the rank at which the document d was found. NB: this metric favours systems that return relevant

documents at high ranks. Is this justified? Part 5 Improving IR Improving IR: Simple techniques A lot of queries will contain: Morphologically inflected words (beans, etc) Function words (fried and breadcrumb) Performance usually improves if: We perform some kind of stemming We use a stop list Function words arent very informative (cf. Zipfs

law), while stemming allows us to identify querydoc relatedness in a more general fashion. Using semantic information Homonymy and polysemy can reduce precision: A query containing bank will match docs containing both senses of the word. But the user will generally only be interested in one of the senses. Perhaps Word Sense Disambiguation can help improve IR? More on WSD next week. Query expansion One way to try to improve IR is to expand a query by:

Finding similar words to the words in the query (using a thesaurus) E.g. q = (Steve Jobs); qexp = (Steve Jobs, Mac) But this will depend on the quality of our thesaurus. E.g. Is it useful to expand the query dog with the related words cat, animal and horse? (These are actual examples from the BNC SketchEngine thesaurus) Or is it more useful to restrict our thesaurus to only synonyms (dog, canine etc)?

Recently Viewed Presentations

  • Crystal Ball : On the Future High Energy

    Crystal Ball : On the Future High Energy

    Crystal Ball : On the Future High Energy Colliders * Vladimir Shiltsev. Fermilab, Batavia, IL , USA. Accelerator Physics Center. August 4, 2015 *FRA, LLC operates Fermilab under contract No. DE-AC02-07CH11359 with the U.S. DOE
  • Adaptive Planning: Moving Beyond the Fiscal Year UAccess

    Adaptive Planning: Moving Beyond the Fiscal Year UAccess

    Strong, Stable UAccess Systems. Corporate Knowledge of Analytics and UAccess Systems. Competence and capabilities of UA Team. Threats. Hundreds of Supplemental/Shadow Systems. Loss of time manually consolidating data. No easy way to model potential future scenarios. Talented staff spend to...
  • Expanding Your RV Rental Business with Towables

    Expanding Your RV Rental Business with Towables

    Courtesy Of: Krenek RV Center. This just in. But unlike the national Media when they announce it we have good news not tragic, revolting, gossipy, trashy or bad news.
  • Time Value of Money - Pace University

    Time Value of Money - Pace University

    Stand-alone vs Market Risk We saw that there was individual-stand alone risk and market risk Assume that each equity's return is the composition of two random variables: one associated with the market's return one associated with the company-specific return Market...
  • Word Stems List 18 - inetTeacher.com

    Word Stems List 18 - inetTeacher.com

    Word Stems List 18 1. atmo Definition- Vapor atmosphere - mass of air around the earth (vapor plus sphere) 2. cardio Definition- Heart cardiology - study of the heart and its diseases 3. cosmo Definition- World or Universe cosmology -...
  • Reactive Testing workshop - Electric Reliability Council of Texas

    Reactive Testing workshop - Electric Reliability Council of Texas

    Information from 2013 Reactive Testing workshop. Outline . Clarify how Nodal Protocols for Voltage Support and Unit Reactive Test are related and dependent. Define CURL and URL. ... Type 2: Wound rotor induction generator with variable rotor resistance .
  • Biome Project - Weebly

    Biome Project - Weebly

    Symbiosis - Commensalism. The relationship between the New World Army Ants, inhabiting the rainforest floor and Antbirds, a small-dull-colored South American bird species, is an example of commensalism in the Tropical Rainforest.
  • Compare and Contrast for Comprehension

    Compare and Contrast for Comprehension

    (Compare and contrast) Objectives: Students will be able to: Contrast The Hat and The Mitten Compare The Hat and The Mitten Materials: Books, The Hat and The Mitten by Jan Brett Large piece of butcher paper Markers Procedure: Read The...