Contrastive Estimation: Training Log-Linear Models on ...

Contrastive Estimation: Training Log-Linear Models on ...

Contrastive Estimation: (Efficiently) Training LogLinear Models (of Sequences) on Unlabeled Data Noah A. Smith and Jason Eisner Department of Computer Science / Center for Language and Speech Processing Johns Hopkins University {nasmith,jason}@cs.jhu.edu ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Nutshell Version unannotated text tractable training contrastive estimation with lattice neighborhoods Experiments on unlabeled data: max ent features sequence models POS tagging: 46% error rate reduction (relative to EM) Max ent features make it possible to survive damage to tag dictionary ACL 2005 N. A. Smith and J. Eisner Contrastiveparsing: Estimation Dependency

21% Red leaves dont hide blue jays. ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Maximum Likelihood Estimation (Supervised) p JJ NNS MD VB JJ NNS y red leaves dont hide blue jays x

p ? ? * * * ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Maximum Likelihood Estimation (Unsupervised) p ? ? ? ? ? ? red leaves dont hide blue

jays This is what EM does. p ? ? * * * ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation x Focusing Probability Mass numerator denominator ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Focusing Probability Mass numerator denominator ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Conditional Estimation (Supervised) p p

JJ NNS MD VB JJ NNS y red leaves dont hide blue jays x ? ? ? ? ?

? red leaves dont hide blue jays (x) * A different denominator! ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Objective Functions Objective Optimizati on Algorithm MLE Count & Normalize* tags & words

* * MLE with hidden variables EM* words * * Conditional Likelihood Iterative Scaling tags & words (words) * tags & words hypothesize d tags & words Perceptron Backprop Denominat Numerator or

*For generative models. ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Objective Functions Objective MLE MLE with Contrastive hidden Estimation variables Optimizati on Algorithm Count & generic Normalize* numerical solvers EM* Conditional Likelihood (in this talk, LMVM Iterative L-BFGS) Scaling Perceptron Backprop Denominat Numerator

or observed tags data& * * words (in this talk, raw words * * ? word sequence, tagsover & sum (words) * words all possible hypothesize taggings) tags & d tags & words words *For generative models. ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation This talk is about denominators ... in the unsupervised case.

A good denominator can improve accuracy and tractability. ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Language Learning (Syntax) red leaves EM dont hide blue jays Why didnt he say, birds fly or dancing granola or the wash dishes or any other sequence of words? ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Language Learning (Syntax) red leaves dont

hide blue jays Why did he pick that sequence for those words? Why not say leaves red ... or ... hide dont ... or ... ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation What is a syntax model supposed to explain? Each learning hypothesis corresponds to a denominator / neighborhood. ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation The Job of Syntax Explain why each word is necessary. DEL1WORD neighborhood red dont hide blue jays leaves dont hide blue jays red leaves hide blue jays red leavesdont hide blue jays red leaves dont hide blue

red leaves dont blue jays red leaves dont hide jays ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation The Job of Syntax Explain the (local) order of the words. TRANS1 neighborhood red dont leaves hide blue jays leaves red dont hide blue jays red leavesdont hide blue jays red leaves hide dont blue jays red leaves dont hide jays blue red leaves dont blue hide jays ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation p ? ? ? ? ? ? red leaves dont

hide blue jays ? ? ? red leaves dont p ? ? ? hide blue jays ? ? ? ? ? ? leaves

red dont hide blue jays ? ? ? ? ? ? red dont leaves hide blue jays ? ?

? ? ? ? red leaves hide dont blue jays ? ? ? ? ? ? red leaves dont

blue hide jays ? ? ? ? ? ? red leaves dont hide jays blue sentences in TRANS1 neighborhoo d ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation p

p ? ? ? ? ? ? red leaves dont hide blue jays red leaves dont le a do n

ve s t re d hide hi de bl ue le a do n ve dont s t hide (with any tagging) ACL 2005 N. A. Smith and J. Eisner

blue jays ja ys hi d e blue bl ue jays sentences in TRANS1 neighborhoo Contrastive Estimation d The New Modeling Imperative A good sentence hints that a set of bad ones is nearby. numerator denominator (neighborhood) Make the good

sentence at the ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation expense likely, This talk is about denominators ... in the unsupervised case. A good denominator can improve accuracy and tractability. ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Log-Linear Models score of x, y partition function Computing is undesirable! Conditional Estimation (Supervised) 1 sentence Contrastive Estimation (Unsupervised) a few

sentences Sums over all possible taggings of all possible sentences! ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation A Big Picture: Sequence Model Estimation unannotated data tractable sums generative, EM: p(x) log-linear, EM: p(x) log-linear, MLE: p(x, y) generative, MLE: p(x, y) log-linear, conditional estimation: p(y | x) log-linear, CE with lattice neighborhoods overlapping

features ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Contrastive Neighborhoods Guide the learner toward models that do what syntax is supposed to do. Lattice representation efficient algorithms. There is an art to choosing neighborhoo d functions. ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Neighborhoods neighborhood size lattice arcs DEL1WORD n+1 O(n) delete up to 1 word TRANS1 n DELORTRANS1 O(n)

DEL1SUBSEQUENCE * (EM) perturbations O(n) transpose any bigram O(n) DEL1WORD TRANS1 O(n2 delete any contiguous O(n2) subsequence ) - replace each word with anything ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation The Merialdo (1994) Task Given unlabeled text and a POS dictionary (that tells all possible tags for each word type), A form of learn to tag. supervision. ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation

Trigram Tagging Model JJ NNS MD VB JJ NNS red leaves dont hide blue jays feature set: tag trigrams tag/word pairs from a POS dictionary ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation CRF supervised 99.5

HMM 97.2 LENGTH 79.3 TRANS1 79.0 DELORTRANS1 78.8 10 data 70.0 DA Smith & Eisner (2004) 66.6 EM Merialdo (1994) EM 62.1 60.4 DEL1WORD 58.7 DEL1SUBSEQUENCE random loglinear EM

35.1 96K words 30.0 40.0 50.0 60.0 70.0 80.0 90.0 full POS dictionary tagging accuracy (ambiguous words) uninformative initializer best of 8 smoothing conditions ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation 100.0 Dictionary includes ... What if we damage the POS dictionary? all words words from 1st half of corpus words with count 2 words with count 3 Dictionary excludes OOV words, which can get any tag.

ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation 90.4 Dictionary excludes OOV words, which can get any tag. 60.5 56.6 65.0 all words words from 1st half of corpus words with count 2 words with count 3 69.5 66.5 70.0 70.9 75.0 84.4 80.5 80.0

Dictionary includes ... 78.3 75.2 72.3 84.8 81.3 77.2 85.0 60.0 51.0 55.0 96K words 17 coarse POS tags uninformative initializer do m ra n EM N G TH LE EL O RT

RA NS 1 50.0 D tagging accuracy (all words) 90.0 90.1 95.0 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Trigram Tagging Model + Spelling JJ NNS MD VB JJ NNS red leaves dont

hide blue jays feature set: tag trigrams tag/word pairs from a POS dictionary 1- to 3-character suffixes, contains hyphen, digit ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation 50.0 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation + LEN sp G T el H lin g D EL + OR sp T R el AN lin S 1 g 55.0 51.0 60.5 56.6 60.0

do m 65.0 ra n EM 70.9 73.8 73.6 69.5 66.5 70.0 83.2 84.4 80.5 78.3 75.2 72.3 75.0 N G TH 80.0

84.8 81.3 77.2 85.0 LE EL O RT RA NS 1 91.9 91.1 90.8 90.3 89.8 90.4 90.1 90.0 D tagging accuracy (all words) 95.0 Spelling features aided recovery,

but only with a smart neighborhoo d. The model need not be finite-state. ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation 35.2 42.1 23.6 33.8 48.7 50.0 40.0 37.4 30.0 20.0 EM 10.0 LENGTH 0.0 TRANS1 See our paper at the IJCAI 2005

Grammatical Inference workshop. ACL 2005 attachment accuracy Klein & Manning (2004) Unsupervised Dependency Parsing initializer cleve r unin form ative N. A. Smith and J. Eisner Contrastive Estimation To Sum Up ... ontrastive Estimation means denominator for tractability or for accuracy picking your own (or, as in our case, for both). w we can use the task to guide the unsupervised learner (like discriminative techniques do for supervised learners

s a particularly good fit for log-linear models: with max ent features unsupervised sequence models all in time for ACL 2006. ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation

Recently Viewed Presentations

  • Analogies - Mrs. Miranda's Classroom Website

    Analogies - Mrs. Miranda's Classroom Website

    Part to whole - the second word (solider) is part of the first word (platoon) Varying degree - the first word (hot) is the extreme of the second word (scalding) Antonym Object to function - gills are used to breathe...
  • Présentation PowerPoint

    Présentation PowerPoint

    Bibliographie (suite) PANDOLFI, Mariella, 2002, « Moral Entrepreneur, souveraineté mouvantes et barbelés. Le bio-politique dans les Balkans postcommunistes », Anthropologie et Sociétés, 26 : 29-51. PARK, Jungwee (juillet 2008).
  • Canadian Biomes - Mr. Morrison Socials Studies 10

    Canadian Biomes - Mr. Morrison Socials Studies 10

    Canadian Biomes AKA: Our Home & Native Land What is a "biome"? A biome is a collection of ecosystems that are similar, or related to one another, based on the types of plants they support.
  • Medical Applications of Bioinformatics

    Medical Applications of Bioinformatics

    Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine Bioinformatics Tools Overview This lecture will summarize a huge amount of bioinformatics material that is usually presented as a full 12 week course. Data management and analysis of...
  • Ancient Egypt - Edl

    Ancient Egypt - Edl

    Ancient Egypt The Gift of the Nile Ancient Egyptian Time An Explanation BC - Means "Before Christ" AD - Means "Anno Domini" (The Year of Our Lord) Both of these terms were adopted during the early formation of the Roman/Christian...
  • Welcome to Hornsey Library Library Provides : ed

    Welcome to Hornsey Library Library Provides : ed

    Welcome to Hornsey Library A COMPLETE LEARNING EXPERIENCE Printing Photocopy scanning (TOP UP Rs 150) Computers Newspaper Printed resources Library Provides : Magazines Wi-Fi access E-resources Numbers of books allowed per student Undergraduate: 2 books Postgraduate: 3 books Students who...
  • Rue the ROOs

    Rue the ROOs

    July 2006: Doha Round Talks Suspended . June 2007: Potsdam Meeting of G-4 (US, EU, Brazil, India) failed. ... What happens without Doha? Tariffs may rise because bound tariffs won't fall. Bound tariffs are almost twice as high as applied...
  • "A new year is unfoldinglike a blossom with

    "A new year is unfoldinglike a blossom with

    "A new year is unfolding—like a blossom with petals curled tightly concealing the beauty within. Lord, let this year be filled with the things that are truly good—with the comfort of warmth in our relationships, with the strength to help...