Learning Accurate, Compact, and Interpretable Tree Annotation Slav Petrov, Leon Barrett, Romain Thibaux, Dan Klein The Game of Designing a Grammar Annotation refines base treebank symbols to improve statistical fit of the grammar Parent annotation [Johnson 98] The Game of Designing a Grammar Annotation refines base treebank symbols to improve statistical fit of the grammar Parent annotation [Johnson 98] Head lexicalization [Collins 99, Charniak 00] The Game of Designing a Grammar Annotation refines base treebank symbols to improve statistical fit of the grammar Parent annotation [Johnson 98] Head lexicalization [Collins 99, Charniak 00] Automatic clustering? Previous Work: Manual Annotation

[Klein & Manning 03] Manually split categories NP: subject vs object DT: determiners vs demonstratives IN: sentential vs prepositional Advantages: Fairly compact grammar Linguistic motivations Disadvantages: F1 Performance leveled out Model Nave Treebank Grammar 72.6 Manually annotated Klein & Manning 03 86.3 [Matsuzaki et. al 05, Prescher 05] Previous Work: Automatic Annotation Induction Advantages: Automatically learned: Label all nodes with latent variables. Same number k of subcategories for all categories.

Disadvantages: Grammar gets too large Most categories are oversplit while others are undersplit. Model F1 Klein & Manning 03 86.3 Matsuzaki et al. 05 86.7 Previous work is complementary Manual Annotation This Work Automatic Annotation Allocates splits where needed Very tedious Compact Grammar Misses Features Splits uniformly Automatically learned Large Grammar Captures many features

Learning Latent Annotations Forward EM algorithm: Brackets are known Base categories are known Only induce subcategories X1 X2 X3 X7 X4 X5 X6 . Just like Forward-Backward for HMMs. He was right

Backward Limit of computational resources Overview 90 85 Parsing accuracy (F1) k=16 k=8 k=4 80 k=2 75 70 65 - Hierarchical Training - Adaptive Splitting - Parameter Smoothing k=1

60 50 250 450 650 850 1050 1250 Total Number of grammar symbols 1450 1650 Refinement of the DT tag DT DT-1 DT-2 DT-3

DT-4 Refinement of the DT tag DT Hierarchical refinement of the DT tag Hierarchical Estimation Results 90 Parsing accuracy (F1) 88 86 84 82 80 78 76 74 100 300 500 700 900

Model 1100 1300 Baseline Total Number of grammar symbols 1500 1700 F1 87.3 Hierarchical Training 88.4 Refinement of the , tag Splitting all categories the same amount is wasteful: The DT tag revisited Oversplit? Adaptive Splitting Want to split complex categories more Idea: split everything, roll back splits which

were least useful Adaptive Splitting Want to split complex categories more Idea: split everything, roll back splits which were least useful Adaptive Splitting Want to split complex categories more Idea: split everything, roll back splits which were least useful Adaptive Splitting Evaluate loss in likelihood from removing each split = Data likelihood with split reversed Data likelihood with split No loss in accuracy when 50% of the splits are reversed. Adaptive Splitting Results 90 Parsing accuracy (F1) 88 86 84 82 80

50% Merging 78 Hierarchical Training 76 Flat Training 74 100 300 500 700 900 Model 1100 1300 Total Number of grammar Previous symbols

1500 1700 F1 88.4 With 50% Merging 89.5 0 LST ROOT X WHADJP RRC SBARQ INTJ WHADVP UCP NAC

FRAG CONJP SQ WHPP PRT SINV NX PRN WHNP QP SBAR ADJP S ADVP PP

VP NP Number of Phrasal Subcategories 40 35 30 25 20 15 10 5 0 LST ROOT X

WHADJP RRC SBARQ INTJ WHADVP UCP NAC FRAG CONJP SQ WHPP PRT SINV 25 NX

N P PRN 30 WHNP 35 QP 40 SBAR ADJP S ADVP PP VP NP Number of Phrasal Subcategories

VP PP 20 15 10 5 LST ROOT X WHADJP RRC NA C SBARQ INTJ WHADVP

10 UCP 15 NAC FRAG CONJP SQ WHPP PRT SINV NX PRN WHNP QP SBAR

ADJP S ADVP PP VP NP Number of Phrasal Subcategories 40 35 30 25 20 X 5 0

30 20 0 NNP JJ NNS NN VBN RB VBG VB VBD CD IN VBZ VBP DT NNPS CC JJR JJS : PRP PRP$ MD RBR WP

POS PDT WRB -LRB. EX WP$ WDT -RRB'' FW RBS TO $ UH , `` SYM RP LS # Number of Lexical Subcategories 70 60 50 40

PO S T O , 10 60 50 40 30 0 NNP JJ NNS NN VBN RB VBG VB VBD CD IN VBZ VBP

DT NNPS CC JJR JJS : PRP PRP$ MD RBR WP POS PDT WRB -LRB. EX WP$ WDT -RRB'' FW RBS TO $ UH , `` SYM RP LS #

Number of Lexical Subcategories 70 R B VBx IN DT 20 10 70 60 50 40 30 0 NNP JJ NNS NN

VBN RB VBG VB VBD CD IN VBZ VBP DT NNPS CC JJR JJS : PRP PRP$ MD RBR WP POS PDT WRB -LRB. EX WP$ WDT -RRB'' FW RBS

TO $ UH , `` SYM RP LS # Number of Lexical Subcategories NN P JJ NN S N N 20 10 Smoothing Heavy splitting can lead to overfitting Idea: Smoothing allows us to pool statistics Linear Smoothing

Result Overview 90 Parsing accuracy (F1) 88 86 84 82 80 50% Merging and Smoothing 78 50% Merging Hierarchical Training 76 Flat Training 74 100 300 500

700 Total Number of grammar symbols 900 1100 Result Overview 90 Parsing accuracy (F1) 88 86 84 82 80 50% Merging and Smoothing 78 50% Merging Hierarchical Training 76 Flat Training

74 100 300 500 700 Total Number of grammar symbols 900 1100 Result Overview 90 Parsing accuracy (F1) 88 86 84 82 80 50% Merging and Smoothing 78

50% Merging Hierarchical Training 76 Flat Training 74 100 300 500 700 Model 900 Previous Total Number of grammar symbols 1100 F1 89.5 With Smoothing 90.7

Final Results F1 40 words F1 all words Klein & Manning 03 86.3 85.7 Matsuzaki et al. 05 86.7 86.1 This Work 90.2 89.7 Parser Final Results F1 40 words

F1 all words Klein & Manning 03 86.3 85.7 Matsuzaki et al. 05 86.7 86.1 Collins 99 88.6 88.2 Charniak & Johnson 05 90.1 89.6 This Work

90.2 89.7 Parser Linguistic Candy Proper Nouns (NNP): NNP-14 Oct. Nov. Sept. NNP-12 John Robert James NNP-2 J. E.

L. NNP-1 Bush Noriega Peters NNP-15 New San Wall NNP-3 York Francisco Street Personal pronouns (PRP): PRP-0 It

He I PRP-1 it he they PRP-2 it them him Linguistic Candy Relative adverbs (RBR): RBR-0 further lower higher

RBR-1 more less More RBR-2 earlier Earlier later Cardinal Numbers (CD): CD-7 one two Three CD-4 1989

1990 1988 CD-11 million billion trillion CD-0 1 50 100 CD-3 1 30 31 CD-9

78 58 34 Conclusions New Ideas: Hierarchical Training Adaptive Splitting Parameter Smoothing State of the Art Parsing Performance: Improves from X-Bar initializer 63.4 to 90.2 Linguistically interesting grammars to sift through. Thank You! [email protected] Other things we tried X-Bar vs structurally annotated grammar: X-Bar grammar starts at lower performance, but provides more flexibility Better Smoothing: Tried different (hierarchical) smoothing methods, all worked about the same

(Linguistically) constraining rewrite possibilities between subcategories: Hurts performance EM automatically learns that most subcategory combinations are meaningless: 90% of the possible rewrites have 0 probability