Information Extraction, Data Mining & Joint Inference Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Charles Sutton, Aron Culotta, Khashayar Rohanemanesh, Ben Wellner, Karl Schultz, Michael Hay, Michael Wick, David Mimno. My Research Building models that mine actionable knowledge from unstructured text. QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture. From Text to Actionable Knowledge Spider Filter Data Mining
IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Document collection Actionable knowledge Prediction Outlier detection Decision support A Natural Language Processing Pipeline Pragmatics Anaphora Resolution Errors cascade &
accumulate Semantic Role Labeling Entity Recognition Parsing Chunking POS tagging Unified Natural Language Processing Pragmatics Anaphora Resolution Unified, joint inference. Semantic Role Labeling Entity Recognition Parsing Chunking POS tagging Solution: Uncertainty Info Spider Filter
Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Document collection Actionable knowledge Emerging Patterns Prediction Outlier detection Decision support
Solution: Unified Model Spider Filter Data Mining IE Segment Classify Associate Cluster Probabilistic Model Discover patterns - entity types - links / relations - events Discriminatively-trained undirected graphical models Document
collection Conditional Random Fields [Lafferty, McCallum, Pereira] Conditional PRMs [Koller], [Jensen], [Geetor], [Domingos] Complex Inference and Learning Just what we researchers like to sink our teeth into! Actionable knowledge Prediction Outlier detection Decision support Scientific Questions What model structures will capture salient dependencies? Will joint inference actually improve accuracy? How to do inference in these large graphical models? How to do parameter estimation efficiently in these models, which are built from multiple large components? How to do structure discovery in these models?
Scientific Questions What model structures will capture salient dependencies? Will joint inference actually improve accuracy? How to do inference in these large graphical models? How to do parameter estimation efficiently in these models, which are built from multiple large components? How to do structure discovery in these models? Methods of Inference Exact Exhaustively explore all interpretations Graphical model has low tree-width Variational Represent distribution in simpler model that is close Monte-Carlo Randomly (but cleverly) sample to explore interpretations Outline Examples of IE and Data Mining. Motivate Joint Inference Brief introduction to Conditional Random Fields
Joint inference: Information Extraction Examples Joint Labeling of Cascaded Sequences (Belief Propagation) Joint Labeling of Distant Entities (BP by Tree Reparameterization) Joint Co-reference Resolution Joint Segmentation and Co-ref (Graph Partitioning) (Sparse BP) Probability + First-order Logic, Co-ref on Entities Semi-supervised Learning Demo: Rexa, a Web portal for researchers (MCMC) Hidden Markov Models HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, Graphical model Finite state model ...
S t-1 St observations O t -1 v |o| Ot O t +1 vv P( s, o ) P(st | st1 )P(ot | st ) State sequence Observation sequence transitions
... ... Generates: S t+1 o1 o2 o3 o4 o5 o6 o7 o8 t=1
IE with Hidden Markov Models Given a sequence of observations: Yesterday Yoav Freund spoke this example sentence. and a trained HMM: person name location name background Find the most likely state sequence: (Viterbi) Yesterday Tony Jebara spoke this example sentence. Any words said to be generated by the designated person name state extract as a person name: Person name: Tony Jebara We want More than an Atomic View of Words Would like richer representation of text: many arbitrary, overlapping features of the words. S t-1 identity of word ends in -ski is capitalized is part of a noun phrase is Wisniewski
is in a list of city names is under node X in WordNet part of ends in is in bold font noun phrase -ski is indented O t 1 is in hyperlink anchor last person name was female next two words are and Associates St S t+1 Ot O t +1 Problems with Richer Representation and a Joint Model
These arbitrary features are not independent. Multiple levels of granularity (chars, words, phrases) Multiple dependent modalities (words, formatting, layout) Past & future Two choices: Model the dependencies. Each state would have its own Bayes Net. But we are already starved for training data! S t-1 O t -1 Ignore the dependencies. This causes over-counting of evidence (ala nave Bayes). Big problem when combining evidence, as in Viterbi! St S t+1 S t-1
St S t+1 Ot O t +1 O Ot O t +1 t -1 Conditional Sequence Models We prefer a model that is trained to maximize a conditional probability rather than joint probability: P(s|o) instead of P(s,o): Can examine features, but not responsible for generating them. Dont have to explicitly model their dependencies. Dont waste modeling effort trying to generate what we are given at test time anyway. From HMMs to Conditional Random Fields
v o = o1,o2 ,...on v s = s1,s2 ,...sn Joint [Lafferty, McCallum, Pereira 2001] St-1 v |o| Conditional St+1 ... vv P( s, o ) = P(st | st1 )P(ot | st ) t=1
St Ot-1 Ot ... Ot+1 v |o| 1 v v P( s | o ) = v P(st | st1 )P(ot | st ) P(o ) t=1 St-1 St St+1 ...
v |o| = 1 v s (st ,st1 ) o (ot ,st ) Z(o ) t=1 where o (t) = exp k f k (st ,ot ) k Ot-1 Ot Ot+1 ... (A super-special case of Conditional Random Fields.)
Set parameters by maximum likelihood, using optimization method on L. (Linear Chain) Conditional Random Fields [Lafferty, McCallum, Pereira 2001] Undirected graphical model, trained to maximize conditional probability of output (sequence) given input (sequence) Finite state model Graphical model OTHER y t-1 PERSON yt OTHER y t+1 ORG y t+2 TITLE y t+3
output seq FSM states ... observations x t -1 said 1 p(y | x) = (y t , y t1,x,t) Zx t x t Jones where
x a t +1 x t +2 Microsoft x t +3 VP input seq (y t , y t1,x,t) = exp k f k (y t , y t1,x,t) k Wide-spread interest, positive experimental results in many applications.
Noun phrase, Named entity [HLT03], [CoNLL03] Protein structure prediction [ICML04] IE from Bioinformatics text [Bioinformatics 04], Asian word segmentation [COLING04], [ACL04] IE from Research papers [HTL04] Object classification in images [CVPR 04] Table Extraction from Government Reports Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------: : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------Year
: of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :-----------------: : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------: 1,000 Head --- Pounds --Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592
3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 -------------------------------------------------------------------------------1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves. Table Extraction from Government Reports [Pinto, McCallum, Wei, Croft, 2003 SIGIR] 100+ documents from www.fedstats.gov CRF of milk during 1995 at $19.9 billion dollars, was eturns averaged $12.93 per hundredweight, 1994. Marketings totaled 154 billion pounds, ngs include whole milk sold to plants and dealers consumers.
ds of milk were used on farms where produced, es were fed 78 percent of this milk with the cer households. 1993-95 ------------------------------------ n of Milk and Milkfat 2/ -------------------------------------: Percentage : Non-Table Table Title Table Header
Table Data Row Table Section Data Row Table Footnote ... (12 in all) Features: uction of Milk and Milkfat: w Labels: Total ----: of Fat in All :-----------------Milk Produced : Milk : Milkfat ------------------------------------
Percentage of digit chars Percentage of alpha chars Indented Contains 5+ consecutive spaces Whitespace in this line aligns with prev. ... Conjunctions of all previous features, time offset: {0,0}, {-1,0}, {0,1}, {1,2}. Table Extraction Experimental Results [Pinto, McCallum, Wei, Croft, 2003 SIGIR] Line labels, percent correct HMM Stateless MaxEnt CRF 65 % 85 % 95 % Table segments,
F1 64 % 92 % IE from Research Papers [McCallum et al 99] IE from Research Papers Field-level F1 Hidden Markov Models (HMMs) 75.6 [Seymore, McCallum, Rosenfeld, 1999] Support Vector Machines (SVMs) 89.7 error 40% [Han, Giles, et al, 2003] Conditional Random Fields (CRFs)
[Peng, McCallum, 2004] 93.9 Outline The Need for IE and Data Mining. Motivate Joint Inference Brief introduction to Conditional Random Fields Joint inference: Information Extraction Examples Joint Labeling of Cascaded Sequences (Belief Propagation) Joint Labeling of Distant Entities (BP by Tree Reparameterization) Joint Co-reference Resolution (Graph Partitioning) Probability + First-order Logic, Co-ref on Entities Joint Information Integration (MCMC + Sample Rank) Demo: Rexa, a Web portal for researchers (MCMC)
1. Jointly labeling cascaded sequences Factorial CRFs [Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries Part-of-speech English words 1. Jointly labeling cascaded sequences Factorial CRFs [Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries Part-of-speech English words 1. Jointly labeling cascaded sequences Factorial CRFs [Sutton, Khashayar, McCallum, ICML 2004]
Named-entity tag Noun-phrase boundaries Part-of-speech English words But errors cascade--must be perfect at every stage to do well. 1. Jointly labeling cascaded sequences Factorial CRFs [Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries Part-of-speech English words Joint prediction of part-of-speech and noun-phrase in newswire, matching accuracy with only 50% of the training data. Inference: Loopy Belief Propagation Outline The Need for IE and Data Mining. Motivate Joint Inference
Brief introduction to Conditional Random Fields Joint inference: Information Extraction Examples Joint Labeling of Cascaded Sequences (Belief Propagation) Joint Labeling of Distant Entities (BP by Tree Reparameterization) Joint Co-reference Resolution (Graph Partitioning) Probability + First-order Logic, Co-ref on Entities Joint Information Integration (MCMC + Sample Rank) Demo: Rexa, a Web portal for researchers (MCMC) 2. Jointly labeling distant mentions Skip-chain CRFs [Sutton, McCallum, SRL 2004] Senator Joe Green said today
. Green ran for Dependency among similar, distant mentions ignored. 2. Jointly labeling distant mentions Skip-chain CRFs [Sutton, McCallum, SRL 2004] Senator Joe Green said today . Green ran for
14% reduction in error on most repeated field in email seminar announcements. Inference: Tree reparameterized BP [Wainwright et al, 2002] See also [Finkel, et al, 2005] Outline The Need for IE and Data Mining. Motivate Joint Inference Brief introduction to Conditional Random Fields Joint inference: Information Extraction Examples Joint Labeling of Cascaded Sequences (Belief Propagation) Joint Labeling of Distant Entities (BP by Tree Reparameterization) Joint Co-reference Resolution (Graph Partitioning) Probability + First-order Logic, Co-ref on Entities Joint Information Integration
(MCMC + Sample Rank) Demo: Rexa, a Web portal for researchers (MCMC) 3. Joint co-reference among all pairs Affinity Matrix CRF Entity resolution Object correspondence Mr. Hill Dana Hill Amy Hall ~25% reduction in error on co-reference of proper nouns in newswire. Dana Inference:
Correlational clustering graph partitioning [Bansal, Blum, Chawla, 2002] she [McCallum, Wellner, IJCAI WS 2003, NIPS 2004] Coreference Resolution AKA "record linkage", "database record deduplication", "citation matching", "object correspondence", "identity uncertainty" Input Output News article, with named-entity "mentions" tagged Number of entities, N = 3 Today Secretary of State Colin Powell met with . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . he . . . . . . . . . . . . . . . . . . . Condoleezza Rice . . . . . . . . . Mr Powell . . . . . . . . . .she . . . . . . .
. . . . . . . . . . . . . . Powell . . . . . . . . . . . . . . . President Bush . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rice . . . . . . . . . . . . . . . . Bush . . . . . . . . . . . . . . . . . . . . . ........................... #1 Secretary of State Colin Powell he Mr. Powell Powell #2 Condoleezza Rice she Rice ......................... #3 President Bush Bush Inside the Traditional Solution Pair-wise Affinity Metric Mention (3) . . . Mr Powell . . .
N Y Y Y Y N Y Y N Y N N Y Y Mention (4) Y/N? . . . Powell . . . Two words in common One word in common "Normalized" mentions are string identical Capitalized word in common > 50% character tri-gram overlap
< 25% character tri-gram overlap In same sentence Within two sentences Further than 3 sentences apart "Hobbs Distance" < 3 Number of entities in between two mentions = 0 Number of entities in between two mentions > 4 Font matches Default OVERALL SCORE = 29 13 39 17 19 -34 9 8 -1 11 12 -3 1 -19 98
> threshold=0 Entity Resolution mention Mr. Hill mention Dana Hill mention Amy Hall Dana she mention mention
Entity Resolution Mr. Hill entity Dana Hill entity Amy Hall Dana she Entity Resolution Mr. Hill entity Dana Hill Amy
Hall entity Dana she Entity Resolution Mr. Hill entity entity Dana Hill Amy Hall Dana she entity Independent pairwise affinity with connected components
The Problem Affinity measures are noisy and imperfect. Mr. Hill N C N Dana Hill C C N N Dana Amy
Hall C N C Pair-wise merging decisions are being made independently from each other They should be made jointly. she CRF for Co-reference [McCallum & Wellner, 2003, ICML] Mr. Hill N C N
Dana Hill C C N N Dana Amy Hall C N C Make pair-wise merging decisions jointly by: - calculating a joint prob. - including all edge weights - adding dependence on
consistent triangles. she 1 v v P( y | x ) = exp l f l (x i , x j , y ij ) + ' f '(y ij , y jk , y ik ) Z xv i, j l i, j,k CRF for Co-reference [McCallum & Wellner, 2003, ICML] Mr. Hill N C N Dana Hill
C C N N Dana Amy Hall C N C she 1 v v P( y | x ) = exp l f l (x i , x j , y ij ) + ' f '(y ij , y jk , y ik )
Z xv i, j l i, j,k CRF for Co-reference Mr. Hill +(23) -(-55) C N -(-44) N Dana Hill +(11) C
C N N Dana -(-23) C N -(-9) +(10) +(17) -(-22) C Amy
Hall she +(4) 218 1 v v P( y | x ) = exp l f l (x i , x j , y ij ) + ' f '(y ij , y jk , y ik ) Z xv i, j l i, j,k CRF for Co-reference Mr. Hill +(23) -(-55)
C N -(-44) N Dana Hill +(11) C C N N Dana -(-23)
C N -(-9) +(10) +(17) -(-22) N Amy Hall she -(4) 210 0
1 v v P( y | x ) = exp l f l (x i , x j , y ij ) + ' f '(y ij , y jk , y ik ) Z xv i, j l i, j,k CRF for Co-reference Mr. Hill -(23) +(-55) N C -(-44) N Dana
Hill -(11) C N N C -(-23) N N +(-9) +(10) -(17) -(-22)
C Dana Amy Hall she +(4) -12 0 1 v v P( y | x ) = exp l f l (x i , x j , y ij ) + ' f '(y ij , y jk , y ik ) Z xv i, j l i, j,k Inference in these MRFs = Graph Partitioning
[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002] Mr. Hill -(23) Correlational Clustering [Bansal & Blum] [Demaine] +(-55) N C -(-44) N Dana Hill -(11)
N C N C N -(17) -(-22) C Dana -(-23) N +(-9) +(10)
Amy Hall she +(4) v v log( P( y | x )) l f l (x i , x j , y ij ) = i, j l w i, j w/in paritions ij w i, j across paritions ij
Pairwise Affinity is not Enough Mr. Hill -(23) +(-55) N C -(-44) N Dana Hill -(11) C N N
C Dana -(-23) N N +(-9) +(10) -(17) -(-22) C +(4) Amy Hall she
Pairwise Affinity is not Enough Mr. Hill C N N Dana Hill N C N C Dana Amy Hall N
N C she Pairwise Affinity is not Enough she N C N she C C N N she
Amy Hall C N N she Pairwise Comparisons Not Enough Examples: mentions are pronouns? Entities have multiple attributes (name, email, institution, location); need to measure compatibility among them. Having 2 given names is common, but not 4. e.g. Howard M. Dean / Martin, Dean / Howard Martin Need to measure size of the clusters of mentions. a pair of lastname strings that differ > 5? We need to ask , questions about a set of mentions We want first-order logic! Outline The Need for IE and Data Mining. Motivate Joint Inference
Brief introduction to Conditional Random Fields Joint inference: Information Extraction Examples Joint Labeling of Cascaded Sequences (Belief Propagation) Joint Labeling of Distant Entities (BP by Tree Reparameterization) Joint Co-reference Resolution (Graph Partitioning) Probability + First-order Logic, Co-ref on Entities Joint Information Integration (MCMC + Sample Rank) Demo: Rexa, a Web portal for researchers (MCMC) Pairwise Affinity is not Enough she N C
N she C C N N she Amy Hall C N N she Partition Affinity CRF she
Amy Hall she she Ask arbitrary questions about all entities in a partition with first-order logic... she ... bringing together LOGIC and PROBABILITY Partition Affinity CRF she Amy Hall she she
she Partition Affinity CRF she Amy Hall she she she Partition Affinity CRF she Amy Hall she she she
Partition Affinity CRF she Amy Hall she she she This space complexity is common in probabilistic first-order logic models Markov Logic First-Order Logic as a Template to Define CRF Parameters [Richardson & Domingos 2005][Paskin & Russell 2002] [Taskar et al 2003] ground Markov network grounding Markov network requires space O(nr) n = number constants r = highest clause arity
How can we perform inference and learning in models that cannot be grounded? Inference in Weighted First-Order Logic SAT Solvers Weighted SAT solvers [Kautz et al 1997] Requires complete grounding of network LazySAT [Singla & Domingos 2006] Saves memory by only storing clauses that may become unsatisfied Initialization still requires time O(nr) to visit all ground clauses Inference in Weighted First-Order Logic MCMC Gibbs Sampling Difficult to move between high probability configurations by changing single variables Although, consider MC-SAT. [Poon & Domingos 06] An alternative: Metropolis-Hastings sampling [Culotta & McCallum 2006] 2 parts: proposal distribution, acceptance distribution.
Can be extended to partial configurations Only instantiate relevant variables Successfully used in BLOG models [Milch et al 2005] Key advantage: can design arbitrary smart jumps Dont represent all alternatives... she Amy Hall she she she Dont represent all alternatives... just one at a time she Amy Hall she
Proposal Distribution she she Sto ch Ju asti mp c she Amy Hall she she she Metropolis-Hastings Jump acceptance probability p(y)/p(y) : likelihood ratio
Ratio of P(Y|X) ZX cancels! q(y|y) : proposal distribution probability of proposing move y y ratio makes up for any biases in the proposal distribution Learning the Likelihood Ratio Parameter Estimation in Large State Spaces Most methods require calculating gradient of log-likelihood, P(y1, y2, y3,... | x1, x2, x3,...)... ...which in turn requires expectations of marginals, P(y1| x1, x2, x3,...) But, getting marginal distributions by sampling can be inefficient due to large sample space. Alternative: Perceptron. Approx gradient from difference between true output and models predicted best output. But, even finding models predicted best output is expensive. We propose: Sample Rank [Culotta, Wick, Hall, McCallum, HLT 2007] Learn to rank intermediate solutions P(y1=1, y2=0, y3=1,... | ...) > P(y1=0, y2=0, y3=1,... | ...) Ranking vs Classification Training Instead of training
[Powell, Mr. Powell, he] --> YES [Powell, Mr. Powell, she] --> NO ...Rather... [Powell, Mr. Powell, he] > [Powell, Mr. Powell, she] In general, higher-ranked example may contain errors [Powell, Mr. Powell, George, he] > [Powell, Mr. Powell, George, she] Ranking Intermediate Solutions Example 1. 2. Model = -23 Truth = -0.2 3. Model = 10 Truth = -0.1 4. Model = -10 Truth = -0.1
5. Model = 3 Truth = 0.3 UPDATE Like Perceptron: Proof of convergence under Marginal Separability More constrained than Maximum Likelihood: Parameters must correctly rank incorrect solutions! Sample Rank Algorithm y t +1 = Pr opose(x, y t ) 1. Proposer: 2. Performance Metric: F(Y ) 3. Inputs: input sequence x and an initial (random) configuration 4. Initialization: set the parameter vector =0 5. Output: Parameters 6. Score function: Score x (y) = .(x, y ) y0
do 7. For t = 1,,T and i = 0, , n-1 . Generate a training instance y t +1 = Pr opose(x, y t ) y + and y be the best and worst configurations among y and y t +1 . Let t according to the performance metric. . If Score x (y + ) < Score x (y ) + = + (x, y ) (x, y ) . end if 8. end for
Weighted Logics Techniques Overview Metropolis-Hastings (MH) for inference Freely bake-in domain knowledge about fruitful jumps; MH safely takes care of its biases. Avoid memory and time consumption with massive deterministic constraint factors: built jump functions that simply avoid illegal states. Sample Rank Dont train by likelihood of completely correct solution... ...train to properly rank intermediate configurations the partition function (normalizer) cancels!...plus other efficiencies Partition Affinity CRF Experiments [Culotta, Wick, Hall, McCallum, 2007] Better Training Likelihood-based Training
Rank-based Training Partition Affinity 69.2 79.3 Pairwise Affinity 62.4 72.5 rt a e New -of-th te sta Better
Representation B-Cubed F1 Score on ACE 2004 Noun Coreference To our knowledge, best previously reported results: 1997 65% 2002 67% 2005 68% Outline The Need for IE and Data Mining. Motivate Joint Inference Brief introduction to Conditional Random Fields Joint inference: Information Extraction Examples Joint Labeling of Cascaded Sequences (Belief Propagation)
Joint Labeling of Distant Entities (BP by Tree Reparameterization) Joint Co-reference Resolution (Graph Partitioning) Probability + First-order Logic, Co-ref on Entities Joint Information Integration (MCMC + Sample Rank) Demo: Rexa, a Web portal for researchers (MCMC) Information Integration Schema A Schema B First Name Last Name Contact Name
Phone J. Smith 222-444-1337 John Smith U.S. 222-444-1337 J. Smith 444 1337 John D. Smith 444 1337 John Smith
(1) 4321115555 J Smiht 432-111-5555 Schema Matching Coreference Schema A Schema B First Name Name John #1 John #2 Last Name Phone
J. Smith John Smith J. Smith J Smiht Contact John Smith Normalized DB John D. Smith Entity# Name Phone 523 John Smith
222-444-1337 524 John D. Smith 432-111-5555 Information Integration Steps 1. Schema Matching First Name, Last Name Name Phone Contact 2. Coreference 3. Canonicalization
J. Smith J. Smith John Smith A. Jones Amanda John Smith .. Amanda Jones .. .. Problems with a Pipeline 1. Data integration tasks are highly correlated 2. Errors can propagate Schema Matching First 1. Schema Matching First Name, Last Name
Name 2. Coreference J. Smith J. Smith Provides Evidence Phone Contact John Smith Amanda A. Jones NEW FEATURES 1. String Identical: F.Name+L.Name==Name 2. Same Area Code: 3-gram in Phone/Contact 3. Coreference First 1. Coreference 2. Schema Matching
J. Smith First Name, Last Name Provides Evidence J. Smith Name John Smith Amanda A. Jones Phone Contact NEW FEATURES 1. Field values similar across corefd records 2. Phone+Contact has same value for J. Smith mentions 3. Problems with a Pipeline 1. Data integration tasks are highly correlated 2. Errors can propagate
Hazards of a Pipeline 1. Schema Matching Full Name Phone Table A Name Corporation Amanda Jones J. Smith & Sons J. Smith IBM Company Name Contact Table B Full Name
Company Name Amanda Jones Smith & Sons John Smith IBM 2. Coreferent? ERRORS PROPOGATE Canonicalization Coref Entity 87 John Smith J. Smith J. Smith J. Smiht J.S Mith Jonh smith John
Canonicalization John Smith Typically occurs AFTER coreference Desiterata: Complete: Contains all information (e.g. first + last) Error-free: No typos (e.g. avoid Smiht) Central: Represents all mentions (not Mith) Access to such features would be very helpful to Coref P(Y | X) = 1 ZX Schema Matching y7 y67 (y , x ) (y , x ) (y , x ) = exp f (y , x ) f7
x6 x7 y5 x5 x8 f67 w yi Y i i b ij ij i yi ,yj Y
i k k k f5 y5 f8 y54 x4 y54 x6 y is1 a set of attributes {phone,contact,telephone}} x7 is a set of attributes {last name, last name}
y3x6/x7 f67f1is a factor between x3 y67 is a binary variable indicating a match (no) y13 y23 f7 is a factor over cluster x7 y7 is a binary variable indicating match (yes) y12 Coreference and Canonicalization
y8 y2 f2 i i x1 is a set of mentions {J.Schema Smith,John,John Smith}} Matching x2=is1 a set ofi, xmentions P(Y | X) w (y i) b (yij, {Amanda, xij ) ZX yi Y yi ,yj Y
f12 is a factor between x1/x2 A.yJones} 67 6 y12 is a binary variable x indicating a match (no) f1 is a factorxover cluster x1 5 y7 f7 x7
f67 y1 is a binary variable indicating match (yes) f5 Entity/attribute factors omitted for clarity y5 y54 y1 x4 (yi, xi) = exp kfk(yi, xi) k y5 x8
f8 y8 y54 y2 x3 y3 f1 f2 y13 x1 y23 x2 y12 Coreference and Canonicalization 1 ZX
P(Y | X) = Schema Matching y7 y67 (y , x ) (y , x ) (y , x ) = exp f (y , x ) f7 x6 x7 y5 x5 x8 f67 w yi Y i i
b ij ij i yi ,yj Y i k k k f5 y5 f8 y54 y1 x4
y8 y54 f43 y2 x3 y3 f1 f2 y13 x1 y23 x2 y12 Coreference and Canonicalization i
i Dataset Faculty and alumni listings from university websites, plus an IE system 9 different schemas ~1400 mentions, 294 coreferent Example Schemas DEX IE Northwestern Fac UPenn Fac First Name Name Name Middle Name Title
First Name Last Name PhD Alma Mater Last Name Title Research Interests Job+Department Department Office Address Company Name E-mail Home Phone Office Phone Fax Number E-mail
Coreference Features First order quantifications/aggregations over: Cosine distance (unweighted) Cosine distance (TFIDF) Cosine distance between mapped fields Substring match between mapped fields All of the above comparisons, but use canonical records Schema Matching Features First order quantifications/aggregations over: String identical Sub string matches TFIDF weighted cosine distance
All of the above with between coreferent mentions only Systems ISO: CASC: CASC: JOINT: Each task in isolation Coref -> Schema matching Schema matching -> Coref Coref + Schema matching Our new work Each system is evaluated with and without Joint canonicalization Coreference Results Pair
No Canon Canon MUC F1 Prec Recall F1 Prec Recall ISO 72.7 88.9
61.5 75.0 88.9 64.9 CASC 64.0 66.7 61.5 65.7 66.7 64.9 JOINT 76.5
89.7 66.7 78.8 89.7 70.3 ISO 78.3 90.0 69.2 80.6 90.0 73.0 CASC
65.8 67.6 64.1 67.6 67.6 67.6 JOINT 81.7 90.6 74.4 84.1 90.6 74.4
Note: cascade does worse than ISO Schema Matching Results Pair No Canon Canon MUC F1 Prec Recall F1 Prec Recall ISO
50.9 40.9 67.5 69.2 81.8 60.0 CASC 50.9 40.9 67.5 69.2 81.8 60.0
JOINT 68.9 100 52.5 69.6 100 53.3 ISO 50.9 40.9 67.5 69.2 81.8
60.0 CASC 52.3 41.8 70.0 74.1 83.3 66.7 JOINT 71.0 100 55.0 75.0
100 60.0 Note: cascade not as harmful here Outline The Need for IE and Data Mining. Motivate Joint Inference Brief introduction to Conditional Random Fields Joint inference: Information Extraction Examples Joint Labeling of Cascaded Sequences (Belief Propagation) Joint Labeling of Distant Entities (BP by Tree Reparameterization) Joint Co-reference Resolution (Graph Partitioning) Probability + First-order Logic, Co-ref on Entities Joint Information Integration (MCMC + Sample Rank) Demo: Rexa, a Web portal for researchers
(MCMC) Data Mining Research Literature Better understand structure of our own research area. Structure helps us learn a new field. Aid collaboration Map how ideas travel through social networks of researchers. Aids for hiring and finding reviewers! Measure impact of papers or people. Our Data Over 1.6 million research papers, gathered as part of Rexa.info portal. Cross linked references / citations. QuickTime and a TIFF (LZW) decompressor are needed to see this picture. Previous Systems QuickTime and a TIFF (LZW) decompressor
are needed to see this picture. Previous Systems Cites Research Paper More Entities and Relations Expertise Cites Research Paper Grant Venue Person University Groups
Topical Transfer Citation counts from one topic to another. Map producers and consumers Topical Bibliometric Impact Measures [Mann, Mimno, McCallum, 2006] Topical Citation Counts Topical Impact Factors Topical Longevity Topical Precedence Topical Diversity Topical Transfer Topical Transfer Transfer from Digital Libraries to other topics Other topic Cits Paper Title Web Pages 31
Trawling the Web for Emerging CyberCommunities, Kumar, Raghavan,... 1999. Computer Vision 14 On being Undigital with digital cameras: extending the dynamic... Video 12 Lessons learned from the creation and deployment of a terabyte digital video libr.. Graphs 12 Trawling the Web for Emerging CyberCommunities Web Pages 11
WebBase: a repository of Web pages Topical Diversity Papers that had the most influence across many other fields... Topical Diversity Entropy of the topic distribution among papers that cite this paper (this topic). High Diversity Low Diversity Summary Joint inference needed for avoiding cascading errors in information extraction and data mining. Most fundamental problem in NLP, data mining, ... Can be performed in CRFs
Cascaded sequences (Factorial CRFs) Distant correlations (Skip-chain CRFs) Co-reference (Affinity-matrix CRFs) Logic + Probability (efficient by MCMC + Sample Rank) Information Integration Rexa: New research paper search engine, mining the interactions in our community. Outline Model / Feature Engineering Brief review of IE w/ Conditional Random Fields Flexibility to use non-independent features Inference Entity Resolution with Probability + First-order Logic Resolution + Canonicalization + Schema Mapping Inference by Metropolis-Hastings
Parameter Estimation Semi-supervised Learning with Label Regularization ...with Feature Labeling Generalized Expectation criteria