Interconnecting lexicographic resources. In search for a model

Interconnecting lexicographic resources. In search for a model

Interconnecting lexicographic resources. In search for a model Dan Cristea Alexandru Ioan Cuza University of Iai Institute of Computer Science of the Romanian Academy [email protected] Topics Why would one want to connect linguistic resources? Parameterising the needs Standardisation helps interconnecting A bunch of notorious resources How would this work? Final remarks COST-ENeL, Bled, 29-30 September 2014 Why would one want to connect linguistic resources? Use case 1: 100 Romanian dictionaries aligned CLRE. Essential Romanian Lexicographical Corpus. 100 dictionaries aligned at entry and, partially, sense levels (2010 2013, at Institute A.Philippide of the Romanian Academy, in Iai dictionaries list at:

http://85.122.23.90/resurse/Lista-dictionarelor.doc written in 3 types of alphabets: Cyrillic, transition and Latin large diversity of formatting styles Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014 CLRE Essential Romanian Lexicographical Corpus Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014 The CLRE project ISER Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014 The CLRE project

Petri Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014 The CLRE project DN II Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014 The CLRE project Blesc u Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014 The CLRE project

Lexicon Militar Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014 The CLRE project Dicionar de informatic Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014 Processing in CLRE Scanning OCR Abby Fine Reader 9 Parsing entries => XML Manual verification

Indexing and alignment Iai, 25-26 September 2013 COST-ENeL, Bled, 29-30 September 2014 CLRE manual verification Iai, 25-26 September 2013 COST-ENeL, Bled, 29-30 September 2014 Why would one want to connect linguistic resources? Use case 2: align WN with an explanatory dictionary a WN synset: pos (def, ex, w1s1 wksk wnsn) an explanatory dictionary entry: wk, pos, COST-ENeL, Bled, 29-30 September 2014 Why would one want to connect

linguistic resources? Use case 2: align WN with an explanatory dictionary synsets of a word wk, pos: (def1, ex1, wks1 ) (defk, exk, wksk ) (defm, exm, wksm ) the explanatory dictionary entry of the word wk, pos: wk, pos, COST-ENeL, Bled, 29-30 September 2014 Why would one want to connect linguistic resources? Use case 2: align WN with an explanatory dictionary a WN synset: pos (def, ex, w1s1 wksk wnsn) explanatory dictionary entries: w1, pos, wk, pos,

wn, pos, COST-ENeL, Bled, 29-30 September 2014 Why would one want to connect linguistic resources? Use case 3: the TOT problem or the forgotten word A: Entire lexicon C : Categorial Tree B: Reduced search-space D : Chosen word Post - pr ocessi ng 1 Ambiguity detection via WN 2 Disambiguation: via clustering A A . .

Michael Zock evoked term target word . . . L . . . Step-1: systembuilder coffee mocha Pr e- pr ocessi ng 1 Ambiguity detection via WN 2 Interactive disambiguation: coffee: beverage or color ? B TEA 39 0.39

CUP 7 0.07 BLACK 5 0.05 BREAK 4 0.04 ESPRESSO 40.0.4 POT 3 0.03 CREAM 2 0.02 HOUSE 2 0.02 MILK 2 0.02 CAPPUCINO 20.02 STRONG 2 0.02 SUGAR 2 0.02 TIME 2 0.02 BAR 1 0.01 BEAN 1 0.01 BEVERAGE 1 0.01 BISCUITS 1 0.01 BITTER 1 0.01 DARK 1 0.01 DESERT 1 0.01 DRINK 1 0.01 FRENCH 1 0.01 GROUND 1 0.01 INSTANT 1 0.01 MACHINE 1 0.01 MOCHA 1 0.01 MORNING 1 0.01 MUD 1 0.01

NEGRO 1 0.01 SMELL 1 0.01 TABLE 1 0.01 Hypothetical lexicon containing 60.000 words Step-2: systembuilder - beverage, food, color, - used_for, used_with - quality, origin, place 1 via computation 2 via a resource 3 via a combination of resources (WordNet, Roget, Named Entities, ) C C a t e g o r ia l t r e e FO O D TA ST E set of

w o rd s D R IN K espresso cappucino mo c h a C O LO R C O O K Y set o f w o rd s Step-2: user Navigation +choice Target word Step- 1: user zero potential categories (nodes), for the words displayed in the search-space (B): Clustering + labeling

Create +/or use associative network (E.A.T, collocations derived from corpora) N . . . . . . . . . . Z associated terms to the input : coffee (beverage) able 1 navigate in the tree + determine whether it contains the target or a

more or less related word. 2 Decide on the next action : stop here, or continue. Provide input say, coffee Given some input the system displays all directly associated words, i.e. direct neighbors (graph), ordered by some criterion or not Tree designed for navigational purposes (reduction of search-space). The leaves contain potential target words and the nodes the names of their categories, allowing the user to look only under the relevant part of the tree. Since words are grouped in named clusters, the user does not have to go through the whole list of words anymore. Rather he navigates in a tree (topto-botton, left to right), choosing rst the category and then its members, to check whether any of them corresponds to the desired target word. COST-ENeL, Bled, 29-30 September 2014 The first thought: standardisation Lexical Markup Framework (LMF) What is it? a common model for creation and use of lexical resources With what goal?

to manage the exchange of data between and among these resources to enable the merging of a large number of individual electronic resources to form extensive global electronic resources COST-ENeL, Bled, 29-30 September 2014 Near-standard Text Encoding Initiative (TEI) What is it? an inventory of the features most often deployed for computer-based text processing recommendations about suitable ways of representing these features With what goal? to facilitate processing by computer programs to facilitate the loss-free interchange of data amongst individuals and research groups using different programs, computer systems, or application software COST-ENeL, Bled, 29-30 September 2014 Standardisation Text Encoding Initiative (TEI) Example of a dictionary entry serialisation (from TEI Guidelines) disproof (dIs"pru:f) n. 1. facts that disprove something. 2. the act of disproving. CED

disproof dIs"pru:f

n facts that disprove something. the act of disproving. COST-ENeL, Bled, 29-30 September 2014 LMF and TEI content and discontent The TEI format may be used as an interchange format, permitting sharing of resources even when their local encoding schemes differ. Both LMF and TEI model lexical material at a deep representational detail COST-ENeL, Bled, 29-30 September 2014 LMF and TEI content and discontent

TEI intention: guidance for individual or local practice in text creation and data capture support of data interchange support of application-independent local processing Opening good possibilities of querying But how would function the interconnection?... COST-ENeL, Bled, 29-30 September 2014 Parameterising the needs If I want to connect two resources, simply merge the contents Then be able to interrogate the merged resource by taking advantage of peculiarities in each resource COST-ENeL, Bled, 29-30 September 2014 Parameterising the needs Able to represent variations in word forms, alternate orthography, diachronic morphology Easy navigation by applying various filtering criteria COST-ENeL, Bled, 29-30 September 2014 Parameterising the needs

Very often lexicographic data is hierarchical for instance, a sense of a dictionary entry contains a definition, examples, but also sub-senses Organise even recursive searches give me the definition neighbouring sphere of depth 2 of the word captain (take all senses of the entry captain and form the list of words in the corresponding definitions, then for each of them take all their senses and collect again words in their definitions) COST-ENeL, Bled, 29-30 September 2014 The idea Representing lexical information as feature structures centred on words lemmas disproof (dIs"pru:f) n. 1. facts that disprove something. 2. the act of disproving. CED [lemma=disproof, entry=[pron=dIs"pru:f, pos=n, sense=[n=1, def=facts that disprove something], sense=[n=2, def=the act of disproving], res=CED]] COST-ENeL, Bled, 29-30 September 2014 Representing lexical entries as feature structures entry=

pron=dIs"pru:f pos=n n=1 sense= def=facts that disprove something n=2 sense= def=the act of disproving res=CED Cambridge English Dictionary lemma=disproof COST-ENeL, Bled, 29-30 September 2014 Representing lexical entries as feature structures Graph representation disproof lemma entry dIs"pru:f pron pos n n

sense sense res def n def 1 facts that disproves smth 2 the act of disproving CED COST-ENeL, Bled, 29-30 September 2014 lemma=disproof entry= pron=dIs"pru:f pos=n n=1 sense= def=the action of disproving n=2 sense= def=evidence that disproves res=MWCD

Merriam-Websters Collegiate Dictionary Representing lexical entries as feature structures COST-ENeL, Bled, 29-30 September 2014 Representing lexical entries as feature structures Graph representation disproof dIs"pru:f lemma entry pron pos n n sense sense res def n def

1 the action of disproving 2 evidence that disproves MWCD COST-ENeL, Bled, 29-30 September 2014 How could lexical entries be merged? Entries of the same word from different dictionaries dIs"pru:f disproof lemma entry disproof dIs"pru:f pron pos lemma n n

sense sense res def n def entry 1 facts that 2 the act of pron pos n n sense sense res def n def

1 the action of 2 evidence that MWCD CED COST-ENeL, Bled, 29-30 September 2014 Merging lexical entries Distinct parts disproof lemma entry disproof dIs"pru:f dIs"pru:f pron pos lemma n

n sense sense res def n def entry 1 facts that 2 the act of pron pos n n sense sense res def

n def 1 the action of 2 evidence that MWCD CED COST-ENeL, Bled, 29-30 September 2014 Merging lexical entries dIs"pru:f disproof lemma entry pron pos n X (new)

n sense sense res def n def X (new) 1 the action of 2 evidence that MWCD n sense sense res def n def

CED 1 facts that 2 the act of COST-ENeL, Bled, 29-30 September 2014 Representation of the merged feature structure lemma=disproof pron=dIs"pru:f pos=n sense= X= entry= n=1 def=facts that disprove something sense= n=2 def=the act of disproving res=CED sense= X=

n=1 def=the action of disproving sense= n=2 def=evidence that disproves res=MWCD COST-ENeL, Bled, 29-30 September 2014 The WordNet search for disproof COST-ENeL, Bled, 29-30 September 2014 Feature structures representation for the WN synsets of disproof (*, falsification, refutation) n disproof lex pos lemma synsets gloss

any evidence that helps to establish the falsity of something synset synset lex (falsification, falsifying, *, refutation, refutal) gloss (the act of determining that something is false COST-ENeL, Bled, 29-30 September 2014 The WordNet search for discount COST-ENeL, Bled, 29-30 September 2014 Representing WordNet synsets n (*, price reduction, deduction) lex pos

discount gloss synset synset lemma synsets synsets lex the act of reducing the selling price of merchandise (discount rate, *, bank discount) gloss interest on an annual basis deducted in advance on a loan synset synset (dismiss, disregard, brush aside, brush

off, *, push aside, ignore) v lex pos gloss bar from attention or consideration synset ex She dismissed his advances" synset COST-ENeL, Bled, 29-30 September 2014 dIs"pru:f disproof lemma entry How could dictionary entries be merged with WN synsets? 1 n pron pos n

sense sense res n def disproof CED facts that disproves smth 2 def the act of disproving (*, falsification, refutation) n lex pos lemma synsets

gloss any evidence that helps to establish the falsity of something synset synset (falsification, falsifying, *, refutation, refutal) lex gloss COST-ENeL, Bled, 29-30 September 2014 (the act of determining that something is false Merging a dictionary entry with a WN entry dIs"pru:f disproof pron pos

lemma entry n n sense sense res 1 def n def facts that disproves smth 2 the act of disproving CED (*, falsification, refutation) n disproof

lex pos lemma synsets gloss any evidence that helps to establish the falsity of something synset synset (falsification, falsifying, *, refutation, refutal) lex gloss COST-ENeL, Bled, 29-30 September 2014 (the act of determining that something is false Merging a dictionary entry with a WN entry

dIs"pru:f pron pos disproof lemma n n sense entry sense res 1 def n def facts that disproves smth 2 the act of disproving CED

(*, falsification, refutation) n lex pos synsets gloss any evidence that helps to establish the falsity of something synset synset (falsification, falsifying, *, refutation, refutal) lex gloss COST-ENeL, Bled, 29-30 September 2014 (the act of determining that something is false Going one step further

Feature structures are hierarchical data Codd: hierarchical data can be represented as relational tables Codd, E.F. (June 1970). "A Relational Model of Data for Large Shared Data Banks". Communications of the ACM 13 (6): 377387. COST-ENeL, Bled, 29-30 September 2014 Representing feature structures as relational tables w lemma pron pos entry n sense 1 n def

cit sense res orth auth yr from http://en.wikipedia.org/wiki/Relational_database COST-ENeL, Bled, 29-30 September 2014 Representing feature structures as relational tables w

lemma pron pos entry W O R D id n sense n def cit entry 1 sense res lemma

orth auth yr from http://en.wikipedia.org/wiki/Relational_database COST-ENeL, Bled, 29-30 September 2014 Representing feature structures as relational tables w lemma pron pos entry

W O R D id n sense n def cit entry 1 sense res lemma E N T R Y entry orth pron pos sense res

auth yr from http://en.wikipedia.org/wiki/Relational_database COST-ENeL, Bled, 29-30 September 2014 Representing feature structures as relational tables w lemma pron pos entry W O R D id n

sense n def cit entry 1 sense res lemma E N T R Y entry orth pron pos sense res auth S E N S E sense n

yr def cit from http://en.wikipedia.org/wiki/Relational_database COST-ENeL, Bled, 29-30 September 2014 Representing feature structures as relational tables w lemma pron pos entry W O R D id

n sense n def cit entry 1 sense res lemma E N T R Y entry orth pron pos sense res

auth S E N S E sense n yr C I T cit orth def cit auth yr from http://en.wikipedia.org/wiki/Relational_database COST-ENeL, Bled, 29-30 September 2014 Relational operators Projection: a1,an(R) => a relation containing only values of attributes a1, an from the relation R Selection: (R), with is logical condition => only tuples verifying the condition are retained from the relation (or the set) R Join: RS => the set of all attributes in R and S that are equal on their common attributes

Union: RS => a table representing the union of the two relations COST-ENeL, Bled, 29-30 September 2014 Interrogating a dictionary Citations before 1850 of the entry symphony. w lemma entry pron pos SENSE 1 n sense n def cit WORD sense res ENTRY

orth auth yr CIT orth(lemma=symphony & yr<1850 (WORDENTRYSENSECIT)) COST-ENeL, Bled, 29-30 September 2014 Interrogating a combination between a dictionary and a wordnet All synonyms of nouns belonging to citations dated before 1850, sorted lexicographically. w lemma entry pron

pos SENSE 1 n sense n def cit WORD sense res ENTRY orth auth yr

CIT COST-ENeL, Bled, 29-30 September 2014 Interrogating a combination between a dictionary and a wordnet All synonyms of nouns belonging to citations dated before 1850, sorted lexicographically. w lemma entry pron pos SENSE 1 n sense n def cit WORD

sense res ENTRY The citations of the title word w: orth(lemma=w & yr<1850 (WORDENTRYSENSECIT)) orth auth yr CIT COST-ENeL, Bled, 29-30 September 2014 Interrogating a combination between a dictionary and a wordnet All synonyms of nouns belonging to citations dated before 1850, sorted lexicographically. w

lemma pron pos entry SENSE 1 n sense n def cit WORD sense res ENTRY orth

auth yr CIT Unify and lemmatise words belonging to the citations: lem(U(orth(lemma=w & yr<1850 (WORDENTRYSENSECIT)))) COST-ENeL, Bled, 29-30 September 2014 Interrogating a combination between a dictionary and a wordnet All synonyms of nouns belonging to citations dated before 1850, sorted lexicographically. WORD w ENTRY SENSE pron lemma 1

pos n entry n def CIT sense orth cit auth sense yr res lex n synsets pos SYN gloss synset SYNS lex(pos=n & lemma lem(U(orth(lemma=w & yr<1850 (WORDENTRYSENSECIT))) (WORDSYNSSYN)) COST-ENeL, Bled, 29-30 September 2014

Conclusions I did not propose any model, I simply made some observations (nothing is really new) Linking lexicographic resources: one resource => TEI representation => as feature structures => hierarchical graphs => relational tables more resources => unifications of tables use query and relational operators for interrogation COST-ENeL, Bled, 29-30 September 2014 Discussion Only a sketch a lot of details should still be filled in the good news: XML structures (the native language of TEI) accept direct representations as database records: XSLT => opening direct access to a complex querying language: XQuery => mimicking the relational operators and adding more facilities COST-ENeL, Bled, 29-30 September 2014 Discussion Another good news representing variable depth structures recursive hierarchies: Kamfonas

Fixed depth dimensions are simpler to implement, maintain and query Hierarchies that have variable depth or an uncertain number of levels can often benefit if implemented as recursive hierarchies. http://www.kamfonas.com/id3.html COST-ENeL, Bled, 29-30 September 2014 Discussion Even more good news (hopes) interrogations can be formulated in natural language => an interpreter translates them in the query language of a DBMS system as such, a handy tool at the benefit of lexicographers COST-ENeL, Bled, 29-30 September 2014 Acknowledgements Work partially supported by the project The Computational Representative Corpus of Contemporary Romanian Language, a project of the Romanian Academy and partially by the COST-ENeL project I thank Isabelle Tamba and Mdlin Ptracu for the slides describing the CLRE project COST-ENeL, Bled, 29-30 September 2014

Thank you! COST-ENeL, Bled, 29-30 September 2014

Recently Viewed Presentations

  • Post Traumatic Stress Disorder (Ptsd)

    Post Traumatic Stress Disorder (Ptsd)

    CPT Alison L. Crane, RN, MS Mental Health Nurse Observer-Trainer 7302nd Medical Training Support Battalion POST TRAUMATIC STRESS DISORDER (PTSD) Beth Jeffries, PhD PCT Supervisor Jack C Montgomery Veterans Hospital Muskogee, OK What We'll Cover Post Traumatic Stress Disorder (PTSD)...
  • Occupational Liver Deseases

    Occupational Liver Deseases

    Arial Century Gothic Wingdings 2 Verdana Calibri Angsana New Times New Roman Wingdings Verve 1_Verve 2_Verve 3_Verve 4_Verve 5_Verve 6_Verve OCCUPATIONAL LIVER DISEASES CLASSIFICATION OF OCCUPATIONAL LIVER DISEASES Chemically Induced Liver Disorders Mechanisms of Toxicity MAJOR HUMAN HEPATOTOXINS EVALUATION OF...
  • 2018 ANNUAL PARTNERSHIP GUIDE Founded in 2004, Howard

    2018 ANNUAL PARTNERSHIP GUIDE Founded in 2004, Howard

    Logo on PPT presentation during event. State of the County Benefits . ... Logo on sponsor page in 2018 Howard County Chamber Annual Report. Listing in Annual Salute to Howard County Chamber, produced by The Business Monthly.
  • Introduction to the N.C. Department of Labor OSH Division

    Introduction to the N.C. Department of Labor OSH Division

    If high hazard procedures performed within airborne infection isolation or treatment rooms. Without source control or local exhaust ventilation and droplets released into environment. Purge time interval must be imposed during which respirators required when entering room.
  • Lay Missioner Ministry Training

    Lay Missioner Ministry Training

    Book of common prayer (2019) "The Book of Common Prayer (2019) is a form of prayers and praises that is thoroughly Biblical, catholic in the manner of the early centuries, highly participatory in delivery, peculiarly Anglican and English in its...
  • Introduction to Web Development

    Introduction to Web Development

    PHP Programming with MySQL, 2nd Edition PHP Programming with MySQL, 2nd Edition PHP Programming with MySQL, 2nd Edition * * In this chapter, you will: Construct text strings Work with single strings Work with multiple strings and parse strings Compare...
  • Blackboard for External Examiners

    Blackboard for External Examiners

    Blackboard for External Examiners What is Blackboard? Content Repository Collaboration Spaces Staff & Student Communication Tools Assessment Creation/Management Tools Used at SHU since 2001 Supports 95%-97% of modules How do People Get Access?
  • Vapour compression cycle ( ENERGY MANAGEMENT HANDBOOK Sixth

    Vapour compression cycle ( ENERGY MANAGEMENT HANDBOOK Sixth

    Fundamental physics for vapor compression cycles Constant-pressure change from liquid to vapour phase for a pure substance. ... diagram showing liquid and vapour phases Ideal vapour compression cycle Relations for the cycle Ideal cycle Log P vs Enthalpy diagram Coefficient...