Library, Archive and Museum Collaboration

Library, Archive and Museum Collaboration

Extracting names and resolving identities in unstructured text Carol Jean Godby Research Scientist OCLC Research Three problems in automated name extraction Recognize Distinguish names from non-names. Assign the name to a broadly recognized category. Cluster Associate variants of the same name. Assign an identity or the names real-world referent Select the canonical form of a name. Extracting nameswith andLinked resolving Leveraging Names Data identities 2 An example The Justice Department has officially ended its inquiry into the assassinations of John F. Kennedy and Martin Luther King Jr, finding no persuasive

evidence to support conspiracy theories, according to department documents. The House [ORG The Justice Department] has officially ended Assassinations Committee concluded in 1978 that its inquiry into the assassinations of [PER John F. Kennedy was probably assassinated as the Kennedy] and [PER Martininvolving Luther King Jr.] , finding result of a conspiracy a second gunman,

noapersuasive evidence to support conspiracy finding that broke from the Warren Commission's theories, to department documents. belief according that Lee Harvey Oswald acted alone in [ORG The House Committee] Dallas on Nov.Assassinations 22, 1963. concluded in 1978 that [PER Kennedy] was probably assassinated as the result of a conspiracy involving a second gunman, a finding that broke from the [ORG Warren Commission]'s belief that [PER Lee Harvey Oswald] acted alone in [LOC Dallas] on Nov. 22, 1963. Extracting nameswith andLinked

resolving Leveraging Names Data identities 3 Types of text Unstructured text Hi Larry, Here is my section of the draft. Im still plugging away, so look for another version resources sometime later today or desk wife tomorrow. Structured text Hi the beyond Larry emi-structured text report

siemens about To: Larry dropped is deck Here is my section of the From:Jean draft. Im still plugging November building Hi Larry, away, so look for Here is my section of the called American buy another version draft. Im still plugging sometime later today or away, so look forchildren another companies tomorrow. version sometime later food could today or tomorrow. Bag of words

Extracting nameswith andLinked resolving Leveraging Names Data identities 4 Project Goals Lower the barrier of access to high-end named entity recognition (NER) tools. Build bridges to identity resolution research. Create tools for open use. Demonstrate use of the tools in digital library applications. Make recommendations for future collaboration between pure and applied research. Extracting nameswith andLinked resolving Leveraging Names Data identities 5 Uses for automatically extracted names

in library applications Index e-resources and make the results available to a browse or search function in a user interface. Assemble e-resources about a particular named entity from a database search. Catalog e-resources with authoritative forms of names. Use names harvested from unstructured text to: Create name lists or gazetteers. Populate future versions of authority files. Create dedicated services that: Anonymize names. Create robust links between structured and unstructured texts. Extracting nameswith andLinked resolving Leveraging Names Data identities 6 The Named Entity Recognizer Who's Who in Your Digital Collection? DHCS 2009 Developing a Tool for Name Leveraging Names with Linked Data Disambiguation and Identity Resolution

7 7 Facility State or province Organization Person Natural feature Leveraging Names with Linked Data 8 How the UIUC NER tagger works Identifies the four categories the standard CoNLL scheme [ORG] Any temporary or permanent collection of people, such as Google, Ohio Division of Natural Resources, Democratic Party Meetup [PER] Personal names. Includes fictional names and supernatural beings. [LOC] Any physical or human-built landmark. Kentucky, Empire State Building, Gulf of Mexico. [MISC] A catchall. World War I, Kleenex, Abstract Expressionism, and Jewish are all [MISC] names. Does not assign internal structure New York Times XXX [ORG [LOC New York] Times] Recognizes names using perceptrons

A machine-learning algorithm that makes minimal Extracting nameswith andLinked resolving identitiesdefinitions, but recognizes Leveraging Names Data assumptions about category 9 An EAD record Papers of Gennaro M.Tisi, noted clinical and research specialist in the area of pulmonary medicine and a founding member of the School of Medicine, University of California, San Diego. Author of over 100 original articles, chapters, and abstracts, Tisi's research interests included the staging of lung cancer, medicalpulmonary education, pulmonary physiology and mechanics, and clinical research in pulmonary disease. Arranged into six series, the collection contains research notes, correspondence, manuscripts, administrative memos, committee agendas and minutes, and photographs documenting Tisi's professional life from 1964 to his death in 1988. Gennaro Michael Tisi (September 26, 1935-February 18, 1988), was a pulmonary specialist, both as a clinician

and teacher. He earned a B.S. in chemistry, biology, and philosophy from Fordham University in 1956 and a M.D. from Georgetown University Medical School in Extracting names and resolving 1960. He was a founding member of UCSD's medical Leveraging Names with Linked Data identities 10 Tagging results Segmentation error Papers of [PER Gennaro M. Tisi], noted clinical and research specialist in the area of pulmonary medicine and a founding member of the School of [MISC Medicine], [ORG University of California], [LOC San Diego]. Author of over 100 original articles, chapters, and abstracts, [PER Tisi]'s research interests included the staging of lung cancer, medical-pulmonary

education, pulmonary physiology and mechanics, and clinical research in pulmonary disease. Arranged into six series, the collection contains research notes, correspondence, manuscripts, administrative memos, committee agendas and Category error minutes, and photographs documenting [PER Tisi]'s professional life from 1964 to his death in 1988. [PER Gennaro Michael Tisi] (September 26, 1935-February 18, 1988), was a pulmonary specialist, both as a clinician and teacher. He earned a [LOC B.S.] in chemistry, biology, and philosophy from [ORG Fordham University] in 1956 and a M.D. from [ORG Georgetown University Medical School] in 1960. He was a names founding member of [ORG UCSD]'s medical school, where Extracting andLinked resolving Leveraging Names with Data identities he worked from 1968 until his death in 1988 of a cerebral 11 Results on government documents Segmentation error Category

46 | 2009-2010 [ORG Illinois] [MISCerror Blue Book] Missed 96th [ORG General Assembly] Segmentation 46 | 2009-2010 [ORG Illinois] Blue Book Office of the Assembly [MISC Senate President] 96th General Segmentation Office of the [ORG Senate] President The SenatePresident President] the presiding officer the state The[MISC [ORG Senate] is theispresiding officer of the state of [ORG [ORG Senate] , elected

bythe and among members Senate] , elected by and among members of the the [ORG Senate]of to the serve [ORG Senate] to serve a two-year term. The [MISC Illinois a two-year term. The [ORG Illinois Constitution] , statutes and rules define Constitution], statutes andofrules

define the functions and the functions and responsibilities the office. Goldmembers error The President appoints [ORG Senate] to standing committees responsibilities of the office. Missed and permanent and interim study commissions, designating one member as The [MISC President] [ORG Senate] to chair. The President alsoappoints appoints the Majority Leadermembers and Assistant

Majority Leaders, who serveand as officers of the [ORG . study standing committees permanent and Senate] interim Passed by the [ORG Senate] are one in accordance Senate] rules. commissions, designating memberwith as[ORG [MISC chair]. The [MISC President] also appoints the [MISC Majority Leader] and [MISC Assistant Majority Leaders], who serve as officers of the [ORG Senate]. Passed by the [ORG Senate] are in accordance with [ORG Extracting

names andLinked resolving Leveraging Names with Data identities Senate] rules. 12 Some genres in government documents Legislation Forms and Instructions Requirements, Codes, Regulations, and Laws Childrens Material Directories

Website Locator and Navigation Webpages Social Media and Interactive Communication Facilities State Academic Institutions Oversight Reports Special Topical Reports Budgetary Material Audits Legal Proceedings Contractual Material Extracting nameswith andLinked resolving Leveraging Names Data identities

13 Leveraging Names with Linked Data 14 Leveraging Names with Linked Data 15 Leveraging Names with Linked Data 16 Scoring Gold text Wrong label [ORG Theand Justice Department] has officially ended its inquiry Segmentation into the assassinations John F. Kennedy] and [PER segmentationof [PER Missed error Martin Luther no persuasive evidence to this

errorKing Jr.], finding support conspiracy theor,ies according to departmentone documents. [ORG The House Assassinations Committee] concluded in 1978 that [PER Kennedy] was probably The [MISC Justice Department] has officially ended its inquiry assassinated as the result of a conspiracy involving a second into the assassinations of [PER John F]. [PER Kennedy] and gunman, a finding that broke from the [ORG Warren Martin Luther King Jr., finding no persuasive evidence to Commission]'s belief that [PER Lee Harvey Oswald] acted alone support conspiracy theories, according to department in [LOC Dallas] on Nov. 22, 1963. documents. [ORG The House Assassinations Committee] concluded in 1978 that [PER Kennedy] was probably assassinated as the result of a conspiracy involving a second gunman, a finding that broke from the [ORG Warren Commission's] belief that [PER Lee Harvey Oswald] acted alone NER-tagged text in [LOC Dallas] on Nov. 22, 1963. F-Measure: Precision/Recall Extracting nameswith andLinked resolving Leveraging Names Data identities

17 Some outcomes F-scores ranked by tag type: PER > LOC > ORG > MISC [PER] and [LOC] are most robust categories across different document collections. [MISC] and [ORG] are highly dependent on the corpus and subject domain. Training on a corpus for one purpose cannot be reused on a different corpus without a degradation in performance. Extracting nameswith andLinked resolving Leveraging Names Data identities 18 Issues with tagging Tag definitions Only four categories are defined: [MISC], [ORG], [PER], [LOC] [MISC] is a grab bag. [ORG] Doesnt have a librarians definition. Has no predictable structure.

Names with internal structure Advisory Committee on Appellate Rules of the Judicial Conference of the United States [ORG Advisory Committee] on [MISC Appellate Rules] of the [ORG Judicial Conference] of the [LOC United States ] Trustees of Wheaton Seminary Fred Steiner Papers L. Tom Perry Special Collections Extracting nameswith andLinked resolving Leveraging Names Data identities 19 What type of name is: What type of name is. Prayer Service Swearing-in Barbecue Illinois Constitution University Archives Reference Desk Extracting nameswith

andLinked resolving Leveraging Names Data identities 20 Conceptual issues with named entity recognition Ambiguous elements [PER H.N. Abrams] or [ORG H.N. Abrams]? [PER Currier] & [PER Ives] or [ORG Currier & Ives]? [MISC White House] or [LOC White House] or [ORG White House]? Conjunction reduction Translated by Jacques and Jean Duvernet. [PER Jacques] and [PER Jean Duvernet] [PER Jacques Duvernet] and [PER Jean Duvernet] Anaphora Mr. Duvernet, Duvernet, he, the translator Naming vs. describing [ORG American Museum of Natural History], [ORG Field Museum] Extracting nameswith andLinked resolving Leveraging Names Data identities [ORG Natural History] museum, [ORG Chicago] museum]

21 In sum Named entity tagging is a complex psycholinguistic task that challenges even mature, sophisticated readers. The tagging task can only be approximated with a model that recognizes just three broadly-defined categories, plus a fourth category with limited utility, none of which can be assigned any internal structure. LIS researchers who wish to apply this technology must: Define tasks that can carried out successfully with the Extracting names andLinked resolving identities Leveraging Names with Databe

22 Training is error-prone and timeconsuming. The need to train is a potential deal-killer for the adoption of named-entity recognition software. Training requires: Criteria for applying the markup that can be articulated and consistently applied to the data; Markup that falls within the scope of the tagging scheme produced by the NER tagger; Patterns that cannot be easily discovered by simpler means, such as regular-expression matching; A corpus that is large enough to change the behavior of the NER tagger. Extracting nameswith andLinked resolving Leveraging Names Data identities 23 Some recommendations For NER clients

Take advantage of the most successful and mature categories for personal names and locations. Work with semi-structured or edited text. Build out named entity recognition modules with other sophisticated tools that classify text and do localized special processing. For NER tool developers Use the perceptron model to define placeholder categories that can be trained on the unique name types in a collection. Develop more detailed models for the most Extracting nameswith and resolving Leveragingmature Names Linked Data identities categories. 24 Next steps Grant responsibilities Next steps Complete formal experiments on library data. Finish final report, which is due on June 30.

OCLC work Outline steps required to beyond interesting examples to mature research prototypes. Publish our study of named entity tagging on library data. Engage with: researchers in the machine learning to improve precision and recall of named entity recognition tools. practitioners in the library community to apply and evaluate Extracting names andLinked resolving identities this technology. Leveraging Names with Data 25 For more information ReferenFf\\ces The Cognitive Computation Group at the University of Illinois Functional genre in Illinois State Government digital docu ments

Name this! Automating metadata extraction through a na med entity recognition tool . Poster for the 2009 NDIIPP Partners Meeting. Whos who in your digital collection: Developing a tool for name disambiguation and identity resolution . To appear in the Chicago Colloquium for Digital Humanities and Computer Science Journal. Extracting nameswith andLinked resolving Leveraging Names Data identities 26 Questions? Extracting nameswith andLinked resolving Leveraging Names Data identities 27 Next up Lunch and then

1:00 Framing Libraries and the Environment Lorcan Dempsey, OCLC Research Buckingham Extracting nameswith andLinked resolving Leveraging Names Data identities 28

Recently Viewed Presentations

  • Exact epidemic models on networks

    Exact epidemic models on networks

    Kieran Sharkey. University of Liverpool. NeST workshop, June 2014. Overview. Introduction to epidemics on networks. Description of moment-closure representation. Description of "Message-passing" representation. Comparison of methods.
  • Revenue Recognition

    Revenue Recognition

    SSARS No. 21 makes the compilation rules apply when the accountant is engaged to perform a compilation service. The new standard eliminates a requirement from previous standards dating to 1978 that required accountants in public practice who prepared financial statements...
  • The Evaluation Proccess - WordPress.com

    The Evaluation Proccess - WordPress.com

    Vineland- Self Sufficiency skills (Daily Living, Communication, socialization, motor skills, maladaptive behaviors) Affective Examples. ... Scoring Exercise to be done after we go over this slide. It is in binder, section 2. Case Study Basics.
  • Rainforest Animals in Spanish - Michigan State University

    Rainforest Animals in Spanish - Michigan State University

    Identifying Words Sentence / Word Order Slide 5 Slide 6 Slide 7 Slide 8 Slide 9 Slide 10 Slide 11 Slide 12 Slide 13 Slide 14 Slide 15 Slide 16 Slide 17 Slide 18 Slide 19 Slide 20 Slide 21...
  • 802.16e

    802.16e

    transmit diversity codes are used to provide spatial diversity and reduce fade margin. Advanced features of mobile . wimax. Smart antenna technologies. Spatial Multiplexing (SM) is to take advantage of higher peak rates and increases throughput.
  • Post Secondary Prep Session #1

    Post Secondary Prep Session #1

    Post Secondary PrepSession #1. Grad Requirements / Unofficial Transcripts. Career Websites such as My Blue Print. Highlight useful resources . Post Secondary Entrance Requirements
  • Missional?

    Missional?

    Missional? Worship Prayer Reason Emotion Attraction The Kingdom of God Evangelism Soteriological Ecclesiological Eschatological Social Fellowship
  • Philosophy of Religion Revision

    Philosophy of Religion Revision

    The falsification principle presents no real challenge to religious belief. Discuss. AO1 . Candidates may begin by making the assertion that Falsifiability is not a criterion to determine whether something is meaningful or not, only whether it has the status...