Wikitology Wikipedia as an Ontology

Wikitology Wikipedia as an Ontology

Creating and Exploiting a Web of Semantic Data Tim Finin University of Maryland, Baltimore County joint work with Zareen Syed (UMBC) and colleagues at the Johns Hopkins University Human Language Technology Center of Excellence ICAART 2010, 24 January 2010 http://ebiquity.umbc.edu/resource/html/id/288/ 1 Overview Conclusion Introduction A Web of linked data Wikitology Applications Conclusion

introduction linked data wikitology applications conclusion2 Conclusion The Web has made people smarter and more capable, providing easy access to the world's knowledge and services Software agents need better access to a Web of data and knowledge to enhance their intelligence Some key technologies are ready to exploit: Semantic Web, linked data, RDF search engines, DBpedia, Wikitology, information extraction, etc. introduction linked data wikitology applications conclusion3 The Age of Big Data Massive amounts of data is available today on the Web, both for people and agents This is whats driving Google, Bing, Yahoo Human language advances also driven by availability of unstructured data, text & speech

Large amounts of structured & semi-structured data is also coming online, including RDF We can exploit this data to enhance our intelligent agents and services introduction linked data wikitology applications conclusion4 Twenty years ago Tim Berners-Lees 1989 WWW proposal described a web of relationships among named objects unifying many info. management tasks. Capsule history Guhas MCF (~94) XML+MCF=>RDF (~96) RDF+OO=>RDFS (~99) RDFS+KR=>DAML+OIL (00) W3Cs SW activity (01) W3Cs OWL (03) SPARQL, RDFa (08)

http://www.w3.org/History/1989/proposal.html 5 Ten yeas ago The W3C began developing standards to support the Semantic Web The vision, technology and use cases are still evolving Moving from a Web of documents to a Web of data introduction linked data wikitology applications conclusion6 Todays LOD Cloud introduction linked data wikitology applications conclusion7 Todays LOD Cloud

~5B integrated facts published on Web as RDF Linked Open Data from ~100 datasets Arcs represent joins across datasets Available to download or query via public SPARQL servers Updated and improved periodically introduction linked data wikitology applications conclusion8 From a Web of documents introduction linked data wikitology applications conclusion9 To a Web of (Linked) Data introduction linked data wikitology applications conclusion10 Wikipedia, DBpedia and inked data Wikipedia as a source of knowledge Wikis have turned out to be great ways to

collaborate on building up knowledge resources Wikipedia as an ontology Every Wikipedia page is a concept or object Wikipedia as RDF data Map this ontology into RDF DBpedia as the lynchpin for Linked Data Exploit its breadth of coverage to integrate things introduction linked data wikitology applications conclusion12 Wikipedia is the new Cyc Theres a history of using encyclopedias to develop KBs Cycs original goal (c. 1984) was to encode the knowledge in a desktop encyclopedia And use it as an integrating ontology Wikipedia is comparable to Cycs original desktop encyclopedia

But its machine accessible and malleable And available (mostly) in RDF! introduction linked data wikitology applications conclusion13 Dbpedia: Wikipedia in RDF A community effort to extract structured information from Wikipedia and publish as RDF on the Web Effort started in 2006 with EU funding Data and software open sourced DBpedia doesnt extract information from Wikipedias text (yet), but from its structured information, e.g., infoboxes, links, categories, redirects, etc. introduction linked data wikitology applications conclusion14 DBpedia's ontologies DBpedias representation makes the schema explicit and accessible But initially inherited most of the

problems in the underlying implicit schema Integration with the Yago ontology DBpedia added richness ontology 248,000 Since version 3.2 (11/08) DBpedia Place Person 214,000 193,000 began developing a explicit OWL Work Species 90,000 ontology and mapping it to the Org. 76,000 Building 23,000 native Wikipedia terms introduction linked data wikitology applications conclusion15 e.g.,

56 properties Person introduction linked data wikitology applications conclusion16 http://lookup.dbpedia.org/ introduction linked data wikitology applications conclusion17 18 19 20 Query with SPARQL PREFIX dbp: PREFIX dbpo: SELECT distinct ?Property ?Place WHERE {dbp:Barack_Obama ?Property ?Place . ?Place rdf:type dbpo:Place .}

What are Barack Obamas properties with values that are places? 21 DBpedia is the LOD lynchpin Wikipedia, via Dbpedia, fills a role first envisioned by Cyc in 1985: an encyclopedic KB forming the substrate of cour common knowledge introduction linked data wikitology applications conclusion22 Consider Baltimore, MD 23 Links between RDF datasets We find assertions equating DBpedia's Baltimore object with those in other LOD datasets dbpedia:Baltimore%2C_Maryland owl:sameAs census:us/md/counties/baltimore/baltimore;

owl:sameAs cyc:concept/Mx4rvVin-5wpEbGdrcN5Y29ycA; owl:sameAs freebase:guid.9202a8c04000641f8000004921a; owl:sameAs geonames:4347778/ . Since owl:sameAs is defined as an equivalence relation, the mapping works both ways Mappings are done by custom programs, machine learning, and manual techniques introduction linked data wikitology applications conclusion24 Wikitology Weve explored a complementary approach to derive an ontology from Wikipedia: Wikitology Wikitology use cases: Identifying user context in a collaboration system from documents viewed (2006) Improve IR accuracy of by adding Wikitology tags to documents (2007) ACE: cross document co-reference resolution for named entities in text (2008) TAC KBP: Knowledge Base population from text

(2009) introduction linked data wikitology applications conclusion25 Wikitology 3.0 (2009) Application Specific Algorithms Application Specific Algorithms Application Specific Algorithms IR collection Articles

Wikitology Code RDF reasoner Relational Database Triple Store DBpedia Freebase Category InfoboxLinks GraphGraph Infobox Page Link Graph Graph

Linked Semantic Web data & ontologies 26 Wikitology Weve explored a complementary approach to derive an ontology from Wikipedia: Wikitology Wikitology use cases: Identifying user context in a collaboration system from documents viewed (2006) Improve IR accuracy of by adding Wikitology tags to documents (2007) ACE 2008: cross document co-reference resolution for named entities in text (2008) TAC 2009: Knowledge Base population from text (2009) introduction linked data wikitology applications conclusion27 ACE 2008: Cross-Document

Coreference Resolution Determine when two documents mention the same entity Are two documents that talk about George Bush talking about the same George Bush? Is a document mentioning Mahmoud Abbas referring to the same person as one mentioning Muhammed Abbas? What about Abu Abbas? Abu Mazen? Drawing appropriate inferences from multiple documents demands crossdocument coreference resolution 28 ACE 2008: Wikitology tagging NIST ACE 2008: cluster named entity mentions in 20K English and Arabic documents We produced an entity document for mentions with name, nominal and pronominal mentions, type and subtype,

and nearby words Tagged these with Wikitology producing vectors to compute features measuring entity pair similarity One of many features for an SVM classifier William Wallace (living British Lord) William Wallace (of Braveheart fame) Abu Abbas aka Muhammad Zaydan aka Muhammad Abbas introduction linked data wikitology applications conclusion29 Wikitology Entity Document & Tags Wikitology entity document

ABC19980430.1830.0091.LDC2000T44-E2 Name Webb Hubbell PER Type & subtype Individual NAM: "Hubbell "Hubbells "Webb Hubbell "Webb_Hubbell" PRO: "he "him "his" Mention heads abc's accountant after again ago all alleges alone also and arranged attorney avoid been before being betray but came can cat charges cheating circle clearly close concluded conspiracy cooperate counsel counsel's department did disgrace do dog dollars earned eightynine enough evasion feel financial firm first four friend friends going got grand happening has he help him hi s hope house hubbell hubbells hundred hush income increase independent indict indicted indictment inner investigating jackie jackie_judd jail jordan judd jury justice kantor ken knew lady late law left lie little make many mickey mid

money mr my nineteen nineties ninetyfour not nothing now office other others paying peter_jennings president's pressure pressured probe prosecutors questions reported reveal rock saddened said schemed seen seven since starr statement such tax taxes tell them they thousand time today ultimately vernon washington webb webb_hubbell were what's whether which white whitewater why wife years Wikitology article tag vector Webster_Hubbell 1.000 Hubbell_Trading_Post National Historic Site 0.379 United_States_v._Hubbell 0.377 Hubbell_Center 0.226 Whitewater_controversy 0.222 Wikitology category tag vector Clinton_administration_controversies 0.204 American_political_scandals 0.204 Living_people 0.201

1949_births 0.167 People_from_Arkansas 0.167 Arkansas_politicians 0.167 American_tax_evaders 0.167 Arkansas_lawyers 0.167 Words surrounding mentions introduction linked data wikitology applications conclusion30 Top Ten Features (by F1) Prec. Recall F1 Feature Description 90.8%

76.6% 83.1% some NAM mention has an exact match 92.9% 71.6% 80.9% Dice score of NAM strings (based on the intersection of NAM strings, not words or n-grams of NAM strings) 95.1% 65.0% 77.2%

the/a longest NAM mention is an exact match 86.9% 66.2% 75.1% Similarity based on cosine similarity of Wikitology Article Medium article tag vector 86.1% 65.4% 74.3% Similarity based on cosine similarity of Wikitology Article Long article tag vector

64.8% 82.9% 72.8% Dice score of character bigrams from the 'longest' NAM string 95.9% 56.2% 70.9% all NAM mentions have an exact match in the other pair 85.3% 52.5%

65.0% Similarity based on a match of entities' top Wikitology article tag 85.3% 52.3% 64.8% Similarity based on a match of entities' top Wikitology article tag 85.7% 32.9% 47.5% Pair has a known alias

The Wikitology-based features were very 31useful 31 Wikipedias Social Network Wikipedia has an implicit social network that can help disambiguate PER mentions (ORGs & GPEs too) We extracted 875K people from Freebase, 616K of were linked to Wikipedia pages, 431K of which are in one of 4.8M person-person article links Consider a document that mentions two people: George Bush and Mr. Quayle There are six George Bushes in Wikipedia and nine Male Quayles introduction linked data wikitology applications conclusion32 Which Bush & which Quayle?

Six George Bushes Nine Male Quayles 33 Use Jaccard coefficient metric Let Si = {two hop neighbors of Si} Cij = |intersection(Si,Sj)| / | union(Si,Sj) | Cij>0 for six of the 56 possible pairs 0.43 George_H._W._Bush -- Dan_Quayle 0.24 George_W._Bush -- Dan_Quayle 0.18 George_Bush_(biblical_scholar) -- Dan_Quayle 0.02 George_Bush_(biblical_scholar) -- James_C._Quayle 0.02 George_H._W._Bush -- Anthony_Quayle 0.01 George_H._W._Bush -- James_C._Quayle introduction linked data wikitology applications conclusion34 Knowledge Base Population The 2009 NIST Text Analysis Conference had a

Knowledge Base Population track Add facts to a reference KB from a collection of 1.3M English newswire documents Given initial KB of facts from Wikipedia infoboxes: 200k people, 200k GPEs, 60k orgs, 300+k misc/non-entities Two fundamental tasks: Entity Linking - Grounding entity mentions in documents to KB entries (or NIL if not in KB) Slot Filling - Learning additional attributes about target entities introduction linked data wikitology applications conclusion35 Sample KB Entry Michael Phelps

Michael Fred Phelps The Baltimore Bullet Butterfly, Individual Medley, Freestyle, Backstroke Club Wolverine, University of Michigan June 30, 1985 (1985-06-30) (age 23) 6 ft 4 in (1.93 m) 200 pounds (91 kg) Williams. Trained as a classical violinist, Californian Kaufman started doing session work in the Hollywood studios in the 1970s. One of his movies was Jaws, with Williams conducting his score in recording sessions in 1975... Michael Phelps John Williams author 1922-1994 J. Lloyd Williams botanist 1854-1945 John Williams politician 1955- John J. Williams US Senator 1904-1988 John Williams Archbishop 1582-1650 John Williams composer 1932- Jonathan Williams poet 1929- Michael Phelps swimmer 1985- Michael Phelps biophysicist 1939- Debbie Phelps, the mother of swimming star Michael Phelps, who won a record eight gold medals in Beijing, is the author of a new memoir, ... Michael Phelps is the scientist most often identified as the inventor of PET, a technique that permits the imaging of biological processes in the organ systems of living individuals. Phelps has ... Identify matching entry, or determine that entity is missing from KB introduction linked data wikitology applications conclusion37 Slot Filling Task Target: EPA + context document Generic Entity Classes Person, Organization, GPE Missing information to mine from text: Date formed: 12/2/1970 Website: http://www.epa.gov/ Headquarters: Washington, DC Nicknames: EPA, USEPA Type: federal agency Address: 1200 Pennsylvania Avenue NW Optional: Link some learned values within the KB: Headquarters: Washington, DC (kbid: 735) introduction linked data wikitology applications conclusion38 KB Entity Attributes Person

alternate names age birth: date, place death: date, place, cause national origin residences spouse children parents siblings other family schools attended job title employee-of member-of religion criminal charges Organization alternate names political/religious affiliation

top members/employees number of employees members member of subsidiaries parents founded by founded dissolved headquarters shareholders website Geo-Political Entity alternate names capital subsidiary orgs top employees political parties established population

currency introduction linked data wikitology applications conclusion39 HLTCOE* Entity Linking: Approach * Human Language Technology Center of Excellence Two-phased approach 1. Candidate Set Identification 2. Candidate Ranking Candidate Set Identification Small set of easy-to-compute features Speed linear in size of KB (~700K entities) Constant-time possible, though recall could fall Candidate Ranking

Supervised machine learning (SVM) Goal is to rank candidates Many features Many, many features Experimental development with 100s tests on held-out data introduction linked data wikitology applications conclusion40 Phase 1: Candidate Identification Triage features: String comparison Exact/Fuzzy String match, Acronym match Known aliases Wikipedia redirects provide rich set of alternate names Statistics

98.6% recall (vs. 98.8% on dev. data) Median = 15 candidates; Mean = 76; Max = 2772 10% of queries <= 4 candidates; 10% > 100 candidates Four orders of magnitude reduction in number of entities considered introduction linked data wikitology applications conclusion41 Candidate Phase Failures Iron Lady EL 1687: refers to Yulia Tymoshenko (prime minister) EL 1694: refers to Biljana Plavsic (war criminal) PCC EL 2885: Cuban Communist Party (in Spanish: Partido Comunista de Cuba)

Queen City EL 2973: Manchester, NH (active nickname) EL 2974: Seattle, WA (former nickname) The Lions EL 3402: Highveld Lions (South African professional cricket team) in KB as: Highveld_Lions_cricket_team introduction linked data wikitology applications conclusion42 Phase 2: Candidate Ranking Supervised Machine Learning SVMrank (Joachims) Trained on 1615 examples About 200 atomic features, most binary Cost function: Number of swaps to elevate correct candidate to top of ranked list

None of the above (NIL) is an acceptable choice Query = CDC 1. California Dept. of Corrections 2. US Center for Disease Control 3. Cedar City Regional Airport (IATA code) 4. Communicable Disease Centre (Singapore) 5. Congress for Democratic Change (Liberian political party) 6. Cult of the Dead Cow (Hacker organization) According to the CDC the prevalence of H1N1 influenza in California prisons has... 7. Control Data Corporation

William C. Norris, 95, founder of the mainframe computer firm CDC., died Aug. 21 in a nursing home ... 9. Consumers for Dental Choice (non-profit) 8. NIL (Absence from KB) 10. Cheerdance Competition (Philippine organization) introduction linked data wikitology applications conclusion44 Results: top five systems Team All in KB

NIL Siel_093 0.8217 0.7654 0.8641 Int. Inst. Of IT, Hyderabad IN QUANTA1 0.8033 0.7725 0.8264

Tsinghua University hltcoe1 0.7984 0.7063 0.8677 Stanford_UBC2 0.7884 0.7588 0.8107 NLPR_KBP1

0.7672 0.6925 0.8232 NIL Baseline 0.5710 0.0000 1.0000 Institute for PR, China Micro-averaged accuracy Of the 13 entrants, the HLTCOE system placed third, but the differences between 2, 3 and 4 are not significant 45

KBP Conclusions Significant reductions in number of KB nodes examined possible with minimal loss of recall Supervised machine learning with a variety of features over query/KB node pairs is effective More features is better; Wikitology features were largely redundant with KB Optimal feature set selection varies with likelihood that query targets are in KB introduction linked data wikitology applications conclusion46 Conclusions The Web has made people smarter and more capable, providing easy access to the world's knowledge and services Software agents need better access to a Web of data and knowledge to enhance their intelligence

Some key technologies are ready to exploit: Semantic Web, linked data, RDF search engines, DBpedia, Wikitology, information extraction, etc. introduction linked data wikitology applications conclusion47 Conclusion Hybrid systems like Wikitology combining IR, RDF, and custom graph algorithms are promising The linked open data (LOD) collection is a good source of background knowledge, useful in many tasks, e.g., extracting information from text The techniques can support distributed LOD collections for your domain: bioinformatics, finance, eco-informatics, etc. introduction linked data wikitology applications conclusion48 http://ebiquity.umbc.edu/

49

Recently Viewed Presentations

  • 8.5 Lewis Structures - Tamaqua Area School District

    8.5 Lewis Structures - Tamaqua Area School District

    8.5 Steps in Drawing Lewis Structures . First of all, it takes practice (+ is the absence of an electron, - is one additional electron in the structure) Steps. 1. Sum the valence electrons from all atoms. For anions add...
  • Everything you wanted to know about SOBI

    Everything you wanted to know about SOBI

    Be very clear about your expectations, when and how students should contact you, what you are willing to discuss with students, and what you can and cannot do. Even if you are a licensed mental health professional, as a faculty...
  • Management 8e. - Robbins and Coulter

    Management 8e. - Robbins and Coulter

    Work Team A group whose members work intensely on a specific common goal using their positive synergy, individual and mutual accountability, and complementary skills Types of Teams Problem-solving teams Self-managed work teams Cross-functional teams Virtual teams Stages in Team Development...
  • Text Features - Miami-Dade County Public Schools

    Text Features - Miami-Dade County Public Schools

    Text features provide information that may not be written in the text itself. Text features can be found in textbooks, magazine articles, newspapers, reports, web pages, and other forms of nonfiction text. Table of Contents.
  • Job Search Strategies Career Counseling and Support Services

    Job Search Strategies Career Counseling and Support Services

    Women and minorities are 2/3 of the population, 2/3 of consumers and 57% of the work force. Since the passage of the Americans with Disabilities Act (ADA) in 1990, people will disabilities still experience unemployment at a rate far above...
  • BARNARDinc. Attorneys Prokureurs About Us A VISION OF

    BARNARDinc. Attorneys Prokureurs About Us A VISION OF

    Services. ompetence. All services rendered by professional-, paralegal- and support staff are supervised by practising attorneys, duly admitted as such in the High Court of South Africa, after having fully complied with the requirements of the Attorneys Act 1979.
  • Chapter 8 Section 3: Cellular Respiration

    Chapter 8 Section 3: Cellular Respiration

    Overview of Cellular Respiration. Organisms get energy in a process called cellular respiration. The function of cellular respiration is to harvest electrons from carbon compounds, such as glucose, and use that energy to make ATP. ... Electron transport is the...
  • Ultra High Speed Spindle Project Laminate Machinability study

    Ultra High Speed Spindle Project Laminate Machinability study

    Firewire, ethernet. Receiver. Receiver. Transmitter. Transmitter. Serial Communication. Synchronous Serial Communication. Transmitter and Receiver have synchronized clocks. Data must be sent constantly in order for them to stay synchronized. ... Conversion factor: 1 bit = 1 Symbol = 1 baud.