Keyword-based Search and Exploration on Databases

Keyword-based Search and Exploration on Databases

Keyword-based Search and Exploration on Databases Yi Chen Wei Wang Ziyang Liu Arizona State University, USA University of New South Wales, Australia Arizona State University, USA Traditional Access Methods for Databases Relational/XML Databases are structured or semi-structured, with rich meta-data Typically accessed by structured query languages: SQL/XQuery Advantages: high-quality results Disadvantages: Query languages: long learning curves Schemas: Complex, evolving, or even unavailable.

select paper.title from conference c, paper p, author a1, author a2, write w1, write w2 where c.cid = p.cid AND p.pid = w1.pid AND p.pid = w2.pid AND w1.aid = a1.aid AND w2.aid = a2.aid AND a1.name = John AND a2.name = John AND c.name = SIGMOD Small user population The usability of a database is as important as its capability [Jagadish, SIGMOD 07]. ICDE 2011 Tutorial 2 Popular Access Methods for Text Text documents have little structure They are typically accessed by keyword-based unstructured queries Advantages: Large user population Disadvantages: Limited search quality Due to the lack of structure of both data and queries ICDE 2011 Tutorial 3 Grand Challenge:

Supporting Keyword Search on Databases Can we support keyword based search and exploration on databases and achieve the best of both worlds? Opportunities Challenges State of the art Future directions ICDE 2011 Tutorial 4 Opportunities /1 Easy to use, thus large user population Share the same advantage of keyword search on text documents ICDE 2011 Tutorial 5 Opportunities /2 High-quality search results Exploit the merits of querying structured data by leveraging structural information Query: John, cloud Structured Document Such a result will have a low rank.

Text Document John is a computer scientist.......... One of John colleagues, Mary, recently published a paper about cloud computing. scientist scientist name publications name publications John Mary ICDE 2011 Tutorial paper paper title title XML cloud 6

Opportunities /3 Enabling interesting/unexpected discoveries Relevant data pieces that are scattered but are collectively relevant to the query should be automatically assembled in the results A unique opportunity for searching DB Text search restricts a result as a document DB querying requires users to specify relationships between data pieces University uid uname 12 UC Berkeley Project Student sname uid 605 5 Margo Seltzer 12 Participation

pid pname pid sid 5 5 605 5 Berkeley DB sid ICDE 2011 Tutorial Q: Seltzer, Berkeley Is Seltzer a student at UC Berkeley? Expected Surprise 7 Keyword Search on DB Summary of Opportuniti Increasing the DB usability and hence user population Increasing the coverage and quality of keyword search ICDE 2011 Tutorial 8

Keyword Search on DBChallenges Keyword queries are ambiguous or exploratory Structural ambiguity Keyword ambiguity Result analysis difficulty Evaluation difficulty Efficiency ICDE 2011 Tutorial 9 Challenge: Structural Ambiguity (I) No structure specified in keyword queries e.g. an SQL query: find titles of SIGMOD papers by John select paper.title from author a, write w, paper p, conference c where a.aid = w.aid AND w.pid = p.pid AND p.cid=c.cid AND a.name = John AND c.name = SIGMOD keyword query: John, SIGMOD Return info (projection) Predicates (selection, joins) --- no structure

Structured data: how to generate structured queries from keyword queries? Infer keyword connection e.g. John, SIGMOD Find John and his paper published in SIGMOD? Find John and his role taken in a SIGMOD conference? Find John and the workshops organized by him associated with SIGMOD? ICDE 2011 Tutorial 10 Challenge: Structural Ambiguity (II) Infer return information Query: John, SIGMOD e.g. Assume the user wants to find John and his SIGMOD papers What to be returned? Paper title, abstract, author, conference year, location? Infer structures from existing structured query templates (query forms) suppose there are query forms designed for popular/allowed queries Author Name Op Expr Conf Name Op Expr

select * from author a, write w, Person Name paper c where Op p, conference Expr Journal a.aid = w.aid AND w.pidName = p.pid Conf Name Op Expr AND p.cid=c.cid AND a.name = Journal Year Workshop $1 AND Op c.name Expr = $2 Op Expr Op Expr Name which forms can be used to resolve keyword query ambiguity? Semi-structured data: the absence of schema may prevent generating structured queries ICDE 2011 Tutorial

11 Challenge: Keyword Ambiguity A user may not know which keywords to use for their search needs Syntactically misspelled/unfinished words Query cleaning/ auto-completion E.g. datbase database conf Under-specified words Query refinement Over-specified words Polysemy: e.g. Java Too general: e.g. database query --- thousands of papers Synonyms: e.g. IBM -> Lenovo Too specific: e.g. Honda civic car in 2006 with price $2-2.2k Non-quantitative queries

Query rewriting e.g. small laptop vs laptop with weight <5lb ICDE 2011 Tutorial 12 Challenge Efficiency Complexity of data and its schema Millions of nodes/tuples Cyclic / complex schema Inherent complexity of the problem NP-hard sub-problems Large search space Working with potentially complex scoring functions Optimize for Top-k answers ICDE 2011 Tutorial 13 Challenge: Result Analysis /1 How to find relevant individual results?

How to rank results based on relevance? scientist name publications John paper title cloud High Rank scientist scientist name publications name publications John paper Mary title pape r title XML Cloud Low Rank However, ranking functions are never perfect. 14

ICDE 2011 Tutorial How to help users judge result relevance w/o reading (big) results? Challenge: Result Analysis / 2 In an information exploratory search, there are many relevant results What insights can be obtained by analyzing multiple results? How to classify and cluster results? How to help users to compare multiple results Eg.. Query ICDE conferences Feature Type value Feature Type value conf: year 2000 conf: year 2010

paper: title OLAP, Data mining paper: title clouds, scalability, search ICDE 2000 ICDE 2010 ICDE 2011 Tutorial 15 Challenge: Result Analysis /3 Aggregate multiple results Find tuples with the same interesting attributes that cover all keywords Query: Motorcycle, Pool, American Food December Texas * Michigan Mont h

Stat e City Event Description Dec TX Housto n US Open Pool Best of 19, ranking Dec TX Dallas Cowboys dream run Motorcycle, beer Dec TX Austin

SPAM Museum party Classical American food Oct MI Detroit Motorcycle Rallies Tournament, round robin Oct MI Flint Michigan Pool Exhibition Non-ranking, 2 days Sep MI Lansin American Food ICDE 2011 Tutorial

The best food from 16 Roadmap Related tutorials SIGMOD09 by Chen, Wang, Liu, Lin VLDB09 by Chaudhuri, Das Motivation Structural ambiguity structure inference return information inference leverage query forms Keyword ambiguity query cleaning and auto-completion query refinement query rewriting Covered by this tutorial only. Evaluation Query processing Focus on work after 2009. Result analysis ranking

snippet comparison clustering ICDE 2011 Tutorial correlation 22 Roadmap Motivation Structural ambiguity Node Connection Inference Return information inference Leverage query forms Keyword ambiguity Evaluation Query processing Result analysis Future directions ICDE 2011 Tutorial 23

Problem Description Data Relational Databases (graph), or XML Databases (tree) Input Query Q = Output A collection of nodes collectively relevant to Q 1. Predefined 2. Searched based on schema graph 3. Searched based on data graph ICDE 2011 Tutorial 24 Option 1: Pre-defined Structure Ancestor of modern KWS: RDBMS SELECT * FROM Movie WHERE contains(plot, meaning of life) Content-and-Structure Query (CAS)

//movie[year=1999][plot ~ meaning of life] Early KWS Proximity search Find movies NEAR meaing of life Q: Can we remove the burden off the ICDE 2011 Tutorial 25 Option 1: Pre-defined Structure QUnit [Nandi & Jagadish, CIDR 09] A basic, independent semantic unit of information in the DB, usually defined by domain experts. e.g., define a QUnit as director(name, DOB)+ all movies(title, year) he/she directed Woody Allen name 1935-12-01 DOB B_Loc title

D_101 Directo r Q: Can we remove Movie year Match Point Melinda and Melinda Anything Else the burden off the domain ICDE 2011 Tutorial 26 Option 2: Search Candidate Structures on the Schema Graph E.g., XML All the label paths /imdb/movie Q: Shining 1980 /imdb/movie/year /imdb/movie/name /imdb/director TV

imdb movie TV movie director Simpsons Friends name year plot name year plot shining 1980 scoop 2006 W Allen 1935-12-1 ICDE 2011 Tutorial

name DOB 27 Candidate Networks E.g., RDBMS All the valid candidate networks (CN) Schema Graph: A W P Q: Widom XML interpretatio ns ID CN 1 AQ 2 PQ 3 AQ W PQ 4 AQ W PQ W AQ 5

PQ W AQ W PQ an author an author wrote a paper two authors wrote a single paper an authors wrote two papers ICDE 2011 Tutorial 28 Option 3: Search Candidate Structures on the Data Graph Data modeled as a graph G Each k in Q matches a set of nodes in G i Find small structures in G that connects keyword instances Grap Group Steiner Tree (GST) h Approximate Group Steiner Tree Distinct root semantics LCA Tree Subgraph-based

Community (Distinct core semantics) EASE (r-Radius Steiner subgraph) ICDE 2011 Tutorial 29 k1 Results as Trees Group Steiner Tree a 5 6 The smallest tree that connects an c 2 3 k1 instance of each keyword top-1 GST = top-1 ST NP-hard 1

M 5 6 c k1 k2 1 0 a GST 7 b 2 1 1 3 1 M k2 d 1M

e 1 1 k3 d a 6 Tractable for fixed l ST 7 b [Li et al, WWW01] k2 1 0 7 k3 c a (c, d): 13 k1 a d

5 e 1M ICDE 2011 Tutorial k3 b k2 2 k3 3 c d a (b(c, d)): 30 Other Candidate Structures Distinct root semantics [Kacholia et al, VLDB05] [He et al, SIGMOD 07] Find trees rooted at r cost(Tr) = i cost(r, matchi) Distinct Core Semantics [Qin et al, ICDE09] Certain subgraphs induced by a distinct combination of keyword matches

r-Radius Steiner graph [Li et al, SIGMOD08] Subgraph of radius r that matches each ki in Q less unnecessary nodes ICDE 2011 Tutorial 31 Candidate Structures for XML Any subtree that contains all keywords subtrees rooted at LCA (Lowest common ancestor) nodes |LCA(S1, S2, , Sn)| = min(N, I |Si|) Many are still irrelevant or redundant needs further pruning conf name paper year SIGMOD 2007 title Q = {Keyword, Mark} author author

keyword Mark Chen ICDE 2011 Tutorial 32 SLCA [Xu et al, SIGMOD 05] SLCA [Xu et al. SIGMOD 05] Min redundancy: do not allow Ancestor-Descendant relationship among SLCA results Q = {Keyword, Mark} conf name paper year SIGMOD 2007 title author author keyword Mark Chen

paper title RDF author author Mark ICDE 2011 Tutorial Zhang 33 Other ?LCAs ELCA [Guo et al, SIGMOD 03] Interconnection Semantics [Cohen et al. VLDB 03] Many more ?LCAs ICDE 2011 Tutorial 34 Search the Best Structure Given Q Many structures (based on schema) For each structure, many results We want to select good structures

Ranking structures Ranking results Select the best interpretation Can be thought of as bias or priors How? Ask user? Encode domain knowledge? Exploit dataICDEstatistics !! 2011 Tutorial XML Graph 35 1. 2. XML E.g., Whats the most likely interpretation Why? XML All the label paths

/imdb/movie Q: Shining 1980 Imdb/movie/year /imdb/movie/plot /imdb/director TV imdb movie TV movie director Simpsons Friends name year plot name year plot shining 1980

scoop 2006 W Allen 1935-12-1 ICDE 2011 Tutorial name DOB 36 XReal [Bao et al, ICDE 09] /1 Infer the best structured query information need information need Q = Widom XML /conf/paper[author ~ Widom][title ~ XML] Find the best return node type (search-for node type) with the highest score C for (T , Q) log(1 tf (T , w)) r depth (T ) wQ /conf/paper 1.9 Ensures T has the potential to

/journal/paper 1.2 match all query keywords /phdthesis/paper 0 ICDE 2011 Tutorial 37 XReal [Bao et al, ICDE 09] /2 Score each instance of type T score each node Leaf node: based on the content Internal node: aggregates the score of child nodes XBridge [Li et al, EDBT 10] builds a structure + value sketch to estimate the most promising return type See later part of the tutorial ICDE 2011 Tutorial 38 Entire Structure Two candidate structures under /conf/paper /conf/paper[title ~ XML][editor ~ Widom] /conf/paper[title ~ XML][author ~ Widom]

Need to score the entire structure (query template) /conf/paper[title ~ ?][editor ~ ?] /conf/paper[title ~ ?][author ~ ?] paper title editor paper paper title author conf title author XML Mark ICDE 2011 Tutorial editor Widom paper

title author editor XML Widom Whang 39 Related Entity Types [Jayapandian & Jagadish, VLDB 08] Background Automatically design forms for a Relational/XML database instance Relatedness of E1 E E2 = [ P(E1 E2) + P(E2 E1) ] / 2 P(E1 E2) = generalized participation ratio of E1 into E2 fraction of E1 instances that are connected to some instance in E2 Author Paper Editor

P(A P) = 5/6 P(P A) = 1 P(E P) = 1 P(P E) = 0.5 i.e., (1/3!) * What about (E1, E2, E3)?ICDE 2011 Tutorial P(A P E) P(E P A) 4/6 P(A P) * P(P E) P(E P) * P(P A) != 1 * 0.5 40 NTC [Termehchy & Winslett, CIKM 09] Specifically designed to capture correlation, i.e., how close they are related Unweighted schema graph is only a crude approximation Manual assigning weights is viable but costly (e.g., Prcis [Koutrika et al, ICDE06])

Ideas 1 / degree(v) [Bhalotia et al, ICDE 02] ? 1-1, 1-n, total participation [Jayapandian & Jagadish, VLDB 08]? ICDE 2011 Tutorial 41 NTC [Termehchy & Winslett, CIKM 09] Idea: Total correlation measures the amount Author of cohesion/relatedness I(P) = H(Pi) H(P1, P2, , Pn) I(P) 0 statistically completely unrelated i.e., knowing the value of one variable does not provide any clue as to the values of the other variables P1 2/6 0 1/6 1/6 1/6

1/6 Paper P2 P3 Editor P4 1/6 1/6 A 1 A 2 A 3 1/6 A 4 1/6 2/6 1/6 2/6 1/6 1/6 A H(A) = 2.25 H(P) =5 1.92 H(A, P) = 2.58 1/6 I(A, P) = 2.25 + 1.92 A2.58 = 1.59 6 42

ICDE 2011 Tutorial NTC [Termehchy & Winslett, CIKM 09] Idea: Total correlation measures the amount Author of cohesion/relatedness I(P) P1 = H(Pi) H(P1, P2, , Pn) 1/2 1/2 E 1 E 2 Paper Editor P2 P3 P4

0 1/2 1/2 0 1/2 1/2 I*(P) = f(n) * I(P) / H(P1, P2, , Pn) f(n) = n2/(n-1)2 Rank answers based on I*(P) of their structure i.e., H(E) = 1.0 H(P) = 1.0 I(E, P) = 1.0 + 1.0 1.0 = 1.0 independent of Q ICDE 2011 Tutorial H(A, P) = 1.0 43 Relational Data Graph E.g., RDBMS All the valid candidate networks (CN)

Method Idea SUITS Heuristic ranking or ask users [Zhou et al, 07] IQP [Demidova et al, TKDE 11] Auto score keyword binding + heuristic score structure Schema Graph: A W P Probabilistic Auto score keyword binding + structure scoring [Petkova et al, Q: XML ECIRWidom 09] I D CN 3 A WP 4 AQ W PQ W AQ 5

PQ W AQ W PQ Q Q an author wrote a paper two authors wrote a single paper ICDE 2011 Tutorial 44 SUITS Rank candidate structured queries by heuristics 1. 2. 3. [Zhou et al, 2007] The (normalized) (expected) results should be small Keywords should cover a majority part of value of a binding attribute Most query keywords should be matched GUI to help user interactively select the right structural query Also c.f., ExQueX [Kimelfeld et al, SIGMOD 09]

Interactively formulate query via reduced trees and filters ICDE 2011 Tutorial 45 IQ P [Demidova et al, TKDE11] Structural query = keyword bindings + query template Author Write Paper Keyword Binding 1 (A1) Widom XML Query template Keyword Binding 2 (A2) Pr[A, T | Q] Pr[A | T] * Pr[T] = Pr[A | T] * Pr[T] = I Pr[Ai | T] * Pr[T] Probability of keyword bindings Q: What if no query log? ICDE 2011 Tutorial Estimated from Query Log 46 Probabilistic Scoring

al, ECIR 09] /1 [Petkova et List and score all possible bindings of (content/structural) keywords Pr(path[~w]) = Pr[~w | path] = pLM[w | doc(path)] Generate high-probability combinations from them Reduce each combination into a valid XPath Query by applying operators and updating the probabilities 1. Aggregation 2. Specialization //a[~x] + //a[~y] //a[~ x y] Pr = Pr(A) * Pr(B) //a[~x] //b//a[~ x] Pr = Pr[//a is a descendant of //b] * Pr(A) 47 ICDE 2011 Tutorial Probabilistic Scoring al, ECIR 09] /2

Reduce each combination into a valid XPath Query by applying operators and updating the probabilities 3. [Petkova et Nesting //a + //b[~y] //a//b[~ y], //a[//b[~y]] Prs = IG(A) * Pr[A] * Pr(B), IG(B) * Pr[A] * Pr[B] Keep the top-k valid queries (via A* search) ICDE 2011 Tutorial 48 Summary Traditional methods: list and explore all possibilities New trend: focus on the most promising one Exploit data statistics! Alternatives Method based on ranking/scoring data subgraph (i.e., result instances)

ICDE 2011 Tutorial 49 Roadmap Motivation Structural ambiguity Node connection inference Return information inference Leverage query forms Keyword ambiguity Evaluation Query processing Result analysis Future directions ICDE 2011 Tutorial 50 Identifying Return Nodes [Liu and Chen SIGMOD 07]

Similar as SQL/XQuery, query keywords can specify predicates (e.g. selections and joins) return nodes (e.g. projections) Q1: John, institution Return nodes may also be implicit Q2: John, Univ of Toronto return node = author Implicit return nodes: Entities involved in results XSeek infers return nodes by analyzing Patterns of query keyword matches: predicates, explicit return nodes Data semantics: entity, attributes ICDE 2011 Tutorial 51 Fine Grained Return Nodes Using Constraints [Koutrika et al. 06] E.g. Q3: John, SIGMOD multiple entities with many attributes are involved which attributes should be returned? Returned attributes are determined based on two user/admin-specified constraints:

Maximum number of attributes in a result Minimum weight of paths in result schema. pname 1 person 0.8 If minimum weight = 0.4 and table person is returned, then sponsorattribute sponsor will not be name year returned since path: person1 0.5 1 >review->conference0.9 review conference >sponsorhas a weight of 0.8*0.9*0.5 = 0.36. ICDE 2011 Tutorial 52 Roadmap Motivation Structural ambiguity Node connection inference Return information inference Leverage query forms

Keyword ambiguity Evaluation Query processing Result analysis Future directions ICDE 2011 Tutorial 53 Combining Query Forms and Keyword Search [Chu et al. SIGMOD 09] Inferring structures for keyword queries are challenging Suppose we have a set of Query Forms, can we leverage them to obtain the structure of a keyword query accurately? What is a Query Form? An incomplete SQL query (with joins) selections to be completed by users which author publishes which paper Author Name Op Expr

Paper Title Op Expr SELECT * FROM author A, paper P, write W WHERE W.aid = A.id AND W.pid = P.id AND A.name op expr AND P.title op expr ICDE 2011 Tutorial 54 Challenges and Problem Definition Challenges How to obtain query forms? How many query forms to be generated? Fewer Forms - Only a limited set of queries can be posed. More Forms Which one is relevant?

Problem definition OFFLINE Input: Database Schema Output: A set of Forms Goal: cover a majority of potential queries ONLINE Input: Keyword Query Output: a ranked List of Relevant Forms, to be filled by the user ICDE 2011 Tutorial 55 Offline: Generating Forms Step 1: Select a subset of skeleton templates, i.e., SQL with only table names and join conditions. Step 2: Add predicate attributes to each skeleton template to get query forms; leave operator and expression unfilled. SELECT * FROM author A, paper P, write W WHERE W.aid = A.id AND W.pid = P.id AND A.name op expr AND P.title op expr semantics: which person writes which paper

ICDE 2011 Tutorial 56 Online: Selecting Relevant Forms Generate all queries by replacing some keywords with schema terms (i.e. table name). Then evaluate all queries on forms using AND semantics, and return the union. e.g., John, XML will generate 3 other queries: Author, XML John, paper Author, paper ICDE 2011 Tutorial 57 Online: Form Ranking and Grouping Forms are ranked based on typical IR ranking metrics for documents (Lucene Index)

Since many forms are similar, similar forms are grouped. Two level form grouping: First, group forms with the same skeleton templates. e.g., group 1: author-paper; group 2: co-author, etc. Second, further split each group based on query classes (SELECT, AGGR, GROUP, UNION-INTERSECT) e.g., group 1.1: author-paper-AVG; group 1.2: author-paper-INTERSECT, etc. ICDE 2011 Tutorial 58 Generating Query Forms [Jayapandian and Jagadish PVLDB08] Motivation: Problem definition

How to generate good forms? i.e. forms that cover many queries What if query log is unavailable? How to generate expressive forms? i.e. beyond joins and selections Input: database, schema/ER diagram Output: query forms that maximally cover queries with size constraints Challenge: How to select entities in the schema to compose a query form? How to select attributes? How to determine input (predicates) and output (return nodes)? ICDE 2011 Tutorial 59 Queriability of an Entity Type Intuition If an entity node is likely to be visited through data browsing/navigation, then its likely to appear in a query Queriability estimated by accessibility in navigation Adapt the PageRank model for data navigation PageRank measures the accessibility of a data node (i.e. a page)

A node spreads its score to its outlinks equally Here we need to measure the score of an entity type Spread weight from n to its outlinks m is defined as: # of connectionsnormalized (n m) by weights of all outlinks of n # of instances of m e.g. suppose: inproceedings , articles authors if in average an author writes more conference papers than articles then inproceedings has a higher weight for score spread to author (than artilcle) ICDE 2011 Tutorial 60 Queriability of Related Entity Types Intuition: related entities may be asked together Queriability of two related entities depends on: Their respective queriabilities The fraction of one entitys instances that are connected to the other entitys instances, and vice versa.

e.g., if paper is always connected with author but not necessarily editor, then queriability (paper, author) > queriability (paper, editor) ICDE 2011 Tutorial 61 Queriability of Attributes Intuition: frequently appeared attributes of an entity are important Queriability of an attribute depends on its number of (nonnull) occurrences in the data with respect to its parent entity instances. e.g., if every paper has a title, but not all papers have indexterm, then queriability(title) > queriability (indexterm). ICDE 2011 Tutorial 62 Operator-Specific Queriability of Attributes Expressive forms with many operators Operator-specific queryability of an attribute: how likely the attribute will be used for this operator

Highly selective attributes Selection Text field attributes Projections e.g., paper year Repeatable and numeric attributes Aggregation. Intuition: they are informative to the users e.g., paper abstract Single-valued and mandatory attributes Order By: Intuition: they are effective in identifying entity instances e.g., author name e.g., person age Selected entity, related entities, their attributes with suitable operators query forms ICDE 2011 Tutorial

63 QUnit [Nandi & Jagadish, CIDR 09] Define a basic, independent semantic unit of information in the DB as a QUnit. Similar to forms as structural templates. Materialize QUnit instances in the data. Use keyword queries to retrieve relevant instances. Compared with query forms QUnit has a simpler interface. Query forms allows users to specify binding of keywords and attribute names. ICDE 2011 Tutorial 64 Roadmap Motivation Structural ambiguity Keyword ambiguity Query cleaning and auto-completion Query refinement Query rewriting

Evaluation Query processing Result analysis Future directions ICDE 2011 Tutorial 65 Spelling Correction Noisy Channel Model Intended Query (C) Variants(k1) Noisy channel C1 = ipad Observed Query (Q) Q = ipd C2 = ipod Pr[Q | C]Pr[C]

Pr[C | Q] Pr[Q | C]Pr[C] Pr[Q] Error model Query generation ICDE 2011 Tutorial (prior) 66 Keyword Query Cleaning [Pu & Yu, VLDB 08] Hypotheses = Cartesian product of variants(ki) ki Confusion Set (ki) Appl {Appl, Apple} ipd {ipd, ipad, ipod} nan att {nan, nano} {att, at&t}

2*3*2 hypotheses: {Appl ipd nan, Apple ipad nano, Apple ipod nano, } Prevent fragmentation Pr[Q | C ] (1 z ) exp( ed(Q, C )) Error model: Pr[C ] (1 y ) IRScoreDB (C ) Boost ( C ) = 0 due to DB normalization Prior: What if at&t inTutorial another table ? ICDE 2011 67 Segmentation Both Q and Ci consists of multiple segments (each backed up by tuples in the DB) Q = { Appl ipd } Pr1 C1 = { Apple ipad }

{ att } Pr2 Maximize Pr1*Pr2 Why not Pr1*Pr2 *Pr3 ? { at&t } How to obtain the segmentation? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ICDE 2011 Tutorial ? Efficient computation using (bottom-up) dynamic programming 68

XClean [Lu et al, ICDE 11] /1 Noisy Channel Model for XML data T Pr[Q | C]Pr[C] Pr[C | Q] Pr[Q | C]Pr[C] Pr[Q] Pr[C | Q, T ] Pr[Q | C , T ] Pr[C | T ] Error model Error model: Query generation model Pr[Q | C , T ] Pr[Q | C ] Query generation model: Pr(C | T ) Pr(C | r ) Pr(r | T ) rentities Lang. model ICDE 2011 Tutorial Prior

69 XClean [Lu et al, ICDE 11] /2 Advantages: Guarantees the cleaned query has non-empty results Not biased towards rare tokens Query adventurecome ravel diiry XClean adventuresome travel diary Google adventure come travel diary [PY08] adventuresome rvel dairy ICDE 2011 Tutorial 70 Auto-completion Auto-completion in search engines

traditionally, prefix matching now, allowing errors in the prefix c.f., Auto-completion allowing errors [Chaudhuri & Kaushik, SIGMOD 09] Auto-completion for relational keyword search TASTIER [Li et al, SIGMOD 09]: 2 kinds of prefix matching semantics ICDE 2011 Tutorial 71 Q = {srivasta, sig} Treat each keyword as a prefix E.g., matches papers by srivastava published in sigmod Idea Index every token in a trie each prefix corresponds to a range of tokens Candidate = tokens for the smallest prefix Use the ranges of remaining keywords (prefix) to filter the candidates With the help of -step forward index ICDE 2011 Tutorial 72

sig Q = {srivasta, sig} sigact Candidates = I(srivasta) = {11,12, 78} Range(sig) = [k23, k27] Nod e Keywords Reachable within Steps 11 k2, k14, k22, k31 12 k5, k25, k75 78

k101, k237 After pruning, Candidates = {12} k23 r v k74 a k27 k73 sigweb {11, 12} {78} grow a Steiner tree around it ICDE 2011 Tutorial srivasta 73 Roadmap

Motivation Structural ambiguity Keyword ambiguity Query cleaning and auto-completion Query refinement Query rewriting Evaluation Query processing Result analysis Future directions ICDE 2011 Tutorial 74 Query Refinement: Motivation and Solutions Motivation: Sometimes lots of results may be returned With the imperfection of ranking function, finding relevant results is overwhelming to users

Question: How to refine a query by summarizing the results of the original query? Current approaches Identify important terms in results Cluster results Classify results by categories Faceted Search ICDE 2011 Tutorial 75 Data Clouds [Koutrika et al. EDBT 09] Goal: Find and suggest important terms from query results as expanded queries. Input: Database, admin-specified entities and attributes, query Attributes of an entity may appear in different tables E.g., the attributes of a paper may include the information of its authors. Output: Top-K ranked terms in the results, each of which is an entity and its attributes. E.g., query = XML Each result is a paper with attributes title, abstract, year, author name, etc. Top terms returned: keyword, XPath, IBM, etc. Gives users insight about papers about XML.

ICDE 2011 Tutorial 76 Ranking Terms in Results Popularity based: Score(t ) tf (t , E ) in all results. E However, it may select very general terms, e.g., data Relevance based: Score(t ) tf (t , E ) idf (t ) for all results E E Result weighted Score(t ) tf (t , E ) idf (t ) score( E ) for all results E

How to rank results Score(E)? E Traditional TF*IDF does not take into account the attribute weights. e.g., course title is more important than course description. Improved TF: weighted sum of TF of attribute. ICDE 2011 Tutorial 77 Frequent Co-occurring Terms[Tao et al. EDBT 09] Can we avoid generating all results first? Input: Query Output: Top-k ranked non-keyword terms in the results. Capable of computing top-k terms efficiently without even generating results.

Terms in results are ranked by frequency. Tradeoff of quality and efficiency. ICDE 2011 Tutorial 78 Query Refinement: Motivation and Solutions Motivation: Sometimes lots of results may be returned With the imperfection of ranking function, finding relevant results is overwhelming to users Question: How to refine a query by summarizing the results of the original query? Current approaches Identify important terms in results Cluster results Classify results by categories Faceted Search ICDE 2011 Tutorial 79 Summarizing Results for Ambiguous Queries

Query words may be polysemy It is desirable to refine an ambiguous query by its distinct meanings All suggested queries are about Java programming language ICDE 2011 Tutorial 80 Motivation Contd. Goal: the set of expanded queries should provide a categorization of the original query results. Ideally: Result(Qi) = Ciisland Java Java language c2 c1 .Java software platform .. .Java applet .. .OO Langua ge ... .devel oped at

Sun .ther e are three langua ges ... .is an island of Indones ia.. .has four province s. Java Java band c3 Java band formed in Paris. .. active from 1972 to 1983.. Result (Q1)

Q1 does not retrieve all results in C1, and retrieves results in C2. How to measure the quality of expanded queries? ICDE 2011 Tutorial 81 Query Expansion Using Clusters Input: Clustered query results Output: One expanded query for each cluster, such that each expanded query Maximally retrieve the results in its cluster (recall) Minimally retrieve the results not in its cluster (precision) Hence each query should aim at maximizing F-measure. This problem is APX-hard Efficient heuristics algorithms have been developed. ICDE 2011 Tutorial 82 Query Refinement: Motivation and Solutions Motivation: Sometimes lots of results may be returned With the imperfection of ranking function, finding relevant results

is overwhelming to users Question: How to refine a query by summarizing the results of the original query? Current approaches Identify important terms in results Cluster results Classify results by categories Faceted Search ICDE 2011 Tutorial 83 Faceted Search [Chakrabarti et al. 04] facet Allows user to explore the classification of results Facets: attribute names Facet conditions: attribute values

By selecting a facet condition, a refined query is generated Challenges: facet condition How to determine the nodes? How to build the navigation tree? ICDE 2011 Tutorial 84 How to Determine Nodes -Facet Conditions Categorical attributes: A value a facet condition Ordered based on how many queries hit each value. Numerical attributes:

A value partition a facet condition Partition is based on historical queries If many queries has predicates that starts or ends at x, it is good to partition at x ICDE 2011 Tutorial 85 How to Construct Navigation Tree Input: Query results, query log. Output: a navigational tree, one facet at each level, Minimizing users expected navigation cost for finding the relevant results. Challenge: How to define cost model? How to estimate the likelihood of user actions? ICDE 2011 Tutorial 86 User Actionscost (showRes( N )) number of tuples that satisfy N apt 1, apt2, apt3

proc(N): Explore the current node N showRes(N): show all tuples that satisfy N expand(N): show the child facet of N readNext(N): read all values of child facet of N showRes neighborhood: Redmond, Bellevue expand price: 200-225K price: 225-250K price: 250-300K Ignore(N) cost (expand ( N )) cost (readNext ( N )) ( p( proc( N ') cost ( proc( N ')) each child N 'of N ICDE 2011 Tutorial 87 Navigation Cost Model EstimatedCost ( N ) p( proc( N )) cost ( proc( N )) p( showRes ( N )) cost ( showRes ( N )) p (expand ( N )) cost (expand ( N ))

How to estimate the involved probabilities? ICDE 2011Tutorial 88 Estimating Probabilities /1 p(expand(N)): high if many historical queries involve the child facet of N p (expand ( N )) number of queries that involve the child facet of N total number of historical queries p(showRes (N)): 1 p(expand(N)) ICDE 2011 Tutorial 89 Estimating Probabilities/2 p(proc(N)): User will process N if and only if user processes and chooses to expand Ns parent facet, and thinks N is relevant. P(N is relevant) = the percentage of queries in query log that has a selection condition overlapping N. ICDE 2011 Tutorial 90

Algorithm Enumerating all possible navigation trees to find the one with minimal cost is prohibitively expensive. Greedy approach: Build the tree from top-down. At each level, a candidate attribute is the attribute that doesnt appear in previous levels. Choose the candidate attribute with the smallest navigation cost. ICDE 2011 Tutorial 91 Facetor [Kashyap et al. 2010] Input: query results, user input on facet interestingness Output: a navigation tree, with set of facet conditions (possibly from multiple facets) at each level, minimizing the navigation cost EXPAND SHOWRESULT SHOWMORE ICDE 2011 Tutorial

92 Facetor [Kashyap et al. 2010] /2 Different ways to infer probabilities: p(showRes): depends on the size of results and value spread p(expand): depends on the interestingness of the facet, and popularity of facet condition p(showMore): if a facet is interesting and no facet condition is selected. Different cost models ICDE 2011 Tutorial 93 Roadmap Motivation Structural ambiguity Keyword ambiguity Query cleaning and auto-completion Query refinement Query rewriting

Evaluation Query processing Result analysis Future directions ICDE 2011 Tutorial 94 Effective Keyword-Predicate Mapping [Xin et al. VLDB 10] Keyword queries Low Precision Low Recall are non-quantitative may contain synonyms E.g. small IBM laptop Handling such queries directly may result in low precision and recall I D Product Name BrandNam Screen Size

e Description 1 ThinkPad T60 Lenovo 14 The IBM laptop...small business 2 ThinkPad X40 Lenovo 12 This notebook... ICDE 2011 Tutorial 95 Problem Definition

Input: Keyword query Q, an entity table E Output: CNF (Conjunctive Normal Form) SQL query T(Q) (Q) for a keyword query Q E..g Input: Q = small IBM laptop Output: T(Q) (Q) = SELECT(Q) * FROM T(Q) able WHERE BrandName = Lenovo AND ProductDescription LIKE %laptop% ORDER BY ScreenSize ASC ICDE 2011 Tutorial 96 Key Idea To understand a query keyword, compare two queries that differ on this keyword, and analyze the differences of the attribute value distribution of their results e.g., to understand keyword IBM, we can compare the results of q1: IBM laptop q2: laptop ICDE 2011 Tutorial 97 Differential Query Pair (DQP)

For reliability and efficiency for interpreting keyword k, it uses all query pairs in the query log that differ by k. DQP with respect to k: foreground query Qf background query Qb Qf = Qb U {k} ICDE 2011 Tutorial 98 Analyzing Differences of Results of DQP To analyze the differences of the results of Qf and Qb on each attribute value, use well-known correlation metrics on distributions Categorical values: KL-divergence Numerical values: Earth Movers Distance E.g. Consider attribute Brand: Lenovo Qb = [IBM laptop] Returns 50 results, 30 of them have Brand:Lenovo Qf = [laptop] Returns 500 results, only 50 of them have Brand:Lenovo The difference on Brand: Lenovo is significant, thus reflecting the meaning of IBM For keywords mapped to numerical predicates, use order by clauses e.g., small can be mapped to Order by size ASC Compute the average score of all DQPs for each keyword k

ICDE 2011 Tutorial 99 Query Translation Step 1: compute the best mapping for each keyword k in the query log. Step 2: compute the best segmentation of the query. Linear-time Dynamic programming. Suppose we consider 1-gram and 2-gram To compute best segmentation of t ,t , t , t : 1 n-2 n-1 n t1,tn-2, tn-1, tn Option 2 Option 1 (t1,tn-2, tn-1), {tn} (t1,tn-2), {tn-1, tn} Recursively computed. ICDE 2011 Tutorial 100 Query Rewriting Using Click Logs [Cheng et al. ICDE 10]

Motivation: the availability of query logs can be used to assess ground truth Problem definition Input: query Q, query log, click log Output: the set of synonyms, hypernyms and hyponyms for Q. E.g. Indiana Jones IV vs Indian Jones 4 Key idea: find historical queries whose ground truth significantly overlap the top k results of Q, and use them as suggested queries ICDE 2011 Tutorial 101 Query Rewriting using Data Only [Nambiar and Kambhampati ICDE 06] Motivation: A user that searches for low-price used Honda civic cars might be interested in Toyota corolla cars How to find that Honda civic and Toyota corolla cars are similar using data only? Key idea

Find the sets of tuples on Honda and Toyota, respectively Measure the similarities between this two sets ICDE 2011 Tutorial 102 Roadmap Motivation Structural ambiguity Keyword ambiguity Evaluation Query processing Result analysis Future directions ICDE 2011 Tutorial 103 INEX - INitiative for the Evaluation of XML Retrieval Benchmarks for DB: TPC, for IR: TREC

A large-scale campaign for the evaluation of XML retrieval systems Participating groups submit benchmark queries, and provide ground truths Assessor highlight relevant data fragments as ground truth results http://inex.is.informatik.uni-duisburg.de/ ICDE 2011 Tutorial 104 INEX Data set: IEEE, Wikipeida, IMDB, etc. Measure: Assume user stops reading when there are too many consecutive non-relevant result fragments. Score of a single result: precision, recall, Fmeasure Tolerance Read by user (D) Precision: % of relevant characters in result P2 P1 P2 Recall: % of relevant characters retrieved. P

recall ( D) 2 P2 P3 precision( D ) Result Ground truth D P1 P2 P3 F-measure: harmonic mean of precision and recall S ( D) 2 precision( D) recall ( D) precision( D)recall ( D) ICDE 2011 Tutorial 105 INEX Measure: Score of a ranked list of results: average generalized precision (AgP) Generalized precision (gP) at rank k: the average score of the

first r results returned. k S (D ) i gP(k ) Average i 1 k gP(AgP): average gP for all values of k. ICDE 2011 Tutorial 106 Axiomatic Framework for Evaluation Formalize broad intuitions as a collection of simple axioms and evaluate strategies based on the axioms. It has been successful in many areas, e.g. mathematical economics, clustering, location theory, collaborative filtering, etc Compared with benchmark evaluation Cost-effective

General, independent of any query, data set ICDE 2011 Tutorial 107 Axioms [Liu et al. VLDB 08] Axioms for XML keyword search have been proposed for identifying relevant keyword matches Challenge: It is hard or impossible to describe desirable results for any query on any data Proposal: Some abnormal behaviors can be identified when examining results of two similar queries or one query on two similar documents produced by the same search engine. Assuming AND semantics Four axioms Data Monotonicity Query Monotonicity Data Consistency Query Consistency ICDE 2011 Tutorial 108 Violation of Query Consistency Q1: paper, Mark Q2: SIGMOD, paper, Mark conf name

paper year SIGMOD 2007 title paper demo author author title author author title author keyword name name Mark Yang XML name Liu name Top-k name Chen Soliman An XML keyword search engine that considers this subtree as irrelevant for Q1, but relevant for Q2 violates query consistency . Query Consistency: the new result subtree contains the new query keyword. ICDE 2011 Tutorial

109 Roadmap Motivation Structural ambiguity Keyword ambiguity Evaluation Query processing Result analysis Future directions ICDE 2011 Tutorial 110 Efficiency in Query Processing Query processing is another challenging issue for keyword search systems 1. 2. 3. Inherent complexity Large search space Work with scoring functions

Performance improving ideas Query processing methods for XML KWS ICDE 2011 Tutorial 111 1. Inherent Complexity RDMBS / Graph Computing GST-1: NP-complete & NP-hard to find (1+)-approximation for any fixed > 0)-approximation for any fixed )-approximation for any fixed > 0 > 0 XML / Tree # of ?LCA nodes = O(min(N, i ni)) ICDE 2011 Tutorial 112 Specialized Algorithms Top-1 Group Steiner Tree Dynamic programming for top-1 (group) Steiner Tree [Ding et al, ICDE07] MIP [Talukdar et al, VLDB08] use Mixed Linear Programming to find the min Steiner Tree (rooted at a node r)

Approximate Methods STAR [Kasneci et al, ICDE 09] 4(log n + 1) approximation Empirically outperforms other methods ICDE 2011 Tutorial 113 Specialized Algorithms Approximate Methods BANKS I [Bhalotia et al, ICDE02] Equi-distance expansion from each keyword instances Found one candidate solution when a node is found to be reachable from all query keyword sources Buffer enough candidate solution to output top-k BANKS II [Kacholia et al, VLDB05] Use bi-directional search + activation spreading mechanism BANKS III [Dalvi et al, VLDB08] Handles graphs in the external memory ICDE 2011 Tutorial 114 ID CN

1 PQ 2. Large Search Space 3 P 2 Typically thousands of CNs SG: Author, Write, Paper, Cite 0.2M CNs, >0.5M Joins CQ Q CQ 4 CQ P Q C Q 5 CQ U C Q 6 CQ P C Q 7

CQ U C Q P Q Solutions Efficient generation of CNs Breadth-first enumeration on the schema graph [Hristidis et al, VLDB 02] [Hristidis et al, VLDB 03] Duplicate-free CN generation [Markowetz et al, SIGMOD 07] [Luo 2009] Other means (e.g., combined with forms, pruning CNs with indexes, top-k processing) ICDE 2011 Tutorial Will be discussed 115 later 3. Work with Scoring Functions top-2 Top-k query processing Discover 2 [Hristidis et al, VLDB 03]

ID CN Naive Retrieve top-k results from all CNs Sparse Retrieve top-k results from each CN in turn. Stop ASAP Single Pipeline Perform a slice of the CN each time Stop ASAP Global pipeline 1 PQ W AQ 2 PQ W AQ W PQ Result (CN1) Score P1-W1-A2

3.0 P2-W5-A3 2.3 ... ... Result (CN2) Score P2-W2-A1-W3-P7 1.0 P2-W9-A5-W6-P8 0.6 ... ... Requiring monotonic ICDE 2011 Tutorialscoring 116 Working with Nonmonotonic Scoring Result (CN1) P1 ?W A1 Function P2 W A3

[Luo et al, SIGMOD 07] SPARK Why non-monotonic function P1 k1 W A1 k1 2.3 10.0 ... Score(P1) > Score(P2) > Solution w tf ( w, tuple) idf ( w) sort Pi and Aj in a salient order watf(tuple) 3.0 ... P2k1 W A3k2

Score w idf ( w) works for SPARKs scoring function Skyline sweeping algorithm Block pipeline algorithm ICDE 2011 Tutorial 117 Efficiency in Query Processing Query processing is another challenging issue for keyword search systems 1. 2. 3. Inherent complexity Large search space Work with scoring functions Performance improving ideas Query processing methods for XML KWS

ICDE 2011 Tutorial 118 Performance Improvement Ideas Keyword Search + Form Search [Baid et al, ICDE 10] idea: leave hard queries to users Build specialized indexes idea: precompute reachability info for pruning Leverage RDBMS [Qin et al, SIGMOD 09] Idea: utilizing semi-join, join, and set operations Explore parallelism / Share computaiton Idea: exploit the fact that many CNs are overlapping substantially with each other ICDE 2011 Tutorial 119 Selecting Relevant Query Forms [Chu et al. SIGMOD 09] Idea

easy Run keyword search for a preset amount of time querie Summarize the rest of unexplored & incompletely s hard explored search space with forms querie s ICDE 2011 Tutorial 120 Specialized Indexes for KWS Graph reachability index Proximity search [Goldman et al, VLDB98] Special reachability indexes Over the entire graph BLINKS [He et al, SIGMOD 07] Reachability indexes [Markowetz et al, ICDE 09] TASTIER [Li et al, SIGMOD 09] Leveraging RDBMS [Qin et al, SIGMOD09] Local neighborhood

Index for Trees Dewey, JDewey [Chen & Papakonstantinou, ICDE 10] ICDE 2011 Tutorial 121 Proximity Search VLDB98] [Goldman et al, H Index node-to-node min distance y O(|V|2) space is impractical Select hub nodes (H ) i ideally balanced separators x d*(u, v) records min distance between u and v without crossing any Hi Using the Hub Index d(x, y) = min( d*(x, y),

dH(A, B) + d*(B, y), A, B H ) ICDE 2011 Tutorial d*(x, A) + 122 BLINKS [He et al, SIGMOD 07] d1 =3 rj SLINKS [He et al, SIGMOD 07] indexes node-tokeyword distances Thus O(K*|V|) space O(|V|2) in practice Then apply Fagins TA algorithm ri d1= 5 d2= 6 d2 =9 r d1

d2 ri 5 6 rj 3 9 BLINKS Partition the graph into blocks Portal nodes shared by blocks Build intra-block, inter-block, and keyword-to- block indexes ICDE 2011 Tutorial 123 D-Reachability Indexes [Markowetz et al, ICDE 09] Precompute various reachability information with a size/range threshold (D) to cap their index sizes Prune partial solutions

Node Set(Term) (N2T) (Node, Relation) Set(Term) (Node, Relation) Set(Node) (Relation1, Term, Relation2) Set(Term) (N2R) (N2N) (R2R) Proximity Search Node (Hub, dist) SLINKS Node (Keyword, dist) ICDE 2011 Tutorial Prune CNs 124 TASTIER [Li et al, SIGMOD 09] Precompute various reachability information with a size/range threshold to cap their index sizes

Prune partial solutions Node Set(Term) (Node, dist) Set(Term) (N2T) (-Step Forward Index) Also employ trie-based indexes to Support prefix-match semantics Support query auto-completion (via 2-tier trie) ICDE 2011 Tutorial 125 Leveraging RDBMS [Qin et al, SIGMOD09] Goal: Perform all the operations via SQL Semi-join, Join, Union, Set difference

Steiner Tree Semantics Semi-joins Distinct core semantics Pairs(n1, n2, dist), dist Dmax Ans = S GROUP BY (a, b) ICDE 2011 Tutorial a b S = Pairsk1(x, a, i) x Pairsk2(x, b, j) x 126 Leveraging RDBMS [Qin et al, SIGMOD09] How to compute Pairs(n1, n2, dist) within RDBMS? T R S

x s PairsS(s, x, i) R PairsR(r, x, i+1) PairsT(t, y, i) R PairsR(r, y, i+1) r Mindist PairsR(r, x, 0) U Also propose more efficient Can use semi-join idea to alternatives PairsR(r, x, 1) U Pairsthe Dmax) prune core R(r, x, further nodes, center nodes, and path nodes ICDE 2011 Tutorial 127 Other Kinds of Index EASE [Li et al, SIGMOD 08] (Term1, Term2)

(maximal r-Radius Graph, sim) Summary Index Mapping Proximity Search Node (Hub, dist) SLINKS Node (Keyword, dist) N2T Node (Keyword, Y/N) | D N2R (Node, R) (Keyword, Y/N) |D N2N (Node, R) (Node, Y/N) | D R2R (R1, Keyword, R2) (Keyword, Y/N) |D [Qin et al, SIGMOD09] Node (Node, dist) | Dmax EASE

(K1, K2) (maximal r-SG, sim) |r ICDE 2011 Tutorial 128 Multi-query Optimization Issues: A keyword query generates too many SQL queries Solution 1: Guess the most likely SQL/CN Solution 2: Parallelize the computation [Qin et al, VLDB 10] Solution 3: Share computation Operator Mesh [[Markowetz et al, SIGMOD 07]] SPARK2 [Luo et al, TKDE] ICDE 2011 Tutorial 129 Parallel Query Processing [Qin et al, VLDB 10] Many CNs share common sub-expressions Capture such sharing in a shared execution graph Each node annotated with its estimated cost 7 ID CN 1

PQ 2 CQ 3 PQ C Q 4 CQ P Q C Q 5 CQ U C Q 6 CQ P C Q 7 CQ U C Q P Q 4 3 2 CQ 5

PQ ICDE 2011 Tutorial 6 1 U P CQ PQ 130 Parallel Query Processing [Qin et al, VLDB 10] 7 CN Partitioning 4 Assign the largest job to the core with the lightest

load ID CN 3 2 5 6 1 1 PQ 2 CQ 3 PQ C Q

4 CQ P Q C Q 1 5 CQ U C Q 2 6 CQ P C Q 3 7 CQ U C Q P Q CQ

PQ U Core Job ICDE 2011 Tutorial P Job CQ PQ Job 131 Parallel Query Processing [Qin et al, VLDB 10] 7 Sharing-aware CN Partitioning 4 3 Assign the largest job to the core that has the lightest resulting load

Update the cost of the rest of the jobs 2 5 6 CQ PQ 2 3 ICDE 2011 Tutorial 1 U

Core Job 1 P CQ Job Job PQ 132 Parallel Query Processing [Qin et al, VLDB 10] Operator-level Partitioning

Consider each level Perform cost (re-)estimation Allocate operators to cores CQ Also has Data level parallelism for extremely skewed scenarios PQ Core U P PQ Jobs 1

CQ 2 PQ 3 PQ ICDE 2011 Tutorial CQ 133 Operator Mesh [Markowetz et al, SIGMOD 07] Background Keyword search over relational data streams No CNs can be pruned ! Leaves of the mesh: |SR| * 2k source nodes CNs are generated in a canonical form in a depth-first manner Cluster these CNs to build the mesh The actual mesh is even more complicated Need

to have buffers associated with each node Need to store timestamp of last sleep ICDE 2011 Tutorial 134 4 SPARK2 [Luo et al, TKDE] Capture CN dependency (& sharing) via the partition graph Features 7 3 5 6 P U

Only CNs are allowed as nodes 1 no open-ended joins Models all the ways a CN can be obtained by joining two other CNs (and possibly some free tuplesets) allow pruning if one sub-CN produce empty result ICDE 2011 Tutorial 2 ID CN 1 PQ 2 CQ 3 PQ C Q 4 CQ P Q C Q 5 CQ U C Q 6

CQ P C Q 7 CQ U CQ135 PQ Efficiency in Query Processing Query processing is another challenging issue for keyword search systems 1. 2. 3. Inherent complexity Large search space Work with scoring functions Performance improving ideas Query processing methods for XML KWS ICDE 2011 Tutorial 136 XML KWS Query Processing SLCA

[Xu & Papakonstantinou, EDBT 08] Index Stack [Xu & Papakonstantinou, SIGMOD 05] Multiway SLCA [Sun et al, WWW 07] ELCA XRank [Guo et al, SIGMOD 03] JDewey Join [Chen & Papakonstantinou, ICDE 10] Also supports SLCA & top-k keyword search ICDE 2011 Tutorial 137 XKSearch [Xu & Papakonstantinou, SIGMOD 05] Indexed-Lookup-Eager (ILE) when ki is selective O( k * d * |Smin| * log(|Smax|) ) SLCA(v, S ) desc(LCA(v, lmS (v)), LCA(v, rmS (v)) z y SLCA(v, S ) LCA(v, closest S (v)) x

w lmS(v) v rmS(v) Document ICDE 2011 Tutorial order Q: x SLCA ? SLCA ? A: No. But we can decide if the previous candidate SLCA node (w) SLCA ? SLCA or not 138 Multiway SLCA [Sun et al, WWW 07] Basic & Incremental Multiway SLCA O( k * d * |Smin| * log(|Smax|) ) Q: Who will be the anchor node next? z y 1) skip_after(Si, anchor) x

2) skip_out_of(z) w anchor ICDE 2011 Tutorial 139 Index Stack [Xu & Papakonstantinou, EDBT 08] Idea: ELCA(S1, S2, Sk) ELCA_candidates(S ELCA_candidates(S1, S2, Sk) ELCA_candidates(S1, S2, Sk) =v SLCA ?S1 SLCA({v}, S2, Sk) O(k * d * log(|Smax|)), d is the depth of the XML data tree Sophisticated stack-based algorithm to find true ELCA nodes from ELCA_candidates Overall complexity: O(k * d * |Smin| * log(|Smax|)) DIL [Guo et al, SIGMOD 03]: O(k * d * |Smax|) RDIL[Guo et al, SIGMOD 03]: O(k2*ICDE d *2011 p *Tutorial

|Smax| log(|Smax|) + k2 * d + |Smax140 |2) Computing ELCA JDewey Join [Chen & Papakonstantinou, ICDE 10] Compute ELCA bottom-up 1 1 1 2 2 1 3 3 1 1 1 1 1 1 1 1

1 3 1 2 2 3 1 2 2 1.1.2.2 ICDE 2011 Tutorial 141 Summary Query processing for KWS is a challenging task Avenues explored: Alternative result definitions Better exact & approximate algorithms Top-k optimization Indexing (pre-computation, skipping) Sharing/parallelize computation

ICDE 2011 Tutorial 142 Roadmap Motivation Structural ambiguity Keyword ambiguity Evaluation Query processing Result analysis Ranking Snippet Comparison Clustering Correlation Summarization Future directions

ICDE 2011 Tutorial 143 Result Ranking /1 Types of ranking factors Term Frequency (TF), Inverse Document Frequency (IDF) TF: the importance of a term in a document IDF: the general importance of a term Adaptation: a document a node (in a graph or tree) or a result. Vector Space Model Represents queries and results using vectors. Each component is a term, the value is its weight (e.g., TFIDF) Score of a result: the similarity between query vector and result vector. ICDE 2011 Tutorial 144 Result Ranking /2 Proximity based ranking Proximity of keyword matches in a document can boost its ranking. Adaptation: weighted tree/graph size, total distance from root to each leaf, etc. Authority based ranking PageRank: Nodes linked by many other important nodes are important.

Adaptation: Authority may flow in both directions of an edge Different types of edges in the data (e.g., entity-entity edge, entityattribute edge) may be treated differently. ICDE 2011 Tutorial 145 Roadmap Motivation Structural ambiguity Keyword ambiguity Evaluation Query processing Result analysis Ranking Snippet Comparison Clustering

Correlation Summarization Future directions ICDE 2011 Tutorial 146 Result Snippets Although ranking is developed, no ranking scheme can be perfect in all cases. Web search engines provide snippets. Structured search results have tree/graph structure and traditional techniques do not apply. ICDE 2011 Tutorial 147 Result Snippets on XML [Huang et al. SIGMOD 08] Q: ICDE

conf name year ICDE 2010 paper paper title author title data country query Input: keyword query, a query result Output: self-contained, informative and concise snippet. Snippet components:

USA Keywords Key of result Entities in result Dominant features The problem is proved NP-hard Heuristic algorithms were proposed ICDE 2011 Tutorial 148 Result Differentiation [Liu et al. VLDB 09] Techniques like snippet and ranking helps user find relevant results. 50% of keyword searches are information exploration queries, which inherently have multiple relevant results Web Search

50% Navigation 50% Information Exploration Users intend to investigate and compare multiple relevant results. Broder, SIGIR 02 How to help user compare relevant results? ICDE 2011 Tutorial 149 Result Differentiation Query: ICDE conf name year ICDE 2000 paper title author

data country paper paper title title information query USA Snippets are not designed to compare results: - both results have many papers about data and query. conf name year ICDE 2010 paper paper

title author author title data country aff. query USA Waterloo ICDE 2011 Tutorial - both results have many papers from authors from USA 150 Result Differentiation Query: ICDE conf name ICDE

paper year 2000 title author data country paper title paper title information query Feature Type Result 1 Result 2 conf: year 2000 2010 paper: title

OLAP data mining cloud scalability search USA conf name year ICDE 2010 paper paper title author author title data country aff.

query USA Waterloo Bank websites usually allow users to compare selected credit cards. however, only with a pre-defined feature set. How to automatically generate good comparison tables efficiently? ICDE 2011 Tutorial 151 Desiderata of Selected Feature Set Concise: user-specified upper bound Good Summary: features that do not summarize the results show useless & misleading differences. Feature Type Result 1 Result 2 paper: title

network query This conference has only a few network papers Feature sets should maximize the Degree of Differentiation (DoD). Feature Type Result 1 Result 2 conf: year 2000 2010 paper: title OLAP data mining Cloud, scalability, search ICDE 2011 Tutorial DoD = 2 152

Result Differentiation Problem Input: set of results Output: selected features of results, maximizing the differences. The problem of generating the optimal comparison table is NP-hard. Weak local optimality: cant improve by replacing one feature in one result Strong local optimality: cant improve by replacing any number of features in one result. Efficient algorithms were developed to achieve these ICDE 2011 Tutorial 153 Roadmap Motivation Structural ambiguity Keyword ambiguity Evaluation Query processing Result analysis

Ranking Snippet Comparison Clustering Correlation Summarization Future directions ICDE 2011 Tutorial 154 Result Clustering Results of a query may have several types. Clustering these results helps the user quickly see all result types. Related to Group By in SQL, however, in keyword search, the user may not be able to specify the Group By attributes. different results may have completely different attributes. ICDE 2011 Tutorial

155 XBridge [Li et al. EDBT 10] To help user see result types, XBridge groups results based on context of result roots E.g., for query keyword query processing, different types of papers can be distinguished by the path from data root to result root. bib bib bib conference journal workshop paper paper paper Input: query results Output: Ranked result clusters

ICDE 2011 Tutorial 156 Ranking of Clusters Ranking score of a cluster: Score (G, Q) = total score of top-R results in G, where R = min(avg, |G|) avg number of results in all clusters This formula avoids too much benefit to large clusters ICDE 2011 Tutorial 157 Scoring Individual Results / 1 Not all matches are equal in terms of content TF(x) = 1 Inverse element frequency (ief(x)) = N / # nodes containing the token x Weight(ni contains x) = log(ief(x))

keyword query processing ICDE 2011 Tutorial 158 Scoring Individual Results / 2 Not all matches are equal in terms of structure Result proximity measured by sum of paths from result root to each keyword node Length of a path longer than average XML depth is discounted to avoid too much penalty to long paths. dist=3 query processing keyword ICDE 2011 Tutorial 159 Scoring Individual Results / 3 Favor tightly-coupled results When calculating dist(), discount the shared path segments

Loosely coupled Tightly coupled - Computing rank using actual results are expensive - Efficient algorithm was proposed utilizes offline computed data statistics. ICDE 2011 Tutorial 160 Describable Result Clustering 10] -- Query Ambiguity [Liu and Chen, TODS Goal Query aware: Each cluster corresponds to one possible semantics of the query Describable: Each cluster has a describable semantics. Semantics interpretation of ambiguous queries are inferred from different roles of query keywords (predicates, return nodes) in different results. auctions closed auction seller buyer auctioneerprice Bob

Mary Tom 149.24 Find the seller, buyer of auctions whose auctioneer is Tom. closed auction Q: auction, seller, buyer, Tom seller buyer auctioneer price FrankTom Louis 750.30 Find the seller of auctions whose buyer is Tom. open auction seller buyer Tom Peter auctioneer price Mark 350.00 Find the buyer of auctions whose seller is Tom.

Therefore, it first clusters the results according to roles of keywords. ICDE 2011 Tutorial 161 Describable Result Clustering and Chen, TODS 10] -- Controlling Granularity [Liu How to further split the clusters if the user wants finer granularity? Keywords in results in the same cluster have the same role. but they may still have different context (i.e., ancestor nodes) Further clusters results based on the context of query keywords, subject to # of clusters and balance of clusters auction, seller, buyer, open auction Tom closed auction seller buyer auctioneer price seller buyer auctioneer price Tom Tom Peter Mary Louis 149.24

Mark 350.00 This problem is NP-hard. Solved by dynamic programming algorithms. ICDE 2011 Tutorial 162 Roadmap Motivation Structural ambiguity Keyword ambiguity Evaluation Query processing Result analysis Ranking Snippet Comparison

Clustering Correlation Summarization Future directions ICDE 2011 Tutorial 163 Table Analysis[Zhou et al. EDBT 09] In some application scenarios, a user may be interested in a group of tuples jointly matching a set of query keywords. E.g., which conferences have both keyword search, cloud computing and data privacy papers? When and where can I go to experience pool, motor cycle and American food together? Given a keyword query with a set of specified attributes, Cluster tuples based on (subsets) of specified attributes so that each cluster has all keywords covered Output results by clusters, along with the shared specified attribute values ICDE 2011 Tutorial 164 Table Analysis Input:

[Zhou et al. EDBT 09] Keywords: pool, motorcycle, American food Interesting attributes specified by the user: month state Goal: cluster tuples so that each cluster has the same value of month and/or state and contains query keywords Output December Texas * Michigan Mont h Stat e City Event Description Dec TX

Housto n US Open Pool Best of 19, ranking Dec TX Dallas Cowboys dream run Motorcycle, beer Dec TX Austin SPAM Museum party Classical American food Oct MI Detroit Motorcycle Rallies Tournament, round robin

Oct MI Flint Michigan Pool ICDE 2011 Tutorial Exhibition Non-ranking, 2 days 165 Keyword Search in Text Cube al. 10] -- Motivation Shopping scenario: a user may be interested in the common features in products to a query, besides individual products E.g. query powerful laptop Brand Model Acer OS Description

AOA110 1.6GH z Win 7 lightweight powerful Acer AOA110 1.7GH z Win 7 powerful processor ASUS EEE PC Win Vista large disk Desirable output: CPU [Ding et 1.7GH z {Brand:Acer, Model:AOA110, CPU:*, OS:*} (first two laptops)

{Brand:*, Model:*, CPU:1.7GHz, OS: *} (last two laptops) ICDE 2011 Tutorial 166 Keyword Search in Text Cube Problem definition Text Cube: an extension of data cube to include unstructured data Each row of DB is a set of attributes + a text document Each cell of a text cube is a set of aggregated documents based on certain attributes and values. Keyword search on text cube problem: Input: DB, keyword query, minimum support Output: top-k cells satisfying minimum support, Ranked by the average relevance of documents satisfying the cell Support of a cell: # of documents that satisfy the cell. {Brand:Acer, Model:AOA110, CPU:*, OS:*} (first two laptops): SUPPORT = 2 ICDE 2011 Tutorial 167 Other Types of KWS

Systems Distributed database, e.g., Kite [Sayyadian et al, ICDE 07], Database selection [Yu et al. SIGMOD 07] [Vu et al, SIGMOD 08] Cloud: e.g., Key-value Stores [Termehchy & Winslett, WWW 10] Data streams, e.g., [Markowetz et al, SIGMOD 07] Spatial DB, e.g., [Zhang et al, ICDE 09] Workflow, e.g., [Liu et al. PVLDB 10] Probabilistic DB, e.g., [Li et al, ICDE 11] RDF, e.g., [Tran et al. ICDE 09] Personalized keyword query, e.g., [Stefanidis et al, EDBT 10] ICDE 2011 Tutorial 168 Future Research: Efficiency Observations Efficiency is critical, however, it is very costly to process keyword search on graphs. results are dynamically generated many NP-hard problems. Questions Cloud computing for keyword search on graphs? Utilizing materialized views / caches? Adaptive query processing? ICDE 2011 Tutorial 169

Future Research: Searching Extracted Structured Data Observations The majority of data on the Web is still unstructured. Structured data has many advantages in automatic processing. Efforts in information extraction Question: searching extracted structured data Handling uncertainty in data? Handling noise in data? ICDE 2011 Tutorial 170 Future Research: Combining Web and Structured Search Observations Web search engines have a lot of data and user logs, which provide opportunities for good search quality. Question: leverage Web search engines for improving search quality? Resolving keyword ambiguity Inferring search intentions Ranking results

ICDE 2011 Tutorial 171 Future Research: Searching Heterogeneous Data Observations Vast amount of structured, semi-structured and unstructured data co-exist. Question: searching heterogeneous data Identify potential relationships across different types of data? Build an effective and efficient system? ICDE 2011 Tutorial 172 Thank You ! ICDE 2011 Tutorial 173 References /1 Baid, A., Rae, I., Doan, A., and Naughton, J. F. (2010). Toward industrial-strength keyword search systems over relational data. In ICDE 2010, pages 717-720.

Bao, Z., Ling, T. W., Chen, B., and Lu, J. (2009). Effective xml keyword search with relevance oriented ranking. In ICDE, pages 517-528. Bhalotia, G., Nakhe, C., Hulgeri, A., Chakrabarti, S., and Sudarshan, S. (2002). Keyword Searching and Browsing in Databases using BANKS. In ICDE, pages 431-440. Chakrabarti, K., Chaudhuri, S., and Hwang, S.-W. (2004). Automatic Categorization of Query Results. In SIGMOD, pages 755-766 Chaudhuri, S. and Das, G. (2009). Keyword querying and Ranking in Databases. PVLDB 2(2): 16581659. Chaudhuri, S. and Kaushik, R. (2009). Extending autocompletion to tolerate errors. In SIGMOD, pages 707-718. Chen, L. J. and Papakonstantinou, Y. (2010). Supporting top-K keyword search in XML databases. In ICDE, pages 689-700. ICDE 2011 Tutorial 174 References /2

Chen, Y., Wang, W., Liu, Z., and Lin, X. (2009). Keyword search on structured and semi-structured data. In SIGMOD, pages 1005-1010. Cheng, T., Lauw, H. W., and Paparizos, S. (2010). Fuzzy matching of Web queries to structured data. In ICDE, pages 713-716. Chu, E., Baid, A., Chai, X., Doan, A., and Naughton, J. F. (2009). Combining keyword search and forms for ad hoc querying of databases. In SIGMOD, pages 349-360. Cohen, S., Mamou, J., Kanza, Y., and Sagiv, Y. (2003). XSEarch: A semantic search engine for XML. In VLDB, pages 45-56. Dalvi, B. B., Kshirsagar, M., and Sudarshan, S. (2008). Keyword search on external memory data graphs. PVLDB, 1(1):1189-1204. Demidova, E., Zhou, X., and Nejdl, W. (2011). A Probabilistic Scheme for Keyword-Based Incremental Query Construction. TKDE, 2011. Ding, B., Yu, J. X., Wang, S., Qin, L., Zhang, X., and Lin, X. (2007). Finding top-k min-cost connected trees in databases. In ICDE, pages 836-845. Ding, B., Zhao, B., Lin, C. X., Han, J., and Zhai, C. (2010). TopCells: Keyword-based search of top-k aggregated documents in text cube. In ICDE, pages 381-384. ICDE 2011 Tutorial 175 References /3

Goldman, R., Shivakumar, N., Venkatasubramanian, S., and Garcia-Molina, H. (1998). Proximity search in databases. In VLDB, pages 26-37. Guo, L., Shao, F., Botev, C., and Shanmugasundaram, J. (2003). XRANK: Ranked keyword search over XML documents. In SIGMOD. Guo, L., Shao, F., Botev, C., and Shanmugasundaram, J. (2003). XRANK: Ranked keyword search over XML documents. In SIGMOD. He, H., Wang, H., Yang, J., and Yu, P. S. (2007). BLINKS: Ranked keyword searches on graphs. In SIGMOD, pages 305-316. Hristidis, V. and Papakonstantinou, Y. (2002). Discover: Keyword search in relational databases. In VLDB. Hristidis, V., Papakonstantinou, Y., and Balmin, A. (2003). Keyword proximity search on xml graphs. In ICDE, pages 367-378. Huang, Yu., Liu, Z. and Chen, Y. (2008). Query Biased Snippet Generation in XML Search. In SIGMOD. Jayapandian, M. and Jagadish, H. V. (2008). Automated creation of a forms-based database query interface. PVLDB, 1(1):695-709. Kacholia, V., Pandit, S., Chakrabarti, S., Sudarshan, S., Desai, R., and Karambelkar, H. (2005). Bidirectional expansion for keyword search on graph databases. In VLDB, pages 505-516. ICDE 2011 Tutorial 176 References /4

Kashyap, A., Hristidis, V., and Petropoulos, M. (2010). FACeTOR: cost-driven exploration of faceted query results. In CIKM, pages 719-728. Kasneci, G., Ramanath, M., Sozio, M., Suchanek, F. M., and Weikum, G. (2009). STAR: Steiner-Tree Approximation in Relationship Graphs. In ICDE, pages 868-879. Kimelfeld, B., Sagiv, Y., and Weber, G. (2009). ExQueX: exploring and querying XML documents. In SIGMOD, pages 1103-1106. Koutrika, G., Simitsis, A., and Ioannidis, Y. E. (2006). Prcis: The Essence of a Query Answer. In ICDE, pages 69-78. Koutrika, G., Zadeh, Z.M., and Garcia-Molina, H. (2009). Data Clouds: Summarizing Keyword Search Results over Structured Data. In EDBT. Li, G., Ji, S., Li, C., and Feng, J. (2009). Efficient type-ahead search on relational data: a TASTIER approach. In SIGMOD, pages 695-706. Li, G., Ooi, B. C., Feng, J., Wang, J., and Zhou, L. (2008). EASE: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data. In SIGMOD. Li, J., Liu, C., Zhou, R., and Wang, W. (2010) Suggestion of promising result types for XML keyword search. In EDBT, pages 561-572. ICDE 2011 Tutorial 177

References /5 Li, J., Liu, C., Zhou, R., and Wang, W. (2011). Top-k Keyword Search over Probabilistic XML Data. In ICDE. Li, W.-S., Candan, K. S., Vu, Q., and Agrawal, D. (2001). Retrieving and organizing web pages by "information unit". In WWW, pages 230-244. Liu, Z. and Chen, Y. (2007). Identifying meaningful return information for XML keyword search. In SIGMOD, pages 329-340. Liu, Z. and Chen, Y. (2008). Reasoning and identifying relevant matches for xml keyword search. PVLDB, 1(1):921-932. Liu, Z. and Chen, Y. (2010). Return specification inference and result clustering for keyword search on XML. TODS 35(2). Liu, Z., Shao, Q., and Chen, Y. (2010). Searching Workflows with Hierarchical Views. PVLDB 3(1): 918927. Liu, Z., Sun, P., and Chen, Y. (2009). Structured Search Result Differentiation. PVLDB 2(1): 313-324. Lu, Y., Wang, W., Li, J., and Liu, C. (2011). XClean: Providing Valid Spelling Suggestions for XML Keyword Queries. In ICDE. Luo, Y., Lin, X., Wang, W., and Zhou, X. (2007). SPARK: Top-k keyword query in relational databases. In SIGMOD, pages 115-126. ICDE 2011 Tutorial

178 References /6 Luo, Y., Wang, W., Lin, X., Zhou, X., Wang, J., and Li, K. (2011). SPARK2: Top-k Keyword Query in Relational Databases. TKDE. Markowetz, A., Yang, Y., and Papadias, D. (2007). Keyword search on relational data streams. In SIGMOD, pages 605-616. Markowetz, A., Yang, Y., and Papadias, D. (2009). Reachability Indexes for Relational Keyword Search. In ICDE, pages 1163-1166. Nambiar, U. and Kambhampati, S. (2006). Answering Imprecise Queries over Autonomous Web Databases. In ICDE, pages 45. Nandi, A. and Jagadish, H. V. (2009). Qunits: queried units in database search. In CIDR. Petkova, D., Croft, W. B., and Diao, Y. (2009). Refining Keyword Queries for XML Retrieval by Combining Content and Structure. In ECIR, pages 662-669. Pu, K. Q. and Yu, X. (2008). Keyword query cleaning. PVLDB, 1(1):909-920. Qin, L., Yu, J. X., and Chang, L. (2009). Keyword search in databases: the power of RDBMS. In SIGMOD, pages 681-694. Qin, L., Yu, J. X., and Chang, L. (2010). Ten Thousand SQLs: Parallel Keyword Queries Computing. PVLDB 3(1):58-69.

ICDE 2011 Tutorial 179 References /7 Qin, L., Yu, J. X., Chang, L., and Tao, Y. (2009). Querying Communities in Relational Databases. In ICDE, pages 724-735. Sayyadian, M., LeKhac, H., Doan, A., and Gravano, L. (2007). Efficient keyword search across heterogeneous relational databases. In ICDE, pages 346-355. Stefanidis, K., Drosou, M., and Pitoura, E. (2010). PerK: personalized keyword search in relational databases through preferences. In EDBT, pages 585-596. Sun, C., Chan, C.-Y., and Goenka, A. (2007). Multiway SLCA-based keyword search in XML data. In WWW. Talukdar, P. P., Jacob, M., Mehmood, M. S., Crammer, K., Ives, Z. G., Pereira, F., and Guha, S. (2008). Learning to create data-integrating queries. PVLDB, 1(1):785-796. Tao, Y., and Yu, J.X. (2009). Finding Frequent Co-occurring Terms in Relational Keyword Search. In EDBT. Termehchy, A. and Winslett, M. (2009). Effective, design-independent XML keyword search. In CIKM, pages 107-116. Termehchy, A. and Winslett, M. (2010). Keyword search over key-value stores. In WWW, pages 11931194.

ICDE 2011 Tutorial 180 References /8 Tran, T., Wang, H., Rudolph, S., and Cimiano, P. (2009). Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-Shaped (RDF) Data. In ICDE, pages 405-416. Xin, D., He, Y., and Ganti, V. (2010). Keyword++: A Framework to Improve Keyword Search Over Entity Databases. PVLDB, 3(1): 711-722. Xu, Y. and Papakonstantinou, Y. (2005). Efficient keyword search for smallest LCAs in XML databases. In SIGMOD. Xu, Y. and Papakonstantinou, Y. (2008). Efficient lca based keyword search in xml data. In EDBT '08: Proceedings of the 11th international conference on Extending database technology, pages 535-546, New York, NY, USA. ACM. Yu, B., Li, G., Sollins, K., Tung, A.T.K. (2007). Effective Keyword-based Selection of Relational Databases. In SIGMOD. Zhang, D., Chee, Y. M., Mondal, A., Tung, A. K. H., and Kitsuregawa, M. (2009). Keyword Search in Spatial Databases: Towards Searching by Document. In ICDE, pages 688-699. Zhou, B. and Pei, J. (2009). Answering aggregate keyword queries on relational databases using

minimal group-bys. In EDBT, pages 108-119. Zhou, X., Zenz, G., Demidova, E., and Nejdl, W. (2007). SUITS: Constructing structured data from keywords. Technical report, L3S Research Center. ICDE 2011 Tutorial 181

Recently Viewed Presentations

  • Microwaves - PC&#92;|MAC

    Microwaves - PC\|MAC

    How long have microwaves been around? In 1945, Dr. Perry Spencer, a scientist, was working in a lab when he felt some heat on his hand. The heat was coming from microwaves. These were being emitted from a nearby vacuum...
  • PolicyMorph: Interactive Policy Transformations for a Logical ...

    PolicyMorph: Interactive Policy Transformations for a Logical ...

    PolicyMorph: Interactive Policy Model Transformations for a Logical ABAC Framework Michael LeMay Omid Fatemieh Carl A. Gunter * * Outline Motivation Introduction Logical Attribute-Based Policies Logical Constraints Access Control Models Model Transformations Prototype Implementation and Test Case Conclusion * Motivation...
  • TORTORA  FUNKE  CASE Microbiology AN INTRODUCTION EIGHTH EDITION

    TORTORA FUNKE CASE Microbiology AN INTRODUCTION EIGHTH EDITION

    Chapter 23, part B Microbial Diseases of the Cardiovascular and Lymphatic Systems * * * * * * * * 1346 Plague-ridden bodies used by Tartar army against Kaffa 1925 Plaque-carrying flea bombs used in the Sino-Japanese War 1950s U.S....
  • Conquering the Comma Brought to you by the

    Conquering the Comma Brought to you by the

    To Correct a Comma Splice The End Rationale: Welcome to "Conquering the Comma." This presentation is designed to acquaint your students with the rules of comma usage, including placement in compound sentences, after introductory elements, with dependent phrases and clauses,...
  • OPPIMISYMPRISTIST OPPIMISALUSTOIHIN Digitaalisuus -hankekoulutus 24.10.2017 Jari Sarja OPPIMISYMPRIST

    OPPIMISYMPRISTIST OPPIMISALUSTOIHIN Digitaalisuus -hankekoulutus 24.10.2017 Jari Sarja OPPIMISYMPRIST

    KOULUTUKSEN KEHITTÄMINEN. Otavan Opisto haluaa, yhteistyössä oppilaitosten ja eri toimijoiden kanssa. uudistaa pedagogiikkaa, opiskeluympäristöjä ja toimintakulttuuria niin, että perusopetus, lukiokoulutus ja vapaa sivistystyö kykenevät vastaamaan tulevaisuuden haasteisiin.
  • Year 2 Autumn Term Week 9 Lesson 5

    Year 2 Autumn Term Week 9 Lesson 5

    Rounding up to the nearest 10p Where would 15 be? 0 10 20 30 15 Numbers ending in 5 always go up to the next multiple of 10. What multiple of 10 is it closest to? Where does your bag...
  • OMB Circular A-81 // 2 CFR 200 Procurement

    OMB Circular A-81 // 2 CFR 200 Procurement

    Noncompetitive contracts to consultants on retainer. ... The methods apply only to direct costs, not indirect (FAQ 320-6) OMB Circular A-81 // 2 CFR 200. Procurement Standards - §200.317 - §200.326 and Appendix IIBurr Millsap, CPA Associate VP for Administration...
  • Computers: Understanding Technology, 3e

    Computers: Understanding Technology, 3e

    Chapter 13 Multimedia and Artificial Intelligence Presentation Overview The Use of Multimedia Creating Digital Media Creating Multimedia Artificial Intelligence The Use of Multimedia Multimedia Web Pages The Use of Multimedia Multimedia Web Pages Flash and Shockwave are leading sources of...