Applying logic to practice in computer science Ron Fagin IBM ResearchAlmaden 1 First Case Study: Garlic (1996) 2 Garlic Laura Haas Mr. Mr. Database Database Theoretician,

Theoretician, weve weve got got aa problem problem with with Garlic, Garlic, our our multimedia multimedia database database system! system! What was the problem? Garlic Example databases: . . .

The answers to queries in DB/2 are sets The answers to queries in QBIC are sorted lists How do you combine the results? 3 Example Searching a CD database for Artist = Beatles yields a set, via, say DB/2 Musicbrainz has 12 million recordings in its DB 4 Example AlbumColor = Red yields a sorted list, via, say QBIC .697 .683

.670 .659 .629 Redness 5 Example How do we make sense of (Artist = Beatles) (AlbumColor = Red) ? (AlbumColor = Red) ? Here it is probably a list of albums by the Beatles, sorted by how red they are What about (Artist = Beatles) (AlbumColor = Red) ? (AlbumColor = Red) ?

And what about (Color = Red) (AlbumColor = Red) ? (Shape = Round) ? 6 What Was My Solution? These werent just sorted lists: they were scored lists Can view sets as scored lists (scores 0 or 1) This reminded me of fuzzy logic In fuzzy logic, conjunction ( (AlbumColor = Red) ?) is min, and disjunction ( (AlbumColor = Red) ?) is max

7 Use Use fuzzy fuzzy logic logic II like like your your solution. solution. But But we we also also need need an an efficient efficient algorithm algorithm that that can

can find find the the top top kk results results while while minimizing minimizing database database accesses accesses II have have an an algorithm algorithm that that finds finds the

the top top kk with with only only n n database database accesses accesses Laura Haas Ron Fagin Good, Good, that that beats beats linear! linear! But But we we database

database people people are are spoiled, spoiled, and and are are used used to to only only log log nn accesses. accesses. Be Be smarter smarter and and get get me me aa log log nn algorithm

algorithm II proved proved that that you you cant cant do do better better than than n n 8 Time for the Accesses Say n = 12,000,000 CDs Assume 1000 accesses per second

n accesses (nave algorithm) would take 3 hours n accesses would take 3 seconds 9 Generalizing the Algorithm The algorithm works for arbitrary monotone scoring functions increasing the scores of arguments cannot decrease the overall score 10 Influence Algorithm implemented in Garlic Influenced other IBM products, including Watson Bundled Search system InfoSphere Federation Server

WebSphere Commerce Paper introducing my algorithm (now called Fagins Algorithm) has around 900 citations (Google Scholar) 11 The Threshold Algorithm Amnon Lotem Moni Naor Ron Fagin In 2001, we found the Threshold Algorithm 12 The Problem

There are m attributes Each object in a database has a score xi for attribute i The objects are given in m sorted lists, one list per attribute Goal: Find the top k objects according to a monotone scoring function, while minimizing access to the lists Can think of the attributes as voters, and the objects as candidates, where each voter assigns a score to each candidate 13 Multimedia Example REDNESS 177: 0.993 139: 0.991 702: 0.982 ... 235: 0.325

... ROUNDNESS 235: 0.999 666: 0.996 820: 0.992 ... 177: 0.406 ... 14 Scoring Functions Let f be the scoring function Popular choices for f: min (used in fuzzy logic) average Let x1,, xm be the scores of object R under the

m attributes Then f(x1,, xm) is the overall score of object R Sometimes write f(R) to mean f(x1,, xm) A scoring function f is monotone if whenever xi yi for every i, then f(x1,, xm) f(y1,, ym) 15 Modes of Access Sorted (or sequential) access Can obtain the next object and its score for attribute i Random access Can obtain the score of object R for attribute i Wish to minimize total number of accesses 16

Algorithms Want an algorithm for finding the top k objects Nave algorithm retrieves every score of every object Too expensive 17 Threshold Algorithm Do sorted access in parallel to each of the m scored lists. As each object R is seen under sorted access: Do random access to retrieve all of its scores x1,, xm Compute its overall score f(x1,, xm) If this is one of the top k answers so far, remember it For each list i, let ti be the score of the last object seen under sorted access Define the threshold value T to be f(t1,, tm). When k objects have been seen whose overall score is at least T,

stop Return the top k answers 18 Threshold Algorithm: Example (using min) REDNESS 177: 0.993 ROUNDNESS 235: 0.999 Scoring function is min 19 Threshold Algorithm: Example (using min) REDNESS 177: 0.993

ROUNDNESS 235: 0.999 ... 177: 0.406 ... 20 Threshold Algorithm: Example (using min) REDNESS 177: 0.993 ROUNDNESS 235: 0.999 ... 235: 0.325 ...

... 177: 0.406 ... 21 Threshold Algorithm: Example (using min) REDNESS 177: 0.993 ROUNDNESS 235: 0.999 ... 235: 0.325 ... ...

177: 0.406 ... Overall score for 177: min(0.993, 0.406) = .406 Overall score for 235: min(0.325, 0.999) = .325 22 Threshold Algorithm: Example (using min) .993 REDNESS 177: 0.993 ROUNDNESS 235: 0.999 ... 235: 0.325

... ... 177: 0.406 ... Overall score for 177: min(0.993, 0.406) = .406 Overall score for 235: min(0.325, 0.999) = .325 Threshold value: min( 0.993 , 0.999 ) = .993 23 Threshold Algorithm: Example (using min) REDNESS 177: 0.993 139: 0.991 ROUNDNESS 235: 0.999 666: 0.996

... 235: 0.325 ... ... 177: 0.406 ... 24 Threshold Algorithm: Example (using min) .991 REDNESS 177: 0.993 139: 0.991

ROUNDNESS 235: 0.999 666: 0.996 ... 235: 0.325 ... ... 177: 0.406 ... Threshold value: min( 0.991 , 0.996 ) = .991 25 Threshold Algorithm: Example (using min) REDNESS 177: 0.993 139: 0.991

702: 0.982 ... 235: 0.325 ... ROUNDNESS 235: 0.999 666: 0.996 820: 0.992 ... 177: 0.406 ... 26 Threshold Algorithm: Example (using min) .982

REDNESS 177: 0.993 139: 0.991 702: 0.982 ... 235: 0.325 ... ROUNDNESS 235: 0.999 666: 0.996 820: 0.992 ... 177: 0.406 ... Threshold value: min( 0.982 , 0.992 ) = .982 27

Correctness of the Halting Rule Suppose the current top k objects have scores at least T (the current threshold). Assume (by way of contradiction): R unseen; S in current top k ; f(R)>f(S) R has scores x1,, xm xi ti for every i (as R has not been seen) f(R) = f(x1,, xm) f(t1,, tm) = T f(S) contradiction! 28 Instance Optimality A = class of algorithms, D = class of legal inputs. For AA and DD have cost(A,D) 0. An algorithm AA is instance optimal over A and D if there are constants c1 and c2 s.t. for every AA and D D

cost(A,D) c1 cost(A,D) c2. c1 is called the optimality ratio 29 Instance Optimality of TA Intuition about why TA is instance optimal: Cannot stop any sooner, since the next object to be explored might have the threshold value. But, life is a bit more delicate... 30 Wild Guesses Wild guesses: random access for a field i of object R that has not been sequentially accessed before Neither FA nor TA use wild guesses Subsystem might not allow wild guesses

31 Instance Optimality of TA Theorem: For each monotone f let A be the class of algorithms that correctly find top k answers, with scoring function f, for every database. Do not make wild guesses. D be the class of all databases. Then TA is instance optimal over A and D. Optimality ratio is m+m(m-1) cR/cS - best possible! 32 Our Our threshold threshold algorithm algorithm isis an

an even even better better algorithm algorithm (optimal (optimal in in aa stronger stronger sense) sense) Amnon Lotem Moni Naor Ron Fagin But But Ron, Ron, you you told

told me me that that your your algorithm algorithm isis optimal!? optimal!? Laura Haas Well, Well, Laura, Laura, there there isis optimal, optimal, and and then then there there isis optimal optimal

33 Influence We submitted the paper to PODS 01 I was worried that the Threshold Algorithm was so simple that the paper would be rejected So I called it a remarkably simple algorithm The paper won the PODS Best Paper Award! The paper was very influential Over 1900 citations (Google Scholar) PODS Test of Time Award in 2011 IEEE Technical Achievement Award in 2011

Gdel Prize in 2014 Gems of PODS 2016 34 Applications of TA

relational databases multimedia databases music databases semistructured databases text databases uncertain databases probabilistic databases graph databases spatial databases spatio-temporal databases web-accessible databases XML data web text data semantic web high-dimensional datasets information retrieval

fuzzy data sets data streams search auctions wireless sensor networks distributed sensor networks distributed networks social-tagging networks document tagging systems peer-to-peer systems recommender systems personal information management systems group recommendation systems document annotation

35 Morals How did theory help? Resolving Laura Haass dilemma Knowledge of the literature (fuzzy logic)

Abstraction (using scoring functions) Devising optimal algorithms and proving optimality Figure out the real problem For example, there are scores, not just sorted lists Dont stop at original problem Example: doing a weighted version (with Ed Wimmers) Led to a successful and influential body of work 36 Measures of Success Making our products better An ultimate measure of success for practitioners Creating a new subfield An ultimate measure of success for theoreticians

A paper that arose by resolving a practical problem won the Gdel Prize! 37 Second Case Study: Clio (2003) 38 Clio Clio deals with data exchange, where we convert data from one format to another When Laura Haas started Clio, I followed her I attended Clio meetings for a year Phokion Kolaitis

Renee Miller Lucian Popa Ron Fagin Lets Lets start start from from scratch scratch and and lay lay the the foundations foundations for for data data exchange! exchange!

39 Data Exchange Translate data from source format to target format S Source Schema I T Target Schema J 40

Data Exchange Data exchange is an old, but recurrent, database problem Phil Bernstein2003 Data exchange is the oldest database problem EXPRESS: IBM San Jose Research Lab1977 Transforms data between hierarchical databases Data exchange underlies: Data warehousing, ETL (Extract-Transform-Load), 41 Example Source EMP MGR

Target EMP DEPT DEPT MGR Relationship between source and target (the schema mapping) specified by tuplegenerating dependencies (tgds) Originally used to help specify normal forms for relational databases EM(e,m) d (ED(e,d) (AlbumColor = Red) ? DM(d,m)) 42 Example Source

EMP Fagin MGR Haas Clarkson Haas Haas Welser Target EMP DEPT

DEPT MGR 43 Example 3 Possible Solutions Source Target EMP Fagin MGR Haas EMP Fagin

DEPT Haas DEPT Haas MGR Haas Clarkson Haas Clarkson Haas Welser

Welser Haas Welser Haas Welser EMP Fagin DEPT d1 DEPT d1

MGR Haas Clarkson d1 d2 Welser Haas d2 EMP Fagin

DEPT d1 DEPT d1 MGR Haas Clarkson d2 d2 Haas Haas

d3 d3 Welser 44 Which Solution Should We Produce? We define a universal solution to be one as general as possible Third solution is universal 45 Target Constraints Might have target constraints specified by equality-generating dependencies (egds), like DM(d,m) (AlbumColor = Red) ? DM(d',m)) (d = d')

If this egd is a target constraint, then second solution is universal 46 How Do We Obtain a Universal Solution? There is a well-known mechanical procedure called the chase, originally used as a tool in database design We use the chase to generate the target from the source efficiently Example: EM(e,m) d (ED(e,d) (AlbumColor = Red) ? DM(d,m)) From EM(Fagin, Haas), create ED(Fagin, d) and DM(d, Haas), where d is a newly introduced labeled null The egds tell when to equate labeled nulls 47

Composing Schema Mappings S12 Schema S1 S23 Schema S2 Schema S3 S13 With Phokion Kolaitis, Lucian Popa, and Wang-Chiew Tan, we studied composition of schema mappings Composition can take us out of first-order logic! We found the right language for composition (secondorder tgds) We gave an algorithm for composition

48 Second-order tgds An example of an SO tgd: f em (EM(e,m) (ED(e, f (m)) (AlbumColor = Red) ? DM(f (m),m)) An SO tgd is a formula of the form f ( x1 (1 1) (AlbumColor = Red) ? (AlbumColor = Red) ? xk (k k) ) where (a) each i is a conjunction of atomic formulas not involving function symbols, and possibly equalities of terms, and (b) each i is a conjunction of atomic formulas, not including equality. 49 Second-order tgds (cont.) Recall the existential second-order logic = NP Therefore, we might suspect that the following

theorem holds: Theorem: There is a second-order tgd such that deciding if (I,J) is an NP-complete problem is an NP-complete problem Proof: Let be f (E(x,y) D(f(x),f(y))). Let D = {(r,g), (r,b), (g,b), (g,r), (b,r), (b,g)}. Then says that E is 3-colorable. 50 Second-order tgds (cont.) SO tgds are the right language for composing FO tgds, in the following sense: 1. The composition of any number of FO tgds gives an SO tgd. 2. Every SO tgd is the result of composing some finite number of FO tgds. 3. In fact, in joint work with Marcelo Arenas and Alan Nash, we showed that surprisingly, every SO tgd is the result of composing just two FO

tgds. 51 Measures of Success Used in DB2 Control Center, Rational Data Architect, and Content Manager Using universal solutions Using our algorithm to produce a universal solution, and our algorithm to compose schema mappings Our initial paper won the International Conference on Database Theory Test of Time Award in 2013. With over 1100 citations, our paper was the 2nd most highly cited paper of the decade in the journal TCS Our paper on composition won the PODS Test of Time Award in 2014, and our follow-up paper on composition won the ICDT 2010 Best Paper Award This work created a new subfield

Special sessions on data exchange in major db conferences 52 Morals How did theory help? Established principles rather than ad hoc approaches Yielded algorithms for converting data, and for composing schema mappings Theorists need a partner to keep us honest Never too late to lay the foundations for an area, even for existing systems Can cause essential changes and improvement Again, dont stop at original problem 53 Conclusions

54 Conclusions (for System Builders) Consult with theoreticians Explaining the problem is useful by itself Principled approaches can improve your product Better or new algorithms can differentiate your product Algorithm analysis can provide performance expectations and provide product guarantees Abstractions can expand the function of your product 55 Conclusions (for Theoreticians)

Involvement with system builders can help your theory! Novel questions will be asked New models and new, interesting areas of study will arise Implementation can reveal weaknesses in the theory Theory will be relevant Practical impact! 56