# Data Integration Essentials: The Entity Matching and ...

Overlap Set Similarity Join with Theoretical Guarantee Dong Deng, Yufei Tao, Guoliang Li 1 Overlap Set Similarity Join Examples Find all the user pairs sharing at least c friends Find all the word pairs co-occurrent in at least c documents Find all the item pairs co-purchased in at least c transactions Challenge 2 ! for = 1 million, = 1 trillion!! Many Applications Data Mining and Data Management Data Integration Data Cleaning Frequent Pattern Mining Recommendation Machine Learning Large Entries Retrieval on Matrix Productions

Non-negative Matrix Factorization Singular Value Decomposition Word Embedding Scene Reconstruction 4 Overlap Set Similarity Join Input: (i) a collection of sets; (ii) a constant threshold c Output: all set pairs with overlap size no smaller than c Bag-of-Words: {9th, Street, WI} Keywords: {4b2b, house, garage} any data type that can be abstracted as sets Contribution: first sub-quadratic algorithm whenever possible

R1 = {e1, e2, e3} R2 = {e1, e3, e4, e7} R3 = {e2, e4, e5, e6} R4 = {e2, e4, e5, e6, e8, e9, e10, e11} R = {e , e , e , e , e threshold c=2 e ,e } Input 5 13 2 7 8 9 10 |R1R2| c

|R3R4| c |R4R5| c (R1 , R2 ) (R3 , R4 ) (R4 , R5 ) , e11, e12, 14 Output Size-Aware Algorithm threshold c = 2 Small R1 = {e1, e2, e3} R1 = {e1, e2, e3} R2 = {e1, e3, e4, e7} R2 = {e1, e3, e4,

e 7} R3 = {e2, e4, e5, e6} R3 = {e2, e4, e5, e 6} size e1e3 e1e4 e1e7 e3e4 e3e7 e42e74 e2e5 e2e6 e4e5 e4e6 e5e6 (R1 , R2) Output all set pairs sharing a common subset of size c x Large e1e2 e1e3 e2e3 (R4 , R3) (R4 , R5)

#< R4 = {e2, e4, e5, e6, e8, e9, e10, e11} R5 = {e2, e7, e8, e9, e10, e11, e12, e13, e14} where n is the total size of all sets Hash Table R1 = {e1, e2, e3} R2 = {e1, e3, e4, e7} R3 = {e2, e4, e5, e6} R4 = {e2, e4, e5, e6, e8, e9, e10, e11} R5 = {e2, e7, e8, e9, e10, e11, e12, e13, e14} Build a hash table for each large set, probe it with every other set to get all the results 6 Size Boundary Selection small sets

1 goes up smoothly first, and then rapidly time cost c: threshold x: size boundary 1 large sets goes down rapidly first, and then smoothly size boundary x Increase x little by little: Benefit: the decrease of the time spend on large sets Cost: the increase of the time spend on small sets Skip Unique c-subsets (i.e., subset of size c)

Observation 1: Unique c-subsets cannot generate any result and we can skip them 8 Skip Redundant c-subsets Observation 2: Redundant c-subsets only generate duplicate results and we can skip them 9 Experiments C++, Ubuntu Server, Single Thread, CPU E5-2620 2.10 GHz 10 2000 Elapsed Time(s) Elapsed Time(s) Evaluating the Size Boundary Selection c=4

c=6 c=8 c=10 c=12 1500 1000 500 0 1 10 100 Size Boundary x first goes down rapidly and then goes down smoothly 1000 c=4 c=6

c=8 c=10 c=12 800 600 400 200 0 1 10 100 Size Boundary x first goes up smoothly and then goes up rapidly 11 Evaluating the Size Boundary Selection the boundaries we selected has roughly the

same performance than that of the optimal one # of enumed c-subsets Evaluating the Heap-based Methods 10 14 1012 10 10 10 naive heapskip heapdedup blockskip blockdedup 8 106

4 6 8 10 12 Threshold c reduced the # of enumerated c-subsets by up to 4 orders 13 Comparing with Existing Methods - Scalability Practical Performance Elapsed Time (s) 30000 20000 Our algorithm

Our algorithm Existing methods: DBLP (c=6) scancount divide adaptjoin all-pairs sizeaware 6.6 hrs x 9.82 times 4.5 hrs x 13.27 times 2.3 hrs 10000 0 Time Complexity:

1.3 hrs 9 mins 1m 1.5m 2m 2.5m Dataset Sizes x 9.15 times x 11.68 times x 5.34 times 3m when dataset size increases by 3 times n: input size k: output size Comparing with Existing Methods - Thresholds 6000 scancount divide adaptjoin all-pairs

sizeaware 3000 Elapsed Time (s) Elapsed Time (s) 4000 2000 1000 0 4 6 8 10 12 Threshold c

DBLP (1 million sets) improves by up to 1 order scancount divide adaptjoin all-pairs sizeaware 5000 4000 3000 2000 1000 0 2 3 4 5 Threshold c

6 Address (1 million sets) even if all sets have the same size 15 Conclusion Overlap set similarity join The first sub-quadratic algorithm whenever that is possible Practical size boundary selection method Heap-based methods to skip many unnecessary c-subsets Support both self-join and two-relation-join 16 THANKS ! Q&A

## Recently Viewed Presentations

• Supergroup; Multidisciplinary Agricultural and Biological Sciences (all) Life Sciences. Arts and Humanities (all) Social Sciences. Biochemistry, Genetics and Molecular Biology (all) Life Sciences. Business, Management and Accounting (all) Social Sciences. Chemical Engineering (all) Physical ...
• CATATAN. Saatinilandscape media sudahbergesersecara dramatis denganmunculnya media-media baru, terutama yang berbasisteknologitinggi (Internet dan mobile phone),
• Calendar and program(and overrides) How to change the content. Course content is the responsibility of the department, however the calendar process involves the department, EUGO, the library, the registrar`s office and the Senate.
• CHAPTER 2 TYPES OF EVIDENCE ... Coral Eugene Watts Read Case Study 2.2 - Ronald Cotton What type of evidence did the courts have? Mark the evidence as testimonial or physical. Was the police sketch accurate? Conviction - correct/incorrect? Why?
• Research made Easy . with ONE Access & NC Wise Owl. How to find credible resources from the Charlotte-Mecklenburg Public Library & NC Wise Owl.
• Sundiata Feature Menu Introducing the Epic Literary Focus: Epic Reading Skills: Making Inferences from Details * * * * * * * * * * * * * * * * * Sundiata An Epic of Old Mali Dogon village...
• Used by DOD users & DOD Contractors. Used by Public/Non-DOD users. DLA Employees use the DLA EBSInternal Portal. Provides external partners with a single point of access to DLA business applications
• This video is available on the Bright Futures Web site and here: ... families, but we recognize that you will want to tailor them to suit the needs and circumstances of the children and families in your clinical setting. For...