Overlap Set Similarity Join with Theoretical Guarantee Dong Deng, Yufei Tao, Guoliang Li 1 Overlap Set Similarity Join Examples Find all the user pairs sharing at least c friends Find all the word pairs co-occurrent in at least c documents Find all the item pairs co-purchased in at least c transactions Challenge 2 ! for = 1 million, = 1 trillion!! Many Applications Data Mining and Data Management Data Integration Data Cleaning Frequent Pattern Mining Recommendation Machine Learning Large Entries Retrieval on Matrix Productions
Non-negative Matrix Factorization Singular Value Decomposition Word Embedding Scene Reconstruction 4 Overlap Set Similarity Join Input: (i) a collection of sets; (ii) a constant threshold c Output: all set pairs with overlap size no smaller than c Bag-of-Words: {9th, Street, WI} Keywords: {4b2b, house, garage} any data type that can be abstracted as sets Contribution: first sub-quadratic algorithm whenever possible
R1 = {e1, e2, e3} R2 = {e1, e3, e4, e7} R3 = {e2, e4, e5, e6} R4 = {e2, e4, e5, e6, e8, e9, e10, e11} R = {e , e , e , e , e threshold c=2 e ,e } Input 5 13 2 7 8 9 10 |R1R2| c
|R3R4| c |R4R5| c (R1 , R2 ) (R3 , R4 ) (R4 , R5 ) , e11, e12, 14 Output Size-Aware Algorithm threshold c = 2 Small R1 = {e1, e2, e3} R1 = {e1, e2, e3} R2 = {e1, e3, e4, e7} R2 = {e1, e3, e4,
e 7} R3 = {e2, e4, e5, e6} R3 = {e2, e4, e5, e 6} size e1e3 e1e4 e1e7 e3e4 e3e7 e42e74 e2e5 e2e6 e4e5 e4e6 e5e6 (R1 , R2) Output all set pairs sharing a common subset of size c x Large e1e2 e1e3 e2e3 (R4 , R3) (R4 , R5)
#< R4 = {e2, e4, e5, e6, e8, e9, e10, e11} R5 = {e2, e7, e8, e9, e10, e11, e12, e13, e14} where n is the total size of all sets Hash Table R1 = {e1, e2, e3} R2 = {e1, e3, e4, e7} R3 = {e2, e4, e5, e6} R4 = {e2, e4, e5, e6, e8, e9, e10, e11} R5 = {e2, e7, e8, e9, e10, e11, e12, e13, e14} Build a hash table for each large set, probe it with every other set to get all the results 6 Size Boundary Selection small sets
1 goes up smoothly first, and then rapidly time cost c: threshold x: size boundary 1 large sets goes down rapidly first, and then smoothly size boundary x Increase x little by little: Benefit: the decrease of the time spend on large sets Cost: the increase of the time spend on small sets Skip Unique c-subsets (i.e., subset of size c)
Observation 1: Unique c-subsets cannot generate any result and we can skip them 8 Skip Redundant c-subsets Observation 2: Redundant c-subsets only generate duplicate results and we can skip them 9 Experiments C++, Ubuntu Server, Single Thread, CPU E5-2620 2.10 GHz 10 2000 Elapsed Time(s) Elapsed Time(s) Evaluating the Size Boundary Selection c=4
c=6 c=8 c=10 c=12 1500 1000 500 0 1 10 100 Size Boundary x first goes down rapidly and then goes down smoothly 1000 c=4 c=6
c=8 c=10 c=12 800 600 400 200 0 1 10 100 Size Boundary x first goes up smoothly and then goes up rapidly 11 Evaluating the Size Boundary Selection the boundaries we selected has roughly the
same performance than that of the optimal one # of enumed c-subsets Evaluating the Heap-based Methods 10 14 1012 10 10 10 naive heapskip heapdedup blockskip blockdedup 8 106
4 6 8 10 12 Threshold c reduced the # of enumerated c-subsets by up to 4 orders 13 Comparing with Existing Methods - Scalability Practical Performance Elapsed Time (s) 30000 20000 Our algorithm