SQL Text Mining - WordPress.com

SQL Text Mining - WordPress.com

SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse (BARC eScience) MSR SVC Mountain View 8/23/2006 Road Map Motivations Methodology & Algorithms Experiments & Analysis Application Future Work 1 Motivations (1): SDSS SkyServer Web and SQL Traffic 1.E+05

W eb SQL Expon. (W eb) 1.E+04 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 M onth SQL Traffic (Rows/Month) 1.E+10 1.E+09 Rows Schema redesign, caching, query recommendation, popular templates, segment users

1.E+06 H its Multi-TB Astronomy Archive 5 Yrs, 178M Hits, 77M Views, 1M Unique IP's 20M SQL Queries (9.8M Unique) SQL access to telescope data Wish to categorize users SQL queries 1.E+07 1.E+08 1.E+07 1.E+06 1 3

5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 Month 2 (2): Not much prior work Could not find research showing how to characterize SQL But many sites and databases maintain query logs Fortunately there is related work NLP, IR, Machine Learning 3 Methodology Use unsupervised learning (K-Means) Cluster centers give us query templates Cluster sizes tell us popularity Do term analysis over these segments More interesting than total aggregate term stats

Can isolate types of users (Bots v. Mortals) 4 K-Means Algorithm def KMeans(k, KMeans docs): clusters = InitialClusterCenters(k, InitialClusterCenters docs) while true: change = AssignToClusters(docs, clusters) AssignToClusters if change: clusters = RecomputeClusterCenters(k, RecomputeClusterCenters docs) else: else break return docs, clusters 5 3 Key Factors to Clustering 1. Distance function 2. Choosing K

Well undershoot K Then break these clusters into tighter ones 3. Choosing the initial centroids Not covered in this talk, but we have some cool tricks For now assume traditional approaches (BuckShot, Random) 6 1st Step: Distance Function How syntactically similar are two queries? Not necessarily functionally similar SQL term order Column Fields and WHERE conditionals can be reordered & represent the same query Solution: Compare token combinations N-Grams (or Shingles)

7 We also need to clean the SQL Templatize SQL queries Remove unnecessary, uncommon features 24 REGEX cleaners and substitutions Ex. Substitute STRING, NUM, COMPARE Goal here is to MAXIMIZE similarity 8 Before & After SELECT s.ra, s. s.dec FROM #upload u, SpecObj s WHERE u.up_plate=s.plate and u.up_mjd=s.mjd and u.up_fiber=s.fiberid select p.objID, rc.name as rc3_name, s.name as stetson_name, stetson_name p.ra, p.dec, ph.name as type, p.u, p.g, p.r, p.i, p.z, o.distance from (((PhotoPrimary p inner join PhotoType ph ((( on p.type = ph.value) left join RC3 rc on p.objid = rc.objid) left join Stetson s on p.objid = s.objid), dbo.fGetNearbyObjEq(180,-0.5,3) o where o.objid = p.objid and p.type = ph.value order by o.distance select ra dec from temp specobj where up_plate compare plate logic up_mjd

compare mjd logic up_fiber compare fiberid select objid name name ra dec name u g r i z distance from photoprimary inner join phototype on type compare value left join rc3 on objid compare objid left join stetson on objid compare objid fgetnearbyobjeq where objid compare objid logic type compare value orderby distance 9 Feature Vector What sized N-Grams should we use? Unigrams & Bigrams most common Any more than Tri-Grams usually results in worse clusters But this assumes unstructured text We have a highly constrained language And we want to capture Joins select objid name name ra dec name u g r i z distance from photoprimary inner join phototype on type compare value (Need at least size 8-grams here) At the same time we want good results too consistent with the literature So bias smaller grams since they more likely to

occur 10 Feature Strategy Use Em All Generate all 1 8 sized N-grams for a query Sort the tokens within an N-gram Why? Increases similarity matches, decreases # N-grams (better for memory) SQL is highly constrained - unlikely the terms reordered represent a different style query Larger N-gram matches are unlikely, should be rewarded similarity if they are within the same N-distance neighborhood of terms Jaccards Similarity Measure |Intersection(Q1_n, Q2_n)| / |Union(Q1_n, Q2_n)| Intersection Union Compute the Jaccard for each N-gram set separately, then take a weighted Fibonacci mean favoring smaller grams Since there can be duplicate terms, we append each N-gram with a rolling index

Multi-Set Jaccard 11 Ex. Distance between 2 Queries Query A: select objid ra dec r z from galaxy specobj B: select objid ra dec from galaxy fgetnearbyobjeq where objid compare objid For simplicity, just up to size 3 Not Shown Sort tokens with N-Gram ra_dec_from => dec_from_ra Index repeated N-grams within same set size Ex. objid_1, objid_2

Unigram s select objid ra dec r z from galaxy specobj Bigrams Trigrams select_objid objid_ra ra_dec dec_r r_z z_from from_galaxy galaxy_speco bj

select_objid_ra objid_ra_dec ra_dec_r dec_r_z r_z_from z_from_galaxy B select objid ra dec rrom galaxy fgetnearb yobjeq where objid compare objid select_objid objid_ra ra_dec

dec_from from_galaxy fgetnearbyob jeq_where where_objid objid_compar e compare_obji d select_objid_r objid_ra_dec ra_dec_from dec_from_galaxy from_galaxy_fgetne arbyobjeq galaxy_fgetnearbyo bjeq_where fgetnearbyobjeq_wh ere_obji where_objid_compar e objid_compare_objid Jaccard

6/14 = 0.43 4/13 = 0.31 2/12 = 0.17 A Distance = (3*(0.43)+2*(0.31)+1*(0.17)) / 6 = 0.35 12 But .. this still wont scale for 20M Were producing 1 8 grams each Producing close to 1000+ grams per query This is like full-scale document clustering! 8 Jaccards over strings in each comparison step Piping results in and out of SQL O(N^2) clustering algorithm

Only have 3 months & a single machine 13 But there are only 80K templates! First, we only need to cluster the distinct queries (9.8M) 9.8M Second, we found MANY queries reduce down to a common template after cleaning Number of unique query templates ~ 80K Over 99.5% reduction! Filter out the queries which result in errors Brings it down to ~77K Clustering these ~77K templates is equivalent to clustering all 20M! We maintain the query to template mappings and bookkeep template counts Just factor the counts at the end to scale the solution for 20M 14 Lets Cluster IronPython + SQL Server 2005 Didn't use Data Mining package in SQL Server

Its Clustering optimized for Euclidean distances K = 24 (based on previous work at JHU) Then search clusters for tighter groups Within 70% similarity and of size >= 10 Total clusters: 194 Computation Time: ~22 Hrs 15 Iterations 15 Ex. A Popular Cluster Found Query Template Count select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic r compare num logic num3100261 1 select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic z compare num logic num 30 select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic z compare num logic num

1 select top num objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic type compare num 17 select run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic r compare num logic num select cast objid ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic i compare num logic num logic type compare num 6 select run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic u compare num logic num 6 93 select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic g compare num logic num 4 select cast objid type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic i compare num logic num logic type compare num select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic u compare num logic num 71 select run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic g compare num logic num

10 select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic u compare num logic num 9 select run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic i compare num logic num 3 20 select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic i compare num logic num select run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic z compare num logic num 1 select run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic u compare num logic num 1 select top num run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic r compare num logic num 2 select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic r compare num logic num

101 select cast objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic i compare num logic num 1 select top num run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic g compare num logic num 3 select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic type compare num 9 select top num objid run rerun camcol field obj type ra dec u g r i z colc rowc from fgetnearbyobjeq photoprimary where objid compare objid logic r compare num logic type compare num 5 Represents 18% of all the SQL queries (3,100,655 hits) Sum of Square Errors / Cluster Size = 3.58006287097031E-06 0.061 variance omitting top template from cluster

Tiny variance among queries 16 And since we collect N-grams We can find the common phrases within a cluster Ex. Top 15 Most Popular Eight-Grams 17 Two Example Applications 1. Bot Detection 2. Query Recommendation 18 App 1: Bot Detection Belief: Little or no variance and large cluster sizes correspond to Bot agents Can isolate bot from human SQL traffic Useful since we dont have the user-agent

strings in the SQL logs 19 App 1: Bot Detection Significant # of EXACT query template matches Top 10 Query Templates Query Template Count select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic r compare num logic num 3100261 select top num objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid 2732991 select count * from photoprimary where htmid compare num logic htmid compare num 2188913 select rowc_g colc_g from photoprimary where objid compare num 900055 select rowc_r colc_r from photoprimary where objid compare num

814884 select objid distance ra dec fphototypen type psfmag_u psfmagerr_u psfmag_g psfmagerr_g psfmag_r psfmagerr_r psfmag_i psfmagerr_i psfmag_z psfmagerr_z dered_u dered_g dered_r d 805539 select objid type flags ra dec r petromag_r isoa_r isob_r isophi_r isophierr_r distance from galaxy fgetnearbyobjeq where objid compare objid 570028 select top num weblink cast objid weblink cast objid weblink run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compa 448722 select top num ra dec u g r i z petror50_r petror90_r expphi_r expab_r nchild flags lnlexp_r lnldev_r from fgetnearbyobjeq galaxy where objid compare objid logic petromag_z compare num lo 307124 select extinction_u extinction_g extinction_r extinction_i extinction_z from photoobjall where objid compare num 262222 20 130.167.109.1 130.167.109.1 130.167.109.1

130.167.109.1 130.167.109.1 130.167.109.1 130.167.109.1 130.167.109.1 130.167.109.1 130.167.109.1 130.167.109.1 130.167.109.1 130.167.109.1 130.167.109.1 130.167.109.1 130.167.109.1 130.167.109.1 130.167.109.1 130.167.109.1 130.167.109.1 130.167.109.1 130.167.109.1 130.167.109.1 130.167.109.1 130.167.109.1 130.167.109.1 130.167.109.1 130.167.109.1 130.167.109.1

130.167.130.1 130.167.130.1 130.167.130.1 130.167.130.1 130.167.130.1 Check the logs for the top template BESTDR3 BESTDR3 BESTDR3 BESTDR3 BESTDR3 BESTDR3 BESTDR3 BESTDR3 BESTDR3 BESTDR3 BESTDR3 BESTDR3 BESTDR3 BESTDR3 BESTDR3 BESTDR3 BESTDR3

BESTDR3 BESTDR3 BESTDR3 BESTDR3 BESTDR3 BESTDR3 BESTDR3 BESTDR3 BESTDR3 BESTDR3 BESTDR3 BESTDR3 BESTDR2 BESTDR2 BESTDR2 BESTDR2 BESTDR2 Sample of SQL log records matching the template Even have the same # of whitespaces & case Just diff numbers being passed into the TVFs Steady rate: rate 2x hour for weeks at a time Product usage pattern (All DR3, then all DR2) All from the same user IP (NASA)

2nd highest # queries on our top orgs list Smells like a bot IP Organization 130.167.109.1 National Aeronautics and Space Administration DNS State Country RA.STSCI.EDU AL US WebHits SqlHits PageViews 6948077 3114078 4662910 21 App 2: Query Recommendation Since we have a distance function we can even return similar queries New users, students, visitors wishing to query the sky

But theres a learning curve SQL (writing 3-way spatial query joins not easy) Schemas Optimizing & Debugging Because queries are quite repetitive, why not suggest known correct ones to the user Spelling Correction! 22 App 2: Example Bad User Query: SELECT TOP 10 ph.ra,ph.dec,str(ph.g - ph.r,11 ?) as color,ISNULL(s.bestObjId, 0) as bestObjId, 'ugri' FROM #x x, #upload up, BESTDR2..PhotoObjAll as ph LEFT OUTER JOIN ? SpecObjAll s ON ph.objID = s.bestObjID WHERE (ph.type=3 OR ?) AND up.up_id = x.up_id ? x.objID=p ?.objID ORDER BY x.up_id => Its Cleaned representation: select top num ra dec str isnull bestobjid from temp temp photoobjall left outer join specobjall on objid compare bestobjid where type compare num logic logic up_id compare up_id objid compare objid orderby up_id 23

an App 2: Similar Query Results Template Ex. Query Similarity select top num ra dec str g r isnull bestobjid from temp temp photoobjall left outer join specobjall on objid compare bestobjid where type compare num logic type compare num logic up_id compare up_id logic objid compare objid orderby up_id SELECT TOP 50 p.ra,p.dec,str(p.g - p.r,11,8 ,8) as grModelColor,ISNULL(s.bestObjID,0) as bestObjID, 'ugri' as filter FROM #x x, #upload u, BESTDR2..PhotoObjAll as p LEFT OUTER JOIN BESTDR2.. ..SpecObjAll s ON p.objID = s.bestObjID WHERE ( p.type = 3 OR p.type = 6)

6 AND u.up_id = x.up_id AND x.objID=p.objID ORDER BY x.up_id 74% select top num ra dec ra dec colc rowc isnull z from temp temp photoobjall left outer join specobjall on objid compare bestobjid where type compare num logic type compare num logic up_id compare up_id logic objid compare objid orderby up_id SELECT TOP 50 p.ra,p.dec,p.ra,p. [dec],p.colc,p.rowc,ISNULL(s.z,0) as z, 'g' as filter FROM #x x, #upload u, BESTDR2..PhotoObjAll as p LEFT OUTER JOIN BESTDR2..SpecObjAll s ON p.objID = s.bestObjID WHERE ( p.type = 3 OR p.type = 6) AND u.up_id = x.up_id AND x.objID=p.objID ORDER BY x.up_id 66% select top num ra dec run rerun SELECT TOP 50

camcol field obj isnull ra num p.ra,p.dec,p.run,p.rerun,p.camCol,p.field,p.obj,ISNULL( isnull dec from temp temp s.ra,0) as ra,ISNULL(s.[dec],0) as [dec], 'ugriz' as filter photoobjall left outer join FROM #x x, #upload u, BESTDR2..PhotoObjAll as p specobjall on objid compare LEFT OUTER JOIN BESTDR2..SpecObjAll s ON p.objID = bestobjid where type compare s.bestObjID WHERE ( p.type = 3 OR p.type = 6) AND num logic type compare num u.up_id = x.up_id AND x.objID=p.objID ORDER BY logic up_id compare up_id logic x.up_id objid compare objid orderby up_id computed quickly by comparing only against queries within the same length 60% neighborho 24

Conclusion Not discussed Optimizations, use of SQL Server, how we broke down clusters (Finesse Algorithm), initializing centers, defining a center (Clustroid), theoretical properties, weighting scheme Future Work More cleaning, using SQL parse trees, other features representations, better n-grams (wrapping edges), optimizations (min-hashing) Findings We found that queries in cleaned representation fall into a small group of clusters (20M => 77K) Queries follow very repetitive template patterns, enabling us to effectively find bots, query outliers, and do query recommendation 25 References 1. SkyServer Site Logs: http://skyserver.sdss.org/log/

2. Sloan Digital Sky Survey SkyServer projects website: http://skyserver.sdss.org/ 3. G. Abdulla, Analysis of SDSS SQL server log files, UCRL-MI-215756-DRAFT. Lawrence Livermore National Laboratory, 2005 4. T. Malik, R. Burns, A. Chaudhary. Bypass Caching: Making Scientific Databases Good Network Citizens. In Proceedings of the 21st International Conference on Data Engineering, 2005. 5. Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig: Syntactic Clustering of the Web. Computer Networks 29(8-13): 1157-1166 (1997) 6. Fast and effective text mining using linear-time document clustering B Larsen, C Aone - Proceedings of the fifth ACM SIGKDD international conference , 1999 - portal.acm.org 7. Web mining research: a survey R Kosala, H Blockeel - ACM SIGKDD Explorations Newsletter, 2000 - portal.acm.org 8. Mehran Sahami, Timothy D. Heilman: A web-based kernel function for measuring the similarity of short text snippets. WWW 2006: 377-386 Plan to publish our web and SQL traffic research in a MSR TR as well as make the database, docs, slides & code available at http://skyserver.wordpress.com this week 26 QA 27

Gives us another optimization Current Design Inefficient Many Jaccards over very large sets of strings Lets use hashing to speed up Jaccard and enable us to store more features in memory Hash(Template) => [Hashes(1-Grams), Hashes(2Grams) ] We can use Pythons 32-bit String hash function But what about collisions? Not many unique N-grams SQL & Schema vocabulary small What about hashing query templates? Birthday Problem We can expect collisions with probability greater than a 50% after hashing: + sqrt(1/4 + 2*2^32 * ln(2)) = ~77K Clustering Primarily two flavors of clustering 1. Expectation-Maximization (EM) 2. K-Means

Dont want EM Model-based clustering for unobservable data K-Means is actually just a special case of EM Clusters modelled by spherical Gaussian distributions Each data item is assigned to one cluster The mixture weights are equal Now how do we compare two sets of N-grams? -- TODO Jaccards Measure Cardinality of the intersection of the two sets divided by the cardinality of the union of the two sets Traditional, established metric for similarity

Captures Euclidean and Cosine properties Since sets remove duplicates, we index ngrams with numbers to denote repetition Multi-Set Jaccards Measure (3): How well can we classify SQL? Motivation for converting unstructured to structured text Can optimize a common layer for SQL style representation Reusable ML development framework Although translating to SQL is not trivial Has nice properties Highly constrained, clean vocabulary, relational, propositional, parseable But we have 8 sets of n-grams -TODO We compute the Jaccard separately for each size n-gram Not a good idea to mix sizes (lose independence) Wish to prioritize smaller grams for better results and to maximize similarity

Since the Jaccard of smaller grams will result in a higher similarity value than larger grams Prioritize the first 3 grams Reverse Fibonacci sequence enables us to grow the weights proportionally and exponentially 78% goes to first 3 grams Comforting to know were taking into account larger grams (22% sounds reasonable to us) Then take the weighted mean of the Jaccards scaled by their respective Fibonacci number This is arbitrary, but we liked the results

Compared to exponential (2 based) and just plain ol unigrams with no scaling Gives us something in between these two Finally, an example: Distance Between 2 SQL Queries -- TODO [include two queries, their n-grams, jaccard of each, multiplied by the fibonacci element, then the mean] For now, say n-grams from 1-3 (so it can fit on a slide) 2nd Step: Choosing Initial Centers First, whats a center? Common clustering done over numerical vectors Euclidean distance and midpoint (Centroid) But were dealing with sets of n-grams No clear average midpoint formula So instead we use Clustroids

SQL queries represent centers Since we cant do average midpoint, we find the query minimizes square error with everybody else within the cluster Initialization Hill-Climbing No really good method that escapes local minima Simulated Annealing Genetic K-Means Difficult to apply to our features and distance metric Buckshot (1992) [include cite]) Hierarchical Agglomerative Clustering (HAC)

Dendrogram Graph Start w/ SQRT(N) samples, Greedily combine clusters (All-Pairs) until left with K Clustroids O(N^2 N*K) But this just does not scale for 20M -- TODO How many Jaccards for just BuckShot [include] Plus how many Jaccards for KMeans (probably more, stops when the clusters are stables, which is unknown number of steps) One can set a max number of iterations So how about sample more Just recomputing clustroids is n^2 Need to do all-pairs to find center with min square error Sample sqrt(n) to make it linear Can we initialize centers faster? -TODO BuckShot is slow Can we do this step faster and produce better clusters? Idea: Cuckoo Clustering

Inspired by Cuckoo Hashing and Genetic KMeans [include cites] Side Excursion: Cuckoo Clustering TODO (Should be extra slides) [describe algorithm] K buckets Elect better leaders Spill over to next best bucket [Analyze run time] [include pseudo code] Cuckoo Clustering TODO (extra slides) Nests (capacity 2) K=4 Sample size = 8 Clustering 116,425 SQL Queries

MSEWC Random Buckshot Cuckoo 0.341520927107638 0.331783982119379 0.329480122484919 SQL Queries chosen uniformly at random K = 56 Max K-Means Iterations = 20 MSEWC = Mean Square Error Within Cluster Plus -- TODO Each query template and the number of times it occurred The most popular ones should represent initial centers And, cleaned queries grouped by the same length should be assigned to the

same cluster initially Get one iteration of AssignClusters for free Now we can do KMeans Bootstrap Example -- TODO [include example, before and after group bys] Experiment -- TODO Setup SQL Server 2005 IronPython <== I love this language Did not use Data Mining package in SQL Server 'Microsoft Clustering' designed around and optimized for Euclidean vectors Not easy to write a SQL Server C++ plug-in for my style feature sets + Jaccard + sampling logic Computation Time: [?] KMeans Iterations: [?]

Some Optimizations TODO (extra slides) Pruning when we find similarity = 1 Hashing Speeds up Jaccard and reduces memory Hash(cleanedQuery) => List of Hash(N-gram)s Use Pythons 32-bit string hash function Collisions? Not many unique N-grams Number of SQL keywords + SDSS schema fields very small Birthday Problem: We can expect collisions with probability greater than a after hashing: + sqrt(1/4 + 2*2^32 * ln(2)) = ~77k queries We have ~77k unique cleaned queries Good enough Clusters Discovered TODO

(include after 80k clustered) We can do better: Finesse -TODO Idea: KMeans does the hard work of grouping similar queries together Intuition: We can find 'tighter' query groups within a cluster fast We choose a threshold (say within 85% similarity) Value of K not really important now [include algorithm] Finesse Algorithm -- TODO [include algorithm] Going backwards from 1.0 to threshold in decremental step values key to finding optimal clusters and for speed A Popular Cluster Found Query Template Count select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic r compare num logic num3100261 1

select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic z compare num logic num 30 select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic z compare num logic num 1 select top num objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic type compare num 17 select run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic r compare num logic num select cast objid ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic i compare num logic num logic type compare num 6 select run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic u compare num logic num 6 93 select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic g compare num logic num 4

select cast objid type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic i compare num logic num logic type compare num select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic u compare num logic num 71 select run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic g compare num logic num 10 select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic u compare num logic num 9 select run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic i compare num logic num 3 20 select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic i compare num logic num select run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic z compare num logic num 1 select run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic u compare num logic num

1 select top num run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic r compare num logic num 2 select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic r compare num logic num 101 select cast objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic i compare num logic num 1 select top num run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic g compare num logic num 3 select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic type compare num 9 select top num objid run rerun camcol field obj type ra dec u g r i z colc rowc from fgetnearbyobjeq photoprimary where objid compare objid logic r compare num logic type compare num 5

Represents 18% of all the SQL queries (3,100,655 hits) Sum of Square Errors / Cluster Size = 3.58006287097031E-06 Meaning tiny variation within cluster And since we collect N-grams We can find the common phrases Ex. Top 15 Most Popular Eight-Grams Phrase dec err_g err_u g i r u z dec err_u g i r ra u z dec g i r ra type u z err_g err_i err_r err_u err_z from i z err_g err_i err_r err_u err_z i r z err_g err_i err_r err_u g i r z err_g err_r err_u g i r u z camcol dec field obj ra rerun type u camcol dec field obj ra rerun run type camcol dec field g obj ra type u dec field g obj r ra type u dec g i obj r ra type u err_i err_z from fgetobjfromrect photoprimary where objid compare err_z from fgetobjfromrect photoprimary where objid compare objid

from fgetobjfromrect photoprimary where objid compare objid logic % Share of all 8-grams 0.037037037 0.037037037 0.037037037 0.037037037 0.037037037 0.037037037 0.037037037 0.035273369 0.035273369 0.035273369 0.035273369 0.035273369 0.02292769 0.02292769 0.02292769 Bot Detection Belief: Little or no variance and large cluster sizes correspond to bot agents Can isolate bot from mortal SQL traffic

No user-agent strings in the SQL logs Bot Detection Significant # of EXACT query template matches Top 10 Query Templates Query Template Count select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic r compare num logic num 3100261 select top num objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid 2732991 select count * from photoprimary where htmid compare num logic htmid compare num 2188913 select rowc_g colc_g from photoprimary where objid compare num 900055 select rowc_r colc_r from photoprimary where objid compare num

814884 select objid distance ra dec fphototypen type psfmag_u psfmagerr_u psfmag_g psfmagerr_g psfmag_r psfmagerr_r psfmag_i psfmagerr_i psfmag_z psfmagerr_z dered_u dered_g dered_r d 805539 select objid type flags ra dec r petromag_r isoa_r isob_r isophi_r isophierr_r distance from galaxy fgetnearbyobjeq where objid compare objid 570028 select top num weblink cast objid weblink cast objid weblink run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compa 448722 select top num ra dec u g r i z petror50_r petror90_r expphi_r expab_r nchild flags lnlexp_r lnldev_r from fgetnearbyobjeq galaxy where objid compare objid logic petromag_z compare num lo 307124 select extinction_u extinction_g extinction_r extinction_i extinction_z from photoobjall where objid compare num 262222 App: Query Recommendation Since we have a distance function we can even find similar queries New users, students, visitors wishing to query the sky But theres a learning curve SQL (writing 3-way spatial query joins may not be intuitive)

Schemas Optimizing & Debugging And since we a found queries to be quite repetitive Why not return the most similar (and correct) ones as suggestions to the user? Improving user experience and reach Example (TODO Better Query this isnt a common one) Find Top 3 Most Similar Queries to: SELECT top 100 p.psfMag_g - s.Mag_g FROM photoprimary p JOIN specObj s on p.specObjID=s.specObjID where s.specClass=1 Its Cleaned representation: select top num psfmag_g mag_g from photoprimary join specobj on specobjid compare specobjid where specclass compare num Similar Query Results Template Ex. Query

Similarity select top num psfmag_g mag_g from photoprimary join specobj on specobjid compare specobjid where specclass compare num SELECT top 100 p.psfMag_g - s.Mag_g FROM photoprimary p JOIN specObj s on p.spec ObjID=s.specObjID where s.specClass=1 select top num * from specobj join specline on specobjid compare specobjid where specclass compare num logic lineid compare num select top 10 * from specObj s join specLine l on s.specObjID = l.specObjID where specClass = 3 and l.lineID=6565 41% select count * from galaxy join specobj on specobjid compare specobjid where zconf compare num

logic specclass compare num SELECT COUNT(*) FROM BESTDR5..Galaxy as G JOIN BESTDR5..SpecObj as GS on G.specObjID = GS.specObjID WHERE GS.zconf > .9 AND GS.specClass =2 35% 100%

Recently Viewed Presentations

  • Understanding the CMMI Validation Process Area

    Understanding the CMMI Validation Process Area

    Understanding the CMMI® Validation Process Area Gary Natwick & Jim Cocci Harris Corporation Melbourne, Florida Application Domain CMMI®-SE/SW (Staged Representation) So… what is validation? Don't we just worry about that when the system is delivered? Isn't that the customer's problem?...
  • Ch 9 The Autonomic Nervous System SLOs  Review

    Ch 9 The Autonomic Nervous System SLOs Review

    Toprol-XL, metoprolol succinate, is a beta1-selective (cardioselective) adrenoceptor blocking agent, for oral administration, available as extended release tablets. Drugs that mimic adrenergic responses. Agonists: Promote the process stimulated by the NT.
  • Growth Regulators - henry.k12.ky.us

    Growth Regulators - henry.k12.ky.us

    Cytokinins Cell division (used in tissue culture) Cell differentiation (used in tissue culture for plant organ formation) Formation of callus tissue Delay aging process in plants Produced in roots Transported through xylem Still researched Cytokinins vs. Auxins In stems -...
  • Chapter 2

    Chapter 2

    Chapter 2 Preparing and Serving Safe Food
  • PHOTOGRAPHY 101 UNDERSTAND THE ELEMENTS OF ART: line

    PHOTOGRAPHY 101 UNDERSTAND THE ELEMENTS OF ART: line

    Pacing Guide: S1 Unit 7 - Elements of Digital Photographic Design. 7A: Teach students to identify, record and use art elements. Students to explore, analyze, and talk about what they see in the physical world and how to describe their...
  • The Bible and Its Influence

    The Bible and Its Influence

    Biblical Allusions. Macbeth, William Shakespeare. Absalom, Absalom, William Faulkner. John Donne "Love Song of J. Alfred Prufrock," T. S. Eliot. 60% of allusions on Advanced Placement Exams are biblical allusions. See list of most common biblical allusions.
  • Introduction to Climate - University of San Diego

    Introduction to Climate - University of San Diego

    ENVI 485 02/20/07 STEAMS AND FLOODING (cont.) MASS WASTING San Diego River 1852 - Since San Diego Bay was a deeper harbor, and the San Diego River carried heavy silt deposits, it was decided to deflect the San Diego River...
  • Isaiah&#x27;S Day of The Lord

    Isaiah'S Day of The Lord

    Isa. 13:9-13, "Behold, the day of the LORD cometh, cruel both with wrath and fierce anger, to lay the land desolate: and he shall destroy the sinners thereof out of it. For the stars of heaven and the constellations thereof...