How to Work with Large Datasets to Build Predictive Models Girish Nathan Misha Bilenko Microsoft Azure Machine Learning Agenda 1. How to Work with Large Datasets Sample Dataset: NYC Taxi HDInsight (Hadoop on Azure) iPython notebook and HDInsight 2. Building Predictive Models Azure ML Studio Learning with Counts 3. Putting it all together: Learning with Counts and HDInsight Sample Data: NYC Taxi One year log of NYC taxi rides 60GB, publicly available at http://www.andresmh.com/nyctaxitrips/ Trip (driver id, times, locations) and fare (fare, tip, tolls) Rest of tutorial: data wrangling and tip prediction Tools: AzCopy, HDInsight, iPython, Azure ML Studio HD Insight : Hadoop on Azure 100% Apache Hadoop as an Azure service Can deploy on Windows or Linux Provides Map-Reduce capability over big data in Azure blobs Head node: job and cluster monitoring Hive: SQL-like queries as an alternative to writing code SELECT Col1, COUNT(*) AS Count_Col1 FROM Your_Table GROUP BY Col1 ORDER BY Count_Col1 DESC LIMIT 10; Ipython Notebook Web-based Python REPL environment Combines authoring, execution, visualization Can author and execute HDInsight Hive queries Sample query (python code snippet) def submit_hive_query(self): response=urllib2.urlopen(self.url, self.hiveParams) data = json.load(response) self.hiveJobID = data[id]
def query(self, queryString): self.submit_hive_query() Example query string: SELECT * FROM sample_table LIMIT 10; What is Azure ML Studio Fully managed cloud service Browser based authoring of dataflow Best in class machine learning algorithms Support for R/Python/SQL Collaborative data science Quickly deploy models as web services/REST APIs Publish to a gallery for collaboration with community Learning with Counts a.k.a Dracula (Distributed Robust Algorithm for CoUnt-based LeArning) Misha Bilenko Microsoft Azure Machine Learning Microsoft Research Large Scale learning in multi entity domains Query = powder skis QCategories = {skiing, outdoor gear} Userid = 0xb49129827048dd9b IP = 131.107.65.14 adid = 1010054353 adText = K2 ski sale! adURL= www.k2.com/sale 10 9 10 9+ 10 7 ( ) 10 10+ Information retrieval Advertising, recommending, search: item, page/query, user Transaction classification Payment fraud: transaction, product, user Email spam: message, sender, recipient Intrusion detection: session, system, user IoT: device, location
8 Large Scale learning in multi entity domains query powder skis qCategories {skiing, outdoor gear} adid: 1010054353 adText: Fall ski sale! adURL: www.k2.com/sale userid 0xb49129827048dd9b IP 131.107.65.14 Problem: representing high-cardinality attributes as features Scalable: Efficient: Flexible: Adaptive: to billions of attribute values predictions/sec for a variety of downstream learners to distribution change Standard approaches: binary features, hashing, projections What everyone uses in industry: learning with counts This talk: formalization and generalization 9 Learning with Counts IP 173.194.33.9 46964 993424 87.250.251.11
31 843 131.107.65.14 46964 173.194.33.9 12993424 430 87.250.251.11 31 REST 131.107.65.14 12 IP 843 745623 430 REST 745623 13964931 ) 13964931 131.107.65.14 k 2. com
) ) powder skis, k2.com powder skis ) ) ) Features are transforms of conditional statistics (per-label counts) = [N+ N- log(N+)-log(N-) IsBackoff] log(N+)-log(N-) = log log-odds/Nave Bayes estimate N+, N- indicators of confidence of the nave estimate IsFromRest: indicator of back-off vs. real count Learning with Counts IP 173.194.33.9 46964 993424 87.250.251.11 31 843 131.107.65.14 46964 173.194.33.9 12993424 430 87.250.251.11 31
REST 131.107.65.14 12 IP 843 745623 430 REST 745623 13964931 ) 13964931 131.107.65.14 k 2. com ) ) powder skis, k2.com powder skis ) ) ) Features are transforms of conditional counts = [N+ N- log(N+)-log(N-) IsBackoff] Scalable head in memory + tail in backoff; or: count-min sketch
Efficient low cost, low dimensionality Flexible low dimensionality works well with non-linear learners new values easily added, back-off for infrequent values, temporal counts Learning with Counts : aggregation IP 173.194.33.9 46964 993424 87.250.251.11 IP 31 843 993424 12 430 173.194.33.9 46964 131.253.13.32 87.250.251.11 31 131.253.13.32REST 12 REST 745623 843 430 745623 13964931
query query facebook 281912 facebook 281912 7957321 dozendozen roses roses32791 REST 6321789 640964 32791 REST Query AdId facebook, ad1 43477252 6321789 Aggregate for different Standard MapReduce Bin function: any projection Backoff options: tail bin, hashing, 7957321 640964 hierarchical (shrinkage) 13964931 43477252 IP[2]
54546 978964 Query facebook, ad2 AdId 232343 8431467 dozen roses, ad3 12973 facebook, ad1 430982 54546 978964 REST facebook,441931 ad2 2 52754683 232343 8431467 dozen roses, ad3 12973 430982
REST 173.194.*.* 46964 993424 87.250.*.* 6341 91356 131.253.*.* 75126 430826 4419312 52754683 Counting Tnow 12 time Learning with Counts : combiner training IP 173.194.33.9 46964 993424 87.250.251.11
IP 31 843 993424 12 430 843 430 745623 13964931 13964931 facebook 281912 facebook 281912 7957321 7957321 dozendozen roses roses32791 REST 6321789 640964 32791 640964 43477252
173.194.33.9 46964 131.253.13.32 87.250.251.11 31 131.253.13.32REST 12 REST 745623 Train non-linear model on count-based features Counts, transforms, lookup properties Additional features can be injected query query REST Query AdId facebook, ad1 54546 Query AdId facebook, ad2 6321789 43477252 8431467 54546 430982 978964 232343
8431467 facebook, ad2 441931 2 dozen roses, ad3 52754683 12973 430982 REST IsBackoff ln ln 978964 232343 + + facebook, ad1 dozen roses, ad3 12973 REST .
4419312 52754683 Counting Aggregated features Original numeric features Train predictor Tnow 13 time Prediction with counts IP 173.194.33.9 46964 993424 87.250.251.11 IP 31 843 993424 12 430 843 430 745623 13964931 facebook 281912
facebook 281912 7957321 7957321 dozendozen roses roses32791 REST 6321789 640964 32791 640964 43477252 173.194.33.9 46964 131.253.13.32 87.250.251.11 31 131.253.13.32REST 12 REST 745623 13964931 Counts are updated continuously Combiner re-training infrequent query
query REST URL Country url1, US 6321789 54546 978964 URL Country url , CA 232343 8431467 url3, FR url1, US 12973 54546 430982 2 REST url2, CA 441931 url3, FR REST 2 43477252
IsBackoff ln ln 978964 8431467 12973 430982 4419312 + + 232343 52754683 . Aggregated features Original numeric features Counting 52754683 Ttrain Tnow time What is great about learning with Counts ? State-of-the-art accuracy Good fit for map-reduce Modular (vs. monolithic) Learner can be tuned/monitored/replaced in isolation Monitorable, debuggable (this is HUGE in practice!)
Temporal changes easy to monitor Easy emergency recovery (remove bot attacks, etc.) Decomposable predictions Error debugging (which feature can we blame) 15 Learning with Counts : in Azure ML Putting it all together HDInsight: large data storage and map-reduce processing Azure ML: anywhere cloud ML and analytics accessible Learning with Counts: intuitive, flexible large-scale ML solution Thanks for your time Useful Links: http://azure.microsoft.com/ml- Sign up for your free Azure ML Trial http://bit.ly/datasc_ebook - Free tutorial on how to use Azure ML Need Azure ML for teaching in classroom ? - Contact the speakers Other Questions ? - Contact the speakers Speakers :Misha Bilenko : [email protected] Girish Nathan [email protected]