Introduction to Microsoft Azure Machine Learning

Introduction to Microsoft Azure Machine Learning

How to Work with Large Datasets to Build Predictive Models Girish Nathan Misha Bilenko Microsoft Azure Machine Learning Agenda 1. How to Work with Large Datasets Sample Dataset: NYC Taxi HDInsight (Hadoop on Azure) iPython notebook and HDInsight 2. Building Predictive Models Azure ML Studio Learning with Counts 3. Putting it all together: Learning with Counts and HDInsight Sample Data: NYC Taxi One year log of NYC taxi rides 60GB, publicly available at http://www.andresmh.com/nyctaxitrips/ Trip (driver id, times, locations) and fare (fare, tip, tolls) Rest of tutorial: data wrangling and tip prediction Tools: AzCopy, HDInsight, iPython, Azure ML Studio HD Insight : Hadoop on Azure 100% Apache Hadoop as an Azure service Can deploy on Windows or Linux Provides Map-Reduce capability over big data in Azure blobs Head node: job and cluster monitoring Hive: SQL-like queries as an alternative to writing code SELECT Col1, COUNT(*) AS Count_Col1 FROM Your_Table GROUP BY Col1 ORDER BY Count_Col1 DESC LIMIT 10; Ipython Notebook Web-based Python REPL environment Combines authoring, execution, visualization Can author and execute HDInsight Hive queries Sample query (python code snippet) def submit_hive_query(self): response=urllib2.urlopen(self.url, self.hiveParams) data = json.load(response) self.hiveJobID = data[id]

def query(self, queryString): self.submit_hive_query() Example query string: SELECT * FROM sample_table LIMIT 10; What is Azure ML Studio Fully managed cloud service Browser based authoring of dataflow Best in class machine learning algorithms Support for R/Python/SQL Collaborative data science Quickly deploy models as web services/REST APIs Publish to a gallery for collaboration with community Learning with Counts a.k.a Dracula (Distributed Robust Algorithm for CoUnt-based LeArning) Misha Bilenko Microsoft Azure Machine Learning Microsoft Research Large Scale learning in multi entity domains Query = powder skis QCategories = {skiing, outdoor gear} Userid = 0xb49129827048dd9b IP = 131.107.65.14 adid = 1010054353 adText = K2 ski sale! adURL= www.k2.com/sale 10 9 10 9+ 10 7 ( ) 10 10+ Information retrieval Advertising, recommending, search: item, page/query, user Transaction classification Payment fraud: transaction, product, user Email spam: message, sender, recipient Intrusion detection: session, system, user IoT: device, location

8 Large Scale learning in multi entity domains query powder skis qCategories {skiing, outdoor gear} adid: 1010054353 adText: Fall ski sale! adURL: www.k2.com/sale userid 0xb49129827048dd9b IP 131.107.65.14 Problem: representing high-cardinality attributes as features Scalable: Efficient: Flexible: Adaptive: to billions of attribute values predictions/sec for a variety of downstream learners to distribution change Standard approaches: binary features, hashing, projections What everyone uses in industry: learning with counts This talk: formalization and generalization 9 Learning with Counts IP 173.194.33.9 46964 993424 87.250.251.11

31 843 131.107.65.14 46964 173.194.33.9 12993424 430 87.250.251.11 31 REST 131.107.65.14 12 IP 843 745623 430 REST 745623 13964931 ) 13964931 131.107.65.14 k 2. com

) ) powder skis, k2.com powder skis ) ) ) Features are transforms of conditional statistics (per-label counts) = [N+ N- log(N+)-log(N-) IsBackoff] log(N+)-log(N-) = log log-odds/Nave Bayes estimate N+, N- indicators of confidence of the nave estimate IsFromRest: indicator of back-off vs. real count Learning with Counts IP 173.194.33.9 46964 993424 87.250.251.11 31 843 131.107.65.14 46964 173.194.33.9 12993424 430 87.250.251.11 31

REST 131.107.65.14 12 IP 843 745623 430 REST 745623 13964931 ) 13964931 131.107.65.14 k 2. com ) ) powder skis, k2.com powder skis ) ) ) Features are transforms of conditional counts = [N+ N- log(N+)-log(N-) IsBackoff] Scalable head in memory + tail in backoff; or: count-min sketch

Efficient low cost, low dimensionality Flexible low dimensionality works well with non-linear learners new values easily added, back-off for infrequent values, temporal counts Learning with Counts : aggregation IP 173.194.33.9 46964 993424 87.250.251.11 IP 31 843 993424 12 430 173.194.33.9 46964 131.253.13.32 87.250.251.11 31 131.253.13.32REST 12 REST 745623 843 430 745623 13964931

query query facebook 281912 facebook 281912 7957321 dozendozen roses roses32791 REST 6321789 640964 32791 REST Query AdId facebook, ad1 43477252 6321789 Aggregate for different Standard MapReduce Bin function: any projection Backoff options: tail bin, hashing, 7957321 640964 hierarchical (shrinkage) 13964931 43477252 IP[2]

54546 978964 Query facebook, ad2 AdId 232343 8431467 dozen roses, ad3 12973 facebook, ad1 430982 54546 978964 REST facebook,441931 ad2 2 52754683 232343 8431467 dozen roses, ad3 12973 430982

REST 173.194.*.* 46964 993424 87.250.*.* 6341 91356 131.253.*.* 75126 430826 4419312 52754683 Counting Tnow 12 time Learning with Counts : combiner training IP 173.194.33.9 46964 993424 87.250.251.11

IP 31 843 993424 12 430 843 430 745623 13964931 13964931 facebook 281912 facebook 281912 7957321 7957321 dozendozen roses roses32791 REST 6321789 640964 32791 640964 43477252

173.194.33.9 46964 131.253.13.32 87.250.251.11 31 131.253.13.32REST 12 REST 745623 Train non-linear model on count-based features Counts, transforms, lookup properties Additional features can be injected query query REST Query AdId facebook, ad1 54546 Query AdId facebook, ad2 6321789 43477252 8431467 54546 430982 978964 232343

8431467 facebook, ad2 441931 2 dozen roses, ad3 52754683 12973 430982 REST IsBackoff ln ln 978964 232343 + + facebook, ad1 dozen roses, ad3 12973 REST .

4419312 52754683 Counting Aggregated features Original numeric features Train predictor Tnow 13 time Prediction with counts IP 173.194.33.9 46964 993424 87.250.251.11 IP 31 843 993424 12 430 843 430 745623 13964931 facebook 281912

facebook 281912 7957321 7957321 dozendozen roses roses32791 REST 6321789 640964 32791 640964 43477252 173.194.33.9 46964 131.253.13.32 87.250.251.11 31 131.253.13.32REST 12 REST 745623 13964931 Counts are updated continuously Combiner re-training infrequent query

query REST URL Country url1, US 6321789 54546 978964 URL Country url , CA 232343 8431467 url3, FR url1, US 12973 54546 430982 2 REST url2, CA 441931 url3, FR REST 2 43477252

IsBackoff ln ln 978964 8431467 12973 430982 4419312 + + 232343 52754683 . Aggregated features Original numeric features Counting 52754683 Ttrain Tnow time What is great about learning with Counts ? State-of-the-art accuracy Good fit for map-reduce Modular (vs. monolithic) Learner can be tuned/monitored/replaced in isolation Monitorable, debuggable (this is HUGE in practice!)

Temporal changes easy to monitor Easy emergency recovery (remove bot attacks, etc.) Decomposable predictions Error debugging (which feature can we blame) 15 Learning with Counts : in Azure ML Putting it all together HDInsight: large data storage and map-reduce processing Azure ML: anywhere cloud ML and analytics accessible Learning with Counts: intuitive, flexible large-scale ML solution Thanks for your time Useful Links: http://azure.microsoft.com/ml- Sign up for your free Azure ML Trial http://bit.ly/datasc_ebook - Free tutorial on how to use Azure ML Need Azure ML for teaching in classroom ? - Contact the speakers Other Questions ? - Contact the speakers Speakers :Misha Bilenko : [email protected] Girish Nathan [email protected]

Recently Viewed Presentations

  • Manifest Destiny - Socorro Independent School District

    Manifest Destiny - Socorro Independent School District

    Causes and Effects of Manifest Destiny United States Territorial Acquisition Map What is Manifest Destiny? The belief that the United States was destined by God to expand throughout the Continent. What was the Northwest Ordinance? Passed in 1787 and set...
  • WHO Global Malaria Programme

    WHO Global Malaria Programme

    to malaria prevention, diagnosis and treatment. Accelerate efforts towards elimination. and attainment of malaria-free status. Transform malaria surveillance. into a core intervention. Harnessing innovation and expanding research. Strengthening the enabling environment.
  • Call quality management for Skype for Business and Microsoft ...

    Call quality management for Skype for Business and Microsoft ...

    - user names are obfuscated except for the user searched for, Advanced and Debug tabs are hidden. Tier 2 - can see all information. Admin ... CQD is based on a data cube which uses: Dimensions - Descriptive category about...
  • Work, Power and Momentum

    Work, Power and Momentum

    The potential energy= work =m*g*h M is mass in kg, g is the acceleration due to gravity (9.8 m/s2) and h is the height in meters, The Law of Conservation of Energy According to the law of conservation of energy,...
  • Welcome! Meeting will begin in a moment. - Delaware Works

    Welcome! Meeting will begin in a moment. - Delaware Works

    The Concept of Continuum. Not every system or process can change at the same time. Core partners will be at different stages of the continuum at different times. It is important to recognize the stages of the interaction between the...
  • Chapter 5 Review of Verb Tenses - AzarGrammar.com

    Chapter 5 Review of Verb Tenses - AzarGrammar.com

    Hitting the vehicle with a shoe was good luck. Later, people began tying shoes to the car instead of throwing them. No one wants to destroy a good pair of shoes, so instead of shoes, people began tying cans to...
  • CSR in Hong Kong: A Political Sociology Perspective

    CSR in Hong Kong: A Political Sociology Perspective

    Data on Corporate Social Responsibility in HK CSR Survey of Hang Seng Index Constituent Companies 2009 - Oxfam Hong Kong Caring Company Scheme - Hong Kong Council of Social Service Survey documents the CSR policies and initiatives of the 42...
  • Integrated Optical Wavelength Converters and Routers for ...

    Integrated Optical Wavelength Converters and Routers for ...

    Integration Platforms Wavelength Converter Enhancements Using QWI Overview Objective Develop a novel QWI process for the fabrication of CQW wavelength-agile PICs to allow for the monolithic integration of high power SGDBRs with other optimized components Approach Impurity-free vacancy-enhanced QWI Accomplishments...