DBI-B334: Data Management in Microsoft HDInsight: How to Move ...

DBI-B334: Data Management in Microsoft HDInsight: How to Move ...

DBI-B334 Data Management in Microsoft HDInsight: How to Move and Store Your Data Saptak Sen Azure Data Platform @saptak Agenda What is HDInsight Hadoop, OSS and HDInsight HDInsight Architecture

Working with Data in HDInsight Where & how to store data for easy big data processing Consuming Result Sets from HDInsight Queries/Jobs How to move result sets into familiar tools/solutions (Excel, RDBMS, etc) Questions What is HDInsight? Hadoop Distributed Architecture MapReduc

e Layer Task tracker Task tracker Job tracker Name node HDFS Layer

Data node Data node MapReduce: Move Code to the Data FIRST, STORE THE DATA Server Server Server

Server Files So How Does It Work? SECOND, TAKE THE PROCESSING TO THE DATA RUNTIME Code Server Server

Server Server // Map Reduce function in JavaScript var map = function (key, value, context) { var words = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++) { if (words[i] !== "") context.write(words[i].toLowerCase(), 1);} }}; var reduce = function (key, values, context) {

var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum); }; Windows Azure HDInsight Service Job submission (hive query, etc) Gateway (REST APIs) Hadoop Query & Metadata:

Hive Data Movement: Sqoop Pig Map Reduce Workflow: Oozie

HCatalog Monitoring : Ambar i Hadoop Filesystem Interface HDFS Windows Azure Blob Storage

Data upload/download Windows Azure HDInsight Service Job submission (hive query, etc) Cluster Dashboard UI Gateway (REST APIs) Hadoop Cluster Head Node

Comput eComput Node eComput Node eComput Node e Node Windows Azure Blob Storage Working With Data in HDInsight

DEMO Creating a Hadoop Cluster, Explore Filesystem Storing Data for use with HDInsight Service WHERE: All persistent data stored in Windows Azure Blob Storage Provides sharable, persistent, highly-scalable storage with Geo DR HDInsight has been optimized for fast access from its compute nodes to blob storage in the same Azure region (east, west, etc)

WHAT: File format used in blob storage is up to you, but using a format with existing serializer/deserializers (aka SerDe) is often a good choice (e.g. comma delim, Avro, JSON, etc) WHY: By separating HDInsight compute nodes from persistent storage you can: Pay only for what you need: drop your HDInsight cluster whenever you dont have work to do Multiple clusters access the same data, but isolate the compute resources by org/job/team/etc. HOW: All data access in Hadoop goes through a pluggable file system interface In on-prem Hadoop installations, this interface is implemented by Hadoop Distributed File System (HDFS) In Azure, HDInsight clusters use this mechanism to be wired to blob storage accounts by default

Using Blob Storage From HDInsight An HDInsight cluster is bound to one default blob storage account & container at cluster create time Using the default container requires no special addressing to access (/ == root folder, etc) To access additional blob storage accounts or containers: asv[s]://@.blob.core.windows.net/

Storage accounts other than the default need to be registered in siteconfig.xml: fs.azure.account.key.accountname enterthekeyvaluehere Uploading Data to Blob Storage For prototyping / samples: #put For production data interact directly with blob storage APIs.

AzCopy Command Line CopyBlob REST API Third party upload/download tools: AzCopy Example Command Line: AzCopy c:\blobs https://.blob.core.windows.net/mycontainer/ /destkey: /S File System: C:\blobs\a.txt C:\blobs\b.txt C:\blobs\dir1\c.txt C:\blobs\dir1\dir2\ d.txt

Blob Storage: Container Blob Name mycontainer a.txt mycontainer b.txt mycontainer

dir1\c.txt mycontainer dir1\dir2\d.txt HDInsight will treat this as a file in a 2level dir structure DEMO Copy blob, Query with Hive Uploading Data to Blob Storage

For prototyping / samples: #put For production data interact directly with blob storage APIs. AzCopy Command Line CopyBlob REST API Third party upload/download tools: Consuming Result Sets Consuming HDInsight Result Sets Target Destination Tool / Library Requires Active

HDInsight Cluster SQL Server, Azure SQL DB Sqoop (Hadoop ecosystem project) Yes Excel Codename Data Explorer No

Another Blob Storage Account Azure Blob Storage REST APIs (Copy Blob, etc) No SQL Server Analysis Services Hive ODBC Driver Yes

Existing BI Apps Hive ODBC Driver (assumes app supports ODBC connections to data sources) Yes DEMO Consume Result Sets SQL DB DEMO Consume Result Sets Excel & Data Explorer

Summary HDInsight is an enterprise grade Hadoop-based big data storage/processing platform Azure Blob Storage + HDInsight == Simple big data storage and processing in the cloud and is available to try today Consuming results from HDInsight into familiar tools, app, etc (Excel, etc) is simple with Data Explorer, Azure Blob APIs, Sqoop, ODBC, etc. Question? Related content

FDN01 Big Data. Small Data. All Data DBIB304 Large Scale Data Warehousing and Big Data DBI-B325 Do you have Big Data? Most Likely! DBI-B336 Big Data Analytics with Microsoft Excel 2013 DBI-B339 Predictive Analytics with Microsoft Big Data DBI-B313 Polybase: Hadoop integration in SQL Server Track Resources Download Data Explorer

Windows Azure Download Geoflow SQL Server Website mva Microsoft Virtual Academy Hands-On Labs

Get Certified! @sqlserver Resources Learning Sessions on Demand http://channel9.msdn.com/Events/TechEd TechNet Resources for IT Professionals http://microsoft.com/technet

Microsoft Certification & Training Resources www.microsoft.com/learning msdn Resources for Developers http://microsoft.com/msdn Complete an evaluation on CommNet and enter to win! Evaluate this session Scan this QR code to

evaluate this session and be automatically entered in a drawing to win a prize 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Recently Viewed Presentations

  • Intro to Puritan Literature

    Intro to Puritan Literature

    Pilgrims believed they were elected by God for salvation and they wanted to worship only with other "saints" who had also been saved by God. Puritans were followers of the teachings of Calvin and believed, like the Separatists, that man...
  • TORTORA  FUNKE  CASE Microbiology AN INTRODUCTION EIGHTH EDITION

    TORTORA FUNKE CASE Microbiology AN INTRODUCTION EIGHTH EDITION

    Chapter 9, part B Biotechnology and Recombinant DNA Procedures: Genetic Engineering: cloning a gene Figure 9.11.1 Genetic Engineering - Blue / White Selection Figure 9.11.2 Finding a Gene Product - Probes Figure 9.12.1 Making a Gene Product Figure 9.12.2 Used...
  • Slayt 1

    Slayt 1

    • Offer v invitation to treat. • Consideration. WHAT IS A CONTRACT? A contract is a legally binding agreement between two or more parties. The law will enforce an agreement provided the following are satisfied: • there has been an...
  • MICROBIOLOGY

    MICROBIOLOGY

    BACILLI. VIBRIO. SPIROCHAETE. COCCI - ANY GENERALLY ROUND SHAPED BACETERIA. Bacilli - rod shaped. Vibrio - curved red/comma shaped - "VIBRONS" ...
  • Life Hacks for Solenoids and Magnetized Beams

    Life Hacks for Solenoids and Magnetized Beams

    BEAMS AND LATTICES ARE DIFFERENT. Descriptions of coupling have been rediscovered or reinvented about every 5 years since early descriptions by Gluckstern. ... Solenoids are atypical beamline elements: transverse components in vector potential.
  • Active Transport - Tuscaloosa County School District

    Active Transport - Tuscaloosa County School District

    Active Transport Section 4.2 ... the sodium-potassium pump transports three sodium ions, Na+, out of a cell and two potassium ions, K+, into the cell. Movement Against a Concentration Gradient, continued The sodium-potassium pump has four steps: Three sodium ions...
  • Road Rally Safety Steward Training Presented by Mike

    Road Rally Safety Steward Training Presented by Mike

    Controls located immediately after a left turn or curve are inherently less safe, since centrifugal force tends to force the competitors' cars toward the Control workers. Safety can be improved in this situation by using remote trips with a long...
  • 幼児におけることばの学習の メカニズム

    幼児におけることばの学習の メカニズム

    動詞と名詞の即時マッピング 実験(Imai et al., 2005) 実験方法 刺激 スライド 32 スライド 33 手順 各年齢での動詞条件・名詞条件におけるAction-Same -Object Change選択の割合 結果 3歳児がわかっていないこと インプット言語の影響:日,米、中の比較 (Imai et al., in prep.)