Agenda - School of Computing

Agenda - School of Computing

Windows Azure Internals: Opportunities and Challenges of a Cloud Operating System Brad Calder Corporate Vice President Windows Azure Microsoft Agenda Promise of the Cloud What a Cloud Provides Opportunities and Challenges Cloud App Modeling Cloud Fabric Cloud Storage Promise of the Cloud The Cloud Vision On-Demand resources Elastically scale out and in Cloud

Devices Available anywhere at anytime ONE Unlock insights from any data ConsistentOnPlatform Premises Focus on application logic Seamless experience across cloud and devices Master Chief meets Windows Azure Halo before the Cloud Find Hosting location Building a

service! How much space do I need? How do I grow? Redundancy? Security? Local support? Local regulations? Taxes?... Hardware Update Clients Buy servers Which type? Where from? How many? What kind of support plan? Spare parts? Replacements? How do I add capacity to running service? Network gear? Storage? Software Cheat & Ban A/B Testing All I wanted is

to build/run a service Which OS? Security patches? Deploying and upgrading software? Patching firmware? Load balancing? Storage? Support Multiplay er Lobby Stats, & Presence Support for all of the above? How much should I Invest? Halo 4 on Windows Azure Built over 40 applications that leverages Orleans runtime Allowed Halo to focus on their application logic instead of infrastructure Title File

Challenge s Video Ingestion XBOX Live Proxy Personaliz e Profile Admim Stats Registe r Client QoS Cheat & Ban

UGC Emble m Lobby Windows Azure Search Presence Content Mang System BI Game Traffic Launch predictions are often wrong Not enough capacity leads to

bad user experience and potentially outages Too much capacity can waste a significant amount of money Cloud Elasticity is key For cost and user experience Able to scale out and in to tightly ride the demand curve Traffic can be spiky Time in Days Provisioning Resources before the Cloud Demand Provision Overprovisioned Underprovisioned Problem: Significant wasted costs vs outage/risk bad user

experience Demand Provision Time Resource Resource Under Provisioning Over Provisioning (catching up with demand) Time Elasticity Provisioning in the Cloud provides on-demand, scale out and in, Cloud compute, storage and network resources Demand Provision Overprovisioned

Underprovisioned Provisioning Benefit: Reduced Costs and Improved User Resource Resource Experience How the Cloud support this? Scale Cloud Provisioning Selfdoes Provisioning Time Time Windows Azures Scale Over 250,000 External Customers Adding 1,000+ new customers a day

Capacity demand doubling every 9 months Microsoft Services on Azure: SkyDrive Windows Azure Cloud What a Cloud Provides Windows Azures Global Datacenters Datacenter Security Power Redundancy Service Glue What a Cloud Provides Under the Covers

App business logic Overprovision for blended peak traffic Add compute/storage capacity on the fly OS patches and Deploying/Upgrading App Metering and billing infrastructure Service Monitoring and alerting infrastructure Reliable/Secure computation and storage Respond to hardware failures Buy and provision hardware Datacenter (Power, Cooling, Internet) glue Building Blocks Provided by Windows Azure to Make it Easier to Build Applications Building modern App services cloud services

caching identity service bus media mobile services web sites BizTalk Services hpc analytics Managing data Data services Infrastructure

services apps that connect services with devices SQL database HDInsight Table Blob storage Virtual machines Virtual network VPN Traffic manager

CDN IT infrastructure Cloud App Modeling Cloud App Modeling services compute compute services services caching caching identity identity service service bus bus

media media mobile mobile services services web web sites sites BizTalk BizTalk Services Services hpc hpc analytics analytics Data services Infrastructure

SQL SQL database database HDInsight HDInsight Table Table Blob Blob storage storage Virtual Virtual machines machines Virtual Virtual network network

VPN VPN Traffic Traffic manager manager CDN CDN Cloud App Model App services Application modeling and composition Cloud Application Cloud Application Model Concepts Resources Identify building blocks used in the service

Apps service code to be run on VMs Deployment Fault Domain Upgrade Domain Choose number of Fault Domains (FD) Unit of failure based on data center topology web web sites sites E.g. top-of-rack switch on a rack of machines Spread VMs out across FDs to avoid single points of physical failure media media compute

compute services services Choose number of Upgrade Domains (UD) Percentage of your app you will take offline for an upgrade at a time Configuration Specify number of instances Set the desired configurations for resources Allows dynamic changes to configuration SQL SQL database database Virtual Virtual machines machines Blob

Blob storage storage Virtual Virtual network network Cloud Application Cloud Application Model Concepts + topology(2) across components Contracts Enforce specified contracts and control access across components Provides resource discoverability and change notification Integrated identity/auth across components Access control across component endpoints Role based access control

Allows management of quotas, monitoring, alerts Dynamic scaling Scale in/out: vary number of vm instances web web sites sites media media compute compute services services SQL SQL database database Virtual Virtual Virtual machines Virtual

machines machines Virtual machines Virtual machines machines Blob Blob storage storage Virtual Virtual network network Cloud Application Windows Azure App Model A Windows Azure application consists of a Model with Definition information Configuration information At least one role

A role is the scaling boundary within an app Roles are like DLLs in your cloud application Collection of code that runs in its own virtual machine with an entry point that WA knows how to invoke Virtual machine is scale unit Role code runs in a virtual machine Role scales by varying the number of virtual machines running that role code Dependencies captured in Model Dependency across roles and resources Connections and contracts among roles and resources An Example: Multi-Tier Cloud App Example Photo Processing Service with 2 Roles HTTP/HTTPS

Network Load balancer, Virtual IP Front End Stateless Web Role: take requests from users Middle-tier Worker Role: process the order Backend storage: Azure Storage, SQL Azure Dynamic scaling # of role instances by scaling # of VMs Load Balancer FrontFrontFrontEnd End End MiddleMiddleMiddleTier MiddleTier Tier Tier Cloud Application Windows Azure Storage, SQL Azure HTTP/

HTTPS App Model Example Load Balance r FrontFrontEnd FrontEnd End MiddleMiddleTier MiddleTier MiddleTier Tier Windows Azure Storage, SQL Azure Cloud Application Role (VM): scaling boundary Code package to run on a VM Definition

App Model Role: Front-End Role: Middle-Tier FE Code Package MT Code Package Definition Type: Web VM Size: Medium Endpoints: External-1 Configuration Instances: 3 Update Domains: 3 Fault Domains: 3 Auto Scaling Rules Definition Type: Worker VM Size: Large Endpoints: Internal-1

Configuration Instances: 5 Update Domains: 4 Fault Domains: 3 Auto Scaling Rules Name, type, VM Size, endpoints, etc Network Binding: DBConnection:[photo] Middle-Tier.Internal-1 Configuration Instance, UD, FD, Auto Scaling, etc Resource: SQLAzure Connections and contracts DBConnectionString: [@photo] Cloud Fabric The Fabric Controller (FC) Fabric Controller translates the Cloud Application Model into

A running service Keeps the service running Provides upgrade and management capabilities and more The kernel of the cloud operating system Programs, manages and owns all of the datacenter hardware Manages Windows Azure provided building block services Manages all customer applications Inputs: Description of the hardware and network resources it will control App model and binaries for cloud applications Windows Azure Fabric Controller Fabric Agent VM VM WS Hypervisor

Hardwar e control Loadbalancers Switches Software control Highly-available Fabric Controller VM Cloud App Model Deployment Steps by FC App model files Process Allocation across fault and update domains Determine resource requirements Create role images Allocate compute and network resources Across separate fault and upgrade domains Prepare servers assigned to run the roles

Place role images on servers Load-balancers Create virtual machines Start virtual machines and roles Configure networking Dynamic IP addresses (DIPs) assigned to VMs Virtual IP addresses (VIPs) + ports allocated and mapped to sets of DIPs Program load balancers to allow traffic to external endpoints Configure packet filter for VM to VM traffic within application App Model HTTP/ HTTPS Load Balance r FrontFrontEnd FrontEnd End MiddleMiddleTier MiddleTier MiddleTier

Tier Cloud Application Windows Azure Storage, SQL Azure Role: Front-End Role: Middle-Tier Definition Type: Web VM Size: Medium Endpoints: External-1 Configuration Instances: 3 Update Domains: 3 Fault Domains: 3 Auto Scaling Rules Definition Type: Worker VM Size: Large

Endpoints: Internal-1 Configuration Instances: 5 Update Domains: 4 Fault Domains: 3 Auto Scaling Rules Network Binding: Middle-Tier.Internal-1 DBConnection:[photo] Resource: SQLAzureDB DBConnectionString: [@photo] FC Deploying an App Worker Role Web Role www.mycloudapp.net Middle-Tier Role Front-End Role

Count: 5 Fault Domains: 3 Upgrade Domains: 4 Size: Large Count: 3 Fault Domains: 3 Upgrade Domains: 3 Size: Medium www.mycloudapp.net 10.100.0.36 Load Balance r 10.100.0.113 10.100.0.122 Upgrade domain

Filled Cores Empty Cores Compute Server Fault domain FC Automated Management Windows Azure FC monitors the health of roles FC Agent on the server detects if a role dies Restart the role to bring it back to a healthy state If a failed server or FD cant be recovered, FC starts new role instances on available VMs A suitable replacement location is found based on FD and UD requirements Existing role instances are notified of the App Resource Allocation Goals FC Primary Goal: Allocate app roles to available resources while satisfying all hard constraints HW requirements based on size of VM chosen:

CPU, Memory, Storage, Network Fault domains, update domains FC Secondary Goal: Satisfy soft constraints Try to not fragment servers E.g., so that large VMs cant fit on them Fabric Scheduling Opportunities FC scheduling across all apps is a complex scheduling problem trying to minimize costs, while meeting all customer app constraints Opportunities for improvements and additional features Advanced rules for specifying when to scale out/in Some resources need to be scaled together and what ratios Allow scaling up and down in terms of VM size to automatically figure out the size of VM to use Currently app model is specific about the resources needed for each roles VM: CPU, Mem, network, storage, etc But customers dont have a good understanding of workload behavior Allow for better managing of resources to reduce app costs

Deadlines Gang scheduling and more Cloud App Modeling Opportunities How to express advanced scheduling features (autoscaling, deadlines, gang scheduling, etc) Current systems allows developers to define environments in which applications live Need to continue to abstract away infrastructure and focus on application logic Allow devs to focus on their specific problem domain and less on how to configure, deploy, and manage their service Richer runtimes and programming languages See Orleans in ACM Symposium on Cloud Computing 2011 by Microsoft Research Cloud Storage

Data Storage Options on Windows Azure SQL Database (Relational) Table Storage (NoSQL Key/Attribute Store) Platform as a Service (managed services) Blob Storage (unstructured files) SQL Server, MySQL, Postgress, RavenDB, MongoDB, CouchDB, neo4j, Redis, Riak, etc. Infrastructure as a Service (virtual machines)

Storage topics Understanding and Optimizing Costs Need to continually optimize costs at scale Location Durability Durability vs Performance vs Consistency Understanding and Optimizing Hosting Cost COGS Data Center, Power, Cooling, Operations, Reserving/Occupying Space, etc Continuous hardware design New hardware design (SKU) at least every year (hardware lasts for 3-4 years) Track and take advantage of new technology Reducing WIP (Work in Progress) Time from order arriving on Dock to the time it is fully used Time to Build, Time to Live, Time to Fill Need to incrementally and efficiently add capacity

Multi-tenancy Blend different workloads and customers to reduce COGS Keeps overprovisioning overheads low due to economies of scale Fully utilize resources by blending different workloads (e.g., Disk GBs vs IOs) Customers needs consistent performance Deal with spikes and varying workloads, deal with background jobs, and seamlessly load balance hot spots away Appropriately throttle and provide isolation among customers Reduce Costs using Erasure At Exabytes+ the savings are significant Coding 3 Replica Standard EC LRC 50%

Storage Overhead 3x 14% 1.5x 1.29x Erasure Coding in Windows Azure Storage, USENIX Annual Technical Conference, June 2012 https://www.usenix.org/conference/usenixfederatedconferencesweek/erasure- Location Durability How far apart should your data be replicated? Some data is fine to be kept within a single region (replicas are kept within a mile(s) of each other) From a 2011 Netflix presentation (http://www.slideshare.net/adrianco/migrating-netflix-from-oracle-to-global-Cassandra): Whereas other customers require replicas to be kept

100s of miles apart from each other for DR (disaster recovery) Ability to recover from major disasters including natural and man made disasters Windows Azure Storage Two Types of Durability Offered Local Redundant Storage 3 copies (or ECd) within region Local Redundant Storage Commit quickly within region 3 replicas within region Geo Redundant Storage N. Central Region 6 copies (or ECd) across

2 regions 100s miles apart Commit quickly within primary region Async geo-replication to secondary region Allow customers read access to secondary region Async geo-replication S. Central Region Decisions about State during App Trade off Durability vs Performance vs Consistency Design What state to keep within a single regional only? Data that can be regenerated, intermediate data, logs, Benefit is lower costs and higher BW for processing the data Then for state that needs to be Geo Redundant for higher durability What state to commit quickly in primary region and then asynchronously to a secondary region? Data that needs consistent low latencies

Large data updates (need flexibility when consuming cross regional bandwidth) What state must be committed across multiple regions before the update is deemed successful? Credentials, critical service metadata, Coordinating State Across Many applications use several data services Components (e.g., Blobs, NoSQL Tables, SQL, etc) Challenges Coordinated consistent view of the data across data services Point-in-Time Recovery Reasoning about a consistent view at massive scale and across geo redundancy Summary Summary Promise of the Cloud Cloud abstracts away infrastructure

to allow developers to focus on application logic Cloud provides building block services to ease and speed app development Cloud provides Elasticity to reduce costs and improve user experience Cloud is in its infancy Cloud demand is more than doubling each year Just starting to scratch the surface of its potential Many areas ripe for research Cloud Application Modeling Fabric Scheduling of Cloud Applications Continually Optimizing Costs Location Durability and many more

More Information on Windows http://www.windowsazure.com/ Azure Free month of Windows Azure http://www.windowsazure.com/en-us/pricing/free-trial/ Windows Azure Publications Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency, ACM Symposium on Operating System Principals (SOSP), Oct. 2011 http://sigops.org/sosp/sosp11/current/2011-Cascais/printable/11-calder.pdf Erasure Coding in Windows Azure Storage, USENIX Annual Technical Conference, June 2012 https://www.usenix.org/conference/usenixfederatedconferencesweek/erasure-coding-windowsazure-storage We are hiring full-time and interns [email protected]

Recently Viewed Presentations

  • Interpreting Wildlife Habitat from Aerial Photographs (Activity II)

    Interpreting Wildlife Habitat from Aerial Photographs (Activity II)

    Tips (cont.) Interspersion Index If a species requires a good bit of interspersion, use index Higher the number, higher the interspersion Materials Bring Pencil No magnifying glasses allowed No manual allowed Score sheet supplied Scratch paper supplied An Example Part...
  • FeedBack along the Continuum

    FeedBack along the Continuum

    Let's say that you are giving feedback and someone is receiving it but they did not solicit it. Turn to the audience and indicate that since he/she did not ask for my feedback, they are basically giving me 'Lip Service',...
  • Benefits Realisation A description of the approach at

    Benefits Realisation A description of the approach at

    eMedicines Lead Pharmacist since 2008, with experience of an eDischarge system implementation and am part of an experienced multidisciplinary team implementing a hospital electronic prescribing and medicines administration system (what we call eMedicines). This team includes clinical and informatics staff.
  • Exploring the alpha cluster structure of nuclei using the ...

    Exploring the alpha cluster structure of nuclei using the ...

    Dalitz Plot -> Information about the energy and momentum of the emitted alpha particles. Hoyle (3-) at 9.64 MeV. Events with alpha multiplicity 3. PACE4. Mixed . events. Data. 7.65 MeV (0+) Hoyle State decays through 8Be gs. Experiment. Simulation.
  • The paperless contract Peter Moran, Principal Tel: 03

    The paperless contract Peter Moran, Principal Tel: 03

    the Corporations Law (but only section 8(1) - note section 127 of Corporations Act 2001 (Cth)); Exceptions cont. the Evidence Act 1995 (Cth), but only sections 161 and 162 which pertain to electronic communications, lettergrams and telegrams;
  • Trees and XML - Computer Science and Engineering

    Trees and XML - Computer Science and Engineering

    Trees have a recursive structure to them. That means you can make a bigger tree by sticking together one or more smaller trees (subtrees). If our tree is . ordered, then we can tell outgoing edges apart. We usually index...
  • Transport and travel - vocabulary and spelling PPT

    Transport and travel - vocabulary and spelling PPT

    Ww/E1.1a Develop knowledge of a context based vocabulary and structural words. Ww/E1.1c Develop strategies to aid spelling. Rw/E1.1a Recognise a limited number of words, signs and symbols. Background This PowerPoint was used for revision of vocabulary and spelling practice. The...
  • Erbil Citadel RevitalizationProject - web.nawroz.edu.krd

    Erbil Citadel RevitalizationProject - web.nawroz.edu.krd

    The geometric analysis of Erbil Citadel urban space with the proposed Circuit Route shows that: In terms of the . length of center lines of its . axies, the Circuit Route is consisted of . short center lines . This...