VMware presentation - IPDPS

VMware presentation - IPDPS

HPC Cloud Bad; HPC in the Cloud Good Josh Simons, Office of the CTO, VMware, Inc. IPDPS 2013 Cambridge, Massachusetts 2011 VMware Inc. All rights reserved Post-Beowulf Status Quo Enterprise IT 2 HPC IT Closer to True Scale (NASA) 3 Converging Landscape Convergence driven by increasingly shared concerns, e.g.: Enterprise IT 4 HPC IT

Scale-out management Power & cooling costs Dynamic resource mgmt Desire for high utilization Parallelization for multicore Big Data Analytics Application resiliency Low latency interconnect Cloud computing Agenda HPC and Public Cloud Limitations of the current approach Cloud HPC Performance Throughput Big Data / Hadoop MPI / RDMA HPC in the Cloud

A more promising model 5 Server Virtualization Without Virtualization With Virtualization Application Operating System Hardware Hardware virtualization presents a complete x86 platform to the virtual machine Allows multiple applications to run in isolation within virtual machines on the same physical machine Virtualization provides direct access to the hardware resources to give you much greater performance than software emulation 6 HPC Performance in the Cloud http://science.energy.gov/~/media/ascr/pdf/program-documents/docs/Magellan_final_report.pdf 7 Biosequence Analysis: BLAST C. Macdonell and P. Lu, "Pragmatics of Virtual Machines for High-Performance Computing: A Quantitative Study of Basic Overheads, " in Proc. of the High Perf. Computing & Simulation Conf., 2007. 8

Biosequence Analysis: HMMer 9 Molecular Dynamics: GROMACS 10 EDA Workload Example app app app app app app app app app app app app OS OS

OS OS OS OS OS OS app app operating operating system system virtualization virtualization layer layer hardware hardware hardware hardware Virtual 6% slower Virtual 2% faster 11

app app Memory Virtualization Virtual HPL Native EPT on virtual EPT off 4K pages 37.04 GFLOPS 36.04 (97.3%) 36.22 (97.8%) 2MB pages 37.74 GLFLOPS 38.24 (100.1%) 38.42 (100.2%)

physical Virtual *RandomAccess Native EPT on machine EPT off 4K pages 0.01842 0.0156 (84.8%) 0.0181 (98.3%) 2MB pages 0.03956 0.0380 (96.2%) 0.0390 (98.6%) EPT = Intel Extended Page Tables = hardware page table virtualization = AMD RVI 12 vNUMA Application Application

ESXi hypervisor M 13 socket socket M vNUMA Performance Study Performance Evaluation of HPC Benchmarks on VMwares ESX Server, Ali Q., Kiriansky, V., Simons J., Zaroo, P., 5th Workshop on System-level Virtualization for High Performance Computing, 2011 14 Compute: GPGPU Experiment General Purpose (GP) computation with GPUs CUDA benchmarks VM Direct Path I/O Small kernels: DSP, financial, bioinformatics, fluid dynamics,

image processing RHEL 6 nVidia (Quadro 4000) and AMD GPUs Generally 98%+ of native performance (worst case was 85%) Currently looking at larger-scale financial and bioinformatics applications 15 MapReduce Architecture MAP Reduce MAP HDFS Reduce MAP Reduce MAP 16 HDFS

vHadoop Approaches M M VM VM Why virtualize Hadoop? Simplified Hadoop cluster configuration and provisioning Support Hadoop usage in existing virtualized datacenters Support multi-tenant environments Project Serengeti 17 Node Node R RR RM MR

R Node Node Node Node VM VM VM VM VM VM HDFS HDFS R R M M R R Compute Compute Node Node Data Data Node Node

Node Node CN CN vHadoop Benchmarking Collaboration with AMAX Seven-node Hadoop cluster (AMAX ClusterMax) Standard tests: PI, DFSIO, Teragen / Terasort Configurations: Native One VM per host Two VMs per host Details: Two-socket Intel X5650, 96 GB, Mellanox 10 GbE, 12x 7200rpm SATA RHEL 6.1, 6- or 12-vCPU VMs, vmxnet3 Cloudera CDH3U0, replication=2, max 40 map and 10 reduce tasks per host Each physical host considered a rack in Hadoops topology description ESXi 5.0 w/dev Mellanox driver, disks passed to VMs via raw disk mapping (RDM) 18 Benchmarks Pi Direct-exec Monte-Carlo estimation of pi # map tasks = # logical processors

1.68 T samples TestDFSIO Streaming write and read 1 TB More tasks than processors Terasort 3 phases: teragen, terasort, teravalidate 10B or 35B records, each 100 Bytes (1 TB, 3.5 TB) More tasks than processors CPU, networking, and storage I/O 19 ~ 4*R/(R+G) = 22/7R/(R+G) = 22/7 Ratio to Native, Lower is Better 1.2 1 0.8 Ratio to Native 0.6 1 VM 2 VMs 0.4

0.2 0 Pi D st e T FS -w IO e rit D st e T FS ad r- e IO T

en G a er 1 TB T aS er t1 or T TB Va a er e at d li 1 TB

T en G a er 5 3. TB Te ra rt o S 5 3. TB a aV r Te

lid e at 5 3. TB A Benchmarking Case Study of Virtualized Hadoop Performance on VMware vSphere 5 http://www.vmware.com/files/pdf/VMW-Hadoop-Performance-vSphere5.pdf 20 Kernel Bypass Model sockets rdma guest kernel kernel sockets tcp/ip driver rdma hardware 21

application user user application tcp/ip driver vmkernel hardware rdma rdma Virtual Infrastructure RDMA Distributed services within the platform, e.g. vMotion (live migration) Inter-VM state mirroring for fault tolerance Virtually shared, DAS-based storage fabric All would benefit from: Decreased latency Increased bandwidth CPU offload 22

vMotion/RDMA Performance 80 70.63 70 14.18 Gbps RDMA 60 50 36 % faster 40 30 432,757.73 45.31 30% Higher 10.84 Gbps TCP/IP 20

330,813.66 0 TCP/IP RDMA Total vMotion Time (sec) 50 45 40 35 30 25 20 15 10 5 0 92% Lower VMware Time (s) Destination CPU Utilization 23 % C o r e U tiliz a t io n u s e d b y v M o tio n

% C o r e U tiliz a tio n u s e d b y v M o tio n 10 0 100000 200000 300000 400000 500000 Pre-copy bandwidth (Pages/sec) 50 45 40 35 30 25 20 15 10 5 0 84 84% Lower %

Lo wer Time (s) Source CPU Utilization Guest OS RDMA RDMA access from within a virtual machine Scale-out middleware and applications increasingly important in the Enterprise memcached, redis, Cassandra, mongoDB, GemFire Data Fabric, Oracle RAC, IBM pureScale, Big Data an important emerging workload Hadoop, Hive, Pig, etc. And, increasingly, HPC 24 SR-IOV VirtualFunction VM DirectPath I/O Single-Root IO Virtualization (SR-IOV): PCI-SIG standard Physical (IB/RoCE/iWARP) HCA can be shared between VMs or by the ESXi hypervisor Virtual Functions direct assigned to

Guest OS RDMA HCA VF OFED Driver OFED Stack Stack RDMA HCA VF Driver VMs Guest OS RDMA HCA VF Driver Guest OS RDMA HCA VF OFED Driver Stack RDMA HCA VF Driver

Virtualization PF Device Layer Driver Physical Function controlled by hypervisor I/O MMU Still VM DirectPath, which is incompatible with several important virtualization features VF VF SR-IOV RDMA HCA VMware 25 PF VF

Paravirtual RDMA HCA (vRDMA) offered to VM New paravirtualized device exposed to Virtual Machine Implements Verbs interface OFED Stack Device emulated in ESXi Guest OS vRDMA HCA Device Driver hypervisor Translates Verbs from Guest to Verbs to ESXi OFED Stack vRDMA Device Emulation Guest physical memory regions mapped to ESXi and passed down to physical RDMA HCA Zero-copy DMA directly from/to guest physical memory Completions/interrupts proxied by I/O

Stack ESXi OFED Stack Physical RDMA HCA Device Driver emulation Holy Grail of RDMA options for vSphere VMs 26 Physical RDMA HCA InfiniBand Bandwidth with VM DirectPath I/O 3500 3000 Bandwidth (MB/s) 2500 2000 Send: Native Send: ESXi RDMA Read: Native RDMA Read: ESXi

1500 1000 500 0 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M

2M 4M 8M Message size (bytes) RDMA Performance in Virtual Machines using QDR InfiniBand on VMware vSphere 5, April 2011 http://labs.vmware.com/academic/publications/ib-researchnote-apr2012 27 Latency with VM DirectPath I/O (RDMA Read, Polling) 10000 MsgSize (bytes) Native ESXi ExpA 2 2.28 2.98 4 2.28

2.98 Half roundtrip latency (s) 1000 100 8 2.28 2.98 16 2.27 2.96 32 2.28 2.98 64 2.28 2.97

128 2.32 3.02 256 2.5 3.19 Native ESXi ExpA 10 1 2 4 8 16 32 64

128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Message size (bytes) 28 2M 4M 8M Latency with VM DirectPath I/O (Send/Receive, Polling) 10000 MsgSize (bytes) Native ESXi ExpA

2 1.35 1.75 4 1.35 1.75 8 1.38 1.78 16 1.37 2.05 32 1.38 2.35 Half roundtrip latency (s)

1000 100 64 1.39 2.9 128 1.5 4.13 256 2.3 2.31 Native ESXi ExpA 10 1 2

4 8 16 32 64 128 256 512 1K 2K 4K 8K Message size (bytes) 29 16K 32K

64K 128K 256K 512K 1M 2M 4M 8M Intel 2009 Experiments Hardware Eight two-socket 2.93GHz X5570 (Nehalem-EP) nodes, 24 GB Dual-ported Mellanox DDR InfiniBand adaptor Mellanox 36-port switch Software vSphere 4.0 (current version is 5.1) Platform Open Cluster Stack (OCS) 5 (native and guest) Intel compilers 11.1 HPCC 1.3.1 STAR-CD V4.10.008_x86 30 HPCC Virtual to Native Run-time Ratios (Lower is Better) 2.5 2 1.5 1 0.5

0 2n16p 4n32p 8n64p Data courtesy of: Marco Righini Intel Italy 31 Point-to-point Message Size Distribution: STAR-CD Source: http://www.hpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf 32 Collective Message Size Distribution: STAR-CD Source: http://www.hpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf 33 STAR-CD Virtual to Native Run-time Ratios (Lower is Better) STAR-CD A-Class Model (on 8n32p) 1.25 1.19 1.20 1.15

1.15 1.10 1.05 1.00 1.00 0.95 0.90 Physical ESX4 (1 socket) ESX4 (2 socket) Data courtesy of Marco Righini, Intel Italy 34 Software Defined Networking (SDN) Enables Network Virtualization Telephony 650.555.1212 Wireless Telephony Identifier = Location 650.555.1212

Networking 192.168.10.1 35 192.168.10.1 VXLAN Identifier = Location Data Center Networks Traffic Trends NORTH / SOUTH WAN/Internet EAST / WEST 36 Data Center Networks the Trend to Fabrics WAN/Internet WAN/Internet 37 Network Virtualization and RDMA

SDN Decouple logical network from physical hardware Encapsulate Ethernet in IP more layers Flexibility and agility are primary goals RDMA Directly access physical hardware Map hardware directly into userspace fewer layers Performance is primary goal Is there any hope of combining the two? Converged datacenter supporting both SDN management and decoupling along with RDMA 38 38 Secure Private Cloud for HPC Research Group 1 Research Group m Users IT Public Clouds

VMware vCloud Director User Portals Catalogs Security VMware VMware vCloud vCloud API API Research Cluster 1 Research Cluster n VMware vShield Programmatic Control and Integrations 39 VMware vCenter Server VMware vCenter Server VMware vCenter Server

VMware vSphere VMware vSphere VMware vSphere Massive Consolidation 40 Run Any Software Stacks Support groups with disparate software requirements Including root access 41 App A App B OS A OS B virtualization virtualization layer layer virtualization

virtualization layer layer virtualization virtualization layer layer hardware hardware hardware hardware hardware hardware Separate workloads Secure multi-tenancy Fault isolation and sometimes performance 42 App A App B OS A OS B

virtualization virtualization layer layer virtualization virtualization layer layer virtualization virtualization layer layer hardware hardware hardware hardware hardware hardware Live Virtual Machine Migration (vMotion) 43 Use Resources More Efficiently Avoid killing or pausing jobs App C Increase

overall throughput OS A 44 App A App B App A App C OS A OS B OS A OS B virtualization virtualization layer layer virtualization virtualization layer layer virtualization virtualization layer

layer hardware hardware hardware hardware hardware hardware Workload Agility app app 45 app app app app app operating operating system system

virtualization virtualization layer layer virtualization virtualization layer layer hardware hardware hardware hardware hardware hardware Multi-tenancy with resource guarantees Define policies to manage resource sharing between groups 46 App C App App A A

AppApp B B App A App C OS A OS OS A A OS B OS B OS A OS B virtualization virtualization layer layer virtualization virtualization layer layer virtualization virtualization layer layer

hardware hardware hardware hardware hardware hardware Protect Applications from Hardware Failures Reactive Fault Tolerance: Fail and Recover 47 App A App A OS OS virtualization virtualization layer layer virtualization virtualization layer layer

virtualization virtualization layer layer hardware hardware hardware hardware hardware hardware Protect Applications from Hardware Failures Proactive Fault Tolerance: Move and Continue 48 MPI-0 MPI-1 MPI-2 OS OS OS

virtualization virtualization layer layer virtualization virtualization layer layer virtualization virtualization layer layer hardware hardware hardware hardware hardware hardware Unification of IT Infrastructure 49 HPC in the (Mainstream) Cloud MPI / RDMA Throughput Throughput

50 Summary HPC Performance in the Cloud Throughput applications perform very well in virtual environments MPI / RDMA applications will experience small to very significant slowdowns in virtual environments, depending on scale and message traffic characteristics Enterprise and HPC IT requirements are converging Though less so with HEC (e.g. Exascale) Vendor and community investments in Enterprise solutions eclipse those made in HPC due to market size differences The HPC community can benefit significantly from adopting Enterprise-capable IT solutions And working to influence Enterprise solutions to more fully address HPC requirements Private and community cloud deployments provide significantly more value than cloud bursting from physical infrastructure to public cloud 51

Recently Viewed Presentations

  • Lesson 4 - Enlightenment

    Lesson 4 - Enlightenment

    Enlightenment emerged from Europe in the 18th century, and represents a departure from the legitimacy of government that comes from a religious authority such as a theocracy or the divine right of kings. Core enlightenment values include an emphasis on...
  • Parallel Data Cube F or d B y

    Parallel Data Cube F or d B y

    Parallel Data Cube Data Mining OLAP (On-line analytical processing) cube / group-by operator in SQL Data Warehousing for Decision Support Operational data collected into DW DW used to support multi-dimensional views Views form the basis of OLAP processing Our focus:...
  • Elementary and Middle School Mathematics Teaching ...

    Elementary and Middle School Mathematics Teaching ...

    Learner Outcomes. 4.1Explain the features of a three-phase lesson plan format for problem-based lessons.. 4.2. Design lessons using a planning process focused on mathematical inquiry. 4.3. Describe specific lesson design ideas, including ways to differentiate instruction.
  • Warm-up  Define solute and solvent  What is a

    Warm-up Define solute and solvent What is a

    moles of solute/liters of solution. Often in grams, need to change to moles. Don't forget the mole hill!!! Often in mL, need to change to liters. khDbdcm. Example. A sample of NaNO3 weighing 0.38 g is placed in a 50.0...
  • Ms Sql Server Developer R2 Nasil Kurulur? Ve

    Ms Sql Server Developer R2 Nasil Kurulur? Ve

    Buradasadecesql server kurmakistiyorsaniz Database engingeservice'ive Management Tools-Basic'isecmenizyeterliolacaktir. Fakatileride Visual Studio gibiortamlardacalisirken TFS ve SharePoint kullanacaksanizkurulumlari tam yapmanizsaglikliolacaktir.
  • Constructing Aberrant Behaviour: The DSM and the ...

    Constructing Aberrant Behaviour: The DSM and the ...

    Decriminalization and the Charter of Rights and Freedoms: Though sexual orientation not mentioned specifically regarding discrimination, in Egan v. Canada (1995), sexual orientation taken as having same standing, and used in cases defending everything from same-sex marriage to same sex...
  • STUDENT LOAN Licensing Laws & Regulations ~~~ Presentation

    STUDENT LOAN Licensing Laws & Regulations ~~~ Presentation

    Because "the D.C. licensing scheme, in effect, requires SLSA's members to 'desist from performance until they satisfy a state officer upon examination that they are competent [to perform their duties] and pay a fee for permission to go on'" (quoting...
  • Program Analytics U.S. General Services Administration GSA SmartPay

    Program Analytics U.S. General Services Administration GSA SmartPay

    Overview. Participants can expect an overview of data analytics and its importance to the GSA SmartPay®Program. AOPCs will gain necessary oversight into their programs and create time and money saving efficiencies.