Scaling Out on Wall Street

Scaling Out on Wall Street

High Productivity Computing: Taking HPC Mainstream Lee Grant Technical Solutions Professional High Performance Computing [email protected] Challenge: High Productivity Computing Make high-end computing easier and more productive to use. Emphasis should be placed on time to solution, the major metric of value to high-end computing users A common software environment for scientific computation encompassing desktop to high-end systems will enhance productivity gains by promoting ease of use and manageability of systems. 2004 High-End Computing Revitalization Task Force Office of Science and Technology Policy, Executive Office of the President X64 Server The Data Pipeline

Data Gathering Raw data includes sensor output, data downloaded from agency or collaboration web sites, papers (especially for ancillary data Discovery and Browsing Science Exploration Domain specific analyses Scientific Output Raw data browsing for

discovery (do I have enough data in the right places?), cleaning (does the data look obviously wrong?), and light weight science via browsing Science variables and data summaries for early science exploration and hypothesis testing. Similar to discovery and browsing, but with science variables computed via gap filling, units conversions, or simple equation. Science variables combined with models,

other specialized code, or statistics for deep science understanding. Scientific results via packages such as MatLab or R2. Special rendering package such as ArcGIS. Paper preparation. Operations per second for serial code for traditional software Free Lunch Free Lunch Is Over For Traditional Software 24 GHz 1 Core

12 GHz 1 Core 3 GHz 2 Cores 6 GHz 1 Core 3 GHz 4 Cores 3 GHz 8 Cores 3 GHz 1 Cor 3 GHz 1 Cores Additional operations per second if code can take advantage of concurrency No Free Lunch for traditional software

(Without highly concurrent software it wont get any faster!) Microsofts Vision for HPC Provide the platform, tools and broad ecosystem to reduce the complexity of HPC by making parallelism more accessible to address future computational needs. Reduced Complexity Mainstream HPC Developer Ecosystem Ease deployment for larger scale clusters Address needs of traditional supercomputing Increase number of parallel applications and codes

Simplify management for clusters of all scale Address emerging cross-industry computation trends Offer choice of parallel development tools, languages and libraries Integrate with existing infrastructure Enable non-technical users to harness the power of HPC Drive larger universe of developers and ISVs Microsoft HPC++ Solution

Application Benefits The most productive distributed application development environment Cluster Benefits Complete HPC cluster platform integrated with the enterprise infrastructure System Benefits Cost-effective, reliable and high performance server operating system Windows HPC Server 2008 Integrated security via Active Directory Support for batch, interactive and service-oriented applications High availability scheduling Interoperability via OGFs HPC

Basic Profile Rapid large scale deployment and built-in diagnostics suite Integrated monitoring, management and reporting Familiar UI and rich scripting interface Systems Management Storage Access to SQL, Windows and Unix file servers Key parallel file server vendor support (GPFS, Lustre, Panasas) In-memory caching options Job Scheduling

MPI MS-MPI stack based on MPICH2 reference implementation Performance improvements for RDMA networking and multi-core shared memory MS-MPI integrated with Windows Event Tracing List or Heat Map view cluster at a glance Group compute nodes based on hardware, software and custom attributes; Act on groupings. Receive alerts for failures Track long running operations and access operation history Pivoting enables correlating nodes and jobs together

Integrated Job Scheduling Services oriented HPC apps Expanded Support for Job Templates Improve interoperability with mixed IT infrastructure Skip/Demo Skip/Demo Node/Socket/Core Allocation Windows HPC Server can help your application make the best use of multi-core systems Node 1 Node 2 P1

P0 S0 S1 J1 P2 P1 P0 P3 P0 P2 P2

P3 P1 S1 S0 J1 P0 P1 P3 P2 P3

P0 P1 J2 P1 P0 S2 P2 P0 J3 P1 J3 S3

J1 P3 P2 J3 P3 J3 J1: /numsockets:3 /exclusive: false J3: /numcores:4 /exclusive: false P0 P1 S3 S2

P2 P3 J2: /numnodes:1 P2 P3 Job submission: 3 methods Command line Programmatic

Job submit /headnode:Clus1 /Numprocessors:124 /nodegroup:Matlab Job submit /corespernode:8 /numnodes:24 Job submit /failontaskfailure:true /requestednodes:N1,N2,N3,N4 Job submit /numprocessors:256 mpiexec \\share\mpiapp.exe [Completel Powershell system mgmt commands are available as well] Support for C++ & .Net languages Web Interface Open Grid Forum: HPC Basic Profile using Microsoft.Hpc.Scheduler; class Program

{ static void Main() { IScheduler store = new Scheduler(); store.Connect(localhost); ISchedulerJob job = store.CreateJob(); job.AutoCalculateMax = true; job.AutoCalculateMin = true; ISchedulerTask task = job.CreateTask(); task.CommandLine = "ping -n *"; task.IsParametric = true; task.StartValue = 1; task.EndValue = 10000; task.IncrementValue = 1; task.MinimumNumberOfCores = 1; task.MaximumNumberOfCores = 1; job.AddTask(task); store.SubmitJob(job, @"hpc\user, "[email protected]"); } }

Scheduling MPI jobs Job Submit /numprocessors:7800 mpiexec hostname Start time: 1 second, Completion time: 27 seconds NetworkDirect A new RDMA networking interface built for speed and stability 2 usec latency, 2 GB/sec bandwidth on ConnectX OpenFabrics driver for Windows includes support for Network Direct, Winsock Direct and IPoIB protocols MPI App MS-MPI Windows Sockets

(Winsock + WSD) RDMA Networking Networking Networking Networking WinSock Direct Hardware WinSock Direct Hardware Hardware Hardware Provider Provider TCP/Ethernet Networking

Networking Networking NetworkDirect NetworkDirect Hardware Hardware Provider Provider Networking Hardware Hardware Networking User Mode Access Layer TCP User Mode

Kernel By-Pass Verbs-based design for close fit with native, high-perf networking interfaces Equal to Hardware-Optimized stacks for MPI microbenchmarks Socket-Based App IP NDIS Networking Networking Mini-port Hardware Hardware Driver

Kernel Mode Networking Hardware Hardware Networking Hardware Driver Networking Hardware Hardware Networking Networking Hardware (ISV) (ISV) App App CCP CCP Component Component

OS OS Component Component IHV IHV Component Component Spring 2008, NCSA, #23 9472 cores, 68.5 TF, 77.7% Spring 2008, Umea, #40 5376 cores, 46 TF, 85.5% Spring 2008, Aachen, #100 2096 cores, 18.8 TF, 76.5% Fall 2007, Microsoft, #116 2048 cores, 11.8 TF, 77.1%

30% efficiency improvement Windows HPC Server 2008 Windows Compute Cluster 2003 Spring 2007, Microsoft, #106 2048 cores, 9 TF, 58.8% Spring 2006, NCSA, #130 896 cores, 4.1 TF November 2008 Top500 Customers It is important that our IT environment is easy to use and support. Windows HPC is improving our performance and manageability. -- Dr. J.S. Hurley, Senior Manager, Head Distributed Computing, Networked Systems Technology, The Boeing Company Ferrari is always looking for the most advanced technological solutions and, of course, the same

applies for software and engineering. To achieve industry leading power-to-weight ratios, reduction in gear change times, and revolutionary aerodynamics, we can rely on Windows HPC Server 2008. It provides a fast, familiar, high performance computing platform for our users, engineers and administrators. -- Antonio Calabrese, Responsabile Sistemi Informativi (Head of Information Systems), Ferrari Our goal is to broaden HPC availability to a wider audience than just power users. We believe that Windows HPC will make HPC accessible to more people, including engineers, scientists, financial analysts, and others, which will help us design and test products faster and reduce costs. -- Kevin Wilson, HPC Architect, Procter & Gamble We are very excited about utilizing the Cray CX1 to support our research activities, said Rico Magsipoc, Chief Technology Officer for the Laboratory of Neuro Imaging. The work that we do in brain research is computationally intensive but will ultimately have a huge impact on our understanding of the relationship between brain structure and function, in both health and disease. Having the power of a Cray supercomputer that is simple and compact is very attractive and necessary, considering the physical constraints we face in our data centers today. Porting Unix Applications Windows Subsystem for Unix applications Complete SVR-5 and BSD UNIX environment with 300

commands, utilizes, shell scripts, compilers Visual Studio extensions for debugging POSIX applications Support for 32 and 64-bit applications Recent port of WRF weather model 350K lines, Fortran 90 and C using MPI, OpenMP Traditionally developed for Unix HPC systems Two dynamical cores, full range of physics options Porting experience Fewer than 750 lines of code changed in makefiles/scripts Level of effort similar to port to any new version of UNIX Performance on par with the Linux systems India Interoperability Lab, MTC Bangalore Industry Solutions for Interop jointly with partners HPC Utility Computing Architecture Open Source Applications on HPC Server 2008 (NAMD, PL_POLY, GROMACS)

High Productivity Modeling Languages/Runtimes C++, C#, VB F#, Python, Ruby, Jscript Fortran (Intel, PGI) OpenMP, MPI .Net Framework LINQ: language integrated query Dynamic Language Runtime Fx/JIT/GC improvements Native support for Web Services Team Development Team portal: version control, scheduled build, bug tracking

Test and stress generation Code analysis, Code coverage Performance analysis IDE Rapid application development Parallel debugging Multiprocessor builds Work flow design MSFT || Computing Technologies Task Concurrency IFx / CCR Robotics-based manufacturing Maestro assembly line

Silverlight Olympics TPL / PPL viewer Local Computing Automotive control WCF system Internet based photo services WF Cluster-TPL Ultrasound imaging equipment Media encode/decode PLINQ

Image processing/ OpenMP enhancement TPL / PPL Data visualization CDS MPI / MPI.Net Enterprise search, OLTP, Cluster SOA collab Animation / CGI Cluster-PLINQ rendering Weather forecasting Seismic monitoring Oil exploration Data Parallelism

Distributed/ Cloud Computing UDF UDF UDF UDF UDF UDF UDF Head Nodes Supports SOA functionality WCF Brokers.

UDF Compute Nodes Each performs UDF Tasks as called From WCF Broker SOA Broker Performance Low latency High throughput Messages/sec (25 ms compute time) 1.6 1.2 1 0.8 0.6 0.4 0.2 4

38 6 16 40 9 24 10 6 25 64 16 4

0 1 Round Trip Latency ( ms ) 1.4 6000 5000 4000 3000 2000 1000 0 0 50 100

150 Number of clients Message Size ( bytes ) WSD IPoIB Gige 0k pingpong 4k pingpong 1k pingpong 16k pingpong 200 MPI.NET Supports all .NET languages

(C#, C++, F#, ..., even Visual Basic!) Natural expression of MPI in C# if (world.Rank == 0) world.Send(Hello, World!, 1, 0); else string msg = world.Receive(0, 0); string[] hostnames = comm.Gather(MPI.Environment.ProcessorName, 0); double pi = 4.0*comm.Reduce(dartsInCircle,(x, y) => return x + y, 0) / totalDartsThrown; Negligible overhead (relative to C) over TCP Allinea DDT VS Debugger Add-in Skip/Demo NetPI PE Performance Throughput (Mbps)

100 10 C (Native) C# (Prim itive) C# (Serialized) 1 1.E+00 1.E+01 1. E+02 1.E+03 1. E+04 0. 1 0. 01

Message Size ( Bytes) 1. E+05 1.E+06 1.E+07 Parallel Extensions to .NET Declarative data parallelism (PLINQ) var q = from n in names.AsParallel() where n.Name == queryInfo.Name && n.State == queryInfo.State && n.Year >= yearStart && n.Year <= yearEnd orderby n.Year ascending select n; Imperative data and task parallelism (TPL) Parallel.For(0, n, i=> { result[i] = compute(i);

}); Data structures and coordination constructs Example: Tree Walk Sequential static void ProcessNode(Tree tree, Action action) { if (tree == null) return; Thread Pool static void ProcessNode(Tree tree, Action action) { if (tree == null) return; Stack> nodes = new Stack>(); Queue data = new Queue(); nodes.Push(tree); while (nodes.Count > 0) { Tree node = nodes.Pop(); data.Enqueue(node.Data); if (node.Left != null) nodes.Push(node.Left); if (node.Right != null) nodes.Push(node.Right);

} ProcessNode(tree.Left, action); ProcessNode(tree.Right, action); action(tree.Data); } using (ManualResetEvent mre = new ManualResetEvent(false)) { int waitCount = Environment.ProcessorCount; WaitCallback wc = delegate { bool gotItem; do { T item = default(T); lock (data) { if (data.Count > 0) { item = data.Dequeue(); gotItem = true; } else gotItem = false; } if (gotItem) action(item);

} while (gotItem); if (Interlocked.Decrement(ref waitCount) == 0) mre.Set(); }; for (int i = 0; i < Environment.ProcessorCount - 1; i++) { ThreadPool.QueueUserWorkItem(wc); } wc(null); mre.WaitOne(); } } Example: Tree Walk Parallel Extensions (with Task) static void ProcessNode(Tree tree, Action action) { if (tree == null) return; Task t = Task.Create(delegate { ProcessNode(tree.Left, action); }); ProcessNode(tree.Right, action); action(tree.Data); t.Wait(); }

Parallel Extensions (with Parallel) static void ProcessNode(Tree tree, Action action) { if (tree == null) return; Parallel.Do( () => ProcessNode(tree.Left, action), () => ProcessNode(tree.Right, action), () => action(tree.Data) ); } Parallel Extensions (with PLINQ) static void ProcessNode(Tree tree, Action action) { tree.AsParallel().ForAll(action); } F# is... ...a functional, object-oriented, imperative and explorative programming language for .NET F# Efficient Interopera

ble Explorativ e Strongly Typed Succinct Libraries Scalable Interactive F# Shell C:\fsharpv2>bin\fsi MSR F# Interactive, (c) Microsoft Corporation, All Rights Reserved F# Version, compiling for .NET Framework Version v2.0.50727 NOTE: NOTE: NOTE: NOTE: NOTE: NOTE:

NOTE: NOTE: NOTE: NOTE: NOTE: NOTE: NOTE: NOTE: See 'fsi --help' for flags Commands: #r ;; reference (dynamically load) the given DLL. #I ;; add the given search path for referenced DLLs. #use ;; accept input from the given file. #load ...;; load the given file(s) as a compilation unit. #time;; toggle timing on/off. #types;; toggle display of types on/off. #quit;; exit. Visit the F# website at Bug reports to [email protected] Enjoy!

> let rec f x = (if x < 2 then x else f (x-1) + f (x-2));; val f : int -> int > f 6;; val it = 8 val it : int Example: Taming Asynchronous I/O using System; using System.IO; using System.Threading; public static void ReadInImageCallback(IAsyncResult asyncResult) { public static void ProcessImagesInBulk() ImageStateObject state = (ImageStateObject)asyncResult.AsyncState; { public class BulkImageProcAsync Stream stream = state.fs; Console.WriteLine("Processing images... "); int bytesRead = stream.EndRead(asyncResult); {

long t0 = Environment.TickCount; (bytesRead != numPixels) public const String ImageBaseName if = "tmpImage-"; NumImagesToFinish = numImages; throw new Exception(String.Format public const int numImages = 200; AsyncCallback readImageCallback = new ("In ReadInImageCallback, got the wrong number of "AsyncCallback(ReadInImageCallback); + public const int numPixels = 512 * 512; "bytes from the image: {0}.", bytesRead)); for (int i = 0; i < numImages; i++) ProcessImage(state.pixels, state.imageNum); { // ProcessImage has a simple O(N) stream.Close(); loop, and you can vary the number ImageStateObject state = new ImageStateObject(); // of times you repeat that loop to make the application more CPUstate.pixels = new byte[numPixels]; // Now write out the image. // bound or more IO-bound. state.imageNum = i; // Using asynchronous I/O here appears not to be best practice. public static int processImageRepeats = 20; // Very large items are read only once, so you can make the // It ends up swamping the threadpool, because the threadpool // buffer on the FileStream very small to save memory. // threads are blocked on I/O requests that were just queued to FileStream fs = new FileStream(ImageBaseName + i + ".tmp", // Threads must decrement NumImagesToFinish, and protect // the threadpool. FileMode.Open, FileAccess.Read, FileShare.Read, 1, true); // their access to it through a mutex. FileStream fs = new FileStream(ImageBaseName + state.imageNum

+ state.fs = fs; public static int NumImagesToFinish = ".done", numImages; FileMode.Create, FileAccess.Write, FileShare.None, fs.BeginRead(state.pixels, 0, numPixels, readImageCallback, public static Object[] NumImagesMutex 4096, = newfalse); Object[0]; state); fs.Write(state.pixels, 0, numPixels); // WaitObject is signalled when all image processing is done. } public static Object[] WaitObject fs.Close(); = new Object[0]; public class ImageStateObject // Determine whether all images are done being processed. // This application model uses too much memory. // If not, block until all are finished.

{ // Releasing memory as soon as possible is a good idea,bool mustBlock = false; public byte[] pixels; // especially global state. lock (NumImagesMutex) public int imageNum; state.pixels = null; { public FileStream fs; fs = null; if (NumImagesToFinish > 0) // Record that an image is finished now. } mustBlock = true; Processing 200 images in parallel lock (NumImagesMutex) {

NumImagesToFinish--; if (NumImagesToFinish == 0) { Monitor.Enter(WaitObject); Monitor.Pulse(WaitObject); Monitor.Exit(WaitObject); } } } if (mustBlock) { Console.WriteLine("All worker threads are queued. " + " Blocking until they complete. numLeft: {0}", NumImagesToFinish); Monitor.Enter(WaitObject); Monitor.Wait(WaitObject); Monitor.Exit(WaitObject); } long t1 = Environment.TickCount; Console.WriteLine("Total time processing images: {0}ms",

(t1 - t0)); } } } Example: Taming Asynchronous I/O Equivalent F# code (same perf) Open the file synchronously Read from the file, asynchronous ly let ProcessImageAsync(i) =

async { let inStream = File.OpenRead(sprintf "source%d.jpg" i) let! pixels = inStream.ReadAsync(numPixels) let pixels' = TransformImage(pixels,i) let outStream = File.OpenWrite(sprintf "result%d.jpg" i) do! outStream.WriteAsync(pixels') do Console.WriteLine "done!" } Write the result asynchronously let ProcessImagesAsync() = Async.Run (Async.Parallel [ for i in 1 .. numImages -> ProcessImageAsync(i) ]) Generate the tasks and queue them in parallel The Coming of Accelerators

Current Offerings Microsoft AMD nVidia Intel Apple Accelerator Brook+ RapidMind Ct Grand Central

D3DX, DaVinci, FFT, Scan ACML-GPU cuFFT, cuBLAS, cuPP MKL++ CoreImage CoreAnim Compute Shader CAL CUDA

LRB Native OpenCL Any Processor AMD CPU or GPU nVidia GPU Intel CPU Larrabee Any Processor DirectX11 Compute Shader A new processing model for GPUs

Integrated with Direct3D Supports more general constructs Enables more general data structures Enables more general algorithms Image/Post processing: Image Reduction, Histogram, Convolution, FFT Video transcode, superResolution, etc. Effect physics Particles, smoke, water, cloth, etc. Ray-tracing, radiosity, etc. Gameplay physics, AI FFT Performance Example Complex 1024x1024 2-D FFT: Software 42ms 6 GFlops

Direct3D9 15ms 17 GFlops 3x CUFFT 8ms 32 GFlops 5x Prototype DX11 6ms 42 GFlops 6x Latest chips 3ms 100 GFlops Shared register space and random access writes enable ~2x speedups IMSL .NET Numerical Library Linear Algebra Eigensystems Interpolation and Approximation

Quadrature Differential Equations Transforms Nonlinear Equations Optimization Basic Statistics Nonparametric Tests Goodness of Fit Regression

Variances, Covariances and Correlations Multivariate Analysis Analysis of Variance Time Series and Forecasting Distribution Functions Random Number Generation Research Integrate Data acquisition from source systems and

integration Data transformation and synthesis Analyze Data enrichment, with business logic, hierarchical views Data discovery via data mining Report Data presentation and distribution

Data access for the masses Data Browsing with Excel Annual Mean Monthly Mean Weekly Mean Courtesy Catherine van Ingen, MSR Datamining with Excel Integrated algorithms Text Mining Neural Nets Nave Bayes Time Series

Sequent Clustering Decision Trees Association Rules Workflow Design for Sharepoint Microsoft HPC++ Labs: Academic Computational Finance Service Taking HPC Mainstream 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Recently Viewed Presentations

  • Native Language Immersion Curriculum Development and Assessment

    Native Language Immersion Curriculum Development and Assessment

    Authorship and illustrator credentials clearly indicate if the materials were created by an American Indian in accordance with PL 101-164 of 1990 the Indian Arts and Crafts Act, and if this work was completed for, or on behalf of, an...
  • THORAX THE THORACIC CAGE :Boundaries: Behind:  Bodies of

    THORAX THE THORACIC CAGE :Boundaries: Behind: Bodies of

    Typical Intercostal space: Those spaces intervening between typical ribs and traversed by vessels and nerves which are confined to the thoracic wall, are known as typical intercostal space. As such 3rd, 4th, 5thand 6th intercostal space are typical
  • ERS - Amazon S3

    ERS - Amazon S3

    Vocabulary . order/sequence . compare/contrast . problem/solution. cause / effect. structure . description . knowledge. question. reflection. etymology. synonym
  • Upwardly Mobile? The Challenges for Irelands Mobile Sector

    Upwardly Mobile? The Challenges for Irelands Mobile Sector

    VirginXtras (VAS) on every phone, voice portal Friendly and easy to access Customer Service Virgin Mobile Economic Model is Key to Success MVNO model has low capital intensity Outsource partner management Brand marketing efficiency Low operating cost base - designed...
  • Development Committee

    Development Committee

    Modes. Modes - the same action means some different depending on the "mode" ... Action, Drama, Documentary, Foreign, etc. Local Google Search. Smart (non-dumb!) Interfaces. UI should TRY to figure out problem, and try to solve it. Too easy to...
  • International Logistics - FIU

    International Logistics - FIU

    International Logistics - from a Jamaican Perspective Team CE: Anika Brooks Jason Dunn Kevin Gallimore January 29, 2011 * * GOJ - Government of Jamaica AFRA - Average Freight Rate Assessment: AFRA (average freight rate assessment) is one of the...
  • METABOLIC SYNDROME - University of Pittsburgh

    METABOLIC SYNDROME - University of Pittsburgh

    To determine the prevalence of metabolic syndrome using NCEP adult treatment panel-3 (ATP-3) guidelines 2. To determine the role of lifestyle risk factors in the development of metabolic syndrome with particular reference to physical activity, smoking, diet and alcohol consumption.


    Worldwide energy consumption has grown steadily over the past 30 years. In fact, the total global energy consumption has grown by more than 80% from 283 quadrillion Btu's in 1980 to over 510 quadrillion Btu's in 2010. In the U.S.,...