Unified Parallel C (UPC) Costin Iancu The Berkeley

Unified Parallel C (UPC) Costin Iancu The Berkeley

Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala, M. Welcome, K. Yelick http://upc.lbl.gov Slides edited by K. Yelick, T. El-Ghazawi, P. Husbands, C.Iancu Context Most parallel programs are written using either: Message passing with a SPMD model (MPI) Usually for scientific applications with C++/Fortran Scales easily Shared memory with threads in OpenMP, Threads+C/C++/F or Java Usually for non-scientific applications Easier to program, but less scalable performance Partitioned Global Address Space (PGAS) Languages take the best of both SPMD parallelism like MPI (performance) Local/global distinction, i.e., layout matters (performance)

Global address space like threads (programmability) 3 Current languages: UPC (C), CAF (Fortran), and Titanium (Java) 3 New languages: Chapel, Fortress, X10 Partitioned Global Address Space Global address space Thread0 Thread1 X[0] ptr: X[1]

ptr: Thread n X[P] Shared ptr: Private Shared memory is logically partitioned by processors Remote memory may stay remote: no automatic caching implied One-sided communication: reads/writes of shared variables Both individual and bulk memory copies Some models have a separate private memory area Distributed array generality and how they are constructed Partitioned Global Address Space Languages Explicitly-parallel programming model with SPMD parallelism Fixed at program start-up, typically 1 thread per processor Global address space model of memory

Allows programmer to directly represent distributed data structures Address space is logically partitioned Local vs. remote memory (two-level hierarchy) Programmer control over performance critical decisions Data layout and communication Performance transparency and tunability are goals Initial implementation can use fine-grained shared memory Current Implementations A successful language/library must run everywhere UPC Commercial compilers: Cray, SGI, HP, IBM Open source compilers: LBNL/UCB (source-to-source), Intrepid (gcc) CAF Commercial compilers: Cray Open source compilers: Rice (source-to-source) Titanium Open source compilers: UCB (source-to-source)

Common tools Open64 open source research compiler infrastructure ARMCI, GASNet for distributed memory implementations Pthreads, System V shared memory Talk Overview UPC Language Design Data Distribution (layout, memory management) Work Distribution (data parallelism) Communication (implicit,explicit, collective operations) Synchronization (memory model, locks) Programming in UPC Performance (one-sided communication)

Application examples: FFT, PC Productivity (compiler support) Performance tuning and modeling UPC Overview and Design Unified Parallel C (UPC) is: An explicit parallel extension of ANSI C with common and familiar syntax and semantics for parallel C and simple extensions to ANSI C A partitioned global address space language (PGAS) Based on ideas in Split-C, AC, and PCP Similar to the C language philosophy Programmers are clever and careful, and may need to get close to hardware to get performance, but can get in trouble SPMD execution model: (THREADS, MYTHREAD), static vs. dynamic threads Data Distribution Data Distribution

Global Address Space Thread0 Thread1 X[0] X[1] Thread n X[P] Shared ours: ptr: ptr: ptr: mine:

mine: mine: Distinction between memory spaces through extensions of the type system (shared qualifier) shared int ours; shared int X[THREADS]; shared int *ptr; int mine; Private Data in shared address space: Static: scalars (T0), distributed arrays Dynamic: dynamic memory management (upc_alloc, upc_global_alloc, upc_all_alloc) Data Layout Data layout controlled through extensions of the type system (layout

specifiers): [0] or [] (indefinite layout, all on 1 thread): shared [] int *p; Empty (cyclic layout) : shared int array[THREADS*M]; [*] (blocked layout): shared [*] int array[THREADS*M]; [b] or [b1][b2][bn] = [b1*b2*bn] (block cyclic) shared [B] int array[THREADS*M]; Element array[i] has affinity with thread (i / B) % THREADS Layout determines pointer arithmetic rules Introspection (upc_threadof, upc_phaseof, upc_blocksize) UPC Pointers Implementation In UPC pointers to shared objects have three fields: thread number local address of block

phase (specifies position in the block) Example: Cray T3E implementation Virtual Address Thread Phase Pointer arithmetic can be expensive in UPC Phase 63 Thread 49 48 Virtual Address 38 37 0 UPC Pointers Where does the pointer point?

Where does the pointer reside? Local Private PP (p1) Shared PS (p3) Shared SP (p2) SS (p4) int *p1; /* shared int *p2; /* int *shared p3; /* shared int *shared private pointer to local memory */ private pointer to shared space */ shared pointer to local memory */

p4; /* shared pointer to shared space */ UPC Pointers Global address space Thread0 Thread1 Thread n p3: p3: p3: p4: p4: p4: p1:

p1: p1: p2: p2: p2: Shared Private int *p1; /* private pointer to local memory */ shared int *p2; /* private pointer to shared space */ int *shared p3; /* shared pointer to local memory */ shared int *shared p4; /* shared pointer to shared space */ Pointers to shared often require more storage and are more costly to dereference; they may refer to local or remote memory.

Common Uses for UPC Pointer Types int *p1; These pointers are fast (just like C pointers) Use to access local data in part of code performing local work Often cast a pointer-to-shared to one of these to get faster access to shared data that is local shared int *p2; Use to refer to remote data Larger and slower due to test-for-local + possible communication int *shared p3; Not recommended shared int *shared p4; Use to build shared linked structures, e.g., a linked list typedef is the UPC programmers best friend UPC Pointers Usage Rules

Pointer arithmetic supports blocked and non-blocked array distributions Casting of shared to private pointers is allowed but not vice versa ! When casting a pointer-to-shared to a pointer-tolocal, the thread number of the pointer to shared may be lost Casting of shared to local is well defined only if the object pointed to by the pointer to shared has affinity with the thread performing the cast Work Distribution Work Distribution upc_forall() Owner computes rule: loop over all, work on those owned by you UPC adds a special type of loop upc_forall(init; test; step; affinity) statement; Programmer indicates the iterations are independent Undefined if there are dependencies across threads Affinity expression indicates which iterations to run on each thread. It may

have one of two types: Integer: affinity%THREADS == MYTHREAD Pointer: upc_threadof(affinity) == MYTHREAD Syntactic sugar for: for(i=MYTHREAD; i

Interface has synchronization modes: (??) Avoid over-synchronizing (barrier before/after is simplest semantics, but may be unnecessary) Data being collected may be read/written by any thread simultaneously Data Communication The UPC Language Specification V 1.2 does not contain non-blocking communication primitives Extensions for non-blocking communication available in the BUPC implementation UPC V1.2 does not have higher level communication primitives for point-to-point communication. See BUPC extensions for scatter, gather VIS Should non-blocking communication be a first class language citizen? Synchronization Synchronization Point-to-point synchronization: locks

opaque type : upc_lock_t* dynamically managed: upc_all_lock_alloc, upc_global_lock_alloc Global synchronization: Barriers (unaligned) : upc_barrier Split-phase barriers: upc_notify; this thread is ready for barrier do computation unrelated to barrier upc_wait; wait for others to be ready Memory Consistency in UPC The consistency model defines the order in which one thread may see another threads accesses to memory If you write a program with un-synchronized accesses, what happens? Does this work? data = flag = 1; while (!flag) { }; = data; // use the data

UPC has two types of accesses: Strict: will always appear in order Relaxed: May appear out of order to other threads Consistency is associated either with a program scope (file, statement) { #pragma upc strict flag = 1; } or with a type shared strict int flag; Sample UPC Code Matrix Multiplication in UPC Given two integer matrices A(NxP) and B(PxM), we want to compute C =A x B. Entries cij in C are computed by the formula: c p ij

=ail blj l =1 Serial C code #define N 4 #define P 4 #define M 4 int a[N][P] = {1,2,3,4,5,6,7,8,9,10,11,12,14,14,15,16}, c[N][M]; int b[P][M] = {0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1}; void main (void) { int i, j , l; for (i = 0 ; i

A (N P) is decomposed row-wise B(P M) is decomposed column into blocks of size (N P) / wise into M/ THREADS blocks as THREADS as shown below: shown below: Thread 0 P 0 .. (N*P / THREADS) -1 (N*P / THREADS)..(2*N*P / THREADS)-1 Thread THREADS-1 M Thread 0 Thread 1 N P ((THREADS-1)N*P) / THREADS ..

(THREADS*N*P / THREADS)-1 Thread THREADS-1 Note: N and M are assumed to be multiples of THREADS Columns 0: (M/THREADS)-1 Columns ((THREAD-1) M)/THREADS:(M-1) UPC Matrix Multiplication Code #include #define N 4 #define P 4 #define M 4 shared [N*P/THREADS] int a[N][P] = {1,..,16}, c[N][M]; // data distribution: a and c are blocked shared matrices shared [M/THREADS] int b[P][M] = {0,1,0,1, ,0,1}; void main (void) { int i, j , l; // private variables

upc_forall(i = 0 ; i shared [N*P /THREADS] int a[N][P], c[N][M]; // a and c are blocked shared matrices shared[M/THREADS] int b[P][M]; int b_local[P][M]; void main (void) { int i, j , l; // private variables //explicit bulk communication upc_memget(b_local, b, P*M*sizeof(int)); //work distribution (c aligned with a??) upc_forall(i = 0 ; i

for (j=0 ; j

when they work (??); search when they dont (programmer or compiler/runtime) Best Available Communication Mechanism Performance is determined by overhead, latency and bandwidth Data transfer (one-sided communication) is often faster than (two sided) message passing Semantics limit performance In-order message delivery Message and tag matching Need to acquire information from remote host processor Synchronization (message receipt) tied to data transfer One-Sided vs Two-Sided: Theory two-sided message message id data payload

one-sided put message address network interface data payload host CPU memory A two-sided messages needs to be matched with a receive to identify memory address to put data Offloaded to Network Interface in networks like Quadrics Need to download match tables to interface (from host) A one-sided put/get message can be handled directly by a network interface with RDMA support Avoid interrupting the CPU or storing data from CPU (preposts) GASNet: Portability and High-Performance

(down is good) 8-byte Roundtrip Latency 25 MPI ping-pong 24.2 22.1 GASNet put+sync 20 15 18.5 17.8 14.6 13.5 11.8 9.6

10 6.6 9.5 9.5 8.3 6.6 4.5 5 Roundtrip Latency (usec) 0 Elan3 Elan4 Myrinet G5/IB

Opteron/IB SP/Fed XT3 GASNet better for overhead and latency across machines UPC Group; GASNet design by Dan Bonachea GASNet: Portability and High-Performance Flood Bandwidth for 4KB messages (up is good) 100% 231 80% 70% MPI 223

763 90% 702 190 714 679 GASNet 765 152 60% 420 50% 750

40% GASNet excels at mid-range sizes: important for overlap 526 547 252 30% Percent HW peak 20% 10% 0% Elan3 Elan4 Myrinet G5/IB

Opteron/IB SP/Fed XT3 (up is good) Flood Bandwidth for 2MB messages 100% 90% 857 858 244 255 225 228 80% 70% 799

795 610 1504 1490 1106 1109 630 GASNet at least as high (comparable) for large messages 60% 50% 40% MPI GASNet 30% 20% 10%

Percent HW peak (BW in MB) 0% Joint work with UPC Group; GASNet design by Dan Bonachea Elan3 Elan4 Myrinet G5/IB Opteron/IB SP/Fed XT3 One-Sided vs. Two-Sided: Practice 900 2.5 (up is good)

800 2 700 600 1.5 500 NERSC Jacquard machine with 400 Opteron Bandwidth (MB/s) processors 300 1 Speedu GASNet put (nonblock)" MPI Flood GASNet/MPI

200 100 0 0 16 0.5 32 64 128 256 512 1K

2K 4K 8K 16K 32K 64K 128K 256K InfiniBand: GASNet vapi-conduit and OSU MVAPICH 0.9.5 Half power point (N ) differs by one order of magnitude This is not a criticism of the implementation! Yelick,Hargrove, Bonachea Overlap Hide Communication by Overlapping A programming model that decouples data transfer and synchronization (init, sync) BUPC has several extensions: (programmer) explicit handle based region based implicit handle based Examples:

3D FFT (programmer) split-phase optimizations (compiler) automatic overlap (runtime) Performing a 3D FFT NX x NY x NZ elements spread across P processors Will Use 1-Dimensional Layout in Z dimension Each processor gets NZ / P planes of NX x NY elements per plane Example: P = 4 NZ NZ/P 1D Partition NX NY Bell, Nishtala, Bonachea, Yelick p3 p2 p1

p0 Performing a 3D FFT (part 2) Perform an FFT in all three dimensions With 1D layout, 2 out of the 3 dimensions are local while the last Z dimension is distributed Step 1: FFTs on the columns (all elements local) Step 2: FFTs on the rows (all elements local) Step 3: FFTs in the Z-dimension (requires communication) Bell, Nishtala, Bonachea, Yelick Performing the 3D FFT (part 3) Can perform Steps 1 and 2 since all the data is available without communication Perform a Global Transpose of the cube Allows step 3 to continue Transpose

Bell, Nishtala, Bonachea, Yelick Communication Strategies for 3D FFT Three approaches: Chunk: chunk = all rows with same destination Wait for 2nd dim FFTs to finish Minimize # messages Slab: Wait for chunk of rows destined for 1 proc to finish Overlap with computation Pencil: Send each row as it completes Maximize overlap and pencil = 1 row Match natural layout slab = all rows in a single plane with same destination

Bell, Nishtala, Bonachea, Yelick NAS FT Variants Performance Summary Chunk (NAS FT with FFTW) .5 Tflops MFlops per Thread d a e r h T Best MFlop rates for all NAS FT Benchmark versions Best MPIBest (always slabs) NAS Fortran/MPI 1000

Best MPI Best UPC (always pencils) Best UPC r e p s p o l F M 800 600 400 200 0

4 56 net 6 nd 2 a B i My ri n Infi 6 3 25 Ela n 2 3 51 Ela n 6 4 25 Ela n

2 4 51 E la n Slab is always best for MPI; small message cost too high Pencil is always best for UPC; more overlap Myrinet #procs 64 Infiniband 256 Elan3 256 Elan3 512 Elan4 256 Elan4 512

Bisection Bandwidth Limits Full bisection bandwidth is (too) expensive During an all-to-all communication phase Effective (per-thread) bandwidth is fractional share Significantly lower than link bandwidth Use smaller messages mixed with computation to avoid swamping the network Bell, Nishtala, Bonachea, Yelick Compiler Optimizations Nave scheme (blocking call for each load/store) not good enough PRE on shared expressions Reduce the amount of unnecessary communication Apply also to UPC shared pointer arithmetic Split-phase communication Hide communication latency through overlapping Message coalescing Reduce number of messages to save startup overhead and achieve better bandwidth Chen, Iancu, Yelick

Benchmarks Gups Random access (read/modify/write) to distributed array Mcop Parallel dynamic programming algorithm Sobel Image filter Psearch Dynamic load balancing/work stealing Barnes Hut Shared memory style code from SPLASH2 NAS FT/IS Bulk communication % improvement over unoptimized

Performance Improvements Chen, Iancu, Yelick Data Driven Execution Data-Driven Execution Many algorithms require synchronization with remote processor Mechanisms: (BUPC extensions) Signaling store: Raise a semaphore upon transfer Remote enqueue: Put a task in a remote queue Remote execution: Floating functions (X10 activities) Many algorithms have irregular data dependencies (LU) Mechanisms (BUPC extensions) Cooperative multithreading Matrix Factorization Blocks 2D block-cyclic distributed

Completed part of U Completed part of L Panel factorizations involve communication for pivoting A(i,j) A(i,k) A(j,i) A(j,k) Trailing matrix to be updated Panel being factored Husbands,Yelick Matrixmatrix multiplication

used here. Can be coalesced Three Strategies for LU Factorization Organize in bulk-synchronous phases (ScaLAPACK) Factor a block column, then perform updates Relatively easy to understand/debug, but extra synchronization Overlapping phases (HPL): Work associated with on block column factorization can be overlapped Parameter to determine how many (need temp space accordingly) Event-driven multithreaded (UPC Linpack): Each thread runs an event handler loop Tasks: factorization (w/ pivoting), update trailing, update upper Tasks my suspend (voluntarily) to wait for data, synchronization, etc. Data moved with remote gets (synchronization built-in)

Must gang together for factorizations Scheduling priorities are key to performance and deadlock avoidance Husbands,Yelick UPC-HP Linpack Performance X1 UPC vs. MPI/HPL 1400 MPI/HPL 1200 UPC Opteron cluster UPC vs. MPI/HPL Altix UPC. Vs. MPI/HPL UPC vs. ScaLAPACK

160 140 1000 200 120 800 80 UPC 100 150 60 80 600

GFlop/s 100 GFlop/s 400 MPI/HPL UPC 50 200 0 X1/64 X1/128 UPC 20 0 0 60

GFlop/s 60 MPI/HPL 40 Opt/64 ScaLAPACK Alt/32 40 GFlops 20 0 2x4 proc grid Comparable to HPL (numbers from HPCC database) Faster than ScaLAPACK due to less synchronization Large scaling of UPC code on Itanium/Quadrics (Thunder) 2.2 TFlops on 512p and 4.4 TFlops on 1024p Husbands, Yelick

4x4 proc grid Performance Tuning Iancu, Strohmaier Efficient Use of One-Sided Implementations: need to be efficient and have scalable performance Application level use of NB benefits from new design techniques: finer grained decompositions and overlap Overlap exercises the system in un-expected ways Prototyping of implementations for large scale systems is a hard problem: non-linear behavior of networks, communication scheduling is NP-hard Need methodology for fast prototyping: understand interaction network/CPU at large scale Performance Tuning Performance is determined by overhead, latency and bandwidth, computational characteristics and communication topology Its all relative: Performance characteristics are determined

by system load Basic principles Minimize communication overhead Avoid congestion: control injection rate (end-point) avoid hotspots (end-point, network routes) Have to use models. What kind of answers can a model answer? Example: Vector-Add shared double *rdata; double *ldata, *buf; upc_memget(buf, rdata, N); for(i=0; i

sync(h[i]); for(j=0;j

sync(BN) compute(BN) Prototyping Usual approach: use time accurate performance model (applications, automatically tuned collectives) Models (LogP..) dont capture important behavior (parallelism, congestion, resource constraints, non-linear behavior) Exhaustive search of the optimization space Validated only at low concurrency (~tens of procs), might break at high concurrency, might break for torus networks Our approach: Use performance model for ideal implementation Understand hardware resource constraints and the variation of performance parameters (understand trends not absolute values) Derive implementation constraints to satisfy both optimal implementation and hardware constraints Force implementation parameters to converge towards optimal Performance Network: bandwidth and overhead Overhead is determined by

message size, communication schedule, hardware flow of control Bandwidth is determined by message size, communication schedule, fairness of allocation Application: communication pattern and schedule (congestion), computation Iancu, Strohmaier Validation Understand network behavior in the presence of non-blocking communication (Infiniband, Elan) Develop performance model for scenarios widely encountered in applications (p2p, scatter, gather, all-to-all) and a variety of aggressive optimization techniques (strip mining, pipelining, communication schedule skewing) Use both micro-benchmarks and application kernels to validate approach Iancu, Strohmaier

Findings Can choose optimal values for implementation parameters Time accurate model for an implementation: hard to develop, inaccurate at high concurrency Methodology does not require exhaustive search of the optimization space (only p2p and qualitative behavior of gather) In practice one can produce templatized implementations for an algorithm and use our approach to determine optimal values: code generation (UPC), automatic tuning of collective operations, application development Need to further understand the mathematical and statistical properties

End Avoid synchronization: Data-driven Execution Many algorithms require synchronization with remote processor What is the right mechanism in a PGAS model for doing this? Is it still one-sided? Part 3: Event-Driven Execution Models Mechanisms for Event-Driven Execution Put operation does a send side notification Needed for memory consistency model ordering Need to signal remote side on completion Strategies: Have remote side do a get (works in some algorithms) Put + strict flag write: do a put, wait for completion, then do another (strict) put Pipelined put + put-flag: works only on ordered networks Signaling put: add new store operation that embeds

signal (2nd remote address) into single message Mechanisms for Event-Driven Execution Signalling Store memput signal upc memput + strict flag pipelined memput (unsafe) MPI n-byte ping/pong MPI send/rec 60 50 40 30 20 time (usec) 10 0 1 10 100

1000 10000 payload size (Bytes) Preliminary results on Opteron + Infiniband 100000

Recently Viewed Presentations

  • UV-Vis (electronic) spectroscopy - Babeș-Bolyai University

    UV-Vis (electronic) spectroscopy - Babeș-Bolyai University

    UV-Vis spectroscopy Electronic absorption spectroscopy Absorption and Emission Calculation of electronic spectra TD-DFT (time-dependent DFT) #P TD(nstates=5) B3LYP/6-31+G(d,p) Run this job on an optimized geometry of formaldehyde UV-Vis spectroscopy Absorption and Emission Calculation of electronic spectra TD-DFT (time-dependent DFT) #P...
  • Chinese New Year - University of Iowa

    Chinese New Year - University of Iowa

    What is Chinese New Year? Chinese New Year is a holiday that celebrates the beginning of a new year according to the lunar calendar. It is considered to be one of the most important holidays for Chinese families. When is...
  • English in Years 1 and 2  Our Aim

    English in Years 1 and 2 Our Aim

    VCOP. In years 1 and 2 these skills are presented to the children as Vocabulary, Conjunctions, Openers and Punctuation. 'Green to be seen' and 'Pink to think' are also used for children to identify skills they are using and those...
  • SAFARICOM DEVICE RANGING Jan 2019 02/03/2020 Handsets Kshs

    SAFARICOM DEVICE RANGING Jan 2019 02/03/2020 Handsets Kshs

    8 MP front & Back Camera. 2630 mAh Battery. 15,999/= Nokia 3.1 (4G) Android Operating System V 8.1. Quad Core 1.8 GHz processor. ... Alcatel Wireless Modem MW40V (4G) 3G/4G Enabled. Portable Wi-fi transmitter. 4G Downlink speeds up to 150.Mbps....
  • Gaussian Distribution - Ryerson University

    Gaussian Distribution - Ryerson University

    Arial Calibri Wingdings Times New Roman Office Theme MathType 5.0 Equation Additive White Gaussian Noise (AWGN) Channel and Matched Filter Detection Part I - Gaussian distribution Gaussian (Normal) Distribution Gaussian RV General Gaussian RV PDF of Gaussian Distribution CDF of...
  • University of Miami Miller School of Medicine Clinical

    University of Miami Miller School of Medicine Clinical

    MSOM Appointment, Promotion, & Tenure (APT) Committee Reviews and Votes. Chair's have opportunity for appeal of negative or split APT votes. Dean reviews APT recommendation and makes Dean's recommendation. University Academic Personnel Board (APB) Reviews and Votes. Provost Reviews Recommendations,...
  • ISACA OVERVIEW June 2015 AGENDA  Who is ISACA

    ISACA OVERVIEW June 2015 AGENDA Who is ISACA

    IFAC - Member and serve on Consultative Advisory Group. ENISA and NIST—Joint programs and champion of Cybersecurity Month. SFIA - Member of Advisory Council - IT Skills for Information Age. CIONET—A partner on a governance study. Speaking notes: *It is...
  • Tarneit Skies Residents Differential Rate Appeal

    Tarneit Skies Residents Differential Rate Appeal

    WYNDHAM RETIREMENT VILLAGESDIFFERENTIAL RATE APPEAL. SUMARRY OF ISSUES • Duplication of rate fees for Retirement Village residents who pay the full residential rate in the dollar to Council and contribute to the maintenance and replacement of all their infrastructure services...