Evaluating the impact of high-bandwidth memory on MPI ...

Evaluating the impact of high-bandwidth memory on MPI ...

Locality-Aware PMI Usage for Efficient MPI Startup Kenneth Raffenetti [email protected]* Neelima Bayyapu [email protected]* Dimitry Durnov [email protected]# Masamichi Takagi [email protected]~ Pavan Balaji [email protected]* * Mathematics and Computer Science Division Argonne National Laboratory # Intel Corporation ~ RIKEN Center for Computational Science Agenda Introduction Background Process Manager Process Management Interface Motivation Related Work Methodology Simple Address Exchange Shared-Memory Optimization for Address Exchange MPI Collective Optimization for Address Exchange Evaluation Results and Analysis

Conclusions Kenneth Reffenetti ICCC 2018, Chengdu 12/08/2018 2 Introduction MPI is the de facto standard programming model for distributed memory systems MPI needs to be initialized by the application Initialization process needs to be more efficient Initialization tasks include Information gathering about the parallel job Setting-up internal library state Preparing resources For performance, external information is exchanged up front during MPI_Init rather than during subsequent communication calls Processes utilize the Process Management Interface (PMI) for fabric address exchange PMI usage needs to be improved to reduce the initialization time of large-scale jobs Kenneth Reffenetti ICCC 2018, Chengdu 12/08/2018: 3

Background Process Manager Handles the start an stop of processes Acts as a central coordination point for parallel processes Process Management Interface First introduced in MPICH (an MPI implementation) By decoupling the process management functionality from the underlying process Provides on key-value store Motivation Increased number of node and core count Need of quick and efficient coordination across ranks Kenneth Reffenetti PMI Key-value store Put Get Rank #0 Rank #1 Rank #2

Node #0 ICCC 2018, Chengdu 12/08/2018 Rank #3 Node #1 4 Related Work Defining API standard PMI features and capabilities in MPICH [1] Optimizations to the PMI functionality PMI data exchange over the HPC fabric [5] Nonblocking APIs [6] PMI proxy communication over shared memory [7] Scalability extensions for PMI PMI for Exascale [4] Kenneth Reffenetti ICCC 2018, Chengdu 12/08/2018 5 Methodology (1/3) Simple Address Exchange

Writes its address or business card into KVS Retrieves business cards of all /* All ranks performs followings*/ PMI_KVS_Put(rank, myaddr); PMI_KVS_Barrier(); for (i = 0; i < size; i++) PMI_KVS_Get(i, &addrs[i]); other processes after a barrier PMI Key-value Store O(P2) algorithm At scale, cost is noticeable 250 Simple Address Exchange Performance Seconds 200 150 PMI Key-value Store 100 50

0 1 2 4 8 16 32 64 128 256 Nodes (ppn=64) Kenneth Reffenetti ICCC 2018, Chengdu 12/08/2018 6 Methodology (2/3) Shared Memory Optimization for Address Exchange PMI Key-value Store Redundant work on each node to be removed By using shared memory Within MPI, processes learn about onnode and off-node processes The overheads of shared memory communication and size of address

data needs to be addressed PMI Key-value Store Used global maximum across the nodes for address data Amount of data fetched is reduced from O(P2) to O(N*P). Kenneth Reffenetti ICCC 2018, Chengdu 12/08/2018 7 Methodology (3/3) MPI Collective Optimization for Address Exchange PMI Key-value Store Passing the address data through the PMI database needs to be optimized Used MPI collective communications (MPI_Allgather) that are localityaware Used node root to reduce the amount of traffic Direct communication among peers PMI Key-value Store

Significant reduction in cost Kenneth Reffenetti ICCC 2018, Chengdu 12/08/2018 8 Evaluation Results and Analysis (1/4) Experimental Setup Theta Supercomputer 11.69 petaflop system Based on Intel Xeon Phi 7230 processors coupled with a Cray Aries interconnect in Dragonfly topology Equipped with 4,392 nodes, each with 64 cores Bebop Supercomputer 1024 nodes 64 cores (Intel Knights Landing) per compute node with Intel Omni-Path fabric Experiments Performance evaluation of address exchange on Bebop Performance evaluation of address exchange on Theta Kenneth Reffenetti ICCC 2018, Chengdu 12/08/2018

9 Evaluation Results and Analysis (2/4) Address Exchange by Node Count on Bebop Seconds Original address exchange takes on the order of minutes at scale At 256 nodes (ppn=64), node-roots method takes less than 2 seconds 10 9 8 7 6 5 4 3 2 1 0 13.279052 allgather bc exchange bc max

shm setup 26.4526612 52.157150 2 105.328247 211.2465936 Nodes (ppn=64) Kenneth Reffenetti ICCC 2018, Chengdu 12/08/2018 10 Evaluation Results and Analysis (3/4) Address Exchange by Process Count on Bebop Seconds At ppn <= 2, there is a little overhead of shared-memory At ppn > 2, the proposed optimizations outperform the traditional method 10

9 8 7 6 5 4 3 2 1 0 12.452322 allgather bc exchange bc max 50.9093674 211.2465963 shm Processes Per Node (256 Nodes) Kenneth Reffenetti

ICCC 2018, Chengdu 12/08/2018 11 Evaluation Results and Analysis (4/4) Address Exchange by node count on Theta shm setup phase max phase PMI bc exchange phase allgather phase 25 102.9865258 421.847700 2 20 Seconds 15 10 5

0 Nodes Kenneth Reffenetti ICCC 2018, Chengdu 12/08/2018 12 Conclusions Efficient startup time becomes important, as HPC systems grow This work looked at the most expensive part of MPI initialization Address exchange using the Process Management Interface (PMI) Address exchange performance is improved with locality information By using shared memory, redundant work is eliminated By using MPI collective communications, we enabled the high-speed fabric Kenneth Reffenetti ICCC 2018, Chengdu 12/08/2018 13 References

1. P. Balaji, D. Buntinas, D. Goodell, W. D. Gropp, J. Krishna, E. L. Lusk, and R. Thakur, Pmi: A scalable parallel process-management interface for extreme-scale systems, in 17th EuroMPI Conference, Lecture Notes in Computer Science, Springer, 11/2009 2009. 2. MPICH, https://www.mpich.org/, 2018. 3. Top500, https://www.top500.org/, 2018. 4. R. H. Castain, D. Solt, J. Hursey, and A. Bouteiller, Pmix: Process management for exascale environments, in Proceedings of the 24th European MPI Users Group Meeting, ser. EuroMPI 17, 2017, pp. 14:1 14:10. 5. S. Chakraborty, H. Subramoni, J. Perkins, A. Moody, M. Arnold, and D. Panda, Pmi extensions for scalable mpi startup, in 21st European MPI Users Group Meeting, EuroMPI/ASIA 14, Kyoto, Japan - Septem- ber 09 - 12, 2014, September 2014. 6. S. Chakraborty, H. Subramoni, A. Moody, A. Venkatesh, J. Perkins, and D. Panda, Nonblocking pmi extensions for fast mpi startup, in Cluster, Cloud and Grid Computing (CCGrid),

2015 15th IEEE/ACM International Symposium on, May 2015. 7. S. Chakraborty, H. Subramoni, J. L. Perkins, and D. K. Panda, Shmempmi shared memory based pmi for improved performance and scalability, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 6069, 2016. Kenneth Reffenetti ICCC 2018, Chengdu 12/08/2018 14 Thank you! Please email any question to the authors: [email protected] or [email protected] or [email protected] or [email protected] or [email protected] Kenneth Reffenetti ICCC 2018, Chengdu 12/08/2018 15

Recently Viewed Presentations

  • Design Programming Patterns - Dan Frost

    Design Programming Patterns - Dan Frost

    Much activity that takes place in a game has to be synchronized with the render cycle, so that one quantum of activity happens per frame. It doesn't work to move a character without regard to the frame rate.
  • Perception - PC&#92;|MAC

    Perception - PC\|MAC

    Depth Perception - Binocular Cues. Requires both eyes for perception. Retinal disparity. Seeing double of your finger as you bring it towards you. An image of the finger is projected onto each retina. Serves as a cue for depth within...
  • Spanish Writing Preparation

    Spanish Writing Preparation

    Tips to survive writing the 40 WORD PARAGRAPH. This question is in the form of 4 bullet points. You MUST cover all 4 bullet points . evenly. With 40 words, that is a . MINIMUM of 10 words per bullet...
  • Spring 2019 Changes What should you expect?   A

    Spring 2019 Changes What should you expect? A

    Kel[email protected]) or on Workplace . Carey Hatch. New Platform. Chat Integration in Groups. Better Quick Chat Option. More Compact Quick Chat. Chat Threads With More Info. Pronounced Group Access. A New Notifications Inbox. Filter Notifications.
  • Female Characters in gothic texts - Quia

    Female Characters in gothic texts - Quia

    Female Characters in gothic texts. ... To look at the characteristic features of different female characters. ... With any gothic text you read: To what extent are women suppressed or marginalised within the story? The trembling victim.
  • CPS PowerPoint Template New Plants - IBM Cognos User Group

    CPS PowerPoint Template New Plants - IBM Cognos User Group

    Through the various product modules, users are able to save tremendous time automating manual processes while adding a wealth of capabilities to extend the value of the out-of-the-box IBM Cognos software. What it is to us. Deployment between Development and...
  • Book Report Powerpoint - Mrs. Alexander&#x27;s Tortugas

    Book Report Powerpoint - Mrs. Alexander's Tortugas

    Introduction Relevance Mrs. Alexander 6th Grade Village Meadows Elementary Your Part Highlights Report Requirements Technology Requirements Summary What is PPBRā€¦
  • Online Loan Application Project Information Form Quick ...

    Online Loan Application Project Information Form Quick ...

    Welcome Note. Dear Customer(s), Welcome to the Online Loan Application Project Information Form Quick Reference Guide. We are excited about your interest in utilizing the Online Loan Application system, and hope you find the system to be convenient and user-friendly.