Carbon Management Technologies for Sustainable Coal Utilization

Carbon Management Technologies for Sustainable Coal Utilization

Accelerating MFIX-DEM code on the Intel Xeon Phi Dr. Handan Liu Dr. Danesh Tafti Research Scientist W.S. Cross Professor Virginia Tech Virginia Tech NETL 2014 Workshop on Multiphase Flow Science August 5-6, 2014, Morgantown, WV High Performance Computational Fluid-Thermal Sciences & Engineering Lab Motivation /Objectives /Executive Summary Motivation The motivation for this research is to accelerate the performance of the NETL code MFIX for multiphase flows. Different parallelization strategies are being considered on modern computer architectures that can lead to large performance gains of MFIX to allow calculations on more

physically realistic systems. Objectives Different parallelization strategies (MPI , OpenMP and hybrid MPI+OpenMP) for MFIX code (TFM and DEM) have been completed in the previous work at Virginia Tech group. Enhance parallelization flexibility on Intel Xeon Phi architecture, namely Intel MIC; Extend OpenMP instrumentation to Intel MIC. High Performance Computational Fluid-Thermal Sciences & Engineering Lab Executive Summary Investigate different models of computing on the Intel-MIC architectures. Choose offloading paradigm in MFIX to enhance performance on Intel MIC. Incorporate explicit offloading directives in MFIX code in order to port DEM solver to Intel MIC. OpenMP performance of MFIX-DEM is about 5 times faster

on CPU offloading MIC compared to the CPU only. Validation studies to ascertain that offloading code gives same results as OpenMP on CPU only. High Performance Computational Fluid-Thermal Sciences & Engineering Lab Executive Summary Investigate different models of computing on the Intel-MIC architectures. Choose offloading paradigm in MFIX to enhance performance on Intel MIC. Incorporate explicit offloading directives in MFIX code in order to port DEM solver to Intel MIC. OpenMP performance of MFIX-DEM is about 5 times faster on CPU offloading MIC compared to the CPU only. Validation studies to ascertain that offloading code gives same results as OpenMP on CPU only. High Performance Computational Fluid-Thermal Sciences & Engineering Lab

Advantages of Intel MIC Intels MIC is based on x86 technology x86 cores w/ caches and cache coherency SIMD instruction set Programming for MIC is similar to programming for CPUs Familiar languages: C/C++ and Fortran Familiar parallel programming models: OpenMP and MPI MPI on host and on the coprocessor Any code can run on MIC, not just kernels Optimizing for MIC is similar to optimizing for CPUs High Performance Computational Fluid-Thermal Sciences & Engineering Lab Programming Models Four programming models on IntelMIC architecture CPU only Run only on the CPU, traditional HPC cluster

MIC only Run natively on the MIC card MPI on CPU and MIC Treat the MIC (mostly) like another host Offload TACC Stampede: Each node is outfitted with two 8-core Intel Xeon E5 processors (CPU) and one 61-core Intel Xeon Phi coprocessor (MIC). The compiler is Intel composer_xe_2013.2.146. OpenMP on CPU, Offload to MIC High Performance Computational Fluid-Thermal Sciences & Engineering Lab

Executive Summary Investigate different models of computing on the Intel-MIC architectures. Choose offloading paradigm in MFIX to enhance performance on Intel MIC. Incorporate explicit offloading directives in MFIX code in order to port DEM solver to Intel MIC. OpenMP performance of MFIX-DEM is about 5 times faster on CPU offloading MIC compared to the CPU only. Validation studies to ascertain that offloading code gives same results as OpenMP on CPU only. High Performance Computational Fluid-Thermal Sciences & Engineering Lab Offloading: MIC as co-processor OpenMP on CPU, Offload to MIC Targeted offload through OpenMP extensions Offload model : the data to be exchanged between the host and the

MIC consists of scalars, arrays, and Fortran derived types Offloading directives are used to specify that a code block can be offloaded. A program running on the host offloads work by directing the MIC to execute a specified block of code. High Performance Computational Fluid-Thermal Sciences & Engineering Lab Offloading Method for MFIX-DEM Migrating the DEM solver to the Intel MIC Architecture Identify the major (time-consuming) routines of DEM solver according to our previous work. Offload most subroutines of DEM solver to MIC except: message passing, involved with communicating and exchanging the ghost cells and ghost particles between multiple nodes; and dealing with I/O data.

Identify the different transfer types of global variables between CPU and MIC. High Performance Computational Fluid-Thermal Sciences & Engineering Lab Challenges Most important consideration in offloading is whether there is any performance gain in offloading or not. Reduce unnecessary data transfer between CPU and MIC for large code like MFIX. Identifying variables that are only transferred in, or only transferred out or both is a challenge. It is important to identifying different types of global variables between CPU and MIC for offloading MIFX DEM solver. When and where the variables are offloaded is also a challenge. High Performance Computational Fluid-Thermal Sciences & Engineering Lab

Executive Summary Investigate different models of computing on the Intel-MIC architectures. Choose offloading paradigm in MFIX to enhance performance on Intel MIC. Incorporate explicit offloading directives in MFIX code in order to port DEM solver to Intel MIC. OpenMP performance of MFIX-DEM is about 5 times faster on CPU offloading MIC compared to the CPU only. Validation studies to ascertain that offloading code gives same results as OpenMP on CPU only. High Performance Computational Fluid-Thermal Sciences & Engineering Lab MFIX-DEM Flow Chart Initialize and transfer data to MIC, including necessary parameters, constants and global arrays for DEM calculation

3) global variables are transferred in MIC and back to CPU in different subroutines. 4) global variables are not transferred to CPU, only calculated in DEM on MIC. Transfer data between CPU and MIC: 1) global variables transferred at each solid time step; 2) global variables transferred before/ after DEM calculation. High Performance Computational Fluid-Thermal Sciences Handan Liu, Danesh Tafti and Tingwen Li. Hybrid Parallelism in MFIX CFD-DEM using OpenMP.

Powder Technology, 259 (2014) 22-29. & Engineering Lab Global Variables for Offloading (1) In MFIX-DEM, we analyzed to give four types of global variables for offloading. Global variables that are used in subroutines must be given the offload attributes directives defined in functions or modules; Code snippet in the module DISCRETELEMENT shows giving the offload attributes to global variables which need to be offloaded in the subroutine CFNEWVALUES.f. High Performance Computational Fluid-Thermal Sciences & Engineering Lab Global Variables for Offloading (2) Four types of global variables for offloading MFIX-DEM Global variables need to be transferred in MIC and back to CPU at each solid time step in different subroutines; e.g. particles position: des_pos_new Code snippet in the routine particles_in_cell.f shows that the do_loop is offloaded to MIC for

calculation and the global variables des_pos_new is transferred from CPU to MIC after ghost particles are exchanged by invoking des_par_exchange on CPU. High Performance Computational Fluid-Thermal Sciences & Engineering Lab Global Variables for Offloading (2) Four types of global variables for offloading MFIX-DEM Global variables need to be transferred in MIC and back to CPU at each solid time step and in/out in different subroutines; e.g. particles position: des_pos_new Then des_pos_new is transferred back to CPU after updating the values in the routine cfnewvalues.f on MIC at each solid time step. High Performance Computational Fluid-Thermal Sciences & Engineering Lab When compiling, the snapshot shows the bytes of data transferred between CPU and MIC. Continue to compile

High Performance Computational Fluid-Thermal Sciences & Engineering Lab Global Variables for Offloading (3) Other types of global variables in DEM solver Global variables that need to be transferred to MIC before DEM calculation, such as gas velocity (u_g, v_g, w_g) and some fluid data used for DEM calculation; Global variables that need to be transferred to MIC just one time before going to the first pass of DEM, such as particles properties defined in input file; Global variables in DEM calculation that dont need to be transferred between CPU and MIC, such as total contact force and torque (fc,tow). Initializing routines for offloading Other subroutines outside DEM solver are related for offloading DEM calculation High Performance Computational Fluid-Thermal Sciences & Engineering Lab MFIX-DEM Flow Chart

Necessary offloading routines, modules and functions, including CFD solver and DEM solver Offload for CFD solver Offload for DEM solver : DES_DRAG_GP Performance Computational Fluid-Thermal Sciences & Engineering Lab Handan Liu, Danesh Tafti and Tingwen Li. Hybrid Parallelism in MFIXHigh CFD-DEM using OpenMP. Powder Technology, 259 (2014) 22-29.

Executive Summary Investigate different models of computing on the Intel-MIC architectures. Choose offloading paradigm in MFIX to enhance performance on Intel MIC. Incorporate explicit offloading directives in MFIX code in order to port DEM solver to Intel MIC. OpenMP performance of MFIX-DEM is about 5 times faster on CPU offloading MIC compared to the CPU only. Validation studies to ascertain that offloading code gives same results as OpenMP on CPU only. High Performance Computational Fluid-Thermal Sciences & Engineering Lab Scaling Analysis of 80,000 particles The 3D fluidized bed with 80,000 particles and 18,000 cells was evaluated with OpenMP performance on CPU compared with OpenMP performance on CPU and offloaded on MIC on a single node on TACC Stampede. Table 1: OpenMP performance on CPU only with 1, 2, 4 and 8 cores compared with

OpenMP on CPU and offloaded on MIC Number of core on CPU (single node) Speed factor CPU Only CPU+MIC (60 cores) Total wall clock time (s) Total wall clock time (s) 1 2091.8

581.1 3.60 2 1136.0 352.8 3.22 4 579.1 152.2 3.80

8 321.2 103.1 3.12 (CPU only / CPU+MIC) High Performance Computational Fluid-Thermal Sciences & Engineering Lab Larger System OpenMP Offloading The performance of OpenMP Parameter Value Total Particles 1.28 million offloading was evaluated on

Diameter 4 mm the CPU and MIC on a single Density 2700 kg/m node of TACC Stampede, Coef. of restitution 0.95, 0.95 compared with the Particle, Wall Friction coefficient 0.3, .03 performance of OpenMP on Particle, Wall only CPU. Spring constant 2400, 2400 N/m Particle, Wall Total Particles 1.28 million; Dimension

6410064 cm Total cells 409,600 Grid size 6410064 2.0 m/s Total physical simulation time is Superficial Velocity Time Step (Fluid, Solid) 5.0e s, 8.6e s 0.5 seconds Number of processors 1,2,4,8,16 cores on CPU Compiler: Intel with 60 cores on MIC composer_xe_2013.2.146 3 -5 -6 High Performance Computational Fluid-Thermal Sciences & Engineering Lab

Scaling Analysis of 1.28 million particles Table 2: OpenMP performance on 1, 2, 4, 8 and 16 cores of the only CPU compared with OpenMP offloading on the corresponding cores on the CPU and 60 cores on the MIC Number of cores on CPU (single node) 1 2 4 8 16 CPU Only CPU+MIC (60 cores)

Total wall clock time (s) Total wall clock time (s) 11732.6 6521.0 3817.4 1600.3 947.3 2523.1 1488.8 842.7 322.0 230.5 Speed factor

(CPU only / CPU+MIC) 4.65 4.38 4.53 4.97 4.11 High Performance Computational Fluid-Thermal Sciences & Engineering Lab Executive Summary Investigate different models of computing on the Intel-MIC architectures. Choose offloading paradigm in MFIX to enhance performance on Intel MIC. Incorporate explicit offloading directives in MFIX code in order to port DEM solver to Intel MIC. OpenMP performance of MFIX-DEM is about 5 times faster on CPU offloading MIC compared to the CPU only. Validation studies to ascertain that offloading code gives

same results as OpenMP on CPU only. High Performance Computational Fluid-Thermal Sciences & Engineering Lab Validation The 3D fluidized bed with 80,000 particles of 4mm diameter was simulated for this validation. The simulation was carried out for a total of 5 seconds. The time averaged profiles were obtained from 2.0-5.0 seconds.

Comparison of time averaged profiles of void fraction and gas velocity for OpenMP and offloading implementation on CPU and CPU+MIC on a single node. The validation shows that the parallel simulation does not alter the accuracy of the solution. High Performance Computational Fluid-Thermal Sciences & Engineering Lab Summary Ported MFIX to the Intel MIC architecture; Chose the offload paradigm in MFIX OpenMP on CPU and offloading to MIC on a single node (done) hybrid MPI+OpenMP on CPU and offloading to MIC on multiple nodes (ongoing) Incorporated offload directives in MFIX-DEM subroutines; OpenMP performance of MFIX-DEM was evaluated with CPU

versus CPU+ offloading MIC on a single node; Accelerated MFIX-DEM code about 5 times on the Intel MIC architecture compared to the Intel CPU only; Validated the offloading directives; Further tune offloading code for DEM solver to gain more performance in the next work. High Performance Computational Fluid-Thermal Sciences & Engineering Lab Future Work FY14 Next work will continue to refine the code to further improve the MFIX-DEM performance on Intel-MIC. Investigating performance of MFIX-TFM on Intel-MIC. FW Evaluate OpenMP 4.x/OpenACC directives on heterogeneous CPU/GPU systems. High Performance Computational Fluid-Thermal Sciences & Engineering Lab

Recently Viewed Presentations

  • Make A Difference - English & Hebrew

    Make A Difference - English & Hebrew

    tYfI ny iliKAw ik ausny hweI skUl pws kr ilAw hY Aqy ausdI klws ivc qIjI pojISn hY Aqy A`j vI auh ausdI SB qoN vDIAw AiDAwpk hY . cwr swl bwAd ausny iek hor p`qr iliKAw . auh hux...
  • THE First, you need a topic What do

    THE First, you need a topic What do

    This is a "persuasive" essay, so you are writing it to get somebody to think like you do. Audience: Who am I writing this for? You are writing for an English writing instructor in class or the SOL test grade....
  • FOLFOX / CAPOX in stage IIIII colon cancer:

    FOLFOX / CAPOX in stage IIIII colon cancer:

    Background and Rationale. The shorter the better , provided no loss of efficacy. At the time TOSCA was launched , 6 months of . oxaliplatin-based therapy was recommended for both stages
  • Malthus & Ricardo

    Malthus & Ricardo

    David Ricardo (1772-1823) In his Principles of Political Economy (1817), Ricardo transformed Malthus' ideas into the "iron law of wages.". If wages were raised, more children would be produced. They, in turn, would enter the labor market, thus expanding the...
  • Chapter 6 Population and Community Ecology Friedland and

    Chapter 6 Population and Community Ecology Friedland and

    The Exponential Growth Model. The exponential growth model. When populations are not limited by resources, their growth can be very rapid. More births occur with each step in time, creating a J-shaped growth curve.
  • Mr. Lipman'S Ap Government Powerpoint

    Mr. Lipman'S Ap Government Powerpoint

    LIPMAN'S AP GOVERNMENT POWERPOINT ... it has now been interpreted to protect many more Out growth of Civil Rights Act of 1866 which was first time a veto was overridden Needed because Bill of Rights did not apply to states...
  • SO442 Lesson 01 Introduction to the Tropics First

    SO442 Lesson 01 Introduction to the Tropics First

    The easterly winds flow out of the subtropical highs and into the equatorial trough where they converge along the Intertropical Convergence Zone (ITCZ) ITCZ. Discussion Question #3. What is a typical morning low temperature for a location at sea level...
  • Stellar Cusp - National Radio Astronomy Observatory

    Stellar Cusp - National Radio Astronomy Observatory

    Spectral index for both states similar, ~1.3 SSC-model fits very well to data Upper limit of 15 min on time lag NIR/X-ray Eckart, Morris, Baganoff, Schödel et al., in preparation Radio: Zhao, Falcke, Bower, Aitken, et al. 1999-2003 X-ray: Baganoff...