Evaluating Communication Costs for Distributed Sparse Tensor Factorization
Evaluating Communication Costs for Distributed Sparse Tensor Factorization on Multi-GPU Systems Thomas B. Rolinger, Tyler A. Simon, Christopher D. Krieger SuperComputing 2017 Outline 1. 2. 3. 4. 5. Background Motivation & Approach Experiments & Results Discussion Conclusions & Future Work 1.) Background: Tensors
Tensors: Multidimensional arrays Typically very large and sparse Ex: NELL-1 has 143 million NNZ and density of 10-13 Tensor Factorization: Higher-order extension of matrix singular value decomposition (SVD) CP-ALS: Alternating Least Squares Critical routine: Matricized tensor times Khatri-Rao product (MTTKRP) 1.) Background: Tensors (cont.) MTTKRP X(1) : Matricized Tensor (sparse) (CB) : Khatri-Rao Product (dense) 1.) Background: ReFacTo Distributed Heterogeneous CP-ALS Extension/refactorization of existing CP-ALS algorithm
DFacTo Utilizes one GPU on each node for sparse matrix times vector (SpMV) 1.) Background: ReFacTo Communication Communication required after each MTTKRP Each MPI rank assigned a contiguous slice of the tensor For each MPI rank, MTTKRP computes some number of rows of the factor matrix (lines 2, 5, 8) Each MPI rank sends those updated rows to all other ranks (lines 3, 6, 9) After communication, all MPI ranks have the same fully updated factor matrix Collective Irregular msg sizes blocking 2.) Motivation: Communication vs. Computation ReFacTo (FDR Infiniband All-GPU 32 nodes)
routines Communication cost dominating factor 2.) Approach: GPU-to-GPU Communication Perform all computation on GPUs Then host only performs communication Communication only involves data residing on the GPUs Take advantage of specialized GPU communication software and hardware Software: CUDA aware MPI & NCCL Hardware: NVLink 2.) Approach: CUDA-aware MPI Send GPU buffers via MPI w/o explicit calls to cudaMemcpy Inter- and intra-node GPU communication GPUDirect Peer-to-Peer (P2P) GPUDirect RDMA (GDR)
No modifications required to MPI calls Supported libraries OpenMPI, MVAPICH, etc. 2.) Approach: NCCL NCCL: NVIDIA Collective Communications Library Automatic topology detection Does not rely on availability of P2P access like CUDAaware MPI Better utilization of NVLink topology Multi-ring topology to provide maximum bandwidth NCCL 2.0: inter-node communication support Lacks vector routines (i.e. Allgatherv) Recreate with series of bcasts 2.) Approach: End Result ReFacTo ported onto dense multi-GPU systems (DGX-1) A single node with 8 GPUs
Same code also runs on traditional clusters Many nodes with 1 GPU per node Can use CUDA-aware MPI or NCCL for communication on either system 3.) Experiment: Goal and Setup Goal: Evaluate communication performance for ReFacTo on 2 different systems using different communication libraries Systems: DGX-1: 8x P100 connected via NVLink Cluster: 32 nodes connected by FDR Infiniband with one K40m per node Communication Libraries: OpenMPI 2.1.1 with and without CUDA support GDR not enabled due to hardware limitations on cluster NCCL 2.0.4 Also does not utilize GDR on cluster for same reasons
3.) Experiment: Performance Metric Metric: Total communication time Time require to get updated data from each GPU to all other GPUs MPI: DtoH + MPI_Allgatherv + HtoD CUDA-MPI: MPI_Allgatherv NCCL: ncclBcast calls 3.) Experiment: Cluster K40m K40m .......
20 38.86 11.90 DELICIOUS Cluster : NCCL generally the fastest NCCL slower than MPI approaches on small tensors On large tensors, NCCL up to 2x faster than MPI and 1.5x faster than MPI-CUDA 3.) Results: DGX-1 seconds DGX-1: 2 GPUs 80 70
DELICIOUS 5.47 11.13 FLICKR 14.26 6.36 NELL-1 DGX-1: MPI-CUDA 3.2x faster on average than MPI NCCL 5.3x faster on average than MPI and 1.65x faster than MPI-CUDA NCCL up to 10.9x faster than MPI, leading to a 64% reduction in overall CP-ALS runtime
4.) Discussion: Overall In general, DGX-1 provides better overall performance when compared to cluster for all communication libraries Exception: non-CUDA MPI is faster on cluster for large tensors when using 8 GPUs when compared to DGX-1 Considerations for CP-ALS on multi-GPU systems Tensor properties GPU topology 4.) Discussion: Tensor Properties Size of factor matrices determined by length of the tensors modes and CPD rank Higher rank CPD gives a more fine-grained factorization but results in larger factor matrices Larger factor matrices larger message sizes
NCCL performs increasingly better than MPI on tensors with large factor matrices MPI: optimized for latency, small messages and scales out to thousands of compute nodes NCCL: optimized for bandwidth, large messages and targets dense multi-GPU systems (i.e. DGX-1) MPI-CUDA: approaches similar to NCCL, hence more comparable performance 4.) Discussion: GPU Topology Single node/many GPUs vs. Many nodes/single GPU DGX-1: dense multi-GPU system, optimized for collective communication (hybrid mesh, NVLink) Cluster: single GPU per node, only path between any 2 GPUs is through host+Infiniband On the cluster, NCCL and MPI/MPI-CUDA use the same communication paths NCCL only has advantage on large tensors due to bandwidth optimizations
4.) Discussion: GPU Topology (cont.) On DGX-1, NCCL has advantage due to NVLink and its topology MPI/MPI-CUDA must rely on PCIe topology Poor performance when using all 8 GPUs on DGX1 since 2 GPUs share a single PCIe switch Less impact on cluster since each GPU has its own PCIe switch 5.) Conclusions Distributed multi-GPU tensor factorization Run on variety of multi-GPU architectures Performance study CP-ALS communication performance on DGX-1 and traditional cluster using NCCL, MPI and CUDA-aware MPI Communication time reduced by as much as 10.9x when using NCCL on DGX-1, leading to 64% reduction in CP-ALS runtime Considerations Tensor properties GPU topology
5.) Future Work Evaluate GPUDirect RDMA (GDR) for both NCCL and CUDA-aware MPI Run on clusters that have multiple GPUs per node and different topologies Expand study to other irregular applications References Jukka Antikainen, Jiri Havel, Radovan Josth, Adam Herout, Pavel Zemcik, and Markku Hauta-Kasari. 2011. Nonnegative Tensor Factorization Accelerated Using GPGPU. IEEE Trans. Parallel Distrib. Syst. 22, 7 (July 2011), 11351141. https://doi.org/10.1109/TPDS.2010.194 M. Baskaran, B. Meister, N. Vasilache, and R. Lethin. 2012. Efficient and scalable computations with sparse tensors. In High Performance Extreme Computing (HPEC), 2012 IEEE Conference on. 16. https://doi.org/10.1109/ HPEC.2012.6408676 Joon Hee Choi and S. Vishwanathan. 2014. DFacTo: Distributed Factorization of Tensors. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 12961304. http://papers.nips.cc/paper/5395-dfacto-distributedfactorizationof-tensors.pdf Sylvain Jeaugey. 2017. NCCL 2.0. http://on-demand.gputechconf .com/gtc/2017/ presentation/s7155-jeaugey-nccl.pdf. (2017). Accessed: 2017-08-18. Tamara G. Kolda and Brett W. Bader. 2009. Tensor Decompositions and Applications. SIAM Rev. 51, 3, 455500. https://doi.org/10.1137/07070111X J. Li, Y. Ma, C. Yan, and R. Vuduc. 2016. Optimizing Sparse Tensor Times Matrix on Multi-core and Many-Core Architectures. In 2016 6th Workshop on Irregular Applications: Architecture and Algorithms (IA3). 2633. https://doi.org/10.1109/IA3.2016.010 Bangtian Liu, Chengyao Wen, Anand D. Sarwate, and Maryam Mehri Dehnavi. 2017. A Unified Optimization Approach for Sparse Tensor Operations on GPUs. In Proceedings of IEEE Cluster 2017, Hawaii, USA, September 5th - 8th, 2017.
Yongchao Liu and Bertil Schmidt. 2017. LightSpMV: Faster CUDA-Compatible Sparse Matrix-Vector Multiplication Using Compressed Sparse Rows. Journal of Signal Processing Systems (10 Jan 2017). https://doi.org/10.1007/s11265-016-1216-4 Thomas B. Rolinger, Tyler A. Simon, and Christopher D. Krieger. 2016. Performance evaluation of parallel sparse tensor decomposition implementations. In Proceedings of the Sixth Workshop on Irregular Applications: Architectures and Algorithms. IEEE Press, 5457. Thomas B. Rolinger, Tyler A. Simon, and Christopher D. Krieger. 2017. Performance Challenges for Heterogeneous Distributed Tensor Decompositions. In 2017 IEEE High Performance Extreme Computing Conference, HPEC 2017,Waltham, MA, USA, September 12-14, 2017. Thomas B. Rolinger, Tyler A. Simon, and Christopher D. Krieger. 2017. Performance Considerations for Scalable Parallel Tensor Decomposition. J. Parallel and Distrib. Comput. (2017). Shaden Smith and George Karypis. 2015. Tensor-Matrix Products with a Compressed Sparse Tensor. Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms (2015). Shaden Smith and George Karypis. 2016. A Medium-Grained Algorithm for Distributed Sparse Tensor Factorization. 30th IEEE International Parallel & Distributed Processing Symposium (2016). Shaden Smith, Niranjay Ravindran, Nicholas D Sidiropoulos, and George Karypis. 2015. SPLATT: Efficient and Parallel sparse tensor-matrix multiplication. 29 th IEEE International Parallel & Distributed Processing Symposium (2015). Benyou Zou, Cuiping Li, Liwen Tan, and Hong Chen. 2015. GPUTENSOR. Inf. Sci. 299, C (April 2015), 159177. https://doi.org/10.1016/j.ins.2014.12.004 Questions Contact: [email protected] seconds
CELLAR 1.78 5.06 DELICIOUS 5.47 FLICKR 14.26 6.36 NELL-1 Back up Slides System: Cluster
System: 32 node cluster with (per node): Memory: 512GB DDR4 CPU: Two 10-core E5-2650v3 Xeon Haswell 2.3GHz 25MB shared last level cache AVX 2.0 GPU: NVIDIA Tesla K40m 12GB global memory 48KB shared memory per SM 288 GB/sec max BW 4.29 TFLOPS (single-precision) Applications and Build Details
32-bit integers and single-precision floats Compiler: nvcc/g++ MPI: OpenMPI 2.1.1 NCCL: 2.0.4 CUDA: 8.0 Matricizing a Tensor Kronecker and Khatri-Rao Prodcuts Kronecker Product Khatri-Rao Product Background: ReFacTo Communication Communication required after each MTTKRP Each MPI rank assigned a contiguous slice of the tensor For each MPI rank, MTTKRP computes some number of rows of the factor matrix
(lines 2, 5, 8) Each MPI rank sends those updated rows to all other ranks (lines 3, 6, 9) After communication, all MPI ranks have the same fully updated factor matrix Approach: Communication Code MPI (original) ReFacTo) CUDA-aware MPI NCCL
Macbeth Openings Journal or Notebook topics for brief writings. Meant to get you thinking & thinking about the play. Mac QuickWrite #1 What if you, like Banquo & Macbeth, received the chance to discover your future?
"Love is when you tell a guy you like his shirt, then he wears it everyday." (Noelle-age 7) ... President Thomas S. Monson said: ... True love is that love which comes into your heart and motivates your life when...
CLOSE READING FINDING THE ... Concrete? ANALYZING DICTION Which words are… formal? informal? colloquial? slang? figurative? figures of speech? metaphors? ANALYZING SYNTAX Is it the usual subject-verb-object order, or is it inverted? Inverted syntax "Usual order, the words are not...
Ladok ansvarar för Field Trial 2016. Sverige kommer att ha fungerande NCP (national contactpoint, nationell knutpunkt). Vi kommer kunna leverera data till andra länder (inresande). Ingen SMP (student mobility plugin , som används av utresande) kommer att utvecklas till gamla...
Network Components and Security Measures for Businesses. By. Adam Hess. Topics to be covered: Basics of a Network. Modems, Routers, Firewalls, Switches, Cabling. Virtual Private Networking (VPN) Vulnerabilities with Networks. Businesses. Schools. ... Network Components and Security Measures for ...
Key equation: hypsometric equationz 2-z 1. is thickness. Thickness is proportional to the mean virtual temperature of the layer between two pressure surfaces. Thus, the difference in the geopotential height between two pressure levels is related to the intervening temperatures.