Rethinking Transport Layer Design for Distributed Machine Learning Jiacheng Xia1, Gaoxiong Zeng1, Junxue Zhang1,2, Weiyan Wang1, Wei Bai3, Junchen Jiang4, Kai Chen1,5 8/17/19 APNet' 19, Beijing, China 1 Growth of Machine Learning Growing applications of AI, many of them leverages machine learning. Our work: Running distributed machine learning over reliable data transfer protocol does NOT lead to optimal performance! 8/17/19 APNet' 19, Beijing, China 2
ML as Iterative Approximation Many ML applications iteratively learns a mathematical model to describe data Represented as minimizing obj. function E.g. Stochastic Gradient Descent (SGD) 8/17/19 APNet' 19, Beijing, China 3 Distributed Machine Learning (DML) After each iteration, workers exchange their parameter updates. Often uses synchronous training for best performance slowest worker determines speed Parameter Servers
Workers Data Shards 8/17/19 APNet' 19, Beijing, China 4 Packet Losses in DML Multiple flows simultaneously -> Likely to have losses (even TCP timeouts) Small flows with a few RTTs, RTO >> FCT w/o timeout Synchronous training, tail FCT determines job speed 8/17/19 S S
S S W W W W APNet' 19, Beijing, China 5 Faster Computations With growing speed of hardware, computations are faster, larger effect of timeouts Model MLP Iteration time (no timeouts)
7ms Slowdown w/ timeouts (RTO = 5ms) 2.4 Matrix Factorization 6ms 2.7 ResNet-18 (CNN) 25ms 1.4 LSTM (RNN) 82ms 1.12
8/17/19 APNet' 19, Beijing, China 6 High Cost of Loss Recovery High recovery cost. E.g. TCP timeouts: Fast computation, >2x longer completion time w/ timeouts TCP w/o timeout TCP w/ timeout >2x completion time Network Compute Worker pull 8/17/19 APNet' 19, Beijing, China Worker push
7 Handling Packet Drops: Necessary? Timeout as a backup to recover packet drops. Is this necessary to handle every packet drop for DML? NO. DML is inherently iterative approximation, so it only requires approximately correct results. DML algorithms (e.g. SGD) are greedy optimization, can recover from slightly incorrect results 8/17/19 APNet' 19, Beijing, China 8 ML are Bounded-Loss Tolerant Conv. Round 100 80
60 MLP 40 0 Normalized JCT Same rounds, reduced JCT 0.5 Random Data Loss Probability More rounds, reduced JCT 1.0 Do not converge 2.0 1.5 1.0
MLP 0.5 0 0.5 Random Data Loss Probability 1.0 Emulate parameter loss locally, compute communication time with NS-3 simulations 8/17/19 APNet' 19, Beijing, China 9 ML view of Bounded Loss Tolerance SGD starts new estimation with results in previous iteration. Can recover from incorrect results
With bounded loss, SGD still converges to same point 8/17/19 APNet' 19, Beijing, China Lossless SGD Lossy SGD 10 Existing Solutions are Insufficient TCP-100% data TCP-80% data Simplied protocol 1.5 1.0 0.5 Unreliable Protocol?
0.94 Model Loss Normalized JCT Reduced communications? 0.92 Better 0.90 0.88 0 RNN MLP MF Model CNN UDP
0.2 Simplied protocol TCP 0.4 0.6 0.8 1.0 Normalized JCT (MLP) 1.2 A simplified protocol to explain in the following has the potential to significantly outperform these settings. 8/17/19 APNet' 19, Beijing, China 11 Packet Drops occur on
different parameter sync. schemes Parameter Server (PS) Ring AllReduce (RING) Normalized JCT Packet Drops on Different Schemes 1.0 0.5 0 8/17/19 PS-TCP RING-TCP PS-Simplied protocol 1.5 APNet' 19, Beijing, China RNN
MLP MF Model CNN 12 A Simplified Protocol Minimizes the time for receiver a predefined threshold of packets TCP-like congestion control logic Receivers notify application layers once received predefined threshold of data Preliminary results in NS-3 simulators 8/17/19 APNet' 19, Beijing, China 13 Results: Simplified Protocol
TCP Simplied protocol 1.5 1.0 0.5 0 8/17/19 RNN MLP MF Model CNN Normalized JCT Normalized JCT [Simulation] 1.1-2.1x speed on both PS and RING scheme TCP
Simplied protocol 1.5 1.0 0.5 0 APNet' 19, Beijing, China RNN MLP MF Model CNN 14 Reduced Tail FCT The FCT reduction results from reduced tail FCTs. A bounded-loss tolerant
protocol benefits DML by ignoring some packet drops 8/17/19 APNet' 19, Beijing, China 15 Future Works We have seen that leveraging Bounded Loss Tolerant has huge potential to speed up DML A concrete testbed implementation of bounded loss tolerant protocols Software prototype on top of this protocol 8/17/19 APNet' 19, Beijing, China 16 Summary
DML applications run with reliable data transfer not necessarily the only way DML applications are bounded-loss tolerant, due to its stochastic (iterative approximation) feature Ignoring some packet drops significantly reduces job completion time without affecting performance 8/17/19 APNet' 19, Beijing, China 17 Thanks! Q&A 8/17/19 APNet' 19, Beijing, China 18