Rethinking Transport Layer Design for Distributed Machine Learning

Rethinking Transport Layer Design for Distributed Machine Learning

Rethinking Transport Layer Design for Distributed Machine Learning Jiacheng Xia1, Gaoxiong Zeng1, Junxue Zhang1,2, Weiyan Wang1, Wei Bai3, Junchen Jiang4, Kai Chen1,5 8/17/19 APNet' 19, Beijing, China 1 Growth of Machine Learning Growing applications of AI, many of them leverages machine learning. Our work: Running distributed machine learning over reliable data transfer protocol does NOT lead to optimal performance! 8/17/19 APNet' 19, Beijing, China 2

ML as Iterative Approximation Many ML applications iteratively learns a mathematical model to describe data Represented as minimizing obj. function E.g. Stochastic Gradient Descent (SGD) 8/17/19 APNet' 19, Beijing, China 3 Distributed Machine Learning (DML) After each iteration, workers exchange their parameter updates. Often uses synchronous training for best performance slowest worker determines speed Parameter Servers

Workers Data Shards 8/17/19 APNet' 19, Beijing, China 4 Packet Losses in DML Multiple flows simultaneously -> Likely to have losses (even TCP timeouts) Small flows with a few RTTs, RTO >> FCT w/o timeout Synchronous training, tail FCT determines job speed 8/17/19 S S

S S W W W W APNet' 19, Beijing, China 5 Faster Computations With growing speed of hardware, computations are faster, larger effect of timeouts Model MLP Iteration time (no timeouts)

7ms Slowdown w/ timeouts (RTO = 5ms) 2.4 Matrix Factorization 6ms 2.7 ResNet-18 (CNN) 25ms 1.4 LSTM (RNN) 82ms 1.12

8/17/19 APNet' 19, Beijing, China 6 High Cost of Loss Recovery High recovery cost. E.g. TCP timeouts: Fast computation, >2x longer completion time w/ timeouts TCP w/o timeout TCP w/ timeout >2x completion time Network Compute Worker pull 8/17/19 APNet' 19, Beijing, China Worker push

7 Handling Packet Drops: Necessary? Timeout as a backup to recover packet drops. Is this necessary to handle every packet drop for DML? NO. DML is inherently iterative approximation, so it only requires approximately correct results. DML algorithms (e.g. SGD) are greedy optimization, can recover from slightly incorrect results 8/17/19 APNet' 19, Beijing, China 8 ML are Bounded-Loss Tolerant Conv. Round 100 80

60 MLP 40 0 Normalized JCT Same rounds, reduced JCT 0.5 Random Data Loss Probability More rounds, reduced JCT 1.0 Do not converge 2.0 1.5 1.0

MLP 0.5 0 0.5 Random Data Loss Probability 1.0 Emulate parameter loss locally, compute communication time with NS-3 simulations 8/17/19 APNet' 19, Beijing, China 9 ML view of Bounded Loss Tolerance SGD starts new estimation with results in previous iteration. Can recover from incorrect results

With bounded loss, SGD still converges to same point 8/17/19 APNet' 19, Beijing, China Lossless SGD Lossy SGD 10 Existing Solutions are Insufficient TCP-100% data TCP-80% data Simplied protocol 1.5 1.0 0.5 Unreliable Protocol?

0.94 Model Loss Normalized JCT Reduced communications? 0.92 Better 0.90 0.88 0 RNN MLP MF Model CNN UDP

0.2 Simplied protocol TCP 0.4 0.6 0.8 1.0 Normalized JCT (MLP) 1.2 A simplified protocol to explain in the following has the potential to significantly outperform these settings. 8/17/19 APNet' 19, Beijing, China 11 Packet Drops occur on

different parameter sync. schemes Parameter Server (PS) Ring AllReduce (RING) Normalized JCT Packet Drops on Different Schemes 1.0 0.5 0 8/17/19 PS-TCP RING-TCP PS-Simplied protocol 1.5 APNet' 19, Beijing, China RNN

MLP MF Model CNN 12 A Simplified Protocol Minimizes the time for receiver a predefined threshold of packets TCP-like congestion control logic Receivers notify application layers once received predefined threshold of data Preliminary results in NS-3 simulators 8/17/19 APNet' 19, Beijing, China 13 Results: Simplified Protocol

TCP Simplied protocol 1.5 1.0 0.5 0 8/17/19 RNN MLP MF Model CNN Normalized JCT Normalized JCT [Simulation] 1.1-2.1x speed on both PS and RING scheme TCP

Simplied protocol 1.5 1.0 0.5 0 APNet' 19, Beijing, China RNN MLP MF Model CNN 14 Reduced Tail FCT The FCT reduction results from reduced tail FCTs. A bounded-loss tolerant

protocol benefits DML by ignoring some packet drops 8/17/19 APNet' 19, Beijing, China 15 Future Works We have seen that leveraging Bounded Loss Tolerant has huge potential to speed up DML A concrete testbed implementation of bounded loss tolerant protocols Software prototype on top of this protocol 8/17/19 APNet' 19, Beijing, China 16 Summary

DML applications run with reliable data transfer not necessarily the only way DML applications are bounded-loss tolerant, due to its stochastic (iterative approximation) feature Ignoring some packet drops significantly reduces job completion time without affecting performance 8/17/19 APNet' 19, Beijing, China 17 Thanks! Q&A 8/17/19 APNet' 19, Beijing, China 18

Recently Viewed Presentations

  • Animals - Weebly

    Animals - Weebly

    An earthworm is ready to mate when its clitellum changes colour from pinkish to red-orange, as shown in the figure to the right. Two earthworms line up in a head to tail fashion and exchange spermatozoa (sperm), as shown in...
  • NPV-Questions - H. Zafer Yuksel

    NPV-Questions - H. Zafer Yuksel

    The Yurdone Corporation wants to set up a private cemetery business. According to the CFO, Barry M. Deep, business is "looking up". As a result, the cemetery project will provide a net cash inflow of $106,000 for the firm during...
  • Periodic Table and Periodic Trends

    Periodic Table and Periodic Trends

    Kinetic-Molecular Theory. This theory describes the behavior of gases in terms of particles in motion. Objects in motion have kinetic energy. Gases consist of small particles that are separated from one another by empty space
  • Urban Economics, Ninth Edition, Chapter 20

    Urban Economics, Ninth Edition, Chapter 20

    The Tiebout model is a formal model of interjurisdictional mobility. The simple version of the Tiebout model is based on five assumptions. 1.Municipal choice. A household chooses the municipality (or school district or other local jurisdiction) that provides the household's...
  • Program Developer Dan Olweus  The Olweus Bullying Prevention

    Program Developer Dan Olweus The Olweus Bullying Prevention

    Sample logs: SWG CD Doc. 23 & the TM Tab 4, Doc. 6). 4. Consider attitudes of supervising adults (see SWG, pp. 59-60). All adults are responsible for supervising all children in the school at all times. When they don't,...
  • AS Revision - 1 Based on previous questions,

    AS Revision - 1 Based on previous questions,

    Typical question - racket actions. May 07 Qu 5. The diagram shows a squash player executing a forehand stroke. Using the diagram, identify the type of joint, the joint action and the main agonist at the shoulder and elbow that...
  • Othello and Frankenstein - mrslivaudais.com

    Othello and Frankenstein - mrslivaudais.com

    Emilia exposes Iago's deceptions, Othello kills himself, and Iago is taken away to be tortured. Othello quotes Iago - I hate the Moor:And it is thought abroad, that 'twixt my sheetsHe has done my office: I know not if't be...
  • Associative Query Answering via Query Feature Similarity

    Associative Query Answering via Query Feature Similarity

    Associative Query Answering via Query Feature Similarity Outline Associative Query Answering Approach and system overview Database schema and semantic model Query feature and similarity Searching associative attributes from case bases Conclusions Associative Query Answering To provide additional relevant information to...