Everything you still don't know about synchronization and how ...

Everything you still don't know about synchronization and how ...

Everything you still dont know about synchronization and how it scales Anshu Raina Suhas Pai Mohit Verma Vikas Goel Yuvraj Patel 02/12/2020 1 Motivation Why need synchronization? Many core architecture a common phenomenon Challenging to scale systems Synchronization Crucial to ensure coordination and correctness Hindrance to scalability

Synchronization ensured using primitives Rely on hardware artifacts sometimes gory details of h/w not known Hard to predict if applications will scale w.r.t a specific synchronization scheme 02/12/2020 2 What are we trying to study? Study synchronization under scaling 02/12/2020 How various hardware artifacts scale?

How the higher level synchronization primitives scale? Does hardware architecture impact the performance? What are the overheads that pop up while scaling? 3 How do you synchronize? Basic hardware artifacts CAS, TAS, FAI and other atomic instructions Mutex, Semaphores, Spin locks, Condition Variables, Barrier Different purpose Structure shared by all threads using it Use the above hardware artifacts to update the shared structure atomically 02/12/2020 4

Synchronization Primitives 101 Basic hardware artifacts CAS - Uses lock cmpxchg TAS - Uses xchg FAI - Uses lock xadd Higher level synchronization primitives Mutex Used to ensure mutual exclusion, Ownership crucial lock/unlock Semaphore Signaling mechanism, Ownership not important wait/post(signal) Spinlock Locking mechanism, generally used for smaller critical section Futex Used for performance avoid syscalls to acquire locks Syscall done only when contention 02/12/2020 5 Experiments Parameters

Different configurations (Intra socket, Inter socket, Hyperthreading) Thread scaling (1, 2, 4, 8, 14, 28, 56) 28, 56 not done for Intra socket 56 not done for Intra socket & Inter socket Vary Critical section (CS) size Pseudo-code for CS: FOR (0 LOOP_COUNT) { count := count + 1; (count is volatile) } Experiments done for LOOP_COUNT 100, 1000, 10000 Layered study Basic Hardware Artifacts CAS, FAI, TAS Higher level synchronization primitives (musl library) Mutex, Semaphore, Spinlocks 02/12/2020 6

Platform/Architecture Intel Xeon E5 v3 (Haswell EP) 2 sockets, 14 physical active cores per socket(possibly using a 18 core die) Hyperthreaded, 2 threads per core 8 cores, 8 L3 slices, 1 memory controller connected to 1 bi-directional ring. Remaining are connected to another bi-directional ring Ring topology hidden from OS in default configuration COD splits the processors into 2 clusters, topology now has 4 NUMA nodes(but we are seeing only 2 NUMA nodes. Enabling COD also doesnt show 4 NUMA nodes ) Cache Coherence Mechanisms MESIF implementation Implemented by Caching Agents(CAs) within L3 slice and Home Agents(HAs) with memory controller. Modes 02/12/2020 Source Mode(enabled by default) Home Mode

Cluster-on-Die Mode 7 How Do Atomic Instructions Scale? How do atomic instructions scale with varying contention? Does placement of threads affect scaling? Single Socket HyperThreading or not Two Sockets HyperThreading or not How do different atomic instructions vary in latency? Locks are implemented using these atomics Spin locks, Mutex, Semaphores use CAS Does the coherence state of the cache line affect latencies of operations? 02/12/2020 8

Atomics Latency trends with increasing threads 02/12/2020 9 Effect of Thread Placement on Latencies 02/12/2020 10 Effect of Thread Placement on a Single CAS 02/12/2020 11 Insights

Latencies of all instructions increas linearly with increasing contention Threads placed on HyperThreaded cores provide improved performance Effects of HyperThreading are more pronounced when threads are on different sockets CAS latency can be very large if threads are placed across sockets (2x more!) Significant because CAS used widely for implementation of locks (spin locks, mutex) - More to be covered in subsequent slides! 02/12/2020 12 Spinlocks, Mutex and Binary Semaphores What should I use if my critical section is small? Does number of threads in my application has matter? Does it thread placement matter? What is the worst & best performance I can get?

02/12/2020 13 Binary Semaphore/Mutex Behavior as Critical Section Size Changes Spinlocks usually used when critical section small Binary Semaphore/ Mutex? See for yourself! 02/12/2020 14 NHT2s_100 02/12/2020 NHT2s_100

02/12/2020 NHT2s_100 02/12/2020 NHT2s_100 02/12/2020 NHT2s_100 02/12/2020 NHT2s_10000 02/12/2020 NHT2s_10000

02/12/2020 NHT2s_10000 02/12/2020 NHT2s_10000 02/12/2020 NHT2s_10000 02/12/2020 NHT2s_10000 02/12/2020 NHT2s_10000

02/12/2020 NHT2s_10000 02/12/2020 NHT2s_1000 02/12/2020 NHT2s_1000 02/12/2020 NHT2s_1000 02/12/2020 NHT2s_1000

02/12/2020 NHT2s_1000 02/12/2020 General behavior as Critical Section size changes We looked at what happens when 14 threads try contending at once. When CS large, everyone calls syscall once. How are the threads woken up? FCFS! When CS small, no contention. CS over before other threads even scheduled! When CS size intermediate, some threads call syscall more than once. Since CS is not big enough, some threads werent even scheduled yet, and start contending with the thread just woken up. FCFS woken up does not imply FCFS entering CS.

02/12/2020 33 Spinlocks scaling as number of threads vary? Max Latency - Spinlocks Critical Section - 100 loop count 450 Does not scale well, even if CS small Actually worse than mutex, and binary semaphore Why? Back off in mutex and semaphore More later! NHT1s_100

NHT2s_100 HT_100 Max Iterations 350 300 250 200 150 100 50 0 1 2 4

8 14 28 56 No. of threads Max Latency - Spinlocks Critical Section - 1000 loop count 14000 NHT1s_1000 NHT2s_1000 HT_1000

12000 Max Iterations Spinlocks mostly used with less CS size How does its performance vary with number of threads? 400 10000 8000 6000 4000 2000 0 02/12/2020

1 2 4 8 No. of Threads 14 28 56 34 How does mutex/semaphore scale with number of threads?

02/12/2020 35 How does mutex/semaphore scale with number of threads? 02/12/2020 36 Why dont they scale well? Whats going on inside? Mutex 1. Try CAS to get the lock 2. Fail? Try CAS again to get the lock 3. Fail? Spin for some time if there are other waiters too! 4. Try CAS on the lock again 5. Fail?

1. Register yourself as a waiter 2. Syscall to Futex 02/12/2020 37 Why dont they scale well? Whats going on inside? Semaphore 1. Check the semaphore to see if you can enter the critical section 2. Fail? Try CAS in a loop until you successfully update semaphore 3. Fail? Spin for some time if there are other waiters too! 4. Try CAS to update the semaphore again. 5. Fail? Register yourself as waiter. CAS to update the semaphore Syscall to futex 02/12/2020

38 Why dont they scale well? Whats going on inside? Mutex lock overhead breakup for 14 threads (HT Intrasocket) Mutex lock overhead breakup for 14 threads (IntraSocket No HT) 1800 2500 1600 2000 1400 Latency (in ns) Latency (in ns)

1200 1000 800 600 400 1500 1000 500 200 0 0 1 2

3 4 5 6 7 8 9 10 11 12

13 14 1 2 3 4 Thread ID 1st_CAS spin_time 02/12/2020 2nd_CAS

while loop trying trylock 5 6 7 8 9 10 11 12 13

Thread ID trylock <- 3rd CAS 1st_CAS spin_time 2nd_CAS while loop trying trylock trylock <- 3rd CAS 39 14 Why dont they scale well? Whats going on inside? Mutex lock overhead breakup for 14 threads (Inter-socket No HT) 4500 4000

3500 Latency (in ns) 3000 2500 2000 1500 1000 500 0 1 2 3 4 5

6 7 8 9 10 11 12 13 14 Thread ID

1st_CAS 02/12/2020 2nd_CAS trylock <- 3rd CAS spin_time while loop trying trylock 40 How does the behavior change as thread placement varies? (For 14 threads) 1st_CAS 2nd_CAS

3rd_CAS while loop try_lock Syscall Spin didnt_complete No_of_syscalls NHT1s_1000 21.43 0 0 0

78.57 1 7-1/4-2 NHT2s_1000 14.29 0 0 7.14 78.57 0

4-1/6-2/1-3 HT_1000 21.43 0 0 14.29 64.29 3 6-1/3-2 NHT1s_100

85.71 0 0 14.29 0 2 0 NHT2s_100 42.86 0

14.29 42.86 0 2 0 HT_100 78.57 7.14 0 14.29

0 0 0 Mutex: % of threads completing at each stage Semaphore has the same behavior It might make sense to not spin in mutex/semaphore when critical section large Most threads block during syscall even after spinning 02/12/2020 41 How does the behavior change as thread placement varies? Config

CS size Max spin count NHT1s 1000 3719 NHT2s 1000 4275 HT1s 1000

3242 NHT1s 100 50 NHT2s 100 151 HT 100 28

Spinlock: spin count variation with thread placement 02/12/2020 42 Variation of max and min overheads as number of threads vary Semaphore wait latency for Critical section size 1000 02/12/2020 43 Variation of max and min overheads as number of threads vary Semaphore wait latency for Critical section size 1000 02/12/2020

44 Variation of max and min overhead as number of threads vary For 14 threads, semaphore wait latency across sockets is worse For smaller number of threads, inter-socket wait latency can be smaller Timeline shows threads across sockets are scheduled late, resulting in lesser contention If threads on hyperthreaded core, wait latency can be smaller Compared to wait latency of threads on non-hyperthreaded cores The behavior of mutex is similar to that of semaphore 02/12/2020 45 How do mutexes, spinlocks and binary

sems compare with each other? Mutexes and binary semaphores have similar latency Critical Section 100 LOOPCOUNT 02/12/2020 46 How do mutexes, spinlocks and binary sems compare with each other? Do not use spin locks if there are lots of threads for small CS Critical Section 100 LOOPCOUNT 02/12/2020 47 What about semaphore post & mutex unlock? Post/unlocks latency increases linearly with scale

02/12/2020 48 Other Observations Locks are given to threads in a cluster format Inter socket experiments, the locks are acquired by threads belonging to the same socket (3-4 threads in one go) 02/12/2020 49 Conclusion Synchronization is hard Basic hardware artifacts are closely tied with the software synchronization primitives Inter-socket performance is usually worse than same socket and hyper-threading

To get the best performance from software, you should know everything about the architecture But if you have a Haswell EP machine, use our slides 50 Backup? 51 Variation of max and min overheads as thread placement varies Inter-socket worst sem wait latency is worse 02/12/2020 52 How does the behavior change as

thread placement varies? Mutex 1st CAS 2nd CAS 3rd CAS while trylock Syscall Spin didnt complete Num syscalls NHT1s_1000 21.43

0 0 0 78.57 1 7-1/4-2 NHT2s_1000 14.29 0 0

7.14 78.57 0 4-1/6-2/1-3 HT_1000 21.43 0 0 14.29 64.29

3 6-1/3-2 NHT1s_100 85.71 0 0 14.29 0 2 0

NHT2s_100 42.86 0 14.29 42.86 0 2 0 HT_100 78.57

7.14 0 14.29 0 0 0 Semaphore Spinlock NHT1s_1000 1st CAS

21.42857 TRY 78.57143 Spin didnt complete 1 Num syscalls 8-1/3-2 NHT1s_1000 Max spin count 3719 NHT2s_1000 HT_1000 14.28571

14.28571 85.71429 85.71429 0 0 9-1/3-2 9-1/2-2 NHT2s_1000 HT_1000 4275 3242 NHT1s_100 NHT2s_100

85.71429 42.85714 14.28571 57.14286 0 0 0 0 NHT1s_100 NHT2s_100 50 151 HT_100

85.71429 14.28571 0 0 HT_100 28 Does it makes sense to spin in mutex/semaphore? 02/12/2020 53

Recently Viewed Presentations

  • Diebold Reports - Carnegie Mellon University

    Diebold Reports - Carnegie Mellon University

    VVAT 2: Two-station Cryptography prevents ballot duplication, replacement, or alteration via MAC and randomized identifiers. Lack of communications channel between stations ensures that if voter's choices are displayed correctly then they are encoded on the ballot. Random sampling of ballots...
  • The Great Gatsby

    The Great Gatsby

    From this chapter forward, the mystery of Jay Gatsby becomes the motivating question of the book, and the unraveling of Gatsby's character becomes one of its central mechanisms. One early clue to Gatsby's character in this chapter is his mysterious...
  • Visual Studio 2013 Licensing Overview - mscom.co.il

    Visual Studio 2013 Licensing Overview - mscom.co.il

    Visual Studio 2013 Licensing OverviewOfferings, Pricing, Licensing, and Promotions. ... pick the right MSDN subscription or user plan for each team member's needs. ... cost effective and convenient access to wide range of.
  • Section 3.5: Error-Correcting Codes

    Section 3.5: Error-Correcting Codes

    Section 3.5: Error-Correcting Codes. Math for Liberal Studies. Transmission Problems. ... If we start with a valid code word and there is a single error, we are 1 away from where we started, and at least 2 away from anywhere...
  • Chapter Interorganizational Relationships 5 Organization Theory and Design

    Chapter Interorganizational Relationships 5 Organization Theory and Design

    Organizational Ecosystems. Interorganizational relationships - resource transactions, flows, and linkages that occur among two organizations. Organizational
  • Future Vehicle Telematics - eti-home.org

    Future Vehicle Telematics - eti-home.org

    Technical Details. 4 Ethernet ports. One preferably located near the driver's seat. Remainder located near rear of vehicle. Basic API would cover vehicle speed, emissions data, odometer, brake switch, accelerometer data, fuel consumption, etc.
  • OLEDs - Theory and Fabrication

    OLEDs - Theory and Fabrication

    OLEDs - THEORY AND FABRICATION. ABSTRACT: Organic Light Emitting Diodes are quickly becoming the cutting edge in display technology. This presentation will give a brief history of OLEDs, the theory behind organic semiconductors, the physics behind OLEDs, and then go...
  • Psyc 319 Final Projects

    Psyc 319 Final Projects

    Final project. Submit a copy of your Research Project on Moodle, as a Word document. Please make sure you save . two copies (e.g., on your computer and a usb) Main body of the final research project (i.e., introduction, method,...