Distributed FSM Modeling and Verification Using Maude

Distributed FSM Modeling and Verification Using Maude

MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isolation in Multi-core Platforms Apr 9, 2012 Heechul Yun+, Gang Yao+, Rodolfo Pellizzoni*, Marco Caccamo+, Lui Sha+ +University of Illinois at Urbana-Champaign *University of Waterloo Real-Time Applications Resource intensive real-time applications Multimedia processing(*), real-time data analytic(**), object tracking Requirements Need more performance and cost less Commercial Off-The Shelf (COTS) Performance guarantee (i.e., temporal predictability and isolation) (*) ARM, QoS for High-Performance and Power-Efficient HD Multimedia, 2010 (**) Intel, The Growing Importance of Big Data and Real-Time Analytics, 2012 2 Modern System-on-Chip (SoC) More cores Freescale P4080 has 8 cores

More sharing Shared memory hierarchy (LLC, MC, DRAM) Shared I/O channels More performance Less cost But, isolation? 3 Problem: Shared Memory Hierarchy Part 1 Core1 Part 2 Part 3 Part 4 Core2 Core3 Core4 Shared Last Level Cache (LLC)

Space contention Memory Controller (MC) Access contention DRAM Shared hardware resources OS has little control 4 Memory Performance Isolation Part 1 Part 2 Part 3 Part 4 Core1 Core2

Core3 Core4 LLC LLC LLC LLC Memory Controller DRAM Q. How to guarantee worst-case performance? Need to guarantee memory performance 5 Inter-Core Memory Interference Slowdown ratio due to interference foreground background 470.lbm X-axis

Core Core L2 L2 Shared Memory Intel Core2 Runtime slowdown 2.2 2.0 1.8 1.6 1.4 1.2 1.0 (1.6GB/s) 437.leslie3d (1.5GB/s) 462.libquantum

(1.5GB/s) 410.bwaves (1.4GB/s) 471.omnetpp Significant slowdown (Up to 2x on 2 cores) Slowdown is not proportional to memory bandwidth usage 6 Background: DRAM Chip Bank 4 Bank 3 Bank 2 Bank 1 READ (Bank 1, Row 3, Col 7) Row 1 Row 2 Row 3 Row 4 Row 5 activate precharge

Col7 Row Buffer Read/write State dependent access latency Row miss: 19 cycles, Row hit: 9 cycles (*) PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency setting) Background: Memory Controller(MC) Bruce Jacob et al, Memory Systems: Cache, DRAM, Disk Fig 13.1. Request queue Buffer read/write requests from CPU cores Unpredictable queuing delay due to reordering 8 Background: MC Queue Re-ordering Initial Queue Reordered Queue Core1: READ Row 1, Col 1 Core2: READ Row 2, Col 1 Core1: READ Row 1, Col 2 Core1: READ Row 1, Col 1

Core1: READ Row 1, Col 2 Core2: READ Row 2, Col 1 DRAM DRAM 2 Row Switch 1 Row Switch Improve row hit ratio and throughput Unpredictable queuing delay 9 Challenges for Real-Time Systems Memory controller(MC) level queuing delay Main source of interference Unpredictable (re-ordering) DRAM state dependent latency DRAM row open/close state Core 1 Core

2 Core 3 Core 4 Memory Controller Predictable Memory Controller Memory Controller DRAM State of Art Predictable DRAM controller h/w: [Akesson07] [Paolieri09] [Reineke11] not in COTS 10 Our Approach OS level approach Works on commodity h/w Core

1 Core 2 Core 3 Core 4 OS control mechanism Memory Memory Controller Controller DRAM DRAM Guarantees performance of each core Maximizes memory performance if possible 11 MemGuard MemGuard Operating System

Reclaim Manager BW 0.9GB/s Regulator BW 0.1GB/s Regulator BW 0.1GB/s Regulator BW 0.1GB/s Regulator PMC Core1 PMC PMC Core3

PMC Core4 Core2 Multicore Processor Memory Controller DRAM DIMM Memory bandwidth reservation and reclaiming 12 Memory Bandwidth Reservation Idea OS monitor and enforce each cores memory bandwidth usage Enqueue tasks 2 Budget 1 Core activity 0 1ms 2ms

Dequeue tasks Dequeue tasks computation memory fetch 13 Memory Bandwidth Reservation Key Insight B/W regulators control memory request rates (Cores)request rate (DRAM) service rate (Memory cont roller) minimal queuing delay System-wide reservation rule up to the guaranteed bandwidth rmin m: #of cores Bi: Core is b/w reservation 14 Guaranteed Bandwidth: rmin Worst-case DRAM performance (service rate) All memory requests go to the same bank (no bank-level parallelism) and cau se row miss Example (PC6400-DDR2*)

Peak B/W: 6.4GB/s 64bytes I/O = 10ns, hide command latency by interleaving Calculated guaranteed B/W: 1.3GB/s PRE + ACT + RD + I/O (8x8bytes) = 47.5ns Measured guaranteed B/W: 1.2GB/s Performance Isolation Sum of memory b/w reservation guaranteed b/w 15 (*) PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency setting) Memory Access Pattern Memory requests Memory requests Time(ms) Time(ms) Memory access patterns vary over time Static reservation is inefficient 16

Memory Bandwidth Reclaiming Key objective Redistribute excess bandwidth to demanding core s Improve memory b/w utilization Predictive bandwidth donation and reclaiming Donate unneeded budget predictively Reclaim on-demand basis 17 Reclaim Example Time 0 Initial budget 3 for both cores Time 3,4 Decrement budgets Time 10 (period 1) Predictive donation (total 4) Time 12,15 Core 1: reclaim Time 16 Core 0: reclaim

Time 17 Core 1: no remaining budget; dequeue tasks Time 20 (period 2) Core 0 donates 2 Core 1 do not donate 18 Best-effort Bandwidth Sharing Key objective Utilize best-effort bandwidth whenever possible Best-effort bandwidth After all cores use their budgets (i.e., delivering guaranteed bandwidth), before the next period begins Sharing policy Maximize throughput Broadcast all cores to continue to execute 19 Evaluation Platform Intel Core2Quad Core 0

Core 1 Core 2 Core 3 I I I I D D D D L2 Cache L2 Cache Memory Controller

DRAM Intel Core2Quad 8400, 4MB L2 cache, PC6400 DDR2 DRAM Prefetchers were turned off for evaluation Power PC based P4080 (8core) and ARM based Exynos4412(4core) also has been ported Modified Linux kernel 3.6.0 + MemGuard kernel module https://github.com/heechul/memguard/wiki/MemGuard Used the entire 29 benchmarks from SPEC2006 and synthetic benchmarks 20 462.libquantum memory hogs (foreground) (background) C0 C2 L2 L2 Shared Memory Intel Core2 Normalized IPC

Evaluation Results 1.2 Foreground (462.libquantum) 1.0 0.8 0.6 0.4 0.2 0.0 53% performance reduction 462.Libquantum memory hogs (foreground) (background) C0 C2 1GB L2 /s .2GB L2 /s

Shared Memory Intel Core2 Normalized IPC Evaluation Results 1.2 1.0 0.8 Foreground (462.libquantum) Guaranteed performance 0.6 0.4 0.2 0.0 Reservation provides performance isolation 462.Libquantum memory hogs (foreground) (background) C0 C2

1GB L2 /s .2GB L2 /s Shared Memory Intel Core2 Normalized IPC Evaluation Results 1.2 1.0 0.8 Foreground (462.libquantum) Guaranteed performance 0.6 0.4 0.2 0.0 Reclaiming and Sharing maximize performance

SPEC2006 Results foreground (X-axis) @1.0GB/s Normalized IPC 3.0 2.5 2.0 1.5 Normalized to guaranteed performance 1.0 0.5 0.0 s r r c k y x lc k p g cs r f

d m p D 3 cf 2 h lII d o ss n ix ef m lb lie3 ntu ave etp FDT hinx .m ople .mi usm cbm ADM 3. gc me asta bzip enc ea lcul 64r obm sjen ma 1.w am tont me ovra ea . 0 s 9 d n m . n . 3 m b s . w

. a s a s . o 48 4. 5 p o 47 7.le qua 0. b . om em 2.sp 42 50. 43 4. ze alan ctu 40 6. h 473 401 erl 47 4.c 4. h2 5.g 458 . gr 4 46 6.g 53. ge p 4 4 4 b 5 5 a 5 3 1 4 . 8 x 1

3 i 1 6 G 4 4 . l c 4 4 4 4 2. 4 47 9. 4 0 4 4 3 6. 43 40 48 43 45 46 Guarantee(soft) performance of foreground (x-axis) W.r.t. 1.0GB/s memory b/w reservation

X-axis (foreground) C0 1GB L2/s 470.lbm (background) C2 .2GB L2 /s Shared Memory Intel Core2 24 SPEC2006 Results foreground (X-axis) @1.0GB/s background (470.lbm) @0.2GB/s Normalized IPC 9.0

8.0 7.0 Normalized to guaranteed performance 6.0 5.0 4.0 3.0 2.0 1.0 0.0 s r r c y x lc k k p g cs r f d m p D 3 cf 2 h lII

d o ss n ix ef m lb lie3 ntu ave etp FDT hinx .m ople .mi usm cbm ADM 3. gc me asta bzip enc ea lcul 64r obm sjen ma 1.w am tont me ovra ea . 0 9 d n s m . . n 3 b s m . w . a s a s

. o 48 4. 5 p o 47 7.le qua 0.b . om em 2.sp 42 50. 43 4. ze alan ctu 40 6. h 473 401 erl 47 4.c 4. h2 5.g 458 . gr 4 46 6. g 53. ge p 4 4 4 b 5 5 a 5 4 . 1 3 1 8 x 3 i 1 6 G

4 4 . l c 4 4 4 4 2. 4 47 9. 4 0 4 4 3 6. 43 40 48 43 45 46 Improve overall throughput background: 368%, foreground(X-axis): 6% X-axis (foreground) C0 1GB

L2/s 470.lbm (background) C2 .2GB L2 /s Shared Memory Intel Core2 25 Conclusion Inter-Core Memory Interference Big challenge for multi-core based real-time systems Sources: queuing delay in MC, state dependent latency in DRAM MemGuard OS mechanism providing efficient per-core memory perfor mance isolation on COTS H/W Memory bandwidth reservation and reclaiming support https://github.com/heechul/memguard/wiki/MemGuard 26

Thank you. 27 Effect of Reclaim IPC improvement of background ([email protected]/s) is 3.8x IPC reduction of foreground ([email protected]/s) is 3% 28 Reclaim Underrun Error 29 Effect of Spare Sharing IPC of background ([email protected]/s) improves 40% IPC of foreground ([email protected]/s) also improves 9% 30 Isolation and Throughput Effect of rmin 4 core configuration 31 Isolation Effect of Reservation Isolation

Core 2: 0.2 2.0 GB/s for lbm Solo [email protected]/s Core 0: 1.0 GB/s for X-axis Sum b/w reservation < rmin (1.2GB/s) Isolation 1.0GB/s(X-axis) + 0.2GB/s(lbm) = rmin 32 Effect of MemGuard Soft real-time application on each core. Provides differentiated memory bandwidth weight for each core=1:2:4:8 for the guaranteed b/w, spare bandwidth sharing is enabled 33 Hard/Soft Reservation on MemGuard Hard reservation (w/o reclaiming) Can guarantee memory bandwidth Bi regardless of other cores a t each period Wasted if not used Soft reservation (w/ reclaiming) Misprediction can caused missed b/w guarantee at each period Error rate is small---less than 5%.

Selectively applicable on per-core basis 34

Recently Viewed Presentations

  • 6 - Dr. Jerry Cronin - Home

    6 - Dr. Jerry Cronin - Home

    Bony callus formation. New trabeculae form a bony (hard) callus. Bony callus formation continues until firm union is formed in ~2 months. Figure 6.15, step 3. Bony callus forms. 3. Bony. callus of. spongy. bone. Stages in the Healing of...
  • 13.3_Mutations - Welcome to Dr. Suris Science Class! - Biology

    13.3_Mutations - Welcome to Dr. Suris Science Class! - Biology

    13.3_Mutations. SC.912.L.16.4 . Explain how mutations in DNA sequence may or may not result in phenotypic change. Explain how mutations in gametes may result in phenotypic changes in offspring.
  • Engineering Dynamics multi-scale integrated approach Contact Mechanics Inertial

    Engineering Dynamics multi-scale integrated approach Contact Mechanics Inertial

    Operational speeds ~10-3 to 10+5 [m/s] Contact size ~10-6 to 10-2 [m] Film size ~10-9 to 10-5 [m] The model must be optimised to run across the physical scale System Dynamics at Nano-Scale: Solvation/Hydration: Molecular reordering due to constraining effect...
  • Gene Expression - Higley Unified School District

    Gene Expression - Higley Unified School District

    When a gene is "on" and its protein or RNA product is being made, scientists say that the gene is being EXPRESSED. The on and off states of all of a cell's genes is known as a GENE EXPRESSION PROFILE....
  • Sámánság neurofenomenologiája

    Sámánság neurofenomenologiája

    Módosult tudatállapotok - Általában A hős útjának templátja Joseph Campbell alkotta meg a "Hős ezer arcának" kompozit portréját meghatározván, hogy a különböző mitológiákban mi a közös motívum a harcos, az uralkodó, a gyógyító, a szent és a félisten "életpályájában" öt...
  • Database Models: Flat Files and the Relational Database

    Database Models: Flat Files and the Relational Database

    Database Models: Flat Files and the Relational Database Objectives: Understand the fundamental structure of the relational database model Learn the circumstances under which it is a better choice than the flat file What is a database? Structured Data Procedures for...
  • STRESS LESS Session 2: Physical Activity To Reduce

    STRESS LESS Session 2: Physical Activity To Reduce

    Only stretch warm muscles (i.e. after your aerobic workout) Hold stretch for 10-30 seconds, without bouncing, repeat 2-4 times . At least two days per week. Only takes about 5 minutes to stretch all major muscle groups. Source: American College...
  • Skip Lists - Biu

    Skip Lists - Biu

    SKIP LISTS Amihood Amir Incorporationg the slides of Goodrich and Tamassia (2004) Sorted Linked List What about Space? What is a Skip List A skip list for a set S of distinct (key, element) items is a series of lists...