Understanding Latency Variation in Modern DRAM Chips ...

Understanding Latency Variation in Modern DRAM Chips ...

Understanding and Improving Latency of DRAM-Based Memory Systems Thesis Oral Kevin Chang Committee: Prof. Onur Mutlu (Chair) Prof. James Hoe Prof. Kayvon Fatahalian Prof. Stephen Keckler (NVIDIA, UT Austin) Prof. Moinuddin Qureshi (Georgia Tech.) The March For Moore 8Gb 4B transistors 4K transistors Intel 8080, 1974 Processor

1Kb Intel 1103, 1970 Main Memory or DRAM (Dynamic Random Access Memory) 2 PROBLEM DRAM latency has been relatively stagnant 3 D R A M Im p r o v e m e n t ( lo Main Memory Latency Lags Behind 128x 20x Capacity

Bandwidth Latency 100 10 1.3x 1 1999 2003 2006 2008 2011 2013 2014 2015 2016 2017 Memory latency remains almost constant 4 DRAM Latency Is Critical for Performance In-memory Databases Graph/Tree Processing

[Mao+, EuroSys12; Clapp+ (Intel), IISWC15] [Xu+, IISWC12; Umuroglu+, FPL15] Long memory latency performance bottleneck In-Memory Data AnalyticsDatacenter [Clapp+ (Intel), IISWC15; workloads Awan+, BDCloud15] [Kanev+ (Google), ISCA15] 5 Goal mprove latency of DRAM (main memory) 6

Different DRAM Latency Problems DRAM 3. High standard latency to mitigate cell variation chi p 1. Slow bulk data movement between two memory locations CPU 2. Refresh delays memory accesses Voltage

4. Voltage affects latency 7 Thesis Statement Memory latency can be significantly reduced with a multitude of low-cost architectural techniques that aim to reduce different causes of long latency 8 Contributions 9 Low-Cost Architectural Features in DRAM Understanding and

overcoming the latency limitation in DRAM 1. Slow bulk data Low-Cost Inter-Linked movement between two Subarrays (LISA) [HPCA16] memory locations Understanding and 3. High standard Exploiting latency to mitigate Latency Variation in cell variation DRAM DRAM (FLY-DRAM) [SIGMETRICS16]

Voltage CPU 2. Refresh delays memory Understanding 4. Voltage affects Mitigating Refresh and accesses by Parallelizing Exploiting latency Latency Accesses with Refreshes Latency-Voltage Trade(DSARP) [HPCA14] Off (Voltron) [SIGMETRICS17] 10 DRAM Background Whats inside a DRAM chip?

How to access DRAM? How long does accessing data take? 11 High-Level DRAM Organization DRAM Channel DRAM chip DIMM (Dual in-line memory module) 12 Chips DRAM Cell Wordline

Subarr ay Bank 512 x 8Kb S P S P S P S P Row Buffer Sense amplifier S

Precharge unit P Internal Data Bus Row Decoder Subarr ayBitline 64b I/O 13 Reading Data From DRAM S P

1 S P 1 S P 1 S P 1 READ: Select the target column and drive to CPU 2 To Bank I/

O 1 ACTIVATE: Store the row into the row buffer 3 PRECHARGE: Reset the bitlines for a new ACTIVATE 14 DRAM Access Latency Activation latency: tRCD 1 (13ns / 50 cycles) Precharge latency: tRP

2 (13ns / 50 cycles) Command Data ACTIVAT E REA D PRECHARG E 1111 Cache line (64B) Duration Next ACT 15

Low-Cost Architectural Features in DRAM Understanding and overcoming the latency limitation in DRAM 1. Slow bulk data Low-Cost Inter-Linked movement between two Subarrays (LISA) [HPCA16] memory locations Understanding and 3. High standard Exploiting latency to mitigate Latency Variation in

cell variation DRAM DRAM (FLY-DRAM) [SIGMETRICS16] Voltage CPU 2. Refresh delays memory Understanding 4. Voltage affects Mitigating Refresh and accesses by Parallelizing Exploiting latency Latency Accesses with Refreshes Latency-Voltage Trade(DSARP) [HPCA14] Off (Voltron) [SIGMETRICS17]

16 Problem: Inefficient Bulk Data Movement Bulk data movement is a key operation in many applications LLC Memory Controll er e e Cor e Cor Cor e Cor

memmove & memcpy: 5% cycles in Googles datacenter [Kanev+, ISCA15] CPU Channel 64 bits sr c ds t Memory Long latency and high energy 17 Move Data inside DRAM? 18

Moving Data Inside DRAM? Bank Bank Bank DRAM Subarray 1 Subarray 2 Subarray 3 Bank 512 rows 8Kb Subarray N Internal Data Bus DRAM is(64b) the

Low connectivity in Goal: Provide a new substrate to enable fundamental bottleneck for bulk data wide connectivity between subarrays movement 19 Our proposal: Low-Cost Inter-Linked SubArrays (LISA) 20 S P S P S

P S P S P S P S P S P Internal Data Bus (64b) Observations 1

Bitlines serve as a bus that is as wide as a row 2 Bitlines between subarrays are close but disconnected 21 Low-Cost Interlinked Subarrays (LISA) 8kb S P S P

S P S P ON 64b S P S P S P S P Interconnect bitlines of adjacent subarrays in a

bank using isolation transistors (links) 22 Low-Cost Interlinked Subarrays (LISA) Row Buffer Movement (RBM): Move a row of data in an 8kb S P S P S P

S P S P S P activated row buffer to a precharged one ONCharge 4KB data in 8ns Sharing 500 GB/s, 26x bandwidth of 64b a DDR4-2400 channel 0.8% DRAM chip area S P overhead S P

23 Three New Applications of LISA to Reduce Latency 1 Fast bulk data copy 24 1. Rapid Inter-Subarray Copying (RISC) Goal: Efficiently copy a row across subarrays Key idea: Use RBM to form a new command Subarray 1 sequence src row Activate src row

1 2 RBM SA1SA2 S P S P S P S P Subarray 2 dst 9x, row latency by Reduces row-copy Activate dst row

DRAM 48x 3(write row buffer into energy dst row) Sby S S S P P P P 25 Methodology Cycle-level simulator: Ramulator [Kim+, CAL15] Four out-of-order cores Two DDR3-1600 channels

Benchmarks: TPC, STREAM, SPEC2006, DynoGraph, random, bootup, forkbench, shell script 26 N o r m a liz e d D R A M E n e r N o r m a liz e d S p e e d u p RISC Outperforms Prior Work [Seshadri+, MICRO13] 66% RowClone 1.8 1.6 1.4 1.2 1 -24% 0.8 0.6 0.4

0.2 0 RISC -5% 1 -55% 0.8 0.6 0.4 0.2(RISC) using LISA Rapid Inter-Subarray Copying 0 improves system performance 50 workloads 50 workloads

RowClone limits bank-level parallelism 27 Three New Applications of LISA to Reduce Latency 1 Fast bulk data copy 2 In-DRAM caching 28 2. Variable Latency DRAM (VILLA) Goal: Reduce access latency with low area overhead Motivation: and Long Bitline Trade-off between Shortarea Bitline latency

Lower resistance and capacitance High latency Low latency High area overhead 29 2. Variable Latency DRAM (VILLA) Key idea: Heterogeneous DRAM design by adding a few fast subarrays as a low-cost cache in each bank Benefits: Reduce access latency for frequentlyaccessed data Slow Subarray Fast Subarray 512 Challenge: How to move data rows

efficiently from slow to fast subarrays? 32 LISA: Cache rows rapidly from rows slow to fast subarrays Reduces hot data access latency by 2.2x Slow at only 1.6% area overhead Subarray 30 N o r m a liz e d S p e e d u p VILLA Improves System Performance by Caching Hot Data Max: 16%

80 VILLA Baseline V IL L A C a c h e H i t R a t e 1.1 70 60 Avg: 5% 50 40 30 20 10 LISA enables an effective in-DRAM caching 1

0 Workloads scheme(50) 31 Three New Applications of LISA to Reduce Latency 1 Fast bulk data copy 2 In-DRAM caching 3 Fast precharge 32 3. Linked Precharge (LIP) Problem: The precharge time is limited by the strength of one precharge unit Linked Precharge (LIP): LISA precharges a subarray using multiple precharge units S P S

P S P S P S P S P S P S P Activated row on

Activated row Linked Prechargi on ng Precharging S S S S Reduces precharge latency by on P P P P 2.6x S P

S P S P S P Conventional DRAM LISA DRAM 33 N o r m a liz e d S p e e d u p LIP Improves System Performance by Accelerating Precharge Max: 13% LIP Avg: 8%

1.1 LISA reduces precharge latency 1 Workloads (50) 34 Latency Reduction of LISA x64 x64 READ WRITE 9x 4KB data copying Latency of Operations ACTIVATE

VILL A 1.7x PRECHARGE VILLA LIP 1.5x 2.6x LISA is a versatile substrate that enables many new techniques 35 Low-Cost Architectural Features in DRAM Understanding and

overcoming the latency limitation in DRAM 1. Slow Inter-Linked bulk data Subarrays Low-Cost movement (LISA) [HPCA16] between two memory locations Understanding and Exploiting High standard Latency3. Variation in DRAM latency to mitigate (FLY-DRAM) [SIGMETRICS16] cell variation

DRAM CPU 2. Refresh delays memory Mitigating Refresh Latency by accesses Accesses with Parallelizing Refreshes (DSARP) [HPCA14] Voltage 4. Voltage affects Understanding and Exploiting latencyTrade-Off Latency-Voltage

(Voltron) [SIGMETRICS17] 36 What Does DRAM Latency Mean to You? DRAM latency: Delay as specified in DRAM standards Memory controllers use these standardized latency to access DRAM The purpose of this Standard is to define the minimum set of requirements for JEDEC compliant SDRAM devices (p.1) JEDEC DDRx standard Key question: How does reducing latency affect DRAM accesses? 37 Goals 1

Understand and characterize reduced-latency behavior in modern DRAM chips 2 Develop a mechanism that exploits our observation to improve DRAM latency 38 Experimental Setup Custom FPGA-based infrastructure Existing systems: Commands are generated and controlled by HW PCIe PC DDR3

FPGA DIMM 39 Experiments Swept each timing parameter to read data Time step of 2.5ns (FPGA cycle time) Check the correctness of data read back from DRAM Any errors (bit flips)? Tested 240 DDR3 DRAM chips from three vendors 30 DIMMs Capacity: 1GB 40 Experimental Results Activation Latency

41 Variation in Activation Errors Results from 7500 rounds over 240 chips Rife with errors Many errors No Errors 8KB (one row) Very few errors <10 bits step size Activation Latency/tRCD (ns) Modern DRAM chips exhibit Different characteristics across cells 13.1ns standard significant variation in activation latency 42

DRAM Latency Variation Imperfect manufacturing process latency variation in timing parameters DRAM A DRAM B DRAM C Slow cells Low High DRAM Latency 43 Experimental Results Precharge Latency 44 Variation in Precharge Errors Results from 4000 rounds over 240 chips

Rife with errors Many errors No Errors 100 rows Very few errors 8KB (one row) Precharge Latency/tRP (ns) step size 13.1ns standard Modern DRAM chips exhibit significant variation in precharge latency 45 Spatial Locality of Slow Cells One DIMM: tRCD=7.5ns

One DIMM: tRP=7.5ns Slow cells are concentrated at certain regions 46 Mechanism: Flexible-Latency (FLY) DRAM F LY 47 Mechanism to Reduce DRAM Latency Observation: DRAM timing errors (slow DRAM cells) are concentrated on certain regions Flexible-LatencY (FLY) DRAM A memory controller design that reduces latency

Key idea: 1) Divide memory into regions of different latencies 2) Memory controller: Use lower latency for regions without slow cells; higher latency for other 48 N o r m a liz e d P e r f o r m a n c e Benefits of FLY-DRAM 17.6% 19.5% 19.7% Fast cells (%) 0% 1.25 1.2 83%

1.15 1.1 1.05 1 Baseline (DDR3) 99% FLY-DRAM (DIMM 1) FLY-DRAM (DIMM 2) 100% FLY-DRAM improves performance 0.9 Upper Bound by exploiting latency

variation in DRAM 40 Workloads 0.95 49 Latency Reduction of FLYDRAM x64 x64 READ WRITE 9x 4KB data copying Latency of Operations ACTIVATE VILL A 1.7x

PRECHARGE VILLA LIP FLY 1.7x FLY 1.5x 2.6x 1.7x Experimental demonstration of latency variation enables techniques to reduce latency 50 Low-Cost Architectural

Features in DRAM Understanding and overcoming the latency limitation in DRAM 1. Slow Inter-Linked bulk data Subarrays Low-Cost movement (LISA) [HPCA16] between two memory locations Understanding and Exploiting High standard Latency3. Variation in DRAM latency to mitigate (FLY-DRAM)

[SIGMETRICS16] cell variation DRAM CPU 2. Refresh delays memory Mitigating Refresh Latency by accesses Accesses with Parallelizing Refreshes (DSARP) [HPCA14] Voltage 4. Voltage affects Understanding

and Exploiting latencyTrade-Off Latency-Voltage (Voltron) [SIGMETRICS17] 51 Motivation DRAM voltage is an important factor that affects: latency, power, and reliability Goal: Understand the relationship between latency and DRAM voltage and exploit this trade-off 52 Methodology FPGA platform DIMM Voltage Controller

Tested 124 DDR3L DRAM chips (31 DIMMs) 53 Key Result: Voltage vs. Latency Circuit-level SPICE simulation Potential latency range Trade-off between access latency and voltage 54 Goal and Key Observation Goal: Exploit the trade-off between voltage and latency to reduce energy consumption Approach: Reduce voltage Performance loss due to increased latency Energy: Function of time (performance) and power (voltage)

Observation: Applications performance loss due to higher latency has a strong linear relationship with its memory intensity 55 Mechanism: Voltron Build a performance (linear) model to predict performance loss based on the selected voltage value Use the model to select a minimum voltage that satisfies a performance loss target specified by the user Results: Reduces system energy by 7.3% with a small performance loss of 1.8% 56 Reducing Latency by Exploiting Voltage-Latency Trade-Off Voltron exploits the latency-voltage trade-off to improve energy efficiency Another perspective: Increase voltage to

reduce latency 57 Low-Cost Architectural Features in DRAM Understanding and overcoming the latency limitation in DRAM 1. Slow Inter-Linked bulk data Subarrays Low-Cost movement (LISA) [HPCA16] between two memory locations Understanding and Exploiting High standard Latency3. Variation

in DRAM latency to mitigate (FLY-DRAM) [SIGMETRICS16] cell variation DRAM CPU 2. Refresh delays memory Mitigating Refresh Latency by accesses Accesses with Parallelizing Refreshes (DSARP) [HPCA14] Voltage

4. Voltage affects Understanding and Exploiting latencyTrade-Off Latency-Voltage (Voltron) [SIGMETRICS17] 58 Summary of DSARP Problem: Refreshing DRAM blocks memory accesses Prolongs latency of memory requests Goal: Reduce refresh-induced latency on demand requests Key observation: Some subarrays and I/O remain completely idle during refresh Dynamic Subarray Access-Refresh Parallelization (DSARP): DRAM modification to enable idle DRAM subarrays to serve accesses during refresh

0.7% DRAM area overhead 59 Prior Work on Low-Latency DRAM Uniform short-bitlines DRAM: FCRAM, RLDRAM Large area overhead (30% - 80%) Heterogeneous bitline design TL-DRAM: Intra-subarray [Lee+, HPCA13] Requires two fast rows to cache one slow row CHARM: Inter-bank [Son+, ISCA13] High movement cost between slow and fast banks SRAM cache in DRAM [Hidaka+, IEEE Micro90] Large area overhead (38% for 64KB) and complex control Our work:

Low cost Detailed experimental understanding via characterization of commodity chips 60 CONCLUSION 61 Conclusion Memory latency has remained mostly constant over the past decade System performance bottleneck for modern applications Simple and low-cost architectural mechanisms New DRAM substrate for fast inter-subarray data movement Refresh architecture to mitigate refresh interference Understanding latency behavior in commodity DRAM Experimental characterization of: 1) Latency variation inside DRAM

2) Relationship between latency and DRAM voltage 62 Thesis Statement Memory latency can be significantly reduced with a multitude of low-cost architectural techniques that aim to reduce different causes of long latency 63 Future Research Direction Latency characterization and optimization for other memory technologies eDRAM Non-volatile memory: PCM, STT-RAM, etc. Understanding other aspects of DRAM Variation in power/energy consumption Security/reliability 64

Other Areas Investigated Energy Efficient Networks-On-Chip [NOCS12, SBACPAD12, SBACPAD14] Memory Schedulers for Heterogeneous Systems [ISCA12, TACO16] Low-latency DRAM Architecture [HPCA15] DRAM Testing Platform [HPCA17] 65 Acknowledgements Onur Mutlu James Hoe, Kayvon Fatahalian, Moinuddin Qureshi, and Steve Keckler Safari group: Rachata Ausavarungnirun, Amirali Boroumand, Chris Fallin, Saugata Ghose, Hasan Hassan, Kevin Hsieh, Ben Jaiyen, Abhijith Kashyap, Samira Khan, Yoongu Kim, Donghyuk Lee, Yang Li,

Jamie Liu, Yixin Luo, Justin Meza, Gennady Pekhimenko, Vivek Seshadri, Lavanya Subramanian, Nandita Vijaykumar, Hanbin Yoon, Hongyi Xin Georgia Tech. collaborators: Prashant Nair, Jaewoong Sim CALCM group Friends Family parents, sister, and girlfriend 66 Intern mentors and industry collaborators: Sponsors Intel and SRC for my fellowship NSF and DOE Facebook, Google, Intel, NVIDIA, VMware, Samsung

67 Thesis Related Publications Improving DRAM Performance by Parallelizing Refreshes with Accesses Kevin Chang, Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim, Onur Mutlu HPCA 2014 Low-Cost Inter-Linked Subarrays (LISA): Enabling Fast Inter-Subarray Data Movement in DRAM Kevin Chang, Prashant J. Nair, Donghyuk Lee, Saugata Ghose, Moinuddin K. Qureshi, and Onur Mutlu HPCA 2016 Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization Kevin Chang, Abhijith Kashyap, Hasan Hassan Samira Khan, Kevin Hsieh, Donghyuk Lee, Saugata

Ghose, Gennady Pekhimenko, Tianshi Li, Onur Mutlu SIGMETRICS 2016 Understanding Reduced-Voltage Operation in Modern DRAM Devices: Experimental Characterization, Analysis, and Mechanisms Kevin Chang, Abdullah Giray Yagliki, Saugata Ghose, Aditya Agrawal, Niladrish Chatterjee, Abhijith Kashyap, Donghyuk Lee, Mike Oconnor, Hasan Hassan, Onur Mutlu SIGMETRICS 2017 68 Understanding and Improving Latency of DRAM-Based Memory Systems Thesis Oral Kevin Chang Committee: Prof. Onur Mutlu (Chair) Prof. James Hoe Prof. Kayvon Fatahalian Prof. Stephen Keckler (NVIDIA, UT Austin)

Prof. Moinuddin Qureshi (Georgia Tech.)

Recently Viewed Presentations

  • Stacey Mulcahy| Technical Evangelist

    Stacey Mulcahy| Technical Evangelist

    PHP In The Wild. PHP is the 4th most popular language type for newly created repositories in Github. Mod_php. is the most popular Apache module. Of websites where the server side language is known, it's listed that 82% of them...
  • Chapter Sixteen - MCCC

    Chapter Sixteen - MCCC

    Chapter Sixteen Developing Integrated Marketing Communications ... * The Promotion Mix The particular combination of promotion methods a firm uses to reach a target market Advertising A paid nonpersonal message communicated to a select audience through a mass medium Personal...
  • Welcome! Welcome! VIDEO SERIES COMMUNICATING THE VISION FOR

    Welcome! Welcome! VIDEO SERIES COMMUNICATING THE VISION FOR

    a diocesan wide strategic plan. Focus of the strategic plan … Sunday (the three "Hs" … hospitality, homilies, hymns) Formation (adults, youth, Catholic schools) Outreach (major life moments, parish social ministry) Casting a vision for. Adult formation
  • Chapter 16 Developing the Research Proposal Chapter 16

    Chapter 16 Developing the Research Proposal Chapter 16

    Example: There is an association between a variant in the CHRNA4 gene and the number of cigarettes smoked each day. Definition of terms Important for terms that may have multiple meanings Example tobacco use: ever tried, tried at least 5...
  • What to expect in the exam

    What to expect in the exam

    What to expect in the exam. Question 4 refers to the whole text and is worth 20 marks (more than twice the value of questions 1 and 2 together) The question assesses your ability to:
  • Chapter 18 Diabetic Emergencies Slide Presentation prepared by

    Chapter 18 Diabetic Emergencies Slide Presentation prepared by

    Differentiate between hyperglycemia and hypoglycemia. List the signs and symptoms of hyperglycemia. List the signs and symptoms of hypoglycemia. Describe and demonstrate the first aid care of a severe hyperglycemic victim. Describe and demonstrate the first aid care of a...
  • s y c u L . St  g!

    s y c u L . St g!

    Cite textual evidence to support your response. Stage 3 of "St. Lucy's" Turn-and-Talk . How does this scene reveal the differences between Mirabella, Jeanette, and Claudette, and what textual evidence supports your ideas? Independent Questions .
  • CalWORKs Housing Support Program - CDSS Public Site

    CalWORKs Housing Support Program - CDSS Public Site

    Grateful for the help she received through the CalWORKs Housing Support Program, Martha is now reaching out to other homeless families to let them know that there are programs to help them get back on their feet. Her story was...