18-741 Advanced Computer Architecture Lecture 1: Intro and Basics

18-741 Advanced Computer Architecture Lecture 1: Intro and Basics

18-740: Computer Architecture Recitation 4: Rethinking Memory System Design Prof. Onur Mutlu Carnegie Mellon University Fall 2015 September 22, 2015 Agenda Review Assignments for Next Week Rethinking Memory System Design (Continued) With a lot of discussion, hopefully 2 Review Assignments for Next Week Required Reviews

Due Tuesday Sep 29 @ 3pm Enter your reviews on the review website Please discuss ideas and thoughts on Piazza 4 Review Paper 1 (Required) Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt, "Fairness via Source Throttling: A Configurable and Hig h-Performance Fairness Substrate for Multi-Core Memor y Systems" Proceedings of the 15th International Conference on Architectural Support for Pro gramming Languages and Operating Systems (ASPLOS), pages 335-346, Pittsburgh, PA, March 2010. Slides (pdf)

Related paper: Kevin Chang, Rachata Ausavarungnirun, Chris Fallin, and Onur Mutlu, "HAT: Heterogeneous Adaptive Throttling for On-Chip Networks" Proceedings of the 24th International Symposium on Computer Architecture an 5 Review Paper 2 (Required) Rachata Ausavarungnirun, Saugata Ghose, Onur Kayiran, Gabriel L. Loh, Chita R. Das, Mahmut T. Kandemir, and Onur Mutlu, "Exploiting Inter-Warp Heterogeneity to Improve GPGP U Performance" Proceedings of the 24th International Conference on Parallel Architectures and Co mpilation Techniques (PACT), San Francisco, CA, USA, October 2015. Related paper: Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M.

Aamodt, Dynamic Warp Formation and Scheduling for Efficient GPU C ontrol Flow , In proceedings of the 40th IEEE/ACM International Symposium on Microarchitecture (MICRO-40), pp. 407-418, 6 Chicago, IL, December 1-5, 2007. slides. pre-print Review Paper 3 (Required) Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, and Onur Mutlu, "Tiered-Latency DRAM: A Low Latency and Low Cost DR AM Architecture" Proceedings of the 19th International Symposium on High-Performance Computer Architecture (HPCA), Shenzhen, China, February 2013. Slides (pptx) Related paper Donghyuk Lee, Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, and Onur Mutlu, "Adaptive-Latency DRAM: Optimizing DRAM Timing fo r the Common-Case"

Proceedings of the 21st International Symposium on High-Performance Comput 7 er Architecture Review Paper 4 (Optional) Justin Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu, "A Large-Scale Study of Flash Memory Errors in the Fiel d" Proceedings of the ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), Portland, OR, June 2015. Related paper Justin Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu, "Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends fr om the Field" Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependa ble Systems and Networks (DSN), Rio de Janeiro, Brazil, June 2015.

8 Project Proposal Due Friday September 25, 2015 Make sure your project is vetted by me before you write your proposal 9 Still Consider: Another Possible Project GPU Warp Scheduling Championship http://adwaitjog.github.io/gpu_scheduling.html

10 Rethinking Memory System Design Some Promising Directions New memory architectures Enabling emerging NVM technologies Rethinking DRAM and flash memory A lot of hope in fixing DRAM Hybrid memory systems Single-level memory and storage A lot of hope in hybrid memory systems and single-level

stores System-level memory/storage QoS A lot of hope in designing a predictable system 12 Rethinking DRAM In-Memory Computation Refresh Reliability Latency

Bandwidth Energy Memory Compression 13 Two Key Questions in 3D Stacked PIM What is the minimal processing-in-memory support we can provide ? without changing the system significantly while achieving significant benefits of processing in 3D-stacked memory

How can we accelerate important applications if we use 3D-stacked memory as a coarse-grained accelerator? what is the architecture and programming model? what are the mechanisms for acceleration? 14 A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing (Ahn et al., ISCA 2015) 15 Large-Scale Graph Processing Large graphs are everywhere 36 Million 1.4 Billion

Wikipedia Pages Facebook Users 300 Million 30 Billion Twitter Users Instagram Photos Scalable large-scale graph processing is challenging 32 Cores ;1 1.42 32 Cores 128 Cores 0 0.5 1 1.5 2 2.5

Speedup 3 3.5 4 Key Bottlenecks in Graph Processing for (v: graph.vertices) { for (w: v.successors) { w.next_rank += weight * v.rank; } } 1. Frequent random memory accesses v &w w.rank w.next_rank w.edges

w weight * v.rank 2. Little amount of computation 17 Challenges in Scalable Graph Processing Challenge 1: How to provide high memory bandwidth to computation units in a practical way? Processing-in-memory based on 3D-stacked DRAM Challenge 2: How to design computation units that efficiently exploit large memory bandwidth? Specialized in-order cores called Tesseract cores Latency-tolerant programming model

Graph-processing-specific prefetching schemes Tesseract System for Graph Processing Host Processor Memory-Mapped Accelerator Interface Noncacheable, Physically Addressed) LP PF Buffer Crossbar Network MTP Message Queue DRAM Controller

In-Order Core NI 19 Tesseract System for Graph Processing Host Processor Memory-Mapped Accelerator Interface Noncacheable, Physically Addressed) Crossbar Network DRAM Controller In-Order Core Communications

LP PF Buffervia Remote Function Calls MTP Message Queue NI 20 Tesseract System for Graph Processing Host Processor Memory-Mapped Accelerator Interface Noncacheable, Physically Addressed) LP PF Buffer Crossbar Network

MTP Message Queue DRAM Controller Prefetching In-Order Core NI 21 Evaluated Systems DDR3-OoO HMC-OoO (with FDP) (with FDP) HMC-MC

Tesseract (32-entry MQ, 4KB PF Buffer) 32 Tesseract Cores 8 OoO 4GHz 8 OoO 4GHz 8 OoO 4GHz 8 OoO 4GHz 128 In-Order 2GHz 128 In-Order 2GHz

8 OoO 4GHz 8 OoO 4GHz 8 OoO 4GHz 8 OoO 4GHz 128 In-Order 2GHz 128 In-Order 2GHz 102.4GB/s 640GB/s 640GB/s 8TB/s

22 Workloads Five graph processing algorithms Average teenage follower Conductance PageRank Single-source shortest path Vertex cover Three real-world large graphs ljournal-2008 (social network)

enwiki-2003 (Wikipedia) indochina-0024 (web graph) 4~7M vertices, 79~194M edges Speedup Tesseract Graph Processing Performance 16 13.78 14 11.63 12 10 9.02 8 6 4 2 DDR3-OoO; 1 1.56 0 1.25

DDR3-OoO HMC-OoO HMC-MC Tesseract Tesseract- TesseractLP LP-MTP 24 M e m o ry B a n d w id th (T B / s ) Speedup Tesseract Graph Processing Performance 16 13.78 14 11.63 12 10 8 6 Memory Bandwidth Consumption 3.5 9.02 2.5

2.24 2 1.5 4 1.29 1 0.5 0.19 1.56 0.08 2 DDR3-OoO; 1 0 0.24 1.25 DDR3-OoO HMC-OoO HMC-MC 0

2.92 3 Tesseract Tesseract- TesseractLP LP-MTP DDR3-OoO HMC-OoO HMC-MC Tesseract Tesseract- TesseractLP LP-MTP 25 Effect of Bandwidth & Programming Model Speedup HMC-MC Bandwidth (640GB/s) 7 6 5 4 3 2 1 0 Tesseract Bandwidth (8TB/s)

6.5 2.3 3.02 Programming Model Bandwidth (No Prefetching) 26 Memory Energy Consumption (Normalized) Memory Layers Logic Layers Cores 1.2 1 0.8 0.6 0.4

-87% 0.2 0 HMC-OoO Tesseract with Prefetching 27 Tesseract Summary How can we accelerate large-scale graph processing using 3Dstacked memory as a coarse-grained accelerator? Tesseract: 3D-stacked PIM accelerator for graph processing Many in-order cores in a memory chip

New message passing mechanism for latency hiding New hardware prefetchers for graph processing Programming interface that exploits our hardware design Promising results on five graph processing workloads ~14x performance improvement & 87% energy reduction Scalable: memory-capacity-proportional performance Two Approaches to In-Memory Processing 1. Minimally change DRAM to enable simple yet powerful computation primitives RowClone : Fast and Efficient In-DRAM Copy and Initialization of Bulk Data (Seshadri et al., MICRO 2013) Fast Bulk Bitwise AND and OR in DRAM (Seshadri et al., IEEE CAL 2015)

2. Exploit the control logic in 3D-stacked memory to enable more comprehensive computation near memory PIM-Enabled Instructions: A Low-Overhead, Locality-Aware Processing-in-Memory Architecture (Ahn et al., ISCA 2015) A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing (Ahn et al., ISCA 2015) 29 In-Memory Computation: Summary It is time to enable mechanisms for performing computation where it makes sense Multiple approaches for in-memory computation can be successful

Push from memory technology Pull from systems and applications Minimally changing DRAM to enable a bulk computation model Exploiting the control logic in 3D-stacked memory Approaches require cross-layer cooperation and research Architecture, systems, compilers, programming models, algorithms, 30 Rethinking DRAM In-Memory Computation Refresh

Reliability Latency Bandwidth Energy Memory Compression 31 DRAM Refresh DRAM capacitor charge leaks over time The memory controller needs to refresh each row

periodically to restore charge Activate each row every N ms Typical N = 64 ms Downsides of refresh -- Energy consumption: Each refresh consumes energy -- Performance degradation: DRAM rank/bank unavailable while refreshed -- QoS/predictability impact: (Long) pause times during refresh -- Refresh rate limits DRAM capacity scaling 32 Refresh Overhead: Performance 46% 8% Liu et al., RAIDR: Retention-Aware Intelligent DRAM Refresh, ISCA 2012.

33 Refresh Overhead: Energy 47% 15% Liu et al., RAIDR: Retention-Aware Intelligent DRAM Refresh, ISCA 2012. 34 Retention Time Profile of DRAM 35 RAIDR: Eliminating Unnecessary Observation: Most DRAM rows can be refreshed much less Refreshes often without losing data [Kim+, EDL09][Liu+ ISCA13]

Key idea: Refresh rows containing weak cells more frequently, other rows less frequently 1. Profiling: Profile retention time of all rows 2. Binning: Store rows into bins by retention time in memory controller Efficient storage with Bloom Filters (only 1.25KB for 32GB memory) 3. Refreshing: Memory controller refreshes rows in different bins at different rates Results: 8-core, 32GB, SPEC, TPC-C, TPC-H 74.6% refresh reduction @ 1.25KB storage ~16%/20% DRAM dynamic/idle power reduction ~9% performance improvement Liu et al., RAIDR: Retention-Aware Intelligent DRAM Refresh, ISCA 2012. 36 Going Forward (for DRAM and How to find out weak memory cells/rows

Flash) Low-cost system-level tolerance of memory errors Liu+, An Experimental Study of Data Retention Behavior in Modern DRAM Devices: Implications for Retention Time Profiling Mechanisms, ISCA 2013. Khan+, The Efficacy of Error Mitigation Techniques for DRAM Retention Failures: A Comparative Experimental Study, SIGMETRICS 2014. Luo+, Characterizing Application Memory Error Vulnerability to Optimize Data Center Cost, DSN 2014. Cai+, Error Analysis and Retention-Aware Error Management for NAND Flash Memory, Intel Technology Journal 2013.

Cai+, Neighbor-Cell Assisted Error Correction for MLC NAND Flash Memories, SIGMETRICS 2014. Tolerating cell-to-cell interference at the system level Kim+, Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors, ISCA 2014. Cai+, Program Interference in MLC NAND Flash Memory: Characterization, Modeling, and Mitigation, ICCD 2013. 37 Experimental DRAM Testing Infrastructure An Experimental Study of Data Reten tion Behavior in Modern DRAM Devic es: Implications for Retention Time Pr ofiling Mechanisms (Liu et al., ISCA 2013) The Efficacy of Error Mitigation Techni ques for DRAM Retention Failures: A Comparative Experimental Study (Khan et al., SIGMETRICS 2014)

Flipping Bits in Memory Without Accessi ng Them: An Experimental Study of DR AM Disturbance Errors (Kim et al., ISCA 2014) Adaptive-Latency DRAM: Optimizing DR AM Timing for the Common-Case (Lee et al., HPCA 2015) AVATAR: A Variable-Retention-Time (VRT ) Aware Refresh for DRAM Systems (Qureshi et al., DSN 2015) 38 Experimental Infrastructure (DRAM) Temperatur e Controller FPGAs Heate r FPGAs PC

Kim+, Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors, ISCA 39 More Information [ISCA13, SIGMETRICS14] 40 Online Profiling of DRAM In the Field Initially protect DRAM with ECC 1 Test Test Test Periodically test parts of DRAM

2 ECC Adjust refresh rate and reduce ECC 3 Optimize DRAM and mitigate errors online without disturbing the system and applications Rethinking DRAM In-Memory Computation Refresh Reliability Latency

Bandwidth Energy Memory Compression 42 DRAM Latency-Capacity Trend Latency (tRC) Capacity (Gb) 2.5 16X 2.0 100 80

1.5 60 1.0 40 -20% 0.5 0.0 2000 2003 2006 2008 20 Latency (ns) Capacity

0 2011 Year DRAM latency continues to be a critical bottleneck, especially for response time-sensitive workloads 43 What Causes the Long Latency? I/O I/O subarray cell array Subarray DRAM Chip channel DRAM Latency = Subarray Subarray Latency Latency ++I/O

I/OLatency Latency Dominant 44 Why is the Subarray So Slow? sense amplifier access transistor bitline wordline capacitor row decoder row decoder sense amplifier Cell cell

bitline: 512 cells Subarray large sense amplifier Long bitline Amortizes sense amplifier cost Small area Large bitline capacitance High latency & power 45 Trade-Off: Area (Die Size) vs. Latency Long Bitline Short Bitline Faster Smaller Trade-Off: Area vs. Latency 46 Normalized DRAM Area Cheaper Trade-Off: Area (Die Size) vs. Latency

4 32 3 64 2 1 0 20 Fancy DRAM Short Bitline Commodity DRAM Long Bitline 128 L A GO 256 30

40 512 cells/bitline 50 60 70 Latency (ns) Faster 47 Approximating the Best of Both Worlds Long Bitline Our Proposal Short Bitline Small Area Large Area High Latency

Low Latency Need Isolation Add Isolation Transistors Short Bitline Fast 48 Approximating the Best of Both Worlds DRAMShort Long Our Proposal Long Bitline BitlineTiered-Latency Short Bitline Bitline Large Area Small Area Small Area High Latency Low Latency Low Latency

Small area using long bitline Low Latency 49 Commodity DRAM vs. TL-DRAM [HPCA 2013] DRAM Latency (tRC) DRAM Power 100% 50% +49% 150% +23% (52.5ns) 56% Power Latency 150%

0% Far Commodity Near TL-DRAM DRAM 100% 50% 51% 0% Far Commodity Near TL-DRAM DRAM DRAM Area Overhead ~3%: mainly due to the isolation transistors 50 Normalized DRAM Area

Cheaper Trade-Off: Area (Die-Area) vs. Latency 4 32 3 64 2 1 0 20 128 L A GO 256 512 cells/bitline Near Segment

30 40 Far Segment 50 60 70 Latency (ns) Faster 51 Leveraging Tiered-Latency DRAM TL-DRAM is a substrate that can be leveraged by the hardware and/or software Many potential uses 1. Use near segment as hardware-managed inclusive cache to far segment 2. Use near segment as hardware-managed exclusive cache to far segment 3. Profile-based page mapping by operating system 4. Simply replace DRAM with TL-DRAM

Lee+, Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture, HPCA 2013. 52 120% 12.4% 11.5% 10.7% 100% 80% 60% 40% 20% 0% 120% Normalized Power Normalized Performance Performance & Power Consumption 1 (1-ch) 2 (2-ch) 4 (4-ch) Core-Count (Channel)

100% 23% 24% 26% 80% 60% 40% 20% 0% 1 (1-ch) 2 (2-ch) 4 (4-ch) Core-Count (Channel) Using near segment as a cache improves performance and reduces power consumption Lee+, Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture, HPCA 2013. 53 What Else Causes the Long DRAM Latency? Conservative timing margins! DRAM timing parameters are set to cover the worst case

Worst-case temperatures Worst-case devices 85 degrees vs. common-case to enable a wide range of operating conditions DRAM cell with smallest charge across any acceptable device to tolerate process variation at acceptable yield 54 This leads to large timing margins for the common Adaptive-Latency DRAM

[HPCA 2015] Idea: Optimize DRAM timing for the common case Current temperature Current DRAM module Why would this reduce latency? A DRAM cell can store much more charge in the common case (low temperature, strong cell) than in the worst case More charge in a DRAM cell Faster sensing, charge restoration, precharging Faster access (read,DRAM:

write, refresh, ) Lee+, Adaptive-Latency Optimizing DRAM Timing for the Common-Case, HPCA 2015. 55 AL-DRAM Key idea Optimize DRAM timing parameters online Two components DRAM timing provides parameters reliable DRAM manufacturer multiple sets of reliable DRAM timing DRAM

parameters at different temperatures temperature for each DIMM System monitors DRAM temperature & Lee+,appropriate Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case, HPCA 56 uses DRAM timing 2015. Latency Reduction Summary of 115 DIMMs Latency reduction for read & write (55C) Read Latency: 32.7% Write Latency: 55.1% Latency reduction for each timing parameter (55C) Sensing: 17.3% Restore: 37.3% (read), 54.8% (write) Lee+, Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case, HPCA 2015.

Precharge: 35.2% 57 AL-DRAM: Real System Evaluation System CPU: AMD 4386 ( 8 Cores, 3.1GHz, 8MB LLC) DRAM: 4GByte DDR3-1600 (800Mhz Clock) OS: Linux Storage: 128GByte SSD Workload 35 applications from SPEC, STREAM, Parsec, Memcached, Apache, GUPS 58 AL-DRAM: Single-Core Evaluation Single Core Multi Core all-workl... all-35- intensive non-inten...

gups s.cluster copy gems lbm libq milc mcf 6.7% 5.0% 1.4% soplex Performance Improvement 25%

20% 15% 10% 5% 0% Average Improveme nt AL-DRAM improves performance on a real system 59 AL-DRAM: Multi-Core Evaluation intensive non-inten... gups s.cluster copy gems

lbm libq milc mcf 2.9% %10.4 % all-workl... all-35- Single Core Multi Core soplex Performance Improvement 25% 20% 15%

10% 5% 0% Average Improveme nt 14.0 AL-DRAM provides higher performance for multi-programmed & multi60 threaded workloads Rethinking DRAM In-Memory Computation Refresh Reliability

Latency Bandwidth Energy Memory Compression 61 Agenda Major Trends Affecting Main Memory The Memory Scaling Problem and Solution Directions

New Memory Architectures Enabling Emerging Technologies How Can We Do Better? Summary 62 Solution 2: Emerging Memory Technologies Some emerging resistive memory technologies seem more scalable than DRAM (and they are non-volatile) Example: Phase Change Memory

Data stored by changing phase of material Data read by detecting materials resistance Expected to scale to 9nm (2022 [ITRS]) Prototyped at 20nm (Raoux+, IBM JRD 2008) Expected to be denser than DRAM: can store multiple bits/cell But, emerging technologies have (many) shortcomings Can they be enabled to replace/augment/surpass 63 Charge vs. Resistive Memories Charge Memory (e.g., DRAM, Flash) Write data by capturing charge Q Read data by detecting voltage V

Resistive Memory (e.g., PCM, STT-MRAM, memristors) Write data by pulsing current dQ/dt Read data by detecting resistance R 64 Limits of Charge Memory Difficult charge placement and control Flash: floating gate charge DRAM: capacitor charge, transistor leakage Reliable sensing becomes difficult as charge storage unit size reduces 65

Promising Resistive Memory Technologies PCM STT-MRAM Inject current to change material phase Resistance determined by phase Inject current to change magnet polarity Resistance determined by polarity Memristors/RRAM/ReRAM

Inject current to change atomic structure Resistance determined by atom distance 66 What is Phase Change Memory? Phase change material (chalcogenide glass) exists in two states: Amorphous: Low optical reflexivity and high electrical resistivity Crystalline: High optical reflexivity and low electrical resistivity PCM is resistive memory: High resistance (0), Low resistance (1) PCM cell can be switched between states reliably and quickly 67 How Does PCM Work? Write: change phase via current injection

SET: sustained current to heat cell above Tcryst RESET: cell heated above Tmelt and quenched Read: detect phase via material resistance amorphous/crystalline Large Current Small Current Memory Element SET (cryst) Low resistance 103-104 W Access Device

RESET (amorph) High resistance 106-107 W Photo Courtesy: Bipin Rajendran, IBM Slide Courtesy: Moinuddin Qureshi, IBM 68 Opportunity: PCM Advantages Scales better than DRAM, Flash Can be denser than DRAM

Requires current pulses, which scale linearly with feature size Expected to scale to 9nm (2022 [ITRS]) Prototyped at 20nm (Raoux+, IBM JRD 2008) Can store multiple bits per cell due to large resistance range Prototypes with 2 bits/cell in ISSCC08, 4 bits/cell by 2012 Non-volatile Retain data for >10 years at 85C 69 Phase Change Memory Properties Surveyed prototypes from 2003-2008 (ITRS, IEDM,

VLSI, ISSCC) Derived PCM parameters for F=90nm Lee, Ipek, Mutlu, Burger, Architecting Phase Change Memory as a Scalable DRAM Alternative, ISCA 2009. 70 71 Phase Change Memory Properties: Latency Latency comparable to, but slower than DRAM Read Latency Write Latency

50ns: 4x DRAM, 10-3x NAND Flash 150ns: 12x DRAM Write Bandwidth 5-10 MB/s: 0.1x DRAM, 1x NAND Flash 72 Phase Change Memory Properties Dynamic Energy Endurance

40 uA Rd, 150 uA Wr 2-43x DRAM, 1x NAND Flash Writes induce phase change at 650C Contacts degrade from thermal expansion/contraction 108 writes per cell 10-8x DRAM, 103x NAND Flash Cell Size 9-12F2 using BJT, single-level cells 1.5x DRAM, 2-3x NAND (will scale with feature size, 73 MLC) Phase Change Memory: Pros Pros over DRAM and Cons

Cons Better technology scaling (capacity and cost) Non volatility Low idle power (no refresh) Higher latencies: ~4-15x DRAM (especially write) Higher active energy: ~2-50x DRAM (especially write) Lower endurance (a cell dies after ~108 writes) Reliability issues (resistance drift) Challenges in enabling PCM as DRAM replacement/helper:

Mitigate PCM shortcomings 74 PCM-based Main Memory: Research Where to place PCM in the memory hierarchy? Challenges Hybrid OS controlled PCM-DRAM Hybrid OS controlled PCM and hardware-controlled DRAM Pure PCM main memory How to mitigate shortcomings of PCM? How to minimize amount of DRAM in the system?

How to take advantage of (byte-addressable and fast) non-volatile main memory? Can we design specific-NVM-technology-agnostic techniques? 75 PCM-based Main Memory (I) How should PCM-based (main) memory be organized? Hybrid PCM+DRAM [Qureshi+ ISCA09, Dhiman+ DAC09]: How to partition/migrate data between PCM and DRAM 76 Hybrid Memory Systems: Challenges

Partitioning Data allocation/movement (energy, performance, lifetime) Should DRAM be a cache or main memory, or configurable? What fraction? How many controllers? Who manages allocation/movement? What are good control algorithms? How do we prevent degradation of service due to wearout? Design of cache hierarchy, memory controllers, OS

Mitigate PCM shortcomings, exploit PCM advantages 77 PCM-based Main Memory (II) How should PCM-based (main) memory be organized? Pure PCM main memory [Lee et al., ISCA09, Top Picks10]: How to redesign entire hierarchy (and cores) to overcome PCM shortcomings 78 An Initial Study: Replace DRAM with PCM Lee, Ipek, Mutlu, Burger, Architecting Phase Change

Memory as a Scalable DRAM Alternative, ISCA 2009. Surveyed prototypes from 2003-2008 (e.g. IEDM, VLSI, ISSCC) Derived average PCM parameters for F=90nm 79 Results: Nave Replacement of DRAM with PCM Replace DRAM with PCM in a 4-core, 4MB L2 system PCM organized the same as DRAM: row buffers, banks, peripherals 1.6x delay, 2.2x energy, 500-hour average lifetime Lee, Ipek, Mutlu, Burger, Architecting Phase Change

Memory as a Scalable DRAM Alternative, ISCA 2009. 80 Results: Architected PCM as Main Memory 1.2x delay, 1.0x energy, 5.6-year average lifetime Scaling improves energy, endurance, density Caveat 1: Worst-case lifetime is much shorter (no guarantees) Caveat 2: Intensive applications see large performance and energy hits Caveat 3: Optimistic PCM parameters? 81 Solution 3: Hybrid Memory

Systems CPU DRAM Fast, durable Small, leaky, volatile, high-cost DRA PCM MCtrl Ctrl Phase Change Memory (or Tech. X) Large, non-volatile, low-cost Slow, wears out, high active energy Hardware/software manage data allocation and movement to achieve the best of multiple technologies Meza+, Enabling Efficient and Scalable Hybrid Memories, IEEE Comp. Arch. Letters, 2012. Yoon+, Row Buffer Locality Aware Caching Policies for Hybrid Memories, ICCD 2012 Best Paper Award. 18-740: Computer Architecture Recitation 4: Rethinking Memory System Design

Prof. Onur Mutlu Carnegie Mellon University Fall 2015 September 22, 2015

Recently Viewed Presentations

  • INFORMATIKA Uvodno predavanje - www.efzg.hr

    INFORMATIKA Uvodno predavanje - www.efzg.hr

    POSLOVNI INFORMACIJSKI SUSTAVI RAZVOJ INFORMACIJSKOG SUSTAVA Katedra za informatiku Ekonomski fakultet - Zagreb Zadaci: "fizički" se stvara informacijski sustav stvaranje računalnih program- algoritmi se «pretvaraju» u programski kod (C, Java,…) stvaranje baze podatka pomoću jezika za rad s bazom podataka...
  • Course Plan Worksheet - FAU

    Course Plan Worksheet - FAU

    Course Plan Worksheet(without narration or animation). James K. Fowlkes. Many of the slides are duplicates from the Curriculum Inventory presentation. Since the concepts and information apply here as well, we have including them for quick reference.
  • Programmes and Modules Judith Webster, Head of Academic

    Programmes and Modules Judith Webster, Head of Academic

    HESA Data Futures. More frequent reporting on module outcomes (including provisional results) on a regular basis. Office for Students (OfS) Conditions for registration include requirements around quality, reliable standards and positive outcomes for students:
  • Federal update - Chesapeake Bay Program

    Federal update - Chesapeake Bay Program

    Administration proposed including environmental literacy in a new $90 million "Effective Teaching and Learning For a Well Rounded Education" grant program at the Department of Education that combines existing grant programs for civics, arts, economics, history, and others.
  • King Lear - Bailieborough Community School's online library

    King Lear - Bailieborough Community School's online library

    Nature The storms act as pathetic fallacy. When the tempest is in full force, it is a reflection of the inner turmoil/torment that King Lear is forced to suffer. ... King Lear has nothing to give Cordelia because she has...
  • Religions of Asia Hinduism Hinduism is one of

    Religions of Asia Hinduism Hinduism is one of

    Shintoism. The earliest religion of Japan was Shintoism, which literally means the "way of the gods." Shintoism centers on reverence for the kami, divine spirits that Shinto followers believe live in nature. The word kami means "superior."
  • Introduction to Rhetorical Fallacies

    Introduction to Rhetorical Fallacies

    Faulty Deduction. Composition. Gambler's . Fallacy. Hasty . Generalizations. Perfectionist Fallacy. Jumping to Conclusions. Middle Ground. The Types of. Anecdotal ...
  • RAHS Robotics Rapid Prototyping Concept 1 RAHS Robotics

    RAHS Robotics Rapid Prototyping Concept 1 RAHS Robotics

    Chief Engineer - Makes final design decisions considering all aspects of the success of the program including; producibility, performance, schedule, personnel, et. al. Decisions may be intended as a temporary assumption for the purpose of driving other design features.