Memory technology - University of California, Berkeley

Memory technology - University of California, Berkeley

Lecture 17: Memory Systems Prepared by: Professor David A. Patterson Computer Science 152, 252 Edited and presented by : Prof. Kurt Keutzer Computer Science 252, Spring 2000 4/5/99 CS152 / Kubiatowicz The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory Datapath Output Todays Topics: SRAM Memory Technology DRAM Memory Technology Memory Organization Lec17.2 Technology Trends Capacity Speed (latency)

Logic: 2x in 3 years 2x in 3 years DRAM: 4x in 3 years 2x in 10 years Disk: 4x in 3 years 2x in 10 years DRAM Year 1980 1000:1! Size 64 Kb 2:1! Cycle Time 250 ns 1983

256 Kb 220 ns 1986 1 Mb 190 ns 1989 4 Mb 165 ns 1992 16 Mb 145 ns 1995 64 Mb 120 ns Lec17.3 Who Cares About the Memory Hierarchy? Processor-DRAM Memory Gap (latency)

100 10 1 Proc 60%/yr. Moores Law (2X/1.5yr) Processor-Memory Performance Gap: (grows 50% / year) DRAM DRAM 9%/yr. (2X/10 yrs) CPU 198 198 0 1 198 198 2 198 3 198 4 5 198 198 6

198 7 1 898 199 9 199 0 199 199 2 199 3 199 4 1 599 199 6 199 7 8 199 200 9 0 Performance 1000 Time Lec17.4

Todays Situation: Microprocessor Rely on caches to bridge gap Microprocessor-DRAM performance gap time of a full cache miss in instructions executed 1st Alpha (7000): 340 ns/5.0 ns = 68 clks x 2 or 136 instructions 2nd Alpha (8400): 266 ns/3.3 ns = 80 clks x 4 or 320 instructions 3rd Alpha (t.b.d.): 180 ns/1.7 ns =108 clks x 6 or 1/2X latency x 3X clock rate x 3X Instr/clock 5X 648 instructions Lec17.5 Impact on Performance Suppose a processor executes at Clock Rate = 200 MHz (5 ns per cycle) CPI = 1.1 Inst Miss (0.5) 16%

Ideal CPI (1.1) 35% 50% arith/logic, 30% ld/st, 20% control Suppose that 10% of memory operations get 50 cycle miss penalty DataMiss (1.6) 49% CPI = ideal CPI + average stalls per instruction = 1.1(cyc) +( 0.30 (datamops/ins) x 0.10 (miss/datamop) x 50 (cycle/miss) ) = 1.1 cycle + 1.5 cycle = 2. 6 58 % of the time the processor is stalled waiting for memory! a 1% instruction miss rate would add an additional 0.5 cycles to the CPI! Lec17.6 The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and fast (most of the time)? Hierarchy

Parallelism Lec17.7 An Expanded View of the Memory System Processor Control Memory Memory Speed: Fastest Size: Smallest Cost: Highest Memory Memory Datapath Memory Slowest Biggest Lowest Lec17.8 Why hierarchy works The Principle of Locality:

Program access a relatively small portion of the address space at any instant of time. Probability of reference 0 Address Space 2^n - 1 Lec17.9 Memory Hierarchy: How Does it Work? Temporal Locality (Locality in Time): => Keep most recently accessed data items closer to the processor Spatial Locality (Locality in Space): => Move blocks consists of contiguous words to the upper levels To Processor Upper Level Memory Lower Level Memory Blk X From Processor

Blk Y Lec17.10 Memory Hierarchy: Terminology Hit: data appears in some block in the upper level (example: Block X) Hit Rate: the fraction of memory access found in the upper level Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data needs to be retrieve from a block in the lower level (Block Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor Hit Time << Miss Penalty To Processor Upper Level Memory Lower Level Memory Blk X From Processor Blk Y

Lec17.11 Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. Processor Control On-Chip Cache Registers Datapath Second Level Cache (SRAM) Main Memory (DRAM) Speed (ns): 1s 10s 100s

Size (bytes): 100s Ks Ms Secondary Storage (Disk) Tertiary Storage (Tape) 10,000,000s 10,000,000,000s (10s sec) (10s Gsms) Ts Lec17.12 How is the hierarchy managed? Registers <-> Memory by compiler (programmer?) cache <-> memory by the hardware memory <-> disks by the hardware and operating system (virtual memory) by the programmer (files)

Lec17.13 Memory Hierarchy Technology Random Access: Random is good: access time is the same for all locations DRAM: Dynamic Random Access Memory - High density, low power, cheap, slow - Dynamic: need to be refreshed regularly SRAM: Static Random Access Memory - Low density, high power, expensive, fast - Static: content will last forever(until lose power) Non-so-random Access Technology: Access time varies from location to location and from time to time Examples: Disk, CDROM Sequential Access Technology: access time linear in location (e.g.,Tape) The next two lectures will concentrate on random access technology The Main Memory: DRAMs + Caches: SRAMs Lec17.14 Main Memory Background

Performance of Main Memory: Latency: Cache Miss Penalty - Access Time: time between request and word arrives - Cycle Time: time between requests Bandwidth: I/O & Large Block Miss Penalty (L2) Main Memory is DRAM : Dynamic Random Access Memory Dynamic since needs to be refreshed periodically (8 ms) Addresses divided into 2 halves (Memory as a 2D matrix): - RAS or Row Access Strobe - CAS or Column Access Strobe Cache uses SRAM : Static Random Access Memory No refresh (6 transistors/bit vs. 1 transistor) Size: DRAM/SRAM 4-8 Cost/Cycle time: SRAM/DRAM 8-16 Lec17.15 Random Access Memory (RAM) Technology Why do computer designers need to know about RAM technology? Processor performance is usually limited by memory bandwidth As IC densities increase, lots of memory will fit on processor chip -

Tailor on-chip memory to specific needs - Instruction cache - Data cache - Write buffer What makes RAM different from a bunch of flip-flops? Density: RAM is much denser Lec17.16 Static RAM Cell 6-Transistor SRAM Cell 0 0 bit word word (row select) 1 1

bit Write: 1. Drive bit lines (bit=1, bit=0) bit bit 2.. Select row replaced with pullup to save area 1. Precharge bit and bit to Vdd or Vdd/2 => make sure equal! Read: 2.. Select row 3. Cell pulls one line low 4. Sense amp on column detects difference between bit and bit Lec17.17 Typical SRAM Organization: 16-word x 4bit Din 3 Din 2 Din 1 Din 0 WrEn Precharge

Wr Driver & - Precharger + Wr Driver & - Precharger + Wr Driver & - Precharger + Wr Driver & - Precharger + SRAM Cell SRAM Cell SRAM Cell SRAM Cell Word 1 SRAM Cell SRAM Cell SRAM

Cell SRAM Cell : : : : Address Decoder Word 0 A0 A1 A2 A3 Word 15 SRAM Cell SRAM Cell SRAM Cell

SRAM Cell - Sense Amp + - Sense Amp + - Sense Amp + - Sense Amp + Dout 3 Dout 2 Dout 1 Dout 0 Q: Which is longer: word line or bit line? Lec17.18 Logic Diagram of a Typical SRAM A N WE_L OE_L

2 N words x M bit SRAM M D Write Enable is usually active low (WE_L) Din and Dout are combined to save pins: A new control signal, output enable (OE_L) is needed WE_L is asserted (Low), OE_L is disasserted (High) - D serves as the data input pin WE_L is disasserted (High), OE_L is asserted (Low) - D is the data output pin Both WE_L and OE_L are asserted: - Result is unknown. Dont do that!!! Although could change VHDL to do what desire, must do the best with what youve got (vs. what you need) Lec17.19 Typical SRAM Timing A N WE_L OE_L 2 N words

x M bit SRAM M Write Timing: D Data In D Read Timing: High Z Data Out Data Out Junk A Write Address Read Address Read Address OE_L WE_L Write Hold Time

Write Setup Time Read Access Time Read Access Time Lec17.20 Problems with SRAM Select = 1 P1 P2 Off On On On On Off N1 N2 bit = 1 bit = 0 Six transistors use up a lot of area Consider a Zero is stored in the cell: Transistor N1 will try to pull bit to 0

Transistor P2 will try to pull bit bar to 1 But bit lines are precharged to high: Are P1 and P2 necessary? Lec17.21 1-Transistor Memory Cell (DRAM) row select Write: 1. Drive bit line 2.. Select row Read: 1. Precharge bit line to Vdd 2.. Select row bit 3. Cell and bit line share charges - Very small voltage changes on the bit line 4. Sense (fancy sense amp) - Can detect changes of ~1 million electrons 5. Write: restore the value Refresh 1. Just do a dummy read to every cell. Lec17.22 Classical DRAM Organization (square) r

o w d e c o d e r row address bit (data) lines Each intersection represents a 1-T DRAM Cell RAM Cell Array word (row) select Column Selector & I/O Circuits data Column Address Row and Column Address together:

Select 1 bit a time Lec17.23 DRAM logical organization (4 Mbit) 11 A0A10 Column Decoder Sense Amps & I/O Memory Array (2,048 x 2,048) D Q Storage Word Line Cell Square root of bits per RAS/CAS Lec17.24 DRAM physical organization (4 Mbit) Column Address Row Address

Block Row Dec. 9 : 512 I/O I/O Block Row Dec. 9 : 512 I/O I/O 8 I/Os D Block Row Dec. 9 : 512 Block Row Dec. 9 : 512 Q 2

I/O I/O Block 0 I/O I/O Block 3 8 I/Os Lec17.25 Memory Systems address n DRAM Controller n/2 Memory Timing Controller DRAM 2^n x 1 chip

w Bus Drivers Tc = Tcycle + Tcontroller + Tdriver Lec17.26 Logic Diagram of a Typical DRAM RAS_L A 9 CAS_L WE_L OE_L 256K x 8 DRAM 8 D Control Signals (RAS_L, CAS_L, WE_L, OE_L) are all active low Din and Dout are combined (D): WE_L is asserted (Low), OE_L is disasserted (High) - D serves as the data input pin WE_L is disasserted (High), OE_L is asserted (Low)

- D is the data output pin Row and column addresses share the same pins (A) RAS_L goes low: Pins A are latched in as row address CAS_L goes low: Pins A are latched in as column address RAS/CAS edge-sensitive Lec17.27 Key DRAM Timing Parameters tRAC: minimum time from RAS line falling to the valid data output. Quoted as the speed of a DRAM A fast 4Mb DRAM tRAC = 60 ns tRC: minimum time from the start of one row access to the start of the next. tRC = 110 ns for a 4Mbit DRAM with a tRAC of 60 ns tCAC: minimum time from CAS line falling to valid data output. 15 ns for a 4Mbit DRAM with a tRAC of 60 ns tPC: minimum time from the start of one column access to the start of the next. 35 ns for a 4Mbit DRAM with a tRAC of 60 ns Lec17.28 DRAM Performance A 60 ns (tRAC) DRAM can perform a row access only every 110 ns (tRC) perform column access (tCAC) in 15 ns, but time between column accesses is at least 35 ns (tPC).

- In practice, external address delays and turning around buses make it 40 to 50 ns These times do not include the time to drive the addresses off the microprocessor nor the memory controller overhead. Drive parallel DRAMs, external memory controller, bus to turn around, SIMM module, pins 180 ns to 250 ns latency from processor to memory is good for a 60 ns (tRAC) DRAM Lec17.29 DRAM Write Timing Every DRAM access begins at: RAS_L The assertion of the RAS_L 2 ways to write: early or late v. CAS A CAS_L WE_L 256K x 8

DRAM 9 OE_L D 8 DRAM WR Cycle Time RAS_L CAS_L A Row Address Col Address Junk Row Address Col Address Junk OE_L WE_L D Junk

Data In WR Access Time Early Wr Cycle: WE_L asserted before CAS_L Junk Data In Junk WR Access Time Late Wr Cycle: WE_L asserted after CAS_L Lec17.30 DRAM Read Timing Every DRAM access begins at: RAS_L The assertion of the RAS_L 2 ways to read: early or late v. CAS A CAS_L WE_L 256K x 8 DRAM

9 OE_L D 8 DRAM Read Cycle Time RAS_L CAS_L A Row Address Col Address Junk Row Address Col Address Junk WE_L OE_L D High Z Junk Read Access

Time Data Out Early Read Cycle: OE_L asserted before CAS_L High Z Output Enable Delay Data Out Late Read Cycle: OE_L asserted after CAS_L Lec17.31 Main Memory Performance Wide: Simple: Interleaved: CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits) CPU, Cache, Bus 1 word: Memory N Modules (4 Modules); example is word interleaved

CPU, Cache, Bus, Memory same width (32 bits) Lec17.32 Main Memory Performance Cycle Time Access Time Time DRAM (Read/Write) Cycle Time >> DRAM (Read/Write) Access Time 2:1; why? DRAM (Read/Write) Cycle Time : How frequent can you initiate an access? Analogy: A little kid can only ask his father for money on Saturday DRAM (Read/Write) Access Time: How quickly will you get what you want once you initiate an access? Analogy: As soon as he asks, his father will give him the money DRAM Bandwidth Limitation analogy: What happens if he runs out of money on Wednesday? Lec17.33 Increasing Bandwidth Interleaving Access Pattern without Interleaving: D1 available Start Access for D1

CPU Memory Start Access for D2 Memory Bank 0 Access Pattern with 4-way Interleaving: CPU Memory Bank 1 Access Bank 0 Memory Bank 2 Access Bank 1 Memory Bank 3 Access Bank 2 Access Bank 3 We can Access Bank 0 again Lec17.34 Main Memory Performance Timing model 1 to send address,

4 for access time, 10 cycle time, 1 to send data Cache Block is 4 words Simple M.P. = 4 x (1+10+1) = 48 Wide M.P. = 1 + 10 + 1 = 12 Interleaved M.P. = 1+10+1 + 3 =15 address 0 4 8 12 Bank 0 address 1 5 9 13 Bank 1 address 2 6 10 14 Bank 2

address 3 7 11 15 Bank 3 Lec17.35 Independent Memory Banks How many banks? number banks number clocks to access word in bank For sequential accesses, otherwise will return to original bank before it has next word ready Increasing DRAM => fewer chips => harder to have banks Growth bits/chip DRAM : 50%-60%/yr Nathan Myrvold M/S: mature software growth (33%/yr for NT) growth MB/$ of DRAM (25%-30%/yr) Lec17.36 Fewer DRAMs/System over Time Minimum PC Memory Size (from Pete MacWilliams, Intel)

DRAM Generation 86 89 92 96 99 02 1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb 8 Memory per 4 MB 32 DRAM growth 16 4 8 MB @ 60% / year 8 2 16 MB 32 MB Memory per 64 MB System growth 128 MB @ 25%-30% / year 256 MB 4 1 8

2 4 1 8 2 Lec17.37 Page Mode DRAM: Motivation Column Regular DRAM Organization: Address N rows x N column x M-bit Read & Write M-bit at a time Fast Page Mode DRAM DRAM Row Address N rows Each M-bit access requires a RAS / CAS cycle N cols N x M register to save a row M bits

M-bit Output 1st M-bit Access 2nd M-bit Access RAS_L CAS_L A Row Address Col Address Junk Row Address Col Address Junk Lec17.38 Fast Page Mode Operation Fast Page Mode DRAM Column Address N cols N x M SRAM to save a row

Only CAS is needed to access other M-bit blocks on that row RAS_L remains asserted while CAS_L is toggled DRAM Row Address N rows After a row is read into the register N x M SRAM M bits M-bit Output 1st M-bit Access 2nd M-bit 3rd M-bit 4th M-bit Col Address Col Address Col Address

RAS_L CAS_L A Row Address Col Address Lec17.39 DRAM v. Desktop Microprocessors Cultures Standards pinout, package, refresh rate, capacity, ... binary compatibility, IEEE 754, I/O bus Sources Multiple Single Figures of Merit 1) capacity, 1a) $/bit 1) SPEC speed 2) BW, 3) latency 2) cost

Improve Rate/year 1) 60%, 1a) 25%, 2) 20%, 3) 7% 1) 60%, 2) little change Lec17.40 DRAM Design Goals Reduce cell size 2.5, increase die size 1.5 Sell 10% of a single DRAM generation 6.25 billion DRAMs sold in 1996 3 phases: engineering samples, first customer ship(FCS), mass production Fastest to FCS, mass production wins share Die size, testing time, yield => profit Yield >> 60% (redundant rows/columns to repair flaws) Lec17.41 DRAM History DRAMs: capacity +60%/yr, cost 30%/yr 2.5X cells/area, 1.5X die size in 3 years 97 DRAM fab line costs $1B to $2B DRAM only: density, leakage v. speed

Rely on increasing no. of computers & memory per computer (60% market) SIMM or DIMM is replaceable unit => computers use any generation DRAM Commodity, second source industry => high volume, low profit, conservative Little organization innovation in 20 years page mode, EDO, Synch DRAM Order of importance: 1) Cost/bit 1a) Capacity RAMBUS: 10X BW, +30% cost => little impact Lec17.42 Todays Situation: DRAM Commodity, second source industry high volume, low profit, conservative Little organization innovation (vs. processors) in 20 years: page mode, EDO, Synch DRAM DRAM industry at a crossroads: Fewer DRAMs per computer over time - Growth bits/chip DRAM : 50%-60%/yr - Nathan Myrvold M/S: mature software growth (33%/yr for NT) growth MB/$ of DRAM (25%-30%/yr) Starting to question buying larger DRAMs? Lec17.43

Todays Situation: DRAM DRAM Revenue per Quarter $20,000 (Miillions) $15,000 $16B $10,000 $5,000 $7B $0 1Q 2Q 3Q 4Q 1Q 2Q 3Q 4Q 1Q 2Q 3Q 4Q 1Q 94 94 94 94 95 95 95 95 96 96 96 96 97 Intel: 30%/year since 1987; 1/3 income profit Lec17.44 Summary: Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon. Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon. By taking advantage of the principle of locality: Present the user with as much memory as is available in the

cheapest technology. Provide access at the speed offered by the fastest technology. DRAM is slow but cheap and dense: Good choice for presenting the user with a BIG memory system SRAM is fast but expensive and not very dense: Good choice for providing the user FAST access time. Lec17.45 Summary: Processor-Memory Performance Gap Tax Processor % Area %Transistors (cost) (power) Alpha 21164 37% 77% StrongArm SA110 61%

94% Pentium Pro 64% 88% 2 dies per package: Proc/I$/D$ + L2$ Caches have no inherent value, only try to close performance gap Lec17.46 Recall: Levels of the Memory Hierarchy Processor Upper Level faster Instr. Operands Cache Blocks Memory Pages Disk Files Tape Larger

Lower Level Lec17.47 Cache performance equations: Time = IC x CT x (ideal CPI + memory stalls/inst) memory stalls/instruction = Average accesses/inst x Miss Rate x Miss Penalty = (Average IFETCH/inst x MissRateInst x Miss PenaltyInst) + (Average Data/inst x MissRateData x Miss PenaltyData) Assumes that ideal CPI includes Hit Times. Average Memory Access time = Hit Time + (Miss Rate x Miss Penalty) Lec17.48 Impact on Performance Suppose a processor executes at Ideal CPI Data Miss Inst Miss 1.1 1.5 0.5 Clock Rate = 200 MHz (5 ns per cycle) CPI = 1.1 50% arith/logic, 30% ld/st, 20% control Suppose that 10% of memory operations get 50 cycle miss penalty

Suppose that 1% of instructions get same miss penalty CPI = ideal CPI + average stalls per instruction 1.1(cycles/ins) + [ 0.30 (DataMops/ins) x 0.10 (miss/DataMop) x 50 (cycle/miss)] + [ 1 (InstMop/ins) x 0.01 (miss/InstMop) x 50 (cycle/miss)] = (1.1 + 1.5 + .5) cycle/ins = 3.1 Data s (1 49 58% of the time the proc is stalled waiting for memory! Lec17.49 The Art of Memory System Design Workload or Benchmark programs Processor reference stream , ,,, . . . op: i-fetch, read, write Memory $ MEM Optimize the memory system organization

to minimize the average memory access time for typical workloads Lec17.50 Example: 1 KB Direct Mapped Cache with 32 B Blocks For a 2 ** N byte cache: The uppermost (32 - N) bits are always the Cache Tag The lowest M bits are the Byte Select (Block Size = 2 ** M) Block address 31 Cache Tag Example: 0x50 9 Cache Index Ex: 0x01 4 0 Byte Select Ex: 0x00 Stored as part of the cache state Cache Tag 0x50 Cache Data Byte 31 Byte 63

: : Valid Bit Byte 1 Byte 0 0 Byte 33 Byte 32 1 2 3 : : Byte 1023 : : Byte 992 31 Lec17.51 Block Size Tradeoff In general, larger block size take advantage of spatial locality BUT: Larger block size means larger miss penalty: - Takes longer time to fill up the block If block size is too big relative to cache size, miss rate will go up - Too few cache blocks In general, Average Access Time: = Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Rate

Miss Rate Exploits Spatial Locality Miss Penalty Fewer blocks: compromises temporal locality Block Size Block Size Average Access Time Increased Miss Penalty & Miss Rate Block Size Lec17.52 Extreme Example: single line Valid Bit Cache Tag Cache Size = 4 bytes bytes Cache Data Byte 3 Byte 2 Byte 1 Byte 0 0 Block Size = 4

Only ONE entry in the cache If an item is accessed, likely that it will be accessed again soon But it is unlikely that it will be accessed again immediately!!! The next access will likely to be a miss again - Continually loading data into the cache but discard (force out) them before they are used again - Worst nightmare of a cache designer: Ping Pong Effect Conflict Misses are misses caused by: Different memory locations mapped to the same cache index - Solution 1: make the cache size bigger - Solution 2: Multiple entries for the same Cache Index Lec17.53 Another Extreme Example: Fully Associative Fully Associative Cache Forget about the Cache Index Compare the Cache Tags of all cache entries in parallel Example: Block Size = 32 B blocks, we need N 27-bit comparators By definition: Conflict Miss = 0 for a fully associative cache

31 4 0 Byte Select Cache Tag (27 bits long) Ex: 0x01 Valid Bit Cache Data X Byte 31 Byte 63 X : : Cache Tag Byte 1 Byte 0 Byte 33 Byte 32 X X X : :

: Lec17.54 Set Associative Cache N-way set associative: N entries for each Cache Index N direct mapped caches operates in parallel Example: Two-way set associative cache Cache Index selects a set from the cache The two tags in the set are compared to the input in parallel Data is selected based on the tag result Valid Cache Tag : : Adr Tag Compare Cache Index Cache Data Cache Data Cache Block 0 Cache Block 0 : : Sel1 1

Mux 0 Sel0 Cache Tag Valid : : Compare OR Hit Cache Block Lec17.55 Disadvantage of Set Associative Cache N-way Set Associative Cache versus Direct Mapped Cache: N comparators vs. 1 Extra MUX delay for the data Data comes AFTER Hit/Miss decision and set selection In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: Possible to assume a hit and continue. Recover later if miss. Valid

Cache Tag : : Adr Tag Compare Cache Index Cache Data Cache Data Cache Block 0 Cache Block 0 : : Sel1 1 Mux 0 Sel0 Cache Tag Valid : :

Compare OR Hit Cache Block Lec17.56 A Summary on Sources of Cache Misses Compulsory (cold start or process migration, first reference): first access to a block Cold fact of life: not a whole lot you can do about it Note: If you are going to run billions of instruction, Compulsory Misses are insignificant Conflict (collision): Multiple memory locations mapped to the same cache location Solution 1: increase cache size Solution 2: increase associativity Capacity: Cache cannot contain all blocks access by the program Solution: increase cache size Coherence (Invalidation): other process (e.g., I/O) updates memory Lec17.57 Source of Cache Misses Quiz Assume constant cost.

Direct Mapped N-way Set Associative Fully Associative Cache Size: Small, Medium, Big? Compulsory Miss: Conflict Miss Capacity Miss Coherence Miss Choices: Zero, Low, Medium, High, Same Lec17.58 Sources of Cache Misses Answer Direct Mapped Cache Size Compulsory Miss Big Same N-way Set Associative Medium Same Fully Associative Small

Same Conflict Miss High Medium Zero Capacity Miss Low Medium High Coherence Miss Same Same Same Note: If you are going to run billions of instruction, Compulsory Misses are insignificant. Lec17.59 Four Questions for Caches and Memory Hierarchy

Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy) Lec17.60 Q1: Where can a block be placed in the upper level? Block 12 placed in 8 block cache: Fully associative, direct mapped, 2-way set associative S.A. Mapping = Block Number Modulo Number Sets Fully associative: block 12 can go anywhere Block no. 01234567 Direct mapped: block 12 can go only into block 4 (12 mod 8) Block no. 01234567

Set associative: block 12 can go anywhere in set 0 (12 mod 4) Block no. Block-frame address Block no. 01234567 Set Set Set Set 0 1 2 3 1111111111222222222233 01234567890123456789012345678901 Lec17.61 Q2: How is a block found if it is in the upper level? Block Address Tag Index Block offset Direct indexing (using index and block offset), tag

compares, or combination Increasing associativity shrinks index, expands tag Lec17.62 Q3: Which block should be replaced on a miss? Easy for Direct Mapped Set Associative or Fully Associative: Random LRU (Least Recently Used) Associativity: 2-way 4-way 8-way Size LRU Random LRU Random LRU Random 16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%

64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5% 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% Lec17.63 Q4: What happens on a write? Write throughThe information is written to both the block in the cache and to the block in the lowerlevel memory. Write backThe information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. is block clean or dirty? Pros and Cons of each? WT: read misses cannot result in writes WB: no writes of repeated writes WT always combined with write buffers so that dont wait for lower level memory

Lec17.64 Write Buffer for Write Through Processor Cache DRAM Write Buffer A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: Typical number of entries: 4 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle Memory system designers nightmare: Store frequency (w.r.t. time) > 1 / DRAM write cycle Write buffer saturation Lec17.65 Write Buffer Saturation Processor Cache DRAM

Write Buffer Store frequency (w.r.t. time) > 1 / DRAM write cycle If this condition exist for a long period of time (CPU cycle time too quick and/or too many store instructions in a row): - Store buffer will overflow no matter how big you make it - The CPU Cycle Time <= DRAM Write Cycle Time Solution for write buffer saturation: Use a write back cache Install a second level (L2) cache: (does this always work?) Processor Cache L2 Cache DRAM Write Buffer Lec17.66 Write-miss Policy: Write Allocate versus Not Allocate Assume: a 16-bit write to memory location 0x0 and causes a miss Do we read in the block? - Yes: Write Allocate

31 No: Write Not Allocate Cache Tag Valid Bit Example: 0x00 Cache Tag 0x50 9 4 0 Cache Index Ex: 0x00 Byte Select Ex: 0x00 Cache Data Byte 31 Byte 63 Byte 1 Byte 0 0 Byte 33 Byte 32 1 : :

- 2 3 : : Byte 1023 : : Byte 992 31 Lec17.67 Impact of Memory Hierarchy on Algorithms Today CPU time is a function of (ops, cache misses) vs. just f(ops): What does this mean to Compilers, Data structures, Algorithms? The Influence of Caches on the Performance of Sorting by A. LaMarca and R.E. Ladner. Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, January, 1997, 370-379. Quicksort: fastest comparison based sorting algorithm when all keys fit in memory Radix sort: also called linear time sort because for keys of fixed length and fixed radix a constant number of passes over the data is sufficient independent of the number of keys

For Alphastation 250, 32 byte blocks, direct mapped L2 2MB cache, 8 byte keys, from 4000 to 4000000 Lec17.68 Quicksort vs. Radix as vary number keys: Instructions Radix sort Quick sort Instructions/key Set size in keys Lec17.69 Quicksort vs. Radix as vary number keys: Instrs & Time Radix sort Time Quick sort Instructions Set size in keys Lec17.70 Quicksort vs. Radix as vary number keys: Cache misses Radix sort

Cache misses Quick sort Set size in keys What is proper approach to fast algorithms? Lec17.71 How Do you Design a Cache? Set of Operations that must be supported read: data <= Mem[Physical Address] write: Mem[Physical Address] <= Data Physical Address Read/Write Memory Black Box Inside it has: Tag-Data Storage, Muxes, Comparators, . . . Data Determine the internal register transfers Design the Datapath Design the Cache Controller Address Data In Data Out

Control Cache Points Cache DataPath Controller Signals R/W Active wait Lec17.72 Impact on Cycle Time PC Cache Hit Time: directly tied to clock rate increases with cache size increases with associativity I -Cache miss IR IRex A B invalid IRm

Average Memory Access time (AMAT) = Hit Time + Miss Rate x Miss Penalty R D Cache IRwb Compute Time = IC x CT x (ideal CPI + memory stalls) T Miss Example: direct map allows miss signal after data Lec17.73 What happens on a Cache Formiss? in-order pipeline, 2 options: Freeze pipeline in Mem stage (popular early on: Sparc, R4000) IF ID IF EX ID Mem stall stall stall stall Mem Wr EX stall stall stall stall stall Ex Wr

Use Full/Empty bits in registers + MSHR queue - MSHR = Miss Status/Handler Registers (Kroft) Each entry in this queue keeps track of status of outstanding memory requests to one complete memory line. Per cache-line: keep info about memory address. For each word: register (if any) that is waiting for result. Used to merge multiple requests to one memory line - New load creates MSHR entry and sets destination register to Empty. Load is released from pipeline. Attempt to use register before result returns causes instruction to block in decode stage. Limited out-of-order execution with respect to loads. Popular with in-order superscalar architectures. Out-of-order pipelines already have this functionality built in (load queues, etc). Lec17.74 Improving Cache Performance: 3 general options Time = IC x CT x (ideal CPI + memory stalls/instruction) memory stalls/instruction = Average memory accesses/inst x AMAT = (Average IFETCH/inst x AMATInst) + (Average DMEM/inst x AMATData) + Average Memory Access time = Hit Time + (Miss Rate x Miss Penalty) = 1. Reduce the miss rate, 2. Reduce the miss penalty, or

3. Reduce the time to hit in the cache. Lec17.75 3Cs Absolute Miss Rate (SPEC92) Miss Rate per Type 0.14 1-way 0.12 Conflict 2-way 0.1 4-way 0.08 8-way 0.06 Capacity 0.04 0.02

64 32 16 8 Cache Size (KB) 128 Compulsory vanishingly small 4 2 1 0 Compulsory Lec17.76 2:1 Cache Rule miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2 1-way 0.12

Conflict 2-way 0.1 4-way 0.08 8-way 0.06 Capacity 0.04 0.02 Cache Size (KB) 128 64 32 16 8 4

2 0 1 Miss Rate per Type 0.14 Compulsory Lec17.77 3Cs Relative Miss Rate Miss Rate per Type 100% 1-way 80% Conflict 2-way 4-way 8-way 60% 40% Capacity

20% 128 64 Flaws: for fixed block size Good: insight => invention Cache Size (KB) 32 16 8 4 2 1 0% Compulsory Lec17.78 1. Reduce Misses via Larger Block Size 25% 1K 20%

15% 16K 10% 64K 5% 256K 256 128 64 32 0% 16 Miss Rate 4K Block Size (bytes) Lec17.79 2. Reduce Misses via Higher Associativity 2:1 Cache Rule: Miss Rate DM cache size N Miss Rate 2-way cache size N/2

Beware: Execution time is only final measure! Will Clock Cycle time increase? Hill [1988] suggested hit time for 2-way vs. 1-way external cache +10%, internal + 2% Lec17.80 Example: Avg. Memory Access Time vs. Miss Rate Example: assume CCT = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs. CCT direct mapped Cache Size (KB) 1 2.33 2 1.98 4 1.72 8 1.46 16 1.29 32 1.20 64 1.14 128 1.10 Associativity 1-way 2-way 2.15

2.07 1.86 1.76 1.67 1.61 1.48 1.47 1.32 1.32 1.24 1.25 1.20 1.21 1.17 1.18 4-way 2.01 1.68 1.53 1.43 1.32 1.27 1.23 1.20 8-way (Red means A.M.A.T. not improved by more associativity) Lec17.81 3. Reducing Misses via a Victim Cache

How to combine fast hit time of direct mapped yet still avoid conflict misses? TAGS DATA Add buffer to place data discarded from cache Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache Used in Alpha, HP machines Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data To Next Lower Level In

Hierarchy Lec17.82 4. Reducing Misses via Pseudo-Associativity How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way SA cache? Divide cache: on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit) Hit Time Pseudo Hit Time Miss Penalty Drawback: CPU pipeline is hardTime if hit takes 1 or 2 cycles Better for caches not tied directly to processor (L2) Used in MIPS R1000 L2 cache, similar in UltraSPARC Lec17.83 5. Reducing Misses by Hardware Prefetching E.g., Instruction Prefetching Alpha 21064 fetches 2 blocks on a miss Extra block placed in stream buffer On miss check stream buffer Works with data blocks too: Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 streams got 43% Palacharla & Kessler [1994] for scientific programs for 8

streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches Prefetching relies on having extra memory bandwidth that can be used without penalty Lec17.84 6. Reducing Misses by Software Prefetching Data Data Prefetch Load data into register (HP PA-RISC loads) Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) Special prefetching instructions cannot cause faults; a form of speculative execution Issuing Prefetch Instructions takes time Is cost of prefetch issues < savings in reduced misses? Higher superscalar reduces difficulty of issue bandwidth Lec17.85 7. Reducing Misses by Compiler Optimizations McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software Instructions Reorder procedures in memory so as to reduce conflict misses Profiling to look at conflicts(using tools they developed) Data Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays

Loop Interchange: change nesting of loops to access data in order stored in memory Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap Blocking: Improve temporal locality by accessing blocks of data repeatedly vs. going down whole columns or rows Lec17.86 Summary #1 / 3: The Principle of Locality: Program likely to access a relatively small portion of the address space at any instant of time. - Temporal Locality: Locality in Time - Spatial Locality: Locality in Space Three Major Categories of Cache Misses: Compulsory Misses: sad facts of life. Example: cold start misses. Conflict Misses: increase cache size and/or associativity. Nightmare Scenario: ping pong effect! Capacity Misses: increase cache size Coherence Misses: invalidation caused by external processors or I/O Cache Design Space total size, block size, associativity replacement policy write-hit policy (write-through, write-back) write-miss policy Lec17.87

Summary #2 / 3: The Cache Design Space Several interacting dimensions Cache Size cache size block size Associativity associativity replacement policy write-through vs write-back write allocation Block Size The optimal choice is a compromise depends on access characteristics - workload - Bad use (I-cache, D-cache, TLB) depends on technology / cost Simplicity often wins Good Factor A Less

Factor B More Lec17.88 Summary #3 / 3: Cache Miss Optimization Lots of techniques people use to improve the miss rate of caches: miss rate Technique Larger Block Size Higher Associativity Victim Caches Pseudo-Associative Caches HW Prefetching of Instr/Data Compiler Controlled Prefetching Compiler Reduce Misses MR + + + + + + + MP HT

Complexity 0 1 2 2 2 3 0 Lec17.89

Recently Viewed Presentations

  • Windows Event Logs (.evt and .evtx File Formats)

    Windows Event Logs (.evt and .evtx File Formats)

    Tools Overview. Tcpdump. Unix-based command-line tool used to intercept packets. Including filtering to just the packets of interest. Tshark. Tcpdump-like capture program that comes w/ Wireshark
  • The Effect of Land Use and Stormwater Control Measures in the ...

    The Effect of Land Use and Stormwater Control Measures in the ...

    The Effect of Land Use and Stormwater Control Measures in the Jordan Lake Watershed. Celia Jackson, Drew Hoag, Maddie Omeltchenko, Aditya Shetty, Naomi Lahiri
  • Chapter 6 - Polynomial Functions

    Chapter 6 - Polynomial Functions

    To find all the roots of a polynomial: determine the possible rational roots using the rational root theorem (a. o /a n)Use synthetic division to test the possible rational roots until one divides evenly. Write the factored form and solve...
  • SOCIAL STUDIES FRAMEWORKS UPDATE - Nassau BOCES

    SOCIAL STUDIES FRAMEWORKS UPDATE - Nassau BOCES

    INQUIRY ARC. A visual representation of the arc shows us that content is still at the heart of what we do. T. he Inquiry Arc highlights the structure of and rationale for the organization of the Frame-work's four Dimensions.
  • 2018 AAACN CONFERENCE - American Academy of Ambulatory Care ...

    2018 AAACN CONFERENCE - American Academy of Ambulatory Care ...

    In 2015 the American Heart Association estimated 34 % of U.S. adults have Metabolic Syndrome. One person in the United States dies every 52 seconds from heart disease.
  • Driving Catering Sales

    Driving Catering Sales

    New bookings, new definites, cancellations, revals, and deleted bookings. Key Delphi Reports. Daily Events. Run this to ensure the operations staff is aware of the group's events. This can be run by group or by event start time! I.e. All...
  • Vocabulary PowerPoint My Brother Martin  By Christine King

    Vocabulary PowerPoint My Brother Martin By Christine King

    Vocabulary PowerPoint My Brother Martin By Christine King Farris injustice Some people spend their entire lives fighting injustice, or unfairness. Injustice - unfairness numerous If numerous people sign a petition, their many voices can change the laws.
  • Ms. Champagne ROOM F-203

    Ms. Champagne ROOM F-203

    Mr. Dahlke(808) [email protected] English 10. What is this class? English 10: A rigorous class designed to give you the skills to succeed in reading, writing, and oral communication. Who Am I? Make Good Choices for your Learning.