CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 18 Cache Coherence Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste http://inst.eecs.berkeley.edu/~cs152 Last Time in Lecture 17 RISC-V Standard Vectors Note slides from last year available on website to help with Lab 4 2 Bus Management Grant Bus Controller Master 0 Request Master 1 Slave 0 Slave 1 Clock/Control
Address Data A bus is a collection of shared wires Newer busses use point-point links Only one master can initiate a transaction by driving wires at any one time Multiple slaves can observe and conditionally respond to the transaction on the wires slaves decode address on bus to see if they should respond (memory is most common slave) some masters can also act as slaves Masters arbitrate for access with requests to bus controller Some busses only allow one master (in which case, its also the controller) 3 Shared-Memory Multiprocessor Memory Bus CPU1 Snoopy Cache CPU2 Snoopy Cache
CPU3 Snoopy Cache Main Memory (DRAM) DMA Disk DMA Network Bus Control Use snoopy mechanism to keep all processors view of memory coherent 4 Snoopy Cache, Goodman 1983 Idea: Have cache watch (or snoop upon) other memory transactions, and then do the right thing Snoopy cache tags are dual-ported Used to drive Memory Bus when Cache is Bus Master A
Proc. R/W Tags and State D Data (lines) A R/W Snoopy read port attached to Memory Bus Cache 5 Snoopy Cache-Coherence Protocols Write miss: the address is invalidated in all other caches before the write is performed Read miss: if a dirty copy is found in some cache, a write-back is performed before the memory is read
6 Cache State-Transition Diagram The MSI protocol Each cache line has state bits Address tag state bits M: Modified S: Shared I: Invalid Write miss (P1 gets line from memory) Other processor reads (P1 writes back) Read miss (P1 gets line from memory) P1 Read by any processor S to t en
t in M ite r w Other processor intent to write P1 reads or writes Other processor intent to write (P1 writes back) I Cache state in processor P1 7 Two-Processor Example (Reading and writing the same cache line) P1 reads P1 writes P2 reads P2 writes P1 reads
P1 writes P2 writes P1 writes P1 Read miss P2 Read miss P2 reads, P1 writes back S P1 to t n e t in ite r w
P2 intent to write P1 reads, P2 writes back S M P2 to t n e t in P2 intent to write I M ite r w P1 intent to write P1 reads or writes Write miss
P2 reads or writes Write miss P1 intent to write I 8 Observation Other processor reads P1 writes back Read miss Read by any processor P1 S nt e t in to M
Write miss ite r w Other processor intent to write P1 reads or writes Other processor intent to write I If a line is in the M state then no other cache can have a copy of the line! Memory stays coherent, multiple differing copies cannot exist 9 MESI: An Enhanced MSI protocol increased performance for private data M: Modified Exclusive E: Exclusive but unmodified S: Shared I: Invalid Each cache line has a tag Address tag
state bits Write miss P1 write or read M Other processor reads P1 writes back Read miss, shared S Read by any processor P1 write P1 read E Other P1 intent to processor write reads Other processor intent to write, P1
writes back Other processor intent to write Read miss, not shared Other processor intent to write I Cache state in processor P1 10 Optimized Snoop with Level-2 Caches CPU CPU CPU CPU L1 $ L1 $ L1 $ L1 $
L2 $ L2 $ L2 $ L2 $ Snooper Snooper Snooper Snooper Processors often have two-level caches small L1, large L2 (usually both on chip now) Inclusion property: entries in L1 must be in L2 Miss in L2 Not present in L1 Not present in L1 Only if invalidation hits in L2 Not present in L1 probe and invalidate in L1 Snooping on L2 does not affect CPU-L1 bandwidth 11 Intervention CPU-1 A 200
CPU-2 cache-1 cache-2 CPU-Memory bus A 100 memory (stale data) When a read-miss for A occurs in cache-2, a read request for A is placed on the bus Cache-1 needs to supply & change its state to shared The memory may respond to the request also! Does memory know it has stale data? Cache-1 needs to intervene through memory controller to supply correct data to cache-2 12 False Sharing state line addr data0 data1 ...
dataN A cache line contains more than one word Cache-coherence is done at the line-level and not word-level Suppose M1 writes wordi and M2 writes wordk and i k but both words have the same line address. What can happen? 13 Performance of Symmetric Multiprocessors (SMPs) Cache performance is combination of: Uniprocessor cache miss traffic Traffic caused by communication Results in invalidations and subsequent cache misses Coherence misses Sometimes called a Communication miss 4th C of cache misses along with Compulsory, Capacity, & Conflict. 14 Coherency Misses True sharing misses arise from the communication of data through the cache coherence mechanism Invalidates due to 1st write to shared line Reads by another CPU of modified line in different cache Miss would still occur if line size were 1 word
False sharing misses when a line is invalidated because some word in the line, other than the one being read, is written into Invalidation does not cause a new value to be communicated, but only causes an extra cache miss Line is shared, but no word in line is actually shared miss would not occur if line size were 1 word 15 Example: True v. False Sharing v. Hit? Assume x1 and x2 in same cache line. P1 and P2 both read x1 and x2 before. Time P1 1 Write x1 2 3 Write x1 False miss; x1 irrelevant to P2 False miss; x1 irrelevant to P2 Write x2
Read x2 True, False, Hit? Why? True miss; invalidate x1 in P2 Read x2 4 5 P2 True miss; x2 not writeable True miss; invalidate x2 in P1 16 MP Performance 4-Processor Commercial Workload: OLTP, Decision Support (Database), Search Engine (Instruction, Capacity/Conflict, Compulsory) True sharing and false sharing unchanged going from 1 MiB to 8 MiB (L3 cache) Memory cycles per instruction
0.5 0 1 2 4 Processor count 6 8 18 CS152 Administrivia Midterm 2 in class Wednesday April 17 covers lectures 10-17, plus associated problem sets, labs, and readings 19 CS252 Administrivia Monday April 15th Project Checkpoint, 11am-noon, 405 Soda Prepare 10-minute presentation on current status CS252 20 Scaling Snoopy/Broadcast Coherence When any processor gets a miss, must probe every other cache
Scaling up to more processors limited by: Communication bandwidth over bus Snoop bandwidth into tags Can improve bandwidth by using multiple interleaved buses with interleaved tag banks E.g, two bits of address pick which of four buses and four tag banks to use (e.g., bits 7:6 of address pick bus/tag bank, bits 5:0 pick byte in 64-byte line) Buses dont scale to large number of connections, so can use point-to-point network for larger number of nodes, but then limited by tag bandwidth when broadcasting snoop requests. Insight: Most snoops fail to find a match! 21 Scalable Approach: Directories Every memory line has associated directory information keeps track of copies of cached lines and their states on a miss, find directory entry, look it up, and communicate only with the nodes that have copies if necessary in scalable networks, communication with directory and copies is through network transactions Many alternatives for organizing directory information 22
Directory Cache Protocol CPU CPU CPU CPU CPU CPU Each line in cache has state field plus tag Stat. Tag Cache Cache Cache Cache Cache Data Cache Each line in memory
has state field plus bit vector directory with one bit per processor Stat. Directry Data Interconnection Network Directory Controller Directory Controller Directory Controller Directory Controller DRAM Bank DRAM Bank DRAM Bank DRAM Bank Assumptions: Reliable network, FIFO message delivery between any given source-destination pair 23
Cache States For each cache line, there are 4 possible states: C-invalid (= Nothing): The accessed data is not resident in the cache. C-shared (= Sh): The accessed data is resident in the cache, and possibly also cached at other sites. The data in memory is valid. C-modified (= Ex): The accessed data is exclusively resident in this cache, and has been modified. Memory does not have the most up-to-date data. C-transient (= Pending): The accessed data is in a transient state (for example, the site has just issued a protocol request, but has not received the corresponding protocol reply). 24 Home directory states For each memory line, there are 4 possible states: R(dir): The memory line is shared by the sites specified in dir (dir is a set of sites). The data in memory is valid in this state. If dir is empty (i.e., dir = ), the memory line is not cached by any ), the memory line is not cached by any site. W(id): The memory line is exclusively cached at site id, and has been modified at that site. Memory does not have the most up-to-date data. TR(dir): The memory line is in a transient state waiting for the acknowledgements to the invalidation requests that the home site has issued. TW(id): The memory line is in a transient state waiting for a line exclusively cached at site id (i.e., in C-modified state) to
make the memory line at the home site up-to-date. 25 Read miss, to uncached or shared line CPU Load request at head of CPU->Cache queue. 1 Load misses in cache. 2 Update cache tag and data and return load data to CPU. 9 Cache 8 ShRep arrives at cache. Send ShReq 3 message to directory. Interconnection Network Message received at directory controller. 4 7 Directory Controller DRAM Bank
Send ShRep message with contents of cache line. Update directory by 6 setting bit for new processor sharer. Access state and directory for line. 5 Lines state is R, with zero or more sharers. 26 Write miss, to read shared line Multiple sharers CPU Update cache tag and data, then store data from CPU 12 Invalidate Cache cache line. ExRep arrives Send InvRep at cache to directory. 8 11 Store request at head of CPU->Cache queue. 1
Store misses in cache. 2 Send ExReq message 3 to directory. CPU CPU CPU Cache Cache Cache InvReq arrives at cache. 7 Interconnection Network ExReq message received at directory controller. 4 InvRep received. 9 Clear down sharer bit. 10 Directory Controller When no more sharers, send ExRep to cache. 6 Send one InvReq message to each sharer.
DRAM Bank Access state and directory for 5 line. Lines state is R, with some set of sharers. 27 Concurrency Management Protocol would be easy to design if only one transaction in flight across entire system But, want greater throughput and dont want to have to coordinate across entire system Great complexity in managing multiple outstanding concurrent transactions to cache lines Can have multiple requests in flight to same cache line! 28 Acknowledgements This course is partly inspired by previous MIT 6.823 and Berkeley CS252 computer architecture courses created by my collaborators and colleagues: Arvind (MIT) Joel Emer (Intel/MIT)
James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB) 29
Surface mucous cells, and gastric pits leading to gastric glands with parietal and chief cells, (in the fundus and body) or to mucous cardiac glands and pyloric glands. No distinguishing features. Three indistinct layers of smooth muscle (inner oblique, middle...
Define aspects of sexuality, including sexual abstinence, sexual activity, sexual orientation and gender identity. Summarize the benefits of respecting individual differences in aspects of sexuality. Describe the physical, emotional, intellectual, and social dimensions of sexual health.
Deontology - this theory states that a person should adhere to their duties and responsibilities when overcoming an ethical dilemma. Utilitarianism - this theory states that an ethically correct decision is one that is of the greatest benefit to the...
Business-Level Strategy Dimensions. Competitive advantage. Superior value. Competitive scope. Target market. Features of the Five Business-Level Strategies - The five basic strategies are generic and can be used in any business in any industry. Each business-level strategy helps the firm...
Investigation 4, Part 2. Mono Lake Food Web. Warm up. What is an ecosystem? What is a community? One of the main points about an ecosystem is the interaction between biotic and abiotic factors. A community is groups of populations...
OVERVIEW Part I Normal language development Screening for language delay Outcomes of children with SLS & late talkers Part II Assessment and differential diagnosis Case illustrations Intervention approaches Normal Language Acquisition Language acquisition is a rapid process 12 months: a...
STATES OF MATTER. Based upon particle arrangement. Based upon energy of particles. Based upon distance between particles. ... indefinite. shape and a . definite. volume. Heat. STATES OF MATTERGAS. Particles of gases are very far apart and move freely.
Nonhuman Communication. When we say that a call, word, or sentence is symbolic communication, we mean at least two things. First, the communication has meaning even when its referent (whatever is referred to) is not present.. Second, the meaning is...
Ready to download the document? Go ahead and hit continue!