Lectures for 2nd Edition - Oregon State University

Chapter Three 1 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Numbers Bits are just bits (no inherent meaning) conventions define relationship between bits and numbers Binary numbers (base 2) 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001... decimal: 0...2n-1 Of course it gets more complicated: numbers are finite (overflow) fractions and real numbers negative numbers e.g., no MIPS subi instruction; addi can add a negative number How do we represent negative numbers? i.e., which bit patterns will represent which numbers? 2 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Possible Representations Sign Magnitude: 000 = +0 001 = +1 010 = +2 011 = +3 100 = -0 101 = -1 110 = -2

111 = -3 One's Complement Two's Complement 000 = +0 001 = +1 010 = +2 011 = +3 100 = -3 101 = -2 110 = -1 111 = -0 000 = +0 001 = +1 010 = +2 011 = +3 100 = -4 101 = -3 110 = -2 111 = -1 Issues: balance, number of zeros, ease of operations Which one is best? Why? 3 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers MIPS 32 bit signed numbers: 0000 0000 0000 ... 0111 0111

1000 1000 1000 ... 1111 1111 1111 0000 0000 0000 0000 0000 0000 0000two = 0ten 0000 0000 0000 0000 0000 0000 0001two = + 1ten 0000 0000 0000 0000 0000 0000 0010two = + 2ten 1111 1111 0000 0000 0000 1111 1111 0000 0000 0000 1111 1111 0000 0000 0000 1111 1111 0000 0000 0000 1111 1111 0000 0000 0000 1111 1111

0000 0000 0000 1110two 1111two 0000two 0001two 0010two = = = = = + + 2,147,483,646ten 2,147,483,647ten 2,147,483,648ten 2,147,483,647ten 2,147,483,646ten maxint minint 1111 1111 1111 1111 1111 1111 1101two = 3ten 1111 1111 1111 1111 1111 1111 1110two = 2ten 1111 1111 1111 1111 1111 1111 1111two = 1ten 4 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Two's Complement Operations Negating a two's complement number: invert all bits and add 1

remember: negate and invert are quite different! Converting n bit numbers into numbers with more than n bits: MIPS 16 bit immediate gets converted to 32 bits for arithmetic copy the most significant bit (the sign bit) into the other bits 0010 -> 0000 0010 1010 -> 1111 1010 "sign extension" (lbu vs. lb) 5 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Addition & Subtraction Just like in grade school (carry/borrow 1s) 0111 0111 0110 + 0110 - 0110 - 0101 Two's complement operations easy subtraction using addition of negative numbers 0111 + 1010 Overflow (result too large for finite computer word): e.g., adding two n-bit numbers does not yield an n-bit number

0111 + 0001 note that overflow term is somewhat misleading, 1000 it does not mean a carry overflowed 6 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Detecting Overflow No overflow when adding a positive and a negative number No overflow when signs are the same for subtraction Overflow occurs when the value affects the sign: overflow when adding two positives yields a negative or, adding two negatives gives a positive or, subtract a negative from a positive and get a negative or, subtract a positive from a negative and get a positive 7 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Effects of Overflow An exception (interrupt) occurs Control jumps to predefined address for exception Interrupted address is saved for possible resumption Details based on software system / language example: flight control vs. homework assignment Don't always want to detect overflow new MIPS instructions: addu, addiu, subu note: addiu still sign-extends! note: sltu, sltiu for unsigned comparisons

8 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Multiplication More complicated than addition accomplished via shifting and addition More time and more area Let's look at 3 versions based on a gradeschool algorithm 0010 __x_1011 (multiplicand) (multiplier) Negative numbers: convert and multiply there are better techniques, we wont look at them 9 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Multiplication: Implementation Start Multiplier0 Morgan Kaufmann Publishers= Morgan Kaufmann Publishers1 1. Morgan Kaufmann Publishers Morgan Kaufmann PublishersTest Multiplier0 Multiplier0 Morgan Kaufmann Publishers= Morgan Kaufmann Publishers0 1a. Morgan Kaufmann Publishers Morgan Kaufmann PublishersAdd Morgan Kaufmann Publishersmultiplicand Morgan Kaufmann Publishersto Morgan Kaufmann Publishersproduct Morgan Kaufmann Publishersand place Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersresult Morgan Kaufmann Publishersin Morgan Kaufmann PublishersProduct Morgan Kaufmann Publishersregister Multiplicand Shift Morgan Kaufmann Publishersleft

64 Morgan Kaufmann Publishersbits Multiplier Shift Morgan Kaufmann Publishersright 64-bit Morgan Kaufmann PublishersALU 2. Morgan Kaufmann Publishers Morgan Kaufmann PublishersShift Morgan Kaufmann Publishersthe Morgan Kaufmann PublishersMultiplicand Morgan Kaufmann Publishersregister Morgan Kaufmann Publishersleft Morgan Kaufmann Publishers1 Morgan Kaufmann Publishersbit 32 Morgan Kaufmann Publishersbits Product Write 3. Morgan Kaufmann Publishers Morgan Kaufmann PublishersShift Morgan Kaufmann Publishersthe Morgan Kaufmann PublishersMultiplier Morgan Kaufmann Publishersregister Morgan Kaufmann Publishersright Morgan Kaufmann Publishers1 Morgan Kaufmann Publishersbit Control Morgan Kaufmann Publisherstest 64 Morgan Kaufmann Publishersbits 32nd Morgan Kaufmann Publishersrepetition? Datapath No: Morgan Kaufmann Publishers Morgan Kaufmann Publishers< Morgan Kaufmann Publishers32 Morgan Kaufmann Publishersrepetitions Yes: Morgan Kaufmann Publishers Morgan Kaufmann Publishers32 Morgan Kaufmann Publishersrepetitions Control Done 10 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Final Version Start Multiplier starts in right half of product Product0 Morgan Kaufmann Publishers= Morgan Kaufmann Publishers1 1. Morgan Kaufmann Publishers Morgan Kaufmann PublishersTest Product0

Product0 Morgan Kaufmann Publishers= Morgan Kaufmann Publishers0 Multiplicand 32 Morgan Kaufmann Publishersbits 32-bit Morgan Kaufmann PublishersALU Product Shift Morgan Kaufmann Publishersright Write Control test 3. Morgan Kaufmann Publishers Morgan Kaufmann PublishersShift Morgan Kaufmann Publishersthe Morgan Kaufmann PublishersProduct Morgan Kaufmann Publishersregister Morgan Kaufmann Publishersright Morgan Kaufmann Publishers1 Morgan Kaufmann Publishersbit 64 Morgan Kaufmann Publishersbits 32nd Morgan Kaufmann Publishersrepetition? What goes here? No: Morgan Kaufmann Publishers Morgan Kaufmann Publishers< Morgan Kaufmann Publishers32 Morgan Kaufmann Publishersrepetitions Yes: Morgan Kaufmann Publishers Morgan Kaufmann Publishers32 Morgan Kaufmann Publishersrepetitions Done 11 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Floating Point (a brief look) We need a way to represent numbers with fractions, e.g., 3.1416 very small numbers, e.g., .000000001 very large numbers, e.g., 3.15576 109

Representation: sign, exponent, significand: (1)sign significand 2exponent more bits for significand gives more accuracy more bits for exponent increases range IEEE 754 floating point standard: single precision: 8 bit exponent, 23 bit significand double precision: 11 bit exponent, 52 bit significand 12 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers IEEE 754 floating-point standard Leading 1 bit of significand is implicit Exponent is biased to make sorting easier all 0s is smallest exponent all 1s is largest bias of 127 for single precision and 1023 for double precision summary: (1)sign significand) 2exponent bias 13 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Floating point addition Sign Exponent Fraction

Sign Exponent 1. Morgan Kaufmann Publishers Morgan Kaufmann PublishersCompare Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersexponents Morgan Kaufmann Publishersof Morgan Kaufmann Publishersthe Morgan Kaufmann Publisherstwo Morgan Kaufmann Publishersnumbers. Shift Morgan Kaufmann Publishersthe Morgan Kaufmann Publisherssmaller Morgan Kaufmann Publishersnumber Morgan Kaufmann Publishersto Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersright Morgan Kaufmann Publishersuntil Morgan Kaufmann Publishersits exponent Morgan Kaufmann Publisherswould Morgan Kaufmann Publishersmatch Morgan Kaufmann Publishersthe Morgan Kaufmann Publisherslarger Morgan Kaufmann Publishersexponent Small Morgan Kaufmann PublishersALU Exponent difference 0 Start Fraction 2. Morgan Kaufmann PublishersAdd Morgan Kaufmann Publishersthe Morgan Kaufmann Publisherssignificands 1 0 1 0 3. Morgan Kaufmann PublishersNormalize Morgan Kaufmann Publishersthe Morgan Kaufmann Publisherssum, Morgan Kaufmann Publisherseither Morgan Kaufmann Publishersshifting Morgan Kaufmann Publishersright Morgan Kaufmann Publishersand incrementing Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersexponent Morgan Kaufmann Publishersor Morgan Kaufmann Publishersshifting Morgan Kaufmann Publishersleft and Morgan Kaufmann Publishersdecrementing Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersexponent Shift Morgan Kaufmann Publishersright Control 1 Overflow Morgan Kaufmann Publishersor underflow? Big Morgan Kaufmann PublishersALU

Yes No 0 0 1 Increment Morgan Kaufmann Publishersor decrement Exception 1 4. Morgan Kaufmann PublishersRound Morgan Kaufmann Publishersthe Morgan Kaufmann Publisherssignificand Morgan Kaufmann Publishersto Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersappropriate number Morgan Kaufmann Publishersof Morgan Kaufmann Publishersbits Shift Morgan Kaufmann Publishersleft Morgan Kaufmann Publishersor Morgan Kaufmann Publishersright No Rounding Morgan Kaufmann Publishershardware Still Morgan Kaufmann Publishersnormalized? Yes Sign Exponent Fraction Done 14 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Floating Point Complexities

Operations are somewhat more complicated (see text) In addition to overflow we can have underflow Accuracy can be a big problem IEEE 754 keeps two extra bits, guard and round four rounding modes positive divided by zero yields infinity zero divide by zero yields not a number other complexities Implementing the standard can be tricky Not using the standard can be even worse see text for description of 80x86 and Pentium bug! 15 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Chapter Three Summary Computer arithmetic is constrained by limited precision Bit patterns have no inherent meaning but standards do exist twos complement IEEE 754 floating point Computer instructions determine meaning of the bit patterns

Performance and accuracy are important so there are many complexities in real machines Algorithm choice is important and may lead to hardware optimizations for both space and time (e.g., multiplication) You may want to look back (Section 3.10 is great reading!) 16 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Lets Build a Processor Almost ready to move into chapter 5 and start building a processor First, lets review Boolean Logic and build the ALU well need (Material from Appendix B) operation a 32 ALU result 32 b 32 17 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Review: Boolean Algebra & Gates Problem: Consider a logic function with three inputs: A, B, and C.

Output D is true if at least one input is true Output E is true if exactly two inputs are true Output F is true only if all three inputs are true Show the truth table for these three functions. Show the Boolean equations for these three functions. Show an implementation consisting of inverters, AND, and OR gates. 18 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers An ALU (arithmetic logic unit) Let's build an ALU to support the andi and ori instructions we'll just build a 1 bit ALU, and use 32 of them operation a op a b res result b Possible Implementation (sum-of-products): 19

2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Review: The Multiplexor Selects one of the inputs to be the output, based on a control input S A 0 B 1 C note: we call this a 2-input mux even though it has 3 inputs! Lets build our ALU using a MUX: 20 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Different Implementations Not easy to decide the best way to build something Don't want too many inputs to a single gate Dont want to have to go through too many gates for our purposes, ease of comprehension is important Let's look at a 1-bit ALU for addition: CarryIn

a Sum b cout = a b + a cin + b cin sum = a xor b xor cin CarryOut How could we build a 1-bit ALU for add, and, and or? How could we build a 32-bit ALU? 21 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Lecture 5 22 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Building a 32 bit ALU CarryIn a0 b0 Operation CarryIn ALU0 Result0 CarryOut Operation

CarryIn a1 a 0 b1 CarryIn ALU1 Result1 CarryOut 1 Result a2 2 b b2 CarryIn ALU2 Result2 CarryOut CarryOut a31 b31 CarryIn ALU31 Result31 23

2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers What about subtraction (a b) ? Two's complement approach: just negate b and add. How do we negate? A very clever solution: Binvert Operation CarryIn a 0 1 b 0 Result 2 1 CarryOut 24 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Adding a NOR function

Can also choose to invert a. How do we get a NOR b ? Ainvert Operation Binvert a CarryIn 0 0 1 1 b 0 + Result 2 1 CarryOut 25 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Tailoring the ALU to the MIPS Need to support the set-on-less-than instruction (slt) remember: slt is an arithmetic instruction produces a 1 if rs < rt and 0 otherwise use subtraction: (a-b) < 0 implies a < b

Need to support test for equality (beq \$t5, \$t6, \$t7) use subtraction: (a-b) = 0 implies a = b 26 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Supporting slt Can we figure out the idea? Binvert a Binvert CarryIn a 0 Operation Ainvert Operation Ainvert CarryIn 0 0 0 1 1

1 1 Result b 0 + Result b 0 2 + 2 1 1 Less Less 3 3 Set Overflow detection Overflow Use this ALU for most significant bit CarryOut all other bits

Supporting slt Operation Binvert Ainvert CarryIn a0 b0 CarryIn ALU0 Less CarryOut Result0 a1 b1 0 CarryIn ALU1 Less CarryOut Result1 a2 b2 0 CarryIn ALU2 Less CarryOut Result2 .. .

a31 b31 0 .. . CarryIn CarryIn ALU31 Less .. . Result31 Set Overflow 28 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Test for equality Notice control lines: Operation Bnegate Ainvert 0000 0001 0010 0110 0111 1100 = = = = =

= and or add subtract slt NOR Note: zero is a 1 when the result is zero! a0 b0 CarryIn ALU0 Less CarryOut a1 b1 0 CarryIn ALU1 Less CarryOut a2 b2 0 CarryIn ALU2 Less CarryOut .. . a31 b31 0

Result0 Result1 .. . Result2 .. . CarryIn CarryIn ALU31 Less Zero .. . .. . Result31 Set Overflow 29 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Conclusion We can build an ALU to support the MIPS instruction set key idea: use multiplexor to select the output we want we can efficiently perform subtraction using twos complement we can replicate a 1-bit ALU to produce a 32-bit ALU Important points about hardware all of the gates are always working the speed of a gate is affected by the number of inputs to the gate

the speed of a circuit is affected by the number of gates in series (on the critical path or the deepest level of logic) Our primary focus: comprehension, however, Clever changes to organization can improve performance (similar to using better algorithms in software) We saw this in multiplication, lets look at addition now 30 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers ALU Summary We can build an ALU to support MIPS addition Our focus is on comprehension, not performance Real processors use more sophisticated techniques for arithmetic Where performance is not critical, hardware description languages allow designers to completely automate the creation of hardware! 31 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers DONE 32 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Chapter Five 33 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers

The Processor: Datapath & Control We're ready to look at an implementation of the MIPS Simplified to contain only: memory-reference instructions: lw, sw arithmetic-logical instructions: add, sub, and, or, slt control flow instructions: beq, j Generic Implementation: use the program counter (PC) to supply instruction address get the instruction from memory read registers use the instruction to decide exactly what to do All instructions use the ALU after reading the registers Why? memory-reference? arithmetic? control flow? 34 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers More Implementation Details Abstract / Simplified View: 4 Add Add Data

PC Address Instruction Instruction memory Register Morgan Kaufmann Publishers# Registers Register Morgan Kaufmann Publishers# ALU Address Data memory Register Morgan Kaufmann Publishers# Data Two types of functional units: elements that operate on data values (combinational) elements that contain state (sequential) 35 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Five Execution Steps Instruction Fetch Instruction Decode and Register Fetch Execution, Memory Address Computation, or Branch Completion

Memory Access or R-type instruction completion Write-back step INSTRUCTIONS TAKE FROM 3 - 5 CYCLES! 36 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers State Elements Unclocked vs. Clocked Clocks used in synchronous logic when should an element that contains state be updated? Falling Morgan Kaufmann Publishersedge Clock Morgan Kaufmann Publishersperiod Rising Morgan Kaufmann Publishersedge cycle time 37 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers An unclocked state element The set-reset latch output depends on present inputs and also on past inputs R S Q Q

38 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Latches and Flip-flops Output is equal to the stored value inside the element (don't need to ask for permission to look at the value) Change of state (value) is based on the clock Latches: whenever the inputs change, and the clock is asserted Flip-flop: state changes only on a clock edge (edge-triggered methodology) "logically true", could mean electrically low A clocking methodology defines when signals can be read and written wouldn't want to read a signal at the same time it was being written 39 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers D-latch Two inputs: the data value to be stored (D) the clock signal (C) indicating when to read & store D Two outputs: the value of the internal state (Q) and it's complement C Q D

C _ Q Q D 40 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers D flip-flop Output changes only on the clock edge D D C D latch Q D C D latch Q Q Q Q C D

C Q 41 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Our Implementation An edge triggered methodology Typical execution: read contents of some state elements, send values through some combinational logic write results to one or more state elements State element 1 Combinational Morgan Kaufmann Publisherslogic State element 2 Clock Morgan Kaufmann Publisherscycle 42 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Register File Built using D flip-flops Read Morgan Kaufmann Publishersregister number Morgan Kaufmann Publishers1 Register Morgan Kaufmann Publishers0

Register Morgan Kaufmann Publishers1 . Morgan Kaufmann Publishers. Morgan Kaufmann Publishers. Read Morgan Kaufmann Publishersregister number Morgan Kaufmann Publishers1 Read Morgan Kaufmann Publishersregister number Morgan Kaufmann Publishers2 Write register Write data Register Morgan Kaufmann Publishersn Morgan Kaufmann Publishers Morgan Kaufmann Publishers2 Read data Morgan Kaufmann Publishers1 Register Morgan Kaufmann Publishersfile Read data Morgan Kaufmann Publishers2 M u Read Morgan Kaufmann Publishersdata Morgan Kaufmann Publishers1 x Register Morgan Kaufmann Publishersn Morgan Kaufmann Publishers Morgan Kaufmann Publishers1 Read Morgan Kaufmann Publishersregister number Morgan Kaufmann Publishers2 Write M u Read Morgan Kaufmann Publishersdata Morgan Kaufmann Publishers2 x Do you understand? What is the Mux above?

43 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Abstraction Make sure you understand the abstractions! Sometimes it is easy to think you do, when you dont Select A31 Select B31 A B M u x C31 32 32 M u x 32 C A30 B30 M

u x C30 .. . .. . A0 B0 M u x C0 44 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Register File Note: we still use the real clock to determine when to write Write C 0 1 Register Morgan Kaufmann Publishersnumber n-to-2n decoder Register Morgan Kaufmann Publishers0 .. . D

C Register Morgan Kaufmann Publishers1 n Morgan Kaufmann Publishers Morgan Kaufmann Publishers1 n D .. . C Register Morgan Kaufmann Publishersn Morgan Kaufmann Publishers Morgan Kaufmann Publishers2 D C Register Morgan Kaufmann Publishersn Morgan Kaufmann Publishers Morgan Kaufmann Publishers1 Register Morgan Kaufmann Publishersdata D 45 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Simple Implementation Include the functional units we need for each instruction Instruction address MemWrite Instruction Add Sum PC Address Read data 16

Instruction memory a. Morgan Kaufmann PublishersInstruction Morgan Kaufmann Publishersmemory b. Morgan Kaufmann PublishersProgram Morgan Kaufmann Publisherscounter c. Morgan Kaufmann PublishersAdder Write data Data memory Sign extend 32 MemRead a. Morgan Kaufmann PublishersData Morgan Kaufmann Publishersmemory Morgan Kaufmann Publishersunit Register numbers 5 Read register Morgan Kaufmann Publishers1 5 Read register Morgan Kaufmann Publishers2 5 Data Write register 4

Read data Morgan Kaufmann Publishers1 Data Registers ALU Morgan Kaufmann Publishersoperation Zero ALU ALU result Read data Morgan Kaufmann Publishers2 Write Data b. Morgan Kaufmann PublishersSign-extension Morgan Kaufmann Publishersunit Why do we need this stuff? RegWrite a. Morgan Kaufmann PublishersRegisters b. Morgan Kaufmann PublishersALU 46 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Building the Datapath Use multiplexors to stitch them together PCSrc M u x Add Add

4 ALU result Shift left Morgan Kaufmann Publishers2 PC Read address Instruction Instruction memory Read register Morgan Kaufmann Publishers1 ALUSrc Read data Morgan Kaufmann Publishers1 ALU Morgan Kaufmann Publishersoperation MemWrite Read register Morgan Kaufmann Publishers2 Registers Read Write data Morgan Kaufmann Publishers2 register MemtoReg Zero M u x Write

data ALU ALU result Address Write data RegWrite 16 4 Sign extend 32 Read data M u x Data memory MemRead 47 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers DONE 48 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Control

Selecting the operations to perform (ALU, read/write, etc.) Controlling the flow of data (multiplexor inputs) Information comes from the 32 bits of the instruction Example: add \$8, \$17, \$18 Instruction Format: 000000 10001 10010 op rs rt 01000 00000 100000 rd shamt funct ALU's operation based on instruction type and function code 49 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Control

e.g., what should the ALU do with this instruction Example: lw \$1, 100(\$2) 35 2 1 op rs rt 16 bit offset ALU control input 0000 0001 0010 0110 0111 1100 100 AND OR add subtract set-on-less-than NOR Why is the code for subtract 0110 and not 0011? 50 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Control

Must describe hardware to compute 4-bit ALU control input given instruction type 00 = lw, sw ALUOp 01 = beq, computed from instruction type 10 = arithmetic function code for arithmetic Describe it using a truth table (can turn into gates): 51 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers 0 M u x Add ALU Add result 4 Instruction Morgan Kaufmann Publishers[3126] PC Read address Instruction [310] Instruction memory Instruction Morgan Kaufmann Publishers[2521]

Read register Morgan Kaufmann Publishers1 Instruction Morgan Kaufmann Publishers[2016] Read register Morgan Kaufmann Publishers2 0 M u Instruction Morgan Kaufmann Publishers[1511] x 1 Instruction Morgan Kaufmann Publishers[150] Shift left Morgan Kaufmann Publishers2 RegDst Branch MemRead MemtoReg ALUOp MemWrite ALUSrc RegWrite Control Write register Write data 16 1 Read data Morgan Kaufmann Publishers1 Zero Read

data Morgan Kaufmann Publishers2 Registers Sign extend ALU ALU result 0 M u x 1 Address Read data 1 M u x 0 Data Write memory data 32 ALU control Instruction Morgan Kaufmann Publishers[50] Memto- Reg Mem Mem Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0 R-format 1

0 0 1 0 0 0 1 0 lw 0 1 1 1 1 0 0 0 0 sw X 1 X 0 0 1 0 0 0 beq X 0 X 0 0 0 1 0 1 Control Simple combinational logic (truth tables)

Inputs Op5 Op4 Op3 Op2 ALUOp Op1 ALU Morgan Kaufmann Publisherscontrol Morgan Kaufmann Publishersb lock Op0 ALUOp0 ALUOp1 Outputs F3 F Morgan Kaufmann Publishers(5 0) F2 F1 F0 Operation2 Operation1 R-format Operation Iw sw beq RegDst ALUSrc MemtoReg Operation0

RegWrite MemRead MemWrite Branch ALUOp1 ALUOpO 53 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Our Simple Control Structure All of the logic is combinational We wait for everything to settle down, and the right thing to be done ALU might not produce right answer right away we use write signals along with clock to determine when to write Cycle time determined by length of the longest path State element 1 Combinational Morgan Kaufmann Publisherslogic State element 2 Clock Morgan Kaufmann Publisherscycle We are ignoring some details like setup and hold times 54 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers

Single Cycle Implementation Calculate cycle time assuming negligible delays except: memory (200ps), ALU and adders (100ps), register file access (50ps) PCSrc M u x Add Add 4 ALU result Shift left Morgan Kaufmann Publishers2 PC Read address Instruction Instruction memory Read register Morgan Kaufmann Publishers1 ALUSrc Read data Morgan Kaufmann Publishers1 ALU Morgan Kaufmann Publishersoperation MemWrite Read register Morgan Kaufmann Publishers2

Registers Read Write data Morgan Kaufmann Publishers2 register MemtoReg Zero M u x Write data ALU ALU result Address Write data RegWrite 16 4 Sign extend 32 Read data M u x Data memory

MemRead 55 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Where we are headed Single Cycle Problems: what if we had a more complicated instruction like floating point? wasteful of area One Solution: use a smaller cycle time have different instructions take different numbers of cycles a multicycle datapath: PC Address Instruction register Instruction or Morgan Kaufmann Publishersdata Memory Data Memory data register Data A Register Morgan Kaufmann Publishers#

Registers Register Morgan Kaufmann Publishers# ALU ALUOut B Register Morgan Kaufmann Publishers# 56 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Multicycle Approach We will be reusing functional units ALU used to compute address and to increment PC Memory used for instruction and data Our control signals will not be determined directly by instruction e.g., what should the ALU do for a subtract instruction? Well use a finite state machine for control 57 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Multicycle Approach Break up the instructions into steps, each step takes a cycle balance the amount of work to be done restrict each cycle to use only one major functional unit At the end of a cycle store values for use in later cycles (easiest thing to do) introduce additional internal registers PC

0 M u x 1 Address Memory MemData Write data Instruction [2016] Instruction [150] Instruction register Instruction [150] Memory data register 0 M u x 1 Read register Morgan Kaufmann Publishers1 Instruction [2521] 0 M Instruction u x [1511] 1

0 M u x 1 16 Read data Morgan Kaufmann Publishers1 Read register Morgan Kaufmann Publishers2 Registers Write Read register data Morgan Kaufmann Publishers2 A ALUOut 0 B Write data Sign extend Zero ALU ALU result 4 1M u 2 x 3 32

Shift left Morgan Kaufmann Publishers2 58 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Instructions from ISA perspective Consider each instruction from perspective of ISA. Example: The add instruction changes a register. Register specified by bits 15:11 of instruction. Instruction specified by the PC. New value is the sum (op) of two registers. Registers specified by bits 25:21 and 20:16 of the instruction Reg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] op Reg[Memory[PC][20:16]] In order to accomplish this we must break up the instruction. (kind of like introducing variables when programming) 59 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Breaking down an instruction ISA definition of arithmetic: Reg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] Reg[Memory[PC][20:16]] Could break down to: IR <= Memory[PC] A <= Reg[IR[25:21]] B <= Reg[IR[20:16]] ALUOut <= A op B Reg[IR[20:16]] <= ALUOut

We forgot an important part of the definition of arithmetic! PC <= PC + 4 op 60 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Idea behind multicycle approach We define each instruction from the ISA perspective (do this!) Break it down into steps following our rule that data flows through at most one major functional unit (e.g., balance work across steps) Introduce new registers as needed (e.g, A, B, ALUOut, MDR, etc.) Finally try and pack as much work into each step (avoid unnecessary cycles) while also trying to share steps where possible (minimizes control, helps to simplify solution) Result: Our books multicycle Implementation! 61 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Five Execution Steps

Instruction Fetch Instruction Decode and Register Fetch Execution, Memory Address Computation, or Branch Completion Memory Access or R-type instruction completion Write-back step INSTRUCTIONS TAKE FROM 3 - 5 CYCLES! 62 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Done for Real 63 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Step 1: Instruction Fetch Use PC to get instruction and put it in the Instruction Register. Increment the PC by 4 and put the result back in the PC. Can be described succinctly using RTL "Register-Transfer Language" IR <= Memory[PC]; PC <= PC + 4; Can we figure out the values of the control signals? What is the advantage of updating the PC now?

64 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Step 2: Instruction Decode and Register Fetch Read registers rs and rt in case we need them Compute the branch address in case the instruction is a branch RTL: A <= Reg[IR[25:21]]; B <= Reg[IR[20:16]]; ALUOut <= PC + (sign-extend(IR[15:0]) << 2); We aren't setting any control lines based on the instruction type (we are busy "decoding" it in our control logic) 65 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Step 3 (instruction dependent) ALU is performing one of three functions, based on instruction type Memory Reference: ALUOut <= A + sign-extend(IR[15:0]); R-type: ALUOut <= A op B; Branch:

if (A==B) PC <= ALUOut; 66 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Step 4 (R-type or memory-access) Loads and stores access memory MDR <= Memory[ALUOut]; or Memory[ALUOut] <= B; R-type instructions finish Reg[IR[15:11]] <= ALUOut; The write actually takes place at the end of the cycle on the edge 67 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Write-back step Reg[IR[20:16]] <= MDR; Which instruction needs this? 68 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Summary: 69 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Simple Questions How many cycles will it take to execute this code?

Label: lw \$t2, 0(\$t3) lw \$t3, 4(\$t3) beq \$t2, \$t3, Label add \$t5, \$t2, \$t3 sw \$t5, 8(\$t3) ... #assume not What is going on during the 8th cycle of execution? In what cycle does the actual addition of \$t2 and \$t3 takes place? 70 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers PCSource PCWriteCond PCWrite ALUOp Outputs IorD MemRead ALUSrcB Control ALUSrcA MemWrite MemtoReg Op

[50] RegWrite IRWrite 0 RegDst 26 Instruction Morgan Kaufmann Publishers[25-0] PC 0 M u x 1 Instruction [3126] Address Memory MemData Write data Instruction [2016] Instruction [150] Instruction register Instruction [150] Memory data register 0 M

u x 1 Read register Morgan Kaufmann Publishers1 Instruction [2521] Read data Morgan Kaufmann Publishers1 Read register Morgan Kaufmann Publishers2 Registers Write Read register data Morgan Kaufmann Publishers2 0 M Instruction u x [1511] 1 A 16 Sign extend 32 Instruction Morgan Kaufmann Publishers[50] 28 PC Morgan Kaufmann Publishers[3128] Zero

ALU ALU result B 4 Write data 0 M u x 1 Shift left Morgan Kaufmann Publishers2 Shift left Morgan Kaufmann Publishers2 Jump address [310] 0 1M u 2 x 3 ALU control ALUOut M 1 u x 2 Review: finite state machines

Finite state machines: a set of states and next state function (determined by current state and the input) output function (determined by current state and possibly input) Next state Current Morgan Kaufmann Publishersstate Next-state function Clock Inputs Output function Outputs Well use a Moore machine (output based only on current state) 72 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Review: finite state machines Example: B. 37 A friend would like you to build an electronic eye for use as a fake security device. The device consists of three lights lined up in a row, controlled by the outputs Left, Middle, and Right, which, if asserted, indicate that a light should be on. Only one light is on at a time, and the light moves from left to right and then from right to left, thus scaring away thieves who believe that the device is monitoring their activity. Draw the graphical representation for the finite state machine used to specify the electronic eye. Note that the rate of the eyes movement will be controlled by the clock speed (which should not be too great) and that there are essentially no inputs. 73 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers

Implementing the Control Value of control signals is dependent upon: what instruction is being executed which step is being performed Use the information weve accumulated to specify a finite state machine specify the finite state machine graphically, or use microprogramming Implementation can be derived from specification 74 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Graphical Specification of FSM Instruction Morgan Kaufmann Publishersfetch MemRead ALUSrcA Morgan Kaufmann Publishers= Morgan Kaufmann Publishers0 IorD Morgan Kaufmann Publishers= Morgan Kaufmann Publishers0 IRWrite ALUSrcB Morgan Kaufmann Publishers= Morgan Kaufmann Publishers01 ALUOp Morgan Kaufmann Publishers= Morgan Kaufmann Publishers00 PCWrite PCSource Morgan Kaufmann Publishers= Morgan Kaufmann Publishers00 0 Start Note: Instruction Morgan Kaufmann Publishersdecode/ register Morgan Kaufmann Publishersfetch 1

ALUSrcA Morgan Kaufmann Publishers= Morgan Kaufmann Publishers0 ALUSrcB Morgan Kaufmann Publishers= Morgan Kaufmann Publishers11 ALUOp Morgan Kaufmann Publishers= Morgan Kaufmann Publishers00 dont care if not mentioned asserted if name only otherwise exact value Memory Morgan Kaufmann Publishersaddress computation How many state bits will we need? 2 6 ALUSrcA Morgan Kaufmann Publishers= Morgan Kaufmann Publishers1 ALUSrcB Morgan Kaufmann Publishers= Morgan Kaufmann Publishers10 ALUOp Morgan Kaufmann Publishers= Morgan Kaufmann Publishers00 8 ALUSrcA Morgan Kaufmann Publishers= Morgan Kaufmann Publishers1 ALUSrcB Morgan Kaufmann Publishers= Morgan Kaufmann Publishers00 ALUOp Morgan Kaufmann Publishers= Morgan Kaufmann Publishers10 Memory access 3 Memory access 5 MemRead IorD Morgan Kaufmann Publishers= Morgan Kaufmann Publishers1 Branch completion Execution

Jump completion 9 ALUSrcA Morgan Kaufmann Publishers= Morgan Kaufmann Publishers1 ALUSrcB Morgan Kaufmann Publishers= Morgan Kaufmann Publishers00 ALUOp Morgan Kaufmann Publishers= Morgan Kaufmann Publishers01 PCWriteCond PCSource Morgan Kaufmann Publishers= Morgan Kaufmann Publishers01 PCWrite PCSource Morgan Kaufmann Publishers= Morgan Kaufmann Publishers10 R-type Morgan Kaufmann Publisherscompletion 7 MemWrite IorD Morgan Kaufmann Publishers= Morgan Kaufmann Publishers1 RegDst Morgan Kaufmann Publishers= Morgan Kaufmann Publishers1 RegWrite MemtoReg Morgan Kaufmann Publishers= Morgan Kaufmann Publishers0 Memory Morgan Kaufmann Publishersread completon Morgan Kaufmann Publishersstep 4 RegDst Morgan Kaufmann Publishers= Morgan Kaufmann Publishers1 RegWrite MemtoReg Morgan Kaufmann Publishers= Morgan Kaufmann Publishers0 75 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Finite State Machine for Control Implementation: PCWrite PCWriteCond IorD MemRead MemWrite IRWrite

Control Morgan Kaufmann Publisherslogic MemtoReg PCSource ALUOp Outputs ALUSrcB ALUSrcA RegWrite RegDst NS3 NS2 NS1 NS0 Instruction Morgan Kaufmann Publishersregister opcode Morgan Kaufmann Publishersfield S0 S1 S2 S3 Op0 Op1 Op2 Op3 Op4 Inputs Op5

State Morgan Kaufmann Publishersregister 76 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers PLA Implementation If I picked a horizontal or vertical line could you explain it? Op5 Op4 Op3 Op2 Op1 Op0 S3 S2 S1 S0 PCWrite PCWriteCond IorD MemRead MemWrite IRWrite MemtoReg PCSource1 PCSource0 ALUOp1 ALUOp0 ALUSrcB1 ALUSrcB0 ALUSrcA RegWrite RegDst NS3 NS2 NS1 NS0 77

2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers ROM Implementation ROM = "Read Only Memory" values of memory locations are fixed ahead of time A ROM can be used to implement a truth table if the address is m-bits, we can address 2m entries in the ROM. our outputs are the bits of data that the address points to. m n 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1

0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 0 1 m is the "height", and n is the "width" 78 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers ROM Implementation

How many inputs are there? 6 bits for opcode, 4 bits for state = 10 address lines (i.e., 210 = 1024 different addresses) How many outputs are there? 16 datapath-control outputs, 4 state bits = 20 outputs ROM is 210 x 20 = 20K bits Rather wasteful, since for lots of the entries, the outputs are the same i.e., opcode is often ignored (and a rather unusual size) 79 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers ROM vs PLA Break up the table into two parts 4 state bits tell you the 16 outputs, 24 x 16 bits of ROM 10 bits tell you the 4 next state bits, 210 x 4 bits of ROM Total: 4.3K bits of ROM PLA is much smaller can share product terms only need entries that produce an active output can take into account don't cares

Size is (#inputs #product-terms) + (#outputs #product-terms) For this example = (10x17)+(20x17) = 510 PLA cells PLA cells usually about the size of a ROM cell (slightly bigger) 80 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Another Implementation Style Complex instructions: the "next state" is often current state + 1 Control Morgan Kaufmann Publishersunit PLA Morgan Kaufmann Publishersor Morgan Kaufmann PublishersROM Outputs Input PCWrite PCWriteCond IorD MemRead MemWrite IRWrite BWrite MemtoReg PCSource ALUOp ALUSrcB ALUSrcA RegWrite RegDst AddrCtl 1 State Adder Address Morgan Kaufmann Publishersselect Morgan Kaufmann Publisherslogic Op[5 0]

Instruction Morgan Kaufmann Publishersregister opcode Morgan Kaufmann Publishersfield 81 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Details Op 000000 000010 000100 100011 101011 Dispatch ROM 1 Opcode name R-format jmp beq lw sw Value 0110 1001 1000 0010 0010 Op 100011 101011 Dispatch ROM 2 Opcode name lw sw Value 0011

0101 PLA Morgan Kaufmann Publishersor Morgan Kaufmann PublishersROM 1 State Adder 3 Mux 2 1 AddrCtl 0 0 Dispatch Morgan Kaufmann PublishersROM Morgan Kaufmann Publishers2 Dispatch Morgan Kaufmann PublishersROM Morgan Kaufmann Publishers1 Address Morgan Kaufmann Publishersselect Morgan Kaufmann Publisherslogic Instruction Morgan Kaufmann Publishersregister opcode Morgan Kaufmann Publishersfield State number 0 1 2 3 4 5 6 7 8 9 Address-control action Use Morgan Kaufmann Publishersincremented Morgan Kaufmann Publishersstate Use Morgan Kaufmann Publishersdispatch Morgan Kaufmann PublishersROM Morgan Kaufmann Publishers1 Use Morgan Kaufmann Publishersdispatch Morgan Kaufmann PublishersROM Morgan Kaufmann Publishers2 Use Morgan Kaufmann Publishersincremented Morgan Kaufmann Publishersstate Replace Morgan Kaufmann Publishersstate Morgan Kaufmann Publishersnumber Morgan Kaufmann Publishersby Morgan Kaufmann Publishers0 Replace Morgan Kaufmann Publishersstate Morgan Kaufmann Publishersnumber Morgan Kaufmann Publishersby Morgan Kaufmann Publishers0 Use Morgan Kaufmann Publishersincremented Morgan Kaufmann Publishersstate

Replace Morgan Kaufmann Publishersstate Morgan Kaufmann Publishersnumber Morgan Kaufmann Publishersby Morgan Kaufmann Publishers0 Replace Morgan Kaufmann Publishersstate Morgan Kaufmann Publishersnumber Morgan Kaufmann Publishersby Morgan Kaufmann Publishers0 Replace Morgan Kaufmann Publishersstate Morgan Kaufmann Publishersnumber Morgan Kaufmann Publishersby Morgan Kaufmann Publishers0 Value of AddrCtl 3 1 2 3 0 0 3 0 0 0 82 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Microprogramming Control Morgan Kaufmann Publishersunit Microcode Morgan Kaufmann Publishersmemory Outputs Input PCWrite PCWriteCond IorD MemRead MemWrite IRWrite BWrite MemtoReg PCSource ALUOp ALUSrcB ALUSrcA RegWrite RegDst

AddrCtl Datapath 1 Microprogram Morgan Kaufmann Publisherscounter Adder Address Morgan Kaufmann Publishersselect Morgan Kaufmann Publisherslogic Instruction Morgan Kaufmann Publishersregister opcode Morgan Kaufmann Publishersfield What are the microinstructions ? 83 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Microprogramming A specification methodology appropriate if hundreds of opcodes, modes, cycles, etc. signals specified symbolically using microinstructions Label Fetch Mem1 LW2 ALU control Add Add Add SRC1 PC PC A Register

control SRC2 4 Extshft Read Extend PCWrite Memory control Read Morgan Kaufmann PublishersPC ALU Read Morgan Kaufmann PublishersALU Write Morgan Kaufmann PublishersMDR SW2 Rformat1 Func Morgan Kaufmann Publisherscode A Write Morgan Kaufmann PublishersALU B Write Morgan Kaufmann PublishersALU BEQ1 JUMP1 Subt A B ALUOut-cond Jump Morgan Kaufmann Publishersaddress Sequencing Seq Dispatch Morgan Kaufmann Publishers1 Dispatch Morgan Kaufmann Publishers2 Seq Fetch

Fetch Seq Fetch Fetch Fetch Will two implementations of the same architecture have the same microcode? What would a microassembler do? 84 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Microinstruction format Field name ALU Morgan Kaufmann Publisherscontrol SRC1 SRC2 Value Add Subt Func Morgan Kaufmann Publisherscode PC A B 4 Extend Extshft Read ALUOp Morgan Kaufmann Publishers= Morgan Kaufmann Publishers10 ALUSrcA Morgan Kaufmann Publishers= Morgan Kaufmann Publishers0 ALUSrcA Morgan Kaufmann Publishers= Morgan Kaufmann Publishers1 ALUSrcB Morgan Kaufmann Publishers= Morgan Kaufmann Publishers00 ALUSrcB Morgan Kaufmann Publishers= Morgan Kaufmann Publishers01 ALUSrcB Morgan Kaufmann Publishers= Morgan Kaufmann Publishers10 ALUSrcB Morgan Kaufmann Publishers= Morgan Kaufmann Publishers11 Write Morgan Kaufmann PublishersALU

RegWrite, RegDst Morgan Kaufmann Publishers= Morgan Kaufmann Publishers1, MemtoReg Morgan Kaufmann Publishers= Morgan Kaufmann Publishers0 RegWrite, RegDst Morgan Kaufmann Publishers= Morgan Kaufmann Publishers0, MemtoReg Morgan Kaufmann Publishers= Morgan Kaufmann Publishers1 MemRead, lorD Morgan Kaufmann Publishers= Morgan Kaufmann Publishers0 MemRead, lorD Morgan Kaufmann Publishers= Morgan Kaufmann Publishers1 MemWrite, lorD Morgan Kaufmann Publishers= Morgan Kaufmann Publishers1 PCSource Morgan Kaufmann Publishers= Morgan Kaufmann Publishers00 PCWrite PCSource Morgan Kaufmann Publishers= Morgan Kaufmann Publishers01, PCWriteCond PCSource Morgan Kaufmann Publishers= Morgan Kaufmann Publishers10, PCWrite AddrCtl Morgan Kaufmann Publishers= Morgan Kaufmann Publishers11 AddrCtl Morgan Kaufmann Publishers= Morgan Kaufmann Publishers00 AddrCtl Morgan Kaufmann Publishers= Morgan Kaufmann Publishers01 AddrCtl Morgan Kaufmann Publishers= Morgan Kaufmann Publishers10 Register control Write Morgan Kaufmann PublishersMDR Read Morgan Kaufmann PublishersPC Memory Read Morgan Kaufmann PublishersALU Write Morgan Kaufmann PublishersALU ALU PC Morgan Kaufmann Publisherswrite Morgan Kaufmann Publisherscontrol ALUOut-cond jump Morgan Kaufmann Publishersaddress Sequencing Signals active ALUOp Morgan Kaufmann Publishers= Morgan Kaufmann Publishers00

No encoding: 1 bit for each datapath operation faster, requires more memory (logic) used for Vax 780 an astonishing 400K of memory! Lots of encoding: send the microinstructions through logic to get control signals uses less memory, slower Historical context of CISC: Too much logic to put on a single chip with everything else Use a ROM (or even RAM) to hold the microcode Its easy to add new instructions 86 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Microcode: Trade-offs Distinction between specification and implementation is sometimes blurred Specification Advantages: Easy to design and write Design architecture and microcode in parallel Implementation (off-chip ROM) Advantages Easy to change since values are in memory Can emulate other architectures Can make use of internal registers

Implementation Disadvantages, SLOWER now that: Control is implemented on same chip as processor ROM is no longer faster than RAM No need to go back and make changes 87 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Historical Perspective In the 60s and 70s microprogramming was very important for implementing machines This led to more sophisticated ISAs and the VAX In the 80s RISC processors based on pipelining became popular Pipelining the microinstructions is also possible! Implementations of IA-32 architecture processors since 486 use: hardwired control for simpler instructions (few cycles, FSM control implemented using PLA or random logic) microcoded control for more complex instructions (large numbers of cycles, central control store) The IA-64 architecture uses a RISC-style ISA and can be implemented without a large central control store 88 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Pentium 4 Pipelining is important (last IA-32 without it was 80386 in 1985) Control

Control I/O interface Chapter 7 Instruction Morgan Kaufmann Publisherscache Data cache Enhanced floating Morgan Kaufmann Publisherspoint and Morgan Kaufmann Publishersmultimedia Integer datapath Control Advanced Morgan Kaufmann Publisherspipelining hyperthreading Morgan Kaufmann Publisherssupport Secondary cache and memory interface Chapter 6 Control Pipelining is used for the simple instructions favored by compilers Simply put, a high performance implementation needs to ensure that the simple instructions execute quickly, and that the burden of the complexities of the instruction set penalize the complex, less frequently used, instructions 89 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers

Pentium 4 Somewhere in all that control we must handle complex instructions Control Control I/O interface Instruction Morgan Kaufmann Publisherscache Data cache Enhanced floating Morgan Kaufmann Publisherspoint and Morgan Kaufmann Publishersmultimedia Integer datapath Control Advanced Morgan Kaufmann Publisherspipelining hyperthreading Morgan Kaufmann Publisherssupport Secondary cache and memory interface Control Processor executes simple microinstructions, 70 bits wide (hardwired) 120 control lines for integer datapath (400 for floating point) If an instruction requires more than 4 microinstructions to implement, control from microcode ROM (8000 microinstructions) Its complicated!

90 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Chapter 5 Summary If we understand the instructions We can build a simple processor! If instructions take different amounts of time, multi-cycle is better Datapath implemented using: Combinational logic for arithmetic State holding elements to remember bits Control implemented using: Combinational logic for single-cycle implementation Finite state machine for multi-cycle implementation 91 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Chapter Six 92 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Pipelining Improve performance by increasing instruction throughput Program execution

Time order (in Morgan Kaufmann Publishersinstructions) 200 lw Morgan Kaufmann Publishers Morgan Kaufmann Publishers\$1, Morgan Kaufmann Publishers100(\$0) Instruction fetch Reg lw Morgan Kaufmann Publishers Morgan Kaufmann Publishers\$2, Morgan Kaufmann Publishers Morgan Kaufmann Publishers200(\$0) 400 600 Data access ALU 800 1000 1200 1400 ALU Data access 1600 1800 Reg Instruction Reg fetch 800 Morgan Kaufmann Publishersps lw Morgan Kaufmann Publishers Morgan Kaufmann Publishers\$3, Morgan Kaufmann Publishers Morgan Kaufmann Publishers300(\$0)

Reg Instruction fetch 800 Morgan Kaufmann Publishersps Note: timing assumptions changed for this example 800 Morgan Kaufmann Publishersps Program execution Time order (in Morgan Kaufmann Publishersinstructions) lw Morgan Kaufmann Publishers Morgan Kaufmann Publishers\$1, Morgan Kaufmann Publishers100(\$0) 200 Instruction fetch 400 Reg lw Morgan Kaufmann Publishers Morgan Kaufmann Publishers\$2, Morgan Kaufmann Publishers Morgan Kaufmann Publishers200(\$0) 200 Morgan Kaufmann Publishersps Instruction fetch lw Morgan Kaufmann Publishers Morgan Kaufmann Publishers\$3, Morgan Kaufmann Publishers Morgan Kaufmann Publishers300(\$0) 200 Morgan Kaufmann Publishersps 600 ALU Reg Instruction fetch 800

Data access ALU Reg 1000 1200 1400 Reg Data access ALU Reg Data access Reg 200 Morgan Kaufmann Publishersps 200 Morgan Kaufmann Publishersps 200 Morgan Kaufmann Publishersps 200 Morgan Kaufmann Publishersps 200 Morgan Kaufmann Publishersps Ideal speedup is number of stages in the pipeline. Do we achieve this? 93 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Pipelining What makes it easy all instructions are the same length just a few instruction formats memory operands appear only in loads and stores What makes it hard? structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions

data hazards: an instruction depends on a previous instruction Well build a simple pipeline and look at these issues Well talk about modern processors and what really makes it hard: exception handling trying to improve performance with out-of-order execution, etc. 94 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Basic Idea IF: Morgan Kaufmann PublishersInstruction Morgan Kaufmann Publishersfetch ID: Morgan Kaufmann PublishersInstruction Morgan Kaufmann Publishersdecode/ register Morgan Kaufmann Publishersfile Morgan Kaufmann Publishersread EX: Morgan Kaufmann PublishersExecute/ address Morgan Kaufmann Publisherscalculation MEM: Morgan Kaufmann PublishersMemory Morgan Kaufmann Publishersaccess WB: Morgan Kaufmann PublishersWrite Morgan Kaufmann Publishersback Add 4 Shift left Morgan Kaufmann Publishers2 P C Address Instruction Instruction memory Read Read

register Morgan Kaufmann Publishers1 data Morgan Kaufmann Publishers1 Read register Morgan Kaufmann Publishers2 Registers Write Read register data Morgan Kaufmann Publishers2 Write data 16 ADD Add result Zero ALU ALU result Address Read data Data Memory Write data Sign 32 extend What do we need to add to actually split the datapath into stages? 95 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Pipelined Datapath IF/ID

Data memory Write data Write data 16 Sign extend 32 Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem? 96 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Corrected Datapath IF/ID ID/EX EX/MEM MEM/WB Add 4 Shift left Morgan Kaufmann Publishers2 PC Address Instruction memory

Add Add result Read register Morgan Kaufmann Publishers1 Read data Morgan Kaufmann Publishers1 Read register Morgan Kaufmann Publishers2 Registers Read Write data Morgan Kaufmann Publishers2 register Zero ALU ALU result Read data Address Data memory Write data Write data 16 Sign extend 32 97 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Graphically Representing Pipelines

Time Morgan Kaufmann Publishers(in Morgan Kaufmann Publishersclock Morgan Kaufmann Publisherscycles) Program execution order (in Morgan Kaufmann Publishersinstructions) lw Morgan Kaufmann Publishers\$1, Morgan Kaufmann Publishers100(\$0) lw Morgan Kaufmann Publishers\$2, Morgan Kaufmann Publishers200(\$0) lw Morgan Kaufmann Publishers\$3, Morgan Kaufmann Publishers300(\$0) CC Morgan Kaufmann Publishers1 CC Morgan Kaufmann Publishers2 IM Reg IM CC Morgan Kaufmann Publishers3 ALU Reg IM CC Morgan Kaufmann Publishers4 CC Morgan Kaufmann Publishers5 DM Reg ALU DM

Reg ALU DM Reg CC Morgan Kaufmann Publishers6 CC7 Reg Can help with answering questions like: how many cycles does it take to execute this code? what is the ALU doing during cycle 4? use this representation to help understand datapaths 98 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Pipeline Control PCSrc IF/ID ID/EX EX/MEM MEM/WB Add Add Add result 4 Shift left Morgan Kaufmann Publishers2 Branch

RegWrite PC Address Instruction memory Read register Morgan Kaufmann Publishers1 Read data Morgan Kaufmann Publishers1 Read register Morgan Kaufmann Publishers2 Registers Read Write data Morgan Kaufmann Publishers2 register MemWrite ALUSrc Zero Add ALU result MemtoReg Read data Address Data memory Write data Write data Instruction (150)

Instruction (2016) 16 Sign extend 32 6 ALU control MemRead ALUOp Instruction (1511) RegDst 99 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Pipeline control We have 5 stages. What needs to be controlled in each stage? Instruction Fetch and PC Increment Instruction Decode / Register Fetch Execution Memory Stage Write Back How would control be handled in an automobile plant? a fancy control center telling everyone what to do? should we use a finite state machine?

100 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Pipeline Control Pass control signals along just like the data Instruction R-format lw sw beq Execution/Address Calculation Memory access stage stage control lines control lines Reg ALU ALU ALU Mem Mem Dst Op1 Op0 Src Branch Read Write 1 1 0 0 0 0 0 0 0 0 1 0 1 0

X 0 0 1 0 0 1 X 0 1 0 1 0 0 Write-back stage control lines Reg Mem to write Reg 1 0 1 1 0 X 0 X WB Instruction IF/ID Control M WB EX

M WB ID/EX EX/MEM MEM/WB 101 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Datapath with Control PCSrc ID/EX Control IF/ID WB EX/MEM M WB EX M MEM/WB WB Add 4 Shift left Morgan Kaufmann Publishers2

[150] Instruction [2016] 16 Sign extend 32 6 ALU control MemRead ALUOp Instruction [1511] RegDst 102 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Dependencies Problem with starting next instruction before first is finished dependencies that go backward in time are data hazards Time Morgan Kaufmann Publishers(in Morgan Kaufmann Publishersclock Morgan Kaufmann Publisherscycles) CC Morgan Kaufmann Publishers1 CC Morgan Kaufmann Publishers2 CC Morgan Kaufmann Publishers3 10 10

10 IM Reg Value Morgan Kaufmann Publishersof register Morgan Kaufmann Publishers\$2: CC Morgan Kaufmann Publishers4 CC Morgan Kaufmann Publishers5 CC Morgan Kaufmann Publishers6 CC Morgan Kaufmann Publishers7 CC Morgan Kaufmann Publishers8 CC Morgan Kaufmann Publishers9 10 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers10/20 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers20 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers20 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers20 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers20 Program execution order (in Morgan Kaufmann Publishersinstructions) sub \$2, Morgan Kaufmann Publishers\$1, Morgan Kaufmann Publishers\$3 and Morgan Kaufmann Publishers\$12, \$2, Morgan Kaufmann Publishers\$5 or Morgan Kaufmann Publishers\$13, Morgan Kaufmann Publishers\$6, \$2 add Morgan Kaufmann Publishers\$14, \$2, \$2 sw Morgan Kaufmann Publishers\$15, Morgan Kaufmann Publishers100(\$2) IM DM Reg

IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg Reg DM Reg 103 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Software Solution Have compiler guarantee no hazards Where do we insert the nops ? sub and or

add sw \$2, \$1, \$3 \$12, \$2, \$5 \$13, \$6, \$2 \$14, \$2, \$2 \$15, 100(\$2) Problem: this really slows us down! 104 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Forwarding Use temporary results, dont wait for them to be written register file forwarding to handle read/write to same register ALU forwarding Time Morgan Kaufmann Publishers(in Morgan Kaufmann Publishersclock Morgan Kaufmann Publisherscycles) CC Morgan Kaufmann Publishers1 CC Morgan Kaufmann Publishers2 Value Morgan Kaufmann Publishersof Morgan Kaufmann Publishersregister Morgan Kaufmann Publishers\$2: 10 10 Value Morgan Kaufmann Publishersof Morgan Kaufmann PublishersEX/MEM: Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX Value Morgan Kaufmann Publishersof Morgan Kaufmann PublishersMEM/WB: Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX CC Morgan Kaufmann Publishers3 10 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX CC Morgan Kaufmann Publishers4 CC Morgan Kaufmann Publishers5

CC Morgan Kaufmann Publishers6 CC Morgan Kaufmann Publishers7 CC Morgan Kaufmann Publishers8 CC Morgan Kaufmann Publishers9 10 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers10/20 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers20 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers20 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers20 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers20 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers20 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers20 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX Program execution order (in Morgan Kaufmann Publishersinstructions) sub \$2, Morgan Kaufmann Publishers\$1, Morgan Kaufmann Publishers\$3 and Morgan Kaufmann Publishers\$12, \$2, Morgan Kaufmann Publishers\$5 or Morgan Kaufmann Publishers\$13, Morgan Kaufmann Publishers\$6, \$2 add Morgan Kaufmann Publishers\$14,\$2 , \$2 sw Morgan Kaufmann Publishers\$15, Morgan Kaufmann Publishers100(\$2) what if this \$2 was \$13? IM Reg IM DM

Reg IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg Reg DM Reg 105 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Forwarding The main idea (some details not shown) ID/EX

EX/MEM MEM/WB M u x ForwardA Registers ALU M u x Data memory M u x ForwardB Rs Rt Rt Rd EX/MEM.RegisterRd M u x Forwarding unit MEM/WB.RegisterRd 106

2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Can't always forward Load word can still cause a hazard: an instruction tries to read a register following a load instruction that writes to the same register. Time Morgan Kaufmann Publishers(in Morgan Kaufmann Publishersclock Morgan Kaufmann Publisherscycles) CC Morgan Kaufmann Publishers1 CC Morgan Kaufmann Publishers2 CC Morgan Kaufmann Publishers3 CC Morgan Kaufmann Publishers4 CC Morgan Kaufmann Publishers5 DM Reg CC Morgan Kaufmann Publishers6 CC Morgan Kaufmann Publishers7 CC Morgan Kaufmann Publishers8 CC Morgan Kaufmann Publishers9 Program execution order (in Morgan Kaufmann Publishersinstructions) lw \$2, Morgan Kaufmann Publishers20(\$1) and \$4, \$2, Morgan Kaufmann Publishers\$5 or Morgan Kaufmann Publishers\$8, \$2, Morgan Kaufmann Publishers\$6 add Morgan Kaufmann Publishers\$9, \$4, \$2

IM Reg IM Reg IM DM Reg IM Reg DM Reg Reg DM Reg Thus, we need a hazard detection unit to stall the load instruction slt Morgan Kaufmann Publishers\$1, Morgan Kaufmann Publishers\$6, Morgan Kaufmann Publishers\$7 IM Reg DM Reg 107 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers

Stalling We can stall the pipeline by keeping an instruction in the same stage Time Morgan Kaufmann Publishers(in Morgan Kaufmann Publishersclock Morgan Kaufmann Publisherscycles) CC Morgan Kaufmann Publishers1 CC Morgan Kaufmann Publishers2 CC Morgan Kaufmann Publishers3 CC Morgan Kaufmann Publishers4 CC Morgan Kaufmann Publishers5 Reg DM Reg CC Morgan Kaufmann Publishers6 CC Morgan Kaufmann Publishers7 CC Morgan Kaufmann Publishers8 CC Morgan Kaufmann Publishers9 CC Morgan Kaufmann Publishers10 Program execution order (in Morgan Kaufmann Publishersinstructions) lw \$2, Morgan Kaufmann Publishers20(\$1) IM bubble and Morgan Kaufmann Publishersbecomes Morgan Kaufmann Publishersnop add Morgan Kaufmann Publishers\$4, Morgan Kaufmann Publishers\$2, Morgan Kaufmann Publishers\$5

or Morgan Kaufmann Publishers\$8, Morgan Kaufmann Publishers\$2, Morgan Kaufmann Publishers\$6 add Morgan Kaufmann Publishers\$9, Morgan Kaufmann Publishers\$4, Morgan Kaufmann Publishers\$2 IM Reg IM DM Reg IM Reg DM DM Reg IM Reg Reg Reg DM Reg 108 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Hazard Detection Unit

Stall by letting an instruction that wont write anything go forward Hazard detection unit ID/EX.MemRead ID/EX WB M u x Control 0 IF/ID EX/MEM M WB EX M MEM/WB WB M u x Registers ALU PC Instruction memory M u

x Data memory M u x IF/ID.RegisterRs IF/ID.RegisterRt IF/ID.RegisterRt Rt IF/ID.RegisterRd Rd M u x ID/EX.RegisterRt Rs Rt Forwarding unit 109 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Branch Hazards When we decide to branch, other instructions are in the pipeline! Time Morgan Kaufmann Publishers(in Morgan Kaufmann Publishersclock Morgan Kaufmann Publisherscycles) CC Morgan Kaufmann Publishers1 CC Morgan Kaufmann Publishers2

CC Morgan Kaufmann Publishers3 CC Morgan Kaufmann Publishers4 CC Morgan Kaufmann Publishers5 DM Reg CC Morgan Kaufmann Publishers6 CC Morgan Kaufmann Publishers7 CC Morgan Kaufmann Publishers8 CC Morgan Kaufmann Publishers9 Program execution order (in Morgan Kaufmann Publishersinstructions) 40 Morgan Kaufmann Publishersbeq Morgan Kaufmann Publishers\$1, Morgan Kaufmann Publishers\$3, Morgan Kaufmann Publishers28 44 Morgan Kaufmann Publishersand Morgan Kaufmann Publishers\$12, Morgan Kaufmann Publishers\$2, Morgan Kaufmann Publishers\$5 48 Morgan Kaufmann Publishersor Morgan Kaufmann Publishers\$13, Morgan Kaufmann Publishers\$6, Morgan Kaufmann Publishers\$2 52 Morgan Kaufmann Publishersadd Morgan Kaufmann Publishers\$14, Morgan Kaufmann Publishers\$2, Morgan Kaufmann Publishers\$2 72 Morgan Kaufmann Publisherslw Morgan Kaufmann Publishers\$4, Morgan Kaufmann Publishers50(\$7) IM Reg IM Reg IM

DM Reg IM Reg DM Reg IM Reg DM Reg Reg DM Reg We are predicting branch not taken need to add hardware for flushing instructions if we are wrong 110 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Flushing Instructions IF.Flush Hazard detection unit ID/EX WB Control 0

IF/ID M u x + EX/MEM M WB EX/MEM EX M WB + 4 M u x PC Shift left Morgan Kaufmann Publishers2 Registers Instruction memory = M u

x ALU M u x Data memory M u x Sign extend M u x Fowarding unit Note: weve also moved branch decision to ID stage 111 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Branches If the branch is taken, we have a penalty of one cycle For our simple design, this is reasonable With deeper pipelines, penalty increases and static branch prediction drastically hurts performance Solution: dynamic branch prediction Taken Not Morgan Kaufmann Publisherstaken Predict Morgan Kaufmann Publisherstaken Predict Morgan Kaufmann Publisherstaken

Taken Not Morgan Kaufmann Publisherstaken Taken Not Morgan Kaufmann Publisherstaken Predict Morgan Kaufmann Publishersnot Morgan Kaufmann Publisherstaken Predict Morgan Kaufmann Publishersnot Morgan Kaufmann Publisherstaken Taken Not Morgan Kaufmann Publisherstaken A 2-bit prediction scheme 112 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Branch Prediction Sophisticated Techniques: A branch target buffer to help us look up the destination Correlating predictors that base prediction on global behavior and recently executed branches (e.g., prediction for a specific branch instruction based on what happened in previous branches) Tournament predictors that use different types of prediction strategies and keep track of which one is performing best. A branch delay slot which the compiler tries to fill with a useful instruction (make the one cycle delay part of the ISA) Branch prediction is especially important because it enables other more advanced pipelining techniques to be effective! Modern processors predict correctly 95% of the time! 113 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Improving Performance

Try and avoid stalls! E.g., reorder these instructions: lw lw sw sw \$t0, \$t2, \$t2, \$t0, 0(\$t1) 4(\$t1) 0(\$t1) 4(\$t1) Dynamic Pipeline Scheduling Hardware chooses which instructions to execute next Will execute instructions out of order (e.g., doesnt wait for a dependency to be resolved, but rather keeps going!) Speculates on branches and keeps the pipeline full (may need to rollback if prediction incorrect) Trying to exploit instruction-level parallelism 114 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Advanced Pipelining Increase the depth of the pipeline Start more than one instruction each cycle (multiple issue)

Loop unrolling to expose more ILP (better scheduling) Superscalar processors DEC Alpha 21264: 9 stage pipeline, 6 instruction issue All modern processors are superscalar and issue multiple instructions usually with some limitations (e.g., different pipes) VLIW: very long instruction word, static multiple issue (relies more on compiler technology) This class has given you the background you need to learn more! 115 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Chapter 6 Summary Pipelining does not improve latency, but does improve throughput Deeply pipelined Multicycle (Section Morgan Kaufmann Publishers5.5) Pipelined Multiple Morgan Kaufmann Publishersissue with Morgan Kaufmann Publishersdeep Morgan Kaufmann Publisherspipeline (Section Morgan Kaufmann Publishers6.10) Multiple Morgan Kaufmann Publishersissue with Morgan Kaufmann Publishersdeep Morgan Kaufmann Publisherspipeline (Section Morgan Kaufmann Publishers6.10)

Multiple-issue pipelined (Section Morgan Kaufmann Publishers6.9) Multiple-issue pipelined (Section Morgan Kaufmann Publishers6.9) Single-cycle (Section Morgan Kaufmann Publishers5.4) Deeply pipelined Multicycle (Section Morgan Kaufmann Publishers5.5) Single-cycle (Section Morgan Kaufmann Publishers5.4) Slower Pipelined Faster Instructions Morgan Kaufmann Publishersper Morgan Kaufmann Publishersclock Morgan Kaufmann Publishers(IPC Morgan Kaufmann Publishers= Morgan Kaufmann Publishers1/CPI) 1 Several Use Morgan Kaufmann Publisherslatency Morgan Kaufmann Publishersin Morgan Kaufmann Publishersinstructions 116 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Chapter Seven 117 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Memories: Review

SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be refreshed) very small but slower than SRAM (factor of 5 to 10) Word Morgan Kaufmann Publishersline A A B B Pass Morgan Kaufmann Publisherstransistor Capacitor Bit Morgan Kaufmann Publishersline 118 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Exploiting Memory Hierarchy Users want large and fast memories! SRAM access times are .5 5ns at cost of \$4000 to \$10,000 per GB. DRAM access times are 50-70ns at cost of \$100 to \$200 per GB. Disk access times are 5 to 20 million ns at cost of \$.50 to \$2 per GB. 2004 Try and give it to them anyway build a memory hierarchy

CPU Level Morgan Kaufmann Publishers1 Levels Morgan Kaufmann Publishersin Morgan Kaufmann Publishersthe memory Morgan Kaufmann Publishershierarchy Increasing Morgan Kaufmann Publishersdistance from Morgan Kaufmann Publishersthe Morgan Kaufmann PublishersCPU Morgan Kaufmann Publishersin access Morgan Kaufmann Publisherstime Level Morgan Kaufmann Publishers2 Level Morgan Kaufmann Publishersn Size Morgan Kaufmann Publishersof Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersmemory Morgan Kaufmann Publishersat Morgan Kaufmann Publisherseach Morgan Kaufmann Publisherslevel 119 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality: it will tend to be referenced again soon spatial locality: nearby items will tend to be referenced soon. Why does code have locality? Our initial focus: two levels (upper, lower) block: minimum unit of data hit: data requested is in the upper level miss: data requested is not in the upper level 120 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers

Cache Two issues: How do we know if a data item is in the cache? If it is, how do we find it? Our first example: block size is one word of data "direct mapped" For each item of data at the lower level, there is exactly one location in the cache where it might be. e.g., lots of items at the lower level share locations in the upper level 121 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Direct Mapped Cache Mapping: address is modulo the number of blocks in the cache Cache 000 001 010 011 100 101 110 111 00001 00101 01001 01101

10001 10101 11001 11101 Memory 122 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Direct Mapped Cache For MIPS: Address Morgan Kaufmann Publishers(showing Morgan Kaufmann Publishersbit Morgan Kaufmann Publisherspositions) 31 Morgan Kaufmann Publishers30 Hit Tag 13 Morgan Kaufmann Publishers12 Morgan Kaufmann Publishers11 20 2 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers1 Morgan Kaufmann Publishers0 Byte offset 10 Data Index Index 0 1 2 Valid Tag

Data 1021 1022 1023 20 32 = What kind of locality are we taking advantage of? 123 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Direct Mapped Cache Taking advantage of spatial locality: Address Morgan Kaufmann Publishers(showing Morgan Kaufmann Publishersbit Morgan Kaufmann Publisherspositions) 31 Hit 14 Morgan Kaufmann Publishers13 18 Tag 6 Morgan Kaufmann Publishers5 8 2 Morgan Kaufmann Publishers1 Morgan Kaufmann Publishers0 4 Byte offset

Data Block Morgan Kaufmann Publishersoffset Index 18 Morgan Kaufmann Publishersbits V 512 Morgan Kaufmann Publishersbits Tag Data 256 entries 16 32 32 32 = Mux 32 124 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Hits vs. Misses Read hits this is what we want! Read misses stall the CPU, fetch block from memory, deliver to cache, restart

Write hits: can replace data in cache and memory (write-through) write the data only into the cache (write-back the cache later) Write misses: read the entire block into the cache, then write the word 125 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Hardware Issues Make reading multiple words easier by using banks of memory CPU CPU CPU Multiplexor Cache Cache Cache Bus Bus Memory b. Morgan Kaufmann PublishersWide Morgan Kaufmann Publishersmemory Morgan Kaufmann Publishersorganization Bus Memory bank Morgan Kaufmann Publishers0

Memory bank Morgan Kaufmann Publishers1 Memory bank Morgan Kaufmann Publishers2 Memory bank Morgan Kaufmann Publishers3 c. Morgan Kaufmann PublishersInterleaved Morgan Kaufmann Publishersmemory Morgan Kaufmann Publishersorganization Memory a. One-word-wide memory Morgan Kaufmann Publishersorganization It can get a lot more complicated... 126 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Performance Increasing the block size tends to decrease miss rate: 40% 35% Miss Morgan Kaufmann Publishersrate 30% 25% 20% 15% 10% 5% 0% 4

16 64 Block Morgan Kaufmann Publisherssize Morgan Kaufmann Publishers(bytes) 256 1 Morgan Kaufmann PublishersKB 8 Morgan Kaufmann PublishersKB 16 Morgan Kaufmann PublishersKB 64 Morgan Kaufmann PublishersKB 256 Morgan Kaufmann PublishersKB Use split caches because there is more spatial locality in code: Program gcc spice Block size in words 1 4 1 4 Instruction miss rate 6.1% 2.0% 1.2% 0.3% Data miss rate 2.1% 1.7% 1.3% 0.6% Effective combined miss rate 5.4%

1.9% 1.2% 0.4% 127 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Performance Simplified model: execution time = (execution cycles + stall cycles) cycle time stall cycles = # of instructions miss ratio miss penalty Two ways of improving performance: decreasing the miss ratio decreasing the miss penalty What happens if we increase block size? 128 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Decreasing miss ratio with associativity One-way Morgan Kaufmann Publishersset Morgan Kaufmann Publishersassociative (direct Morgan Kaufmann Publishersmapped) Block Tag Data 0 1 2 3 4 5 6 Two-way Morgan Kaufmann Publishersset Morgan Kaufmann Publishersassociative

Set Tag Data Tag Data 0 1 2 3 7 Four-way Morgan Kaufmann Publishersset Morgan Kaufmann Publishersassociative Set Tag Data Tag Data Tag Data Tag Data 0 1 Eight-way Morgan Kaufmann Publishersset Morgan Kaufmann Publishersassociative Morgan Kaufmann Publishers(fully Morgan Kaufmann Publishersassociative) Compared to direct mapped, give a series of references that: Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data results in a lower miss ratio using a 2-way set associative cache results in a higher miss ratio using a 2-way set associative cache assuming we use the least recently used replacement strategy 129 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers An implementation Address 31 30 12 11 10 9 8 8 22 Index 0 1 2

V Tag Data V 3210 Tag Data V Tag Data V Tag Data 253 254 255 22 32 4-to-1 Morgan Kaufmann Publishersmultiplexor Hit Data 130 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers

Performance 15% 1 Morgan Kaufmann PublishersKB 12% 2 Morgan Kaufmann PublishersKB 9% 4 Morgan Kaufmann PublishersKB 6% 8 Morgan Kaufmann PublishersKB 16 Morgan Kaufmann PublishersKB 32 Morgan Kaufmann PublishersKB 3% 64 Morgan Kaufmann PublishersKB 128 Morgan Kaufmann PublishersKB 0 One-way Two-way Four-way Eight-way Associativity 131 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Decreasing miss penalty with multilevel caches Add a second level cache:

often primary cache is on the same chip as the processor use SRAMs to add another cache above primary memory (DRAM) miss penalty goes down if data is in 2nd level cache Example: CPI of 1.0 on a 5 Ghz machine with a 5% miss rate, 100ns DRAM access Adding 2nd level cache with 5ns access time decreases miss rate to .5% Using multilevel caches: try and optimize the hit time on the 1st level cache try and optimize the miss rate on the 2nd level cache 132 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Cache Complexities Not always easy to understand implications of caches: 1200 2000 Radix Morgan Kaufmann Publisherssort 1000 Radix Morgan Kaufmann Publisherssort 1600 800 1200 600 800 400 200

Quicksort 400 0 Quicksort 0 4 8 16 32 64 128 256 512 1024 2048 4096 Size Morgan Kaufmann Publishers(K Morgan Kaufmann Publishersitems Morgan Kaufmann Publishersto Morgan Kaufmann Publisherssort) Theoretical behavior of Radix sort vs. Quicksort 4 8 16 32 64 128

256 512 1024 2048 4096 Size Morgan Kaufmann Publishers(K Morgan Kaufmann Publishersitems Morgan Kaufmann Publishersto Morgan Kaufmann Publisherssort) Observed behavior of Radix sort vs. Quicksort 133 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Cache Complexities Here is why: 5 Radix Morgan Kaufmann Publisherssort 4 3 2 1 Quicksort 0 4 8 16 32 64 128 256 512 1024 2048 4096 Size Morgan Kaufmann Publishers(K Morgan Kaufmann Publishersitems Morgan Kaufmann Publishersto Morgan Kaufmann Publisherssort)

Memory system performance is often critical factor multilevel caches, pipelined processors, make it harder to predict outcomes Compiler optimizations to increase locality sometimes hurt ILP Difficult to predict best algorithm: need experimental data 134 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Virtual Memory Main memory can act as a cache for the secondary storage (disk) Virtual Morgan Kaufmann Publishersaddresses Physical Morgan Kaufmann Publishersaddresses Address Morgan Kaufmann Publisherstranslation Disk Morgan Kaufmann Publishersaddresses Advantages: illusion of having more physical memory program relocation protection 135 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Pages: virtual memory blocks Page faults: the data is not in memory, retrieve it from disk huge miss penalty, thus pages should be fairly large (e.g., 4KB) reducing page faults is important (LRU is worth the price) can handle the faults in software instead of hardware

using write-through is too expensive so we use writeback Virtual Morgan Kaufmann Publishersaddress 31 Morgan Kaufmann Publishers30 Morgan Kaufmann Publishers29 Morgan Kaufmann Publishers28 Morgan Kaufmann Publishers27 15 Morgan Kaufmann Publishers14 Morgan Kaufmann Publishers13 Morgan Kaufmann Publishers12 Morgan Kaufmann Publishers11 Morgan Kaufmann Publishers10 Morgan Kaufmann Publishers9 Morgan Kaufmann Publishers8 3 Morgan Kaufmann Publishers2 Morgan Kaufmann Publishers1 Morgan Kaufmann Publishers0 Page Morgan Kaufmann Publishersoffset Virtual Morgan Kaufmann Publisherspage Morgan Kaufmann Publishersnumber Translation 29 Morgan Kaufmann Publishers28 Morgan Kaufmann Publishers27 15 Morgan Kaufmann Publishers14 Morgan Kaufmann Publishers13 Morgan Kaufmann Publishers12 Morgan Kaufmann Publishers11 Morgan Kaufmann Publishers10 Morgan Kaufmann Publishers9 Morgan Kaufmann Publishers8 Physical Morgan Kaufmann Publisherspage Morgan Kaufmann Publishersnumber 3 Morgan Kaufmann Publishers2 Morgan Kaufmann Publishers1 Morgan Kaufmann Publishers0 Page Morgan Kaufmann Publishersoffset Physical Morgan Kaufmann Publishersaddress 136 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Page Tables Virtual Morgan Kaufmann Publisherspage number Page Morgan Kaufmann Publisherstable Physical Morgan Kaufmann Publisherspage Morgan Kaufmann Publishersor Valid disk Morgan Kaufmann Publishersaddress 1 1 1 1 0 1 1

0 1 1 0 1 Physical Morgan Kaufmann Publishersmemory Disk Morgan Kaufmann Publishersstorage 137 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Page Tables Page Morgan Kaufmann Publisherstable Morgan Kaufmann Publishersregister Virtual Morgan Kaufmann Publishersaddress 3 1 Morgan Kaufmann Publishers 3 0 Morgan Kaufmann Publishers 2 9 Morgan Kaufmann Publishers 2 8 Morgan Kaufmann Publishers 2 7 1 5 Morgan Kaufmann Publishers 1 4 Morgan Kaufmann Publishers 1 3 Morgan Kaufmann Publishers 1 2 Morgan Kaufmann Publishers 1 1 Morgan Kaufmann Publishers 1 0 Morgan Kaufmann Publishers 9 Morgan Kaufmann Publishers 8 Virtual Morgan Kaufmann Publisherspage Morgan Kaufmann Publishersnumber Page Morgan Kaufmann Publishersoffset 12 20 Valid 3 Morgan Kaufmann Publishers 2 Morgan Kaufmann Publishers 1 Morgan Kaufmann Publishers 0 Physical Morgan Kaufmann Publisherspage Morgan Kaufmann Publishersnumber Page Morgan Kaufmann Publisherstable 18 If Morgan Kaufmann Publishers0 Morgan Kaufmann Publishersthen Morgan Kaufmann Publisherspage Morgan Kaufmann Publishersis Morgan Kaufmann Publishersnot present Morgan Kaufmann Publishersin Morgan Kaufmann Publishersmemory 2 9 Morgan Kaufmann Publishers 2 8 Morgan Kaufmann Publishers 2 7 1 5 Morgan Kaufmann Publishers 1 4 Morgan Kaufmann Publishers 1 3 Morgan Kaufmann Publishers 1 2 Morgan Kaufmann Publishers 1 1 Morgan Kaufmann Publishers 1 0 Morgan Kaufmann Publishers 9 Morgan Kaufmann Publishers 8 Physical Morgan Kaufmann Publisherspage Morgan Kaufmann Publishersnumber 3 Morgan Kaufmann Publishers 2 Morgan Kaufmann Publishers 1 Morgan Kaufmann Publishers 0

Page Morgan Kaufmann Publishersoffset Physical Morgan Kaufmann Publishersaddress 138 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Making Address Translation Fast A cache for address translations: translation lookaside buffer TLB Virtual Morgan Kaufmann Publisherspage number Valid Dirty Ref 1 1 1 1 0 1 0 1 1 0 0 0 Tag Physical Morgan Kaufmann Publisherspage address 1 1 1 1 0 1 Physical Morgan Kaufmann Publishersmemory

Page Morgan Kaufmann Publisherstable Physical Morgan Kaufmann Publisherspage Valid Dirty Ref or Morgan Kaufmann Publishersdisk Morgan Kaufmann Publishersaddress 1 1 1 1 0 1 1 0 1 1 0 1 Typical values: 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 0

1 Disk Morgan Kaufmann Publishersstorage 16-512 entries, miss-rate: .01% - 1% miss-penalty: 10 100 cycles 139 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers TLBs and caches Virtual Morgan Kaufmann Publishersaddress TLB Morgan Kaufmann Publishersaccess TLB Morgan Kaufmann Publishersmiss exception No Yes TLB Morgan Kaufmann Publishershit? Physical Morgan Kaufmann Publishersaddress No Try Morgan Kaufmann Publishersto Morgan Kaufmann Publishersread Morgan Kaufmann Publishersdata from Morgan Kaufmann Publisherscache Cache Morgan Kaufmann Publishersmiss Morgan Kaufmann Publishersstall while Morgan Kaufmann Publishersread Morgan Kaufmann Publishersblock No Cache Morgan Kaufmann Publishershit? Yes Write?

No Yes Write Morgan Kaufmann Publishersaccess bit Morgan Kaufmann Publisherson? Write Morgan Kaufmann Publishersprotection exception Yes Try Morgan Kaufmann Publishersto Morgan Kaufmann Publisherswrite Morgan Kaufmann Publishersdata to Morgan Kaufmann Publisherscache Deliver Morgan Kaufmann Publishersdata to Morgan Kaufmann Publishersthe Morgan Kaufmann PublishersCPU Cache Morgan Kaufmann Publishersmiss Morgan Kaufmann Publishersstall while Morgan Kaufmann Publishersread Morgan Kaufmann Publishersblock No Cache Morgan Kaufmann Publishershit? Yes Write Morgan Kaufmann Publishersdata Morgan Kaufmann Publishersinto Morgan Kaufmann Publisherscache, update Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersdirty Morgan Kaufmann Publishersbit, Morgan Kaufmann Publishersand put Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersdata Morgan Kaufmann Publishersand Morgan Kaufmann Publishersthe address Morgan Kaufmann Publishersinto Morgan Kaufmann Publishersthe Morgan Kaufmann Publisherswrite Morgan Kaufmann Publishersbuffer 140 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers TLBs and Caches Virtual Morgan Kaufmann Publishersaddress 31 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers30 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers29 14 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers13 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers12 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers11 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers10 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers9 Virtual Morgan Kaufmann Publisherspage Morgan Kaufmann Publishersnumber

3 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers2 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers1 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers0 Page Morgan Kaufmann Publishersoffset 12 20 Valid Dirty Tag Physical Morgan Kaufmann Publisherspage Morgan Kaufmann Publishersnumber = = = = = = TLB TLB Morgan Kaufmann Publishershit 20 Page Morgan Kaufmann Publishersoffset Physical Morgan Kaufmann Publisherspage Morgan Kaufmann Publishersnumber Physical Morgan Kaufmann Publishersaddress Block Cache Morgan Kaufmann Publishersindex Physical Morgan Kaufmann Publishersaddress Morgan Kaufmann Publisherstag offset 18 8 4 Byte offset 2 8 12 Valid

Data Tag Cache = Cache Morgan Kaufmann Publishershit 32 Data 141 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Modern Systems 142 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Modern Systems Things are getting complicated! 143 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Some Issues Processor speeds continue to increase very fast much faster than either DRAM or disk access times 100,000 10,000 1,000 Performance

CPU 100 10 Memory 1 Year Design challenge: dealing with this growing disparity Prefetching? 3rd level caches and more? Memory design? 144 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Chapters 8 & 9 (partial coverage) 145 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Interfacing Processors and Peripherals I/O Design affected by many factors (expandability, resilience) Performance: access latency throughput connection between devices and the system the memory hierarchy the operating system A variety of different users (e.g., banks, supercomputers, engineers)

Interrupts Processor Cache Memory- Morgan Kaufmann PublishersI/O Morgan Kaufmann Publishersbus Main memory I/O controller Disk Disk I/O controller I/O controller Graphics output Network 146 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers I/O Important but neglected The difficulties in assessing and designing I/O systems have often relegated I/O to second class status courses in every aspect of computing, from programming to computer architecture often ignore I/O or give it scanty coverage textbooks leave the subject to near the end, making it easier for students and instructors to skip it!

GUILTY! we wont be looking at I/O in much detail be sure and read Chapter 8 in its entirety. you should probably take a networking class! 147 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers I/O Devices Very diverse devices behavior (i.e., input vs. output) partner (who is at the other end?) data rate 148 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers I/O Example: Disk Drives Platters Tracks Platter Sectors Track To access data: seek: position head over the proper track (3 to 14 ms. avg.) rotational latency: wait for desired sector (.5 / RPM) transfer: grab the data (one or more sectors) 30 to 80 MB/sec 149 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers

I/O Example: Buses Shared communication link (one or more wires) Difficult design: may be bottleneck length of the bus number of devices tradeoffs (buffers for higher bandwidth increases latency) support for many different devices cost Types of buses: processor-memory (short high speed, custom design) backplane (high speed, often standardized, e.g., PCI) I/O (lengthy, different devices, e.g., USB, Firewire) Synchronous vs. Asynchronous use a clock and a synchronous protocol, fast and small but every device must operate at same rate and clock skew requires the bus to be short dont use a clock and instead use handshaking 150 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers I/O Bus Standards Today we have two dominant bus standards: 151 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Other important issues

Bus Arbitration: daisy chain arbitration (not very fair) centralized arbitration (requires an arbiter), e.g., PCI collision detection, e.g., Ethernet Operating system: polling interrupts direct memory access (DMA) Performance Analysis techniques: queuing theory simulation analysis, i.e., find the weakest link (see I/O System Design) Many new developments 152 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Pentium 4 I/O Options Pentium Morgan Kaufmann Publishers4 processor DDR Morgan Kaufmann Publishers400 (3.2 Morgan Kaufmann PublishersGB/sec) Main memory DIMMs DDR Morgan Kaufmann Publishers400 (3.2 Morgan Kaufmann PublishersGB/sec)

System Morgan Kaufmann Publishersbus Morgan Kaufmann Publishers(800 Morgan Kaufmann PublishersMHz, Morgan Kaufmann Publishers604 Morgan Kaufmann PublishersGB/sec) AGP Morgan Kaufmann Publishers8X Memory (2.1 Morgan Kaufmann PublishersGB/sec) Graphics controller output hub CSA (north Morgan Kaufmann Publishersbridge) (0.266 Morgan Kaufmann PublishersGB/sec) 1 Morgan Kaufmann PublishersGbit Morgan Kaufmann PublishersEthernet 82875P Serial Morgan Kaufmann PublishersATA (150 Morgan Kaufmann PublishersMB/sec) (266 Morgan Kaufmann PublishersMB/sec) Parallel Morgan Kaufmann PublishersATA (100 Morgan Kaufmann PublishersMB/sec) Serial Morgan Kaufmann PublishersATA (150 Morgan Kaufmann PublishersMB/sec) Parallel Morgan Kaufmann PublishersATA (100 Morgan Kaufmann PublishersMB/sec) Disk Disk Stereo (surroundsound) AC/97 (1 Morgan Kaufmann PublishersMB/sec) USB Morgan Kaufmann Publishers2.0 (60 Morgan Kaufmann PublishersMB/sec) . Morgan Kaufmann Publishers. Morgan Kaufmann Publishers. I/O controller hub (south Morgan Kaufmann Publishersbridge)

82801EB (20 Morgan Kaufmann PublishersMB/sec) CD/DVD Tape 10/100 Morgan Kaufmann PublishersMbit Morgan Kaufmann PublishersEthernet PCI Morgan Kaufmann Publishersbus (132 Morgan Kaufmann PublishersMB/sec) 153 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Fallacies and Pitfalls Fallacy: the rated mean time to failure of disks is 1,200,000 hours, so disks practically never fail. Fallacy: magnetic disk storage is on its last legs, will be replaced. Fallacy: A 100 MB/sec bus can transfer 100 MB/sec. Pitfall: Moving functions from the CPU to the I/O processor, expecting to improve performance without analysis. 154 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Multiprocessors

Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad news: its really hard to write good concurrent programs many commercial failures Processor Processor Processor Cache Cache Cache Processor Processor Processor Cache Cache Cache Memory Memory Memory Single Morgan Kaufmann Publishersbus Memory I/O Network

155 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Questions How do parallel processors share data? single address space (SMP vs. NUMA) message passing How do parallel processors coordinate? synchronization (locks, semaphores) built into send / receive primitives operating system protocols How are they implemented? connected by a single bus connected by a network 156 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Supercomputers Plot of top 500 supercomputer sites over a decade: Single Morgan Kaufmann PublishersInstruction Morgan Kaufmann Publishersmultiple Morgan Kaufmann Publishersdata Morgan Kaufmann Publishers(SIMD) 500 Cluster (network Morgan Kaufmann Publishersof workstations) 400 Cluster (network Morgan Kaufmann Publishersof SMPs)

300 Massively parallel processors (MPPs) 200 100 0 93 93 94 94 95 95 96 96 97 97 98 98 99 99 00 Sharedmemory multiprocessors (SMPs) Uniprocessors 157 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Using multiple processors an old idea Some SIMD designs: Costs for the the Illiac IV escalated from \$8 million in 1966 to \$32 million in 1972 despite completion of only of the machine. It took three more years before it was operational! For better or worse, computer architects are not easily discouraged Lots of interesting designs and ideas, lots of failures, few successes 158 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Topologies P0

P1 P2 a. Morgan Kaufmann Publishers2-D Morgan Kaufmann Publishersgrid Morgan Kaufmann Publishersor Morgan Kaufmann Publishersmesh Morgan Kaufmann Publishersof Morgan Kaufmann Publishers16 Morgan Kaufmann Publishersnodes P3 P0 P4 P1 P5 P2 P6 P3 P7 P4 P5 P6 P7 b. Morgan Kaufmann PublishersOmega Morgan Kaufmann Publishersnetwork a. Morgan Kaufmann PublishersCrossbar b. Morgan Kaufmann Publishersn-cube Morgan Kaufmann Publisherstree Morgan Kaufmann Publishersof Morgan Kaufmann Publishers8 Morgan Kaufmann Publishersnodes Morgan Kaufmann Publishers(8 Morgan Kaufmann Publishers= Morgan Kaufmann Publishers23 Morgan Kaufmann Publishersso Morgan Kaufmann Publishersn Morgan Kaufmann Publishers= Morgan Kaufmann Publishers3) 159 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Clusters

Constructed from whole computers Independent, scalable networks Strengths: Many applications amenable to loosely coupled machines Exploit local area networks Cost effective / Easy to expand Weaknesses: Administration costs not necessarily lower Connected using I/O bus Highly available due to separation of memories In theory, we should be able to do better 160 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Google Serve an average of 1000 queries per second Google uses 6,000 processors and 12,000 disks Two sites in silicon valley, two in Virginia Each site connected to internet using OC48 (2488 Mbit/sec) Reliability: On an average day, 20 machines need rebooted (software error) 2% of the machines replaced each year In some sense, simple ideas well executed. Better (and cheaper) than other approaches involving increased complexity 161 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Concluding Remarks

Evolution vs. Revolution More often the expense of innovation comes from being too disruptive to computer users Acceptance of hardware ideas requires acceptance by software people; therefore hardware people should learn about software. And if software people want good machines, they must learn more about hardware to be able to communicate with and thereby influence hardware engineers. 162 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers

Recently Viewed Presentations

• Red flowers, R, are dominant to white flowers, r. A gardener crosses a TtRr with a ttrr plant. What genotypes and phenotypes would you expect to see in the F1 generation? Color blindness is a sex linked recessive trait. A...
• Herbicide Resistance: Basic Principles. Herbicide resistance is the result of naturally occurring processes. * Biotypes are plants within a species that have biological characteristics that are not common to the population as a whole.
• Introduction. Applications of HCF and LCM to Problem Solving. Reason for study: Realised we needed to go further than just the skills of HCF and LCM, because students tend not to apply their prior knowledge and skills to problem solving.
• Efficient Program Compilation through Machine Learning Techniques. Gennady . Pekhimenko. IBM Canada. Angela . Demke. Brown. University of Toronto
• Title: PowerPoint Presentation Author: Thomas Tokura Last modified by: Danny Created Date: 5/28/2000 12:47:49 PM Document presentation format: On-screen Show
• F O N D S M O N É T A I R E I N T E R N A T I O N A L Centre Régional d'Assistance Technique pour l'Afrique Centrale AFRITAC Centre - Libreville, Gabon Séminaire...
• Global Health Disparities The Universal Declaration of Human Rights "Everyone has the right to a standard of living adequate for the health and well being of himself and his family, including food, clothing, housing and medical care."
• In the problem you just solved, for example, the edit distance between BOB'S RAFTS and BARB'S CRAFTS was 4. Spell check works this way too, though it gives you a list of options to choose from in the order of...