Instruction Flow Techniques ECE/CS 752 Fall 2017 Prof.

Instruction Flow Techniques ECE/CS 752 Fall 2017 Prof.

Instruction Flow Techniques ECE/CS 752 Fall 2017 Prof. Mikko H. Lipasti University of Wisconsin-Madison High-IPC Processor I-cache Branch Predictor FETCH Instruction Buffer Instruction Flow DECODE Integer Floating-point Media Memory Memory Data Flow EXECUTE Register

Data Flow Reorder Buffer (ROB) COMMIT Store Queue D-cache Mikko Lipasti-University of Wisconsin 2 Instruction Flow Techniques Instruction Flow and its Impediments Control Dependences Control Flow Speculation Branch Speculation Mis-speculation Recovery Branch Direction Prediction Static Prediction A brief history of dynamic branch prediction Branch Target Prediction High-bandwidth Fetch High-Frequency Fetch

Mikko Lipasti-University of Wisconsin 3 Goal and Impediments Goal of Instruction Flow Supply processor with maximum number of useful instructions every clock cycle Impediments Branches and jumps Finite I-Cache Capacity Bandwidth restrictions Mikko Lipasti-University of Wisconsin 4 Branch Types and Implementation 1. Types of Branches A. Conditional or Unconditional B. Save PC? C. How is target computed? Single target (immediate, PC+immediate) Multiple targets (register) 2. Branch Architectures A. Condition code or condition registers B. Register

Mikko Lipasti-University of Wisconsin 5 Whats So Bad About Branches? Effects of Branches Fragmentation of I-Cache lines Need to determine branch direction Need to determine branch target Use up execution resources Pipeline drain/fill Mikko Lipasti-University of Wisconsin 6 Whats So Bad About Branches? Problem: Fetch stalls until direction is determined Solutions: Minimize delay Move instructions determining branch condition away from branch (CC architecture)

Make use of delay Non-speculative: Fill delay slots with useful safe instructions Execute both paths (eager execution) Speculative: Predict branch direction Mikko Lipasti-University of Wisconsin 7 Whats So Bad About Branches? Problem: Fetch stalls until branch target is determined Solutions: Minimize delay Generate branch target early Make use of delay: Predict branch target Single target Multiple targets Mikko Lipasti-University of Wisconsin 8 Control Dependences BB 1

main: BB 2 loop: BB 3 BB 4 T1: T2: BB 5 end: addi addi addi addi add bge r2, r0, r3, r0, r4, r0, r5, r0, r10,r0, r10,r5, A B C

N r0 end lw lw bge sw b r20, 0(r2) r21, 0(r3) r20,r21,T1 r21, 0(r4) T2 sw r20, 0(r4) BB 4 addi addi addi addi blt r10,r10,1 r2, r2, 4 r3, r3, 4 r4, r4, 4 r10,r5, loop

BB 5 BB 1 BB 2 BB 3 Control Flow Graph Shows possible paths of control flow through basic blocks Mikko Lipasti-University of Wisconsin 9 Control Dependences BB 1 main: BB 2 loop: BB 3 BB 4 T1: T2: BB 5 end: addi

addi addi addi add bge r2, r0, r3, r0, r4, r0, r5, r0, r10,r0, r10,r5, A B C N r0 end lw lw bge sw b r20, 0(r2) r21, 0(r3) r20,r21,T1 r21, 0(r4) T2 sw

r20, 0(r4) BB 4 addi addi addi addi blt r10,r10,1 r2, r2, 4 r3, r3, 4 r4, r4, 4 r10,r5, loop BB 5 BB 1 BB 2 BB 3 Control Dependence Node B is CD on Node A if A determines whether B executes If path 1 from A to exit includes B, and path 2 does not, then B is control-dependent on A Mikko Lipasti-University of Wisconsin 10 Program Control Flow Implicit Sequential Control Flow

Static Program Representation Control Flow Graph (CFG) Nodes = basic blocks Edges = Control flow transfers Physical Program Layout Mapping of CFG to linear program memory Implied sequential control flow Dynamic Program Execution Traversal of the CFG nodes and edges (e.g. loops) Traversal dictated by branch conditions Dynamic Control Flow Deviates from sequential control flow Disrupts sequential fetching Can stall IF stage and reduce I-fetch bandwidth Program Control Flow Dynamic traversal of static CFG Mapping CFG to linear memory Limits on Instruction Level Parallelism (ILP) 1970: Flynn Weiss and Smith [1984] 1.58 Sohi and Vajapeyam [1987] 1.81

Tjaden and Flynn [1970] 1.86 (Flynns bottleneck) Tjaden and Flynn [1973] 1.96 Uht [1986] 2.00 Smith et al. [1989] 2.00 Jouppi and Wall [1988] 2.40 Johnson [1991] 2.50 Acosta et al. [1986] 2.79 Wedig [1982] 3.00 Butler et al. [1991]

5.8 Melvin and Patt [1991] 6 Wall [1991] 7 (Jouppi disagreed) Kuck et al. [1972] 8 Riseman and Foster [1972] 51 (no control dependences) Nicolau and Fisher [1984] 90 (Fishers optimism) Mikko Lipasti-University of Wisconsin 13 Riseman and Fosters Study 1970: Flynn 1972: Riseman/Foster 7 benchmark programs on CDC-3600 Assume infinite machines

Infinite memory and instruction stack Infinite register file Infinite functional units True dependencies only at dataflow limit If bounded to single basic block, speedup is 1.72 (Flynns bottleneck) If one can bypass n branches (hypothetically), then: Branches Bypassed 0 1 2 8 32 128 Speedup 1.72

2.72 3.62 7.21 14.8 24.4 51.2 Mikko Lipasti-University of Wisconsin 14 1970: Flynn 1972: Riseman/Foster Branch Prediction 1979: Smith Predictor Riseman & Foster showed potential But no idea how to reap benefit 1979: Jim Smith patents branch prediction at Control Data Predict current branch based on past history Today: virtually all processors use branch prediction

Mikko Lipasti-University of Wisconsin 15 Disruption of Sequential Control Flow Fetch Instruction/Decode Buffer Decode Dispatch Buffer Dispatch Reservation Stations Issue Branch Execute Finish Complete Reorder/ Completion Buffer Store Buffer Retire Mikko Lipasti-University of Wisconsin 16 Branch Prediction Target address generation Target Speculation Access register: PC, General purpose register, Link register Perform calculation:

+/- offset, autoincrement, autodecrement Condition resolution Condition speculation Access register: Condition code register, General purpose register Perform calculation: Comparison of data register(s) Mikko Lipasti-University of Wisconsin 17 Target Address Generation Fetch Reg. ind. with offset PCrel. Reg. ind. Decode Buffer Decode Dispatch Buffer Dispatch Reservation Stations Issue

Branch Execute Finish Completion Buffer Complete Store Buffer Retire Mikko Lipasti-University of Wisconsin 18 Condition Resolution Fetch Decode Buffer GP reg. value comp. Decode CC reg. Dispatch Buffer Dispatch Reservation Stations Issue Branch Execute

Finish Completion Buffer Complete Store Buffer Retire Mikko Lipasti-University of Wisconsin 19 Branch Instruction Speculation to I-cache Prediction FA-mux Spec. target PC(seq.) Branch Spec. cond. Predictor (using a BTB) BTB update (target addr. and history) PC(seq.) = FA (fetch address) Fetch Decode Buffer Decode Dispatch Buffer Dispatch Reservation

Stations Issue Branch Execute Finish Completion Buffer Mikko Lipasti-University of Wisconsin 20 Branch/Jump Target Prediction 0x0348 0101 (NTNT) 0x0612 Branch inst. Information Branch target address for predict. address (most recent) Branch Target Buffer: small cache in fetch stage Previously executed branches, address, taken history, target(s) Fetch stage compares current FA against BTB If match, use prediction If predict taken, use BTB target

When branch executes, BTB is updated Optimization: Size of BTB: increases hit rate Prediction algorithm: increase accuracy of prediction Mikko Lipasti-University of Wisconsin 21 Branch Prediction: Condition Speculation 1. Biased Not Taken 2. Hardware prediction Does not affect ISA Not effective for loops Software Prediction Extra bit in each branch instruction 3.

Bit set by compiler or user; can use profiling Static prediction, same behavior every time Prediction based on branch offset 4. Set to 0 for not taken Set to 1 for taken Positive offset: predict not taken Negative offset: predict taken Prediction based on dynamic history Mikko Lipasti-University of Wisconsin 22 UCB Study [Lee and Smith, 1984] Benchmarks used 26 programs (IBM 370, DEC PDP-11, CDC 6400) 6 workloads (4 IBM, 1 DEC, 1 CDC) Used trace-driven simulation Branch types

Unconditional: always taken or always not taken Subroutine call: always taken Loop control: usually taken Decision: either way, if-then-else Computed goto: always taken, with changing target Supervisor call: always taken Execute: always taken (IBM 370) T NT IBM1: compiler IBM2: cobol (business app) IBM3: scientific IBM4: supervisor (OS) IBM1 IBM2 IBM3 IBM4 DEC CDC Avg 0.640 0.657 0.704 0.540 0.738 0.778 0.676 0.360 0.343 0.296 0.460 0.262 0.222 0.324 Mikko Lipasti-University of Wisconsin 23 Branch Prediction Function Prediction function F(X1, X2, ) X1 opcode type X2 history

Prediction effectiveness based on opcode only, or history Opcode only History 0 History 1 History 2 History 3 History 4 History 5 IBM1 66 64 92 93 94 95 95 IBM2 69 64 95 97 97 97 97 IBM3 71 70

87 91 91 92 92 IBM4 55 54 80 83 84 84 84 Mikko Lipasti-University of Wisconsin DEC 80 74 97 98 98 98 98 CDC 78 78 82 91 94 95 96

24 Example Prediction Algorithm Branch inst. address T T TT T Information Branch target for predict. address NT T T N T N NN N TTN N TT N N

Hardware table remembers last 2 branch outcomes History of past several branches encoded by FSM Current state used to generate prediction Results: Workload Accuracy IBM1 93 IBM2 97 IBM3 IBM4 91 83 Mikko Lipasti-University of Wisconsin DEC 98 CDC 91 25 IBM Study [Nair, 1992] Branch processing on the IBM RS/6000 Separate branch functional unit Overlap of branch instructions with other instructions Zero cycle branches

Two causes for branch stalls Unresolved conditions Branches downstream too close to unresolved branches Investigated optimal FSM design for branch prediction Mikko Lipasti-University of Wisconsin 26 Exhaustive Search for Optimal 2-bit Predictor There are 220 possible state machines of 2-bit predictors Some machines are uninteresting, pruning them out reduces the number of state machines to 5248 For each benchmark, determine prediction accuracy for all the predictor state machines Find optimal 2-bit predictor for each application Mikko Lipasti-University of Wisconsin 27 Branch Prediction Implementation (PPC 604) to I-cache Prediction

FA-mux Spec. target FA Branch Predictor FA (fetch address) Fetch Decode Buffer Decode Branch Predictor Update Dispatch Buffer Dispatch BRN SFX SFX CFX Reservation Stations LS FPU Issue

Branch Execute Finish Completion Buffer Mikko Lipasti-University of Wisconsin 28 BTAC and BHT Design (PPC 604) FAR FA-mux +4 FA FA Branch History Table (BHT) I-cache FA Branch Target Address Cache (BTAC) BTAC update BHT prediction

Decode Buffer BHT update Decode Dispatch Buffer BTAC prediction Dispatch BTAC: - 64 entries - fully associative - hit => predict taken BRN SFX SFX CFX Reservation Stations LS FPU Issue Branch BHT: - 512 entries

Execute - direct mapped - 2-bit saturating counter history based prediction - overrides BTAC prediction Finish Mikko Lipasti-University of Wisconsin Completion Buffer 29 Branch Speculation NT NT T NT T (TAG 2) NT T NT T NT T (TAG 3)

T NT T (TAG 1) Start new correct path Must remember the alternate (non-predicted) path Eliminate incorrect path Must ensure that the mis-speculated instructions produce no side effects Mikko Lipasti-University of Wisconsin 30 Mis-speculation Recovery Start new correct path 1. Update PC with computed branch target (if predicted NT) 2. Update PC with sequential instruction address (if predicted T) 3. Can begin speculation again at next branch Eliminate incorrect path 1. Use tag(s) to deallocate ROB entries occupied by speculative instructions 2. Invalidate all instructions in the decode and dispatch

buffers, as well as those in reservation stations Mikko Lipasti-University of Wisconsin 31 Tracking Instructions Assign branch tags Allocated in circular order Instruction carries this tag throughout processor Often: track instruction groups Instructions managed in groups, max. one branch per group ROB structured as groups Leads to some inefficiency Simpler tracking of speculative instructions Mikko Lipasti-University of Wisconsin 32 BTAC and BHT Design (PPC 604) Fairly simple, 5stage machine from 1994 Many sources for PC redirect

Lots of complexity Mikko Lipasti-University of Wisconsin 33 Instruction Flow Techniques Instruction Flow and its Impediments Control Dependences Control Flow Speculation Branch Speculation Mis-speculation Recovery Branch Direction Prediction Static Prediction A brief history of dynamic branch prediction Branch Target Prediction High-bandwidth Fetch High-Frequency Fetch Mikko Lipasti-University of Wisconsin 34 Static Branch Prediction Single-direction Always not-taken: Intel i486 Backwards Taken/Forward Not Taken Loop-closing branches Used as backup in Pentium Pro, II, III, 4

Heuristic-based: void * p = malloc (numBytes); if (p == NULL) errorHandlingFunction( ); Mikko Lipasti-University of Wisconsin 35 Static Branch Prediction Heuristic Name Description Loop Branch If the branch target is back to the head of a loop, predict taken. Pointer If a branch compares a pointer with NULL, or if two pointers are compared, predict in the direction that corresponds to the pointer being not NULL, or the two pointers not being equal. Opcode If a branch is testing that an integer is less than zero, less than or equal to zero, or equal to a constant, predict in the direction that corresponds to the test evaluating to false. Guard If the operand of the branch instruction is a register that gets used before being redefined in the successor block, predict that the branch goes to the successor block. Loop Exit

If a branch occurs inside a loop, and neither of the targets is the loop head, then predict that the branch does not go to the successor that is the loop exit. Loop Header Predict that the successor block of a branch that is a loop header or a loop pre-header is taken. Call If a successor block contains a subroutine call, predict that the branch goes to that successor block. Store If a successor block contains a store instruction, predict that the branch does not go to that successor block. Return If a successor block contains a return from subroutine instruction, predict that the branch does not go to that successor block. Heuristic-based: Ball/Larus Thomas Ball and James R. Larus. Branch Prediction for Free. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 300-313, May 1993. Mikko Lipasti-University of Wisconsin 36 Static Branch Prediction Profile-based

1. Instrument program binary 2. Run with representative (?) input set 3. Recompile program a. Annotate branches with hint bits, or b. Restructure code to match predict not-taken Best performance: 75-80% accuracy Mikko Lipasti-University of Wisconsin 37 Dynamic Branch Prediction Main advantages: Learn branch behavior autonomously No compiler analysis, heuristics, or profiling Adapt to changing branch behavior Program phase changes branch behavior First proposed in 1979-1980 US Patent #4,370,711, Branch predictor using random access memory, James. E. Smith Continually refined since then Mikko Lipasti-University of Wisconsin 38 Smith Predictor Hardware Branch Address

2m k -bit counters Updated Counter Value m Saturating Counter Increment/Decrement most significant bit Branch Prediction Branch Outcome Jim E. Smith. A Study of Branch Prediction Strategies. International Symposium on Computer Architecture, pages 135-148, May 1981 Widely employed: Intel Pentium, PowerPC 604, PowerPC 620, etc. Mikko Lipasti-University of Wisconsin 39 Two-level Branch Prediction PHT 000000 000001 000010 000011 PC = 01011010010101 010110

BHR 0110 010100 010101 010110 1 0 010111 111110 111111 1 Branch Prediction BHR adds global branch history Provides more context Can differentiate multiple instances of the same static branch Can correlate behavior across multiple static branches Mikko Lipasti-University of Wisconsin 40 Two-level Prediction: Local History PHT PC = 01011010010101 BHT 000 001 010 011 100 101 110 110

111 0101110 0000000 0000001 0000010 0000011 0101100 0101101 0101110 0 1 0101111 0111110 0111111 0 Branch Prediction Detailed local history can be useful Mikko Lipasti-University of Wisconsin 41 Local History Predictor Example Loop closing branches Loop closing branchs history 11101110111011101110 PHT 0000 0001

Must identify last instance Local history dedicates PHT entry to each instance 0111 entry predicts not taken 0010 0011 0100 0101 0110 0111 00 1000 1001 1010 1011 11 1100 1101 11 1110

11 1111 Mikko Lipasti-University of Wisconsin 42 Two-level Taxonomy 1970: Flynn 1972: Riseman/Foster 1979: Smith Predictor 1991: Two-level prediction Based on indices for branch history and pattern history BHR: {G,P,S}: {Global, Per-address, Set} PHT: {g,p,s}: {Global, Per-address, Set} 9 combinations: GAg, GAp, GAs, PAg, PAp, PAs, SAg, SAp and SAs Tse-Yu Yeh and Yale N. Patt. Two-Level Adaptive Branch Prediction. International Symposium on Microarchitecture, pages 51-61, November 1991. Mikko Lipasti-University of Wisconsin 43 Index Sharing in Two-level Predictors 1970: Flynn

1972: Riseman/Foster 1979: Smith Predictor GAp BHR 1101 1001 1991: Two-level prediction 1993: gshare, tournament PC 0110 BHR 1001 1001 PC 1010 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101

1110 1111 gshare BHR 1101 PC 0110 XOR 1011 BHR 1001 PC 1010 XOR 0011 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 Use XOR function to achieve better utilization of PHT Scott McFarling. Combining Branch Predictors. TN-36,

Digital Equipment Corporation Western Research Laboratory, June 1993. Used in e.g. IBM Power 4, Alpha 21264 Mikko Lipasti-University of Wisconsin 44 Sources of Mispredictions Lack of history (training time) Randomized behavior Usually due to randomized input data (benchmarks) Surprisingly few branches depend on input data values BHR capacity Correlate to branch that already shifted out E.g. loop count > BHR width PHT capacity

Aliasing/interference Positive Negative Mikko Lipasti-University of Wisconsin 45 Reducing Interference Compulsory aliasing (cold miss) Not important (less than 1%) Only remedy is to set appropriate initial value Also: beware indexing schemes with high training cost (e.g. very long branch history) Capacity aliasing (capacity miss) Increase PHT size Conflict aliasing (conflict miss) Change indexing scheme or partition PHT in a clever fashion Mikko Lipasti-University of Wisconsin 46 Bi-Mode Predictor Bra nch Address

PHT0 PHT1 choice predictor Global BHR XOR PHT partitioned into T/NT halves Selector chooses source Final Prediction Reduces negative interference, since most entries in PHT 0 tend towards NT, and most entries in PHT1 tend towards T Used by ARM Cortex-A15 Mikko Lipasti-University of Wisconsin 47 gskewed Predictor PHT0 Branch Address PHT1 PHT2

f0 Global BHR f1 f2 M a jority Final Prediction Multiple PHT banks indexed by different hash functions Conflicting branch pair unlikely to conflict in more than one PHT Majority vote determines prediction Used in Alpha EV8 (ultimately cancelled) P. Michaud, A. Seznec, and R. Uhlig. Trading Conflict and Capacity Aliasing in Conditional Branch Predictors. ISCA-24, June 1997 Mikko Lipasti-University of Wisconsin 48 Agree Predictor biasing bits Branch Address Global BHR XOR

PHT 1 0 Prediction 1 = agree with bias bit 0 = disagree Same principle as bi-mode PHT records whether branch bias matches outcome Exploits 70-80% static predictability Used in HP PA-8700 E. Sprangle, R. S. Chappell, M. Alsup, and Y. N. Patt. The Agree Predictor: A Mechanism for Reducing Negative Branch History Interference. ISCA-24, June 1997. Mikko Lipasti-University of Wisconsin 49 YAGS Predictor 1970: Flynn 1972: Riseman/Foster Branch Address choice

PHT Global BHR XOR 1979: Smith Predictor T-cache 1991: Two-level prediction 1993: gshare, tournament NT-cache Partial Tag 2bC Partial Tag 2bC 1998: Cache exceptions Based on bi-mode T/NT PHTs cache only the exceptions A. N. Eden and T. N. Mudge. The YAGS Branch Prediction Scheme. MICRO, Dec 1998. = =

0 1 T/NT-cache hit? Mikko Lipasti-University of Wisconsin 0 1 0 1 F i n a l P r e d i c ti o n 50 Branch Confidence Estimation 1970: Flynn 1972: Riseman/Foster Branch Address 1979: Smith Predictor Global BHR XOR Table of CIRs 1991: Two-level prediction 1993: gshare, tournament 1996: Confidence estimation 1998: Cache exceptions Reduction Function

Confidence Prediction Limit speculation (energy), reverse predictions, guide fetch for multithreaded processors, choose best prediction Q Jacobson, E Rotenberg, and JE Smith. Assigning Confidence to Conditional Branch Predictions. MICRO, December 1996. Mikko Lipasti-University of Wisconsin 51 Dynamic History Length 1970: Flynn 1972: Riseman/Foster 1979: Smith Predictor 1991: Two-level prediction 1993: gshare, tournament 1996: Confidence estimation 1996: Vary history length 1998: Cache exceptions Branch history length: Some prefer short history (less training time) Some require longer history (complex behavior) Vary history length

Choose through profile/compile-time hints Or learn dynamically References Maria-Dana Tarlescu, Kevin B. Theobald, and Guang R. Gao. Elastic History Buffer: A Low-Cost Method to Improve Branch Prediction Accuracy. ICCD, October 1996. Toni Juan, Sanji Sanjeevan, and Juan J. Navarro. Dynamic History-Length Fitting: A Third Level of Adaptivity for Branch Prediction. ISCA, June 1998. Jared Stark, Marius Evers, and Yale N. Patt. Variable Path Branch Prediction. ACM SIGPLAN Notices, 33(11):170179, 1998 Mikko Lipasti-University of Wisconsin 52 Loop Count Predictors BHR entry: H/C History/Count 1 Mode bit: H history C count To predict last loop iterations NT branch: Must have length(BHR) > loop count Not feasible for large loop counts

Instead, BHR has mode bit Once history == 11111 or 00000 switch to count mode Now nth entry in PHT trains to NT and predicts nth iteration as last one Now length(BHR) > log2(loop count) is sufficient Used in Intel Pentium M/Core Duo/ Core 2 Duo Mikko Lipasti-University of Wisconsin 53 Path History A if (y == 0) B if (y == 5) goto C; goto C; History = T History = T C if (y < 12) goto D; H is to r y = T T

Path ACD: D Branch Address = X if (y % 2) goto E; Path BCD: Branch Address = X Branch History = TT Branch History = TT Branch Outcome = Not Taken Branch Outcome = Taken Sometimes T/NT history is not enough Path history (PC values) can help Mikko Lipasti-University of Wisconsin 54 Understanding Advanced Predictors Four types of history Local (bimodal) history (Smith predictor) Table of counters summarizes local history Simple, but only effective for biased branches

Local outcome history (correlate with self) Shift register of individual branch outcomes Separate counter for each outcome history (M-F vs Sat/Sun) Global outcome history (correlate with others) Shift register of recent branch outcomes Separate counter for each outcome history Path history (overcomes CFG convergence aliasing) Shift register of recent (partial) block addresses Can differentiate similar global outcome histories Can combine histories in many ways Mikko Lipasti-University of Wisconsin 55 Understanding Advanced Predictors History length Short historylower training cost Long historycaptures macro-level behavior Variable history length predictors Really long history (long loops) Loop count predictors Fourier transform into frequency domain Kampe et. al, The FAB Predictor, HPCA 2002 Limited capacity & interference

Constructive vs. destructive Bi-mode, gskewed, agree, YAGS Sec. 9.3.2 provides good overview Mikko Lipasti-University of Wisconsin 56 Perceptron Branch Prediction 1970: Flynn 1972: Riseman/Foster 1979: Smith Predictor [Jimenez, Lin HPCA 2001] Perceptron n y = w0 + xi wi i=1 1991: Two-level prediction 1993: gshare, tournament 1996: Confidence estimation 1996: Vary history length 1998: Cache exceptions Basis in AI concept [1962] Computes boolean result based on multiple weighted inputs 2001: Neural predictor

xi from branch history (1 T, -1 NT) wi incremented whenever branch outcome matches xi Finds correlation between current branch and any subset of prior branches Adapted for branch prediction Mikko Lipasti-University of Wisconsin 57 Perceptrons - Implementation Complex dot product must be computed for every prediction Too slow Arithmetic tricks, pipelining: Daniel A. Jimenez and Calvin Lin. Neural methods for dynamic branch prediction. ACM Transactions on Computer Systems, 20(4):369397, November 2002. Analog circuit implementation also possible Amant, Jimenez, Burger, MICRO 2008 Key insights: Not all branches in history are important, weights learn this

Long histories are useful Mikko Lipasti-University of Wisconsin 58 Practical Neural Predictors X BHR feature Precomputed Response PC >t response Approximate dot product function with precomputed responses Update (inc/dec) response based on outcomes Mikko Lipasti-University of Wisconsin 59 Practical Neural Predictors feature feature feature feature feature feature

response response response response response response >t Many possible features (local, global, path, ) Responses updated based on neuron-like model Threshold tuned and/or updated Recent designs from AMD, Samsung claim neural predictor This slide is my best guess as to what that means Some hints: Multiperspective Perceptron Predictor, Daniel Jimenez, CPB-5, ISCA 2016. Mikko Lipasti-University of Wisconsin 60 Overriding Predictors Different types of history E.g. Bimodal, Local, Global (BLG)

Different history lengths How to choose? Metapredictor/selector? Expensive, slow to train Tag match with most sophisticated predictor entry Parallel tag check with B, L, G, long-history G Choose most sophisticated prediction Fancy predictors only updated when simple ones fail Mikko Lipasti-University of Wisconsin 61 Prediction by Partial Matching 1970: Flynn 1972: Riseman/Foster 1979: Smith Predictor 1991: Two-level prediction 1993: gshare, tournament 1996: Confidence estimation 1996: Vary history length 1998: Cache exceptions 2001: Neural predictor 2004: PPM [P. Michaud, CBP-1 2004, JILP 2005] Elegant approach for choosing from several predictors, based on PPM data compression Partial tags like YAGS, varying history lengths Mikko Lipasti-University of Wisconsin 62

Current State of the Art 1970: Flynn 1972: Riseman/Foster Key concepts Different history types (B,L,G) Geometric series history lengths 1979: Smith Predictor 1991: Two-level prediction 1993: gshare, tournament 1996: Confidence estimation 1996: Vary history length 1998: Cache exceptions 2001: Neural predictor 2004: PPM 2006: TAGE Some branches prefer short, others long Use geometric series [Seznec, CBP-1, OGEHL] Cache only exceptions (YAGS/PPM) Confidence estimation [Jacobson et al, MICRO 1996] Tagged Geometric History Length (TAGE) A. Seznec, P. Michaud, A case for (partially) tagged Geometric History Length Branch Prediction , Mikko Journal

of Instruction Level Parallelism , Feb. 2006 Lipasti-University of Wisconsin 63 TAGE Predictor Multiple tagged tables, use different global history lengths Set of history lengths forms a geometric series {0, 2, 4, 8, 16, 32, 64, 128, 256, , 2048} most of the storage for short history !! Mikko Lipasti-University of Wisconsin 64 Tagged Geometric History Length (TAGE) 0 GHR(h) L1 --- PC L2 --- PC h[0 - L1] Base

Predictor =? Hit tag PC h[0 - L3] hash hash pred tag --- PC h[0 - L2] hash pred L3 pred =? tag =? Hit

Miss prediction Longest matching table provides the prediction, subject to branch confidence Mikko Lipasti-University of Wisconsin 65 1970: Flynn 1972: Riseman/Foster 1979: Smith Predictor TAGE Tweaks to basic concept still win CBP-6 1991: Two-level prediction 1993: gshare, tournament 1996: Confidence estimation 1996: Vary history length 1998: Cache exceptions 1st place: TAGE-SC-L 2nd place: Perceptron+TAGE hybrid State of the art, but 2001: Neural predictor 2004: PPM 2006: TAGE Rumors of real implementation Very energy-intensive (parallel lookups)

Complex update rules Real opportunity exists for improvement 2016: Still TAGE vs Neural Mikko Lipasti-University of Wisconsin 66 1970: Flynn 1972: Riseman/Foster TAGE vs. Neural Neural: ARM, AMD, Samsung TAGE: Intel, ??? 1991: Two-level prediction Similarity 1993: gshare, tournament 1979: Smith Predictor 1996: Confidence estimation 1996: Vary history length 1998: Cache exceptions Many sources or features Key difference: how to combine them 2001: Neural predictor 2004: PPM 2006: TAGE TAGE: Override via partial match

Neural: integrate + threshold Every CBP is a cage match Seznec vs. Jimenez 2016: Still TAGE vs Neural Mikko Lipasti-University of Wisconsin 67 Instruction Flow Techniques Instruction Flow and its Impediments Control Dependences Control Flow Speculation Branch Speculation Mis-speculation Recovery Branch Direction Prediction Static Prediction A brief history of dynamic branch prediction Branch Target Prediction High-bandwidth Fetch High-Frequency Fetch Mikko Lipasti-University of Wisconsin 68 Branch Target Prediction Branch Address Branch Target Buffer

target Branch Direction Predictor tag target tag ... target tag Size of Instruction = = = + OR not-taken target taken-target

0 1 BTB Hit? Branch Target Partial tags sufficient in BTB Mikko Lipasti-University of Wisconsin 69 Return Address Stack Bra nch Address Bra nch Address Size of Instruction BTB Return Address BTB Target Prediction is this a return? Target Prediction (b) (a)

Speculative update causes headaches On each predicted branch, checkpoint head/tail Further, checkpoint stack contents since speculative pop/push sequence is destructive Conditional call/return causes more headaches Mikko Lipasti-University of Wisconsin 70 Indirect Branches Tagged target cache Chang et. al, Target Prediction for Indirect Jumps, ISCA 1997 Mikko Lipasti-University of Wisconsin 71 Indirect Branches ITTAGE proposed in same 2006 paper as TAGE A. Seznec, P. Michaud, A case for (partially) tagged Geometric History Length Branch Prediction, Journal of Instruction Level Parallelism , Feb. 2006 Mikko Lipasti-University of Wisconsin 72 Indirect Branches CPB-3 had an indirect prediction track #1: A. Seznec, A 64-Kbytes ITTAGE indirect branch predictor, MPPKI 34.1

#2: Y. Ishii, T. Sawada, K. Kuroyanagi, M. Inaba, K. Hiraki, Bimode Cascading: Adaptive Rehashing for ITTAGE Indirect Branch Predictor, MPPKI 37.0 #3: N. Bhansali, C. Panirwala, H. Zhou, Exploring Correlation for Indirect Branch Prediction, MPPKI 51.6 #4: Daniel A. Jimenez, SNIP: Scaled Neural Indirect Predictor, MPPKI 52.9 Mikko Lipasti-University of Wisconsin 73 High-Bandwidth Fetch: Collapsing Buffer Interleaved BTB Branch A ddress Two cache line addresses Cache Bank 1 Cache Bank 2 E F G H Valid Instruction Bits A B C D

Interchange Switch A B C DE F G H Collapsing Circuit A B C EG To Decode Stage Fetch from two cache blocks, rotate, collapse past taken branches Thomas M. Conte, Kishore N. Menezes, Patrick M. Mills and Burzin A. Patel. Optimization of Instruction Fetch Mechanisms for High Issue Rates. International Symposium on Computer Architecture, June 1995. Mikko Lipasti-University of Wisconsin 74 High-Bandwidth Fetch: Trace Cache Instruction Cache E F G Trace Cache H I J A B

A B C D E F G H I J C D (a) (b) Fold out taken branches by tracing instructions as they commit into a fill buffer Eric Rotenberg, S. Bennett, and James E. Smith. Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching. MICRO, December 1996. Mikko Lipasti-University of Wisconsin 75 Intel Pentium 4 Trace Cache Front-End BTB Instruction TLB and Prefetcher Level-Two Unified Data and Instruction Cache Instruction Decode Trace Cache BTB

Trace Cache Ins truction Fetch Queue To renamer, execute, etc. No first-level instruction cache: trace cache only Trace cache BTB identifies next trace Miss leads to fetch from level two cache Trace cache instructions are decoded (uops) Cache capacity 12k uops Overwhelmed for database applications Serial decoder becomes performance bottleneck Mikko Lipasti-University of Wisconsin 76 High-Bandwidth Fetch: Loop Buffers Loop Body Fetch/Decode/ Dispatch Buffer bc History: AMD29K Branch Target Cache

Dont cache the target address; cache 4 instructions from the target itself Avoid accessing I$ for first fetch group following a taken branch If loop body is <= 4 instructions, effectively a loop cache Room for 32/64 branch targets Also common in DSP designs, under s/w control (e.g. Lucent) Introduced in Intel Merom (Core 2 Duo) Fetch buffer detects short backward branches, inhibits refetch from I$ Intel Nehalem (Core i7) Moved loop buffer after decoders: contains uops Intel Sandybridge General-purpose uop cache (not just loops) 1.5K capacity Mikko Lipasti-University of Wisconsin 77 High Frequency: Next-line Prediction Cycle 1 Cycle 2 next line prediction A B C D

tag E F G H Cycle 3 tag 6 I J K L 2 Tag check for cycle 1s lookup = Target Prediction T ag c h ec k fo r cycl e 2s lookup Next line m isprediction Target Prediction = Target Pred Embed next fetch address in instruction cache Enables high-frequency back-to-back fetch

Brad Calder and Dirk Grunwald. Next Cache Line and Set Prediction. International Symposium on Computer Architecture, pages 287-296, June 1995. Mikko Lipasti-University of Wisconsin 78 High Frequency: Overriding Predictors Slow Overriding Predictor Stage 1 Small, Fast Predictor Stage 2 Instruction Cache Stage 3 Fetch Queue Cycle 1 Pre dict A Predic t A Cycle 2

Cycle 3 Predict B Predict B Predict C Predict C Fetch A Pre dict A Fetch B Predic t B Queue A Predic t A If slow pre dict agrees with fast predict, do nothing If pre dictions do not match, flush A, B, and C, a nd resta rt fetch at new predicted target Simple, fast predictor turns around every cycle Smarter, slower predictor can override Widely used: PowerPC 604, 620, Alpha 21264 Mikko Lipasti-University of Wisconsin 79 Instruction Flow Summary

Instruction Flow and its Impediments Control Dependences Control Flow Speculation Branch Speculation Mis-speculation Recovery Branch Direction Prediction Static Prediction A brief history of dynamic branch prediction Branch Target Prediction High-bandwidth Fetch High-Frequency Fetch Mikko Lipasti-University of Wisconsin 80

Recently Viewed Presentations

  • RATES OF REACTION - Weebly

    RATES OF REACTION - Weebly

    R H King Academy Other titles: Times New Roman Arial Gill Sans MT Wingdings 2 Verdana Calibri Symbol Solstice 1_Solstice 2_Solstice 3_Solstice 4_Solstice 5_Solstice 6_Solstice RATES OF REACTION Rates of Reaction Collision Theory The Factors Affecting Rate of Reaction and...
  • The 2013-2014 3 - VDOE :: Virginia Department of Education Home

    The 2013-2014 3 - VDOE :: Virginia Department of Education Home

    Spring 2014 Student Performance Analysis. 3. rd, 4. th, and 5. th. Grade . Reading. Standards of . Learning Tests. Presentation may be paused and resumed using the arrow keys or the mouse. This is the statewide spring 2014 student...
  • Pre-service Education on FP and AYSRH

    Pre-service Education on FP and AYSRH

    Standard Days Method (SDM) SDM, Session II Topic 9 Slide 2. Family Planning Training Resource Package - Standard Days Method
  • Simulation Program - California State University, Stanislaus

    Simulation Program - California State University, Stanislaus

    The CSU Stanislaus School of Nursing Simulation Program has a supportive simulation team and dedicated faculty that strive to create the very best learning opportunities for nursing students. Our goal being to create confident, competent nurses that promote patient safety...
  • Anatomy and Physiology of the Balance System

    Anatomy and Physiology of the Balance System

    Gentile's Taxonomy of Tasks Types of Treatments Available in VRT/BRT How Do You Treat…Dysequilibrium CASE I CASE I CASE I How To Treat…BPPV CASE II CASE II Always Looking For PTAs… We Do Rotations in Balance - please call Jim...
  • Kingdom Animalia Chapters 14 &amp; 15

    Kingdom Animalia Chapters 14 & 15

    Kingdom Animalia Notes Chapter 12 What do these have in common? Animals are: Multicellular eukaryotes Heterotrophic (consumers) Reproduce sexually (usually) Develop from embryos Have specialized parts Are able to move (most) True or False: Almost all animals reproduce sexually.
  • ACADEMIC PLANNING - Fredric G. Levin College of Law

    ACADEMIC PLANNING - Fredric G. Levin College of Law

    The ISIS student records system allows students to track their progression using the University's online degree audit program. The degree audit will show credits earned, requirements met, requirements outstanding, etc. The degree audit also contains your official law school GPA...
  • Kein Folientitel - vsb.cz

    Kein Folientitel - vsb.cz

    Hardware Setup. PMD CamCube. Microsoft Kinect. SoftKinetic DepthSense. Make them suitable for driver status monitoring in various lighting conditions, especially in truck driver cabin. Computer-Aided Design of Analog and Mixed-Signal ICs using Cadence DFW II icfb. Andreas König, 2001