Computer Architecture A Quantitative Approach, Sixth Edition Chapter

Computer Architecture A Quantitative Approach, Sixth Edition Chapter

Computer Architecture A Quantitative Approach, Sixth Edition Chapter 7 Domain-Specific Architectures Copyright 2019, Elsevier Inc. All rights Reserved 1 Moores Law enabled: Introduction

Introduction Deep memory hierarchy Wide SIMD units Deep pipelines Branch prediction Out-of-order execution Speculative prefetching Multithreading Multiprocessing Objective: Extract performance from software that is oblivious to architecture Copyright 2019, Elsevier Inc. All rights Reserved 2 Need factor of 100 improvements in number of operations per instruction

Introduction Introduction Requires domain specific architectures For ASICs, NRE cannot be amoratized over large volumes FPGAs are less efficient than ASICs Copyright 2019, Elsevier Inc. All rights Reserved 3 Use dedicated memories to minimize data

movement Invest resources into more arithmetic units or bigger memories Use the easiest form of parallelism that matches the domain Reduce data size and type to the simplest needed for the domain Use a domain-specific programming language Copyright 2019, Elsevier Inc. All rights Reserved Guidelines for DSAs Guidelines for DSAs 4 Copyright 2019, Elsevier Inc. All rights Reserved Guidelines for DSAs Guidelines for DSAs 5

Inpired by neuron of the brain Computes non-linear activiation function of the weighted sum of input values Neurons arranged in layers Copyright 2019, Elsevier Inc. All rights Reserved Example: Deep Neural Networks Example: Deep Neural Networks 6 Most practioners will choose an existing design Training (learning):

Topology Data type Calculate weights using backpropagation algorithm Supervised learning: stocastic graduate descent 2019, Elseviernetwork Inc. All rights Reserved Inferrence: Copyright use neural for classification Example: Deep Neural Networks Example: Deep Neural Networks 7 Parameters:

Dim[i]: number of neurons Dim[i-1]: dimension of input vector Number of weights: Dim[i-1] x Dim[i] Operations: 2 x Dim[i-1] x Dim[i] Operations/weight: 2 Copyright 2019, Elsevier Inc. All rights Reserved Example: Deep Neural Networks Multi-Layer Perceptrons 8 Computer vision Each layer raises the level of abstraction

First layer recognizes horizontal and vertical lines Second layer recognizes corners Third layer recognizes shapes Fourth layer recognizes features, such as ears of a dog Higher layers recognizes different breeds of dogs Copyright 2019, Elsevier Inc. All rights Reserved Example: Deep Neural Networks Convolutional Neural Network 9 Parameters:

Example: Deep Neural Networks Convolutional Neural Network DimFM[i-1]: Dimension of the (square) input Feature Map DimFM[i]: Dimension of the (square) output Feature Map DimSten[i]: Dimension of the (square) stencil NumFM[i-1]: Number of input Feature Maps NumFM[i]: Number of output Feature Maps Number of neurons: NumFM[i] x DimFM[i] 2 Number of weights per output Feature Map:

NumFM[i-1] x DimSten[i]2 Total number of weights per layer: NumFM[i] x Number of weights per output Feature Map Number of operations per output Feature Map: 2 x DimFM[i]2 x Number of weights per output Feature Map Total number of operations per layer: NumFM[i] x Number of operations per output Feature Map = 2 x DimFM[i]2 x NumFM[i] x Number of weights per output Feature Map = 2 x DimFM[i]2 x Total number of weights per layer Operations/Weight: 2 x DimFM[i] 2 Copyright 2019, Elsevier Inc. All rights Reserved 10 Speech recognition and language translation Long short-term memory (LSTM) network Copyright 2019, Elsevier Inc. All rights Reserved Example: Deep Neural Networks

Recurrent Neural Network 11 Parameters: Number of weights per cell: 3 x (3 x Dim x Dim)+(2 x Dim x Dim) + (1 x Dim x Dim) = 12 x Dim2 Number of operations for the 5 vector-matrix multiplies per cell: 2 x Number of weights per cell

= 24 x Dim2 Number of operations for the 3 element-wise multiplies and 1 addition (vectors are all the size of the output): 4 x Dim Total number of operations per cell (5 vector-matrix multiplies and the 4 element-wise operations): 24 x Dim2 + 4 x Dim Operations/Weight: ~2 Copyright 2019, Elsevier Inc. All rights Reserved Example: Deep Neural Networks Recurrent Neural Network 12 Batches:

Quantization Reuse weights once fetched from memory across multiple inputs Increases operational intensity Use 8- or 16-bit fixed point Summary: Need the following kernels: Example: Deep Neural Networks Convolutional Neural Network

Matrix-vector multiply Matrix-matrix multiply Stencil ReLU Sigmoid Hyperbolic tangeant Copyright 2019, Elsevier Inc. All rights Reserved 13 Googles DNN ASIC 256 x 256 8-bit matrix multiply unit Large software-managed scratchpad Coprocessor on the PCIe bus Copyright 2019, Elsevier Inc. All rights Reserved Tensor Processing Unit

Tensor Processing Unit 14 Copyright 2019, Elsevier Inc. All rights Reserved Tensor Processing Unit Tensor Processing Unit 15 Read_Host_Memory Read_Weights

Perform a matrix-matrix multiply, a vector-matrix multiply, an elementwise matrix multiply, an element-wise vector multiply, or a convolution from the Unified Buffer into the accumulators takes a variable-sized B*256 input, multiplies it by a 256x256 constant input, and produces a B*256 output, taking B pipelined cycles to complete Activate Reads weights from the Weight Memory into the Weight FIFO as input to the Matrix Unit MatrixMatrixMultiply/Convolve Reads memory from the CPU memory into the unified buffer Tensor Processing Unit TPU ISA Computes activation function

Write_Host_Memory Writes data from unified buffer into host memory Copyright 2019, Elsevier Inc. All rights Reserved 16 Tensor Processing Unit TPU ISA Copyright 2019, Elsevier Inc. All rights Reserved 17 Tensor Processing Unit TPU ISA Copyright 2019, Elsevier Inc. All rights Reserved 18

Read_Host_Memory Read_Weights Perform a matrix-matrix multiply, a vector-matrix multiply, an elementwise matrix multiply, an element-wise vector multiply, or a convolution from the Unified Buffer into the accumulators takes a variable-sized B*256 input, multiplies it by a 256x256 constant input, and produces a B*256 output, taking B pipelined cycles to complete Activate Reads weights from the Weight Memory into the Weight FIFO as input to

the Matrix Unit MatrixMatrixMultiply/Convolve Reads memory from the CPU memory into the unified buffer Tensor Processing Unit TPU ISA Computes activation function Write_Host_Memory Writes data from unified buffer into host memory Copyright 2019, Elsevier Inc. All rights Reserved 19 Copyright 2019, Elsevier Inc. All rights Reserved

Tensor Processing Unit Improving the TPU 20 Tensor Processing Unit The TPU and the Guidelines Use dedicated memories Invest resources in arithmetic units and dedicated memories Exploits 2D SIMD parallelism Reduce the data size and type needed for the domain

60% of the memory and 250X the arithmetic units of a server-class CPU Use the easiest form of parallelism that matches the domain 24 MiB dedicated buffer, 4 MiB accumulator buffers Primarily uses 8-bit integers Use a domain-specific programming language Uses TensorFlow Copyright 2019, Elsevier Inc. All rights Reserved 21

Needed to be general purpose and power efficient Uses FPGA PCIe board with dedicated 20 Gbps network in 6 x 8 torus Each of the 48 servers in half the rack has a Catapult board Limited to 25 watts 32 MiB Flash memory Two banks of DDR3-1600 (11 GB/s) and 8 GiB DRAM FPGA (unconfigured) has 3962 18-bit ALUs and 5 MiB of on-chip

memory Programmed in Verilog RTL Shell is 23% of the FPGA Copyright 2019, Elsevier Inc. All rights Reserved Microsoft Capapult Microsoft Catapult 22 CNN accelerator, mapped across multiple FPGAs Copyright 2019, Elsevier Inc. All rights Reserved Microsoft Capapult Microsoft Catapult: CNN 23 Copyright 2019, Elsevier Inc. All rights Reserved

Microsoft Capapult Microsoft Catapult: CNN 24 Feature extraction (1 FPGA) Free-form expressions (2 FPGAs) Calculates feature combinations Machine-learned Scoring (1 FPGA for compression, 3 FPGAs calculate score)

Extracts 4500 features for every document-query pair, e.g. frequency in which the query appears in the page Systolic array of FSMs Microsoft Capapult Microsoft Catapult: Search Ranking Uses results of previous two stages to calculate floating-point score One FPGA allocated as a hot-spare Copyright 2019, Elsevier Inc. All rights Reserved 25 Free-form expression evaluation

60 core processor Pipelined cores Each core supports four threads that can hide each others latency Threads are statically prioritized according to thread latency Copyright 2019, Elsevier Inc. All rights Reserved Microsoft Capapult Microsoft Catapult: Search Ranking 26 Version 2 of Catapult

Placed the FPGA between the CPU and NIC Increased network from 10 Gb/s to 40 Gb/s Also performs network acceleration Shell now consumes 44% of the FPGA Now FPGA performs only feature extraction Copyright 2019, Elsevier Inc. All rights Reserved Microsoft Capapult Microsoft Catapult: Search Ranking 27 Use dedicated memories

Invest resources in arithmetic units and dedicated memories 2D SIMD for CNN, MISD parallelism for search scoring Reduce the data size and type needed for the domain 3926 ALUs Use the easiest form of parallelism that matches the domain 5 MiB dedicated memory Microsoft Capapult

Catapult and the Guidelines Uses mixture of 8-bit integers and 64-bit floating-point Use a domain-specific programming language Uses Verilog RTL; Microsoft did not follow this guideline Copyright 2019, Elsevier Inc. All rights Reserved 28 Intel Crest Intel Crest DNN training 16-bit fixed point Operates on blocks of 32x32 matrices SRAM + HBM2

Copyright 2019, Elsevier Inc. All rights Reserved 29 Pixel Visual Core Image Processing Unit Performs stencil operations Decended from Image Signal processor Copyright 2019, Elsevier Inc. All rights Reserved Pixel Visual Core Pixel Visual Core 30 Software written in Halide, a DSL

Optimized for energy Compiled to virtual ISA vISA is lowered to physical ISA using application-specific parameters pISA is VLSI Pixel Visual Core Pixel Visual Core Power Budget is 6 to 8 W for bursts of 10-20 seconds,

dropping to tens of milliwatts when not in use 8-bit DRAM access equivalent energy as 12,500 8-bit integer operations or 7 to 100 8-bit SRAM accesses IEEE 754 operations require 22X to 150X of the cost of 8-bit integer operations Optimized for 2D access 2D SIMD unit On-chip SRAM structured using a square geometry Copyright 2019, Elsevier Inc. All rights Reserved 31 Copyright 2019, Elsevier Inc. All rights Reserved Pixel Visual Core Pixel Visual Core 32 Copyright 2019, Elsevier Inc. All rights Reserved

Pixel Visual Core Pixel Visual Core 33 Copyright 2019, Elsevier Inc. All rights Reserved Pixel Visual Core Pixel Visual Core 34 Use dedicated memories Invest resources in arithmetic units and dedicated memories

2D SIMD and VLIW Reduce the data size and type needed for the domain 16x16 2D array of processing elements per core and 2D shifting network per core Use the easiest form of parallelism that matches the domain 128 + 64 MiB dedicated memory per core Microsoft Capapult Visual Core and the Guidelines Uses mixture of 8-bit and 16-bit integers

Use a domain-specific programming language Halide for image processing and TensorFlow for CNNs Copyright 2019, Elsevier Inc. All rights Reserved 35 It costs $100 million to design a custom chip Performance counters added as an afterthought Architects are tackling the right DNN tasks For DNN hardware, inferences per second (IPS) is a fair summary performance metric Being ignorant of architecture history when designing an DSA Copyright 2019, Elsevier Inc. All rights Reserved

Microsoft Capapult Fallacies and Pitfalls 36

Recently Viewed Presentations

  • Introduction to - Mrs. Tully's Website for Students

    Introduction to - Mrs. Tully's Website for Students

    The muses are typically invoked at or near the beginning of an epic poem or classical Greek hymn. An epic : Includes supernatural elements, involves an epic hero, was originally sung, often to a harp or lyre, ... Introduction to...
  • GSP & Central Asia - Office of the United States Trade ...

    GSP & Central Asia - Office of the United States Trade ...

    Deputy Assistant U.S. Trade Representative. Office of the united states trade representative. Executive office of the president. December 2017. Kyrgyzstan. Kazakhstan. Uzbekistan. ... Full description and list in Guidebook. Benefits of GSP. Diversify exports .
  • Clinical Protocol for Removable Partial Dentures

    Clinical Protocol for Removable Partial Dentures

    Clinical Protocol for Removable Partial Dentures ... Guiding planes Rest seats Retentive areas Heights of contour Framework Impression Syringe low viscosity material Around abutment teeth Over occlusal surfaces Use care in rest seats Do not over fill trays - overextension...
  • It's academic

    It's academic

    Assoc Prof Peter Richmond, Dr Andrew Currie, Dr Lea-Ann Kirkham and Dr Selma Wiertsema. Immunological mechanisms involved in childhood vaccination. Vaccine development. Licensure of . vaccine. ... Clin Prof Stephen Stick, Assoc Prof Graham Hall & Assoc Prof Peter Franklin.
  • World War II - Loudoun County Public Schools

    World War II - Loudoun County Public Schools

    Would favor British, who controlled Atlantic. 1940 - peacetime draft "destroyers for bases" swap w/ Britain. Lend-Lease Act - 1941 - to loan $ to nations whose defense was vital to US interests. Like loaning a hose to neighbor to...
  • MANDATOS DE NOSOTROS LETS DO SOMETHING! NOSOTROS COMMANDS:

    MANDATOS DE NOSOTROS LETS DO SOMETHING! NOSOTROS COMMANDS:

    Como los otros mandatos de tú, Ud. Y Uds., conecta el complemento al fin del verbo afirmativo o antes del verbo negativo: No quita la -S. Por Ejemplo: Let´swashthe car. Let´swashit. Lavémoslo. (el coche) Let´snotseethemovie. Let´snotseeit. No la veamos. (la...
  • N C R M P Phase I 18th

    N C R M P Phase I 18th

    National Cyclone Risk Mitigation Project (NCRMP) Objective: Reduce cyclone risk and vulnerability in the coastal areas, Develop an effective Early Warning Dissemination System to ensure last mile connectivity, Construct Cyclone Risk Mitigation Infrastructure, Build capacity of the coastal communities for...
  • A Swot Analysis of Directorate of Internal Audit and ...

    A Swot Analysis of Directorate of Internal Audit and ...

    a swot analysis of directorate of internal audit and recommendation for propelling wealth creation for the university PREFACE Before starting any planning or analysis process there is need to have a clear and smart goal or objective.