Performance Considerations for Packet ... - The Fast Data Project

Performance Considerations for Packet ... - The Fast Data Project

Performance Considerations for Packet Processing on Intel Architecture Patrick Lu, Intel DCG/NPG/ASE Acknowledgement: Roman Dementiev, John DiGiglio, Andi Kleen, Maciek Konstantynowicz, Sergio Gonzalez Monroy, Shrikant Shah, George Tkachuk, Vish Viswanathan, Jeremy Williamson Agenda Uncore Performance Monitoring with Hardware Counters Measure PCIe Bandwidth Measure Memory Bandwidth Core Performance Monitoring IPC revisit Intel Processor Trace for Performance Optimization uncore: (everything not core) LLC cache

Memory controller IIO controller QPI controller and more! Core: ALU L1 cache L2 cache Terminology sync: Core = execution units (ALU) + L1 cache + L2 caches Uncore = everything not core 3 VPP and DPDKpacket processingusingDirect DataIO Minimal memory traffic per packet. Core writes Rx descriptor in preparation for receiving a packet. NIC reads Rx descriptor to get ctrl flags and buffer address. NIC writes the packet.

NIC writes Rx descriptor. Core reads Rx descriptor (polling or irq or coalesced irq). Core reads packet header to determine action. Core performs action on packet header. Core writes packet header (MAC swap, TTL, tunnel, foobar..) Core reads Tx descriptor. Core writes Tx descriptor and writes Tx tail pointer. NIC reads Tx descriptor. NIC reads the packet. NIC writes Tx descriptor. CPUSocket DDRSDRAM CPUCores 8 7

6 5 1 9 10 LLC rxd txd 2 4 13 packet

3 Memory Controller (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) Memory Channels

12 11 PCIe Core operations Traditional performance analysis is core centric NIC packet operations NIC descriptor operations We need new methodologies to analyze I/O centric Most of software thread work in CPU core and local workloads cache with smart algos and predictive prefetching (shifted forward in time) NICs 4 Uncore Side Performance

Considerations PCIe bandwidth Memory bandwidth QPI bandwidth (Backup) Open Sourced Uncore Performance Tool Options Processor Counter Monitoring (PCM) in GitHub* real-time display on metrics, bandwidth user space tools cross platforms (Linux*, Windows*, Mac OSx*) Linux* perf + pmu-tools in Linux* kernel stand alone Linux* perf utility + wrapper 6 Measure PCIe Bandwidth 7 Measure PCIe Bandwidth Processor Counter Monitoring (PCM): ./pcm-pcie.x -e

Network Rx (ItoM/RFO) Network Tx (PCIeRdCur ) MMIO Read (PRd) MMIO Write (WiL) Inbound PCIe write Inbound PCIe read

Outbound CPU read Outbound CPU write Optimize batch size Match workload I/O throughput1 Avoid DDIO miss Avoid MMIO Rd Reduce RFO 1. Results in number of cache lines (64 Bytes) 8 Measure Memory Bandwidth 9

Processor Counter Monitoring (PCM): ./pcm-memory.x Measure Memory Bandwidth Reason with memory traffic: 1) Wrong NUMA allocation 2) DDIO Miss 3) CPU data structure No memory traffic: Best 1) Fix it in the code! 2) Receive packets ASAP, descriptor ring sizes (needs tuning) 3) May be okay, but check for latency! 10 Uncore Performance Takeaway Easy Calibrate I/O bandwidth with expected performance

Check for unexpected cross-NUMA traffic Avoid MMIO Read Batch MMIO Write Advance Optimize Software to make use of DDIO 11 Core Side Performance Consideration IPC Intel Processor Trace Which Workload Has Best Performance? IPC VPP A

2.63 VPP B 2.12 Workload Performance Reveal: A: Un-optimized VPP IPSec, B: Intel AES-NI Accelerated VPP IPSec IPC VPP A 2.63 Instructions/packet 15733 8x reduction in instructions

VPP B 2.12 1865 Cycles/packet 5982 6x reduction in cycles 880 Performance is Determined by Two Factors Instruction per Cycle (IPC) Mystery Instructions/pkt Instructions / sec

Measured (M) Measured Cycles / sec (M) Derived IPC Measured CPU freq. (MHz) Measured Packets / sec (M) Derived Instructions / packet Derived Cycles / packet 5229 2103 2.49 2100 9.7

539 217 Cycles/pkt = IPC Unless IPC is really poor, instructions/packet or cycles/packet is more realistic Core metrics. 15 Intel Processor Trace (Intel PT) Components Intel Processor Trace (Intel PT) packet log, binaries, and software runtime data are used to reconstruct the precise execution flow

Intel CPU 0..n Intel PT packet log (per logical processor) Intel Intel PT PT HW HW Configure & enable Intel PT Intel PT Software Decoder Intel PTenabled

Tools Agent IntelRing0 Processor Trace can reveal exact execution path Runtime data, including: (OS, (OS, VMM, VMM, BIOS, BIOS, Driver, ) Map linear-address to image files Map CR3 value to application

Log module load/unload and JIT info Binary Image Files 16 Linux* perf and Intel Processor Trace Provides basic framework for trace collection, mining, and visualization Kernel driver in 4.1, user tools in 4.2/4.3 Supports full trace and snapshot modes Display tools support histograms, event lists, flame graphs, etc Linux perf can capture finest detail with Intel PT. Post processing is required to reduce trace data. 17 Life Cycle of a Packet using Intel PT VPP L2 bridge domain

Instructions / packet 100 87 90 256 packets = 82 256 packets = 80 32-packet rx 16-packet tx 70 burst * 8 loops burst * 16 loops 60 52 49 50 47 50 40 30

23 30 i40e_recv_scattered i40e_recv_scattered i40e_recv_scattered i40e_recv_scattered i40e_recv_scattered i40e_recv_scattered i40e_recv_scattered i40e_recv_scattered 20 i40e_xmit_pkts_vec i40e_xmit_pkts_vec i40e_xmit_pkts_vec i40e_xmit_pkts_vec i40e_xmit_pkts_vec i40e_xmit_pkts_vec i40e_xmit_pkts_vec i40e_xmit_pkts_vec i40e_xmit_pkts_vec

i40e_xmit_pkts_vec i40e_xmit_pkts_vec i40e_xmit_pkts_vec i40e_xmit_pkts_vec i40e_xmit_pkts_vec i40e_xmit_pkts_vec i40e_xmit_pkts_vec _pkts_vec; _pkts_vec; _pkts_vec; 7 _pkts_vec; 7 _pkts_vec; 7 _pkts_vec; 7 _pkts_vec; 7 7 7 7 10 _pkts_vec; ;2 ;3 ;2 ;3 ;2 ;3 ;2 ;3 ;2 ;3 ;2 ;3 ;2 ;3 ;2 ;3 0 Rx

Dense packet proces sing fu nctions Not the Processing Tx most o ptimal driver p ath 18 Core Performance Takeaway 1. Use perfmon hardware to baseline workload performance. PCM, Linux* Perf (pmu-tools/{toplev.py, ucevents.py}, intel_pt) 2. Reduce instruction counts per packet Create more direct VPP path to process packets

Consider hardware offload 3. Reduce cycles per packet Follow uncore performance takeaway to reduce latency Measure CPU L1 cache miss latency and look for opportunity to prefetch it (not covered) 19 References Xeon E5-v4 Uncore Performance Monitor Unit (PMU) guide: https://software.intel.com/en-us/blogs/2014/07/11/documentation-for-uncore-p erformance-monitoring-units Intel 64 and IA-32 Architectures Optimization Reference Manual: http:// www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-archit ectures-optimization-manual.html Processor Counter Monitoring (PCM): https://github.com/opcm/pcm pmu-tools: https://github.com/andikleen/pmu-tools Intel Processor Trace Linux* perf support: https:// github.com/torvalds/linux/blob/master/tools/perf/Documentation/intel-pt.txt 20

Backup 22 Measure QPI (Cross Sockets) Bandwidth 23 Measure QPI (cross sockets) Bandwidth Processor Counter Monitoring (PCM): ./pcm.x -nc Any QPI traffic will inevitably hurt latency, if not throughput. QPI traffic is determined by placements of CPU, memory and I/O. If unavoidable, watch for link utilization. 24 TMAM Top-Down Analysis Methodology (TMAM) Linux perf + pmu-tools

Perfmon device abstractions layers reside in /sys/devices Measure Memory Bandwidth with Linux perf: Measure PCIe Bandwidth with Linux perf (events pending upstream): ./ucevent.py --scale MEGA iMC.MEM_BW_READS iMC.MEM_BW_WRITES #Event not upstreamed yet. /ucevent.py --scale MEGA CBO.CPU_OUTBOUND_MMIO_READ \ CBO.CPU_OUTBOUND_MMIO_WRITE \ CBO.PCIE_INBOUND_READ_HIT_DDIO \ CBO.PCIE_INBOUND_READ_MISS_DDIO \ CBO.PCIE_INBOUND_WRITE_HIT_DDIO \ CBO.PCIE_INBOUND_WRITE_MISS_DDIO Measure QPI Bandwidth with Linux perf: ./ucevent.py --scale MEGA QPI_LL.QPI_DATA_BW QPI_LL.QPI_LINK_BW 26

Recently Viewed Presentations

  • Care of Expensive Precision Equipment in Cory 111

    Care of Expensive Precision Equipment in Cory 111

    People often use the verbal shorthand of "SMA" when referrring to a 3.5 mm connector, but they're technically different. Because of their air dielectric, 3.5 mm connectors are rated to 26 GHz and used on test equipment for precision measurements....
  • Real, Relevant, Surprising and Fresh: Cisco Brand

    Real, Relevant, Surprising and Fresh: Cisco Brand

    A port-channel with six or more 100 Mbps physical ports will have an STP cost of 5. STP costs for port-channels vary according to how many ports are assigned to the bundle, ... I - stand-alone s - suspended. H...
  • Chapter 23 Revolutionary Changes in the Atlantic World, 1750-1850

    Chapter 23 Revolutionary Changes in the Atlantic World, 1750-1850

    A Reason to Imperialize. Since the monarchs of Europe could no longer raise money by collecting taxes, due to enlightenment ideas like the rights of individuals and growing democratic reforms in Europe, during the end of the Middle Ages, they...
  • Presentación de PowerPoint

    Presentación de PowerPoint

    Por otro lado, su mayor ventaja es que se encuentra en un gran número de copias en cada célula (hay entre 100 y 1000 copias de mtADN por una de genoma nuclear) y, por tanto, se puede detectar en muchos...
  • Quantum Crypto - Professores

    Quantum Crypto - Professores

    Quanta, ciphers and computers It does make a difference You can observe quantum interference The interference term in action Bookkeeping of probs and amps Stochastic vs unitary mathematical aside - U(2) and SU(2) Divide and conquer Beam-splitter Phase shift Mach...
  • SCOLIOSIS - Trinity Valley Community College

    SCOLIOSIS - Trinity Valley Community College

    SCOLIOSIS Three dimensional deformity involving rotation of the vertebral bodies Causes the rib cage to become misshapen Body develops a compensatory curve to ...
  • SOUTHERN REGION FAASTeam/Area 1 Stress, Fatigue, and Flying

    SOUTHERN REGION FAASTeam/Area 1 Stress, Fatigue, and Flying

    Duty Day Length and Fatigue In a study of corporate pilots, 61% of surveyed corporate pilots said that fatigue was a "common occurrence" 74% said fatigue is moderate or serious concern 71% reported microsleep during a flight and 39% arranged...
  • Senior College Project Bianca Pulido Period 1 How

    Senior College Project Bianca Pulido Period 1 How

    According to the results of your career assessment test, you may want to consider the possibility of a career in business. People who work in business careers should typically have a knack for problem solving and persuading others. They may...