On the Interaction Between Commercial Workloads and Memory
On the Interaction Between Commercial Workloads and Memory Systems in High-Performance Servers Per Stenstrm Fredrik Dahlgren, Magnus Karlsson, and Jim Nilsson in collaboration with Sun Microsystems and Ericsson Research Department of Computer Engineering, Chalmers, Gteborg, Sweden http://www.ce.chalmers.se/~pers Motivation Dominating multiprocessor server applications 1995 (Source: Dataquest) Scientific and engineering 9% 6% Database 16% File server 5%
Other 32% 32% Media and e-mail Print server Database applications dominate (32%) Yet, major focus is on scientific/eng. apps (16%) Project Objective Design principles for high-performance memory systems for emerging applications Systems considered: high-performance compute nodes SMP and DSM systems built out of them Applications considered: Decision support and on-line transaction processing Emerging applications
Computer graphics video/sound coding/decoding handwriting recognition ... Outline Experimental platform Memory system issues studied Working set size in DSS workloads Prefetch approaches for pointer-intensive workloads (such as in OLTP) Coherence issues in OLTP workloads Concluding remarks Experimental Platform Application $ Ethernet SCSI TTY
Interrupt Operating system (Linux) Devices CPU Memory Sparc V8 M Single and multiprocessor system models Platform enables Analysis of comm. workloads Analysis of OS effects Tracking architectural events to OS or appl. level Outline Experimental platform Memory system issues studied Working set size in DSS workloads
Prefetch approaches for pointer-intensive workloads (such as in OLTP) Coherence issues in OLTP workloads Concluding remarks Decision-Support Systems (DSS) Compile a list of matching entries in several database relations Level i Join Scan i-1 Join Scan 2 Join
Scan 1 Scan Will moderately sized caches suffice for huge databases? Our Findings Cache Miss Rate MWS DWS 1 DWS 2 DWS i Cache size MWS: footprint of instructions and private data to access a single tuple
typically small (< 1 Mbyte) and not affected by database size DWS: footprint of database data (tuples) accessed across consecutive invocations of same scan node typically small impact (~0.1%) on overall miss rate Methodological Approach Challenges: Not feasible to simulate huge databases Need source code: we used PostgreSQL and MySQL Approach: Analytical model using parameters that describe the query parameters measured on downscaled query executions system parameters Footprints and Reuse Characteristics in DSS Footprints per tuple access Level i Join
Scan MWS and DWSi i-1 Join Scan MWS and DWSi-1 2 Join Scan MWS and DWS2 1 Scan
MWS and DWS1 MWS: instructions, private, and metadata can be measured on downscaled simulation DWS: all tuples accessed at lower levels can be computed based on query composition and prob. of match Analytical model-an overview Goal: Predicts miss rate versus cache size for fully-assoc. caches with a LRU replacement policy for single-proc. systems Number of cold misses: size of footprint/block size |MWS| is measured |DWSi| computed based on parameters describing the query (size of relations probability of matching a search criterion, index versus sequential scan, etc) Number of capacity misses for tuple access at level i: CM0(1- C - C0) if C0 < Cache size < |MWS| |MWS| - C0 size of tuple/block size if |MWS| <= Cache size < |MWS| + |DWS i| Number of accesses per tuple: measured Total number of misses and accesses: computed Model Validation
Miss ratio Goal: Prediction accuracy for queries with different compositions Q3, Q6, and Q10 from TPC-D Prediction accuracy when scaling up the database parameters at 5Mbyte used to predict at 200 Mbytes databases Robustness across database engines Two engines: PostgreSQL and MySQL 10 9 8 7 6 5 4 3 2 1 0 Database, measured Database, predicted
20 48 40 96 Cache size (Kbyte) Q3 on PostgreSQL - 3 levels - 1 seq. scan - 2 index scan - 2 nest. loop joins Model Predictions: Miss rates for Huge Databases Miss ratio (%) versus cache size (Kbytes) for Q3 on a 10-Terabyte database 2,5 2
1,5 1 0,5 0 Meta data Database data Private data Instructions Cache size Miss rate by instr., priv. and meta data rapidly decay (128 Kbytes) Miss rate component for database data small Whats in the tail? Outline Experimental platform Memory system issues studied Working set size in DSS workloads Prefetch approaches for pointer-intensive workloads (such as in OLTP) Coherence issues in OLTP workloads Concluding remarks
Cache Issues for Linked Data Structures Traversal of lists may exhibit poor temporal locality Results in chains of data dependent loads, called pointer-chasing Pointer-chasing show up in many interesting applications: 35% of the misses in OLTP (TPC-B) 32% of the misses in an expert system 21% of the misses in Raytrace SW Prefetch Techniques to Attack Pointer-Chasing Greedy Prefetching (G). - computation per node < latency Jump Pointer Prefetching (J) - short list or traversal not known a priori Prefetch Arrays (P.(S/H)) Generalization of G and J that addresses above shortcomings. - Trade memory space and bandwidth for more latency tolerance
Results: Hash Tables and Lists in Olden Normalized execution time 120 MST HEALTH 100 80 60 40 20 0 86 82 84 64 61 89 75 44 Memory Stall Time 39 39
14 16 15 18 17 11 13 28 20 20 B G J P.S P.H B G J P.S P.H Busy Time Prefetch Arrays do better because: MST has short lists and little computation per node They prefetch data for the first nodes in HEALTH unlike Jump prefetching Results: Tree Traversals in OLTP and Olden Normalized execution time 200 DB.tree Tree.add
150 83 100 50 0 35 58 35 42 50 Memory Stall Time 26 79 62 50 B G J P.S P.H 26 70 33 22
37 Busy Time 30 40 35 50 38 B G J P.S P.H Hardware-based prefetch Arrays do better because: Traversal path not known in DB.tree (depth first search) Data for the first nodes prefetched in Tree.add Other Results in Brief Impact of longer memory latencies: Robust for lists For trees, prefetch arrays may cause severe cache pollution Impact of memory bandwidth Performance improvements sustained for bandwidths of typical high-end servers (2.4 Gbytes/s) Prefetch arrays may suffer for trees. Severe contention on lowbandwidth systems (640 Mbytes/s) were observed
Node insertion and deletion for jump pointers and prefetch arrays Results in instruction overhead (-). However, insertion/deletion is sped up by prefetching (+) Outline Experimental platform Memory system issues studied Working set size in DSS workloads Prefetch approaches for pointer-intensive workloads (such as in OLTP) Coherence issues in OLTP workloads Concluding remarks Coherence Issues in OLTP DSM node SMP node M M
M M M M M M M M M M $ $ $ P P P $ $ $
P P P $ $ $ P P P $ $ $ P P P Favorite protocol: write-invalidate
Ownership overhead: invalidations cause write stall and inval. traffic Ownership Overhead in OLTP Simulation setup: CC-NUMA with 4 nodes MySQL, TPC-B, 600 MB database Kernel DBMS Library Total Load44% Store Stored- 32% Between Loaded- 24% Between 31% 27% 40% 61% 66%
41% 8% 7% 19% 40% of all ownership transactions stem from load/store sequences Techniques to Attack Ownership Overhead Dynamic detection of migratory sharing detects two load/store sequences by different processors only a sub-set of all load/store sequences (~40% in OLTP) Static detection of load/store sequences compiler algorithms that tags a load followed by a store and brings exclusive block in cache poses problems in TPC-B New Protocol Extension
Criterion: load miss from processor i followed by global store from i, tag block as Load/Store Normalized execution time 120 100 80 Write stall 60 Read stall 40 Busy 20 0 Baseline Mig
LS Concluding Remarks Focus on DSS and OLTP has revealed challenges not exposed by traditional appl. Pointer-chasing Load/store optimizations Application scaling not fully understood Our work on combining simulation with analytical modeling shows some promise
INTRODUCTION :- In 1921-22 A.D. as a result of the excavations made by R.D. Bannerji, Ram sahni, sir John marshal and Mortimer wheeler at Mohenjodaro and Harappa, a new civilization came in to lime light, which is known as Harappan...
Substantial risk of forfeiture. What is taxable? Value of benefit provided (e.g., value of free shares) Dividends / capital gain not taxable in Hong Kong. Different rules in different jurisdictions. A type of award can have different tax outcomes in...
U.S. Public Company Accounting Oversight Board (PCAOB) U.S. Government Accountability Office (GAO) U.S. Securities and Exchange Commission (SEC) (via telecon) A US firm with large public sector client portfolio. Ordre des CPA du Québec (CPA Québec) Focus. US Stakeholders. Regulators...
Main steps are: Develop linkages to well established models (Admire- REBUS, BIOTRANS, PRIMES), harmonise data and assumptions and refine spatial resolution Quantify demand and supply dynamics through market segmentation (heat, electricity-CHP and transport) Provide indicators for indirect land use, air,...
Sigmund Freud, the founder of psychoanalysis, compared the human mind to an iceberg. The tip above the water represents consciousness, and the vast region below the surface symbolizes the unconscious mind. Of Freud's three basic personality structures—id, ego, and superego—only...
The response can be explained by the dynamical thermostat mechanism (Clement et al. 1996). This mechanism assumes that the mean upwelling of subsurface water in the CEP combined with the reduction of the ocean vertical temperature gradient act to reduce...
Objectives. Identify the characteristics of a network that keep data safe from loss or damage. Protect an enterprise-wide network from malware. Explain fault-tolerance techniques for storage, network design, connectivity devices, naming and addressing services, and servers
The RILO - ESA is an office for collecting and analysing data of the intelligence, as well as, disseminating information on trends, modus operandi, routes and significant cases of fraud. September 28 - 29, 2016. Role - RILO ESA.
Ready to download the document? Go ahead and hit continue!