Animation demo - University of California, Berkeley

Animation demo - University of California, Berkeley

Flat Datacenter Storage Microsoft Research, Redmond Ed Nightingale, Jeremy Elson Jinliang Fan, Owen Hofmann, Jon Howell, Yutaka Suzue Writing Fine-grained write striping statistical multiplexing high disk utilization Good performance and disk efficiency Reading High utilization (for tasks with balanced CPU/IO) Easy to write software Dynamic work allocation no stragglers Easy to adjust the ratio of CPU to disk resources

Metadata management Physical data transport FDSOutline in 90 Seconds FDS is simple, scalable blob storage; logically separate compute and storage without the usual performance penalty Distributed metadata management, no centralized components on common-case paths Built on a CLOS network with distributed scheduling High read/write performance demonstrated (2 Gbyte/s, single-replicated, from one process) Fast failure recovery (0.6 TB in 33.7 s with 1,000 disks) High application performance web index serving; stock cointegration; set the 2012 world record for disk-to-disk sorting Outline FDS is simple, scalable blob storage; logically separate compute and storage without the usual performance penalty

Distributed metadata management, no centralized components on common-case paths Built on a CLOS network with distributed scheduling High read/write performance demonstrated (2 Gbyte/s, single-replicated, from one process) Fast failure recovery (0.6 TB in 33.7 s with 1,000 disks) High application performance set the 2012 world record for disk-to-disk sorting 8 MB Blob 0x5f37...59df: Tract 0 Tract 1 Tract 2 ... Tract n // create a blob with the specified GUID CreateBlob(GUID, &blobHandle, doneCallbackFunction); //... // Write 8mb from buf to tract 0 of the blob. blobHandle->WriteTract(0, buf, doneCallbackFunction); // Read tract 2 of blob into buf blobHandle->ReadTract(2, buf, doneCallbackFunction); Client Clients

Network Metadata Server Tractservers Outline FDS is simple, scalable blob storage; logically separate compute and storage without the usual performance penalty Distributed metadata management, no centralized components on common-case paths Built on a CLOS network with distributed scheduling High read/write performance demonstrated (2 Gbyte/s, single-replicated, from one process) Fast failure recovery (0.6 TB in 33.7 s with 1,000 disks) High application performance set the 2012 world record for disk-to-disk sorting GFS, Centralized metadata server

On critical path of reads/writes Large (coarsely striped) writes + Complete state visibility Hadoop + Full control over data placement + One-hop access to data + Fast reaction to failures FDS DHTs + No central bottlenecks + Highly scalable Multiple hops to find data Slower failure recovery Metadata Server Tract Locator Table Client Locat or

Disk 1 Disk 2 Disk 3 0 A B C 1 A D F

2 3 4 A C Oracle D E G G Consistent B C F Pseudo-random

O(n) or O(n2) 1,526 LM TH Tractserver Addresses (Readers use one; Writers use all) JE (hash(Blob_GUID) + Tract_Num) MOD Table_Size (hash(Blob_GUID) + Tract_Num) MOD Table_Size Extend 1 = Special metadata tract

by 10 Tracts Write (Blob to Trac ts 10-1 5b8) 9 b8) 5 b lo B ( s t c a Tr Extend by 4 20-23 s

t c a r T o t e it r W Extend by 7 T racts ( Blob d Write 17) to trac ts 54-6 0 7) 1

d b o l B ( racts T 5 y b d Exten -65 1 6 s t c a r t

Write to Outline FDS is simple, scalable blob storage; logically separate compute and storage without the usual performance penalty Distributed metadata management, no centralized components on common-case paths Built on a CLOS network with distributed scheduling High read/write performance demonstrated (2 Gbyte/s, single-replicated, from one process) Fast failure recovery (0.6 TB in 33.7 s with 1,000 disks) High application performance set the 2012 world record for disk-to-disk sorting Bandwidth is (was?) scarce in datacenters due to oversubscription Network Core 10x-20x Top-Of-Rack Switch

CPU Rack Bandwidth is (was?) scarce in datacenters due to oversubscription CLOS networks: [Al-Fares 08, Greenberg 09] full bisection bandwidth at datacenter scales Bandwidth is (was?) scarce in datacenters due to oversubscription CLOS networks: [Al-Fares 08, Greenberg 09] full bisection bandwidth at datacenter scales 4x-25x

Disks: 1Gbps bandwidth each Bandwidth is (was?) scarce in datacenters due to oversubscription CLOS networks: [Al-Fares 08, Greenberg 09] full bisection bandwidth at datacenter scales FDS: Provision the network sufficiently for every disk: 1G of network per disk ~1,500 disks spread across ~250 servers Dual 10G NICs in most servers 2-layer Monsoon: o o o o

Based on Blade G8264 Router 64x10G ports 14x TORs, 8x Spines 4x TOR-to-Spine connections per pair 448x10G ports total (4.5 terabits), full bisection No Silver Bullet X Full bisection bandwidth is only stochastic o Long flows are bad for load-balancing o FDS generates a large number of short flows are going to diverse destinations Congestion isnt eliminated; its been pushed to the edges o TCP bandwidth allocation performs poorly with short, fat flows: incast FDS creates circuits using RTS/CTS

Outline FDS is simple, scalable blob storage; logically separate compute and storage without the usual performance penalty Distributed metadata management, no centralized components on common-case paths Built on a CLOS network with distributed scheduling High read/write performance demonstrated (2 Gbyte/s, single-replicated, from one process) Fast failure recovery (0.6 TB in 33.7 s with 1,000 disks) High application performance set the 2012 world record for disk-to-disk sorting Performance Single-Replicated Tractservers, 10G Clients Read: 950 MB/s/client Write: 1,150 MB/s/client Performance Triple-Replicated Tractservers, 10G Clients

Outline FDS is simple, scalable blob storage; logically separate compute and storage without the usual performance penalty Distributed metadata management, no centralized components on common-case paths Built on a CLOS network with distributed scheduling High read/write performance demonstrated (2 Gbyte/s, single-replicated, from one process) Fast failure recovery (0.6 TB in 33.7 s with 1,000 disks) High application performance set the 2012 world record for disk-to-disk sorting X Hot Spare More disks faster recovery Locat or Disk 1

Disk 2 Disk 3 1 A B C 2 A C Z 3

A D H 4 A E M 5 A F C

6 A G P 648 Z W

H 649 Z X L 650 Z Y C All disk pairs appear in the table n disks each recover 1/nth of the lost data in parallel

Locat or Disk 1 1 A 2 A 3 A 4 A

M S R D Disk 2 Disk 3 B C C Z D H E

M F C 5 A S 6 A N G P

648 Z W H 649 Z X L 650

Z Y C All disk pairs appear in the table n disks each recover 1/nth of the lost data in parallel Locat or Disk 1 1 A 2

A 3 A 4 A M S R D Disk 2 Disk 3 B C

C Z D H E M F G 6 A N G

P 648 Z W H 649 Z

X L 650 Z Y C M S C 2 H 1

R 3 All disk pairs appear in the table n disks each recover 1/nth of the lost data in parallel A S 1 5 B Failure Recovery Results

Disks in Cluster Disks Failed Data Recovered Time 100 1 47 GB 19.2 0.7s 1,000 1

47 GB 3.3 0.6s 1,000 1 92 GB 6.2 6.2s 1,000 7 655 GB 33.7 1.5s We recover at about 40 MB/s/disk + detection time

1 TB failure in a 3,000 disk cluster: ~17s Failure Recovery Results Disks in Cluster Disks Failed Data Recovered Time 100 1 47 GB 19.2 0.7s

1,000 1 47 GB 3.3 0.6s 1,000 1 92 GB 6.2 6.2s 1,000 7 655 GB

33.7 1.5s We recover at about 40 MB/s/disk + detection time 1 TB failure in a 3,000 disk cluster: ~17s Outline FDS is simple, scalable blob storage; logically separate compute and storage without the usual performance penalty Distributed metadata management, no centralized components on common-case paths Built on a CLOS network with distributed scheduling High read/write performance demonstrated (2 Gbyte/s, single-replicated, from one process) Fast failure recovery (0.6 TB in 33.7 s with 1,000 disks) High application performance set the 2012 world record for disk-to-disk sorting Minute Sort Jim Grays benchmark: How much data can you sort in 60

seconds? o Has real-world applicability: sort, arbitrary join, group by System Compu Disk Sort Time Disk column ters s Size Throughp ut MSR FDS 2012 Yahoo! Hadoop 2009 256 1,033 1,470

GB 59 s 1,408 5,632 500 GB 59 s 46 MB/s 15x efficiency improvement! 3 MB/s Previous no holds barred record UCSD (1,353 GB); FDS: 1,470 GB o Their purpose-built stack beat us on efficiency, however Sort was just an app FDS was not enlightened o Sent the data over the network thrice (read, bucket, write)

Dynamic Work Allocation Conclusions Agility and conceptual simplicity of a global store, without the usual performance penalty Remote storage is as fast (throughput-wise) as local Build high-performance, high-utilization clusters o Buy as many disks as you need aggregate IOPS o Provision enough network bandwidth based on computation to I/O ratio of expected applications o Apps can use I/O and compute in whatever ratio they need o By investing about 30% more for the network and use nearly all the hardware Potentially enable new applications Thank you! FDS Sort vs. TritonSort System

Comput Disks ers Sort Size Time Disk Throughp ut FDS 2012 256 1,470GB 59.4s 47.9MB/s

1,033 TritonSort 66 is more 1,056efficient 1,353GB(~10%) 59.2s 43.3MB/s Disk-wise: FDS 2011 Computer-wise: FDS is less efficient, but o Some is genuine inefficiency sending data three times o Some is because FDS used a scrapheap of old computers Only 7 disks per machine Couldnt run tractserver and client on the same machine Design differences: o General-purpose remote store vs. purpose-built sort application o Could scale 10x with no changes vs. one big switch at the top

Hadoop on a 10G CLOS network? Congestion isnt eliminated; its been pushed to the edges o TCP bandwidth allocation performs poorly with short, fat flows: incast o FDS creates circuits using RTS/CTS Full bisection bandwidth is only stochastic Software written to assume bandwidth is scarce wont try to use the network We want to exploit all disks equally Stock Market Analysis Analyzes stock market data from BATStrading.com 23 seconds to o Read 2.5GB of compressed data from a blob o Decompress to 13GB & do computation o Write correlated data back to blobs Original zlib compression thrown out too slow! o FDS delivered 8MB/70ms/NIC, but each tract took 218ms to decompress (10 NICs, 16 cores)

o Switched to XPress, which can decompress in 62ms FDS turned this from an I/O-bound to computebound application FDS Recovery Speed: Triple-Replicated, Single Disk Failure 2012 Result: 1,000 disks, 92GB per disk, 2010 Estimate: recovered in 6.2 +/- 0.4 sec 2,500-3,000 disks, 1TB per disk, should recover in 30 sec 2010 Experiment: 98 disks, 25GB per disk, recovered in 20 sec Why is fast failure recovery important? Increased data durability

o Too many failures within a recovery window = data loss o Reduce window from hours to seconds Decreased CapEx+OpEx o CapEx: No need for hot spares: all disks do work o OpEx: Dont replace disks; wait for an upgrade. Simplicity o Block writes until recovery completes o Avoid corner cases FDS Cluster 1 14 machines (16 cores) 8 disks per machine ~10 1G NICs per machine 4x LB4G switches o 40x1G + 4x10G 1x LB6M switch o 24x10G

Made possible through the generous support of the eXtreme Computing Group (XCG) Cluster 2 Network Topology Distributing 8mb tracts to disks uniformly at random: How many tracts is a disk likely to get? 60GB, 56 disks: = 134, = 11.5 Likely range: 110-159 Max likely 18.7% higher than average Distributing 8mb tracts to disks uniformly at random: How many tracts is a disk likely to get? 500GB, 1,033 disks: = 60, = 7.8

Likely range: 38 to 86 Max likely 42.1% higher than average Solution (simplified): Change locator to (Hash(Blob_GUID) + Tract_Number) MOD TableSize

Recently Viewed Presentations

  • Counting Atoms

    Counting Atoms

    COUNTING ATOMS RULES FOR COUNTING ATOMS SUBSCRIPTS only refer to the atom that they are BEHIND. For example… H2S There are TWO atoms of HYDROGEN and only ONE atom of SULFUR. COEFFICIENTS COEFFICIENTS apply to the entire compound. You MULTIPLY...
  • OEF SAFETY ALERTS - Ammunition and Explosives

    OEF SAFETY ALERTS - Ammunition and Explosives

    SAFETY ALERT. References: SB 742-1 DA PAM 710-2-1 DA PAM 385-64. SAFETY ALERT. SAFETY. ALERT . SAFETY. ALERT. SAFETY. ALERT . SAFETY. ALERT. If you are unsure what to do with excess ammunition or explosives, contact EOD or the ASP...
  • Constantly in Prayer - Whistling Pines

    Constantly in Prayer - Whistling Pines

    Psalm 124 (NIV). If the Lord had not been on our side — let Israel say — if the Lord had not been on our side when men attacked us, when their anger flared against us, they would have swallowed...
  • Insert title here - British Computer Society

    Insert title here - British Computer Society

    Presented by: Mark Sear CITP - Treasurer, Canada Section. BCS Canada. Thursday, November 2nd, 2017. Good afternoon everyone - I hope you are enjoying the day? Thank you for this opportunity to tell you a little bit about BCS Canada...
  • Other Important Advancements - Las Positas College

    Other Important Advancements - Las Positas College

    Created by Alan Turing in 1936. Abstract computer model that could perform logical operations. ... In 1958, Jack Kilby, while working at Texas Instruments, invented the world's first integrated circuit, a small chip capable of containing thousands of transistors. This...
  • Cardioprotection in Neonates

    Cardioprotection in Neonates

    Ventilatory support duration (hr) Clinical Outcome 1- Impact of Age. Impact of Age on duration of ventilation & inotrope use. Matched for Cross-clamp time. ... (equivalent to control viability at 120mins) Images from the 24th of November 2016. JH.
  • 5.2 Verifying Trigonometric Identities Copyright  Cengage Learning. All

    5.2 Verifying Trigonometric Identities Copyright Cengage Learning. All

    When you find these values, you are solving the equation. Conditional equation Verifying Trigonometric Identities On the other hand, an equation that is true for all real values in the domain of the variable is an identity. For example, the...
  • The Medium Access Sublayer - CS Department

    The Medium Access Sublayer - CS Department

    (a) A sending an RTS to B. (b) B responding with a CTS to A. Ethernet Cabling Manchester Encoding The Ethernet MAC Sublayer Protocol The Binary Exponential Backoff Algorithm Ethernet Performance Switched Ethernet Fast Ethernet Gigabit Ethernet IEEE 802.2: Logical...