Chapter 6 Storage and Other I/O Topics Chapter 6 — Storage and Other I/O Topics — 2 Introduction  I/O devices can be characterized by  Behaviour: input, output, storage  Partner: human or machine  Data rate: bytes/sec, transfers/sec  I/O bus connections §6.1Introduction Chapter 6 — Storage and Other I/O Topics — 3 I/O System Characteristics  Dependability is important  Particularly for storage devices  Performance measures  Latency (response time)  Throughput (bandwidth)  Desktops & embedded systems  Mainly interested in response time & diversity of devices  Servers  Mainly interested in throughput & expandability of devices Chapter 6 — Storage and Other I/O Topics — 4 Dependability  Fault: failure of a component  May or may not lead to system failure §6.2Dependability,Reliability,andAvailability Service accomplishment Service delivered as specified Service interruption Deviation from specified service FailureRestoration Chapter 6 — Storage and Other I/O Topics — 5 Dependability Measures  Reliability: mean time to failure (MTTF)  Service interruption: mean time to repair (MTTR)  Mean time between failures  MTBF = MTTF + MTTR  Availability = MTTF / (MTTF + MTTR)  Improving Availability  Increase MTTF: fault avoidance, fault tolerance, fault forecasting  Reduce MTTR: improved tools and processes for diagnosis and repair Chapter 6 — Storage and Other I/O Topics — 6 Disk Storage  Nonvolatile, rotating magnetic storage §6.3DiskStorage Chapter 6 — Storage and Other I/O Topics — 7 Disk Sectors and Access  Each sector records  Sector ID  Data (512 bytes, 4096 bytes proposed)  Error correcting code (ECC)  Used to hide defects and recording errors  Synchronization fields and gaps  Access to a sector involves  Queuing delay if other accesses are pending  Seek: move the heads  Rotational latency  Data transfer  Controller overhead Chapter 6 — Storage and Other I/O Topics — 8 Disk Access Example  Given  512B sector, 15,000rpm, 4ms average seek time, 100MB/s transfer rate, 0.2ms controller overhead, idle disk  Average read time  4ms seek time + ½ / (15,000/60) = 2ms rotational latency + 512 / 100MB/s = 0.005ms transfer time + 0.2ms controller delay = 6.2ms  If actual average seek time is 1ms  Average read time = 3.2ms Chapter 6 — Storage and Other I/O Topics — 9 Disk Performance Issues  Manufacturers quote average seek time  Based on all possible seeks  Locality and OS scheduling lead to smaller actual average seek times  Smart disk controller allocate physical sectors on disk  Present logical sector interface to host  SCSI, ATA, SATA  Disk drives include caches  Prefetch sectors in anticipation of access  Avoid seek and rotational delay Chapter 6 — Storage and Other I/O Topics — 10 Flash Storage  Nonvolatile semiconductor storage  100× – 1000× faster than disk  Smaller, lower power, more robust  But more $/GB (between disk and DRAM) §6.4FlashStorage Chapter 6 — Storage and Other I/O Topics — 11 Flash Types  NOR flash: bit cell like a NOR gate  Random read/write access  Used for instruction memory in embedded systems  NAND flash: bit cell like a NAND gate  Denser (bits/area), but block-at-a-time access  Cheaper per GB  Used for USB keys, media storage, …  Flash bits wears out after 1000’s of accesses  Not suitable for direct RAM or disk replacement  Wear leveling: remap data to less used blocks Chapter 6 — Storage and Other I/O Topics — 12 Interconnecting Components  Need interconnections between  CPU, memory, I/O controllers  Bus: shared communication channel  Parallel set of wires for data and synchronization of data transfer  Can become a bottleneck  Performance limited by physical factors  Wire length, number of connections  More recent alternative: high-speed serial connections with switches  Like networks §6.5ConnectingProcessors,Memory,andI/ODevices Chapter 6 — Storage and Other I/O Topics — 13 Bus Types  Processor-Memory buses  Short, high speed  Design is matched to memory organization  I/O buses  Longer, allowing multiple connections  Specified by standards for interoperability  Connect to processor-memory bus through a bridge Chapter 6 — Storage and Other I/O Topics — 14 Bus Signals and Synchronization  Data lines  Carry address and data  Multiplexed or separate  Control lines  Indicate data type, synchronize transactions  Synchronous  Uses a bus clock  Asynchronous  Uses request/acknowledge control lines for handshaking Chapter 6 — Storage and Other I/O Topics — 15 I/O Bus Examples Firewire USB 2.0 PCI Express Serial ATA Serial Attached SCSI Intended use External External Internal Internal External Devices per channel 63 127 1 1 4 Data width 4 2 2/lane 4 4 Peak bandwidth 50MB/s or 100MB/s 0.2MB/s, 1.5MB/s, or 60MB/s 250MB/s/lane 1×, 2×, 4×, 8×, 16×, 32× 300MB/s 300MB/s Hot pluggable Yes Yes Depends Yes Yes Max length 4.5m 5m 0.5m 1m 8m Standard IEEE 1394 USB Implementers Forum PCI-SIG SATA-IO INCITS TC T10 Chapter 6 — Storage and Other I/O Topics — 16 Typical x86 PC I/O System Chapter 6 — Storage and Other I/O Topics — 17 I/O Management  I/O is mediated by the OS  Multiple programs share I/O resources  Need protection and scheduling  I/O causes asynchronous interrupts  Same mechanism as exceptions  I/O programming is fiddly  OS provides abstractions to programs §6.6InterfacingI/ODevices… Chapter 6 — Storage and Other I/O Topics — 18 I/O Commands  I/O devices are managed by I/O controller hardware  Transfers data to/from device  Synchronizes operations with software  Command registers  Cause device to do something  Status registers  Indicate what the device is doing and occurrence of errors  Data registers  Write: transfer data to a device  Read: transfer data from a device Chapter 6 — Storage and Other I/O Topics — 19 I/O Register Mapping  Memory mapped I/O  Registers are addressed in same space as memory  Address decoder distinguishes between them  OS uses address translation mechanism to make them only accessible to kernel  I/O instructions  Separate instructions to access I/O registers  Can only be executed in kernel mode  Example: x86 Chapter 6 — Storage and Other I/O Topics — 20 Polling  Periodically check I/O status register  If device ready, do operation  If error, take action  Common in small or low-performance realtime embedded systems  Predictable timing  Low hardware cost  In other systems, wastes CPU time Chapter 6 — Storage and Other I/O Topics — 21 Interrupts  When a device is ready or error occurs  Controller interrupts CPU  Interrupt is like an exception  But not synchronized to instruction execution  Can invoke handler between instructions  Cause information often identifies the interrupting device  Priority interrupts  Devices needing more urgent attention get higher priority  Can interrupt handler for a lower priority interrupt Chapter 6 — Storage and Other I/O Topics — 22 I/O Data Transfer  Polling and interrupt-driven I/O  CPU transfers data between memory and I/O data registers  Time consuming for high-speed devices  Direct memory access (DMA)  OS provides starting address in memory  I/O controller transfers to/from memory autonomously  Controller interrupts on completion or error Chapter 6 — Storage and Other I/O Topics — 23 DMA/Cache Interaction  If DMA writes to a memory block that is cached  Cached copy becomes stale  If write-back cache has dirty block, and DMA reads memory block  Reads stale data  Need to ensure cache coherence  Flush blocks from cache if they will be used for DMA  Or use non-cacheable memory locations for I/O Chapter 6 — Storage and Other I/O Topics — 24 DMA/VM Interaction  OS uses virtual addresses for memory  DMA blocks may not be contiguous in physical memory  Should DMA use virtual addresses?  Would require controller to do translation  If DMA uses physical addresses  May need to break transfers into page-sized chunks  Or chain multiple transfers  Or allocate contiguous physical pages for DMA Chapter 6 — Storage and Other I/O Topics — 25 Measuring I/O Performance  I/O performance depends on  Hardware: CPU, memory, controllers, buses  Software: operating system, database management system, application  Workload: request rates and patterns  I/O system design can trade-off between response time and throughput  Measurements of throughput often done with constrained response-time §6.7I/OPerformanceMeasures:… Chapter 6 — Storage and Other I/O Topics — 26 Transaction Processing Benchmarks  Transactions  Small data accesses to a DBMS  Interested in I/O rate, not data rate  Measure throughput  Subject to response time limits and failure handling  ACID (Atomicity, Consistency, Isolation, Durability)  Overall cost per transaction  Transaction Processing Council (TPC) benchmarks (www.tcp.org)  TPC-APP: B2B application server and web services  TCP-C: on-line order entry environment  TCP-E: on-line transaction processing for brokerage firm  TPC-H: decision support — business oriented ad-hoc queries Chapter 6 — Storage and Other I/O Topics — 27 File System & Web Benchmarks  SPEC System File System (SFS)  Synthetic workload for NFS server, based on monitoring real systems  Results  Throughput (operations/sec)  Response time (average ms/operation)  SPEC Web Server benchmark  Measures simultaneous user sessions, subject to required throughput/session  Three workloads: Banking, Ecommerce, and Support Chapter 6 — Storage and Other I/O Topics — 28 I/O vs. CPU Performance  Amdahl’s Law  Don’t neglect I/O performance as parallelism increases compute performance  Example  Benchmark takes 90s CPU time, 10s I/O time  Double the number of CPUs/2 years  I/O unchanged Year CPU time I/O time Elapsed time % I/O time now 90s 10s 100s 10% +2 45s 10s 55s 18% +4 23s 10s 33s 31% +6 11s 10s 21s 47% §6.9ParallelismandI/O:RAID Chapter 6 — Storage and Other I/O Topics — 29 RAID  Redundant Array of Inexpensive (Independent) Disks  Use multiple smaller disks (c.f. one large disk)  Parallelism improves performance  Plus extra disk(s) for redundant data storage  Provides fault tolerant storage system  Especially if failed disks can be “hot swapped”  RAID 0  No redundancy (“AID”?)  Just stripe data over multiple disks  But it does improve performance Chapter 6 — Storage and Other I/O Topics — 30 RAID 1 & 2  RAID 1: Mirroring  N + N disks, replicate data  Write data to both data disk and mirror disk  On disk failure, read from mirror  RAID 2: Error correcting code (ECC)  N + E disks (e.g., 10 + 4)  Split data at bit level across N disks  Generate E-bit ECC  Too complex, not used in practice Chapter 6 — Storage and Other I/O Topics — 31 RAID 3: Bit-Interleaved Parity  N + 1 disks  Data striped across N disks at byte level  Redundant disk stores parity  Read access  Read all disks  Write access  Generate new parity and update all disks  On failure  Use parity to reconstruct missing data  Not widely used Chapter 6 — Storage and Other I/O Topics — 32 RAID 4: Block-Interleaved Parity  N + 1 disks  Data striped across N disks at block level  Redundant disk stores parity for a group of blocks  Read access  Read only the disk holding the required block  Write access  Just read disk containing modified block, and parity disk  Calculate new parity, update data disk and parity disk  On failure  Use parity to reconstruct missing data  Not widely used Chapter 6 — Storage and Other I/O Topics — 33 RAID 3 vs RAID 4 Chapter 6 — Storage and Other I/O Topics — 34 RAID 5: Distributed Parity  N + 1 disks  Like RAID 4, but parity blocks distributed across disks  Avoids parity disk being a bottleneck  Widely used Chapter 6 — Storage and Other I/O Topics — 35 RAID 6: P + Q Redundancy  N + 2 disks  Like RAID 5, but two lots of parity  Greater fault tolerance through more redundancy  Multiple RAID  More advanced systems give similar fault tolerance with better performance Chapter 6 — Storage and Other I/O Topics — 36 RAID Summary  RAID can improve performance and availability  High availability requires hot swapping  Assumes independent disk failures  Too bad if the building burns down!  See “Hard Disk Performance, Quality and Reliability”  http://www.pcguide.com/ref/hdd/perf/index.htm Chapter 6 — Storage and Other I/O Topics — 37 I/O System Design  Satisfying latency requirements  For time-critical operations  If system is unloaded  Add up latency of components  Maximizing throughput  Find “weakest link” (lowest-bandwidth component)  Configure to operate at its maximum bandwidth  Balance remaining components in the system  If system is loaded, simple analysis is insufficient  Need to use queuing models or simulation §6.8DesigningandI/OSystem Chapter 6 — Storage and Other I/O Topics — 38 Server Computers  Applications are increasingly run on servers  Web search, office apps, virtual worlds, …  Requires large data center servers  Multiple processors, networks connections, massive storage  Space and power constraints  Server equipment built for 19” racks  Multiples of 1.75” (1U) high §6.10RealStuff:SunFirex4150Server Chapter 6 — Storage and Other I/O Topics — 39 Rack-Mounted Servers Sun Fire x4150 1U server Chapter 6 — Storage and Other I/O Topics — 40 Sun Fire x4150 1U server 4 cores each 16 x 4GB = 64GB DRAM Chapter 6 — Storage and Other I/O Topics — 41 I/O System Design Example  Given a Sun Fire x4150 system with  Workload: 64KB disk reads  Each I/O op requires 200,000 user-code instructions and 100,000 OS instructions  Each CPU: 109 instructions/sec  FSB: 10.6 GB/sec peak  DRAM DDR2 667MHz: 5.336 GB/sec  PCI-E 8× bus: 8 × 250MB/sec = 2GB/sec  Disks: 15,000 rpm, 2.9ms avg. seek time, 112MB/sec transfer rate  What I/O rate can be sustained?  For random reads, and for sequential reads Chapter 6 — Storage and Other I/O Topics — 42 Design Example (cont)  I/O rate for CPUs  Per core: 109/(100,000 + 200,000) = 3,333  8 cores: 26,667 ops/sec  Random reads, I/O rate for disks  Assume actual seek time is average/4  Time/op = seek + latency + transfer = 2.9ms/4 + 4ms/2 + 64KB/(112MB/s) = 3.3ms  303 ops/sec per disk, 2424 ops/sec for 8 disks  Sequential reads  112MB/s / 64KB = 1750 ops/sec per disk  14,000 ops/sec for 8 disks Chapter 6 — Storage and Other I/O Topics — 43 Design Example (cont)  PCI-E I/O rate  2GB/sec / 64KB = 31,250 ops/sec  DRAM I/O rate  5.336 GB/sec / 64KB = 83,375 ops/sec  FSB I/O rate  Assume we can sustain half the peak rate  5.3 GB/sec / 64KB = 81,540 ops/sec per FSB  163,080 ops/sec for 2 FSBs  Weakest link: disks  2424 ops/sec random, 14,000 ops/sec sequential  Other components have ample headroom to accommodate these rates Chapter 6 — Storage and Other I/O Topics — 44 Fallacy: Disk Dependability  If a disk manufacturer quotes MTTF as 1,200,000hr (140yr)  A disk will work that long  Wrong: this is the mean time to failure  What is the distribution of failures?  What if you have 1000 disks  How many will fail per year? §6.12FallaciesandPitfalls 0.73% ehrs/failur1200000 hrs/disk8760disks1000 (AFR)RateFailureAnnual = × = Chapter 6 — Storage and Other I/O Topics — 45 Fallacies  Disk failure rates are as specified  Studies of failure rates in the field  Schroeder and Gibson: 2% to 4% vs. 0.6% to 0.8%  Pinheiro, et al.: 1.7% (first year) to 8.6% (third year) vs. 1.5%  Why?  A 1GB/s interconnect transfers 1GB in one sec  But what’s a GB?  For bandwidth, use 1GB = 109 B  For storage, use 1GB = 230 B = 1.075×109 B  So 1GB/sec is 0.93GB in one second  About 7% error Chapter 6 — Storage and Other I/O Topics — 46 Pitfall: Offloading to I/O Processors  Overhead of managing I/O processor request may dominate  Quicker to do small operation on the CPU  But I/O architecture may prevent that  I/O processor may be slower  Since it’s supposed to be simpler  Making it faster makes it into a major system component  Might need its own coprocessors! Chapter 6 — Storage and Other I/O Topics — 47 Pitfall: Backing Up to Tape  Magnetic tape used to have advantages  Removable, high capacity  Advantages eroded by disk technology developments  Makes better sense to replicate data  E.g, RAID, remote mirroring Chapter 6 — Storage and Other I/O Topics — 48 Fallacy: Disk Scheduling  Best to let the OS schedule disk accesses  But modern drives deal with logical block addresses  Map to physical track, cylinder, sector locations  Also, blocks are cached by the drive  OS is unaware of physical locations  Reordering can reduce performance  Depending on placement and caching Chapter 6 — Storage and Other I/O Topics — 49 Pitfall: Peak Performance  Peak I/O rates are nearly impossible to achieve  Usually, some other system component limits performance  E.g., transfers to memory over a bus  Collision with DRAM refresh  Arbitration contention with other bus masters  E.g., PCI bus: peak bandwidth ~133 MB/sec  In practice, max 80MB/sec sustainable Chapter 6 — Storage and Other I/O Topics — 50 Concluding Remarks  I/O performance measures  Throughput, response time  Dependability and cost also important  Buses used to connect CPU, memory, I/O controllers  Polling, interrupts, DMA  I/O benchmarks  TPC, SPECSFS, SPECWeb  RAID  Improves performance and dependability §6.13ConcludingRemarks