PSaAP II 2004-03-19 1/52 TCP and beyond: Protocols for reliable transfer in high- bandwidth long-distance networks Petr Holub hopet@ics.muni.cz PSaAP II 2004-03-19 2/52 Overview · Traditional TCP and its problems · multiple stream TCP · Improvements to TCP · Scalable TCP · HS TCP · TCP improvements based on additional information provided by the network · QuickStart · E-TCP · Non-TCP approaches PSaAP II 2004-03-19 3/52 Reliable transfer protocols · ensuring reliability of transfer · retransmission of lost data · overload prevention · for both network and receiver · behavior assessment · aggressiveness - utilization of bandwidth available · responsiveness - loss recovery capability · fairness - obtaining fair share of bandwidth for multiple network participants · the problem we have: fat long pipes PSaAP II 2004-03-19 4/52 Traditional TCP · flow control vs. congestion control networksender receiver networksender receiver networksender receiver flow control (rwnd) congestion control (cwnd) PSaAP II 2004-03-19 5/52 Traditional TCP (3) · Flow control · deterministic, precise (rwnd) · Congestion control · rough estimate · ownd = min(cwnd,rwnd) · bw=owin* 8*packet_size/rtt PSaAP II 2004-03-19 6/52 Traditional TCP (4) · Congestion control · traditionally based on AIMD (additive increase multiplicative decrease) · cwnd += 1 MSS per successful RTT when above sstresh · cwnd *= .5 on each loss event · Reno: fast retransmission (loss detected by receiving 3 duplicated ACKs) and fast recovery (canceling slowstart) Tahoe Reno PSaAP II 2004-03-19 9/52 Traditional TCP (5) · TCP Vegas · trying to avoid congestion by monitoring RTT · if RTT increases (suggesting that congestion is imminent) it decreases cwnd linearly · measurement of available bandwidth based on inter-packet spacing PSaAP II 2004-03-19 10/52 Traditional TCP (6) · Reaction to loss · TCP Tahoe (whole current windows (owin)) · TCP Reno (one segment in ,,Fast Retransmit" mode) · TCP NewReno (more segments in "Fast Retransmit" mode) · TCP SACK (lost packets only) · Issue of large enough cwnd for fast long distance networks PSaAP II 2004-03-19 11/52 TCP Reponse function · relating throughput bw (or window size owin) and steady-state packet loss rate p · owin ~ 1.2/sqrt(p) · bw = 8*MSS*owin/RTT · bw = (8*MSS/RTT)*1.2/sqrt(p) · Traditional TCP responsiveness · assume that packet loss is experienced when cwnd = bw*RTT == bwbw . RTT. RTT 2 . MSS2 . MSS 22 PSaAP II 2004-03-19 12/52 Traditional TCP responsivness TCP responsiveness 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 0 50 100 150 200 RTT (ms) Time(s) C= 622 Mbit/s C= 2.5 Gbit/s C= 10 Gbit/s TCP - fairness TCP - fairness (2) PSaAP II 2004-03-19 15/52 Some remarks to fairness · assessment of fairness · for streams with varying RTT · for streams with different MTU PSaAP II 2004-03-19 16/52 Going multi-stream · improves performance when single packet loss occurs · multiple packet loss can influence all streams · when multiple · many real applications in use: bbftp, GridFTP, Internet Backplane Protocol PSaAP II 2004-03-19 17/52 Going multi-stream (2) · disadvantages of multiple streams · more complicated (usually requires several threads) · startup and shutdown times are not improved substantially · may result in synchronous overloading of router buffers PSaAP II 2004-03-19 18/52 Possible implementation improvements · Cooperation with hardware · TCP Checksum Offloading (both Rx and Tx) · Zero copy · networking usually involves several copies: from userland process to kernel and from kernel to NIC and vice versa in case of receiving data · page flipping (moving pages from user to kernel space and vice versa · Implementations for Linux, FreeBSD, and Solaris PSaAP II 2004-03-19 19/52 Web100 · TCP instrumentation for Linux kernel and userland interfaces · interface for monitoring kernel parameters connected with networking stack and overall performance · number of tunable parameters · advanced auto-tuning support PSaAP II 2004-03-19 20/52 Web100 (2) · Kernel instrumentation set (TCP-KIS) · instrumentation inside kernel · monitoring various kernel structures relevant to TCP · currently more than 125 "instruments" · exposing information via /proc · Web100 library provides access to variables/instruments · Userland utilities (both command-line and GUI) PSaAP II 2004-03-19 21/52 Beyond the Traditional TCP PSaAP II 2004-03-19 22/52 GridDT · correction of sstresh · modification of congestion control · cwnd += a for each RTT without packet loss · cwnd *= b * cwnd on each loss event · faster slowstart · modification of sender's stack only PSaAP II 2004-03-19 23/52 GridDT fairness PSaAP II 2004-03-19 24/52 GridDT example SunnyvaleSunnyvale Starlight (CHI)Starlight (CHI) CERN (GVA)CERN (GVA) RR RRGbEGbE SwitchSwitch POS 2.5POS 2.5 Gb/sGb/s1 GE1 GE 1 GE1 GE Host #2Host #2 Host #1Host #1 Host #2Host #2 1 GE1 GE 1 GE1 GE BottleneckBottleneck RRPOS 10POS 10 Gb/sGb/s RR 10GE10GE Host #1Host #1 TCP Reno performance (see slide #8): First stream GVA <-> Sunnyvale : RTT = 181 ms ; Avg. throughput over a period of 7000s = 202 Mb/s Second stream GVA<->CHI : RTT = 117 ms; Avg. throughput over a period of 7000s = 514 Mb/s Links utilization 71,6% Grid DT tuning in order to improve fairness between two TCP streams with different RTT: First stream GVA <-> Sunnyvale : RTT = 181 ms, Additive increment = A = 7 ; Average throughput = 330 Mb/s Second stream GVA<->CHI : RTT = 117 ms, Additive increment = B = 3 ; Average throughput = 388 Mb/s Links utilization 71.8% Throughput of two streams with different RTT sharing a 1Gbps bottleneck 0 100 200 300 400 500 600 700 800 900 1000 0 1000 2000 3000 4000 5000 6000 Time (s) Throughput(Mbps) A=7 ; RTT=181ms Average over the life of the connection RTT=181ms B=3 ; RTT=117ms Average over the life of the connection RTT=117ms PSaAP II 2004-03-19 25/52 Scalable TCP · Modification of congestion control mechanism · cwnd += 0.01 * cwnd ... for each RTT without loss · cwnd = 0.875 * cwnd ... on each loss event · with constant number of RTTs between losses independently of bandwidth · no longer AIMD => MIMD · switches to AIMD for smaller window size and occurrence of more losses PSaAP II 2004-03-19 26/52 Scalable TCP (2) PSaAP II 2004-03-19 27/52 Scalable TCP - fairness PSaAP II 2004-03-19 28/52 Scalable TCP - response curve PSaAP II 2004-03-19 29/52 Scalable TCP · http://www- lce.eng.cam.ac.uk/~ctk21/scalable/ · Tom Kelly, Scalable TCP: Improving Performance in Highspeed Wide Area Networks Submitted for publication, December 2002. http://www- lce.eng.cam.ac.uk/~ctk21/papers/scalable_improve_hswa n.pdf PSaAP II 2004-03-19 30/52 HSTCP · Emulates behavior of standard TCP for congested (=high packet loss rate) and low bandwidth networks · Modification of congestion control mechanism · cwnd += a(cwnd) ... for each RTT without loss · cwnd = (1-b(cwnd)) * cwnd ... on each loss event · RFC 3649 PSaAP II 2004-03-19 31/52 HSTCP (2) · suggested parameterization · b(w) ~ -0.4 * (log(w) - 3.64) / 7.69 + 0.5 · a(w) ~ (2*w^2*b(w))/((2-b(w))*w^1.2*12.8) · possible Linear High Speed equivalent to Scalable TCP · comparison with multiple streams · N(W) ~ 0.23 * W^0.4 PSaAP II 2004-03-19 32/52 cwnd: Traditional TCP vs. HSTCP PSaAP II 2004-03-19 33/52 HSTCP (2) · suggested parameterization · b(w) ~ -0.4 * (log(w) - 3.64) / 7.69 + 0.5 · a(w) ~ (2*w^2*b(w))/((2-b(w))*w^1.2*12.8) · possible Linear High Speed equivalent to Scalable TCP · comparison with multiple streams · N(W) ~ 0.23 * W^0.4 PSaAP II 2004-03-19 34/52 HSTCP (3) · neither ScalableTCP nor HSTCP handles slowstart phase in some advanced way... PSaAP II 2004-03-19 35/52 HSTCP (4) PSaAP II 2004-03-19 36/52 Quickstart or Limited Slowstart · strong suspicion there is no reasonable way for improving slowstart phase without interaction with the network · proposes 4 byte option for IP header comprising two fields: QS TTL and Initial rate PSaAP II 2004-03-19 37/52 Quickstart (2) · Sender willing to use QS sets QS TTL to random value and Initial rate to desired value and sends SYN packet. · All routers on the way to receiver that understand and approve QS decrement QS TTL by 1 and decrease Initial rate if needed. PSaAP II 2004-03-19 38/52 Quickstart (3) · Receiver sends feedback in SYN/ACK packet so sender knows if all the routers on the way participated, has RTT measurement. · Sender sets initial adequate congestion window and than uses AIMD as usual. · Requires changing IP layer :-(( PSaAP II 2004-03-19 39/52 E-TCP · Early Congestion Notification · the bit is set by routers before reaching line/buffer saturation · ECN flag must be reflected by receiver · TCP was excepted to react in the same way as to congestion · E-TCP · suggest to reflect ECN flag just once · freeze cwnd when ECN marked ACK arrives PSaAP II 2004-03-19 40/52 E-TCP (2) · requires modification of Active Queue Management to allow small losses to reintroduce multiplicative decrease for fair temporal behavior · problems · with ECN: most of routed admins don't bother to configure it · with AQM: the same as above · change ECN behavior on receivers PSaAP II 2004-03-19 41/52 Other protocols · FAST - Fast AQM Scalable TCP · uses end-to-end delay, ECN, and loss as congestion avoidance/detection · http://netlab.caltech.edu/FAST/ http://netlab.caltech.edu/pub/papers/FAST- infocom2004.pdf PSaAP II 2004-03-19 42/52 Non-TCP based protocols PSaAP II 2004-03-19 43/52 tsunami · TCP connection for out-of-band control channel · transfer parameters negotiation, retransmission requests, end of transmission negotiation · NACKs instead of ACKs · UDP data channel · exponential increase, exponential back-off · highly tunable: speedup/slowdown factors, error threshold, maximum retransmission queue, retransmission request interval. · http://www.anml.iu.edu/anmlresearch.html PSaAP II 2004-03-19 44/52 XCP · per packet feedback from routers Feedback Round Trip Time Congestion Window Congestion Header Feedback Round Trip Time Congestion Window Feedback = + 0.1 packet PSaAP II 2004-03-19 45/52 XCP (2) Feedback = + 0.1 packet Round Trip Time Congestion Window Feedback = - 0.3 packet PSaAP II 2004-03-19 46/52 XCP (3) Congestion Window = Congestion Window + Feedback PSaAP II 2004-03-19 47/52 Other approaches · SCTP · multistreaming, multi-homed transport · http://www.sctp.org/ · DCCP · unreliable protocol with congestion control mechanisms · http://www.ietf.org/html.charters/dccp- charter.html · http://www.icir.org/kohler/dcp/ PSaAP II 2004-03-19 48/52 · STP · simple protocol easily implemented in hw; no sophisticated congestion etc. · http://lwn.net/2001/features/OLS/pdf/pdf/s tlinux.pdf · Reliable UDP · ensures reliable and in-order delivery · why??? Cisco folks needed some job perhaps... · http://www.watersprings.org/pub/id/draft- ietf-sigtran-reliable-udp-00.txt · XTP and bunch of other guys... PSaAP II 2004-03-19 49/52 Concluding remarks · Current status · multi-stream TCP in heavy use in Grid computing · when going for improving TCP, {HS,Scalable}TCP et al. provide least dangerous way to go · use of aggressive protocols in private virtual circuits based e.g. on lambdas (CA*net4 already supports establishing these at user request!) PSaAP II 2004-03-19 50/52 Concluding remarks (2) · Interactions with link layer · wireless with varying delay and throughput · optical burst switching · Flow-specific state in routers · flow-specific maring or dropping · may help also finishing short and high-bw transmissions · scalability & cost :-( PSaAP II 2004-03-19 51/52 Other References · RFC 3426 General Architectural and Policy Considerations PSaAP II 2004-03-19 52/52 Thank you for your attention!