分类: 系统运维
2012-06-19 13:44:01
In
chapter 2 we described the concept of the path MTU. It is the minimum
MTU on any network that is currently in the path between two hosts. Path
MTU discovery entails setting the "don't fragment" (DF) bit in the IP
header to discover if any router on the current path needs to fragment
IP datagrams that we send. In chapter 11 we showed the ICMP unreachable
error returned by a router that is asked to forward an IP datagram with
the DF bit set when the MTU is less than the datagram size.
TCP's
path MTU discovery operates as follows. When a connection is
established, TCP uses the minimum of the MTU of the outgoing interface,
or the MSS announced by the other end, as the starting segment size.
Path MTU discovery does not allow TCP to exceed the MSS announced by the
other end. If the other end does not specify an MSS, it defaults to
536. It is also possible for an implementation to save path MTU
information on a per-route basis
Once the initial segment size is
chosen, all IP datagrams sent by TCP on that connection have the DF bit
set. If an intermediate router needs to fragment a datagram that has
the DF bit set, it discards the datagram and generates the ICMP "can't
fragment" error.
If this ICMP error is received, TCP decreases
the segment size and retransmits. If the router generated the newer form
of this ICMP error, the segment size can be set to the next-hop MTU
minus the sizes of the IP and TCP headers. If the older ICMP error is
returned, the probable value of the next smallest MTU must be tried.
When a retransmission caused by this ICMP error occurs, the congestion
window should not change, but slow start should be initiated.
Since
routes can change dynamically, when some time has passed since the last
decrease of the path MTU, a larger value (up to the minimum of the MSS
announced by the other end, or the outgoing interface MTU) can be tried.
RFC 1191 recommends this time interval be about 10 minutes.
Given
the normal default MSS of 536 for nonlocal destinations, path MTU
discovery avoids fragmentation across intermediate links with an MTU of
less than 576 (which is rare). It can also avoid fragmentation on local
destinations when an intermediate link (e.g., an Ethernet) has a smaller
MTU than the end-point networks (e.g., a token ring).
But for
path MTU discovery to be more useful, and take advantage of wide area
networks with MTUs greater than 576, implementations must stop using a
default MSS of 536 bytes for nonlocal destinations. A better choice for
the MSS is the MTU of the outgoing interface (minus the size of the IP
and TCP headers, of course).
Window Scale Option
The
window scale option increases the definition of the TCP window from 16
to 32 bits. Instead of changing the TCP header to accommodate the larger
window, the header still holds a 16-bit value, and an option is defined
that applies a scaling operation to the 16-bit value. TCP then
maintains the "real" window size internally as a 32-bit value.
This
option can only appear in a SYN segment; therefore the scale factor is
fixed in each direction when the connection is established. To enable
window scaling, both ends must send the option in their SYN segments.
The end doing the active open sends the option in its SYN, but the end
doing the passive open can send the option only if the received SYN
specifies the option. The scale factor can be different in each
direction.
If the end doing the active open sends a nonzero scale
factor, but doesn't receive a window scale option from the other end,
it sets its send and receive shift count to 0. This lets newer systems
interoperate with older systems that don't understand the new option.
Assume
we are using the window scale option, with a shift count of S for
sending and a shift count of R for receiving. Then every 16-bit
advertised window that we receive from the other end is left shifted by R
bits to obtain the real advertised window size. Every time we send a
window advertisement to the other end, we take our real 32-bit window
size and right shift it S bits, placing the resulting 16-bit value in
the TCP header.
The shift count is automatically chosen by TCP,
based on the size of the receive buffer. The size of this buffer is set
by the system, but the capability is normally provided for the
application to change it.
Timestamp Option
The
timestamp option lets the sender place a timestamp value in every
segment. The receiver reflects this value in the acknowledgment,
allowing the sender to calculate an RTT for each received ACK. (We must
say "each received ACK" and not "each segment" since TCP normally
acknowledges multiple segments per ACK.) We said that many current
implementations only measure one RTT per window, which is OK for windows
containing eight segments. Larger window sizes, however, require better
RTT calculations.
The sender places a 32-bit value in the first
field, and the receiver echoes this back in the reply field. TCP headers
containing this option will increase from the
normal 20 bytes to 32 bytes.
The
timestamp is a monotonically increasing value. Since the receiver
echoes what it receives, the receiver doesn't care what the timestamp
units are. This option does not require any form of clock
synchronization between the two hosts. RFC 1323 recommends that the
timestamp value increment by one between 1 ms and 1 second.
The
specification of this option during connection establishment is handled
the same way as the window scale option. The end doing the active open
specifies the option in its SYN. Only if it receives the option in the
SYN from the other end can the option be sent in future segments.
We've
seen that a receiving TCP does not have to acknowledge every data
segment that it receives. Many implementations send an ACK for every
other data segment. If the receiver sends an ACK that acknowledges two
received data segments, which received timestamp is sent back in the
timestamp echo reply field?
To minimize the amount of state
maintained by either end, only a single timestamp value is kept per
connection. The algorithm to choose when to update this value is simple.
1. TCP keeps track of the timestamp value to send in the next ACK (a variable named tsrecent) and the acknowledgment sequence number from the last ACK that was sent (a variable named lastack). This sequence number is the next sequence number the receiver is expecting.
2.
When a segment arrives, if the segment contains the byte numbered
lastack, then the timestamp value from the segment is saved in tsrecent.
3.
Whenever a timestamp option is sent, tsrecent is sent as the timestamp
echo reply field and the sequence number field is saved in lastack.
This algorithm handles the following two cases:
1.
If ACKs are delayed by the receiver, the timestamp value returned as
the echo value will correspond to the earliest segment being
acknowledged.
2. If a received segment is in-window but
out-of-sequence, implying that a previous segment has been lost, when
that missing segment is received, its time-stamp will be echoed, not the
timestamp from the out-of-sequence segment.
T/TCP: A TCP Extension for Transactions
TCP
provides a virtual-circuit transport service. There are three distinct
phases in the life of a connection: establishment, data transfer, and
termination. Applications such as remote login and file transfer are
well suited to a virtual-circuit service.
Other applications,
however, are designed to use a transaction service. A transaction is a
client request followed by a server response with the following
characteristics:
1. The overhead of connection establishment and connection termination should be avoided. When possible, send one request packet and receive one reply packet.
2. The latency
should be reduced to RTT plus SPT, where RTT is the round-trip time and
SPT is the server processing time to handle the request.
3.
The server should detect duplicate requests and not replay the
transaction when a duplicate request arrives. (Avoiding the replay means
the server does not process the request again. The server sends back
the saved reply corresponding to that request.)
One application
that we've already seen that uses this type of service is the Domain
Name System (Chapter 14), although the DNS is not concerned with the
server replaying duplicate requests. Today the choice an application
designer has is TCP or UDP. TCP provides too many features for
transactions, and UDP doesn't provide enough. Usually the application is
built using UDP (to avoid the overhead of TCP connections) but many of
the desirable features (dynamic timeout and retransmission, congestion
avoidance, etc.) are placed into the application, where they're
reinvented over and over again.
A better solution is to
provide a transport layer that provides efficient handling of
transactions. The transaction protocol we describe in this section is
called T/TCP. Our description is from its definition, RFC 1379 [Braden
1992b] and [Braden 1992c].
Most TCPs require 7 segments to open
and close a connection. Three more segments are then added: one with the
request, another with the reply and an ACK of the request, and a third
with the ACK of the reply. If additional control bits are added onto the
segments-that is, the first segment contains a SYN, the client request,
and a FIN-the client still sees a minimal overhead of twice the RTT
plus SPT. (Sending a SYN along with data and a FIN is legal; whether
current TCPs handle it correctly is another question.)
Another problem with TCP is the TIME_WAIT state and its required 2MSL wait. This limits the transaction rate between two hosts to about 268 per second.
The
two modifications required for TCP to handle transactions are to avoid
the three-way handshake and shorten the TIME_WAIT state. T/TCP avoids
the three-way handshake by using an accelerated open:
1. It assigns a 32-bit connection count (CC) value to connections it opens, either actively or passively. A host's CC value is assigned from a global counter that gets incremented by 1 each time it's used.
2. Every segment between two hosts using T/TCP includes a new TCP option named CC. This option has a length of 6 bytes and contains the sender's 32-bit CC value for the connection.
3. A host maintains a per-host cache of the last CC value received in an acceptable SYN segment from that host.
4.
When a CC option is received on an initial SYN, the receiver compares
the value with the cached value for the sender. If the received CC is
greater than the cached CC, the SYN is new and any data in the segment
is passed to the receiving application (the server). The connection is
called half-synchronized. If the received CC is not greater than the
cached CC, or if the receiving host doesn't have a cached CC for this
client, the normal TCP three-way handshake is performed.
5. The SYN, ACK segment in response to an initial SYN echoes the received CC value in another new option named CCECHO.
6.
The CC value in a non-SYN segment detects and rejects any duplicate
segments from previous incarnations of the same connection.
The
accelerated open avoids the need for a three-way handshake unless either
the client or server has crashed and rebooted. The cost is that the
server must remember the last CC received from each client.
The
TIME_WAIT state is shortened by calculating the TIME_WAIT delay
dynamically, based on the measured RTT between the two hosts. The
TIME_WAIT delay is set to 8 times RTO, the retransmission timeout value.
Using these features the minimal transaction sequence is an exchange of three segments:
1. Client to server, caused by an active open: client-SYN, client-data (the request), client-FIN, and client-CC.
When
the server TCP with the passive open receives this segment, if the
client-CC is greater than the cached CC for this client host, the
client-data is passed to the server application, which processes the
request.
2. Server to client: server-SYN, server-data (reply),
server-FIN, ACK of client-FIN, server-CC, and CCECHO of client-CC. Since
TCP acknowledgments are cumulative, this ACK of the client FIN
acknowledges the client's SYN, data, and FIN. When the client TCP
receives this segment it passes the reply to the client application.
3. Client to server: ACK of server-FIN, which acknowledges the server's SYN, data, and FIN.
The client's response time to its request is RTT plus SPT.
There are many fine points to the implementation of this TCP option that are covered in the references. We summarize them here:
1.
The server's SYN, ACK (the second segment) should be delayed, to allow
the reply to piggyback with it. (Normally the ACK of a SYN is not
delayed.) It can't delay too long, or the client will time out and
retransmit.
2. The request can require multiple segments, but the
server must handle their possible out-of-order arrival. (Normally when
data arrives before the SYN, the data is discarded and a reset is
generated. With T/TCP this out-of-order data should be queued instead.)
3.
The API must allow the server process to send data and close the
connection in a single operation to allow the FIN in the second segment
to piggyback with the reply (Normally the application would write the
reply, causing a data segment to be sent, and then close the connection,
causing the FIN to be sent.)
4. The client is sending data in the first segment before receiving an MSS announcement from the server. To avoid restricting the client to an MSS of 536, the MSS for a given host should be cached along with its CC value.
5. The client is
also sending data to the server without receiving a window advertisement
from the server.T/TCP suggests a default window of 4096 bytes and also
caching the congestion threshold for the server.