分类: 系统运维
2012-06-18 14:14:23
TCP
is a connection-oriented protocol. Before either end can send data to
the other, a connection must be established between them. This
establishment of a connection between the two ends differs from a
connectionless protocol such as UDP. With UDP one end just sends a
datagram to the other end, without any preliminary handshaking.
Connection Establishment Protocol
To establish a TCP connection:
1.
The requesting end (normally called the client) sends a SYN segment
specifying the port number of the server that the client wants to
connect to, and the client's initial sequence number (ISN, 1415531521 in
this example). This is segment 1.
2. The server responds with
its own SYN segment containing the server's initial sequence number
(segment 2). The server also acknowledges the client's SYN by ACKing the
client's ISN plus one. A SYN consumes one sequence number.
3. The client must acknowledge this SYN from the server by ACKing the server's ISN plus one (segment 3).
These three segments complete the connection establishment. This is often called the three-way handshake.
The
side that sends the first SYN is said to perform an active open. The
other side, which receives this SYN and sends the next SYN, performs a
passive open. When each end sends its SYN to establish the connection,
it chooses an initial sequence number for that connection. The ISN
should change over time, so that each connection has a different ISN.
RFC 793 [Postel 1981c] specifies that the ISN should be viewed as a
32-bit counter that increments by one every 4 microseconds. The purpose
in these sequence numbers is to prevent packets that get delayed in the
network from being delivered later and then misinterpreted as part of
an existing connection.
Connection Termination Protocol
While
it takes three segments to establish a connection, it takes four to
terminate a connection. This is caused by TCP's half-close. Since a TCP
connection is full-duplex (that is, data can be flowing in each
direction independently of the other direction), each direction must be
shut down independently. The rule is that either end can send a FIN when
it is done sending data. When a TCP receives a FIN, it must notify the
application that the other end has terminated that direction of data
flow. The sending of a FIN is normally the result of the application
issuing a close.
The receipt of a FIN only means there will be no
more data flowing in that direction. A TCP can still send data after
receiving a FIN. While it's possible for an application to take
advantage of this half-close, in practice few TCP applications use it.
We
say that the end that first issues the close (e.g., sends the first
FIN) performs the active close and the other end (that receives this
FIN) performs the passive close. Normally one end does the active close
and the other does the passive close
Segment 4 initiates the termination of the connection and is sent when the client closes its connection.
When
the server receives the FIN it sends back an ACK of the received
sequence number plus one (segment 5). A FIN consumes a sequence number,
just like a SYN. At this point the server's TCP also delivers an
end-of-file to the application. The server then closes its connection,
causing its TCP to send a FIN (segment 6), which the client TCP must ACK
by incrementing the received sequence number by one (segment 7).
Connections
are normally initiated by the client, with the first SYN going from the
client to the server. Either end can actively close the connection
(i.e., send the first FIN). Often, however, it is the client that
determines when the connection should be terminated, since client
processes are often driven by an interactive user, who enters something
like "quit" to terminate.
Timeout of Connection Establishment
There are several instances when the connection cannot be established. In one example the server host is down.
After timing out the client's TCP sends a SYN to try to establish the connection again.
Maximum Segment Size
The
maximum segment size (MSS) is the largest "chunk" of data that TCP will
send to the other end. When a connection is established, each end can
announce its MSS.
When a connection is established, each end has
the option of announcing the MSS it expects to receive. (An MSS option
can only appear in a SYN segment.) If one end does not receive an MSS
option from the other end, a default of 536 bytes is assumed. (This
default allows for a 20-byte IP header and a 20-byte TCP header to fit
into a 576-byte IP datagram.)
In general, the larger the MSS the
better, until fragmentation occurs. (This may not always be true. ) A
larger segment size allows more data to be sent in each segment,
amortizing the cost of the IP and TCP headers. When TCP sends a SYN
segment, either because a local application wants to initiate a
connection, or when a connection request is received from another host,
it can send an MSS value up to the outgoing interface's MTU, minus the
size of the fixed TCP and IP headers. For an Ethernet this implies an
MSS of up to 1460 bytes. Using IEEE 802.3 encapsulation, the MSS could
go up to 1452 bytes.
The MSS lets a host limit the size of
datagrams that the other end sends it. When combined with the fact that a
host can also limit the size of the datagrams that it sends, this lets a
host avoid fragmentation when the host is connected to a network with a
small MTU.
TCP Half-Close
TCP
provides the ability for one end of a connection to terminate its
output, while still receiving data from the other end. This is called a
half-close. Few applications take advantage of this capability, as we
mentioned earlier.
To use this feature the programming
interface must provide a way for the application to say "I am done
sending data, so send an end-of-file (FIN) to the other end, but I still
want to receive data from the other end, until it sends me an
end-of-file (FIN)."
2MSL Wait State
The
TIME_WAIT state is also called the 2MSL wait state. Every
implementation must choose a value for the maximum segment lifetime
(MSL). It is the maximum amount of time any segment can exist in the
network before being discarded. We know this time limit is bounded,
since TCP segments are transmitted as IP datagrams, and the IP datagram
has the TTL field that limits its lifetime.
Given the MSL value
for an implementation, the rule is: when TCP performs an active close,
and sends the final ACK, that connection must stay in the TIME_WAIT
state for twice the MSL. This lets TCP resend the final ACK in case this
ACK is lost (in which case the other end will time out and retransmit
its final FIN).
Another effect of this 2MSL wait is that while
the TCP connection is in the 2MSL wait, the socket pair defining that
connection (client IP address, client port number, server IP address,
and server port number) cannot be reused. That connection can only be
reused when the 2MSL wait is over.
Unfortunately most
implementations (i.e., the Berkeley-derived ones) impose a more
stringent constraint. By default a local port number cannot be reused
while that port number is the local port number of a socket pair that is
in the 2MSL wait.
Any delayed segments that arrive for a
connection while it is in the 2MSL wait are discarded. Since the
connection defined by the socket pair in the 2MSL wait cannot be reused
during this time period, when we do establish a valid connection we know
that delayed segments from an earlier incarnation of this connection
cannot be misinterpreted as being part of the new connection. (A
connection is defined by a socket pair. New instances of a connection
are called incarnations of that connection.)
It is normally the
client that does the active close and enters the TIME_WAIT state. The
server usually does the passive close, and does not go through the
TIME_WAIT state. The implication is that if we terminate a client, and
restart the same client immediately, that new client cannot reuse the
same local port number. This isn't a problem, since clients normally use
ephemeral ports, and don't care what the local ephemeral port number
is.
With servers, however, this changes, since servers use
well-known ports. If we terminate a server that has a connection
established, and immediately try to restart the server, the server
cannot assign its well-known port number to its end point, since that
port number is part of a connection that is in a 2MSL wait. It may take
from 1 to 4 minutes before the server can be restarted.
Quiet Time Concept
The 2MSL wait provides protection against delayed segments from an earlier incarnation of a connection from being interpreted as part of a new connection that uses the same local and foreign IP addresses and port numbers. But this works only if a host with connections in the 2MSL wait does not crash.
What if a host with ports in the 2MSL wait
crashes, reboots within MSL seconds, and immediately establishes new
connections using the same local and foreign IP addresses and port
numbers corresponding to the local ports that were in the 2MSL wait
before the crash? In this scenario, delayed segments from the
connections that existed before the crash can be misinterpreted as
belonging to the new connections created after the reboot. This can
happen regardless of how the initial sequence number is chosen after the
reboot.
To protect against this scenario, RFC 793 states that TCP
should not create any connections for MSL seconds after rebooting. This
is called the quiet time.
FIN WAIT 2 State
In the FIN_WAIT_2 state we have sent our FIN and the other end has acknowledged it. Unless we have done a half-close, we are waiting for the application on the other end to recognize that it has received an end-of-file notification and close its end of the connection, which sends us a FIN. Only when the process at the other end does this close will our end move from the FIN_WAIT_2 to the TIME_WAIT state.
This
means our end of the connection can remain in this state forever. The
other end is still in the CLOSE_WAIT state, and can remain there
forever, until the application decides to issue its close.
Reset Segments
We've
mentioned a bit in the TCP header named RST for "reset." In general, a
reset is sent by TCP whenever a segment arrives that doesn't appear
correct for the referenced connection. (We use the term "referenced
connection" to mean the connection specified by the destination IP
address and port number, and the source IP address and port number.
This is what RFC 793 calls a socket.)
A common case for
generating a reset is when a connection request arrives and no process
is listening on the destination port. In the case of UDP, an ICMP port
unreachable was generated when a datagram arrived for a destination port
that was not in use. TCP uses a reset instead.
The normal way to
terminate a connection is for one side to send a FIN. This is sometimes
called an orderly release since the FIN is sent after all previously
queued data has been sent, and there is normally no loss of data. But
it's also possible to abort a connection by sending a reset instead of a
FIN. This is sometimes called an abortive release.
Aborting a
connection provides two features to the application: (1) any queued
data is thrown away and the reset is sent immediately, and (2) the
receiver of the RST can tell that the other end did an abort instead of a
normal close. The API being used by the application must provide a way
to generate the abort instead of a normal close.
Detecting Half-Open Connections
A
TCP connection is said to be half-open if one end has closed or aborted
the connection without the knowledge of the other end. This can happen
any time one of the two hosts crashes. As long as there is no attempt to
transfer data across a half-open connection, the end that's still up
won't detect that the other end has crashed.
Another common cause
of a half-open connection is when a client host is powered off, instead
of terminating the client application and then shutting down the client
host. This happens when PCs are being used to run Telnet clients, for
example, and the users power off the PC at the end of the day. If there
was no data transfer going on when the PC was powered off, the server
will never la-row that the client disappeared. When the user comes in
the next morning, powers on the PC, and starts a new Telnet client, a
new occurrence of the server is started on the server host. This can
lead to many half-open TCP connections on the server host.
Simultaneous Open
It
is possible, although improbable, for two applications to both perform
an active open to each other at the same time. Each end must transmit a
SYN, and the SYNs must pass each other on the network. It also requires
each end to have a local port number that is well known to the other
end. This is called a simultaneous open.
For example, one
application on host A could have a local port of 7777 and perform an
active open to port 8888 on host B. The application on host B would have
a local port of 8888 and perform an active open to port 7777 on host A.
TCP
was purposely designed to handle simultaneous opens and the rule is
that only one connection results from this, not two connections. (Other
protocol suites, notably the OSI transport layer, create two connections
in this scenario, not one.)
Simultaneous Close
We
said earlier that one side (often, but not always, the client) performs
the active close, causing the first FIN to be sent. It's also possible
for both sides to perform an active close, and the TCP protocol allows
for this simultaneous close.
With a simultaneous close the same number of segments are exchanged as in the normal close.
TCP Options
The
TCP header can contain options. The only options defined in the
original TCP specification are the end of option list, no operation, and
the maximum segment size option.
Newer RFCs, specifically RFC
1323 [Jacobson, Braden, and Borman 1992], define additional TCP options,
most of which are found only in the latest implementations.
TCP Server Design
Most
TCP servers are concurrent. When a new connection request arrives at a
server, the server accepts the connection and invokes a new process to
handle the new client. Depending on the operating system, various
techniques are used to invoke the new server. Under Unix the common
technique is to create a new process using the fork function.
Lightweight processes (threads) can also be used, if supported.
Incoming Connection Request Queue
A
concurrent server invokes a new process to handle each client, so the
listening server should always be ready to handle the next incoming
connection request. That's the underlying reason for using concurrent
servers. But there is still a chance that multiple connection requests
arrive while the listening server is creating a new process, or while
the operating system is busy running other higher priority processes.
How does TCP handle these incoming connection requests while the
listening application is busy? In Berkeley-derived implementations the
following rules apply.
1. Each listening end point has a fixed
length queue of connections that have been accepted by TCP (i.e., the
three-way handshake is complete), but not yet accepted by the
application. Be careful to differentiate between TCP accepting a
connection and placing it on this queue, and the application taking the
accepted connection off this queue.
2. The application
specifies a limit to this queue, commonly called the backlog. This
backlog must be between 0 and 5, inclusive. (Most applications specify
the maximum value of 5.)
3. When a connection request arrives
(i.e., the SYN segment), an algorithm is applied by TCP to the current
number of connections already queued for this listening end point, to
see whether to accept the connection or not. We would expect the backlog
value specified by the application to be the maximum number of queued
connections allowed for this end point, but it's not that simple.
Keep
in mind that this backlog value specifies only the maximum number of
queued connections for one listening end point, all of which have
already been accepted by TCP and are waiting to be accepted by the
application. This backlog has no effect whatsoever on the maximum number
of established connections allowed by the system, or on the number of
clients that a concurrent server can handle concurrently.
4. If
there is room on this listening end point's queue for this new
connection, the TCP module ACKs the SYN and completes the connection.
The server application with the listening end point won't see this new
connection until the third segment of the three-way handshake is
received. Also, the client may think the server is ready to receive data
when the client's active open completes successfully, before the server
application has been notified of the new connection. (If this happens,
the server's TCP just queues the incoming data.)
5. If there
is not room on the queue for the new connection, TCP just ignores the
received SYN. Nothing is sent back (i.e., no RST segment). If the
listening server doesn't get around to accepting some of the already
accepted connections that have filled its queue to the limit, the
client's active open will eventually time out.
TCP
ignores the incoming SYN when the queue is full, and doesn't respond
with an RST, because this is a soft error, not a hard error. Normally
the queue is full because the application or the operating system is
busy, preventing the application from servicing incoming connections.
This condition could change in a short while. But if the server's TCP
responded with a reset, the client's active open would abort (which is
what we saw happen if the server wasn't started). By ignoring the SYN,
the server forces the client TCP to retransmit the SYN later, hoping
that the queue will then have room for the new connection.