分类: 系统运维
2012-06-19 12:49:40
The keepalive option can cause an otherwise good connection between
two processes to be terminated because of a temporary loss of
connectivity in the network joining the two end systems. For example, if
the keepalive probes are sent during the time that an intermediate
router has crashed and is rebooting, TCP will think that the client's
host has crashed, which is not what has happened.
A common
example showing the need for the keepalive feature nowadays is when
personal computer users use TCP/IP to login to a host using Telnet. If
they just power off the computer at the end of the day, without logging
off, they leave a half-open connection. We showed how sending data
across a half-open connection caused a reset to be
returned, but that
was from the client end, where the client was sending the data. If the
client disappears, leaving the half-open connection on the server's end,
and the server is waiting for some data from the client, the server
will wait forever. The keepalive feature is intended to detect these
half-open connections from the server side.
Description
In this description
we'll call the end that enables the keepalive option the server, and the
other end the client. There is nothing to stop a client from setting
this option, but normally it's set by servers. It can also be set by
both ends of a connection, if it's important for each end to know if the
other end disappears.
If there is no activity on a given
connection for 2 hours, the server sends a probe segment to the client.
The client host must be in one of four states.
1. The client host is still up and running and reachable from the
server. The client's TCP responds normally and the server knows that the
other end is still up. The server's TCP will reset the keepalive timer
for 2 hours in the future. If there is application traffic across the
connection before the next 2-hour timer expires, the timer is reset for 2
hours in the future, following the exchange of data.
2. The
client's host has crashed and is either down or in the process of
rebooting. In either case, its TCP is not responding. The server will
not receive a response to its probe and it times out after 75 seconds.
The server sends a total of 10 of these probes, 75 seconds apart, and if
it doesn't receive a response, the server considers the client's host
as down and terminates the connection.
3. The client's host has crashed and rebooted. Here the server will receive a response to its keepalive probe, but the response will be a reset, causing the server to terminate the connection.
4. The client's host is up and running, but unreachable from the
server. This is the same as scenario 2, because TCP can't distinguish
between the two. All it can tell is that no replies are received to its
probes.
The server does not have to worry about the client's
host being shut down and then rebooted. (This refers to an operator
shutdown, instead of the host crashing.) When the system is shut down by
an operator, all application processes are terminated (i.e., the client
process), which causes the client's TCP to send a FIN on the
connection. Receiving the FIN would cause the server's TCP to report an
end-of-file to the server process, allowing the server to detect this
scenario.
In the first scenario the server application has no
idea that the keepalive probes are taking place. Everything is handled
at the TCP layer. It's transparent to the application until one of
scenarios 2, 3, or 4 occurs. In these three scenarios an error is
returned to the server application by its TCP. (Normally the server has
issued a read from the network, waiting for data from the client. If the
keepalive feature returns an error, it is returned to the server as the
return value from the read.) In scenario 2 the error is something like
"connection timed out," and in scenario 3 we expect "connection reset by
peer." The fourth scenario may look like the connection timed out, or
may cause another error to be returned, depending on whether an ICMP
error related to the connection is received.