从TCP-IP详解(1)中,我们了解tcp数据流的形式有两种:一种是交互式,例如rlogin交互命令时所产生的数据流。
另一种是成块式,即发送的数据流都是满窗口的。比如以服务器为中心的下载服务(非p2p架构)。现在来介绍另一种数据流,叫做thin数据流。现在又很多的应用,比如在线游戏,是依赖用户行为的。也可以说数据流是依赖时间的。即当用户玩游戏时,有一个突然地数据流,但是过了一段时间(这段时间内用户也许在做其他事情),
又有一个突然地数据流。
Andreas Petlund研究了很多的应用,都是这种依赖时间的交互式的应用。并且发现现在的linux实现中,快速重传机制对这样的thin数据流产生很大的延迟。为何呢?快速重传被触发的前提是,当收到3个重复的ACK时,就触发快速重传。第一,中国的网络环境,很容易导致接收端接收到的数据包是乱序的。第二,每个重复ACK的发出,也说明了接收端接收到了新的数据后,才会发送ACK.
如果在数据流的发送过程中,其中某个包丢弃了。收到重复的一个ACK(这个ACK不是应答最高序号的包)后,并没有新的数据要发送,而当前仍然有3个包在网络中。这个时候,没有新的数据需要发送,只能等待网络中的3个数据包被接收,并陆续的返回重复ACK,直到收到3个后,快速重传丢失的包。或者直接等到RTO超时,发送丢失的包。
Andreas
Petlund做了很多的实验,同时改进了linux中快速重传机制和RTO超时后的RTO算法。当然这个新的特性默认是关闭的,也就是并不会与之前的快速重传算法冲突。可以选择通过setsockopt来启用,或者通过sysctl来启用。下面来介绍算法的内容。这个算法包括两个部分:
第一部分:tcp_thin_dupack
tcp_time_recover是用来判断是否要启动快速重传机制。如果收到重复ACK超过3个,则启用。
核心代码见下面的红色部分。
- static int tcp_time_to_recover(struct sock *sk)
-
{
-
struct tcp_sock *tp = tcp_sk(sk);
-
__u32 packets_out;
-
-
/* Do not perform any recovery during F-RTO algorithm */
-
if (tp->frto_counter)
-
return 0;
-
-
/* Trick#1: The loss is proven. */
-
if (tp->lost_out)
-
return 1;
-
-
/* Not-A-Trick#2 : Classic rule... */
- /* 下面这个是用来重复的ACK个数是否大于3. recordering的值是3.
-
if (tcp_dupack_heuristics(tp) > tp->reordering)
-
return 1;
-
-
/* Trick#3 : when we use RFC2988 timer restart, fast
-
* retransmit can be triggered by timeout of queue head.
-
*/
-
if (tcp_is_fack(tp) && tcp_head_timedout(sk))
-
return 1;
-
-
/* Trick#4: It is still not OK... But will it be useful to delay
-
* recovery more?
-
*/
-
packets_out = tp->packets_out;
-
if (packets_out <= tp->reordering &&
-
tp->sacked_out >= max_t(__u32, packets_out/2, sysctl_tcp_reordering) &&
-
!tcp_may_send_now(sk)) {
-
/* We have nothing to send. This connection is limited
-
* either by receiver window or by application.
-
*/
-
return 1;
-
}
-
-
/* If a thin stream is detected, retransmit after first
-
* received dupack. Employ only if SACK is supported in order
-
* to avoid possible corner-case series of spurious retransmissions
-
* Use only if there are no unsent data.
-
*/
- 说明:thin_dupack表示用户通过setsockoopt设置TCP_THIN_DUPACK选项,则这个值为1.否则为0.
- sysctl_tcp_thin_dupack是proc参数,与thin_dupack的含义一样。为1,表示启用thin-steam优化。
- tcp_stream_is_thin(tp)判断当前的tcp流是否是thin流。如果当前网络中的数据包个数小于4个,并且不再一开始的慢启动阶段,则这个流是thin。
- tcp_dupack_heuristics(tp) > 1.表示收到第一个重复ACK.
- tcp_is_sack(tp): 当前数据流启用了SACK.
- !tcp_send_head(sk):当前没有新的数据需要发送。
-
if ((tp->thin_dupack || sysctl_tcp_thin_dupack) &&
-
tcp_stream_is_thin(tp) && tcp_dupack_heuristics(tp) > 1 &&
-
tcp_is_sack(tp) && !tcp_send_head(sk))
-
return 1;
-
-
return 0;
-
}
第二部分:tcp_thin_linear_timeouts
tcp_retransmit_timer这个函数RTO超时的处理函数。如果是thin流,则不要新设RTO是原先的2倍。
- void tcp_retransmit_timer(struct sock *sk)
-
{
-
struct tcp_sock *tp = tcp_sk(sk);
-
struct inet_connection_sock *icsk = inet_csk(sk);
-
-
if (!tp->packets_out)
-
goto out;
-
-
WARN_ON(tcp_write_queue_empty(sk));
-
-
if (!tp->snd_wnd && !sock_flag(sk, SOCK_DEAD) &&
-
!((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV))) {
-
/* Receiver dastardly shrinks window. Our retransmits
-
* become zero probes, but we should not timeout this
-
* connection. If the socket is an orphan, time it out,
-
* we cannot allow such beasts to hang infinitely.
-
*/
-
#ifdef TCP_DEBUG
-
struct inet_sock *inet = inet_sk(sk);
-
if (sk->sk_family == AF_INET) {
-
LIMIT_NETDEBUG(KERN_DEBUG "TCP: Peer %pI4:%u/%u unexpectedly shrunk window %u:%u (repaired)\n",
-
&inet->inet_daddr, ntohs(inet->inet_dport),
-
inet->inet_num, tp->snd_una, tp->snd_nxt);
-
}
-
#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
-
else if (sk->sk_family == AF_INET6) {
-
struct ipv6_pinfo *np = inet6_sk(sk);
-
LIMIT_NETDEBUG(KERN_DEBUG "TCP: Peer %pI6:%u/%u unexpectedly shrunk window %u:%u (repaired)\n",
-
&np->daddr, ntohs(inet->inet_dport),
-
inet->inet_num, tp->snd_una, tp->snd_nxt);
-
}
-
#endif
-
#endif
-
if (tcp_time_stamp - tp->rcv_tstamp > TCP_RTO_MAX) {
-
tcp_write_err(sk);
-
goto out;
-
}
-
tcp_enter_loss(sk, 0);
-
tcp_retransmit_skb(sk, tcp_write_queue_head(sk));
-
__sk_dst_reset(sk);
-
goto out_reset_timer;
-
}
-
-
if (tcp_write_timeout(sk))
-
goto out;
-
-
if (icsk->icsk_retransmits == 0) {
-
int mib_idx;
-
-
if (icsk->icsk_ca_state == TCP_CA_Recovery) {
-
if (tcp_is_sack(tp))
-
mib_idx = LINUX_MIB_TCPSACKRECOVERYFAIL;
-
else
-
mib_idx = LINUX_MIB_TCPRENORECOVERYFAIL;
-
} else if (icsk->icsk_ca_state == TCP_CA_Loss) {
-
mib_idx = LINUX_MIB_TCPLOSSFAILURES;
-
} else if ((icsk->icsk_ca_state == TCP_CA_Disorder) ||
-
tp->sacked_out) {
-
if (tcp_is_sack(tp))
-
mib_idx = LINUX_MIB_TCPSACKFAILURES;
-
else
-
mib_idx = LINUX_MIB_TCPRENOFAILURES;
-
} else {
-
mib_idx = LINUX_MIB_TCPTIMEOUTS;
-
}
-
NET_INC_STATS_BH(sock_net(sk), mib_idx);
-
}
-
-
if (tcp_use_frto(sk)) {
-
tcp_enter_frto(sk);
-
} else {
-
tcp_enter_loss(sk, 0);
-
}
-
-
if (tcp_retransmit_skb(sk, tcp_write_queue_head(sk)) > 0) {
-
/* Retransmission failed because of local congestion,
-
* do not backoff.
-
*/
-
if (!icsk->icsk_retransmits)
-
icsk->icsk_retransmits = 1;
-
inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
-
min(icsk->icsk_rto, TCP_RESOURCE_PROBE_INTERVAL),
-
TCP_RTO_MAX);
-
goto out;
-
}
-
-
/* Increase the timeout each time we retransmit. Note that
-
* we do not increase the rtt estimate. rto is initialized
-
* from rtt, but increases here. Jacobson (SIGCOMM 88) suggests
-
* that doubling rto each time is the least we can get away with.
-
* In KA9Q, Karn uses this for the first few times, and then
-
* goes to quadratic. netBSD doubles, but only goes up to *64,
-
* and clamps at 1 to 64 sec afterwards. Note that 120 sec is
-
* defined in the protocol as the maximum possible RTT. I guess
-
* we'll have to use something other than TCP to talk to the
-
* University of Mars.
-
*
-
* PAWS allows us longer timeouts and large windows, so once
-
* implemented ftp to mars will work nicely. We will have to fix
-
* the 120 second clamps
-
*/
-
icsk->icsk_backoff++;
-
icsk->icsk_retransmits++;
-
-
out_reset_timer:
-
/* If stream is thin, use linear timeouts. Since 'icsk_backoff' is
-
* used to reset timer, set to 0. Recalculate 'icsk_rto' as this
-
* might be increased if the stream oscillates between thin and thick,
-
* thus the old value might already be too high compared to the value
-
* set by 'tcp_set_rto' in tcp_input.c which resets the rto without
-
* backoff. Limit to TCP_THIN_LINEAR_RETRIES before initiating
-
* exponential backoff behaviour to avoid continue hammering
-
* linear-timeout retransmissions into a black hole
-
*/
- thin_lto是TCP_THIN_LINEAR_TIMEOUTS选项。
- sysctl_tcp_thin_linear_timeouts是proc参数,跟thin_lto效果一样。
- icsk->icsk_retransmits <= TCP_THIN_LINEAR_RETRIES:TCP_THIN_LINEAR_RETRIES是常量,等于6.
- 不知道为何这里要设置这个限制,只针对之前的6个RTO?
-
if (sk->sk_state == TCP_ESTABLISHED &&
-
(tp->thin_lto || sysctl_tcp_thin_linear_timeouts) &&
-
tcp_stream_is_thin(tp) &&
-
icsk->icsk_retransmits <= TCP_THIN_LINEAR_RETRIES) {
-
icsk->icsk_backoff = 0;
-
icsk->icsk_rto = min(__tcp_set_rto(tp), TCP_RTO_MAX);
-
} else {
-
/* Use normal (exponential) backoff */
-
icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
-
}
-
inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
-
if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1, 0, 0))
-
__sk_dst_reset(sk);
-
-
out:;
-
}
Todo:
1. 如何判断当前的业务是否适用这个算法呢?
参考:
1. 内核实现:
tcp_thin_linear_timeouts:
tcp_thin_dupack:
2. 论文:
阅读(1669) | 评论(0) | 转发(0) |