TCP thin strem-thewayma-ChinaUnix博客

thewayma

首页　| 　博文目录　| 　关于我

thewayma

博客访问： 733980
博文数量： 183
博客积分： 2650
博客等级：少校
技术积分： 1428
用户组：普通用户
注册时间： 2008-11-22 17:02

文章分类

全部博文（183）

QEMU（9）
Xen（0）
虚拟化（5）
KVM（32）
系统优化（2）
嵌入式（5）
GPU架构（2）
硬件（4）
I2C总线（2）
构建嵌入式文件系（0）
MIPS架构研究（8）
Linux内核线程（2）
Linux 基础编程（2）
PCI总线（5）
Linux性能（3）
nsfocus产品（0）
Linux TCP/IP协议（6）
Linux内核初始化（5）
Linux文件系统（15）
Linux内核攻击（0）
内核同步（4）
Linux IPC（1）
Linux进程管理（2）
网络攻击（5）

IP分片攻击（3）
Linux rootkit（3）
Linux中断（1）
Linux设备驱动（19）

块设备（14）
Linux 内核重要算（7）

I/O调度分析（2）
GUN C（3）
算法设计（2）
Linux内存管理（24）
未分配的博文（5）

文章存档

2017年（1）

2015年（46）

2014年（4）

2013年（8）

2012年（2）

2011年（27）

2010年（35）

2009年（60）

我的朋友

相关博文

TCP thin strem

分类： LINUX

2014-10-10 13:50:09

原文地址：TCP thin strem 作者：asweisun_shan

从TCP-IP详解（1）中，我们了解tcp数据流的形式有两种：一种是交互式，例如rlogin交互命令时所产生的数据流。另一种是成块式，即发送的数据流都是满窗口的。比如以服务器为中心的下载服务（非p2p架构）。现在来介绍另一种数据流，叫做thin数据流。现在又很多的应用，比如在线游戏，是依赖用户行为的。也可以说数据流是依赖时间的。即当用户玩游戏时，有一个突然地数据流，但是过了一段时间(这段时间内用户也许在做其他事情), 又有一个突然地数据流。

Andreas Petlund研究了很多的应用，都是这种依赖时间的交互式的应用。并且发现现在的linux实现中，快速重传机制对这样的thin数据流产生很大的延迟。为何呢？快速重传被触发的前提是，当收到3个重复的ACK时，就触发快速重传。第一，中国的网络环境，很容易导致接收端接收到的数据包是乱序的。第二，每个重复ACK的发出，也说明了接收端接收到了新的数据后，才会发送ACK. 如果在数据流的发送过程中，其中某个包丢弃了。收到重复的一个ACK(这个ACK不是应答最高序号的包)后，并没有新的数据要发送，而当前仍然有3个包在网络中。这个时候，没有新的数据需要发送，只能等待网络中的3个数据包被接收，并陆续的返回重复ACK,直到收到3个后，快速重传丢失的包。或者直接等到RTO超时，发送丢失的包。

Andreas Petlund做了很多的实验，同时改进了linux中快速重传机制和RTO超时后的RTO算法。当然这个新的特性默认是关闭的，也就是并不会与之前的快速重传算法冲突。可以选择通过setsockopt来启用，或者通过sysctl来启用。下面来介绍算法的内容。这个算法包括两个部分：

第一部分：tcp_thin_dupack

tcp_time_recover是用来判断是否要启动快速重传机制。如果收到重复ACK超过3个，则启用。

核心代码见下面的红色部分。

static int tcp_time_to_recover(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
__u32 packets_out;
/* Do not perform any recovery during F-RTO algorithm */
if (tp->frto_counter)
return 0;
/* Trick#1: The loss is proven. */
if (tp->lost_out)
return 1;
/* Not-A-Trick#2 : Classic rule... */
/* 下面这个是用来重复的ACK个数是否大于3. recordering的值是3.
if (tcp_dupack_heuristics(tp) > tp->reordering)
return 1;
/* Trick#3 : when we use RFC2988 timer restart, fast
* retransmit can be triggered by timeout of queue head.
*/
if (tcp_is_fack(tp) && tcp_head_timedout(sk))
return 1;
/* Trick#4: It is still not OK... But will it be useful to delay
* recovery more?
*/
packets_out = tp->packets_out;
if (packets_out <= tp->reordering &&
tp->sacked_out >= max_t(__u32, packets_out/2, sysctl_tcp_reordering) &&
!tcp_may_send_now(sk)) {
/* We have nothing to send. This connection is limited
* either by receiver window or by application.
*/
return 1;
}
/* If a thin stream is detected, retransmit after first
* received dupack. Employ only if SACK is supported in order
* to avoid possible corner-case series of spurious retransmissions
* Use only if there are no unsent data.
*/
说明：thin_dupack表示用户通过setsockoopt设置TCP_THIN_DUPACK选项，则这个值为1.否则为0.
sysctl_tcp_thin_dupack是proc参数，与thin_dupack的含义一样。为1，表示启用thin-steam优化。
tcp_stream_is_thin(tp)判断当前的tcp流是否是thin流。如果当前网络中的数据包个数小于4个，并且不再一开始的慢启动阶段，则这个流是thin。
tcp_dupack_heuristics(tp) > 1.表示收到第一个重复ACK.
tcp_is_sack(tp): 当前数据流启用了SACK.
!tcp_send_head(sk):当前没有新的数据需要发送。
if ((tp->thin_dupack || sysctl_tcp_thin_dupack) &&
tcp_stream_is_thin(tp) && tcp_dupack_heuristics(tp) > 1 &&
tcp_is_sack(tp) && !tcp_send_head(sk))
return 1;
return 0;
}

第二部分：tcp_thin_linear_timeouts

tcp_retransmit_timer这个函数RTO超时的处理函数。如果是thin流，则不要新设RTO是原先的2倍。

void tcp_retransmit_timer(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
struct inet_connection_sock *icsk = inet_csk(sk);
if (!tp->packets_out)
goto out;
WARN_ON(tcp_write_queue_empty(sk));
if (!tp->snd_wnd && !sock_flag(sk, SOCK_DEAD) &&
!((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV))) {
/* Receiver dastardly shrinks window. Our retransmits
* become zero probes, but we should not timeout this
* connection. If the socket is an orphan, time it out,
* we cannot allow such beasts to hang infinitely.
*/
#ifdef TCP_DEBUG
struct inet_sock *inet = inet_sk(sk);
if (sk->sk_family == AF_INET) {
LIMIT_NETDEBUG(KERN_DEBUG "TCP: Peer %pI4:%u/%u unexpectedly shrunk window %u:%u (repaired)\n",
&inet->inet_daddr, ntohs(inet->inet_dport),
inet->inet_num, tp->snd_una, tp->snd_nxt);
}
#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
else if (sk->sk_family == AF_INET6) {
struct ipv6_pinfo *np = inet6_sk(sk);
LIMIT_NETDEBUG(KERN_DEBUG "TCP: Peer %pI6:%u/%u unexpectedly shrunk window %u:%u (repaired)\n",
&np->daddr, ntohs(inet->inet_dport),
inet->inet_num, tp->snd_una, tp->snd_nxt);
}
#endif
#endif
if (tcp_time_stamp - tp->rcv_tstamp > TCP_RTO_MAX) {
tcp_write_err(sk);
goto out;
}
tcp_enter_loss(sk, 0);
tcp_retransmit_skb(sk, tcp_write_queue_head(sk));
__sk_dst_reset(sk);
goto out_reset_timer;
}
if (tcp_write_timeout(sk))
goto out;
if (icsk->icsk_retransmits == 0) {
int mib_idx;
if (icsk->icsk_ca_state == TCP_CA_Recovery) {
if (tcp_is_sack(tp))
mib_idx = LINUX_MIB_TCPSACKRECOVERYFAIL;
else
mib_idx = LINUX_MIB_TCPRENORECOVERYFAIL;
} else if (icsk->icsk_ca_state == TCP_CA_Loss) {
mib_idx = LINUX_MIB_TCPLOSSFAILURES;
} else if ((icsk->icsk_ca_state == TCP_CA_Disorder) ||
tp->sacked_out) {
if (tcp_is_sack(tp))
mib_idx = LINUX_MIB_TCPSACKFAILURES;
else
mib_idx = LINUX_MIB_TCPRENOFAILURES;
} else {
mib_idx = LINUX_MIB_TCPTIMEOUTS;
}
NET_INC_STATS_BH(sock_net(sk), mib_idx);
}
if (tcp_use_frto(sk)) {
tcp_enter_frto(sk);
} else {
tcp_enter_loss(sk, 0);
}
if (tcp_retransmit_skb(sk, tcp_write_queue_head(sk)) > 0) {
/* Retransmission failed because of local congestion,
* do not backoff.
*/
if (!icsk->icsk_retransmits)
icsk->icsk_retransmits = 1;
inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
min(icsk->icsk_rto, TCP_RESOURCE_PROBE_INTERVAL),
TCP_RTO_MAX);
goto out;
}
/* Increase the timeout each time we retransmit. Note that
* we do not increase the rtt estimate. rto is initialized
* from rtt, but increases here. Jacobson (SIGCOMM 88) suggests
* that doubling rto each time is the least we can get away with.
* In KA9Q, Karn uses this for the first few times, and then
* goes to quadratic. netBSD doubles, but only goes up to *64,
* and clamps at 1 to 64 sec afterwards. Note that 120 sec is
* defined in the protocol as the maximum possible RTT. I guess
* we'll have to use something other than TCP to talk to the
* University of Mars.
*
* PAWS allows us longer timeouts and large windows, so once
* implemented ftp to mars will work nicely. We will have to fix
* the 120 second clamps
*/
icsk->icsk_backoff++;
icsk->icsk_retransmits++;
out_reset_timer:
/* If stream is thin, use linear timeouts. Since 'icsk_backoff' is
* used to reset timer, set to 0. Recalculate 'icsk_rto' as this
* might be increased if the stream oscillates between thin and thick,
* thus the old value might already be too high compared to the value
* set by 'tcp_set_rto' in tcp_input.c which resets the rto without
* backoff. Limit to TCP_THIN_LINEAR_RETRIES before initiating
* exponential backoff behaviour to avoid continue hammering
* linear-timeout retransmissions into a black hole
*/
thin_lto是TCP_THIN_LINEAR_TIMEOUTS选项。
sysctl_tcp_thin_linear_timeouts是proc参数，跟thin_lto效果一样。
icsk->icsk_retransmits <= TCP_THIN_LINEAR_RETRIES：TCP_THIN_LINEAR_RETRIES是常量，等于6.
不知道为何这里要设置这个限制，只针对之前的6个RTO?
if (sk->sk_state == TCP_ESTABLISHED &&
(tp->thin_lto || sysctl_tcp_thin_linear_timeouts) &&
tcp_stream_is_thin(tp) &&
icsk->icsk_retransmits <= TCP_THIN_LINEAR_RETRIES) {
icsk->icsk_backoff = 0;
icsk->icsk_rto = min(__tcp_set_rto(tp), TCP_RTO_MAX);
} else {
/* Use normal (exponential) backoff */
icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
}
inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1, 0, 0))
__sk_dst_reset(sk);
out:;
}

Todo：

1. 如何判断当前的业务是否适用这个算法呢？

参考：

1. 内核实现：

tcp_thin_linear_timeouts：

tcp_thin_dupack：

2. 论文：

阅读(2385) | 评论(1) | 转发(0) |

上一篇：Intel内存虚拟化技术分析

下一篇：KVM基本原理和架构一-概念和术语

给主人留下些什么吧！~~

renyuan0002015-06-07 19:20:56

\"只针对之前的6个RTO?\"，没错，感觉不能一直这样子试下去，所以超过6次重传后，就走backoff指数退避了

回复 | 举报

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6