Chinaunix首页 | 论坛 | 博客
  • 博客访问: 198196
  • 博文数量: 28
  • 博客积分: 1179
  • 博客等级: 上尉
  • 技术积分: 405
  • 用 户 组: 普通用户
  • 注册时间: 2008-04-21 22:51
文章分类

全部博文(28)

文章存档

2012年(4)

2011年(8)

2010年(2)

2009年(6)

2008年(8)

分类: LINUX

2012-01-10 23:02:15


从TCP-IP详解(1)中,我们了解tcp数据流的形式有两种:一种是交互式,例如rlogin交互命令时所产生的数据流。 另一种是成块式,即发送的数据流都是满窗口的。比如以服务器为中心的下载服务(非p2p架构)。现在来介绍另一种数据流,叫做thin数据流。现在又很多的应用,比如在线游戏,是依赖用户行为的。也可以说数据流是依赖时间的。即当用户玩游戏时,有一个突然地数据流,但是过了一段时间(这段时间内用户也许在做其他事情), 又有一个突然地数据流。

Andreas Petlund研究了很多的应用,都是这种依赖时间的交互式的应用。并且发现现在的linux实现中,快速重传机制对这样的thin数据流产生很大的延迟。为何呢?快速重传被触发的前提是,当收到3个重复的ACK时,就触发快速重传。第一,中国的网络环境,很容易导致接收端接收到的数据包是乱序的。第二,每个重复ACK的发出,也说明了接收端接收到了新的数据后,才会发送ACK. 如果在数据流的发送过程中,其中某个包丢弃了。收到重复的一个ACK(这个ACK不是应答最高序号的包)后,并没有新的数据要发送,而当前仍然有3个包在网络中。这个时候,没有新的数据需要发送,只能等待网络中的3个数据包被接收,并陆续的返回重复ACK,直到收到3个后,快速重传丢失的包。或者直接等到RTO超时,发送丢失的包。

Andreas Petlund做了很多的实验,同时改进了linux中快速重传机制和RTO超时后的RTO算法。当然这个新的特性默认是关闭的,也就是并不会与之前的快速重传算法冲突。可以选择通过setsockopt来启用,或者通过sysctl来启用。下面来介绍算法的内容。这个算法包括两个部分:

第一部分:tcp_thin_dupack

tcp_time_recover是用来判断是否要启动快速重传机制。如果收到重复ACK超过3个,则启用。

核心代码见下面的红色部分。

  1. static int tcp_time_to_recover(struct sock *sk)
  2. {
  3.     struct tcp_sock *tp = tcp_sk(sk);
  4.     __u32 packets_out;

  5.     /* Do not perform any recovery during F-RTO algorithm */
  6.     if (tp->frto_counter)
  7.         return 0;

  8.     /* Trick#1: The loss is proven. */
  9.     if (tp->lost_out)
  10.         return 1;

  11.     /* Not-A-Trick#2 : Classic rule... */
  12.     /* 下面这个是用来重复的ACK个数是否大于3. recordering的值是3.
  13.     if (tcp_dupack_heuristics(tp) > tp->reordering)
  14.         return 1;

  15.     /* Trick#3 : when we use RFC2988 timer restart, fast
  16.      * retransmit can be triggered by timeout of queue head.
  17.      */
  18.     if (tcp_is_fack(tp) && tcp_head_timedout(sk))
  19.         return 1;

  20.     /* Trick#4: It is still not OK... But will it be useful to delay
  21.      * recovery more?
  22.      */
  23.     packets_out = tp->packets_out;
  24.     if (packets_out <= tp->reordering &&
  25.      tp->sacked_out >= max_t(__u32, packets_out/2, sysctl_tcp_reordering) &&
  26.      !tcp_may_send_now(sk)) {
  27.         /* We have nothing to send. This connection is limited
  28.          * either by receiver window or by application.
  29.          */
  30.         return 1;
  31.     }

  32.     /* If a thin stream is detected, retransmit after first
  33.      * received dupack. Employ only if SACK is supported in order
  34.      * to avoid possible corner-case series of spurious retransmissions
  35.      * Use only if there are no unsent data.
  36.      */
  37.     说明:thin_dupack表示用户通过setsockoopt设置TCP_THIN_DUPACK选项,则这个值为1.否则为0.
  38.      sysctl_tcp_thin_dupack是proc参数,与thin_dupack的含义一样。为1,表示启用thin-steam优化。
  39.        tcp_stream_is_thin(tp)判断当前的tcp流是否是thin流。如果当前网络中的数据包个数小于4个,并且不再一开始的慢启动阶段,则这个流是thin。
  40.        tcp_dupack_heuristics(tp) > 1.表示收到第一个重复ACK. 
  41.        tcp_is_sack(tp): 当前数据流启用了SACK.
  42.        !tcp_send_head(sk):当前没有新的数据需要发送。

  43.     if ((tp->thin_dupack || sysctl_tcp_thin_dupack) &&
  44.      tcp_stream_is_thin(tp) && tcp_dupack_heuristics(tp) > 1 &&
  45.      tcp_is_sack(tp) && !tcp_send_head(sk))
  46.         return 1;

  47.     return 0;
  48. }


第二部分:tcp_thin_linear_timeouts

tcp_retransmit_timer这个函数RTO超时的处理函数。如果是thin流,则不要新设RTO是原先的2倍。

  1. void tcp_retransmit_timer(struct sock *sk)
  2. {
  3.     struct tcp_sock *tp = tcp_sk(sk);
  4.     struct inet_connection_sock *icsk = inet_csk(sk);

  5.     if (!tp->packets_out)
  6.         goto out;

  7.     WARN_ON(tcp_write_queue_empty(sk));

  8.     if (!tp->snd_wnd && !sock_flag(sk, SOCK_DEAD) &&
  9.      !((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV))) {
  10.         /* Receiver dastardly shrinks window. Our retransmits
  11.          * become zero probes, but we should not timeout this
  12.          * connection. If the socket is an orphan, time it out,
  13.          * we cannot allow such beasts to hang infinitely.
  14.          */
  15. #ifdef TCP_DEBUG
  16.         struct inet_sock *inet = inet_sk(sk);
  17.         if (sk->sk_family == AF_INET) {
  18.             LIMIT_NETDEBUG(KERN_DEBUG "TCP: Peer %pI4:%u/%u unexpectedly shrunk window %u:%u (repaired)\n",
  19.              &inet->inet_daddr, ntohs(inet->inet_dport),
  20.              inet->inet_num, tp->snd_una, tp->snd_nxt);
  21.         }
  22. #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
  23.         else if (sk->sk_family == AF_INET6) {
  24.             struct ipv6_pinfo *np = inet6_sk(sk);
  25.             LIMIT_NETDEBUG(KERN_DEBUG "TCP: Peer %pI6:%u/%u unexpectedly shrunk window %u:%u (repaired)\n",
  26.              &np->daddr, ntohs(inet->inet_dport),
  27.              inet->inet_num, tp->snd_una, tp->snd_nxt);
  28.         }
  29. #endif
  30. #endif
  31.         if (tcp_time_stamp - tp->rcv_tstamp > TCP_RTO_MAX) {
  32.             tcp_write_err(sk);
  33.             goto out;
  34.         }
  35.         tcp_enter_loss(sk, 0);
  36.         tcp_retransmit_skb(sk, tcp_write_queue_head(sk));
  37.         __sk_dst_reset(sk);
  38.         goto out_reset_timer;
  39.     }

  40.     if (tcp_write_timeout(sk))
  41.         goto out;

  42.     if (icsk->icsk_retransmits == 0) {
  43.         int mib_idx;

  44.         if (icsk->icsk_ca_state == TCP_CA_Recovery) {
  45.             if (tcp_is_sack(tp))
  46.                 mib_idx = LINUX_MIB_TCPSACKRECOVERYFAIL;
  47.             else
  48.                 mib_idx = LINUX_MIB_TCPRENORECOVERYFAIL;
  49.         } else if (icsk->icsk_ca_state == TCP_CA_Loss) {
  50.             mib_idx = LINUX_MIB_TCPLOSSFAILURES;
  51.         } else if ((icsk->icsk_ca_state == TCP_CA_Disorder) ||
  52.              tp->sacked_out) {
  53.             if (tcp_is_sack(tp))
  54.                 mib_idx = LINUX_MIB_TCPSACKFAILURES;
  55.             else
  56.                 mib_idx = LINUX_MIB_TCPRENOFAILURES;
  57.         } else {
  58.             mib_idx = LINUX_MIB_TCPTIMEOUTS;
  59.         }
  60.         NET_INC_STATS_BH(sock_net(sk), mib_idx);
  61.     }

  62.     if (tcp_use_frto(sk)) {
  63.         tcp_enter_frto(sk);
  64.     } else {
  65.         tcp_enter_loss(sk, 0);
  66.     }

  67.     if (tcp_retransmit_skb(sk, tcp_write_queue_head(sk)) > 0) {
  68.         /* Retransmission failed because of local congestion,
  69.          * do not backoff.
  70.          */
  71.         if (!icsk->icsk_retransmits)
  72.             icsk->icsk_retransmits = 1;
  73.         inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
  74.                      min(icsk->icsk_rto, TCP_RESOURCE_PROBE_INTERVAL),
  75.                      TCP_RTO_MAX);
  76.         goto out;
  77.     }

  78.     /* Increase the timeout each time we retransmit. Note that
  79.      * we do not increase the rtt estimate. rto is initialized
  80.      * from rtt, but increases here. Jacobson (SIGCOMM 88) suggests
  81.      * that doubling rto each time is the least we can get away with.
  82.      * In KA9Q, Karn uses this for the first few times, and then
  83.      * goes to quadratic. netBSD doubles, but only goes up to *64,
  84.      * and clamps at 1 to 64 sec afterwards. Note that 120 sec is
  85.      * defined in the protocol as the maximum possible RTT. I guess
  86.      * we'll have to use something other than TCP to talk to the
  87.      * University of Mars.
  88.      *
  89.      * PAWS allows us longer timeouts and large windows, so once
  90.      * implemented ftp to mars will work nicely. We will have to fix
  91.      * the 120 second clamps
  92.      */
  93.     icsk->icsk_backoff++;
  94.     icsk->icsk_retransmits++;

  95. out_reset_timer:
  96.     /* If stream is thin, use linear timeouts. Since 'icsk_backoff' is
  97.      * used to reset timer, set to 0. Recalculate 'icsk_rto' as this
  98.      * might be increased if the stream oscillates between thin and thick,
  99.      * thus the old value might already be too high compared to the value
  100.      * set by 'tcp_set_rto' in tcp_input.c which resets the rto without
  101.      * backoff. Limit to TCP_THIN_LINEAR_RETRIES before initiating
  102.      * exponential backoff behaviour to avoid continue hammering
  103.      * linear-timeout retransmissions into a black hole
  104.      */
  105.     thin_lto是TCP_THIN_LINEAR_TIMEOUTS选项。
  106. sysctl_tcp_thin_linear_timeouts是proc参数,跟thin_lto效果一样。
  107.     icsk->icsk_retransmits <= TCP_THIN_LINEAR_RETRIES:TCP_THIN_LINEAR_RETRIES是常量,等于6.
  108.     不知道为何这里要设置这个限制,只针对之前的6个RTO?

  109.     if (sk->sk_state == TCP_ESTABLISHED &&
  110.      (tp->thin_lto || sysctl_tcp_thin_linear_timeouts) &&
  111.      tcp_stream_is_thin(tp) &&
  112.      icsk->icsk_retransmits <= TCP_THIN_LINEAR_RETRIES) {
  113.         icsk->icsk_backoff = 0;
  114.         icsk->icsk_rto = min(__tcp_set_rto(tp), TCP_RTO_MAX);
  115.     } else {
  116.         /* Use normal (exponential) backoff */
  117.         icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
  118.     }
  119.     inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
  120.     if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1, 0, 0))
  121.         __sk_dst_reset(sk);

  122. out:;
  123. }


Todo:

1. 如何判断当前的业务是否适用这个算法呢?


参考:

1. 内核实现:

tcp_thin_linear_timeouts:

tcp_thin_dupack:

2. 论文:

阅读(4203) | 评论(0) | 转发(3) |
给主人留下些什么吧!~~