背景
网络程序员对于“最大传输单元”--MTU应该都不陌生。对于网络传输而已,一条链路上的负载大小通常都是有限制的。比如,对于以太网,MTU通常被设置为1500字节(ip报文最大长度)。对于网络传输中最常用TCP协议而言,其架设与IP协议之上:TCP是基于流,它的数据需要通过划分成一个一个的块,然后组装成一个一个tcp报文,再交由ip协议封装、并发送。通常,为了更好的使用效率,TCP最好确认其数据块的大小,使得一个TCP报文能够顺利的装入一个IP报文中,而且不超过MTU的限制(相反,如果TCP报文过大,那么在IP层发送时,需要将报文分割为更小的多个报文发送,在接收端重组,效率很低)。TCP为了很好的完成这个任务,其通过扩展选项,通过服务端和客户端进行友好协商,选择一个合适的分片大小(MSS),保证双方在数据传输过程中,不需要进行IP分片。
经过上面TCP协商处理后,通常是没有问题的。但后来运营商在网络接入时引入了PPPoE后,情况就不同了!
PPPoE协议是架设在以太网的ppp协议,它会在IP与Ethernet之间添加一个PPPoE头部(包含pppoe头部和ppp头部,共8字节),这样其实变形减小了链路的MTU;但问题在于,对于TCP/IP层面而已,PPPoE是不可见,即TCP在协商MSS时,所看到的MTU依然是Ethernet的MTU,并没有排除PPPoE头部长度。
见下面一个典型拓扑,client 通过PPPoE接入intenet,并试图访问站点SERVER的资源:
(CLINET)-------(PPPoE cli) ------- (PPPoE serv) ------(INTERNET) ------(SERVER)
step1:client向server端发起TCP连接请求,同时申明它支持的最大分片长度为1460字节(MTU - TCP_HEADER_LEN - IP_HEADER_LEN);
step2:server回复client,同时申明自己支持最大分片长度为1460字节;
step3:client发起读取资源请求;
step4-1:server把资源按最大1460字节切片,然封装为TCP报文,进一步封装IP报文,经以太网成帧后发往client。
step4-2:报文从server向client过程中,流经PPPoE serv,PPPoE serv需要向报文添加PPPoE头部,此时发现添加PPPoE头部后报文超过MTU限制了!(怎么办?怎么办?怎么办?)
step4-3: (i)SERVER在发送报文时,明确这个报文不允许分片,那么PPPoE serv只能丢弃该报文;(ii)PPPoE serv检查报文已经超长,那么久默默丢弃,当什么事情也没发生;(iii)PPPoE serv对报文进行分片,逐个发生到CLINET。
从上面可以看到,当PPPoE serv接受到一个“超长”报文时,其对待的态度是不一定的;同理,当CLIENT发生一个“超长”报文时,PPPoE cli对待态度也是不一定的。很遗憾,绝大多数PPPoE cli不会对报文进行分片,并且PPPoE serv也不是总是会执行分片操作。由此引发的问题是,通常CLIENT可以和SERVER建立连接,但进行大数据传输时却失败了!
解决方案
问题原因知悉后,解决就不困难了。在上述拓扑中,PPPoE cli (或者PPPoE serv)监听连接过程,CLIENT与SERVER进行MSS协商时,主动参与,修正MSS:
(CLINET)-------(PPPoE cli) ------- (PPPoE serv) ------(INTERNET) ------(SERVER)
step1:client向server端发起TCP连接请求,同时申明它支持的最大分片长度为1460字节;
step1-1:PPPoE cli 捕获TCP MSS协商,修正为1412字节;
step2:SERVER确认连接请求,知悉CLIENT最大支持1412字节;申明其支持1412字节没有问题;
step2-1:PPPoE cli 捕获TCP MSS 协商,判断其值不大于1412,OK没有问题
step2-2:ClIENT 了解到SERVER最大支持1412字节的分片长度;
step3:CLIENT请求资源;
step4:SERVER按最大1412字节分片资源,组装发生给CLIENT;
step5:ClIENT 收到报文,与以ACK确认;
......
经过上面分步说明,这个问题基本阐释清楚了。在路由器上,如果采用PPPoE接入,通常需要执行TCP MSS clamp,下面是内核pppoe模块添加TCP MSS clamp的代码:
-
static uint16_t tcp_checksum(uint8_t *piphdr, uint8_t *ptcphdr)
-
{
-
uint32_t sum = 0;
-
uint16_t count;
-
uint16_t tmp;
-
-
uint8_t *addr;
-
uint8_t pseudo_header[12];
-
int i;
-
-
/* Count number of bytes in TCP header and data: IP total length - IP header length */
-
count = piphdr[2] * 256 + piphdr[3];
-
count -= (piphdr[0] & 0x0F) * 4;
-
-
/*ip src addr, dest addr, protocl, payload length*/
-
memcpy(pseudo_header, piphdr+12, 8);
-
pseudo_header[8] = 0;
-
pseudo_header[9] = piphdr[9];
-
pseudo_header[10] = (count >> 8) & 0xFF;
-
pseudo_header[11] = (count & 0xFF);
-
-
/* Checksum the pseudo-header */
-
for (i = 0; i < 12; i += 2)
-
{
-
sum += *(uint16_t *)(pseudo_header + i);
-
}
-
-
/* Checksum the TCP header and data */
-
addr = ptcphdr;
-
while (count > 1)
-
{
-
memcpy(&tmp, addr, sizeof(tmp));
-
sum += (uint32_t) tmp;
-
addr += sizeof(tmp);
-
count -= sizeof(tmp);
-
}
-
-
if (count > 0)
-
{
-
sum += (uint8_t) *addr;
-
}
-
-
while (sum >> 16)
-
{
-
sum = (sum & 0xffff) + (sum >> 16);
-
}
-
return (uint16_t) ((~sum) & 0xFFFF);
-
}
-
-
-
-
/**
-
* detect SYN of tcp, clamp MSSs
-
*/
-
static void clamp_mss(struct sk_buff* skb, int clamp_mss)
-
{
-
struct tcphdr* ptcphdr;
-
struct iphdr* piphdr;
-
uint8_t* pppphdr;
-
struct pppoe_hdr *ppppoehdr;
-
int len;
-
int minlen;
-
int optlen;
-
-
uint16_t csum;
-
uint16_t mss = 0;
-
uint8_t* opt;
-
uint8_t* mssopt;
-
-
ppppoehdr = pppoe_hdr(skb);
-
-
pppphdr = (uint8_t*)ppppoehdr + sizeof(struct pppoe_hdr);
-
-
/* check PPP protocol type */
-
if (pppphdr[0] & 0x01)
-
{
-
/* may be 8 bit protocol type ? */
-
if (pppphdr[0] != 0x21)
-
{
-
return;
-
}
-
-
piphdr = (struct iphdr*)(pppphdr + 1);
-
minlen = 41; // tcp header len + ip header len + ppp header len
-
}
-
else
-
{
-
/* 16 bit protocol type, upper layer is IP, and the protocol value is 0x0021*/
-
if (pppphdr[0] != 0x00 || pppphdr[1] != 0x21)
-
{
-
return;
-
}
-
piphdr = (struct iphdr*)(pppphdr + 2);
-
minlen = 42;
-
}
-
-
/* Is it too short? */
-
len = (int)ntohs(ppppoehdr->length);
-
if (len < minlen)
-
{
-
return;
-
}
-
-
/* Verify once more that it's IPv4 */
-
if (piphdr->version != 4)
-
{
-
return;
-
}
-
-
/* Is it a fragment that's not at the beginning of the packet? */
-
if ( ntohs(piphdr->frag_off) & 0x1FFF)
-
{
-
return;
-
}
-
-
/* Is it TCP? */
-
if (piphdr->protocol != 0x06)
-
{
-
return;
-
}
-
-
/* Get start of TCP header */
-
ptcphdr = (struct tcphdr*)((uint8_t*)piphdr + (piphdr->ihl) * 4);
-
-
/* Is SYN set? */
-
if (!ptcphdr->syn)
-
{
-
return;
-
}
-
-
/* Compute and verify TCP checksum -- do not touch a packet with a bad checksum */
-
csum = tcp_checksum((uint8_t*)piphdr, (uint8_t*)ptcphdr);
-
if (csum)
-
{
-
return;
-
}
-
-
/* Look for existing MSS option */
-
optlen = ntohs(ptcphdr->doff) * 4 - 20;
-
-
if (optlen <= 0)
-
{
-
return;
-
}
-
-
opt = (uint8_t*)ptcphdr + 20;
-
-
while (optlen > 0)
-
{
-
switch (*opt)
-
{
-
case 0: // end of options
-
case 1: // empty option, always use for pad
-
len = 1;
-
break;
-
case 2: // MSS option
-
if (opt[1] != 4)
-
{
-
return;
-
}
-
-
len = 4;
-
mss = opt[2] * 256 + opt[3];
-
mssopt = opt;
-
break;
-
case 3:
-
case 4:
-
case 5:
-
case 8:
-
len = (int)opt[1];
-
break;
-
default:
-
return;
-
-
}
-
-
if (mss > 0)
-
{
-
break;
-
}
-
-
optlen -= len;
-
opt += len;
-
-
}
-
-
/* If MSS Not exists or it's low enough, do nothing */
-
if (!mss || mss <= clamp_mss)
-
{
-
return;
-
}
-
-
mssopt[2] = (((unsigned) clamp_mss) >> 8) & 0xFF;
-
mssopt[3] = ((unsigned) clamp_mss) & 0xFF;
-
-
/* Recompute TCP checksum */
-
ptcphdr->check = 0;
-
-
csum = tcp_checksum((uint8_t*)piphdr, (uint8_t*)ptcphdr);
-
ptcphdr->check = csum;
-
}
阅读(3745) | 评论(0) | 转发(0) |