TCP中的几个系统参数-亚夫的博客-ChinaUnix博客

亚夫的博客的ChinaUnix博客

首页　| 　博文目录　| 　关于我

亚夫的博客

博客访问： 86941
博文数量： 18
博客积分： 0
博客等级：民兵
技术积分： 321
用户组：普通用户
注册时间： 2013-07-30 21:09

文章分类

全部博文（18）

python（1）
linux（5）

vim（1）
c++（3）
汇编（2）
tcp（4）
YARN（3）
未分配的博文（0）

文章存档

2015年（3）

2014年（9）

2013年（6）

我的朋友

相关博文

TCP中的几个系统参数

分类：网络与安全

2013-10-09 07:48:59

下面的一些参数的解释来自于网络上的两篇博文，关于博文的内容也罗列在下面：
cat /proc/net/sockstat

sockets: used：已使用的所有协议套接字总量
TCP: inuse：正在使用的TCP套接字数量。
TCP: orphan：无主，即socket fd不属于任何进程的数量；当执行close之后，引用计数为0，则socket fd属于orphan；处于Fin-wait1, fin-wait2等状态都是这种情况；但处于time-wait状态的数量（下面的tw），又不属于这个orphan的计数里面。太多的orhpan会消耗过多的资源（单个orphan为64k），为了防止ddos攻击，当内核当前的orphan数量超过了tcp_max_orphan值之后，就这个socket直接置为reset，进入closed状态。
TCP: tw：等待关闭的TCP连接数。
TCP：alloc(allocated)：已分配的TCP套接字数量。
TCP：mem：当前tcp协议栈用到的内存数量，以page为单位

cat /proc/sys/net/ipv4/tcp_rmem
这个参数的三个值，分别代表了min receive buffer，default receive buffer， max receive buffer；我们通过插件选项可以动态的修改receive buffer的大小，但不能超过这里的min和max值
对tcp而言，该设置会覆盖/proc/sys/net/core/rmem_default

The tcp_rmem variable is pretty much the same as the tcp_wmem, except in one large area. It tells the kernel the TCP receive memory buffers instead of the transmit buffer which is defined in tcp_wmem. This variable takes 3 different values, just the same as the tcp_wmem variable.

The first value tells the kernel the minimum receive buffer for each TCP connection, and this buffer is always allocated to a TCP socket, even under high pressure on the system. This value is set to 4096 bytes, or 4 kilobytes, in newer kernels, but was in previous kernels set to 8192 bytes or 8 kilobytes. This should generally be a good value, and you should avoid raising this value if you are sporadically experiencing large bursts and high network loads since the system may get into even worse problems then.

The second value specified tells the kernel the default receive buffer allocated for each TCP socket. This value overrides the /proc/sys/net/core/rmem_default value used by other protocols. The default value here is 87380 bytes, or 85 kilobytes. This value is used together with tcp_adv_win_scale and tcp_app_win to calculate the TCP window size, which is discussed within the explanations of those variables. This value should under normal circumstances not be touched either since it may result in similar problems as with the first value in this variable.

This variable may give tremenduous increase in throughput on high bandwidth networks, if used properly together with the tcp_mem and tcp_wmem variable. The tcp_rmem variable doesn't need too much manual tuning however, since the Linux 2.4 kernels has very good autotuning handlings on this aspect, but the other two may be worth looking at. For more information about this, look at the .

The third and last value specified in this variable specifies the maximum receive buffer that can be allocated for a TCP socket. This value is overridden by the /proc/sys/net/core/rmem_max if the ipv4 value is larger than the core value. You need to look at the core value before you do any changes to the ipv4 value in other words. The default value here is a double up of the second value specified. In other words, 87380 * 2 bytes, or 174760 bytes (170 kilobytes). Generally this should be a good value and should not need to be changed.

This variable takes 3 different values which holds information on how much TCP sendbuffer memory space each TCP socket has to use. Every TCP socket has this much buffer space to use before the buffer is filled up. Each of the three values are used under different conditions.

The first value in this variable tells the minimum TCP send buffer space available for a single TCP socket. This space is always allocated for a specific TCP socket opened by a program as soon as it is opened. This value is normally set to 4096 bytes, or 4 kilobytes.

The second value in the variable tells us the default buffer space allowed for a single TCP socket to use. If the buffer tries to grow larger than this, it may get hampered if the system is currently under heavy load and don't have a lot of memory space available. It may even have to drop packets if the system is so heavily loaded that it can not give more memory than this limit. The default value set here is 16384 bytes, or 16 kilobytes of memory. It is not very wise to raise this value since the system is most probably already under heavy memory load and usage, and this would hence lead to even more problems for the rest of the system. This value overrides the /proc/sys/net/core/wmem_default value that is used by other protocols, and is usually set to a lower value than the core value.

The third value tells the kernel the maximum TCP send buffer space. This defines the maximum amount of memory a single TCP socket may use. Per default this value is set to 131072, or 128 kilobytes. This should be a reasonable value for most circumstances, and you will most probably never need to change these values. However, if you ever do need to change it, you should keep in mind that the /proc/sys/net/core/wmem_max value overrides this value, and hence this value should always be smaller than that value.

This variable may give tremenduous increase in throughput on high bandwidth networks, if used properly together with the tcp_mem and tcp_rmem variable. The tcp_wmem variable is the variable of the three which may give the most gain from this kind of tweaking. Do note that you will see almost no gain on slower networks than giga ethernet networks. For more information about this, look at the .

cat /proc/sys/net/ipv4/tcp_mem 3093984 4125312 6187968
内核对tcp协议栈的内存使用做了限制，tcp_mem可以设置这个内存的使用量，注意这里面的单位是page；当tcp协议栈使用的内存小于3093984页时（11.8GB），则内核认为处于low threshold阶段，对其内存使用不会有任何的干预；当内存使用达到4125312页时（15.7GB），则内核认为tcp协议栈进入了memory pressure阶段；当超过6187968（23.6GB）后，则会输出系统error “Out of socket memory”，并且各种影响tcp协议正常运转的限制也会出来。

-----------------以下是来自网络的两篇博文----------------------------------
从内核的角度，解释了tcp在什么情况下回出现out of memory error
http://blog.tsunanet.net/2011/03/out-of-socket-memory.html
The "Out of socket memory" error
I recently did some work on some of our frontend machines (on which we run Varnish) at StumbleUpon and decided to track down some of the errors the Linux kernel was regularly throwing in kern.log such as:
Feb 25 08:23:42 foo kernel: [3077014.450011] Out of socket memory
Before we get started, let me tell you that you should NOT listen to any blog or forum post without doing your homework, especially when the post recommends that you tune up virtually every TCP related knob in the kernel. These people don't know what they're doing and most probably don't understand much to TCP/IP. Most importantly, their voodoo won't help you fix your problem and might actually make it worse.
Dive in the Linux kernel
In order to best understand what's going on, the best thing is to go read the code of the kernel. Unfortunately, the kernel's error messages or counters are often imprecise, confusing, or even misleading. But they're important. And reading the kernel's code isn't nearly as hard as what people say.
The "Out of socket memory" error
The only match for "Out of socket memory" in the kernel's code (as of v2.6.38) is in net/ipv4/tcp_timer.c:
66 static int tcp_out_of_resources(struct sock *sk, int do_reset)
67 {
68 struct tcp_sock *tp = tcp_sk(sk);
69 int shift = 0;
70
71 /* If peer does not open window for long time, or did not transmit
72 * anything for long time, penalize it. */
73 if ((s32)(tcp_time_stamp - tp->lsndtime) > 2*TCP_RTO_MAX || !do_reset)
74 shift++;
75
76 /* If some dubious ICMP arrived, penalize even more. */
77 if (sk->sk_err_soft)
78 shift++;
79
80 if (tcp_too_many_orphans(sk, shift)) {
81 if (net_ratelimit())
82 printk(KERN_INFO "Out of socket memory\n");
So the question is: when does tcp_too_many_orphans return true? Let's take a look in include/net/tcp.h:
268 static inline bool tcp_too_many_orphans(struct sock *sk, int shift)
269 {
270 struct percpu_counter *ocp = sk->sk_prot->orphan_count;
271 int orphans = percpu_counter_read_positive(ocp);
272
273 if (orphans << shift > sysctl_tcp_max_orphans) {
274 orphans = percpu_counter_sum_positive(ocp);
275 if (orphans << shift > sysctl_tcp_max_orphans)
276 return true;
277 }
278
279 if (sk->sk_wmem_queued > SOCK_MIN_SNDBUF &&
280 atomic_long_read(&tcp_memory_allocated) > sysctl_tcp_mem[2])
281 return true;
282 return false;
283 }
So two conditions that can trigger this "Out of socket memory" error:
There are "too many" orphan sockets (most common).
The socket already has the minimum amount of memory and we can't give it more because TCP is already using more than its limit.
In order to remedy to your problem, you need to figure out which case you fall into. The vast majority of the people (especially those dealing with frontend servers like Varnish) fall into case 1.
Are you running out of TCP memory?
Ruling out case 2 is easy. All you need is to see how much memory your kernel is configured to give to TCP vs how much is actually being used. If you're close to the limit (uncommon), then you're in case 2. Otherwise (most common) you're in case 1. The kernel keeps track of the memory allocated to TCP in multiple of pages, not in bytes. This is a first bit of confusion that a lot of people run into because some settings are in bytes and other are in pages (and most of the time 1 page = 4096 bytes).
Rule out case 2: find how much memory the kernel is willing to give to TCP:
$ cat /proc/sys/net/ipv4/tcp_mem 3093984 4125312 6187968
The values are in number of pages. They get automatically sized at boot time (values above are for a machine with 32GB of RAM). They mean:
When TCP uses less than 3093984 pages (11.8GB), the kernel will consider it below the "low threshold" and won't bother TCP about its memory consumption.
When TCP uses more than 4125312 pages (15.7GB), enter the "memory pressure" mode.
The maximum number of pages the kernel is willing to give to TCP is 6187968 (23.6GB). When we go above this, we'll start seeing the "Out of socket memory" error and Bad Things will happen.
Now let's find how much of that memory TCP actually uses.
$ cat /proc/net/sockstat
sockets: used 14565
TCP: inuse 35938 orphan 21564 tw 70529 alloc 35942 mem 1894
UDP: inuse 11 mem 3
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0
The last value on the second line (mem 1894) is the number of pages allocated to TCP. In this case we can see that 1894 is way below 6187968, so there's no way we can possibly be running out of TCP memory. So in this case, the "Out of socket memory" error was caused by the number of orphan sockets.
Do you have "too many" orphan sockets?
First of all: what's an orphan socket? It's simply a socket that isn't associated to a file descriptor. For instance, after you close() a socket, you no longer hold a file descriptor to reference it, but it still exists because the kernel has to keep it around for a bit more until TCP is done with it. Because orphan sockets aren't very useful to applications (since applications can't interact with them), the kernel is trying to limit the amount of memory consumed by orphans, and it does so by limiting the number of orphans that stick around. If you're running a frontend web server (or an HTTP load balancer), then you'll most likely have a sizeable number of orphans, and that's perfectly normal.

In order to find the limit on the number of orphan sockets, simply do:
$ cat /proc/sys/net/ipv4/tcp_max_orphans 65536
Here we see the default value, which is 64k. In order to find the number of orphan sockets in the system, look again in sockstat:
$ cat /proc/net/sockstat
sockets: used 14565
TCP: inuse 35938 orphan 21564 tw 70529 alloc 35942 mem 1894
[...]
So in this case we have 21564 orphans. That doesn't seem very close to 65536... Yet, if you look once more at the code above that prints the warning, you'll see that there is this shift variable that has a value between 0 and 2, and that the check is testing if (orphans << shift > sysctl_tcp_max_orphans). What this means is that in certain cases, the kernel decides to penalize some sockets more, and it does so by multiplying the number of orphans by 2x or 4x to artificially increase the "score" of the "bad socket" to penalize. The problem is that due to the way this is implemented, you can see a worrisome "Out of socket memory" error when in fact you're still 4x below the limit and you just had a couple "bad sockets" (which happens frequently when you have an Internet facing service). So unfortunately that means that you need to tune up the maximum number of orphan sockets even if you're 2x or 4x away from the threshold. What value is reasonable for you depends on your situation at hand. Observe how the count of orphans in /proc/net/sockstat is changing when your server is at peak traffic, multiply that value by 4, round it up a bit to have a nice value, and set it. You can set it by doing a echo of the new value in/proc/sys/net/ipv4/tcp_max_orphans, and don't forget to update the value of net.ipv4.tcp_max_orphans in /etc/sysctl.conf so that your change persists across reboots.

That's all you need to get rid of these "Out of socket memory" errors, most of which are "false alarms" due to the shift variable of the implementation.

------------------------下面的网络博文测试了在200w连接的情况下，系统的内存使用情况
http://blog.lifeibo.com/blog/2011/07/07/200-long-connection.html

对于一个server，我们一般考虑他所能支撑的qps，但有那么一种应用，我们需要关注的是它能支撑的连接数个数，而并非qps，当然qps也是我们需要考虑的性能点之一。这种应用常见于消息推送系统，也称为comet应用，比如聊天室或即时消息推送系统等。comet应用具体可见我之前的介绍，在此不多讲。对于这类系统，因为很多消息需要到产生时才推送给客户端，所以当没有消息产生时，就需要hold住客户端的连接，这样，当有大量的客户端时，就需要hold住大量的连接，这种连接我们称为长连接。
首先，我们分析一下，对于这类服务，需消耗的系统资源有：cpu、网络、内存。所以，想让系统性能达到最佳，我们先找到系统的瓶颈所在。这样的长连接，往往我们是没有数据发送的，所以也可以看作为非活动连接。对于系统来说，这种非活动连接，并不占用cpu与网络资源，而仅仅占用系统的内存而已。所以，我们假想，只要系统内存足够，系统就能够支持我们想达到的连接数，那么事实是否真的如此？如果真能这样，内核来维护这相当大的数据结构，也是一种考验。
要完成测试，我们需要有一个服务端，还有大量的客户端。所以需要服务端程序与客户端程序。为达到目标，我的想法是这样的：客户端产生一个连接，向服务端发起一个请求，服务端hold住该连接，而不返回数据。
1. 服务端的准备
对于服务端，由于之前的假想，我们需要一台大内存的服务器，用于部署nginx的comet应用。下面是我用的服务端的情况：
Summary: Dell R710, 2 x Xeon E5520 2.27GHz, 23.5GB / 24GB 1333MHz
System: Dell PowerEdge R710 (Dell 0VWN1R)
Processors: 2 x Xeon E5520 2.27GHz 5860MHz FSB (16 cores)
Memory: 23.5GB / 24GB 1333MHz == 6 x 4GB, 12 x empty
Disk-Control: megaraid_sas0: Dell/LSILogic PERC 6/i, Package 6.2.0-0013, FW 1.22.02-0612,
Network: eth0 (bnx2):Broadcom NetXtreme II BCM5709 Gigabit Ethernet,1000Mb/s
OS: RHEL Server 5.4 (Tikanga), Linux 2.6.18-164.el5 x86_64, 64-bit
服务端程序很简单，基于nginx写的一个comet模块，该模块接受用户的请求，然后保持用户的连接，而不返回。Nginx的status模块，可直接用于监控最大连接数。
服务端还需要调整一下系统的参数，在/etc/sysctl.conf中：
net.core.somaxconn = 2048
net.core.rmem_default = 262144
net.core.wmem_default = 262144
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 4096 16777216
net.ipv4.tcp_wmem = 4096 4096 16777216
net.ipv4.tcp_mem = 786432 2097152 3145728
net.ipv4.tcp_max_syn_backlog = 16384
net.core.netdev_max_backlog = 20000
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_max_syn_backlog = 16384
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_max_orphans = 131072
/sbin/sysctl -p 生效
这里，我们主要看这几项：
net.ipv4.tcp_rmem 用来配置读缓冲的大小，三个值，第一个是这个读缓冲的最小值，第三个是最大值，中间的是默认值。我们可以在程序中修改读缓冲的大小，但是不能超过最小与最大。为了使每个socket所使用的内存数最小，我这里设置默认值为4096。
net.ipv4.tcp_wmem 用来配置写缓冲的大小。
读缓冲与写缓冲在大小，直接影响到socket在内核中内存的占用。
而net.ipv4.tcp_mem则是配置tcp的内存大小，其单位是页，而不是字节。当超过第二个值时，TCP进入pressure模式，此时TCP尝试稳定其内存的使用，当小于第一个值时，就退出pressure模式。当内存占用超过第三个值时，TCP就拒绝分配socket了，查看dmesg，会打出很多的日志“TCP: too many of orphaned sockets”。
另外net.ipv4.tcp_max_orphans这个值也要设置一下，这个值表示系统所能处理不属于任何进程的socket数量，当我们需要快速建立大量连接时，就需要关注下这个值了。当不属于任何进程的socket的数量大于这个值时，dmesg就会看到”too many of orphaned sockets”。
另外，服务端需要打开大量的文件描述符，比如200万个，但我们设置最大文件描述符限制时，会遇到一些问题，我们在后面详细讲解。

2. 客户端的准备
由于我们需要构建大量的客户端，而我们知道，在一台系统上，连接到一个服务时的本地端口是有限的。由于端口是16位整数，也就只能是0到65535，而0到1023是预留端口，所以能分配的只是1024到65534，也就是64511个。也就是说，一台机器只能创建六万多个长连接。要达到我们的两百万连接，需要大概34台客户端。
当然，我们可以采用虚拟ip的方式来实现这么多客户端，如果是虚拟ip，则每个ip可以绑定六万多个端口，34个虚拟ip就可以搞定。而我这里呢，正好申请到了公司的资源，所以就采用实体机来做了。
由于系统默认参数，自动分配的端口数有限，是从32768到61000，所以我们需要更改客户端/etc/sysctl.conf的参数：
net.ipv4.ip_local_port_range = 1024 65535
/sbin/sysctl -p
客户端程序是基于libevent写的一个测试程序，不断的建立新的连接请求。
3. 由于客户端与服务端需要建立大量的socket，所以我们需要调速一下最大文件描述符。
客户端，需要创建六万多个socket，我设置最大为十万好了，的在/etc/security/limits.conf中添加：
admin soft nofile 100000
admin hard nofile 100000
服务端，需要创建200万连接，那我想设置nofile为200万，好，问题来了。
当我设置nofile为200万时，系统直接无法登陆了。尝试几次，发现最大只能设置到100万。在查过源码后，才知道，原来在2.6.25内核之前有个宏定义，定义了这个值的最大值，为1024*1024，正好是100万，而在2.6.25内核及其之后，这个值是可以通过/proc/sys/fs/nr_open来设置。于是我升级内核到2.6.32。ulimit详细介绍见博文：老生常谈: ulimit问题及其影响。
升级内核后，继续我们的调优，如下：
sudo bash -c 'echo 2000000 > /proc/sys/fs/nr_open'
现在再设置nofile就可以了:
admin soft nofile 2000000
admin hard nofile 2000000
4. 最后，在测试的过程中，根据dmesg的系统打出的信息不断调整服务端/sbin/sysctl中的配置，最后我们的测试完成了200万的长连接。
为了使内存占用尽量减少，我将Nginx的request_pool_size从默认的4k改成1k了。另外，net.ipv4.tcp_wmem与net.ipv4.tcp_rmem中的默认值也设置成4k。
两百万连接时，通过nginx的监控得到数据：
data
两百万连接时系统内存情况：
从最后的测试结论来看，200w长连接，占用了18G左右的内存

阅读(4857) | 评论(0) | 转发(1) |

上一篇：深入浅出TCP协议的2MSL TIME_WAIT状态

下一篇：x86_64 GAS 从内存角度看继承和多重继承（一）

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6