Linux/Nginx kernel tweaks/tunes-expert1-ChinaUnix博客

whenexpert1.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

expert1

博客访问： 1111505
博文数量： 186
博客积分： 4939
博客等级：上校
技术积分： 2075
用户组：普通用户
注册时间： 2010-04-08 17:15

文章分类

全部博文（186）

AWS（6）
自动化（26）
php（6）
杂项（23）
工作（41）
感悟（5）
优化？（6）
架构相关（11）
高级脚本（60）
未分配的博文（2）

文章存档

2018年（1）

2017年（3）

2016年（11）

2015年（42）

2014年（21）

2013年（9）

2012年（18）

2011年（46）

2010年（35）

我的朋友

相关博文

Linux/Nginx kernel tweaks/tunes

分类：系统运维

2015-08-25 11:56:14

For more references , />
And />
SYN Flood Protection

These settings added to sysctl.conf will make a server more resistant to SYN flood attacks. Applying configures the kernel to use the SYN cookies mechanism, with a backlog queue of 1024 connections, also setting the SYN and SYN/ACK retries to an effective ceiling of about 45 seconds. The defaults for these settings vary depending on kernel version and distribution you may want to check them with sysctl -a | grep syn

Increasing the number of outstanding syn requests is allowed. Note: some people (including myself) have used tcp_syncookies to handle the problem of too many legitimate outstanding SYNs.

Note, that syncookies is fallback facility. It MUST NOT be used to help highly loaded servers to stand against legal connection rate. If you see synflood warnings in your logs, but investigation shows that they occur because of overload with legal connections, you should tune another parameters until this warning disappear.

net.ipv4.tcp_max_syn_backlog = 1024
net.ipv4.tcp_syn_retries = 6
net.ipv4.tcp_synack_retries = 3
net.ipv4.tcp_syncookies = 1

if you wish that change to be made persistently you should append to the file /etc/sysctl.conf the line:

fs.file-max = 100000

All TCP/IP tunning parameters are located under /proc/sys/net/... For example, here is a list of the most important tunning parameters, along with short description of their meaning:

/proc/sys/net/core/rmem_max - Maximum TCP Receive Window
/proc/sys/net/core/wmem_max - Maximum TCP Send Window
/proc/sys/net/ipv4/tcp_rmem - memory reserved for TCP receive buffers
/proc/sys/net/ipv4/tcp_wmem - memory reserved for TCP send buffers
/proc/sys/net/ipv4/tcp_timestamps - Timestamps (RFC 1323) add 12 bytes to the TCP header...
/proc/sys/net/ipv4/tcp_sack - TCP Selective Acknowledgements. They can reduce retransmissions, however make servers more prone to DDoS Attacks and increase CPU utilization.
/proc/sys/net/ipv4/tcp_window_scaling - support for large TCP Windows (RFC 1323). Needs to be set to 1 if the Max TCP Window is over 65535.

The following settings may improve performance on a 10Gb network:

net.core.rmem_default = 8388608
net.core.rmem_max = 8388608
net.core.wmem_default = 8388608
net.core.wmem_max = 8388608
net.core.netdev_max_backlog = 10000

With this disabled that the last time a file was accessed won't be constantly updated every time you read a file, since this information isn't generally useful inand causes extra disk hits, its typically disabled.

    /dev/rd/c0d0p3          /test                    ext4   noatime        1 2

# To serve a client request via an upstream application, NginX must open 2 TCP connections; one for the client, one for the connection to the upstream. If the server receives many connections, this can rapidly saturate the system’s available port capacity. The net.ipv4.ip_local_port_range directive increases the range to much larger than the default, so we have room for more allocated ports. If you're seeing errors in your /var/log/syslog such as: “possible SYN flooding on port 80. Sending cookies” it might mean the system can’t find an available port for the pending connection. Increasing the capacity will help alleviate this symptom.

For a web server, the destination address and the destination port are likely to be constant. If your web server is behind a L7 load-balancer, the source address will also be constant. On Linux, the client port is by default allocated in a port range of about 30,000 ports (this can be changed by tuning net.ipv4.ip_local_port_range). This means that only 30,000 connections can be established between the web server and the load-balancer every minute, so about 500 connections per second.

net.ipv4.ip_local_port_range='1024 65535'

Or use more server ports by asking the web server to listen to several additional ports (81, 82, 83, …)

# When the server has to cycle through a high volume of TCP connections, it can build up a large number of connections in TIME_WAIT state. TIME_WAIT means a connection is closed but the allocated resources are yet to be released. Setting this directive to 1 will tell the kernel to try to recycle the allocation for a new connection when safe to do so. This is cheaper than setting up a new connection from scratch.

Note: The tcp_tw_reuse setting is particularly useful in environments where numerous short connections are open and left in TIME_WAIT state, such as web servers. Reusing the sockets can be very effective in reducing server load.

net.ipv4.tcp_tw_reuse='1'   # may bring about some side-effects, please refer to />
#When the remote host is in fact a NAT device, the condition on timestamps will forbid all of the hosts except one behind the NAT device to connect during one minute because they do not share the same timestamp clock. In doubt, this is far better to disable this option since it leads to difficult to detect and difficult to diagnose problems.

On the server side, do not enable net.ipv4.tcp_tw_recycle unless you are pretty sure you will never have NAT devices in the mix. Enabling net.ipv4.tcp_tw_reuse is useless for incoming connections.

# This setting determines the time that must elapse before TCP/IP can release a closed connection and reuse its resources. During this TIME_WAIT state, reopening the connection to the client costs less than establishing a new connection. By reducing the value of this entry, TCP/IP can release closed connections faster, making more resources available for new connections. Adjust this in the presence of many connections sitting in the TIME_WAIT state:

net.ipv4.tcp_fin_timeout='15'
net.core.netdev_max_backlog='4096'
net.core.rmem_max='16777216'
net.core.somaxconn='4096'
net.core.wmem_max='16777216'
net.ipv4.tcp_max_syn_backlog='20480'
net.ipv4.tcp_max_tw_buckets='400000'
net.ipv4.tcp_no_metrics_save='1'
net.ipv4.tcp_rmem='4096 87380 16777216'
net.ipv4.tcp_syn_retries='2'
net.ipv4.tcp_synack_retries='2'
net.ipv4.tcp_wmem='4096 65536 16777216'
vm.min_free_kbytes='65536'

We can add the following commands to the /etc/sysctl.conf file to tune individual parameters, as follows. To reduce the number of connections in TIME_WAIT state, we can decrease the number of seconds connections are kept in this state before being dropped:

# reduce TIME_WAIT from the 120s default to 30-60s

net.netfilter.nf_conntrack_tcp_timeout_time_wait=30

# reduce FIN_WAIT from teh 120s default to 30-60s

net.netfilter.nf_conntrack_tcp_timeout_fin_wait=30

As load on our web servers continually increased, we started hitting some odd limitations in our NginX cluster. I noticed connections were being throttled or dropped, and the kernel was complaining about syn flooding with the error message I mentioned earlier. Frustratingly, I knew the servers could handle more, because the load average and CPU usage was negligible.

TCP:   47461 (estab 311, closed 47135, orphaned 4, synrecv 0, timewait 47135/0), ports 33938

47,135 connections in TIME_WAIT! Moreover, netstat indicates that they are all closed connections. This suggests the server is burning through a large portion of the available port range, which implies that it is allocating a new port for each connection it’s handling. Tweaking the networking settings helped firefight the problem a bit, but the socket range was still getting saturated.

After some digging around, I uncovered some documentation about an upstream keepalive directive. The docs state:

Sets the maximum number of idle keepalive connections to upstream servers that are retained in the cache per one worker process

This is interesting. In theory, this will help minimise connection wastage by pumping requests down connections that have already been established and cached. Additionally, the documentation also states that the proxy_http_version directive should be set to “1.1” and the “Connection” header cleared. On further research, it’s clear this is a good idea since HTTP/1.1 optimises TCP connection usage much more efficiently than HTTP/1.0, which is the default in Nginx Proxy.

Making both of these changes, our upstream config looks more like:

upstream backend {
server backend3:5016 max_fails=0 fail_timeout=10s;
server backend4:5016 max_fails=0 fail_timeout=10s;
server backend5:5016 max_fails=0 fail_timeout=10s;
server backend6:5016 max_fails=0 fail_timeout=10s;
keepalive 512;
}

server {
listen 80;
server_name />
client_max_body_size 16M;
keepalive_timeout 10;

location / {
    proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
    proxy_set_header   Connection "";
    proxy_http_version 1.1;
    proxy_pass /> }

}

When I pushed out the new configuration to the nginx cluster, I noticed a 90% reduction in occupied sockets. Nginx is now able to use far fewer connections to send many requests.

############## one more article #############

The backlog argument defines the maximum length to which the queue of pending connections for sockfd may grow. If a connection request arrives when the queue is full, the client may receive an error with an indication of ECONNREFUSED or, if the underlying protocol supports retransmission, the request may be ignored so that a later re-attempt at connection succeeds.

and a very important note

If the backlog argument is greater than the value in /proc/sys/net/core/somaxconn, then it is silently truncated to that value; the default value in this file is 128. In kernels before 2.4.25, this limit was a hard coded value, SOMAXCONN, with the value 128.

We were dropping packets since the backlog queue was filling up. Worse, clients will wait 3 seconds before re-sending the SYN, and then 9 seconds if that SYN doesn’t get through again.

Another symptom we saw when looking at /var/log/messages was this message showing up

[84440.731929] possible SYN flooding on port 80. Sending cookies.

Were we being SYN flooded? Not an unreasonable thing to expect with the servers exposed to the internet, but it turns out this message can send you looking in the wrong direction. Couldn’t we just turn off After fixing the backlog issue, it was time to review our existing sysctl settings. We’ve had some tunings in place for a while but it had been some time since they were reviewed to ensure they still made sense for us. There’s a lot of bad information out on the web on tuning TCP settings under sysctl that people just blindly apply to their servers. Often times these resources don’t bother explaining why they are setting a certain sysctl parameter and just give you a file to put in place and tell you this will give you the best performance. You should be sure you fully understand any value you are changing under sysctl. You can seriously affect the performance of your server with the wrong values or certain options even enabled in the wrong environments. The TCP man page and TCP/IP Illustrated: The Implementation, Vol 2 were great resources in helping to understand these parameters.

Our current sysctl modifications as they stand today are as follows (included with comments), Disclaimer: please don’t just use these settings on your servers without understanding them first

# Max receive buffer size (8 Mb)
net.core.rmem_max=8388608
# Max send buffer size (8 Mb)
net.core.wmem_max=8388608

# Default receive buffer size
net.core.rmem_default=65536
# Default send buffer size
net.core.wmem_default=65536

# The first value tells the kernel the minimum receive/send buffer for each TCP connection,
# and this buffer is always allocated to a TCP socket,
# even under high pressure on the system. …
# The second value specified tells the kernel the default receive/send buffer
# allocated for each TCP socket. This value overrides the /proc/sys/net/core/rmem_default
# value used by other protocols. … The third and last value specified
# in this variable specifies the maximum receive/send buffer that can be allocated for a TCP socket.
# Note: The kernel will auto tune these values between the min-max range
# If for some reason you wanted to change this behavior, disable net.ipv4.tcp_moderate_rcvbuf
net.ipv4.tcp_rmem=8192 873800 8388608
net.ipv4.tcp_wmem=4096 655360 8388608

# Units are in page size (default page size is 4 kb)
# These are global variables affecting total pages for TCP
# sockets
# 8388608 * 4 = 32 GB
# low pressure high
# When mem allocated by TCP exceeds “pressure”, kernel will put pressure on TCP memory
# We set all these values high to basically prevent any mem pressure from ever occurring
# on our TCP sockets
net.ipv4.tcp_mem=8388608 8388608 8388608

# Increase max number of sockets allowed in TIME_WAIT
net.ipv4.tcp_max_tw_buckets=6000000

# Increase max half-open connections.
net.ipv4.tcp_max_syn_backlog=65536

# Increase max TCP orphans
# These are sockets which have been closed and no longer have a file handle attached to them
net.ipv4.tcp_max_orphans=262144

# Max listen queue backlog
# make sure to increase nginx backlog as well if changed
net.core.somaxconn = 16384

# Max number of packets that can be queued on interface input
# If kernel is receiving packets faster than can be processed
# this queue increases
net.core.netdev_max_backlog = 16384

# Only retry creating TCP connections twice
# Minimize the time it takes for a connection attempt to fail
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_syn_retries = 2

# Timeout closing of TCP connections after 7 seconds
net.ipv4.tcp_fin_timeout = 7

# Avoid falling back to slow start after a connection goes idle
# keeps our cwnd large with the keep alive connections
net.ipv4.tcp_slow_start_after_idle = 0

阅读(1319) | 评论(0) | 转发(0) |

上一篇：python/pexpect来获取交换机/防火墙的配置

下一篇：linux对进程数量的限制

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6