博客文章除注明转载外,均为原创。转载请注明出处。
本文链接地址:http://blog.chinaunix.net/uid-31396856-id-5761764.html
今天客户的数据库服务器负载很高,但是cpu,io,内存却很低的现象
[root@bjst-xxx ~]# uptime
13:27:17 up 767 days, 5:08, 4 users, load average: 43.73, 43.07, 42.18
但是服务器资源消耗却很低
top - 13:27:45 up 767 days, 4:41, 4 users, load average: 43.34, 46.13, 42.68
Tasks: 845 total, 4 running, 841 sleeping, 0 stopped, 0 zombie
Cpu(s): 11.3%us, 0.6%sy, 0.0%ni, 87.9%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 65932232k total, 63627872k used, 2304360k free, 296648k buffers
Swap: 4194296k total, 0k used, 4194296k free, 28106116k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
23532 oracle 20 0 30.2g 33m 19m R 98.7 0.1 2:16.93 oracle
25540 oracle 20 0 30.2g 31m 25m R 72.9 0.0 0:13.13 oracle
25635 oracle 20 0 30.2g 30m 25m S 64.7 0.0 0:11.65 oracle
24812 oracle 20 0 30.2g 32m 25m S 56.4 0.0 0:31.97 oracle
22906 oracle 20 0 30.2g 32m 25m S 34.6 0.0 1:03.80 oracle
22866 oracle 20 0 30.2g 32m 26m S 17.2 0.1 1:49.70 oracle
3392 root 20 0 4267m 307m 8228 S 1.0 0.5 22705:49 java
25646 root 20 0 15560 1856 940 R 1.0 0.0 0:00.19 top
检查内存和swap也没用问题
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
4 1 0 2276308 296648 28106280 0 0 321 10 0 0 2 0 97 0 0
4 0 0 2222988 296648 28106280 0 0 164251 40 8654 8031 8 1 89 3 0
4 1 0 2295776 296648 28106280 0 0 197915 180 8889 8702 7 0 90 3 0
7 2 0 2069308 296648 28106284 0 0 147099 137 12540 10842 13 1 83 3 0
5 3 0 2244712 296648 28106284 0 0 117434 58 11020 8286 12 1 85 3 0
2 1 0 2312408 296648 28106288 0 0 182610 97 7461 8222 4 0 93 3 0
1 2 0 2312584 296648 28106288 0 0 144962 155 8396 9580 4 0 93 3 0
2 1 0 2313364 296648 28106288 0 0 211755 133 7345 8049 4 0 94 3 0
1 2 0 2313476 296648 28106288 0 0 140179 10 6495 7552 3 0 94 3 0
2 2 0 2305284 296648 28106288 0 0 177059 118 8029 7786 5 0 92 3 0
奇怪了
继续检查awr发现数据库运行本身没有问题。
搜索MOS,doc 发现其中有High Load Average 提到的案例中和如下有些类似
At the time the load average started to climb on the example system, an nfs server had stopped responding.
The load average includes processes in a D state (uninterruptible sleep, usually IO) and if ..........there are many processes
in a D state so the load average climbs though the system effectively is idle.
于是检查系统的进程是否D状态的
[root@bjst-xxx ~]# ps axl | awk '$10 ~ /D/'
1 0 554 1 20 0 14740 908 rpc_wa D ? 0:01 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 702 1 20 0 14740 908 rpc_wa D ? 0:01 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 3277 1 20 0 14732 908 rpc_wa D ? 0:03 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 3382 1 20 0 14748 908 rpc_wa D ? 0:01 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 4999 1 20 0 14748 908 rpc_wa D ? 0:01 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 6784 1 20 0 14748 912 rpc_wa D ? 0:00 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 8360 1 20 0 14756 912 rpc_wa D ? 0:05 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 11726 1 20 0 14732 908 rpc_wa D ? 0:01 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 12104 1 20 0 14740 908 rpc_wa D ? 0:01 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 12799 1 20 0 14756 912 rpc_wa D ? 0:06 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 15019 1 20 0 14724 908 rpc_wa D ? 0:01 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 15497 1 20 0 14756 912 rpc_wa D ? 0:04 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 16032 1 20 0 14740 908 rpc_wa D ? 0:01 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 16116 1 20 0 14732 904 rpc_wa D ? 0:01 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 17043 1 20 0 14732 904 rpc_wa D ? 0:01 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 17849 1 20 0 14724 908 rpc_wa D ? 0:01 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 18934 1 20 0 14724 908 rpc_wa D ? 0:02 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 20416 1 20 0 14748 908 rpc_wa D ? 0:07 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 23926 1 20 0 14724 908 rpc_wa D ? 0:03 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 24093 1 20 0 14740 908 rpc_wa D ? 0:00 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 24423 1 20 0 14732 908 rpc_wa D ? 0:01 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 24492 1 20 0 14732 908 rpc_wa D ? 0:01 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 24916 1 20 0 14748 904 rpc_wa D ? 0:00 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 27257 1 20 0 14756 916 rpc_wa D ? 0:02 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 29331 1 20 0 14740 904 rpc_wa D ? 0:00 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 32144 1 20 0 14756 912 rpc_wa D ? 0:05 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 32359 1 20 0 14740 908 rpc_wa D ? 0:00 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 35198 1 20 0 14740 912 rpc_wa D ? 0:00 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 37974 1 20 0 14732 904 rpc_wa D ? 0:00 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 41384 1 20 0 14732 908 rpc_wa D ? 0:03 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 42968 1 20 0 14740 912 rpc_wa D ? 0:01 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 43315 1 20 0 14748 904 rpc_wa D ? 0:00 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 45836 1 20 0 14756 916 rpc_wa D ? 0:06 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 47467 1 20 0 14756 912 rpc_wa D ? 0:05 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 53444 1 20 0 14740 912 rpc_wa D ? 0:00 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 55187 1 20 0 14748 908 rpc_wa D ? 0:00 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 56218 1 20 0 14748 908 rpc_wa D ? 0:02 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 59855 1 20 0 14740 912 rpc_wa D ? 0:00 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
1 0 65320 1 20 0 14740 908 rpc_wa D ? 0:01 /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
[root@bjst-xxx ~]# ps -ef|grep nmon |wc -l
41
一共有41个D状态进程。D状态进程是什么呢?
D (task_uninterruptible): 不可中断的睡眠状态
与task_interruptible状态类似,进程处于睡眠状态,但是此刻进程是不可中断的。不可中断,指的并不是CPU不响应外部硬件的中断,
而是指进程不响应异步信号。
绝大多数情况下,进程处在睡眠状态时,总是应该能够响应异步信号的,但是uninterruptible sleep 状态的进程不接受外来的任何信号。
其后发现有定时任务:
0 2 * * * /var/nmon/nmon -f -N -m /var/nmon/logs -s 30 -c 2880
0 1 * * * /bin/find /var/nmon/logs/ -mtime +30 -name "*.nmon" -exec rm -rf {} \;
[root@bjst-xxx ~]# ps -ef|grep nmon |grep -v grep|cut -c 9-15| xargs kill -9
确定后,关闭以上进程。数据库服务器负载最终恢复到正常
[root@bjst-xxx ~]# uptime
13:48:57 up 767 days, 5:08, 4 users, load average: 2.87, 3.16, 8.04
到此,问题处理完毕。
---The end