境centos5.5x86_64,这台web给数据采集用的,主要是抓取数据.启动过错出错信息:
Call Trace:
[] out_of_memory+0x8b/0x203
[] __alloc_pages+0x27f/0x308
[] getnstimeofday+0x10/0x28
[] __do_page_cache_readahead+0xc6/0x1ab
[] filemap_nopage+0x14c/0x360
[] __handle_mm_fault+0x443/0x1348
[] hypercall_page+0x22a/0x1000
[] do_page_fault+0xf7b/0x12e0
[] monotonic_clock+0x35/0x7b
[] thread_return+0x6c/0x113
[] error_exit+0x0/0x6e
DMA per-cpu:
cpu 0 hot: high 186, batch 31 used:26
cpu 0 cold: high 62, batch 15 used:58
cpu 1 hot: high 186, batch 31 used:35
cpu 1 cold: high 62, batch 15 used:49
DMA32 per-cpu: empty
Normal per-cpu: empty
HighMem per-cpu: empty
Free pages: 5760kB (0kB HighMem)
Active:46859 inactive:24599 dirty:0 writeback:0 unstable:0 free:1440 slab:28082 mapped-file:5 mapped-anon:76709 pagetables:390527
DMA free:5760kB min:5800kB low:7248kB high:8700kB active:187436kB inactive:98396kB present:2105344kB pages_scanned:328111276 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
HighMem free:0kB min:128kB low:128kB high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
DMA: 16*4kB 68*8kB 2*16kB 4*32kB 2*64kB 2*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 1*4096kB = 5760kB
DMA32: empty
Normal: empty
HighMem: empty
45 pagecache pages
Swap cache: add 14188563, delete 14188552, find 585539/2267924, race 4109+888
Free swap = 0kB
Total swap = 4192956kB
重启后查看message错误日志:
Dec 9 12:09:33 SN7x kernel: INFO: task httpd:31276 blocked for more than 120 seconds.
Dec 9 12:09:33 SN7x kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 9 12:09:33 SN7x kernel: httpd D 011d5cb2df1c1910 0 31276 6472 31284 31269 (NOTLB)
Dec 9 12:09:33 SN7x kernel: ffff88005e587d28 0000000000000286 0000000050c40dea ffff88007f67d678
Dec 9 12:09:33 SN7x kernel: 0000000000000008 ffff8800614cf820 ffffffff804f4b80 0000000000003d62
Dec 9 12:09:33 SN7x kernel: ffff8800614cfa08 ffffffff00000001
Dec 9 12:09:33 SN7x kernel: Call Trace:
Dec 9 12:09:34 SN7x kernel: [] delayacct_end+0x5d/0x86
Dec 9 12:09:34 SN7x kernel: [] __mutex_lock_slowpath+0x60/0x9b
Dec 9 12:09:35 SN7x kernel: [] .text.lock.mutex+0xf/0x14
Dec 9 12:09:35 SN7x kernel: [] remove_exclusive_swap_page+0x10b/0x116
Dec 9 12:09:35 SN7x kernel: [] generic_file_aio_write+0x4e/0xc1
Dec 9 12:09:35 SN7x kernel: [] :ext3:ext3_file_write+0x16/0x91
Dec 9 12:09:35 SN7x kernel: [] do_sync_write+0xc7/0x104
Dec 9 12:09:35 SN7x kernel: [] do_page_fault+0xfae/0x12e0
Dec 9 12:09:35 SN7x kernel: [] autoremove_wake_function+0x0/0x2e
Dec 9 12:09:35 SN7x kernel: [] vfs_write+0xce/0x174
Dec 9 12:09:35 SN7x kernel: [] sys_write+0x45/0x6e
Dec 9 12:09:36 SN7x kernel: [] tracesys+0xab/0xb6
查询结果:
httpd:31276 blocked for more than 120 seconds. httpd进程阻塞了120s,查了下:
This is a know bug. By default Linux uses up to 40% of the available memory for file system caching.
原理:linux会设置40%的可用内存用来做系统cache,当flush数据时这40%内存中的数据由于和IO同步问题导致超时(120s)简单讲就是设置在文件 /etc/sysctl.conf中加入 “vm.dirty_ratio=10″
目前处理方案:
更改maxclients,将其进程数的最大值设大点或
临时处理更改了下内核参数有待后续观察:
'echo 0 > /proc/sys/kernel/hung_task_timeout_secs.
在进程处理于可中断的睡眠状态S 时,进程要等待 如 套接字、信号量被挂起,进入到对应事件等待队列中的的这个时间太长,占用了队列空间使进程长时间处于阻塞状态,影响了进程队列的有序调度。
附自己服务器状态:
我集群中的两台apache 负载突增,无法访问,80,22 。但telnet 能通,监控控中的负载是
2014-05-14 16:17:07] SERVICE NOTIFICATION:
nagiosadmin;web1;Load;CRITICAL;notify-service-by-email;CRITICAL - load average:
108.49, 136.73, 141.33
[2014-05-14
16:17:07] SERVICE NOTIFICATION:
telnumber;web1;Load;CRITICAL;notify-service-by-email;CRITICAL - load average:
108.49, 136.73, 141.33
[2014-05-14
16:17:07] SERVICE NOTIFICATION:
telnumber1;web1;Load;CRITICAL;notify-service-by-email;CRITICAL - load average:
108.49, 136.73, 141.33
[2014-05-14
16:17:07] SERVICE NOTIFICATION:
nagios;web1;Load;CRITICAL;notify-service-by-email;CRITICAL - load average:
108.49, 136.73, 141.33
cat /var/log/messages
May 14 16:06:31 web1 kernel: Call Trace:
May 14 16:06:31 web1 kernel: [] io_schedule+0x73/0xc0
May 14 16:06:31 web1 kernel: [] get_request_wait+0x108/0x1d0
May 14 16:06:31 web1 kernel: [] ? autoremove_wake_function+0x0/0x40
May 14 16:06:31 web1 kernel: [] ? elv_merge+0x1cb/0x200
May 14 16:06:31 web1 kernel: [] __make_request+0x7d/0x5a0
May 14 16:06:31 web1 kernel: [] generic_make_request+0x25e/0x530
May 14 16:06:31 web1 kernel: [] submit_bio+0x8d/0x120
May 14 16:06:31 web1 kernel: [] swap_writepage+0x94/0xe0
May 14 16:06:31 web1 kernel: [] pageout.clone.1+0x12b/0x300
May 14 16:06:31 web1 kernel: [] shrink_page_list.clone.0+0x4c5/0x660
May 14 16:06:31 web1 kernel: [] ? mempool_alloc_slab+0x15/0x20
May 14 16:06:31 web1 kernel: [] shrink_inactive_list+0x31c/0x7d0
May 14 16:06:31 web1 kernel: [] ? __sg_alloc_table+0x7e/0x130
May 14 16:06:31 web1 kernel: [] ? scsi_sg_alloc+0x0/0x60
May 14 16:06:31 web1 kernel: [] ? ata_qc_issue+0x1d9/0x340
May 14 16:06:31 web1 kernel: [] ? cfq_close_cooperator+0x4d/0x1d0
May 14 16:06:31 web1 kernel: [] shrink_zone+0x38f/0x520
May 14 16:06:31 web1 kernel: [] ? ktime_get_ts+0xa9/0xe0
May 14 16:06:31 web1 kernel: [] do_try_to_free_pages+0xfe/0x520
May 14 16:06:31 web1 kernel: [] try_to_free_pages+0x9d/0x130
May 14 16:06:31 web1 kernel: [] ? isolate_pages_global+0x0/0x350
May 14 16:06:31 web1 kernel: [] __alloc_pages_nodemask+0x40d/0x940
May 14 16:06:31 web1 kernel: [] ? ktime_get_ts+0x70/0xe0
May 14 16:06:31 web1 kernel: [] alloc_pages_vma+0x9a/0x150
May 14 16:06:31 web1 kernel: [] read_swap_cache_async+0xf2/0x150
May 14 16:06:31 web1 kernel: [] ? valid_swaphandles+0x69/0x150
May 14 16:06:31 web1 kernel: [] swapin_readahead+0x87/0xc0
May 14 16:06:31 web1 kernel: [] handle_pte_fault+0x70b/0xb50
May 14 16:06:31 web1 kernel: [] ? apic_timer_interrupt+0xe/0x20
May 14 16:06:31 web1 kernel: [] handle_mm_fault+0x1e4/0x2b0
May 14 16:06:31 web1 kernel: [] ? down_read_trylock+0x1/0x30
May 14 16:06:31 web1 kernel: [] __do_page_fault+0x139/0x480
May 14 16:06:31 web1 kernel: [] ? __switch_to+0x1ac/0x320
May 14 16:06:31 web1 kernel: [] ? thread_return+0x4e/0x76e
May 14 16:06:31 web1 kernel: [] ? remove_vma+0x6e/0x90
May 14 16:06:31 web1 kernel: [] do_page_fault+0x3e/0xa0
May 14 16:06:31 web1 kernel: [] page_fault+0x25/0x30
May 14 16:06:31 web1 kernel: INFO: task httpd:496 blocked for more than 120 seconds.
May 14 16:06:31 web1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 14 16:06:31 web1 kernel: httpd D 0000000000000000 0 496 10265 0x00000000
May 14 16:06:31 web1 kernel: ffff88023360f268 0000000000000086 0000000000000000 ffff88021f6e2080
May 14 16:06:31 web1 kernel: 0000000000000001 ffff880176c8f3c0 ffff88021f6e2080 ffff880177a02e80
May 14 16:06:31 web1 kernel: ffff88021f6e2638 ffff88023360ffd8 000000000000fb88 ffff88021f6e2638
May 14 16:06:31 web1 kernel: Call Trace:
May 14 16:06:31 web1 kernel: [] io_schedule+0x73/0xc0
May 14 16:06:31 web1 kernel: [] get_request_wait+0x108/0x1d0
May 14 16:06:31 web1 kernel: [] ? autoremove_wake_function+0x0/0x40
May 14 16:06:31 web1 kernel: [] ? elv_merge+0x1cb/0x200
May 14 16:06:31 web1 kernel: [] __make_request+0x7d/0x5a0
May 14 16:06:31 web1 kernel: [] generic_make_request+0x25e/0x530
May 14 16:06:31 web1 kernel: [] submit_bio+0x8d/0x120
May 14 16:06:31 web1 kernel: [] swap_writepage+0x94/0xe0
May 14 16:06:31 web1 kernel: [] pageout.clone.1+0x12b/0x300
May 14 16:06:31 web1 kernel: [] shrink_page_list.clone.0+0x4c5/0x660
May 14 16:06:31 web1 kernel: [] ? mempool_alloc_slab+0x15/0x20
May 14 16:06:31 web1 kernel: [] shrink_inactive_list+0x31c/0x7d0
May 14 16:06:31 web1 kernel: [] ? __sg_alloc_table+0x7e/0x130
May 14 16:06:31 web1 kernel: [] ? scsi_sg_alloc+0x0/0x60
May 14 16:06:31 web1 kernel: [] ? ata_qc_issue+0x1d9/0x340
May 14 16:06:31 web1 kernel: [] ? determine_dirtyable_memory+0x1a/0x30
May 14 16:06:31 web1 kernel: [] ? get_dirty_limits+0x27/0x2f0
May 14 16:06:31 web1 kernel: [] ? cfq_close_cooperator+0x4d/0x1d0
May 14 16:06:31 web1 kernel: [] shrink_zone+0x38f/0x520
May 14 16:06:31 web1 kernel: [] ? ktime_get_ts+0xa9/0xe0
May 14 16:06:31 web1 kernel: [] do_try_to_free_pages+0xfe/0x520
May 14 16:06:31 web1 kernel: [] try_to_free_pages+0x9d/0x130
May 14 16:06:31 web1 kernel: [] ? isolate_pages_global+0x0/0x350
May 14 16:06:31 web1 kernel: [] __alloc_pages_nodemask+0x40d/0x940
May 14 16:06:31 web1 kernel: [] ? ktime_get_ts+0x70/0xe0
May 14 16:06:31 web1 kernel: [] alloc_pages_vma+0x9a/0x150
May 14 16:06:31 web1 kernel: [] read_swap_cache_async+0xf2/0x150
May 14 16:06:31 web1 kernel: [] ? valid_swaphandles+0x69/0x150
May 14 16:06:31 web1 kernel: [] swapin_readahead+0x87/0xc0
May 14 16:06:31 web1 kernel: [] handle_pte_fault+0x70b/0xb50
May 14 16:06:31 web1 kernel: [] ? thread_return+0x4e/0x76e
May 14 16:06:31 web1 kernel: [] ? dput+0x9a/0x150
May 14 16:06:31 web1 kernel: [] ? apic_timer_interrupt+0xe/0x20
May 14 16:06:31 web1 kernel: [] handle_mm_fault+0x1e4/0x2b0
May 14 16:06:31 web1 kernel: [] __do_page_fault+0x139/0x480
May 14 16:06:31 web1 kernel: [] ? __switch_to+0x1ac/0x320
May 14 16:06:31 web1 kernel: [] ? thread_return+0x4e/0x76e
May 14 16:06:31 web1 kernel: [] ? mntput_no_expire+0x30/0x110
May 14 16:06:31 web1 kernel: [] ? apic_timer_interrupt+0xe/0x20
May 14 16:06:31 web1 kernel: [] do_page_fault+0x3e/0xa0
May 14 16:06:31 web1 kernel: [] page_fault+0x25/0x30
阅读(1728) | 评论(0) | 转发(0) |