分类: LINUX
2014-08-15 16:20:06
武汉地铁某项目的车地无线系统的中心服务器经常出现宕机的情况,程序在试验环境下连续运行1周以上正常,但是上线后严重时一天就可能宕机2、3次。怀疑是程序对某种特定格式的数据包进行解析的时候出错。
中心服务器采用是centos6.3系统,内核版本是2.6.32-279.el6.i686。为了能在宕机后能对系统进行分析,启动了kdump。
执行
crash /usr/lib/debug/lib/modules/`uname –r`/vmlinux vmcore
来启动分析
KERNEL: /usr/lib/debug/lib/modules/2.6.32-279.el6.i686/vmlinux
DUMPFILE: vmcore [PARTIAL DUMP]
CPUS: 8
DATE: Sat Mar 2 08:28:06 2013
UPTIME: 9 days, 19:14:07
LOAD AVERAGE: 0.45, 0.37, 0.36
TASKS: 382
NODENAME: localhost.localdomain
RELEASE: 2.6.32-279.el6.i686
VERSION: #1 SMP Fri Jun 22 10:59:55 UTC 2012
MACHINE: i686 (2400 Mhz)
MEMORY: 4 GB
PANIC: "Oops: 0002 [#1] SMP " (check log for details)
PID: 0
COMMAND: "swapper"
TASK: f62cfaa0 (1 of 8) [THREAD_INFO: f62e4000]
CPU: 7
STATE: TASK_RUNNING (PANIC)
KERNEL: /usr/lib/debug/lib/modules/2.6.32-279.el6.i686/vmlinux
表示分析内核的路径,这个内核是后安装上去的。
DUMPFILE: vmcore [PARTIAL DUMP]
Dump文件的路径, PARTIAL偏爱,局部。
CPUS: 8
机器有8个处理器
DATE: Sat Mar 2 08:28:06 2013
Dump发生的时间
UPTIME: 9 days, 19:14:07
已经开机的时间
LOAD AVERAGE: 0.45, 0.37, 0.36
平均系统负载
TASKS: 382
NODENAME: localhost.localdomain
RELEASE: 2.6.32-279.el6.i686
发行版的内核
VERSION: #1 SMP Fri Jun 22 10:59:55 UTC 2012
MACHINE: i686 (2400 Mhz)
MEMORY: 4 GB
PANIC: "Oops: 0002 [#1] SMP " (check log for details)
0002 = 10
* bit 0 == 0 means no page found, 1 means protection fault
* bit 1 == 0 means read, 1 means write
* bit 2 == 0 means kernel, 1 means user-mode
* bit 3 == 0 means data, 1 means instruction
#1:发生了一次
Oops: 0002 [#1]表示是一个write引起的no page found错误,该错误发生了一次。
PID: 0
COMMAND: "swapper"
TASK: f62cfaa0 (1 of 8) [THREAD_INFO: f62e4000]
f62e4000是thread_info的地址。
CPU: 7
第七个cpu上发生的oops。
STATE: TASK_RUNNING (PANIC)
执行bt查看调用堆栈
PID: 0 TASK: f62cfaa0 CPU: 7 COMMAND: "swapper"
#0 [f62e5b50] crash_kexec at c049b75c
#1 [f62e5ba4] oops_end at c083fe92
#2 [f62e5bb8] no_context at c0433dad
#3 [f62e5bdc] bad_area_nosemaphore at c0433fda
#4 [f62e5be4] __do_page_fault at c04344c3
#5 [f62e5c64] do_page_fault at c0841815
#6 [f62e5c7c] error_code (via page_fault) at c083f2a5
EAX: 00000292 EBX: f7f44c18 ECX: 00000000 EDX: e23c8c80 EBP: c0aa5ba0
DS: 007b ESI: f7f44c24 ES: 007b EDI: e23c8c80 GS: 00e0
CS: 0060 EIP: c078afb4 ERR: ffffffff EFLAGS: 00010046
#7 [f62e5cb0] skb_queue_tail at c078afb4
#8 [f62e5cc8] doRecv at f7f3db97 [gate]
#9 [f62e5cd4] handlerecv at f7f3cfd2 [gate]
#10 [f62e5cf0] getip at f7f3d202 [gate]
#11 [f62e5d00] nf_iterate at c07b87e3
#12 [f62e5d24] nf_hook_slow at c07b894d
#13 [f62e5d60] ip_rcv at c07c1279
#14 [f62e5d84] __netif_receive_skb at c079449e
#15 [f62e5db8] netif_receive_skb at c079602a
#16 [f62e5dcc] napi_skb_finish at c07960fa
#17 [f62e5dd4] napi_gro_receive at c0798520
#18 [f62e5de0] e1000_clean_rx_irq at f87c756b [e1000e]
#19 [f62e5e50] e1000_clean at f87c6386 [e1000e]
#20 [f62e5e70] net_rx_action at c079861b
#21 [f62e5e98] __do_softirq at c045cf28
#22 [f62e5ed8] do_softirq at c045d078
#23 [f62e5ee0] irq_exit at c045d1d0
#24 [f62e5ee4] do_IRQ at c040b10b
#25 [f62e5f0c] common_interrupt at c0409feb
EAX: 00000000 EBX: 0000007f ECX: 00000000 EDX: 0000007f EBP: 00000000
DS: 007b ESI: 12e77335 ES: 007b EDI: 00000002 GS: 00e0
CS: 0060 EIP: c064d44f ERR: ffffff55 EFLAGS: 00000202
#26 [f62e5f40] intel_idle at c064d44f
#27 [f62e5f70] cpuidle_idle_call at c0766e3e
#28 [f62e5f7c] cpu_idle at c040895e
下面的函数会在内部调用上面的函数
#7 [f62e5cb0] skb_queue_tail at c078afb4
#8 [f62e5cc8] doRecv at f7f3db97 [gate]
#9 [f62e5cd4] handlerecv at f7f3cfd2 [gate]
#10 [f62e5cf0] getip at f7f3d202 [gate]
由此可知下面的函数调用关系
getip()->handlerecv()-> doRecv()->skb_queue_tail()
BUG: unable to handle kernel NULL pointer dereference at (null)
IP:
[
*pdpt = 00000000243c3001 *pde = 000000013f51a067
Oops: 0002 [#1] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:1c.5/0000:02:00.0/net/eth1/statistics/collisions
Modules linked in: vfat fat usb_storage gate(U) bridge stp llc fuse autofs4 sunrpc cpufreq_ondemand acpi_cpufreq mperf ipt_REJECT ip6t_REJECT ipv6 uinput sg microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support e1000e ioatdma dca i7core_edac edac_core ext3 jbd mbcache sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nf_conntrack]
Pid: 0, comm: swapper Tainted: G W --------------- 2.6.32-279.el6.i686 #1 Lenovo Lenovo WQ R510 G7/Tyan Tank GT20-B7002LNV
EIP:
0060:[
EIP is at skb_queue_tail+0x24/0x50
EAX: 00000292 EBX: f7f44c18 ECX: 00000000 EDX: e23c8c80
ESI: f7f44c24 EDI: e23c8c80 EBP: c0aa5ba0 ESP: f62e5cbc
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process swapper (pid: 0, ti=f62e4000 task=f62cfaa0 task.ti=f62e4000)
Stack:
e8e8eec0 f7f44c18 00000000 f7f3db9c f2ecfa30 e8e8eec0 f7f3cfd7 e176a844
<0> c07e68d5 00000002 e8e8eec0 f2ecfa30 f7f3f440 f7f3d207 f7f3f4e8 f62e5d4c
<0> 80000000 c07b87e6 00000000 c07c0cc0 e8e8eec0 00000000 00000000 f3aac020
Call Trace:
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
Code: e9 82 39 0b 00 66 90 83 ec 0c 89 74 24 04 8d 70 0c 89 1c 24 89 c3 89 f0 89 7c 24 08 89 d7 e8 34 38 0b 00 8b 4b 04 89 1f 89 4f 04 <89> 39 83 43 08 01 89 c2 89 f0 8b 74 24 04 89 7b 04 8b 1c 24 8b
EIP:
[
CR2: 0000000000000000
BUG: unable to handle kernel NULL pointer dereference at (null)
空指针错误
EIP is at skb_queue_tail+0x24/0x50
IP:
[
错误的地点
反汇编skb_queue_tail
0xc078af90
0xc078af93
0xc078af97
0xc078af9a
0xc078af9d
0xc078af9f
0xc078afa1
0xc078afa5
0xc078afa7
0xc078afac
0xc078afaf
0xc078afb1
0xc078afb4
0xc078afb6
0xc078afba
0xc078afbc
0xc078afbe
0xc078afc2
0xc078afc5
0xc078afc8
0xc078afcc
0xc078afcf
0xc078afd4
0xc078afda
分析出错的地方
反汇编skb_queue_tail得到上面的代码,出错的地方在0x24/0x50。
0x50表示这个函数的大小是0x50个字节。0x24=36,也就是说真正的地方在
mov %edi,(%ecx)。
这句话的意思是将寄存器edi存放的地址的内容拷贝到ecx寄存器存放的地址中去。
查看寄存器状态
EAX: 00000292 EBX: f7f44c18 ECX: 00000000 EDX: e23c8c80
ESI: f7f44c24 EDI: e23c8c80 EBP: c0aa5ba0 ESP: f62e5cbc
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
定位出错的地方
通过查看源码
void skb_queue_tail(struct sk_buff_head *list, struct sk_buff *newsk)
{
unsigned long flags;
spin_lock_irqsave(&list->lock, flags);
__skb_queue_tail(list, newsk);
spin_unlock_irqrestore(&list->lock, flags);
}
0xc078afa7
和
0xc078afcf
明显就是上面源码的中加锁和解锁的地方。那么在这两处之间可能就是__skb_queue_tail(list, newsk);要做的事情。
__skb_queue_tail(list, newsk)
这函数最终要执行
newsk->next = next;
newsk->prev = prev;
next->prev = prev->next = newsk;
list->qlen++;
在反汇编源码
0xc078afb6
就是将ebx存放的地址的值加上1,然后再存回去。这应该就是
list->qlen++;
也就是说
0xc078afb4
应该执行的是
next->prev = prev->next = newsk;
所以这里猜测出错的地方应该是要么是next,要么是prev为空。
执行 rd -32 0xe23c8c80
*(e23c8c80) = f7f44c18
c1497200: 00 00 08 75 c0 a8 00 0a 00 e0 81 dc f0 a9 08 00 ...u............
c1497210: 45 00 00 2c 33 33 00 00 80 11 86 32 c0 a8 00 01 E..,33.....2....
c1497220: c0 a8 00 e0 81 dc f0 a9 20 12 03 08 13 14 08 00 ........ .......
c1497230: 45 00 00 2c 00 b0 00 00 40 11 f8 99 c0 a8 00 26 E..,....@......&
c1497240: c0 a8 00 01 04 d2 07 d0 00 18 00 00 00 9b fd 00 ................
c1497250: 02 64 00 00 00 00 00 00 00 00 00 00 40 05 00 00 .d..........@...
c1497260: 00 00 00 00 ....
源ip : c0 a8 00 26(192.168.0.38)
目标ip : c0 a8 00 01(192.168.0.1)
协议类型:0x11(17) UDP
源端口 : 04 d2(1234)
目标端口 : 07 d0(2000)
自定义头部的内容
SEQ : 00 9b
Cksum : fd 00
Line : 02
Train : 64
Flag :00 00
R0 :00 00
R1 :00 00
R2 :00 00
R3 :00 00
这里有个train,它的值是0x64,也就是100。而源码中数组的长度就是100.问题找到了。