Chinaunix首页 | 论坛 | 博客
  • 博客访问: 354108
  • 博文数量: 67
  • 博客积分: 2550
  • 博客等级: 少校
  • 技术积分: 990
  • 用 户 组: 普通用户
  • 注册时间: 2008-09-17 18:24
文章分类

全部博文(67)

文章存档

2011年(6)

2010年(2)

2009年(40)

2008年(19)

我的朋友

分类: LINUX

2008-12-04 17:04:50

A Case Study: "kernel BUG at pipe.c:120!"

Upon bringing up a crash session, a great deal of information can be gained just by the invocation data. Here is what what displayed in this particular case:
...
KERNEL: vmlinux-2.4.9-e.10.13enterprise-g
DUMPFILE: vmcore-incomplete
CPUS: 2
DATE: Mon Feb 17 08:20:56 2003
UPTIME: 4 days, 20:04:41
LOAD AVERAGE: 0.95, 1.04, 1.25
TASKS: 110
NODENAME: testbox.redhat.com
RELEASE: 2.4.9-e.10.13enterprise
VERSION: #1 SMP Mon Feb 3 12:59:26 EST 2003
MACHINE: i686 (2788 Mhz)
MEMORY: 6 GB
PANIC: "kernel BUG at pipe.c:120!"
PID: 20571
COMMAND: "imp"
TASK: d1566000
CPU: 1
STATE: TASK_RUNNING (PANIC)

crash>

In this case the PANIC string "kernel BUG at pipe.c:120!" points to the exact kernel source code line at which the panic occurred.

Then, getting a backtrace of panicking task is typically the first order of the day:

crash> bt
PID: 20571 TASK: d1566000 CPU: 1 COMMAND: "imp"
#0 [d1567e44] die at c010785c
#1 [d1567e54] do_invalid_op at c0107b2c
#2 [d1567f0c] error_code (via invalid_op) at c01073de
EAX: 0000001d EBX: ed87b2e0 ECX: c02f6064 EDX: 00005fa1 EBP: 00001000
DS: 0018 ESI: f640e740 ES: 0018 EDI: 00001000
CS: 0010 EIP: c0150b6d ERR: ffffffff EFLAGS: 00010292
#3 [d1567f48] pipe_read at c0150b6d
#4 [d1567f6c] sys_read at c01468d4
#5 [d1567fc0] system_call at c01072dc
EAX: 00000003 EBX: 0000000a ECX: 40b4e05c EDX: 00002000
DS: 002b ESI: 00002000 ES: 002b EDI: 40b4e05c
SS: 002b ESP: bffe9e88 EBP: bffe9eb8
CS: 0023 EIP: 40aaa1d4 ERR: 00000003 EFLAGS: 00000286

The backtrace shows that the call to die() was generated by an invalid_op exception. The exception was caused by the BUG() call in the pipe_read() function:

        if (count && PIPE_WAITING_WRITERS(*inode) &&
!(filp->f_flags & O_NONBLOCK)) {
/*
* We know that we are going to sleep: signal
* writers synchronously that there is more
* room.
*/
wake_up_interruptible_sync(PIPE_WAIT(*inode));
if (!PIPE_EMPTY(*inode))
BUG();

goto do_more_read;
}

In the code segment above, the pipe_read() code has previously down'd the semaphore of the inode associated with the pipe, giving it exclusive access. It had read all data in the pipe, but still needed more to satisfy the count requested. Finding that there was a writer with more data -- and who was waiting on the semaphore -- it woke up the writer. However, after doing the wakeup, it did a sanity-check on the pipe contents, and found that it was no longer empty -- which is theoretically impossible since it was still holding the semaphore. It appeared that the writer process wrote to the pipe while the reader process still had exclusive access -- somehow overriding the semaphore.

Since the semaphore mechanism was seemingly not working, it was first necessary to look at the actual semaphore structure associated with the pipe's inode. This first required looking at the first argument to the pipe_read() function; the command shows that it is a struct file pointer:

crash> whatis pipe_read
ssize_t pipe_read(struct file *, char *, size_t, loff_t *);
crash>

Using the option, each frame in the backtrace is expanded to show all stack data in the frame. Looking at the expansion of the sys_read() frame, we can see that the last thing pushed on the stack before calling pipe_read() was the file pointer address of edf3f740:

...
#3 [d1567f48] pipe_read at c0150b6d
[RA: c01468d6 SP: d1567f4c FP: d1567f6c SIZE: 36]
d1567f4c: c026701c 00000078 fffffff2 00001000
d1567f5c: 00000000 edf3f740 ffffffea 00002000
d1567f6c: c01468d6
#4 [d1567f6c] sys_read at c01468d4
[RA: c01072e3 SP: d1567f70 FP: d1567fc0 SIZE: 84]
d1567f70: edf3f740 40b4f05c 00002000 edf3f760
d1567f80: c03683d0 fffffffb 00000001 c0120d3b
d1567f90: 00000046 00000046 0000000b c0350960
d1567fa0: 0000000b f639eb00 c0108e0e 00000020
d1567fb0: d1566000 00002000 40b4e05c bffe9eb8
d1567fc0: c01072e3
...

The task at hand is finding the inode containing the suspect semaphore from the file structure address. The file structure's f_dentry member points to its dentry structure, whose d_inode member in turn points to the pipe's inode. The command can be used to dump the complete contents of a data structure at a given address; by tagging the .member onto the structure name, we can print just the member desired. By following the structure chain, the inode address can be determined like so:

crash> struct file.f_dentry edf3f740
f_dentry = 0xdb0ec440,
crash> struct dentry.d_inode db0ec440
d_inode = 0xf640e740,
crash> struct inode.i_sem f640e740

i_sem = {
count = {
counter = 2
},
sleepers = 0,
wait = {
lock = {
lock = 1
},
task_list = {
next = 0xf640e7ac,
prev = 0xf640e7ac
}
}
},
crash>

The dump of the semaphore structure above showed the problem: the counter value of 2 is illegal. It should never be greater than 1; in this case a value of 2 allows two successful down operations, i.e., giving two tasks access to the pipe at the same time.

(As an aside, determining the inode address above could also be accomplished by using the context-sensitive command, which dumps the associated file, dentry and inode structure addresses for each open file descriptor of a task. The dumped file descriptor list would contain one with a reference to the file structure at edf3f740, and would also show the associated inode address of f640e740.)

Before getting a dumpfile, this same panic had occurred several times. It was erroneously presumed that the problem was in the pipe-handling code, but it was eventually determined not to be the case. By instrumenting a kernel with debug code, the starting counter value of a pipe was found to be 3. Compounding that problem was the fact that the inode slab cache is one of a few special cases that presume that the freed inode's contents are left in a legitimate state so that they do not have to be completely reinitialized with each subsequent reallocation. So when the pipe's inode was created, it received an inode with a bogus counter value.

Confirming the existence of bogus inode structures in the slab cache was a multi-stepped procedure. Using the command command to access the inode slab cache, we can get the addresses of all free and currently-allocated inodes. Since there are typically several thousand inodes, the output is extremely verbose, but here is the beginning of it:

crash> kmem -S inode_cache
CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE
c7666564 inode_cache 448 11563 12339 1371 4k
SLAB MEMORY TOTAL ALLOCATED FREE
d1d82000 d1d82040 9 9 0
FREE / [ALLOCATED]
[d1d82040]
[d1d82200]
[d1d823c0]
[d1d82580]
[d1d82740]
[d1d82900]
[d1d82ac0]
[d1d82c80]
[d1d82e40]
SLAB MEMORY TOTAL ALLOCATED FREE
f4e52000 f4e52040 9 7 2
FREE / [ALLOCATED]
f4e52040 (cpu 1 cache)
f4e52200 (cpu 1 cache)
[f4e523c0]
[f4e52580]
[f4e52740]
[f4e52900]
[f4e52ac0]
[f4e52c80]
[f4e52e40]
...

In the truncated output above, all of the inode address in the slab cache are dumped; the ones currently in use are surrounded by brackets, the free ones are not. So, for example, the inodes at addresses f4e52040 and f4e52200 are free; the others are not. The full output was that pulled out just the free inode addresses (i.e., output lines starting with three spaces), and . The file was modified to be a crash by making each extracted inode address to be the arguments of the command, using its short-cut method that allows the dropping of the struct command name; therefore the input file contained hundreds of crash commands of the form:

inode.i_sem f4e52040
inode.i_sem f4e52200
inode.i_sem f5cdc040
inode.i_sem f5cdc200
inode.i_sem f5cdc3c0
inode.i_sem f5cdc580
...

Note that the command would be used by default above, as documented in its help page; if the first command line argument is not a crash or gdb command, but it is the name of a known data structure, it passes the arguments to the command.

Using the capability of feeding an , in this case consisting of hundreds of short-cut commands like those above, the output was again quite verbose, consisting of structure member dumps of the form:

crash> < input.file
crash> inode.i_sem f4e52040
i_sem = {
count = {
counter = 1
},
sleepers = 0,
wait = {
lock = {
lock = 1
},
task_list = {
next = 0xf4e520ac,
prev = 0xf4e520ac
}
}
},
crash> inode.i_sem f4e52200
i_sem = {
count = {
counter = 1
},
sleepers = 0,
wait = {
lock = {
lock = 1
},
task_list = {
next = 0xf4e5226c,
prev = 0xf4e5226c
}
}
},
...

However, it was a simple matter of to grep, and looking for counter values not equal to 1:

crash> < input.file | grep counter | grep -v "= 1"
counter = 3
counter = 3
counter = 3
counter = 3
crash>

This turned out to be the smoking gun. Another round of debugging with an instrumented kernel that trapped attempts to free an inode with a semaphore counter of 3 caught the perpetrator in the act.


加载新的模块。

crash> mod -d egenera_base
crash> mod -s egenera_base egenera_base.o
 MODULE   NAME                     SIZE  OBJECT FILE
f8804000  egenera_base           448324  egenera_base.o
crash>
crash> whatis
Usage: whatis [struct | union | typedef | symbol]
Enter "help whatis" for details.
crash> bt -t
PID: 11192  TASK: e3a5a000  CPU: 0   COMMAND: "java"
      START: dump_execute at fa5740eb
  [e3a5bc64] __do_vmdump at fa5744f7
  [e3a5bc7c] do_vmdump at fa573081
  [e3a5bc8c] kdb_dump at c0183c00
  [e3a5bc98] kdb_local at c018434c
  [e3a5bcc4] kernrpc_rclan_send at f881b516
  [e3a5bce4] hash_remove at f881bea8
  [e3a5bd0c] kernrpc_unregister_send at f881a300
  [e3a5bd54] egenera_config_query_rpc at fb2ed2c1
  [e3a5bd84] do_get_write_access at f9a93e85
  [e3a5bdd4] journal_dirty_metadata at f9a945b9
  [e3a5bdf8] ext3_do_update_inode at f9aa7a64
  [e3a5be08] journal_get_write_access at f9a93ef6
  [e3a5be9c] wake_up_process at c0119ccb
  [e3a5beac] deliver_signal at c012795e
  [e3a5bebc] ignored_signal at c0127770
  [e3a5bed8] kdba_getregcontents at c023a3ca
  [e3a5bf0c] kdb_main_loop at c0184604
  [e3a5bf34] kdb at c0184cf6
  [e3a5bf70] egenera_nmi_check at c010857d
  [e3a5bf90] do_nmi at c01087b8
  [e3a5bfb8] nmi at c0107732
crash> whatis kernrpc_rclan_send
int kernrpc_rclan_send(kernrpc_transport_t *, kernrpc_transport_descriptor_t *, long unsigned int, int);

阅读(1939) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~