分类:
2005-12-22 11:43:04
遇到了jfs的一个bug,好在那台机器的JFS以模块形式链入内核的...
日志记录:
message里的记录:
Dec 18 13:34:29 stg009 kernel: assert(newval == leaf[buddy])
Dec 18 13:34:29 stg009 kernel: ------------[ cut here ]------------
Dec 18 13:34:29 stg009 kernel: kernel BUG at jfs_dmap.c:2701!
Dec 18 13:34:29 stg009 kernel: invalid operand: 0000
Dec 18 13:34:29 stg009 kernel: nfsd iptable_filter ip_tables autofs nfs lockd sunrpc ians e1000 microcode nls_iso8859-1 jfs keybdev mousedev hid input usb-ohci usbcore ext3 jbd aic79xx mpts
Dec 18 13:34:29 stg009 kernel: CPU: 0
Dec 18 13:34:29 stg009 kernel: EIP: 0060:[
Dec 18 13:34:29 stg009 kernel: EFLAGS: 00010282
Dec 18 13:34:29 stg009 kernel:
Dec 18 13:34:29 stg009 kernel: EIP is at dbJoin [jfs] 0x64 (2.4.20-31.9nks2smp)
Dec 18 13:34:29 stg009 kernel: eax: 0000001e ebx: 00000005 ecx: c03ad8a8 edx: c47a3c5c
Dec 18 13:34:29 stg009 kernel: esi: 00000054 edi: c9ac9010 ebp: 00000001 esp: c47a3cc0
Dec 18 13:34:29 stg009 kernel: ds: 0068 es: 0068 ss: 0068
Dec 18 13:34:29 stg009 kernel: Process jfsCommit (pid: 713, stackpage=c47a3000)
Dec 18 13:34:29 stg009 kernel: Stack: f89f036c f89f04b8 de8eba60 c9ac9076 00000055 c9ac9000 00000056 00000000
Dec 18 13:34:29 stg009 kernel: 00000001 f89dd403 c9ac9010 00000055 00000005 f76763b4 c9ac9010 00000001
Dec 18 13:34:29 stg009 kernel: 00000aa6 063eaaa6 00000000 00000009 063eaaa6 00000000 c9ac9000 f89dcebd
Dec 18 13:34:29 stg009 kernel: Call Trace: [
Dec 18 13:34:29 stg009 kernel: [
Dec 18 13:34:29 stg009 kernel: [
Dec 18 13:34:29 stg009 kernel: [
Dec 18 13:34:29 stg009 kernel: [
Dec 18 13:34:29 stg009 kernel: [
Dec 18 13:34:29 stg009 kernel: [
Dec 18 13:34:29 stg009 kernel: [
Dec 18 13:34:29 stg009 kernel: [
Dec 18 13:34:29 stg009 kernel: [
Dec 18 13:34:29 stg009 kernel: [
Dec 18 13:34:29 stg009 kernel: [
Dec 18 13:34:29 stg009 kernel: [
------------------------------------------------------------------------------------------------------------
查到的一个解释:
JFS definitely ran into some corrupt metadata. It could either be
corrupt on disk, or a memory-corruption problem. I don't know if
hardware is the cause, but you say that nothing indicates that. If it
is caused by a software bug, it would be hard to track down unless it is
repeatable. The BUG() itself should probably be replaced by something
nicer, like the first two errors reported by dbFree.
At this point the superblock should be marked dirty, so fsck should
attempt to repair the damage upon reboot. I'd like to know if you
continue to see problems.
Thanks,
Shaggy
-------------------------------------------------------------------------------------------
JFS是以NFS挂在另外一台机器S上的,S的负载不断升高,出现状态为D的进程,猜测是I/O错误,df一定会死。师兄查出是jfs的bug,继续查ing.
这种情况下init 6 或者reboot是不行的,猜测的原因是:因为系统在halt的时候要将内存内容回写硬盘,而这个bug正是由于数据dirty造成的,这就会导致halt失败...所以,只好硬件reset了。:(
相关的代码:
/* if the leaf's new value is greater than its
* buddy's value, we join no more.
*/
if (newval > leaf[buddy])
break;
assert(newval == leaf[buddy]);
可能的解决方法: 如果jsf是以模块加载,那么下载最新的jfs,重新编译模块。可能遇到的问题,模块的依赖关系可能导致内核的不稳定
现在的文法...周期性地重起