Chinaunix首页 | 论坛 | 博客
  • 博客访问: 1302293
  • 博文数量: 416
  • 博客积分: 10495
  • 博客等级: 上将
  • 技术积分: 4258
  • 用 户 组: 普通用户
  • 注册时间: 2005-04-23 22:13
文章分类

全部博文(416)

文章存档

2015年(7)

2014年(42)

2013年(35)

2012年(14)

2011年(17)

2010年(10)

2009年(18)

2008年(127)

2007年(72)

2006年(23)

2005年(51)

分类:

2008-03-27 11:43:04

前些日子,客户的S7A主机发生了几次宕机,产生了CORE_DUMP文件,下面是利用crash命令分析宕机原因的过程
pwd
/
# hostname
s7a01
# cd /var/adm/ras
# ls -l 查看core文件名称
total 395133
-rw-rw-r--   1 root     system      4226 Apr 02 2003  BosMenus.log
-rw-r--r--   1 root     system         2 Jan 07 2000  SRCSemID
-rw-------   1 root     system      8192 May 20 13:35 bootlog
-rw-r--r--   1 root     system      8388 Apr 02 2003  bosinst.data
-rw-rw-r--   1 root     system     16384 Apr 02 2003  bosinstlog
--w-------   1 root     system         2 May 16 15:47 bounds
-rw-r--r--   1 bin      bin       197206 Jan 01 1970  codepoint.cat
-rw--w--w-   1 root     system     16384 May 20 15:52 conslog
--w-------   1 root     system        21 May 16 15:47 copyfilename
-rw-r--r--   1 root     system     57078 Apr 02 2003  devinst.log
-rw-r--r--   1 root     system     83319 May 20 14:00 diag_log
-rw-------   1 root     system      8192 May 16 15:49 dumpsymplog
-rw-r--r--   1 root     system    151552 May 20 15:52 errlog
-rw-r--r--   1 root     system    151552 Apr 22 2004  errlog0422.log
-r--r--r--   1 bin      bin       103968 Jan 07 2000  errtmplt
-rw-r--r--   1 root     system      7949 Apr 02 2003  image.data
-rw-r--r--   1 root     system      8192 May 20 13:21 nimlog
-rw-rw-rw-   1 root     system   1334264 Jan 20 2000  trcfile
-rw-------   1 root     system   200136704 May 16 15:47 vmcore.0
# crash vmcore.0 开打vmcore.0文件
Using /unix as the default namelist file.
2 dump routines failed.  The following were recorded:
    0x0141cbe8 <.[kbddd_chrp:DATA]+9a8> failed with rc=14
    0x01422764 <.[msedd_chrp:DATA]+664> failed with rc=14
> stat 查看宕机时的状态
        sysname: AIX
        nodename: s7a01
        release: 3
        version: 4
        machine: 000AAD014C00
        time of crash: Tue May 16 15:05:18 TAIST 2006
        age of system: 22 hr., 51 min.
        xmalloc debug: disabled
        abend code: 300 查看错误代码,这个代码很关键
        csa: 0x2ff3b400
        exception struct:
                dar:   0x00000000
                dsisr: 0x00000000:
                srv:   0x00000000
                dar2:  0x00000000
                dsirr: 0x00000000: (errno) "Error 0"
> trace -m
Skipping first MST

MST STACK TRACE:
0x2ff3b400 (excpt=00000004:0a000000:00000000:00000004:00000106) (intpri=11)
        IAR:      .compare_and_swap+2c (0000a4ec):     stw   r9,0x0(r4)
        LR:       .[aiopin:untie_knot]+a8 (0143d7a8)
        2ff3a2e0: .[aio.ext:qlioreq]+b0 (014376ec)
        2ff3a340: .[aio.ext:listio]+128 (01438f5c)
        2ff3b3c0: .sys_call_ret+0 (00003a6c)
        0001113a: lasttocentry+fead9 (00348001)
0452-771: Cannot read return address at address 0x01892c0b.

> le 0000a4ec
No loader entry found for module address 0x0000a4ec
No loader entry found for module named '0000a4ec'
> le 0143d7a8
LoadList entry at 0x04ea7980
  Module *start:0x00000000_0143bef0  Module filesize:0x00000000_0000228c
  Module *end:0x00000000_0143e17c
  *data:0x00000000_0143dbe8  data length:0x00000000_00000594
  Use-count:0x0001  load_count:0x0000  *file:0x00000000
  flags:0x00000262 TEXT DATAINTEXT DATA DATAEXISTS
  *exp:0x04ed8000  *lex:0x00000000  *deferred:0x00000000  expsize:0x6e6c732f
  Name: /usr/lib/drivers/aiopin 
  ndepend:0x0001  maxdepend:0x0001
  *depend[00]:0x05039280
  *le_next:  04ea7680

> le 014376ec
LoadList entry at 0x04ea7680
  Module *start:0x00000000_014348c0  Module filesize:0x00000000_00007624
  Module *end:0x00000000_0143bee4
  *data:0x00000000_0143a4c0  data length:0x00000000_00001a24
  Use-count:0x0003  load_count:0x0001  *file:0x00000000
  flags:0x00000272 TEXT KERNELEX DATAINTEXT DATA DATAEXISTS
  *exp:0x051e3000  *lex:0x00000000  *deferred:0x00000000  expsize:0x6c696263
  Name: /etc/drivers/aio.ext 
  ndepend:0x0002  maxdepend:0x0002
  *depend[00]:0x04ea7980
  *depend[01]:0x05039280
  *le_next:  04edb700

> le 01438f5c
LoadList entry at 0x04ea7680
  Module *start:0x00000000_014348c0  Module filesize:0x00000000_00007624
  Module *end:0x00000000_0143bee4
  *data:0x00000000_0143a4c0  data length:0x00000000_00001a24
  Use-count:0x0003  load_count:0x0001  *file:0x00000000
  flags:0x00000272 TEXT KERNELEX DATAINTEXT DATA DATAEXISTS
  *exp:0x051e3000  *lex:0x00000000  *deferred:0x00000000  expsize:0x6c696263
  Name: /etc/drivers/aio.ext 
  ndepend:0x0002  maxdepend:0x0002
  *depend[00]:0x04ea7980
  *depend[01]:0x05039280
  *le_next:  04edb700
经查,宕机跟Name: /usr/lib/drivers/aiopin有关,
> errpt 查看宕机时产生的错误日志

LAST ERRORS READ BY ERRDEMON (MOST RECENT LAST):
    Tue May 16 15:05:18 TAIST: DSI_PROC        data storage interrupt : processor
        Resource Name: SYSVMM         
        0a000000 00000000 00000004 00000086

LAST 3 ERRORS READ BY ERRDEMON (MOST RECENT FIRST):
> od vmmerrlog 9              rpco    proc - 0
SLT ST    PID   PPID   PGRP   UID  EUID  TCNT  NAME
  0 a       0      0      0     0     0     1  swapper
        FLAGS: swapped_in no_swap fixed_pri kproc

Links:  *child:0xe20030c0  *siblings:0x00000000  *uinfo:0x50004020(0x0038)
    *ganchor:0x00000000  *pgrpl:0x00000000  *ttyl:0x00000000
Dispatch Fields:  pevent:0x00000000  *synch:0xffffffff
    lock:0x00000000  lock_d:0x00000000
Thread Fields:  *threadlist:0xe6000000  threadcount:1
    active:1  suspended:0  local:0   terminating:0
Scheduler Fields:   fixed pri: 16  repage:0x00000000  scount:0  sched_pri:0
    *sched_next:0x00000000  *sched_back:0x00000000 cpticks:3087
    msgcnt:0    majfltsec:0
Misc:  adspace:0x0003c00f  kstackseg:0x00000000  xstat:0x0000
    *p_ipc:0x00000000  *p_dblist:0x00000000  *p_dbnext:0x00000000
Signal Information:
    pending:hi 0x00000000,lo 0x00000000
    sigcatch:hi 0x00000000,lo 0x00000000  sigignore:hi 0xffffffff,lo 0xfff7ffff
Statistics:  size:0x00000000(pages)  audit:0x00000000
    accounting page frames:0   page space blocks:0
    Number of virtual pages in use :0  

    pctcpu:0    minflt:1987    majflt:7
> thread - 0
SLT ST    TID      PID    CPUID  POLICY PRI CPU    EVENT  PROCNAME
  0 s       3        0  unbound    FIFO  10  78            swapper
        t_flags:  wakeonsig kthread

Links:  *procp:0xe2000000  *uthreadp:0x2ff3b400  *userp:0x2ff3b6e0
    *prevthread:0xe6000000  *nextthread:0xe6000000,  *stackp:0x00000000
    *wchan1(real):0x00000000  *wchan2(VMM):0x00000000 *swchan:0x00000000
    wchan1sid:0x00000000  wchan1offset:0x00000000
    pevent:0x00000000  wevent:0x00000001  *slist:0x00000000
Dispatch Fields:  *prior:0xe6000000  *next:0xe6000000
    polevel:0x0000000a  ticks:0x0c0f  *synch:0xffffffff  result:0x00000000
    *eventlst:0x00000000  *wchan(hashed):0x00000000  suspend:0x0001
    thread waiting for:  event(s)
Scheduler Fields:  cpuid:0xffffffff  scpuid:0xffffffff  pri: 16  policy:FIFO
    affinity:0x0001  affinity_ts:0x3b6e31e  cpu:0x0078  run_queue:34a900
    lpri:  0  wpri:127    time:0x00  sav_pri:0x10
Misc:  lockcount:0x00000000  ulock:0x00000000  *graphics:0x00000000
    dispct:0x00031718  fpuct:0x00000001  boosted:0x0000
    userdata:0x00000000
    fsflags: 00000000   adsp_flags: 0000
Signal Information:  cursig:0x00  *scp:0x00000000
    pending:hi 0x00000000,lo 0x00000000  sigmask:hi 0x00000000,lo 0x00000000


> q

#lslpp -w /usr/lib/drivers/aiopin 查看相关的文件集
  File                                        Fileset               Type
  ----------------------------------------------------------------------------
  /usr/lib/drivers/aiopin                     bos.rte.aio           File

 


# lslpp -ah bos.rte.aio 查看这个文件集的版本为4.3.3.1
  Fileset         Level     Action       Status       Date         Time       
  ----------------------------------------------------------------------------
Path: /usr/lib/objrepos
  bos.rte.aio
                  4.3.3.0   COMMIT       COMPLETE     01/01/70     08:29:52   
                  4.3.3.1   COMMIT       COMPLETE     01/07/00     09:57:11   
                  4.3.3.1   APPLY        COMPLETE     01/07/00     09:55:52   

Path: /etc/objrepos
  bos.rte.aio
                  4.3.3.0   COMMIT       COMPLETE     01/01/70     08:29:52   
                  4.3.3.1   COMMIT       COMPLETE     01/07/00     09:57:11   
                  4.3.3.1   APPLY        COMPLETE     01/07/00     09:55:53   
 

经查,宕机跟bos.rte.aio有关,在IBM网站上查到如下内容

IY05599: AIO CRASH IN COMPARE_AND_SWAP 00/01/14 PTF PECHANGE

APAR status
Closed as program error.

Error description
When the parameter passed to the compare_and_swap() expected
to be a pointer to an integer, but the code passed an integer.
I/O on this address (small integer) caused the system crashed
with DSI.
Local fix
Problem summary
***************************************************************
*USERS AFFECTED:                                              *
* All users with the following filesets at   these levels   *
*   bos.rte.aio 4.3.3.1.
***************************************************************
*PROBLEM DESCRIPTION:                                         *
*  When the parameter passed to the compare_and_swap()
*  expected to be a pointer to an integer, but the code
*  passed an integer. I/O on this address (small
*  integer) caused the system crashed with DSI.
***************************************************************
*RECOMMENDATION:                                              *
*  Apply apar IY05599
***************************************************************
Problem conclusion
Corrected the parameter passed to compare_and_swap calls.
Temporary fix
Comments
      APAR information
      APAR number IY05599
      Reported component name AIX 4.3.0
      Reported component ID 5765C3403
      Reported release 430
      Status CLOSED PER
      PE YesPE
      HIPER NoHIPER
      Submitted date 1999-11-02
      Closed date 1999-11-08
      Last modified date 2000-10-17

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:


      Fix information
      Fixed component name AIX 4.3.0
      Fixed component ID 5765C3403

      Applicable component levels
      R430 PSY U467596    UP99/12/21 I 1000


现在确定,这台机器需要打相关补丁才能彻底解决宕机.

阅读(2245) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~