Chinaunix首页 | 论坛 | 博客
  • 博客访问: 25881596
  • 博文数量: 271
  • 博客积分: 10025
  • 博客等级: 上将
  • 技术积分: 3358
  • 用 户 组: 普通用户
  • 注册时间: 2007-11-12 15:28
文章分类

全部博文(271)

文章存档

2010年(71)

2009年(164)

2008年(36)

我的朋友

分类: BSD

2009-09-22 21:22:30

Debugging a Failed AIX System Dump

 Technote (FAQ)
 
Problem
Debugging a Failed AIX System Dump
 
Solution

This document is intended for persons involved in customer support, and for customers themselves. The examples are primarily for customers who would like to provide some additional information on a failed dump before contacting IBM support, to speed up analysis.

This document explains how to gather information about the cause of a crash before attempting to dump. It does not include a discussion of general debugging techniques.

At the end of this document is a brief description of kdb, the kernel debugger.

How can the dump fail?
Hard dump failures
Dump debugging examples
Some things you should know about KDB


How can the dump fail?

There are basically two types of AIX system dump failures. The first is a hard failure, identified by a dump completion code of other than 0C0 in the panel display, or a non-zero dump status as shown by the sysdumpdev -L command. The second type of failure is a good dump, dump status 0, but kdb, or crash in AIX prior to version 5, can't process the dump. This document describes how to collect information to send to IBM support to debug these problems.


Hard dump failures

The dump failures seen most often are those with a dump status of -3, 0C5 in the front panel display. The other failures are mostly self-explanitory and can usually be corrected by the system administrator. The 0C5, however, means that the dump was unable to complete due to an internal failure. This is usually caused by a hardware failure or by memory corruption, such as overwriting kernel data areas.

If you get an 0C5, you'll most likely want to first try to identify what caused your system to dump in the first place. It is note worthy that getting an 0C5 doesn't necessarily mean the dump is useless. If, upon reboot, sysdumpdev -L shows the dump size to be non-zero, then at least something was dumped. In this case, first try to use crash or kdb on the dump. If it comes up with no errors, then the dump is very likely good enough to solve the problem. If kdb or crash is unable to process the dump, the dmp_minimal facility (AIX version 5 and above) may be useful.

If you started the dump manually, either with the sysdumpstart command, the system reset, or the dump key sequence, then of course you already know the cause of the dump. However, if the dump just happened, this means the kernel crashed, usually due to a bad memory reference or a bad instruction or trap. You'll need to use the kernel debugger in this case to see what the original cause was.

The first thing to do is enable the kernel debugger. See the section in the AIX documentation, First, pay particular attention to the subsections "Loading and Starting the KDB Kernel Debugger in AIX 4.3.3" or "Loading and Starting the KDB Kernel Debugger in AIX 5.1 and Subsequent Releases". Also be sure to read "Using a Terminal with the KDB Kernel Debugger".

Note that you may of course use another machine emulating a tty, such as is done by the "cu" command. In fact, this method is preferred, because the script command can be used to log the cu session. This log will capture all data written to the screen, and may then be sent to IBM support for analysis.

The rest of this discussion assumes kdb has been enabled, i.e., the bosboot has been done and the system rebooted.

Finding the cause of a system crash

With kdb enabled, if the system crashes due to an exception, kdb becomes active at the point of failure. Here is an example.

Data Storage Interrupt - PROC
uiocopyin_ppc+0001C4     stbx    r7,r6,r4            r7=0000000A,r6=0,r4=0
KDB(0)> where
pvthread+001580 STACK:
[00303524]uiocopyin_ppc+0001C4 ()
[00075360]uiomove+0001B4 (??, ??, ??, ??)
[0065313C]mmrw+000138 (??, ??, ??, ??, ??)
[005F4394]rdevwrite+00017C (??, ??, ??, ??)
[006DF834]cdev_rdwr+000264 (??, ??, ??, ??, ??, ??, ??)
[006A6B0C]spec_rdwr+000098 (??, ??, ??, ??, ??, ??, ??, ??)
[00608D24]vnop_rdwr+000094 (??, ??, ??, ??, ??, ??, ??, ??)
[005E1130]rwuio+0000D0 (??, ??, ??, ??, ??)
[005E1348]rdwr+00013C (??, ??, ??, ??, ??)
[005E0A74]kwrite+0000EC (00000001, 2021B738, 00000001)
[00003A78].sys_call+000000 ()
[D01DFA68]write+000148 (??, ??, ??)
[1000B750]p_flush+000118 ()
[10004CF0]io_fclose+000054 (??)
[1000349C]io_renumber+00004C (??, ??, ??, ??)
[100031AC]io_restore+0000C0 (??, ??)
[10020254]xec_builtin+00027C (??, ??, ??, ??, ??, ??)
[100206A4]xec_switch+0003EC (??, ??, ??, ??, ??)
[100220B8]sh_exec+000300 (??, ??, ??)
[10001994]exfile+000624 ()
[10000D18]main+000A18 (??, ??)
(0)> more (^C to quit) ?
[10000188]__start+000088 ()
KDB(0)> th
                SLOT NAME     STATE    TID PRI  RQ CPUID  CL WCHAN
pvthread+001580   43*ksh      RUN   002B73 03C   0 *0000   0
NAME................ ksh
...............state :00000002  ...............wtype :00000000
...............flags :00000000  ..............flags2 :00000000
DATA.........pvprocp :E2003E00 

Note that if you are on the system, but not on the terminal session where kdb becomes active, the system appears to hang. This is because kdb must, in this case, take over the entire machine. If the terminal you're on appears to lock up, and kdb is enabled on another terminal, check to see if kdb is now active.

Note the first two lines of kdb output in the figure above:

Data Storage Interrupt - PROC
uiocopyin_ppc+0001C4     stbx    r7,r6,r4            r7=0000000A,r6=0,r4=0

The system crashed due to an attempt to store to location 0 in kernel memory. The where subcommand is then used to display the thread's call/return stack. Then the th subcommand is used to list information about the current thread, "ksh" in this case.

Note if the debugger was entered due to a failure such as is described here, then the halt_display point has already been reached. See Interesting dump routines for more information about halt_display.

After you have gathered whatever information you need to gather regarding the initial cause of the system crash, you can turn on the dump's debug feature by setting the dump_debug kernel variable to one. This causes the system dump to report progress on the kdb terminal while the dump is happening. This facility is useful if it is possible to get to the system dump. If you enable the dump_debug feature, and see no output when you attempt to take a dump, it is likely the debug code didn't even get to the dump.

Example 1

KDB(0)> mw dump_debug
dump_debug+000000:  00000000  = 1
sys_resource_ptr+000000:  E0000000  = .
KDB(0)> g
Dump started on CPU 0.
Calling cdt routine at 2F87B8, bufsize 0
dump table at 0x21BF78, name dmp_minimal, len 124, 5 entries.
data area bldtime: seg:ofst 0:21BFF8, len 48.
data area vars: seg:ofst 0:A68010, len 24.
data area mst: seg:ofst D3CD:2FF3B400, len 336.
data area stack: seg:ofst D3CD:2FF3B000, len 16384.
data area dbgtbls: seg:ofst 0:A67970, len 1024.
Calling cdt routine at 2F8128, bufsize 360
initial part of unlimited dump table at 0x30044200,     name proc, entries 3
data area 0pv at        addr E2000000 and length: 200
data area 0p at addr 1FA2200 and length: 148
data area 0U at addr 2FF3B400 and length: C4C00
data area 1pv at        addr E2000200 and length: 200
data area 1p at addr 1FA2600 and length: 148
data area 1U at addr 2FF3B400 and length: C4C00
data area 2pv at        addr E2000400 and length: 200
[snip]
dump table at 0x21BF60, name end-of-dump, len 24, 0 entries.
dmp_complete, type=3, status=0.
Dump Details are as follows:
dump status: 0
dump flags: 0x3
dump type(primary/sec): 1
dump Device name: /dev/hd6
dump major/min dev nos: 0xA0002
device status: 0
dump size: 0
dump magic - ver info: 0xDDDD0001
processor id: 0

Example 2

KDB(0)> mw dump_debug
dump_debug+000000:  00000000  = 4
dump_debug+000004:  00000000  = .
KDB(0)> 
KDB(0)> g 
$ sysdumpstart -p
Static breakpoint:
.brkpoint+000000     tweq    stkp,stkp           stkp=F00000002FF39FD0
.brkpoint+000004      blr                        <.dmp_do+000054> 
r3=0000000000000004
KDB(0)> g
Static breakpoint:
.brkpoint+000000     tweq    stkp,stkp           stkp=F100009E1460EE90
.brkpoint+000004      blr                        <.idmp_do+00005C> 
r3=0000000000000004
KDB(0)> g
Static breakpoint:
.brkpoint+000000     tweq    stkp,stkp           stkp=F100009E1460ED80
.brkpoint+000004      blr                        <.dump_op+000074> 
r3=0000000000000002
KDB(0)> g
Static breakpoint:
.brkpoint+000000     tweq    stkp,stkp           stkp=F00000002FF39EB0
.brkpoint+000004      blr                        <.sr_slih+000030> 
r3=0000000000000001
KDB(0)> g

In the first example, the first kdb subcommand shown above changed the value of dump_debug from 0 to 1. The g or go subcommand tells the debugger to continue with normal execution which, in this case, is to take a dump.

In the second example, the dump_debug flag's 3rd bit is set to enable static breakpoints at key dump routines. This tells kdb to break at the first c instruction of each of these routines. In this example, the dump is forced from the command line. In the case of a dump triggered by the reset button, kdb would hit static breakpoints at check_key() and sr_slih() prior to dmp_do().

Debug messages and breakpoints can both be enabled by setting the dump_debug flag's first and third bits, that is, by setting dump_debug to 5.

NOTE: The additional debugging feature provided through the dump_debug flag's second and third bits is only enabled starting with kernel level 5.1.0.50 for the 51 release, and starting with kernel level 5.2.0.10 for the 52 release. Further, setting of the second bit is purposely left out from the examples above. This is because setting the second bit will enable a debug message in dump_op() which slows down the dump greatly as a result of dump_op() being hit many times. That bit is intended to enable a single debug statement in each of the routines where the static breakpoints exist. As indicated previously, these routines include check_key(), sr_slih(), dmp_do() and dump_op(). Setting the second and third bits share a common goal which is to signal that key dump routines are hit. Since breakpoints are most important, the use of bit 2 can be bypassed for now without much impact to this new debug feature. APAR IY51358 for release 51 and APAR IY51359 for release 52 will address the extra debug messages such that the debug message for dump_op() is only outputted the first time that the routine is called. Until these APAR's are available in the service stream and installed on a customer's system, only the first and third bits should be used. On systems where IY51358 and IY51359 are not installed, values allowed for the dump_debug flag are 0, 1, 4 and 5.

In the output from the first example, note that the dump size of 0 is not a problem in this case, rather it is a problem with the dump debug facility. The end-of-dump dump table was dumped, so the dump was good in this case. The dump size should be gotten from the sysdumpdev -L command after the reboot.

If the dump had failed in the middle, debug output for the last item dumped would be displayed, providing an idea where the problem is. Such a dump may indeed be usable though. If enough memory was dumped, kdb or crash will be able to process the dump, and the problem can most likely be debugged.

It may be, however, that not enough was dumped. In this case, the first dump table, dmp_minimal, is important.

Starting with AIX version 5, a small dump area, known as dmp_minimal, is dumped first. This dump table attempts to summarize the cause of a system crash. The dmp_minimal table is generally not useful in the case of a user-initiated dump. To view the dmp_minimal data, use the /usr/lib/ras/dmprtns/dmp_minimal command. Some sample output is shown below.

# sysdumpdev -L
0453-039
Device name:         /dev/hd6
Major device number: 10
Minor device number: 2
Size:                91735040 bytes
Date/Time:           Fri May 10 07:34:37 CDT 2002
Dump status:         0
dump completed successfully
Dump copy filename: /var/adm/ras/vmcore.2
# /usr/lib/ras/dmprtns/dmp_minimal /var/adm/ras/vmcore.2
The kernel build time is:  Apr 22 2002 12:58:41
crash_mst:2ff3b400 used_mst:2ff3b400 LED:30000000 CPUID:0
mst at 2ff3b400
prev:0x0 kjmpbuf:0x0 stackfix:0x0
intpri:0xb backt:0x0 flags:0x0000
curid:0x1f7a excp_type:0x0 iar:0x303524 msr:0x90b2 cr:0x28424822
lr:0x75364 ctr:0x1 xer:0x0 mq:0x0 tid:0x0
fpscr:0x0 fpeu:0x1 fpinfo:0x0
except: 0x0 0xa000000 0x0 0x0 0x106 mstext: 0x0
[snip]
kernel stack:
2ff3b000:  2ff3b060 7007e310 00609364 7007e05c
2ff3b010:  2ff3b050 00705c88 02062fa4 7007e05c
2ff3b020:  00000000 00000012 7007e05c 007fffff
2ff3b030:  2ff3b090 40001db6 005e8abc 7007e2a0
2ff3b040:  00000001 0cd8b637 006a6fb4 00000008
[snip]
traceback:
 iar:00303524 unknown
  lr:00075364 uiomove+1b8
addr:00609364 vnop_getattr+1c
addr:00653140 mmrw+13c
addr:005f4398 rdevwrite+180
addr:006df838 cdev_rdwr+268
addr:006a6b10 spec_rdwr+9c
addr:00608d28 vnop_rdwr+98
addr:005e1134 rwuio+d4
addr:005e134c rdwr+140
addr:005e0a78 kwrite+f0
addr:00003a7c unknown

First, note that sysdumpdev -L shows that some data was dumped. In this case, the dump was successful. In general, if at least 32k of data was dumped, the dmp_minimal information will be usable. The information provided will, in this case, tell us there was a protection exception taken by an attempted write to location 0. The final part of the data shows us the traceback of the problem. Once again, this feature only exists on AIX versions 5 and above.

If your system dumps, either as a result of a crash or a user-initiated dump, and the dump size is 0, then no data was dumped. Note, however, that if the date/time of the dump is correct, the code at least made it into the dump facility, and were likely unable to do I/O to the dump device for some reason. The DUMP_STATS error log entry may shed some light here.

---------------------------------------------------------------------------
LABEL:          DUMP_STATS
Date/Time:       Thu Apr 18 10:25:47 CDT
Type:            UNKN
Resource Name:   SYSDUMP
Description
SYSTEM DUMP
Detail Data
DUMP DEVICE
/dev/rmt0
DUMP SIZE
             152358912
TIME
Thu Apr 18 10:21:46 2002
DUMP TYPE (1 = PRIMARY, 2 = SECONDARY)
           1
DUMP STATUS
           0
ERROR CODE
           0
FILE NAME
PROCESSOR ID
           0

This shows the DUMP_STATS error log entry from a system dump taken to tape, /dev/rmt0. Note that this example shows a successful dump, but if the dump had failed with an 0C5 in the panel display, the DUMP STATUS would have been -3, and the ERROR CODE may have been non-zero. This code may be set by the dump device handler if an I/O error occurred writing to the dump device.

IBM support may be able to tell what the problem was by looking at the error log, which may contain entries dealing with other disk errors. IBM support may also need to set breakpoints in the dump facility, see Dump breakpoints for a discussion of where to put these breakpoints.

Here are some of the interesting points in the system dump facility where breakpoints may be set in order to debug, or provide information on a dump failure. These breakpoints are valid in AIX versions 4 and 5.1. Following the discussion of the various entry points, there is a description of how data flows through the dump. You may also want to refer to the dump examples section at the end of this document.

These entry points are either in the system dump facility, or, in the case of the initial ones, entry points called leading up to the dump.

sr_slih
r3 contains a code telling us how we got here. If you got here via the reset button, r3 should be 0. If r3 is 1, sr_slih was reached by a direct call, and if 2, sr_slih was reached from dump completion.

check_key
This has no parameters. If the system is supposed to dump, (i.e.) the Always Allow System Dump flag is set, sysdumpdev -K, or the system's key is in the service position, then check_key should go to the dump, dmp_do.

v_exception
This is called if the system crashes with an exception such as a bad memory reference or an instruction trap. The first parameter, r3, gives the exception code, and the second parameter in r4 shows the 32-bit virtual address of the exception. Note, however, that with the debugger enabled, you should get to the debugger with the exception, see the section on the dump flow.

p_slih
This is called by v_exception to handle some exceptions.

halt_display
This is called from v_exception or p_slih. The first parameter in r3 points to the mstsave area in use when the system crashed, (i.e.) the mst of the cause of the crash. Register r4 contains the front panel code, 3 hex digits shifted to the leftmost part of the word. For example, r4 is 0x30000000 if 300 is to be displayed on the panel display. halt_display is the function that calls the debugger with the exception, and then invokes the system dump when the debugger returns.

dmp_do
There are two parameters passed in registers r3 and r4. R3 will usually contain a 3 if the system crashed, or a 4 if the dump was forced by the system reset, (i.e.) we came from check_key. The value of r4 should be 0, 1, or 2. If 0, the dump happened automatically as the result of a crash, 1 means we're dumping as the result of a key sequence such as ctrl-alt-numpad1, and 2 means reset was hit. dmp_do sets up for the dump, including stopping all processors other than the one taking the dump.

idmp_do
This is the routine that actually takes the dump.

dump_op
After reaching idmp_do, set a breakpoint here to see the dump operations, the operation code is in r3. You should first see r3=2, dump starts. You should then see r3=3, write dump data. If you see a value of 4, this is the DUMPEND value.

callfunc
This routine calls the dump handlers for each area of the dump. It is called at least twice for each component dump area. For unlimited dump areas, AIX v5 and above, such as those for the process and thread tables, it is called repeatedly to get more data to dump.

wr_cdt and wr_cdtu
These routines write the dump data for each dump table entry, (i.e.) they write the dump areas returned by the component dump handlers called from callfunc.

jwrite
This function buffers data to be written by jwrite_io. The buffer address is in r4 for the 64-bit kernel, and r4-r5 in the 32-bit kernel. Note that this may be a 64-bit real address in the 32-bit kernel. The number of bytes to dump is in r5 or r6, depending upon the kernel.

jwrite_io
This is the function that passes the data to the dump device handler. Register r4 points to the data and r5 contains the length.

ddcompress
ddcompress is called to compress the data to be written. If dump compression is on, ddcompress is called instead of jwrite, although jwrite is still called to buffer the data when it is to be written. The interesting parameters to ddcompress are similar to those for jwrite, r4 or r4-r5 contain the data's address, r4-r5 on the 32-bit kernel, and r5 or r6 contains the length.

dmp_complete
This is called when the dump is finished. Of particular interest is the value of r4, the dump return code. These are defined in /usr/include/sys/dump.h. They have the form #define DMPDO_... , and correspond to the dump return codes shown by sysdumpdev -L.

Starting with aix 5.2 ML 01, if AIX is running in an LPAR environment and reset is sent to the partition, then the system will dump to non-removable media, regardless of the logical key position. For all cases where a system dump is triggered by issuing the system reset, the dump is normally entered by first invoking sr_slih. In the case of a system crash caused by a bad memory reference or instruction exception, v_exception is called, which then goes through p_slih to halt_display, or just to halt_display. halt_display then invokes the debugger if enabled, and then invokes the dump when the debugger returns.

Machine checks will start by calling either mc_flih, or, in version 5 and above, mc_fwnmi. They will go through halt_display though.

If the sysdumpstart command is used, dump is invoked from the dmpnow function.

The dump entry point is dmp_do. This then calls idmp_do after data areas have been set up and other processors have been stopped. It should be noted here that, if all processors can't be stopped, we continue dumping anyway after a few seconds. If you believe all processors may not be stopped, check the proc_state data area. This is a byte array, one byte per cpu. If the corresponding cpu's slot is zero upon entry to idmp_do, then that processor was not stopped. Note that the processor taking the dump is shown as having been stopped in the proc_state array, so all CPUs' bytes should be non-zero.

idmp_do first calls dump_op to initialize I/O handling, r3 is 2. It then goes into the main loop, for each area to be dumped, it calls callfunc to get the data to be dumped, and then wr_cdt to dump the data. Note that this varies for unlimited dump tables, such as proc and thread, where wr_cdtu is called instead of wr_cdt, and callfunc is called from wr_cdtu to get more data to dump. This stops when callfunc returns null. Normally, callfunc returns a pointer to a cdt structure, see /usr/include/sys/dump.h. Unlimited dump tables are supported in AIX version 5 and above.

When wr_cdt or wr_cdtu needs to do I/O, dump_op is called with a code of 3, DUMPWRITE, in r3. This function then calls either jwrite or ddcompress. If ddcompress is called to compress the data, it will also call jwrite to buffer the data.

When jwrite needs to flush the dump data buffer, it calls jwrite_io. jwrite_io then calls the dump device's I/O handler.

The dmp_complete function is called when the dump is finished.

It can happen that you get what appears to be a good dump, the return code is 0 and 0C0 appears in the panel display, but when you reboot and try to process it with kdb or crash prior to version 5, you get something like,

/var/adm/ras/vmcore.4 mapped from @ 700000000000000 to @ 70000000567ae00
Preserving 1012186 bytes of symbol table
First symbol __mulh
Component Dump Table not found.
Kernel not included in this dump.
dump /var/adm/ras/vmcore.4 corrupted
make sure /unix refers to the running kernel

This can happen if you've recently changed the kernel, such as using the unix_kdb on version 4.3, or having booted with unix_64. The file /unix in the root directory must be the kernel you're using. It is normally linked to /usr/lib/boot/unix_xx, where xx is "up", "mp", or "64". Also, /usr/lib/boot/unix should be linked to the running kernel's file in AIX version 5.1.

It is possible, however, that the dump doesn't contain all the data necessary for a dump reader to process it.

The /usr/lib/ras/dmprtns/dmpfmt dump formatter can be used to show which components were dumped, how much was dumped, and the locations of the dump areas, i.e., it can summarize the dump. This is supported in AIX version 5 and above.

/usr/lib/ras/dmprtns/dmpfmt -s dumpfile >foo

The dmpfmt program with the -s option provides such a summary. It is best to write this to a file since there is usually a lot of information. The file shown in this case is "foo". The file will look something like:

Component dump table dmp_minimal at file offset 0x200
  type 32-bit with 5 entries.
  Data area bldtime, segment value 0x0, address 0x21bff8
  length 48, dumped 48
  file offsets:  data 0x27d, bit map 0x27c
  Data area vars, segment value 0x0, address 0xa68010
  length 24, dumped 24
  file offsets:  data 0x2ae, bit map 0x2ad
  Data area mst, segment value 0x1b23b, address 0x2ff3b400
  length 336, dumped 336
  file offsets:  data 0x2c7, bit map 0x2c6
  Data area stack, segment value 0x1b23b, address 0x2ff3b000
  length 16384, dumped 12288
  file offsets:  data 0x418, bit map 0x417
  Data area dbgtbls, segment value 0x0, address 0xa67970
  length 1024, dumped 1024
  file offsets:  data 0x3419, bit map 0x3418
  total data requested is 17816, dumped is 13720
Component dump table proc at file offset 0x3819
  type 32-bit Unlimited with 116 entries.
  Data area 0pv, segment value 0x4004, address 0xe2000000
  length 512, dumped 512
  file offsets:  data 0x386a, bit map 0x3869
[snip]
Component dump table thrd at file offset 0xf9543
  type 32-bit Unlimited with 217 entries.
  Data area 0tv, segment value 0x4004, address 0xea000000
[snip]
Component dump table bos at file offset 0x18dda3
  type 32-bit with 7 entries.
[snip]
Component dump table vmm at file offset 0x2d323f6
  type 32-bit VR with 20 entries.
  Data area hwpft, segment value 0x0, address 0x3000000(real)
[snip]
total dump data requested is 2804205734, dumped is 90572962

You should at least have the proc, thrd, bos, and vmm component dump tables. It is also possible that you have those, but one or more of them wasn't completely dumped. You can send the file "foo" into IBM support for analysis.

If you use dmpfmt -s and see a dump table at the end of the dump called dump_failures, then one or more of the component dump routines that provide data to the dump has failed. This generally happens when a dump routine attempts to access data incorrectly, or is not pinned in memory. The dump_failures dump table can report up to two dump routine failures. For each failing routine, the address of the routine is reported, and the error code of the failure. Note that when a dump routine fails, the dump is not terminated; rather the failure is reported in the table and the dump continues. It is therefore unlikely that, if you get an 0C5 in the panel display, there was a dump routine failure.


Dump debugging examples

Here are some examples of how to debug the dump in various circumstances.

  1. We couldn't enter anything on the terminals, nor could we telnet to the machine. We tried to take a dump with the cntrl-alt-numpad1 key sequence, and then with the reset button, and nothing happened. Nothing came up on the front panel display, or HSC, either.

    In this case, there is likely a hardware failure, or the kernel's data has been corrupted. You will first need to reboot the machine or partition. At this point, run diagnostics to see if it detects any hardware problems. One of the most important diagnostic functions is to analyze recent error log entries for signs of trouble.

    If this shows nothing, enable the low level debugger. Also enable the MODS, memory overlay detection system, see in the AIX documentation. This describes how to enable MODS with the bosdebug command. You will probably just need the basic MODS features enabled, use bosdebug -M. Now reboot the system.

    Upon reboot, bring up the debugger from the tty with ctrl-\ (back slash) or ctrl-4. Set breakpoints at sr_slih, check_key, and dmp_do. Also, turn on dump_debug.

    kdb(0)> br .sr_slih
    kdb(0)> br .check_key
    kdb(0)> br .dmp_do
    KDB(0)> mw dump_debug
    dump_debug+000000:  00000000  = 1
    sys_resource_ptr+000000:  E0000000  = .
    KDB(0)> g
    

    Now, if the system hangs, first check to see if the debugger is already active. If it is, MODS has likely detected an error. The above-referenced section in the documentation will help you with debugging this problem. For the rest of this discussion though, we'll assume it's not active.

    Perform a system reset. If the debugger doesn't come up, there's likely some fundamental problem such as a hardware error. It is also possible some software has overwritten a kernel data area such as the TOC, the loader's table of contents. At this point you'll want to use some basic problem solving techniques, such as identifying any recent changes in environment or software. If you've added a new feature to the system that involved a kernel extention, these extentions have the ability to corrupt some kernel memory. Perhaps you've attached a new hardware device involving both new hardware and a new device handler, which is a kernel extention. If possible, run without this new feature and see if the problem reoccurs.

    If, however, the debugger comes up when you hit the system reset, it will be at sr_slih. Use the g subcommand to continue execution. If you get to check_key, proceed, using g, and see if you get to dmp_do. if so, set a breakpoint at idmp_do, and see if you get there. In all likelyhood you won't, because the front panel display value of 0C2 will have been displayed by the time you get to idmp_do. If you don't get to idmp_do, or a previous point, you have gone about as far as you can go without involving IBM support. Note, however, that having this information will greatly speed your problem's analysis.

  2. It never goes to 0C0. Upon reboot, sysdumpdev -L shows I have a dump size of 0.

    This problem is debugged much the same as for the above problem. In this case, however, you're likely getting to idmp_do. Assuming you do, watch the output from the debugger's terminal while the dump is happening, and note where it stops. This is as far as you need to go before contacting IBM support. IBM support will likely use more of the breakpoints discussed above in the section on dump routines.

  3. The dump size as shown by sysdumpdev -L is 35 mb.

    First, try kdb or crash on the dump, it might work. For now, we'll assume it didn't. If the dump was taken automatically, i.e., your system crashed, use the dmp_minimal command on the dump to format the initial dump table and get a summary of the crash. If your system dump was taken by forcing the dump, then you'll need to enable the kernel debugger, reboot, and enable the dump_debug feature. Before you reboot you may also wish to enable the MODS facility. This information will tell IBM support where to start looking.

  4. The kdb command, or crash prior to version 5, cannot process my dump. I know the dump matches the UNIX on my system.

    You can use the dmpfmt facility to get a listing of the data areas contained in the dump. Kdb requires a minimal set of dump areas to be present in the dump in order for it to correctly process the data. This set includes critical dump areas such as proc, thrd, bos, and vmm. Hence, you should at least see these components in the listing. Starting in the 5.3 release, you should also see the alloc dump areas in the listing; this only applies to the 64-bit kernel as the alloc component is only present in 64-bit dumps. For a complete analysis, this output of dmpfmt can be sent to IBM support.


This document uses kdb, the kernel debugger. kdb is both the low level kernel debugger and the dump reader in AIX version 5. In version 4.3, kdb is available by booting with a kdb kernel.

The steps for enabling kdb are discussed in see the subsections "Loading and Starting the KDB Kernel Debugger in AIX 4.3.3" or "Loading and Starting the KDB Kernel Debugger in AIX 5.1 and Subsequent Releases". kdb uses a tty attached to a native serial port. See "Using a Terminal with the KDB Kernel Debugger".

It is also important to link both /unix and /usr/lib/boot/unix to the kernel in use. This is so that the kdb command, or the crash command will be able to read the dumps. Other commands, such as /usr/lib/ras/check_unix, used by snap, use /unix to verify the dump and unix match. Before you create these links, make sure the file to which you are linking exists.

Example

If you've booted with unix_mp_kdb in AIX 4.3, then link /unix and /usr/lib/boot/unix to unix_mp_kdb as follows:

# ln -fs /usr/lib/boot/unix_mp_kdb /unix
# ln -fs /usr/lib/boot/unix_mp_kdb /usr/lib/boot/unix
NOTE
In AIX version 5 you don't need another UNIX file, because kdb is the only kernel debugger, so you shouldn't have to relink /unix in AIX 5.
Prior to version 5, the low level debugger, lldb, is the default debugger, see _kdb

kdb subcommands are entered like regular UNIX commands. They have the general form of:

subcmd [ switches ] [ parameters ]

Note that kdb never takes action until the enter key is pressed. Some of the more common commands are:

br address-or-symbol
Set a breakpoint. This causes execution to stop, and kdb to be entered, when the breakpoint is hit. You may then use kdb to examine registers and memory before continuing.
g
continues execution, "go". This is how you resume execution once kdb has gotten control.
d address-or-symbol
display memory.
mw address-or-symbol
modify a word of memory. This allows you to change words in memory, so please use with care! It displays the word at the address, and you may then enter a new value. It then allows you to change the next word. Enter a (.), period, to stop changing memory.
th
show the current thread area.
where
shows where we are and how we got here. This shows the call/return stack.

The output of kdb commands can be long, and often many commands are necessary to obtain the necessary information. If no log of the output is kept, it will scroll off the screen and be lost. It is therefore best to keep a log of kdb output. The easiest way to do this is to use another machine instead of a terminal. The other system must be using a tty emulator such as the cu AIX command, although note that it is not necessary to use an AIX machine at all. A PC with a terminal program should work fine. Many terminal emulators have a built-in logging facility. In AIX, use the script command before bringing up the terminal session. Thus you will have a log of your commands and their output. If you must use a regular terminal, you will need to write down the kdb output from your commands.

 
 
Historical Number
isg1pTechnote0827
阅读(3563) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~