Chinaunix首页 | 论坛 | 博客
  • 博客访问: 381485
  • 博文数量: 160
  • 博客积分: 10
  • 博客等级: 民兵
  • 技术积分: 250
  • 用 户 组: 普通用户
  • 注册时间: 2012-09-18 01:16
文章分类

全部博文(160)

文章存档

2016年(4)

2015年(13)

2014年(29)

2013年(114)

我的朋友

分类: LINUX

2016-05-14 16:51:01

2006年09.(很久了。)
 
结合自己的实践和网上的文章,介绍手工调试内核bug的通用方法。

1.步骤
1).Collect oops output, System.map, /proc/ksyms, vmlinux, /proc/modules
2).Use ksymoops XXXXXXXXXXXXXXXXXXXXXXXX(只对2.4内核)to interpret oops
   Instructions is /usr/src/linux/Documentation/oops-tracing.txt
   Ksymoops(8)XXXXXXXXXXXXXXX man page ()
   
2.简单分析
1)KsymoopsXXXXXXXXXXXXXXXXXXXXXX disassembles the code section
2)The EIP points to the failing instruction
3)The call trace section shows how you got there
   Caution: Noise on the stack?
3.找到出错代码
------------------------------

oops 例子
C Source Code

int test_read_proc(char *buf, char **start, off_t offset, int count, int *eof, void*data)
{ int *ptr; ptr=0; printk("%d\n",*ptr);
return 0;
}
写个内核模块程序test.c,使用上面的代码来在/proc/下创建文件。
读去文件的时候会调用这个函数,从而产生下面的oops:

Unable to handle kernel NULL pointer dereference at virtual address 00000000
c2483069 <---EIP (Instruction Pointer or Program Counter)
*pde = 00000000
Oops: 0000
CPU:
0
EIP: 0010:[ipv6:__insmod_ipv6_O/lib/modules/2.4.10-4GB/kernel/net/ipv6
           ipv6+-472895383/96]
EFLAGS: 00010283
eax: db591f98 ebx: de2aeb60 ecx: de2aeb80 edx: c2483060
esi: 00000c00 edi: d41d0000 ebp: db591f5c esp: db591f4c
ds: 0018 es: 0018 ss: 0018
Process cat (pid: 1986, stackpage=db591000)
Stack: c012ca65 000001f0 ffffffea 00000000 00001000 c014e878 d41d0000 db591f98
       00000000 00000c00 db591f94 00000000 de2aeb60 ffffffea 00000000 00001000
       deae6f40 00000000 00000000 00000000 c01324d6 de2aeb60 0804db50 00001000
Call Trace: [__alloc_pages+65/452] [proc_file_read+204/420] [sys_read+146/200]
[system_call+51/64]
Code:a1 00 00 00 00 50 68 10 31 48 c2 e8 67 38 c9 fd 31 c0 89 ec


使用XXXXXXXXXXXXXXXXXXXXXXXXXXXXX   ksysmoops来获取内河函数地址,(也可以读取/proc/kallsym或者/proc/ksyms文件察看export出来的函数地址)
Using defaults from ksymoops -t elf32-i386 -a i386

Code; 00000000 Before first symbol
00000000 <_EIP>
:
Code; 00000000 Before first symbol
0: a1 00 00 00 00 mov 0x0,%eax
Code; 00000004 Before first symbol
5: 50 push %eax
Code; 00000006 Before first symbol
6: 68 10 31 48 c2 push $0xc2483110
Code; 0000000a Before first symbol
b: e8 67 38 c9 fd call fdc93877
<_EIP+0xfdc93877> fdc93876
Code; 00000010 Before first symbol
10: 31 c0 xor %eax,%eax
Code; 00000012 Before first symbol
12: 89 ec mov %ebp,%esp
c2483060 test_read_proc [test]
c2483000 __insmod_test_O/home/ross/prog/test.o_M3 [test]
c2483110 __insmod_test_S.rodata_L68 [test]
c2483060 __insmod_test_S.text_L176 [test]
c2483080 foo [test]
de79c340 ip6_frag_mem [ipv6]
de783d00 addrconf_del_ifaddr [ipv6]
de78a5bc ipv6_packet_init [ipv6]
de78fd70 ipv6_sock_mc_drop [ipv6]
de781ee4 ip6_call_ra_chain [ipv6]

可以看出EIP的c2483069是在[test]模块的test_read_proc 函数中。
(EIP) - (Base addr of routine)
c2483069 - c2483060 = 9
下一步反汇编test.o, 找到偏移量为9行即位代码出错行。

找到出错代码,使用objdump:
Excerpt from "objdump -D test.o "
test.o:
file
format elf32-i386
Disassembly of section .text:
00000000 :
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 83 ec 08 sub $0x8,%esp
6: 83 c4 f8 add $0xfffffff8,%esp
9: a1 00 00 00 00 mov 0x0,%eax
e: 50 push %eax
f: 68 00 00 00 00 push $0x0

C Source Code

int test_read_proc(char *buf, char **start, off_t offset, int count, int *eof, void*data)
{ int *ptr; ptr=0; printk("%d\n",*ptr);
return 0;
}
======================================================================以上不能用于2.6内核
其他:
 
Unable to handle kernel NULL pointer dereference at virtual address 00000000
 
Dear all:
I am ruuning a kernel module..after some time the kernel hangs withthe following message:

Unable to handle kernel NULL pointer dereference at virtual address 00000000
printing eip:
c011f40e
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[] Tainted: P
EFLAGS: 00010086
eax: 00000000 ebx: c02a1ba4 ecx: c020e1ac edx: c02a1bac
esi: c02a1b54 edi: c02a1980 ebp: 00000001 esp: c0281f34
ds: 0018 es: 0018 ss: 0018
Process swapper (pid: 0, stackpage=c0281000)
Stack: c785bbd8 c011f50d c02a1ba4 c0281f40 c0281f40 00000000 c0297560 fffffffe
00000046 c011bc02 c011bb14 00000000 00000001 c011b943 c0297560 c0295900
00000000 c0281f98 c02443c4 c0108520 c0105300 c0280000 c0280000 ffffe000
Call Trace: [] [] [] [] []
[] [] [] [] [] []

Code: 8b 10 89 5a 04 89 13 89 43 04 89 18 5b c3 8d 74 26 00 81 f9
<0>Kernel panic: Aiee, killing interrupt handler!
In interrupt handler - not syncing
<6>wlan2: SW TICK stuck? bits=0x0 EvStat=8001 IntEn=e018
wlan1: SW TICK stuck? bits=0x0 EvStat=8000 IntEn=e09f
wlan1: SW TICK stuck? bits=0x0 EvStat=8000 IntEn=e09f

----------------------------------------------------
Hi,
Seems u are trying to access NULL pointer.

It may be like storing some value into null pointer or accessing beyond allocated memory.

Check your memory code especially allocated one.
 
------------------------------------------------------

The best thing you can do is to take a look at linux-src/Documentation/oops-tracing.txt It explains how you can start debugging this.

The first things you can try is decoding the backtrace???. The system.map file??? should contain the addresses of all functions in your kernel. You should be able to produce a readable backtrace this way.

------------------------------------------------------

You should take a look at the system.map that is generated with your kernel. Thanks to this file you'll be enable to decode your oops backtrace easily.

----------------------------------------------------------
 
Make sure that ksymoops旧的 is enabled Or look backtrace values in the System.map Or get a JTAG
-------------------------------------------------------------------------------------------

Have a look at and download chapter 4 scroll down to page 94.

The easiest way would be to compile your kernel with debug info (CONFIG_DEBUG_INFO) and execute the kernel from gdb.

For an older kernel (2.4) you will need ksymoops.

------------------------------------------------------
印象中主要是指针超过了范围,或者是map了之后没有执行unmap造成的
 
---------------------------------------------------------------
Kernel Oops Howto

is still under development and chances are that you hit a bug sooner or later. When that happens you might see one of the infamous "kernel oops" messages. These messages are invaluable when trying to locate a bug, so it is worth to learn more on what an oops means and how to handle them.

Note: You can jump right to the section in case you already have a full oops message that needs to be prepared for submission as part of a bug report.

Kernel oops - what is that?

Kernel code, other than code that is running in userspace, is hard to debug. Even with paying a lot of attention on the code of a driver: sometimes bugs remain, and the system faults when the driver is executed. When this happens, it’s important to be able to collect as much information as possible to solve the problem. That is exactly the purpose of a "kernel oops".

Oops messages displays the processor status at the time of the fault, including the contents of the CPU registers and other seemingly incomprehensible information. Such a message is generated by printk statements in the fault handler (arch/*/kernel/traps.c). printk writes the output to a dedicated ring buffer in the kernel where it usually is read by klogd. Depending on the configuration klogd prints the messages to the console, writes it to a file or hands them over to syslogd. Refer to the man-pages klogd(8) and syslogd(8) to learn more.

Let’s look at one such message. Here’s what results from dereferencing a NULL pointer on a PC running Version 2.6 of the kernel. The most relevant information here is the instruction pointer (EIP), the address of the faulty instruction:

Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: d083a064 Oops: 0002 [#1] SMP CPU: 0 EIP: 0060:[] Not tainted EFLAGS: 00010246 (2.6.6) EIP is at faulty_write+0x4/0x10 [faulty] eax: 00000000 ebx: 00000000 ecx: 00000000 edx: 00000000 esi: cf8b2460 edi: cf8b2480 ebp: 00000005 esp: c31c5f74 ds: 007b es: 007b ss: 0068 Process bash (pid: 2086, threadinfo=c31c4000 task=cfa0a6c0) Stack: c0150558 cf8b2460 080e9408 00000005 cf8b2480 00000000 cf8b2460 cf8b2460 fffffff7 080e9408 c31c4000 c0150682 cf8b2460 080e9408 00000005 cf8b2480 00000000 00000001 00000005 c0103f8f 00000001 080e9408 00000005 00000005 Call Trace: [] vfs_write+0xb8/0x130 [] sys_write+0x42/0x70 [] syscall_call+0x7/0xb Code: 89 15 00 00 00 00 c3 90 8d 74 26 00 83 ec 0c b8 00 a6 83 d0

On kernel 2.4 the same code causes a slightly different oops message:

Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: c48370c3 *pde = 00000000 Oops: 0002 CPU: 0 EIP: 0010:[] EFLAGS: 00010286 eax: ffffffea ebx: c2281a20 ecx: c48370c0 edx: c2281a40 esi: 4000c000 edi: 4000c000 ebp: c38adf8c esp: c38adf8c ds: 0018 es: 0018 ss: 0018 Process ls (pid: 23171, stackpage=c38ad000) Stack: 0000010e c01356e6 c2281a20 4000c000 0000010e c2281a40 c38ac000 0000010e 4000c000 bffffc1c 00000000 00000000 c38adfc4 c010b860 00000001 4000c000 0000010e 0000010e 4000c000 bffffc1c 00000004 0000002b 0000002b 00000004 Call Trace: [] [] Code: c7 05 00 00 00 00 00 00 00 00 31 c0 89 ec 5d c3 8d b6 00 00 Preparation

Depending on your system configuration a few more or less simple steps are necessary to make sure that you can obtain oops messages. Please make sure that all requirements are met, otherwise it might happen that you miss relevant oopses.

Adjusting the console log level

As you know now, oops messages are generated by using the printk kernel function. printk allows the classification of messages according to their severity by associating different loglevels, or priorities, with the messages. There are eight possible priorities, defined in the header kernel.h>, which are usually refered to by corresponding macros. The following overview lists them in order of decreasing severity (with the numeric representation given in brackets):

Macro Name Numeric Value Description
KERN_EMERG 0 Used for emergency messages, usually those that precede a crash.
KERN_ALERT 1 A situation requiring immediate action.
KERN_CRIT 2 Critical conditions, often related to serious hardware or software failures.
KERN_ERR 3 Used to report error conditions; device drivers often use KERN_ERR to report hardware difficulties.
KERN_WARNING 4 Warnings about problematic situations that do not, in themselves, create serious problems with the system.
KERN_NOTICE 5 Situations that are normal, but still worthy of note. A number of security-related conditions are reported at this level.
KERN_INFO 6 Informational messages. Many drivers print information about the hardware they find at startup time at this level.
KERN_DEBUG 7 Used for debugging messages.

printk sends the messages into a ring buffer, where it can be read by tools such as klogd (). Based on the loglevel, the kernel additionally may print the message to the current console, be it a text-mode terminal, a serial port, or a parallel printer. If the priority is less than the integer variable console_loglevel, the message is delivered to the console one line at a time (nothing is sent unless a trailing newline is provided). If both klogd and syslogd are running on the system, kernel messages are appended to /var/log/messages (or otherwise treated depending on your syslogd configuration), independent of console_loglevel. If klogd is not running, the message won’t reach user space unless you read /proc/kmsg (which is often most easily done with the dmesg command). When using klogd, you should remember that it doesn’t save consecutive identical lines; it only saves the first such line and, at a later time, the number of repetitions it received.

The variable console_loglevel is initialized to DEFAULT_CONSOLE_LOGLEVEL and can be modified through the sys_syslog system call. One way to change it is by specifying the –c switch when invoking klogd, as specified in the klogd manpage. Note that to change the current value, you must first kill klogd and then restart it with the –c option. Alternatively, you can write a program to change the console loglevel. You’ll find a version of such a program in misc-progs/setlevel.c in the source files provided on . The new level is specified as an integer value between 1 and 8, inclusive. If it is set to 1, only messages of level 0 (KERN_EMERG) reach the console; if it is set to 8, all messages, including debugging ones, are displayed.

It is also possible to read and modify the console loglevel using the text file /proc/sys/kernel/printk. The file hosts four integer values: the current loglevel, the default level for messages that lack an explicit loglevel, the minimum allowed loglevel, and the boot-time default loglevel. Writing a single value to this file changes the current loglevel to that value; thus, for example, you can cause all kernel messages to appear at the console by simply entering:

# echo 8 > /proc/sys/kernel/printk

You should use at least a value of 4, so that every message with a priority of KERN_ERR and above will be printed to the console.

Selecting the console terminal

Linux allows for some flexibility in console logging policies by letting you send messages to a specific virtual console (if your console lives on the text screen). By default, the “console” is the current virtual terminal. To select a different virtual terminal to receive messages, you can issue ioctl(TIOCLINUX) on any console device. The program setconsole makes this an easy task. You'll find the program in misc-progs/setconsole.c in the source files provided on . You should invoke it (as superuser) with a single argument specifying the number of the console that is to receive messages.

Setting up logging facilities

It's always a good idea to configure your system that important messages from your system are logged somewhere, allowing you to review them in a regular manner. This is even more important if you expect kernel oopses to occur during debugging a problem related to a driver such as . Usually the is used for this task. Virtually all distributions bring that suite with them. It comes with two daemons: klogd and syslogd.

klogd reads messages from the kernel, prints them to the console, writes them to a file and/or sends them to syslogd. More important: klogd is able to decode kernel oops messages ().

syslogd accepts messages from klogd, processes running in userspace or other syslogd instances running on remote hosts. Depending on their source and priority these messages can be written to log files, pipes, the console or sent via network to other loghosts.

A complete description of the configuration options of these two daemons would be beyond the scope of this page. Please refer to the excellent man pages klogd(8), syslogd(8) and syslog.conf(5) for detailed instructions - we will just mention some important things here.

klogd configuration hints

For automatic symbol decoding klogd needs to access the System.map file that has been created when your kernel was built. It looks for this file at /boot/System.map, /System.map and, on most systems, at /boot/System.map-<kernelversion>. If the file is not available at these locations you should specify it by using via the parameter -k ation>

The System.map does not include symbols out of loadable kernel modules, klogd must determine them by looking at the list of loaded modules. This is done once when klogd is started. If other modules are loaded at a later time, klogd isn't up-to-date and needs to be told to refresh its symbol list via klogd -i - don't forget this step!

You might find references to the parameter -p (for paranoid) which makes klogd reloading the module symbols whenever an oops occurs. Don't use this parameter. It makes klogd query the kernel for information after the problem has occurred. Information obtained after an error could be plain wrong.

syslogd configuration hints

Check your local syslogd configuration (usually /etc/syslog.conf), it should contain a line like this:

kern.* /var/log/kernel

This takes care that all messages sent by the kernel are written to the file /var/log/kernel - you can choose another location and/or filename, of course.

If your system crashes as a result of an oops and you don't find the corresponding oops message in your logfiles you should check /etc/syslogd.conf. If you see a line like:

kern.* -/var/log/kernel

you should remove the leading - (minus) sign from the name of the logfile. It tells syslogd to omit syncing the file after every logged line. While this is good for performance (especially when a lot of logging takes place), but it might happen that you loose part of the logged information if the system crashes right after a write attempt.

If you want to use a loghost in your network for logging the oops messages don't forget that in this case syslogd shouldn't try to send the messages via a -driven interface. It's unlikely that any syslog packets can be sent out over your Atheros WLAN card(s) after the driver oopsed. Use a wired network connection to the loghost instead.

xconsole configuration hints

Kernel oops messages are reported to the text consoles only. If you're working in an X11-based graphical UI (Gnome or KDE, for example) you won't see them unless you switch to a text console (by pressing CTRL + ALT + F1) before you try anything that might triggering the kernel oops. Or, as an alternative, you make use of xconsole - a small tool which shows syslog messages in a GUI window and comes with any major Linux distribution.

First create the named pipe /dev/xconsole if it doesn't already exist:

# mknod -m 644 /dev/xconsole p

Then you need to make sure that syslogd sends messages to this named pipe. The following line should be added to /etc/syslog.conf so that xconsole sees the relevant kernel messages:

kern.err |/dev/xconsole

Now start xconsole like this:

# xconsole -file /dev/xconsole

Et voila, every kernel message with a priority of KERN_ERR or higher will be shown in the xconsole window.

Last but not least

When working on a text console it might happen that part of the oops message is scrolled off the screen. As a workaround you can tell the text console to work with a higher resolution via the kernel parameter vga (for example vga=791). Refer to the files Documentation/kernel-parameters.txt and Documentation/svga.txt in the kernel source for further information.

Use the kernel module in order to test if your configuration is correct and you'll see kernel oopses in your logs or the console. After compiling the source against your running kernel (as described in the README) and loading the resulting module, you can cause a kernel oops by typing:

# cat /proc/oops

Check your configuration if you don't see the oops.

Obtaining oops messages

In a perfect world your system wouldn't be destabilized when an oops occurs. As long as you've configured your system correctly the complete oops message then will be logged into a file. But then again, in a perfect world there wouldn't be things like bugs that causes oops messages... so chances are that you won't get away with a simple cut'n'paste.

In such a case you have several other options:

  • hand copy the text from the screen and type it down when the machine has restarted
  • take a screenshot with a digital camera; this isn't as nice as having the message in a text file, but it is sufficient unless you need to decode the oops message with a tool like ksymoops ([ see below)
  • use one of the several crash dump patches that are out there, which save the oops messages (and probably other useful information) to disk, video memory, etc; looks good, but you may search for kmsgdump or oops+smram for alternatives
  • redirect console output to a serial port or a network device ()

Whatever method you choose, make sure you get the complete kernel oops message. It is preferred that you grab more lines than necessary, rather than risking to accidentally leave out important lines. Tripplecheck the result if you write the oops down manually and get rid of every typo - otherwise your cumbersome work would be useless!

Redirecting console output

In many cases it can be helpful to redirect the console output of the machine that oopses. You might be working with an embedded, headless system (where no no VGA and keyboard is available). Or you might want to redirect the console to the serial port to automatically log the oops message rather than transcribing it manually from the screen of the oopsing machine.

Serial Console

If the oops'ing host has a serial port, this can be used for logging purposes. All you need is a nullmodem cable, a second computer that preferrably runs Linux (although almost all other OSes should work, too) and a terminal program for this second computer.

Let's start with the preparation of the oops'ing host. The kernel on this box needs to have support for serial ports compiled in (read: not build as module). Support for having a console on a serial port needs to be enabled: for 2.4.x that's CONFIG_SERIAL_CONSOLE, not sure about 2.6.x.

Output of console messages is enabled by using the kernel parameter console=.... You can (and usually should) specify multiple consoles with this parameter. Each of them will be used for printing kernel messages, and only the last specified device will be used when /dev/console is opened - that might be important for some applications. You should be safe with:

console=ttyS0,38400 console=tty0

Let's have a look at the definition of the serial console (first parameter). ttyS0 is the name of the first serial port (change that to ttyS1 for the second serial port, ttyS2 for the third, and so on). 38400 is the speed (in baud) that should be used on the port, and may be set to any valid RS232 speed (up to a maximum of 115200).

Now connect the two computers with the nullmodem cable and start the terminal program on the second computer. Configure it to use the right port settings (38400 baud, 8 bits, no parity bit). That's it. If you did everything right you should see logged kernel messages on the screen of the second computer now - test that with the oopstest module as described .

In case of any trouble you should have a look at the excellent , written by Glen Turner. It covers every detail of this topic and even explains the steps that are necessary to offer a full login on the serial port. Documentation/serial-console.txt in the kernel source is another valuable resource.

Network Console

The network console has been introduced with the 2.6 series kernels, and is available as for recent 2.4 kernels. The network console (netconsole) allows logging kernel printk messages over UDP and supports debugging in cases where disk logging fails and serial consoles are not available or impractible.

Configuration and usage of netconsole is described in Documentation/networking/netconsole.txt, please refer to that file for details.

For obvious reasons you should be aware that sending the netconsole packets out via a -driven card isn't a good idea when trying to catch kernel oopses caused by . It's suggested to make use of a wired network interface for this purpose instead.

Tools for oops decoding

The main problem with users dealing with oops messages is in the little intrinsic meaning carried by hexadecimal values; to be meaningful to the programmer they need to be resolved to symbols. While kernel 2.6 already takes care of that (as long as CONFIG_KALLSYMS is enabled in the kernel configuration) additional tools are needed for this task on systems running on kernel 2.4.

Two common utilities for this purpose are klogd and ksymoops. The former tool performs symbol decoding by itself whenever it is running; the latter needs to be purposely invoked by the user. The following sections will explain how they are used. Remember that you most probably won't need them if you're running a properly configured kernel 2.6.

klogd

The klogd daemon, which is part of the , can decode oops messages before they reach the log files. In many situations, klogd can provide all the information a developer needs to track down a problem, though sometimes the developer must give it a little help.

A dump of the oops that has been shown above, as it reaches the system log, looks like this (note the decoded symbols on the EIP line and in the stack trace):

Unable to handle kernel NULL pointer dereference at virtual address \ 00000000 printing eip: c48370c3 *pde = 00000000 Oops: 0002 CPU: 0 EIP: 0010:[faulty:faulty_write+3/576] EFLAGS: 00010286 eax: ffffffea ebx: c2c55ae0 ecx: c48370c0 edx: c2c55b00 esi: 0804d038 edi: 0804d038 ebp: c2337f8c esp: c2337f8c ds: 0018 es: 0018 ss: 0018 Process cat (pid: 23413, stackpage=c2337000) Stack: 00000001 c01356e6 c2c55ae0 0804d038 00000001 c2c55b00 c2336000 00000001 0804d038 bffffbd4 00000000 00000000 bffffbd4 c010b860 00000001 0804d038 00000001 00000001 0804d038 bffffbd4 00000004 0000002b 0000002b 00000004 Call Trace: [sys_write+214/256] [system_call+52/56] Code: c7 05 00 00 00 00 00 00 00 00 31 c0 89 ec 5d c3 8d b6 00 00

klogd provides most of the necessary information to track down the problem. In this case we

see that the instruction pointer (EIP) was executing in the function faulty_write, so we know where to start looking. The 3/576 string tells us that the processor was at byte 3 of a function that appears to be 576 bytes long. Note that the values are decimal, not hex.

The developer must exercise some care, however, to get useful information for errors that occur within loadable modules. klogd loads all of the available symbol information when it starts, and uses those symbols thereafter. If you load a module after klogd has initialized itself (usually at system boot), klogd will not have your module's symbol information. To force klogd to go out and get that information, send the klogd process a SIGUSR1 signal after your module has been loaded (or reloaded), and before you do anything that could cause it to oops.

It is also possible to run klogd with the -p ("paranoid") option, which will cause it to reread symbol information anytime it sees an oops message. The klogd manpage recommends against this mode of operation, however, since it makes klogd query the kernel for information after the problem has occurred. Information obtained after an error could be plain wrong.

For klogd to work properly, it must have a current copy of the System.map symbol table file. Normally this file is found in /boot; if you have built and installed a kernel from a nonstandard location you may have to copy System.map into /boot, or tell klogd to look elsewhere. klogd refuses to decode symbols if the symbol table doesn't match the current kernel. If a symbol is decoded on the system log, you can be reasonably sure it is decoded correctly.

ksymoops

At times klogd may not be enough for your tracing purposes. Usually, you need to get both the hexadecimal address and the associated symbol, and you often need offsets printed as hex numbers. You may need more information than address decoding. Also, it is common for klogd to get killed during the fault. In such situations, a stronger oops analyzer may be called for; ksymoops is such a tool. In case your distribution doesn't provide this tool you can it from kernel.org.

To operate at its best, ksymoops needs a lot of information in addition to the error message; you can use command-line options to tell it where to find the various items. The program needs the following items:

A System.map file This map must correspond to the kernel that was running at the time the oops occurred. The default is /usr/src/linux/System.map. A list of modules ksymoops needs to know what modules were loaded when the oops occurred, in order to extract symbolic information from them. If you do not supply this list, ksymoops will look at /proc/modules. A list of kernel symbols defined when the oops occurred The default is to get this list from /proc/ksyms. A copy of the kernel image that was running Note that ksymoops needs a straight kernel image, not the compressed version (vmlinuz, zImage, or bzImage) that most systems boot. The default is to use no kernel image because most people don't keep it. If you have the exact image handy, you should tell the program where it is by using the -v option. The locations of the object files for any kernel modules that were loaded ksymoops will look in the standard directories for modules, but during development you will almost certainly have to tell it where your module lives using the -o option

Although ksymoops will go to files in /proc for some of its needed information, the results can be unreliable. The system, of course, will almost certainly have been rebooted between the time the oops occurs and when ksymoops is run, and the information from /proc may not match the state of affairs when the failure occurred. When possible, it is better to save copies of /proc/modules and /proc/ksyms prior to causing the oops to happen.

Please refer to the manpage (ksymoops(8)), it is a very informative document.

The last argument on the tool's command line is the location of the oops message; if it is missing, the tool will read stdin in the best Unix tradition. The message can be recovered from the system logs with luck; in the case of a very bad crash you may end up writing it down off the screen and typing it back in (unless you were using a serial console, a nice tool for kernel developers).

Note that ksymoops will be confused by an oops message that has already been processed by klogd. If you are running klogd, and your system is still running after an oops occurs, a clean oops message can often be obtained by invoking the dmesg command.

If you do not provide all of the listed information explicitly, ksymoops will issue warnings. It will also issue warnings about things like loaded modules that define no symbols. A warning-free run of ksymoops is rare.

Output from ksymoops tends to look like the following:

>>EIP; c48370c3 <[faulty]faulty_write+3/20> <===== Trace; c01356e6 Trace; c010b860 Code; c48370c3 <[faulty]faulty_write+3/20> 00000000 <_EIP>: Code; c48370c3 <[faulty]faulty_write+3/20> <===== 0: c7 05 00 00 00 movl $0x0,0x0 <===== Code; c48370c8 <[faulty]faulty_write+8/20> 5: 00 00 00 00 00 Code; c48370cd <[faulty]faulty_write+d/20> a: 31 c0 xorl %eax,%eax Code; c48370cf <[faulty]faulty_write+f/20> c: 89 ec movl %ebp,%esp Code; c48370d1 <[faulty]faulty_write+11/20> e: 5d popl %ebp Code; c48370d2 <[faulty]faulty_write+12/20> f: c3 ret Code; c48370d3 <[faulty]faulty_write+13/20> 10: 8d b6 00 00 00 leal 0x0(%esi),%esi Code; c48370d8 <[faulty]faulty_write+18/20> 15: 00

As you can see, ksymoops provides EIP and kernel stack information much like klogd does, but more precisely and in hexadecimal. You'll note that the faulty_write function is correctly reported to be 0x20 bytes long. This is because ksymoops reads the object file of your module and extracts all available information.

In this case, moreover, you also get an assembly language dump of the code where the fault occurred. This information can often be used to figure out exactly what was happening; here it's clearly an instruction that writes a 0 to address 0.

One interesting feature of ksymoops is that it is ported to nearly all the platforms where Linux runs and exploits the bfd (binary format description) library in order to support several computer architectures at the same time. This allows to trace bugs on other platforms as x86 as well (but we won't go into that here - refer to the LDD for further information).

gdb

On kernel 2.6, oopses produce the following message:

EIP is at ath_beacon_generate+0x4ad/0x58f

You can easily locate the appropriate source line by doing the following:

  • edit Makefile.inc and add "-g" to COPTS
  • recompile the driver using the same environment that produce the kernel oops (gcc, kernel)
  • run gdb like gdb ath/ath_pci.ko (gdb) info line *(ath_beacon_generate+0x4ad) Line 4465 of "/home/benoit/src/madwifi/madwifi-dfs/ath/if_ath.c" starts at address 0x5113 <ath_beacon_generate+1195> and ends at 0x511a <ath_beacon_generate+1202>.
Resources About kernel oopses and kernel debugging
  • There was a longer discussion on the Linux Kernel Mailing List (LKML) about decoding and understanding oops messages
  • Documentation/oops-tracing.txt which comes along with the source of the Linux kernel
  • Linux Device Drivers, Chapter 4: Debugging Techniques
    2nd edition (refering on kernel 2.4) in and
    3rd edition (concentrating on kernel 2.6) in english (and PDF only)
  • the chapter of the
  • the chapter of the , which explains the meaning of the System.map file and how it correlates to the oops messages
  • explains how to set up klogd and syslogd to support you in troubleshooting
  • with a draft document, explaining how to find out what part of a driver caused the oops
  • - a set of slides from a talk
  • - introduction in how to locate offending lines in order to fix bugs that cause kernel oopses
  • of how to work with ksymoops
Console output redirection
  • Documentation/networking/netconsole.txt describes how to make use of the network console; file comes along with the source of Linux kernel 2.6
  • offers patches for a backport of the network console to some recent 2.4 series kernels
  • Documentation/serial-console.txt describes the steps necessary to set up a serial console; file comes along with the source of Linux kernel (2.4 and 2.6 series)
  • tutorial at howtoforge.com
Acknowledgement

The text on this page has been compiled by gathering related documents, filtering out relevant parts and streamlining the result a bit. The biggest chunk of text has been taken from the book (2nd and 3rd edition), written by Alessandro Rubini, Jonathan Corbet, Greg Kroah-Hartman and published by . Thanks to O'Reilly and the authors for publishing the work as !

In particular, chapter 4 of that book has been a great help, and a lot of text has been used unchanged for this page.

Another thanks also goes to the authors of the documentation that has been referenced at various places above.

Attachments
  • (8.4 kB) -Kernel module source for kernel series 2.4 and 2.6; allows to check kernel oops output/logging without destabilizing the system., added by mrenzmann on 01/30/06 11:45:09.
 
 
 
 
 
 
 
阅读(32467) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~