时光荏苒..
全部博文(453)
分类: LINUX
2012-02-03 12:46:10
The best thing you can do is to take a look at linux-src/Documentation/oops-tracing.txt It explains how you can start debugging this.
The first things you can try is decoding the backtrace???. The system.map file??? should contain the addresses of all functions in your kernel. You should be able to produce a readable backtrace this way.
You should take a look at the system.map that is generated with your kernel. Thanks to this file you'll be enable to decode your oops backtrace easily.
Have a look at and download chapter 4 scroll down to page 94.
The easiest way would be to compile your kernel with debug info (CONFIG_DEBUG_INFO) and execute the kernel from gdb.
For an older kernel (2.4) you will need ksymoops.
is still under development and chances are that you hit a bug sooner or later. When that happens you might see one of the infamous "kernel oops" messages. These messages are invaluable when trying to locate a bug, so it is worth to learn more on what an oops means and how to handle them.
Note: You can jump right to the section in case you already have a full oops message that needs to be prepared for submission as part of a bug report.
Kernel oops - what is that?Kernel code, other than code that is running in userspace, is hard to debug. Even with paying a lot of attention on the code of a driver: sometimes bugs remain, and the system faults when the driver is executed. When this happens, it’s important to be able to collect as much information as possible to solve the problem. That is exactly the purpose of a "kernel oops".
Oops messages displays the processor status at the time of the fault, including the contents of the CPU registers and other seemingly incomprehensible information. Such a message is generated by printk statements in the fault handler (arch/*/kernel/traps.c). printk writes the output to a dedicated ring buffer in the kernel where it usually is read by klogd. Depending on the configuration klogd prints the messages to the console, writes it to a file or hands them over to syslogd. Refer to the man-pages klogd(8) and syslogd(8) to learn more.
Let’s look at one such message. Here’s what results from dereferencing a NULL pointer on a PC running Version 2.6 of the kernel. The most relevant information here is the instruction pointer (EIP), the address of the faulty instruction:
Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: d083a064 Oops: 0002 [#1] SMP CPU: 0 EIP: 0060:[On kernel 2.4 the same code causes a slightly different oops message:
Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: c48370c3 *pde = 00000000 Oops: 0002 CPU: 0 EIP: 0010:[Depending on your system configuration a few more or less simple steps are necessary to make sure that you can obtain oops messages. Please make sure that all requirements are met, otherwise it might happen that you miss relevant oopses.
Adjusting the console log levelAs you know now, oops messages are generated by using the printk kernel function. printk allows the classification of messages according to their severity by associating different loglevels, or priorities, with the messages. There are eight possible priorities, defined in the header
Macro Name | Numeric Value | Description |
KERN_EMERG | 0 | Used for emergency messages, usually those that precede a crash. |
KERN_ALERT | 1 | A situation requiring immediate action. |
KERN_CRIT | 2 | Critical conditions, often related to serious hardware or software failures. |
KERN_ERR | 3 | Used to report error conditions; device drivers often use KERN_ERR to report hardware difficulties. |
KERN_WARNING | 4 | Warnings about problematic situations that do not, in themselves, create serious problems with the system. |
KERN_NOTICE | 5 | Situations that are normal, but still worthy of note. A number of security-related conditions are reported at this level. |
KERN_INFO | 6 | Informational messages. Many drivers print information about the hardware they find at startup time at this level. |
KERN_DEBUG | 7 | Used for debugging messages. |
printk sends the messages into a ring buffer, where it can be read by tools such as klogd (). Based on the loglevel, the kernel additionally may print the message to the current console, be it a text-mode terminal, a serial port, or a parallel printer. If the priority is less than the integer variable console_loglevel, the message is delivered to the console one line at a time (nothing is sent unless a trailing newline is provided). If both klogd and syslogd are running on the system, kernel messages are appended to /var/log/messages (or otherwise treated depending on your syslogd configuration), independent of console_loglevel. If klogd is not running, the message won’t reach user space unless you read /proc/kmsg (which is often most easily done with the dmesg command). When using klogd, you should remember that it doesn’t save consecutive identical lines; it only saves the first such line and, at a later time, the number of repetitions it received.
The variable console_loglevel is initialized to DEFAULT_CONSOLE_LOGLEVEL and can be modified through the sys_syslog system call. One way to change it is by specifying the –c switch when invoking klogd, as specified in the klogd manpage. Note that to change the current value, you must first kill klogd and then restart it with the –c option. Alternatively, you can write a program to change the console loglevel. You’ll find a version of such a program in misc-progs/setlevel.c in the source files provided on . The new level is specified as an integer value between 1 and 8, inclusive. If it is set to 1, only messages of level 0 (KERN_EMERG) reach the console; if it is set to 8, all messages, including debugging ones, are displayed.
It is also possible to read and modify the console loglevel using the text file /proc/sys/kernel/printk. The file hosts four integer values: the current loglevel, the default level for messages that lack an explicit loglevel, the minimum allowed loglevel, and the boot-time default loglevel. Writing a single value to this file changes the current loglevel to that value; thus, for example, you can cause all kernel messages to appear at the console by simply entering:
# echo 8 > /proc/sys/kernel/printkYou should use at least a value of 4, so that every message with a priority of KERN_ERR and above will be printed to the console.
Selecting the console terminalLinux allows for some flexibility in console logging policies by letting you send messages to a specific virtual console (if your console lives on the text screen). By default, the “console” is the current virtual terminal. To select a different virtual terminal to receive messages, you can issue ioctl(TIOCLINUX) on any console device. The program setconsole makes this an easy task. You'll find the program in misc-progs/setconsole.c in the source files provided on . You should invoke it (as superuser) with a single argument specifying the number of the console that is to receive messages.
Setting up logging facilitiesIt's always a good idea to configure your system that important messages from your system are logged somewhere, allowing you to review them in a regular manner. This is even more important if you expect kernel oopses to occur during debugging a problem related to a driver such as . Usually the is used for this task. Virtually all distributions bring that suite with them. It comes with two daemons: klogd and syslogd.
klogd reads messages from the kernel, prints them to the console, writes them to a file and/or sends them to syslogd. More important: klogd is able to decode kernel oops messages ().
syslogd accepts messages from klogd, processes running in userspace or other syslogd instances running on remote hosts. Depending on their source and priority these messages can be written to log files, pipes, the console or sent via network to other loghosts.
A complete description of the configuration options of these two daemons would be beyond the scope of this page. Please refer to the excellent man pages klogd(8), syslogd(8) and syslog.conf(5) for detailed instructions - we will just mention some important things here.
klogd configuration hintsFor automatic symbol decoding klogd needs to access the System.map file that has been created when your kernel was built. It looks for this file at /boot/System.map, /System.map and, on most systems, at /boot/System.map-<kernelversion>. If the file is not available at these locations you should specify it by using via the parameter -k
The System.map does not include symbols out of loadable kernel modules, klogd must determine them by looking at the list of loaded modules. This is done once when klogd is started. If other modules are loaded at a later time, klogd isn't up-to-date and needs to be told to refresh its symbol list via klogd -i - don't forget this step!
You might find references to the parameter -p (for paranoid) which makes klogd reloading the module symbols whenever an oops occurs. Don't use this parameter. It makes klogd query the kernel for information after the problem has occurred. Information obtained after an error could be plain wrong.
syslogd configuration hintsCheck your local syslogd configuration (usually /etc/syslog.conf), it should contain a line like this:
kern.* /var/log/kernelThis takes care that all messages sent by the kernel are written to the file /var/log/kernel - you can choose another location and/or filename, of course.
If your system crashes as a result of an oops and you don't find the corresponding oops message in your logfiles you should check /etc/syslogd.conf. If you see a line like:
kern.* -/var/log/kernelyou should remove the leading - (minus) sign from the name of the logfile. It tells syslogd to omit syncing the file after every logged line. While this is good for performance (especially when a lot of logging takes place), but it might happen that you loose part of the logged information if the system crashes right after a write attempt.
If you want to use a loghost in your network for logging the oops messages don't forget that in this case syslogd shouldn't try to send the messages via a -driven interface. It's unlikely that any syslog packets can be sent out over your Atheros WLAN card(s) after the driver oopsed. Use a wired network connection to the loghost instead.
xconsole configuration hintsKernel oops messages are reported to the text consoles only. If you're working in an X11-based graphical UI (Gnome or KDE, for example) you won't see them unless you switch to a text console (by pressing CTRL + ALT + F1) before you try anything that might triggering the kernel oops. Or, as an alternative, you make use of xconsole - a small tool which shows syslog messages in a GUI window and comes with any major Linux distribution.
First create the named pipe /dev/xconsole if it doesn't already exist:
# mknod -m 644 /dev/xconsole pThen you need to make sure that syslogd sends messages to this named pipe. The following line should be added to /etc/syslog.conf so that xconsole sees the relevant kernel messages:
kern.err |/dev/xconsoleNow start xconsole like this:
# xconsole -file /dev/xconsoleEt voila, every kernel message with a priority of KERN_ERR or higher will be shown in the xconsole window.
Last but not leastWhen working on a text console it might happen that part of the oops message is scrolled off the screen. As a workaround you can tell the text console to work with a higher resolution via the kernel parameter vga (for example vga=791). Refer to the files Documentation/kernel-parameters.txt and Documentation/svga.txt in the kernel source for further information.
Use the kernel module in order to test if your configuration is correct and you'll see kernel oopses in your logs or the console. After compiling the source against your running kernel (as described in the README) and loading the resulting module, you can cause a kernel oops by typing:
# cat /proc/oopsCheck your configuration if you don't see the oops.
Obtaining oops messagesIn a perfect world your system wouldn't be destabilized when an oops occurs. As long as you've configured your system correctly the complete oops message then will be logged into a file. But then again, in a perfect world there wouldn't be things like bugs that causes oops messages... so chances are that you won't get away with a simple cut'n'paste.
In such a case you have several other options:
Whatever method you choose, make sure you get the complete kernel oops message. It is preferred that you grab more lines than necessary, rather than risking to accidentally leave out important lines. Tripplecheck the result if you write the oops down manually and get rid of every typo - otherwise your cumbersome work would be useless!
Redirecting console outputIn many cases it can be helpful to redirect the console output of the machine that oopses. You might be working with an embedded, headless system (where no no VGA and keyboard is available). Or you might want to redirect the console to the serial port to automatically log the oops message rather than transcribing it manually from the screen of the oopsing machine.
Serial ConsoleIf the oops'ing host has a serial port, this can be used for logging purposes. All you need is a nullmodem cable, a second computer that preferrably runs Linux (although almost all other OSes should work, too) and a terminal program for this second computer.
Let's start with the preparation of the oops'ing host. The kernel on this box needs to have support for serial ports compiled in (read: not build as module). Support for having a console on a serial port needs to be enabled: for 2.4.x that's CONFIG_SERIAL_CONSOLE, not sure about 2.6.x.
Output of console messages is enabled by using the kernel parameter console=.... You can (and usually should) specify multiple consoles with this parameter. Each of them will be used for printing kernel messages, and only the last specified device will be used when /dev/console is opened - that might be important for some applications. You should be safe with:
console=ttyS0,38400 console=tty0Let's have a look at the definition of the serial console (first parameter). ttyS0 is the name of the first serial port (change that to ttyS1 for the second serial port, ttyS2 for the third, and so on). 38400 is the speed (in baud) that should be used on the port, and may be set to any valid RS232 speed (up to a maximum of 115200).
Now connect the two computers with the nullmodem cable and start the terminal program on the second computer. Configure it to use the right port settings (38400 baud, 8 bits, no parity bit). That's it. If you did everything right you should see logged kernel messages on the screen of the second computer now - test that with the oopstest module as described .
In case of any trouble you should have a look at the excellent , written by Glen Turner. It covers every detail of this topic and even explains the steps that are necessary to offer a full login on the serial port. Documentation/serial-console.txt in the kernel source is another valuable resource.
Network ConsoleThe network console has been introduced with the 2.6 series kernels, and is available as for recent 2.4 kernels. The network console (netconsole) allows logging kernel printk messages over UDP and supports debugging in cases where disk logging fails and serial consoles are not available or impractible.
Configuration and usage of netconsole is described in Documentation/networking/netconsole.txt, please refer to that file for details.
For obvious reasons you should be aware that sending the netconsole packets out via a -driven card isn't a good idea when trying to catch kernel oopses caused by . It's suggested to make use of a wired network interface for this purpose instead.
Tools for oops decodingThe main problem with users dealing with oops messages is in the little intrinsic meaning carried by hexadecimal values; to be meaningful to the programmer they need to be resolved to symbols. While kernel 2.6 already takes care of that (as long as CONFIG_KALLSYMS is enabled in the kernel configuration) additional tools are needed for this task on systems running on kernel 2.4.
Two common utilities for this purpose are klogd and ksymoops. The former tool performs symbol decoding by itself whenever it is running; the latter needs to be purposely invoked by the user. The following sections will explain how they are used. Remember that you most probably won't need them if you're running a properly configured kernel 2.6.
klogdThe klogd daemon, which is part of the , can decode oops messages before they reach the log files. In many situations, klogd can provide all the information a developer needs to track down a problem, though sometimes the developer must give it a little help.
A dump of the oops that has been shown above, as it reaches the system log, looks like this (note the decoded symbols on the EIP line and in the stack trace):
Unable to handle kernel NULL pointer dereference at virtual address \ 00000000 printing eip: c48370c3 *pde = 00000000 Oops: 0002 CPU: 0 EIP: 0010:[faulty:faulty_write+3/576] EFLAGS: 00010286 eax: ffffffea ebx: c2c55ae0 ecx: c48370c0 edx: c2c55b00 esi: 0804d038 edi: 0804d038 ebp: c2337f8c esp: c2337f8c ds: 0018 es: 0018 ss: 0018 Process cat (pid: 23413, stackpage=c2337000) Stack: 00000001 c01356e6 c2c55ae0 0804d038 00000001 c2c55b00 c2336000 00000001 0804d038 bffffbd4 00000000 00000000 bffffbd4 c010b860 00000001 0804d038 00000001 00000001 0804d038 bffffbd4 00000004 0000002b 0000002b 00000004 Call Trace: [sys_write+214/256] [system_call+52/56] Code: c7 05 00 00 00 00 00 00 00 00 31 c0 89 ec 5d c3 8d b6 00 00klogd provides most of the necessary information to track down the problem. In this case we
see that the instruction pointer (EIP) was executing in the function faulty_write, so we know where to start looking. The 3/576 string tells us that the processor was at byte 3 of a function that appears to be 576 bytes long. Note that the values are decimal, not hex.
The developer must exercise some care, however, to get useful information for errors that occur within loadable modules. klogd loads all of the available symbol information when it starts, and uses those symbols thereafter. If you load a module after klogd has initialized itself (usually at system boot), klogd will not have your module's symbol information. To force klogd to go out and get that information, send the klogd process a SIGUSR1 signal after your module has been loaded (or reloaded), and before you do anything that could cause it to oops.
It is also possible to run klogd with the -p ("paranoid") option, which will cause it to reread symbol information anytime it sees an oops message. The klogd manpage recommends against this mode of operation, however, since it makes klogd query the kernel for information after the problem has occurred. Information obtained after an error could be plain wrong.
For klogd to work properly, it must have a current copy of the System.map symbol table file. Normally this file is found in /boot; if you have built and installed a kernel from a nonstandard location you may have to copy System.map into /boot, or tell klogd to look elsewhere. klogd refuses to decode symbols if the symbol table doesn't match the current kernel. If a symbol is decoded on the system log, you can be reasonably sure it is decoded correctly.
ksymoopsAt times klogd may not be enough for your tracing purposes. Usually, you need to get both the hexadecimal address and the associated symbol, and you often need offsets printed as hex numbers. You may need more information than address decoding. Also, it is common for klogd to get killed during the fault. In such situations, a stronger oops analyzer may be called for; ksymoops is such a tool. In case your distribution doesn't provide this tool you can it from kernel.org.
To operate at its best, ksymoops needs a lot of information in addition to the error message; you can use command-line options to tell it where to find the various items. The program needs the following items:
A System.map file This map must correspond to the kernel that was running at the time the oops occurred. The default is /usr/src/linux/System.map. A list of modules ksymoops needs to know what modules were loaded when the oops occurred, in order to extract symbolic information from them. If you do not supply this list, ksymoops will look at /proc/modules. A list of kernel symbols defined when the oops occurred The default is to get this list from /proc/ksyms. A copy of the kernel image that was running Note that ksymoops needs a straight kernel image, not the compressed version (vmlinuz, zImage, or bzImage) that most systems boot. The default is to use no kernel image because most people don't keep it. If you have the exact image handy, you should tell the program where it is by using the -v option. The locations of the object files for any kernel modules that were loaded ksymoops will look in the standard directories for modules, but during development you will almost certainly have to tell it where your module lives using the -o optionAlthough ksymoops will go to files in /proc for some of its needed information, the results can be unreliable. The system, of course, will almost certainly have been rebooted between the time the oops occurs and when ksymoops is run, and the information from /proc may not match the state of affairs when the failure occurred. When possible, it is better to save copies of /proc/modules and /proc/ksyms prior to causing the oops to happen.
Please refer to the manpage (ksymoops(8)), it is a very informative document.
The last argument on the tool's command line is the location of the oops message; if it is missing, the tool will read stdin in the best Unix tradition. The message can be recovered from the system logs with luck; in the case of a very bad crash you may end up writing it down off the screen and typing it back in (unless you were using a serial console, a nice tool for kernel developers).
Note that ksymoops will be confused by an oops message that has already been processed by klogd. If you are running klogd, and your system is still running after an oops occurs, a clean oops message can often be obtained by invoking the dmesg command.
If you do not provide all of the listed information explicitly, ksymoops will issue warnings. It will also issue warnings about things like loaded modules that define no symbols. A warning-free run of ksymoops is rare.
Output from ksymoops tends to look like the following:
>>EIP; c48370c3 <[faulty]faulty_write+3/20> <===== Trace; c01356e6As you can see, ksymoops provides EIP and kernel stack information much like klogd does, but more precisely and in hexadecimal. You'll note that the faulty_write function is correctly reported to be 0x20 bytes long. This is because ksymoops reads the object file of your module and extracts all available information.
In this case, moreover, you also get an assembly language dump of the code where the fault occurred. This information can often be used to figure out exactly what was happening; here it's clearly an instruction that writes a 0 to address 0.
One interesting feature of ksymoops is that it is ported to nearly all the platforms where Linux runs and exploits the bfd (binary format description) library in order to support several computer architectures at the same time. This allows to trace bugs on other platforms as x86 as well (but we won't go into that here - refer to the LDD for further information).
gdbOn kernel 2.6, oopses produce the following message:
EIP is at ath_beacon_generate+0x4ad/0x58fYou can easily locate the appropriate source line by doing the following:
The text on this page has been compiled by gathering related documents, filtering out relevant parts and streamlining the result a bit. The biggest chunk of text has been taken from the book (2nd and 3rd edition), written by Alessandro Rubini, Jonathan Corbet, Greg Kroah-Hartman and published by . Thanks to O'Reilly and the authors for publishing the work as !
In particular, chapter 4 of that book has been a great help, and a lot of text has been used unchanged for this page.
Another thanks also goes to the authors of the documentation that has been referenced at various places above.
Attachments