分类: LINUX
2011-08-01 21:27:19
Here are some notes on how to debug Linux kernel lockups – both "hard lockups" and "soft lockups" – and other panic, BUG, and oops situations. I am not an expert in this, but I figured incomplete information was better than no information, so here we go:
. One way of confirming that you are the victim of a lockup is to note that the keyboard “caps lock” light does not respond to the “caps lock” key. Similarly the the “num lock” light won’t respond to the “num lock” key. Furthermore, the machine will not respond to ctrl-alt-delete.
Some people take this symptom as their definition of a hard lockup ... but beware that there is a situation that the kernel calls a soft lockup that exhibits the same symptom.
One way a soft lockup can occur is when the machine goes into a loop with interrupts turned off. This commonly happens if a device driver uses spinlocks improperly.
and then of course recompile your kernel, install the newly compiled kernel, and reboot.
For slightly more information, see the
Say Y here to enable the kernel to detect "soft lockups", which are bugs that cause the kernel to loop in kernel mode for more than 10 seconds, without giving other tasks a chance to run.When a soft-lockup is detected, the kernel will print the current stack trace (which you should report), but the system will stay locked up. This feature has negligible overhead.
In some smallish subset of cases, the stack trace will be saved in the log files, but you should not count on this.
Far and away the best way to do this is to set up a “serial console”. That is, you arrange for console i/o (including oops messages) to appear on a serial port.
Getting this to work requires the following steps:
make menuconfig \--> Device Drivers \--> Character devices \--> Serial drivers \--> Console on 8250/16550 and compatible serial portThen, in your /boot/grub/menu.lst file, add a boot option, namely
console=ttyS0,115200or more explicitly, you need a grub stanza something like this:
title Linux (serial console) root (hd0,2) kernel /boot/vmlinuz-2.6.99 ro root=/dev/sda3 console=ttyS0,115200 console=tty0Here tty0 refers to “the” PC screen (i.e. the one hooked to “the” graphics card via the VGA interface or some such). Meanwhile, ttyS0 refers to the lowest-numbered serial line. Note that ttyS0 is what Microsoft calls com1, and ttyS1 is what they call com2, et cetera; the MS numbers are systematically one unit higher.
You are not required to explicitly specify the baudrate (115200) of the serial line, but I recommend you do so. Of course you are free to use another serial line such as ttyS1 if you prefer. In any case, you must use the correct capitalization (capital S). Note that you can specify more than one console=... option, as in the example above. If you specify none, you get tty0 by default. If you specify only ttyS0, you get that instead of tty0. If you want both, you must specify both.
Tangential remark: Choosing to log kernel messages to the serial port is independent of choosing to permit logins on that serial port; you can choose either or both or neither.If you choose both, it allows you to administer a system that has no screen at all.
Edit /etc/inittab to tell init to spawn a getty on the chosen serial line. I recommend you leave at least one runlevel where the getty is not spawned, for convenience if you ever need to use that serial port for something else. You may also need to edit /etc/securetty if you want to permit root logins on the serial line.
If you want to interact with the grub menu via the serial line, you must reconfigure grub accordingly. See the grub info pages. (You can skip this task if you are content to let grub boot the default kernel without interaction, which is often the case. Just don’t make a mistake with your grub configuration, or you’ll be locked out until you hook up a screen.)
Then of course you must hook up a serial cable from your computer (#1) to some other computer (#2). We assume computer #2 will remain running even if/when computer #1 crashes. On computer #2, run some communication program such as Kermit to allow you to talk to the serial line, and log the traffic to a disk file.
Computer #2 doesn’t need to be a Linux box. If it is a windows box, you can install Kermit-for-windows, or just use the built-in “hyperterm” application to make the connection and log the traffic.
As for the cable itself, you need “null modem” functionality. This just involves crossing a couple of wires. In many cases, if the cable has female connectors on both ends, it will have this functionality built in. In particular, a so-called LapLink cable has null-modem functionality built in. Conversely, if the cable looks like an extension cord (male on one end, female on the other) it most likely does not have null-modem functionality, and you will need a separate dongle (both to perform the ***-change operation and to cross the required wires).
To test that it is working, try something like
echo "Hi there." > /dev/consoleand verify that the message is seen by computer #2.
If you have two computers, you can use each to ride herd on the other. All you need is two cables. Just use ttyS0 as the console on each one, and monitor it with ttyS1 on the other. Presumably they won’t both crash at the same time. If you have a large number of computers, you can connect them in a big daisy chain: A→B→C→D→E→A. If you have an even number of machines, you might consider connecting them in pairs, but the daisy chain is just as easy, and isn’t limited to even numbers. If machine N crashes, you can ssh to machine N+1 (via its ethernet interface) to collect the logged information; we don’t need to rely on the serial links for all of our communication.
The point here is that by selecting this option, you get a non-interrupt-dependent printk (not just an “early” printk). This trick is not very well documented or widely known, so be glad that somebody told you about it.
There are some mild downsides to the early printk option; see
the menuconfig
The simplest way to escape from a hard lockup and get a stack trace is by means of a watchdog timer. For info on watchdog timers, read /usr/src/linux/Documentation/watchdog/*.txt.
If you are running on a system that has an Intel 82801 “I/O Controller Hub” chip (which includes most of the reasonably modern Intel-based systems) then life is simple: you can use the TCO timer and route it to the processor’s NMI line (Non-Maskable Interrupt).
To make this happen:
make menuconfig \--> Device Drivers \--> Character devices \--> Watchdog Cards \--> Intel i8xx TCO Timer/Watchdog \--> Intel TCO Timer/WatchdogMake it a module. Load it with modprobe iTCO-wdt.
Note that in some older kernels the option was named differently make menuconfig \--> Device Drivers \--> Character devices \--> Watchdog Cards \--> Intel i8xx TCO Timer/Watchdog The module was loaded with modprobe i8xx-tco.
You can tickle it with the simple userspace program in , or the even simpler program mentioned in /usr/src/linux/Documentation/watchdog/watchdog.txt. That program is advertised as “Example Watchdog Driver” but it’s not a driver in the usual sense of the word; it’s really an “Example Watchdog Daemon” or something like that.
Alternatively, you can tickle it using something like echo > /dev/watchdog every so often. Use echo -n V > /dev/watchdog to make the watchdog stop watching (so you can stop tickling, without causing a reboot).
If you don’t have an 82801 chip, you’ll have to buy one of the hardware cards described in the aforementioned watchdog.txt file.
If you’re still interested, you can find it at:
make menuconfig \--> Device Drivers \--> Character devices \--> Watchdog Cards \--> Software watchdog
There are at least two ways to proceed:
Of course if machine A is not hung, you have programmatic control of all the machines plugged into the power controller. This includes control of machine A itself. Beware that any command to power down machine A is irreversible, unless the same command brings the power back up later.
#include
. Glenn Turner “Remote Serial Console HOWTO” 原文链接: