分类: LINUX
2007-02-02 13:21:29
This tutorial was developed for the 2006 . I'm hoping it will be useful as a general resource.
There is documentation in Documentation/kprobes.txt in the kernel source.
kprobes is both useful and cool. Unfortunately most of the kernel developers I've talked to are confused both about why its useful and how to use it. I'm hoping to answer both of those here. I should add that I have no skin in the game personally, I'm not a kprobes developer. I should probably turn this into a HOWTO...
Simply, kprobes allows you to write modules that can add debug information to the kernel. It is an alternative to building custom kernels or custom modules. I think the most useful case is when you are dealing with some remote machine (your grandmother or a tester) who hits some bug that you can't figure out by looking at /var/log/messages. Build a kprobes module and have them insmod it on their kernel.
This tutorial deals with kernel kprobing. There is additional work around user-land kprobing that will not be discussed here.
There are three kinds of kprobes:
kprobes has been in mainline since 2.6.9. There are some kernel configuration flags that need to be set to use kprobes. Current enterprise kernels (such as SLES 10) have these turned on, and so does FC5. Some others (cough, Ubuntu, cough) do not. The flags that are required are:
kallsyms_lookup_name()
.
To build a module (any module) you need to have access to the kernel headers and a suitable compiler.
The simplest case, useful in 99% of cases is the jprobe case, where your function gains control on the entry to some arbitrary routine in the kernel.
In this example we will trace do_execve in the kernel. Start from the makefile and source in Documentation/kprobes.txt.
/* Trace do_execv. Taken basically from Documentation/kprobes.txt */
#include
#include
#include
#include
#include
/*
* Pre-entry point for do_execve.
*/
static int my_do_execve(char * filename,
char __user *__user *argv,
char __user *__user *envp,
struct pt_regs * regs)
{
printk("do_execve for %s from %s\n", filename, current->comm);
/* Always end with a call to jprobe_return(). */
jprobe_return();
/*NOTREACHED*/
return 0;
}
static struct jprobe my_jprobe = {
.entry = (kprobe_opcode_t *) my_do_execve
};
int init_module(void)
{
int ret;
my_jprobe.kp.addr =
(kprobe_opcode_t *) kallsyms_lookup_name("do_execve");
if (!my_jprobe.kp.addr) {
printk("Couldn't find %s to plant jprobe\n", "do_execve");
return -1;
}
if ((ret = register_jprobe(&my_jprobe)) <0) {
printk("register_jprobe failed, returned %d\n", ret);
return -1;
}
printk("Planted jprobe at %p, handler addr %p\n",
my_jprobe.kp.addr, my_jprobe.entry);
return 0;
}
void cleanup_module(void)
{
unregister_jprobe(&my_jprobe);
printk("jprobe unregistered\n");
}
MODULE_LICENSE("GPL");
Note Give YOUR routine a different name than the one you are tracing, otherwise kallsyms_lookup_name will get confused.
Note where the jprobe handler calls jprobe_return(). I mean REALLY note that (don't just return from a jprobe.)
The Makefile (also straight out of Documentation/kprobes.txt) is
# This is taken straight from Documentation/kprobes.txt
obj-m := trace-exec.o
KDIR := /lib/modules/$(shell uname -r)/build
PWD := $(shell pwd)
default:
$(MAKE) -C $(KDIR) SUBDIRS=$(PWD) modules
clean:
rm -f *.mod.c *.ko *.o
Easy. insmod the module and away you go! Here's the output on my thinkad
Jul 16 19:20:46 hound kernel: [17213292.188000] do_execve for /bin/sh from wcstatusd
Jul 16 19:20:46 hound kernel: [17213292.204000] do_execve for /sbin/ifconfig from sh
Jul 16 19:20:46 hound kernel: [17213292.216000] do_execve for /bin/sh from wcstatusd
Jul 16 19:20:46 hound kernel: [17213292.232000] do_execve for /sbin/ifconfig from sh
Jul 16 19:20:46 hound kernel: [17213292.248000] do_execve for /bin/sh from wcstatusd
Jul 16 19:20:46 hound kernel: [17213292.264000] do_execve for /sbin/ifconfig from sh
Jul 16 19:20:46 hound kernel: [17213292.296000] do_execve for /bin/sh from wcstatusd
Jul 16 19:20:46 hound kernel: [17213292.376000] do_execve for /sbin/ifconfig from sh
Jul 16 19:20:46 hound kernel: [17213292.392000] do_execve for /bin/sh from wcstatusd
Jul 16 19:20:46 hound kernel: [17213292.456000] do_execve for /sbin/ifconfig from sh
Jul 16 19:20:46 hound kernel: [17213292.472000] do_execve for /bin/sh from wcstatusd
Hmm...what the heck is wcstatusd doing all the time...
If you try the above on powerpc and (I think) ia64 it will fail. And you will bang your head. The reason is that when you take the address of a routine you get the address in a jump table, not the actual code address.
You can fix this by something like the following:
if ((ret = register_jprobe(&my_jprobe)) <0) {
printk("register_jprobe failed, returned %d. "
"trying dereferenced address\n", ret);
my_jprobe.kp.addr = *((kprobe_opcode_t **)my_jprobe.kp.addr);
if ((ret = register_jprobe(&my_jprobe)) <0) {
printk("register_jprobe failed again, returned %d\n",
ret);
return -1;
}
}
Also beware of the 32-bit compat_* functions. In the above example, do_execve is actually never called on powerpc unless you are running 64-bit processes. compat_do_execve is called instead.
I usually do something like the following to limit the amount of data that comes out (depending on what I am tracing)
...
static atomic_t trace_limit;
...
if (atomic_inc_return(&trace_limit) > 100)
jprobe_return();
...
It can help to have a fault handler in your kprobes. Unless you plan to always write bug-less kprobes.
...
/* fault_handler: this is called if an exception is generated for any
* instruction within the pre- or post-handler, or when Kprobes
* single-steps the probed instruction.
*/
int handler_fault(struct kprobe *p, struct pt_regs *regs, int trapnr)
{
printk("fault_handler: p->addr=0x%p, trap #%d\n",
p->addr, trapnr);
/* Return 0 because we don't handle the fault. */
return 0;
}
...
my_jprobe.kp.fault_handler = handler_fault;
...
My Thinkpad has a Centrino processor with speedstep CPU frequency
controls. CPU frequency is controlled by the
speedstep_centrino
module. Lets trace all the calls to
centrino_target
in that module.
First, here is the routine we want to trace from
arch/i386/kernel/cpu/cpufreq/speedstep-centrino.c
.
...
/**
* centrino_setpolicy - set a new CPUFreq policy
* @policy: new policy
* @target_freq: the target frequency
* @relation: how that frequency relates to achieved frequency (CPUFREQ_RELATION_L or CPUFREQ_RELATION_H)
*
* Sets a new CPUFreq policy.
*/
static int centrino_target (struct cpufreq_policy *policy,
unsigned int target_freq,
unsigned int relation)
{
unsigned int newstate = 0;
unsigned int msr, oldmsr, h, cpu = policy->cpu;
struct cpufreq_freqs freqs;
cpumask_t saved_mask;
int retval;
if (centrino_model[cpu] == NULL)
return -ENODEV;
...
Some things to note. First, this routine is in a module, not in the kernel proper. Secondly, the routine is marked static. No problem. if KALLSYMS_ALL is configured, we will still get the address for it.
/* Trace do_execv. Taken basically from Documentation/kprobes.txt */
#include
#include
#include
#include
#include
static atomic_t trace_limit;
/* fault_handler: this is called if an exception is generated for any
* instruction within the pre- or post-handler, or when Kprobes
* single-steps the probed instruction.
*/
int handler_fault(struct kprobe *p, struct pt_regs *regs, int trapnr)
{
printk("fault_handler: p->addr=0x%p, trap #%d\n",
p->addr, trapnr);
/* Return 0 because we don't handle the fault. */
return 0;
}
/*
* Pre-entry point for centrino_target.
*/
static int my_centrino_target (struct cpufreq_policy *policy,
unsigned int target_freq,
unsigned int relation)
{
if (atomic_inc_return(&trace_limit) > 100)
jprobe_return();
printk("centrino_target %u relation %u\n", target_freq, relation);
dump_stack();
/* Always end with a call to jprobe_return(). */
jprobe_return();
/*NOTREACHED*/
return 0;
}
static struct jprobe my_jprobe = {
.entry = (kprobe_opcode_t *) my_centrino_target,
};
int init_module(void)
{
int ret;
my_jprobe.kp.addr =
(kprobe_opcode_t *) kallsyms_lookup_name("centrino_target");
if (!my_jprobe.kp.addr) {
printk("Couldn't find %s to plant jprobe\n", "centrino_target");
return -1;
}
my_jprobe.kp.fault_handler = handler_fault;
if ((ret = register_jprobe(&my_jprobe)) <0) {
printk("register_jprobe failed, returned %d\n", ret);
return -1;
}
printk("Planted jprobe at %p, handler addr %p\n",
my_jprobe.kp.addr, my_jprobe.entry);
return 0;
}
void cleanup_module(void)
{
unregister_jprobe(&my_jprobe);
printk("jprobe unregistered\n");
}
MODULE_LICENSE("GPL");
Note the use of dump_stack()
. The output in
/var/log/messages
Jul 16 19:43:01 hound kernel: [17179786.852000] centrino_target 1500000 relation 0
Jul 16 19:43:01 hound kernel: [17179786.852000] [pg0+949055590/1069523968] my_centrino_target+0x36/0x50 [trace_speedstep]
Jul 16 19:43:01 hound kernel: [17179786.852000] [pg0+946753796/1069523968] store_speed+0xb4/0xd0 [cpufreq_userspace]
Jul 16 19:43:01 hound kernel: [17179786.852000] [neigh_get_next+135/160] store+0x47/0x60
Jul 16 19:43:01 hound kernel: [17179786.852000] [ldm_partition+4123/4304] sysfs_write_file+0x9b/0xf0
Jul 16 19:43:01 hound kernel: [17179786.852000] [vfs_read+206/432] vfs_write+0xde/0x1b0
Jul 16 19:43:01 hound kernel: [17179786.852000] [do_sendfile+699/752] sys_write+0x4b/0x80
Jul 16 19:43:01 hound kernel: [17179786.852000] [need_resched+12/33] sysenter_past_esp+0x54/0x75
Jul 16 19:43:11 hound kernel: [17179797.328000] centrino_target 1400000 relation 0
Jul 16 19:43:11 hound kernel: [17179797.328000] [pg0+949055590/1069523968] my_centrino_target+0x36/0x50 [trace_speedstep]
Jul 16 19:43:11 hound kernel: [17179797.328000] [pg0+946753796/1069523968] store_speed+0xb4/0xd0 [cpufreq_userspace]
Jul 16 19:43:11 hound kernel: [17179797.328000] [neigh_get_next+135/160] store+0x47/0x60
Jul 16 19:43:11 hound kernel: [17179797.328000] [ldm_partition+4123/4304] sysfs_write_file+0x9b/0xf0
Jul 16 19:43:11 hound kernel: [17179797.328000] [vfs_read+206/432] vfs_write+0xde/0x1b0
Jul 16 19:43:11 hound kernel: [17179797.328000] [do_sendfile+699/752] sys_write+0x4b/0x80
Jul 16 19:43:11 hound kernel: [17179797.328000] [need_resched+12/33] sysenter_past_esp+0x54/0x75
kprobes patches a single address (the address used above in kallsyms_lookup_name().) This doesn't handle inlines. There is no automatic way to handle inlines.
The second most useful thing to do is track the return value of a
routine. kprobes uses kretprobes
to handle that. A
classic case is tracking memory allocations (kmalloc.)
kprobes is smart enough to trap on any return from a routine (it actually places a trap at the place the routine was called from rather than on every return instruction
kprobes can miss kretprobes. You can check my_kretprobe.nmissed
to see if you missed any retprobes.
/* Trace kmalloc. Taken basically from Documentation/kprobes.txt */
#include
#include
#include
#include
#include
static atomic_t trace_limit;
/*
* Pre-entry point for kmalloc_target.
*/
static void *my__kmalloc(size_t size, gfp_t flags)
{
if (atomic_inc_return(&trace_limit) > 100)
jprobe_return();
printk("kmalloc call: %u in %s\n", size, current->comm);
/* Always end with a call to jprobe_return(). */
jprobe_return();
/*NOTREACHED*/
return 0;
}
static struct jprobe my_jprobe = {
.entry = (kprobe_opcode_t *) my__kmalloc,
};
/* Return-probe handler: If the probed function fails, log the return value. */
static int ret_handler(struct kretprobe_instance *ri, struct pt_regs *regs)
{
if (atomic_inc_return(&trace_limit) > 100)
return 0;
printk("kmalloc returns %lx\n", regs->eax);
return 0;
}
static struct kretprobe my_kretprobe = {
.handler = ret_handler,
/* Probe up to 20 instances concurrently. */
.maxactive = 20
};
int init_module(void)
{
int ret;
my_jprobe.kp.addr = my_kretprobe.kp.addr =
(kprobe_opcode_t *) kallsyms_lookup_name("__kmalloc");
if (!my_jprobe.kp.addr) {
printk("Couldn't find %s to plant jprobe\n", "__kmalloc");
return -1;
}
if ((ret = register_jprobe(&my_jprobe)) <0) {
printk("register_jprobe failed, returned %d\n", ret);
return -1;
}
if ((ret = register_kretprobe(&my_kretprobe)) < 0) {
printk("register_kretprobe failed, returned %d\n", ret);
unregister_jprobe(&my_jprobe);
return -1;
}
printk("Planted jprobe at %p, handler addr %p\n",
my_jprobe.kp.addr, my_jprobe.entry);
return 0;
}
void cleanup_module(void)
{
unregister_jprobe(&my_jprobe);
unregister_kretprobe(&my_kretprobe);
printk("jprobe unregistered\n");
}
MODULE_LICENSE("GPL");
Note that you just return()
from kretprobes.
Given SMP, preemption, etc. there isn't an easy way to correlate a jprobe and a kretprobe.
jprobes and kretprobes are based on raw kprobes. A raw kprobe just takes an arbitrary address and sets a breakpoint there. Frequently useful in debugging a panic (where you know the faulting address) and inassociation with a disassembled kernel (objdump -D)
kprobes provide "pre" handlers tha run before the specific instruction, and "post" handlers that run afterwards. The following is a generic kprobe that dumps state when an arbitrary address is hit.
/* generic kprobe */
#include
#include
#include
#include
#include
static unsigned long probe_addr;
module_param_named(probe_addr, probe_addr, ulong, S_IRUGO | S_IWUSR);
/*For each probe you need to allocate a kprobe structure*/
static struct kprobe kp;
static void dump_state(struct pt_regs *regs)
{
print_symbol(KERN_INFO "EIP is at %s\n", regs->eip);
printk(KERN_INFO "eax: %08lx ebx: %08lx ecx: %08lx edx: %08lx\n",
regs->eax, regs->ebx, regs->ecx, regs->edx);
printk(KERN_INFO "esi: %08lx edi: %08lx ebp: %08lx esp: %08lx\n",
regs->esi, regs->edi, regs->ebp, regs->esp);
printk(KERN_INFO "ds: %04x es: %04x\n",
regs->xds & 0xffff, regs->xes & 0xffff);
printk(KERN_INFO "Process %s (pid: %d, threadinfo=%p task=%p)",
current->comm, current->pid, current_thread_info(), current);
}
/*kprobe pre_handler: called just before the probed instruction is executed*/
int handler_pre(struct kprobe *p, struct pt_regs *regs)
{
printk("pre_handler: p->addr=0x%p\n",
p->addr);
dump_state(regs);
dump_stack();
return 0;
}
/*kprobe post_handler: called after the probed instruction is executed*/
void handler_post(struct kprobe *p, struct pt_regs *regs, unsigned long flags)
{
printk("post_handler: p->addr=0x%p\n",
p->addr);
dump_state(regs);
return;
}
/* fault_handler: this is called if an exception is generated for any
* instruction within the pre- or post-handler, or when Kprobes
* single-steps the probed instruction.
*/
int handler_fault(struct kprobe *p, struct pt_regs *regs, int trapnr)
{
printk("fault_handler: p->addr=0x%p, trap #%dn",
p->addr, trapnr);
/* Return 0 because we don't handle the fault. */
return 0;
}
int init_module(void)
{
int ret;
if (!probe_addr) {
printk("trace-generic: provide probe_addr paramter\n");
return -1;
}
kp.pre_handler = handler_pre;
kp.post_handler = handler_post;
kp.fault_handler = handler_fault;
kp.addr = (kprobe_opcode_t*) probe_addr;
printk("setting probe at address %p\n",kp.addr);
/* register the kprobe now */
if (!kp.addr) {
printk("Couldn't find %s to plant kprobe\n", "do_fork");
return -1;
}
if ((ret = register_kprobe(&kp) < 0)) {
printk("register_kprobe failed, returned %d\n", ret);
return -1;
}
printk("kprobe registered\n");
return 0;
}
void cleanup_module(void)
{
unregister_kprobe(&kp);
printk("kprobe unregistered\n");
}
MODULE_LICENSE("GPL");
There's no reason you can't add a kprobes module to your initrds to trace things happening reasonably early during the boot process. Similarly, you can actually add a kprobe to a distro installation if the installer allows you to load extra modules before doing the install.
relayfs () provides examples of tying kprobes and relayfs together. Rather than dumping all your trace data into dmesg.