Debugging the linux kernel using ftrace-king_

Overview

ftrace is a tracing facility built into the Linux kernel. It originated in the real time enhancements efforts and is now merged into mainstream. However not all features are supported by all architectures. This overview is based around the x86 architecture as the rest will slowly catchup feature wise.

General settings

Mount the debugfs in /etc/fstab or at runtime by doing:

mount -t debugfs nodev /sys/kernel/debug

All configuration settings are accessed through:

/sys/kernel/debug/tracing

Use > to overwrite or >> to append.

available_tracers: Lists the available tracers compiled into the kernel.

current_tracer: Enables a tracer.

tracing_enabled: Enables/disables the current tracer.

tracing_on: Start/stops tracing.

per_cpu: Trace files for all existing CPUs.

tracing_cpumask: This is a mask that lets the user only trace on specified CPUS. The format is a hex string representing the CPUS.

ftrace_dump_on_oops: Ftrace dumps to the console the entire contents of its ring buffer on oops. The dump can be triggered by sysrq-z. The kernel may call ftrace_dump() in all contexts to chose a location for the dump, however the buffer may end up corrupted after the dump.

buffer_size_kb: Set the size of the circular ring buffer.

ftrace has a variety of tracers and it allows for new ones to be added. We will focus on the function trace and leave others for another post.

Function tracer

The function tracer uses the -pg option of gcc to have every non-inline kernel function call mcount(), a special assembly function. With CONFIG_DYNAMIC_FTRACE these calls are convered to NOP at boot time so the system has the small performance hit of the frame pointers added by -pg. Once the function tracer (CONFIG_FUNCTION_TRACER) or graphical function tracers (CONFIG_FUNCTION_GRAPH_TRACER) are enabled they are converted back.

trace: Holds the ouput of a trace in human readable format.

latency_trace: Same information but organized to display system latencies.

trace_pipe: Same as trace, but meant to be streamed so the data is consumed on a read.

[tracing]# echo function > current_tracer

[tracing]# cat current_tracer

function

[tracing]# cat trace | head -10

# tracer: function

#

# TASK-PID CPU# TIMESTAMP FUNCTION

# | | | | |

bash-16939 [000] 6075.461561: mutex_unlock <-tracing_set_tracer

-0 [001] 6075.461561: _spin_unlock_irqrestore <-hrtimer_get_next_event

-0 [001] 6075.461562: rcu_needs_cpu <-tick_nohz_stop_sched_tick

bash-16939 [000] 6075.461563: inotify_inode_queue_event <-vfs_write

-0 [001] 6075.461563: mwait_idle <-cpu_idle

bash-16939 [000] 6075.461563: __fsnotify_parent <-vfs_write

Displays task name, PID, CPU number, timestamp from boot, function being traced <- parent function.

[tracing]# echo function_graph > current_tracer

[tracing]# cat trace | head -20

# tracer: function_graph

#

# CPU DURATION FUNCTION CALLS

# | | | | | | |

1) 1.015 us | _spin_lock_irqsave();

1) 0.476 us | internal_add_timer();

1) 0.423 us | wake_up_idle_cpu();

1) 0.461 us | _spin_unlock_irqrestore();

1) 4.770 us | }

1) 5.725 us | }

1) 0.450 us | mutex_unlock();

1) + 24.243 us | }

1) 0.483 us | _spin_lock_irq();

1) 0.517 us | _spin_unlock_irq();

1) | prepare_to_wait() {

1) 0.468 us | _spin_lock_irqsave();

1) 0.502 us | _spin_unlock_irqrestore();

1) 2.411 us | }

1) 0.449 us | kthread_should_stop();

1) | schedule() {

Functions start with { and end with }. Leaf function which don't call other functions end with ;. The duration is the time spent in the function (with the overhead of nested functions and the graph tracer). Durations of more than 10 us are marked with +, and ! for more than 100 us.

Function filtering

available_filter_functions: This lists the functions that ftrace has processed and can trace. These are the function names that you can pass to "set_ftrace_filter" or "set_ftrace_notrace".

set_ftrace_filter: Echoing names of functions into this file limits the trace to only those functions. Valid patterns are '*match', 'match*' and '*match*'. To filter only functions taht belong to a specific module use ':mod:name' as in:

echo ':mod:tg3' > set_ftrace_filter

set_ftrace_notrace: Any function that is added here will not be traced.

set_ftrace_pid: Have the function tracer only trace a single thread. Echoing a -1 will go back to the generic tracing.

set_ftrace_filter has a method to enable/disable kernel tracing without having to recompile the kernel.

function:command[:count]

command is either traceon or traceoff, and the optional count has the command execute only a certain number of times.

The kernel comman dline options ftrace_notrace and ftrace_filter can be used to set a predefine set at boot time, as in:

ftrace_notrace=rcu_read_lock,rcu_read_unlock,spin_lock,spin_unlock

ftrace_filter=kfree,kmalloc,schedule,vmalloc_fault,spurious_fault

To remove functions from the filter use:

echo '!*lock*' >> set_ftrace_filter

And to remove commands:

echo '!schedule:traceoff' >> set_ftrace_filter

Kernel debugging with ftrace_printk()

printk is a very intrusive way of debugging with a big overhead. ftrace_printk prints to the ftrace ring buffer so can also be used in any context (interrupt, NMI and scheduler) taking just ~0.1us. It will appear in any tracer. It was introduced in 2.6.32.

To enable ftrace debugging from kernel space use tracing_on() and tracing_off().

Application debugging using ftrace

int trace_fd = -1;

int marker_fd = -1;

int main(int argc, char *argv)

{

char *debugfs;

char path[256];

[...]

debugfs = find_debugfs();

if (debugfs) {

strcpy(path, debugfs);

strcat(path,"/tracing/tracing_on");

trace_fd = open(path, O_WRONLY);

if (trace_fd >= 0)

write(trace_fd, "1", 1);

strcpy(path, debugfs);

strcat(path,"/tracing/trace_marker");

marker_fd = open(path, O_WRONLY);

trace_marker writes to the ftrace buffer from user space and it's used to coordinate between user space and kernel space.

[tracing]# echo hello world > trace_marker

[tracing]# cat trace

# tracer: nop

#

# TASK-PID CPU# TIMESTAMP FUNCTION

# | | | | |

<...>-3718 [001] 5546.183420: 0: hello world

if (marker_fd >= 0)

write(marker_fd, "In critical area\n", 17);

if (critical_function() < 0) {

/* we failed! */

if (trace_fd >= 0)

write(trace_fd, "0", 1);

}

Stack tracer (CONFIG_STACK_TRACER)

Allows to examine the size of the kernel stack and how much stack space each function is using. It checks the stack size at every function call and records it if it exceeds the last recorded value. Note that as interrupts have their own stack it won't trace stack usage there.

stack_tracer_enabled: Enables/Disables. It has no performance cost when enabled.

stack_trace: Dumps the trace.

stack_max_size: Records the maximum stack size, can be reset with a 0.