Chinaunix首页 | 论坛 | 博客
  • 博客访问: 1665865
  • 博文数量: 607
  • 博客积分: 10031
  • 博客等级: 上将
  • 技术积分: 6633
  • 用 户 组: 普通用户
  • 注册时间: 2006-03-30 17:41
文章分类

全部博文(607)

文章存档

2011年(2)

2010年(15)

2009年(58)

2008年(172)

2007年(211)

2006年(149)

我的朋友

分类: LINUX

2006-09-20 14:47:58

Search the Catalog

Linux Device Drivers, 2nd Edition


2nd Edition June 2001
0-59600-008-1, Order Number: 0081
586 pages, $39.95

Chapter 6
Flow of Time

Contents:



Delaying Execution



The first point we need to cover is the timer interrupt, which is the mechanism the kernel uses to keep track of time intervals. Interrupts are asynchronous events that are usually fired by external hardware; the CPU is interrupted in its current activity and executes special code (the Interrupt Service Routine, or ISR) to serve the interrupt. Interrupts and ISR implementation issues are covered in Chapter 9, "Interrupt Handling".

Every time a timer interrupt occurs, the value of the variable jiffies is incremented. jiffies is initialized to 0 when the system boots, and is thus the number of clock ticks since the computer was turned on. It is declared in as unsigned long volatile, and will possibly overflow after a long time of continuous system operation (but no platform features jiffy overflow in less than 16 months of uptime). Much effort has gone into ensuring that the kernel operates properly when jiffies overflows. Driver writers do not normally have to worry about jiffies overflows, but it is good to be aware of the possibility.

It is possible to change the value of HZ for those who want systems with a different clock interrupt frequency. Some people using Linux for hard real-time tasks have been known to raise the value of HZ to get better response times; they are willing to pay the overhead of the extra timer interrupts to achieve their goals. All in all, however, the best approach to the timer interrupt is to keep the default value for HZ, by virtue of our complete trust in the kernel developers, who have certainly chosen the best value.

The details differ from platform to platform: the register may or may not be readable from user space, it may or may not be writable, and it may be 64 or 32 bits wide -- in the latter case you must be prepared to handle overflows. Whether or not the register can be zeroed, we strongly discourage resetting it, even when hardware permits. Since you can always measure differences using unsigned variables, you can get the work done without claiming exclusive ownership of the register by modifying its current value.

These lines, for example, measure the execution of the instruction itself:

Some of the other platforms offer similar functionalities, and kernel headers offer an architecture-independent function that you can use instead of rdtsc. It is called get_cycles, and was introduced during 2.1 development. Its prototype is

We'll base the example on MIPS because most MIPS processors feature a 32-bit counter as register 9 of their internal "coprocessor 0.'' To access the register, only readable from kernel space, you can define the following macro that executes a "move from coprocessor 0'' assembly instruction:[26]

With this macro in place, the MIPS processor can execute the same code shown earlier for the x86.

What's interesting with gcc inline assembly is that allocation of general-purpose registers is left to the compiler. The macro just shown uses %0 as a placeholder for "argument 0,'' which is later specified as "any register (r) used as output (=).'' The macro also states that the output register must correspond to the C expression dest. The syntax for inline assembly is very powerful but somewhat complex, especially for architectures that have constraints on what each register can do (namely, the x86 family). The complete syntax is described in the gcc documentation, usually available in the info documentation tree.

The short C-code fragment shown in this section has been run on a K7-class x86 processor and a MIPS VR4181 (using the macro just described). The former reported a time lapse of 11 clock ticks, and the latter just 2 clock ticks. The small figure was expected, since RISC processors usually execute one instruction per clock cycle.

Kernel code can always retrieve the current time by looking at the value of jiffies. Usually, the fact that the value represents only the time since the last boot is not relevant to the driver, because its life is limited to the system uptime. Drivers can use the current value of jiffies to calculate time intervals across events (for example, to tell double clicks from single clicks in input device drivers). In short, looking at jiffies is almost always sufficient when you need to measure time intervals, and if you need very sharp measures for short time lapses, processor-specific registers come to the rescue.

It's quite unlikely that a driver will ever need to know the wall-clock time, since this knowledge is usually needed only by user programs such as cron and at. If such a capability is needed, it will be a particular case of device usage, and the driver can be correctly instructed by a user program, which can easily do the conversion from wall-clock time to the system clock. Dealing directly with wall-clock time in a driver is often a sign that policy is being implemented, and should thus be looked at closely.

If your driver really needs the current time, the do_gettimeofday function comes to the rescue. This function doesn't tell the current day of the week or anything like that; rather, it fills a struct timeval pointer -- the same as used in the gettimeofday system call -- with the usual seconds and microseconds values. The prototype for do_gettimeofday is:

The source states that do_gettimeofday has "near microsecond resolution'' for many architectures. The precision does vary from one architecture to another, however, and can be less in older kernels. The current time is also available (though with less precision) from the xtime variable (a struct timeval); however, direct use of this variable is discouraged because you can't atomically access both the timeval fields tv_sec and tv_usec unless you disable interrupts. As of the 2.2 kernel, a quick and safe way of getting the time quickly, possibly with less precision, is to call get_fast_time:

Code for reading the current time is available within the jit ("Just In Time'') module in the source files provided on the O'Reilly FTP site. jit creates a file called /proc/currentime, which returns three things in ASCII when read:

We chose to use a dynamic /proc file because it requires less module code -- it's not worth creating a whole device just to return three lines of text.

morgana% cd /proc; cat currentime currentime currentimegettime: 846157215.937221
xtime: 846157215.931188
jiffies: 1308094
gettime: 846157215.939950
xtime: 846157215.931188
jiffies: 1308094
gettime: 846157215.942465
xtime: 846157215.941188
jiffies: 1308095

Delaying Execution

Device drivers often need to delay the execution of a particular piece of code for a period of time -- usually to allow the hardware to accomplish some task. In this section we cover a number of different techniques for achieving delays. The circumstances of each situation determine which technique is best to use; we'll go over them all and point out the advantages and disadvantages of each.

If you want to delay execution by a multiple of the clock tick or you don't require strict precision (for example, if you want to delay an integer number of seconds), the easiest implementation (and the most braindead) is the following, also known as busy waiting:

The variable j in this example and the following ones is the value of jiffies at the expiration of the delay and is always calculated as just shown for busy waiting.

This loop (which can be tested by reading /proc/jitsched) still isn't optimal. The system can schedule other tasks; the current process does nothing but release the CPU, but it remains in the run queue. If it is the only runnable process, it will actually run (it calls the scheduler, which selects the same process, which calls the scheduler, which...). In other words, the load of the machine (the average number of running processes) will be at least one, and the idle task (process number 0, also called swapper for historical reasons) will never run. Though this issue may seem irrelevant, running the idle task when the computer is idle relieves the processor's workload, decreasing its temperature and increasing its lifetime, as well as the duration of the batteries if the computer happens to be your laptop. Moreover, since the process is actually executing during the delay, it will be accounted for all the time it consumes. You can see this by running time cat /proc/jitsched.

In a normal driver, execution could be resumed in either of two ways: somebody calls wake_up on the wait queue, or the timeout expires. In this particular implementation, nobody will ever call wake_up on the wait queue (after all, no other code even knows about it), so the process will always wake up when the timeout expires. That is a perfectly valid implementation, but, if there are no other events of interest to your driver, delays can be achieved in a more straightforward manner with schedule_timeout:

 
set_current_state(TASK_INTERRUPTIBLE);
schedule_timeout (jit_delay*HZ);

The previous line (for /proc/jitself) causes the process to sleep until the given time has passed. schedule_timeout, too, expects a time offset, not an absolute number of jiffies. Once again, it is worth noting that an extra time interval could pass between the expiration of the timeout and when your process is actually scheduled to execute.

Sometimes a real driver needs to calculate very short delays in order to synchronize with the hardware. In this case, using the jiffies value is definitely not the solution.

The functions are compiled inline on most supported architectures. The former uses a software loop to delay execution for the required number of microseconds, and the latter is a loop around udelay, provided for the convenience of the programmer. The udelay function is where the BogoMips value is used: its loop is based on the integer value loops_per_second, which in turn is the result of the BogoMips calculation performed at boot time.

The udelay call should be called only for short time lapses because the precision of loops_per_second is only eight bits, and noticeable errors accumulate when calculating long delays. Even though the maximum allowable delay is nearly one second (since calculations overflow for longer delays), the suggested maximum value for udelay is 1000 microseconds (one millisecond). The function mdelay helps in cases where the delay must be longer than one millisecond.

One feature many drivers need is the ability to schedule execution of some tasks at a later time without resorting to interrupts. Linux offers three different interfaces for this purpose: task queues, tasklets (as of kernel 2.3.43), and kernel timers. Task queues and tasklets provide a flexible utility for scheduling execution at a later time, with various meanings for "later''; they are most useful when writing interrupt handlers, and we'll see them again in "Tasklets and Bottom-Half Processing", in Chapter 9, "Interrupt Handling". Kernel timers are used to schedule a task to run at a specific time in the future and are dealt with in "Kernel Timers", later in this chapter.

A typical situation in which you might use task queues or tasklets is to manage hardware that cannot generate interrupts but still allows blocking read. You need to poll the device, while taking care not to burden the CPU with unnecessary operations. Waking the reading process at fixed time intervals (for example, using current->timeout) isn't a suitable approach, because each poll would require two context switches (one to run the polling code in the reading process, and one to return to a process that has real work to do), and often a suitable polling mechanism can be implemented only outside of a process's context.

A similar problem is giving timely input to a simple hardware device. For example, you might need to feed steps to a stepper motor that is directly connected to the parallel port -- the motor needs to be moved by single steps on a timely basis. In this case, the controlling process talks to your device driver to dispatch a movement, but the actual movement should be performed step by step at regular intervals after returning from write.

The preferred way to perform such floating operations quickly is to register a task for later execution. The kernel supports task queues, where tasks accumulate to be "consumed'' when the queue is run. You can declare your own task queue and trigger it at will, or you can register your tasks in predefined queues, which are run (triggered) by the kernel itself.

The "bh'' in the first comment means bottom half. A bottom half is "half of an interrupt handler''; we'll discuss this topic thoroughly when we deal with interrupts in "Tasklets and Bottom-Half Processing", in Chapter 9, "Interrupt Handling". For now, suffice it to say that a bottom half is a mechanism provided by a device driver to handle asynchronous tasks which, usually, are too large to be done while handling a hardware interrupt. This chapter should make sense without an understanding of bottom halves, but we will, by necessity, refer to them occasionally.

The most important fields in the data structure just shown are routine and data. To queue a task for later execution, you need to set both these fields before queueing the structure, while next and sync should be cleared. The sync flag in the structure is used by the kernel to prevent queueing the same task more than once, because this would corrupt the next pointer. Once the task has been queued, the structure is considered "owned'' by the kernel and shouldn't be modified until the task is run.

The other data structure involved in task queues is task_queue, which is currently just a pointer to struct tq_struct; the decision to typedef this pointer to another symbol permits the extension of task_queue in the future, should the need arise. task_queue pointers should be initialized to NULL before use.

This function is used to consume a queue of accumulated tasks. You won't need to call it yourself unless you declare and maintain your own queue.

A task queue, as we have already seen, is in practice a linked list of functions to call. When run_task_queue is asked to run a given queue, each entry in the list is executed. When you are writing functions that work with task queues, you have to keep in mind when the kernel will call run_task_queue; the exact context imposes some constraints on what you can do. You should also not make any assumptions regarding the order in which enqueued tasks are run; each of them must do its task independently of the other ones.

And when are task queues run? If you are using one of the predefined task queues discussed in the next section, the answer is "when the kernel gets around to it.'' Different queues are run at different times, but they are always run when the kernel has no other pressing work to do.

Most important, they almost certainly are not run when the process that queued the task is executing. They are, instead, run asynchronously. Until now, everything we have done in our sample drivers has run in the context of a process executing system calls. When a task queue runs, however, that process could be asleep, executing on a different processor, or could conceivably have exited altogether.

This asynchronous execution resembles what happens when a hardware interrupt happens (which is discussed in detail in Chapter 9, "Interrupt Handling"). In fact, task queues are often run as the result of a "software interrupt.'' When running in interrupt mode (or interrupt time) in this way, your code is subject to a number of constraints. We will introduce these constraints now; they will be seen again in several places in this book. Repetition is called for in this case; the rules for interrupt mode must be followed or the system will find itself in deep trouble.

A number of actions require the context of a process in order to be executed. When you are outside of process context (i.e., in interrupt mode), you must observe the following rules:

One other feature of the current implementation of task queues is that a task can requeue itself in the same queue from which it was run. For instance, a task being run from the timer tick can reschedule itself to be run on the next tick by calling queue_task to put itself on the queue again. Rescheduling is possible because the head of the queue is replaced with a NULL pointer before consuming queued tasks; as a result, a new queue is built once the old one starts executing.

The easiest way to perform deferred execution is to use the queues that are already maintained by the kernel. There are a few of these queues, but your driver can use only three of them, described in the following list. The queues are declared in , which you should include in your source.

The buffer is filled by successive runs of a task queue. Each pass through the queue appends a text string to the buffer being filled; each string reports the current time (in jiffies), the process that is current during this pass, and the return value of in_interrupt.

The code for filling the buffer is confined to the jiq_print_tq function, which executes at each run through the queue being used. The printing function is not interesting and is not worth showing here; instead, let's look at the initialization of the task to be inserted in a queue:

The scheduler queue is, in some ways, the easiest to use. Because tasks executed from this queue do not run in interrupt mode, they can do more things; in particular, they can sleep. Many parts of the kernel use this queue to accomplish a wide variety of tasks.

There are a couple of implications to the keventd implementation that are worth keeping in mind. The first is that tasks in this queue can sleep, and some kernel code takes advantage of that freedom. Well-behaved code, however, should take care to sleep only for very short periods of time, since no other tasks will be run from the scheduler queue while keventd is sleeping. It is also a good idea to keep in mind that your task shares the scheduler queue with others, which can also sleep. In normal situations, tasks placed in the scheduler queue will run very quickly (perhaps even before schedule_task returns). If some other task sleeps, though, the time that elapses before your tasks execute could be significant. Tasks that absolutely have to run within a narrow time window should use one of the other queues.

In this output, the time field is the value of jiffies when the task is run, delta is the change in jiffies since the last time the task ran, interrupt is the output of the in_interrupt function, pid is the ID of the running process, cpu is the number of the CPU being used (always 0 on uniprocessor systems), and command is the command being run by the current process.

The immediate queue is the fastest queue in the system -- it's executed soonest and is run in interrupt time. The queue is consumed either by the scheduler or as soon as one process returns from its system call. Typical output can look like this:

It's clear that the queue can't be used to delay the execution of a task -- it's an "immediate'' queue. Instead, its purpose is to execute a task as soon as possible, but at a safe time. This feature makes it a great resource for interrupt handlers, because it offers them an entry point for executing program code outside of the actual interrupt management routine. The mechanism used to receive network packets, for example, is based on a similar mechanism.

Declaring a new task queue is not difficult. A driver is free to declare a new task queue, or even several of them; tasks are queued just as we've seen with the predefined queues discussed previously.

Unlike a predefined task queue, however, a custom queue is not automatically run by the kernel. The programmer who maintains a queue must arrange for a way of running it.

 DECLARE_TASK_QUEUE(tq_custom);

 
queue_task(&custom_task, &tq_custom);

The following line will run tq_custom when it is time to execute the task-queue entries that have accumulated:

 run_task_queue(&tq_custom);

If you want to experiment with custom queues now, you need to register a function to trigger the queue in one of the predefined queues. Although this may look like a roundabout way to do things, it isn't. A custom queue can be useful whenever you need to accumulate jobs and execute them all at the same time, even if you use another queue to select that "same time.''

Shortly before the release of the 2.4 kernel, the developers added a new mechanism for the deferral of kernel tasks. This mechanism, called tasklets, is now the preferred way to accomplish bottom-half tasks; indeed, bottom halves themselves are now implemented with tasklets.

Each tasklet has associated with it a function that is called when the tasklet is to be executed. The life of some kernel developer was made easier by giving that function a single argument of type unsigned long, which makes life a little more annoying for those who would rather pass it a pointer; casting the long argument to a pointer type is a safe practice on all supported architectures and pretty common in memory management (as discussed in Chapter 13, "mmap and DMA"). The tasklet function is of type void; it returns no value.

Declares a tasklet with the given name; when the tasklet is to be executed (as described later), the given function is called with the (unsigned long) data value.

Declares a tasklet as before, but its initial state is "disabled,'' meaning that it can be scheduled but will not be executed until enabled at some future time.

This function disables the given tasklet. The tasklet may still be scheduled with tasklet_schedule, but its execution will be deferred until a time when the tasklet has been enabled again.

This function may be used on tasklets that reschedule themselves indefinitely. tasklet_kill will remove the tasklet from any queue that it is on. In order to avoid race conditions with the tasklet rescheduling itself, this function waits until the tasklet executes, then pulls it from the queue. Thus, you can be sure that tasklets will not be interrupted partway through. If, however, the tasklet is not currently running and rescheduling itself, tasklet_kill may hang. tasklet_kill may not be called in interrupt time.

The ultimate resources for time keeping in the kernel are the timers. Timers are used to schedule execution of a function (a timer handler) at a particular time in the future. They thus work differently from task queues and tasklets in that you can specify when in the future your function will be called, whereas you can't tell exactly when a queued task will be executed. On the other hand, kernel timers are similar to task queues in that a function registered in a kernel timer is executed only once -- timers aren't cyclic.

There are times when you need to execute operations detached from any process's context, like turning off the floppy motor or finishing another lengthy shutdown operation. In that case, delaying the return from close wouldn't be fair to the application program. Using a task queue would be wasteful, because a queued task must continually reregister itself until the requisite time has passed.

The timeout of a timer is a value in jiffies. Thus, timer->function will run when jiffies is equal to or greater than timer->expires. The timeout is an absolute value; it is usually generated by taking the current value of jiffies and adding the amount of the desired delay.

What can appear strange when using timers is that the timer expires at just the right time, even if the processor is executing in a system call. We suggested earlier that when a process is running in kernel space, it won't be scheduled away; the clock tick, however, is special, and it does all of its tasks independent of the current process. You can try to look at what happens when you read /proc/jitbusy in the background and /proc/jitimer in the foreground. Although the system appears to be locked solid by the busy-waiting system call, both the timer queue and the kernel timers continue running.

Thus, timers can be another source of race conditions, even on uniprocessor systems. Any data structures accessed by the timer function should be protected from concurrent access, either by being atomic types (discussed in Chapter 10, "Judicious Use of Data Types") or by using spinlocks.

current->timeout = jiffies + timeout;
interruptible_sleep_on(my_queue);

 
extern inline void schedule_timeout(int timeout)
{
current->timeout = jiffies + timeout;
current->state = TASK_INTERRUPTIBLE;
schedule();
current->timeout = 0;
}

In 2.0, there were a couple of additional functions for putting functions into task queues. queue_task_irq could be called instead of queue_task in situations in which interrupts were disabled, yielding a (very) small performance benefit. queue_task_irq_off is even faster, but does not function properly in situations in which the task is already queued or is running, and can thus only be used where those conditions are guaranteed not to occur. Neither of these two functions provided much in the way of performance benefits, and they were removed in kernel 2.1.30. Using queue_task in all cases works with all kernel versions. (It is worth noting, though, that queue_task had a return type of void in 2.2 and prior kernels.)

As has been mentioned, the 2.3 development series added the tasklet mechanism; before, only task queues were available for "immediate deferred'' execution. The bottom-half subsystem was implemented differently, though most of the changes are not visible to driver writers. We didn't emulate tasklets for older kernels in sysdep.h because they are not strictly needed for driver operation; if you want to be backward compatible you'll need to either write your own emulation or use task queues instead.

The del_timer_sync function did not exist prior to development kernel 2.4.0-test2. The usual sysdep.h header defines a minimal replacement when you build against older kernel headers. Kernel version 2.0 didn't have mod_timer, either. This gap is also filled by our compatibility header.

This chapter introduced the following symbols:

#include
HZ

The current time, as calculated at the last timer tick.

The functions return the current time; the former is very high resolution, the latter may be faster while giving coarser resolution.

Returns nonzero if the processor is currently running in interrupt mode.

The function registers a task for later execution.

Declare a tasklet structure that will call the given function (passing it the given unsigned long data) when the tasklet is executed. The second form initializes the tasklet to a disabled state, keeping it from running until it is explicitly enabled.

Causes an "infinitely rescheduling'' tasklet to cease execution. This function can block and may not be called in interrupt time.

This function is similar to del_timer, but guarantees that the function is not currently running on other CPUs.



Back to:


| | | | | |

© 2001, O'Reilly & Associates, Inc.

阅读(335) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~