|
Linux Device Drivers, 2nd Edition
2nd Edition June 2001
0-59600-008-1, Order Number: 0081
586 pages, $39.95
|
Chapter 5
Enhanced Char Driver Operations
Contents:
In Chapter 3, "Char Drivers", we built a complete device driver that the
user can write to and read from. But a real device usually offers more
functionality than synchronous read and
write. Now that we're equipped with debugging
tools should something go awry, we can safely go ahead and implement
new operations.
What is normally needed, in addition to reading and writing the
device, is the ability to perform various types of hardware control
via the device driver. Control operations are usually supported via
the ioctl method. The alternative is to look at
the data flow being written to the device and use special sequences as
control commands. This latter technique should be avoided because it
requires reserving some characters for controlling purposes; thus, the
data flow can't contain those characters. Moreover, this technique
turns out to be more complex to handle than
ioctl. Nonetheless, sometimes it's a useful
approach to device control and is used by tty's and other
devices. We'll describe it later in this chapter in "Device Control Without ioctl".
As we suggested in the previous chapter, the
ioctl system call offers a device specific entry
point for the driver to handle "commands.''
ioctl is device specific in that, unlike
read and other methods, it allows applications to
access features unique to the hardware being driven, such as
configuring the device and entering or exiting operating modes. These
control operations are usually not available through the read/write
file abstraction. For example, everything you write to a serial port
is used as communication data, and you cannot change the baud rate by
writing to the device. That is what ioctl is for:
controlling the I/O channel.
Another important feature of real devices (unlike
scull) is that data being read or written
is exchanged with other hardware, and some synchronization is
needed. The concepts of blocking I/O and asynchronous notification
fill the gap and are introduced in this chapter by means of a modified
scull device. The driver uses
interaction between different processes to create asynchronous events.
As with the original scull, you don't need
special hardware to test the driver's workings. We
will definitely deal with real hardware, but not
until Chapter 8, "Hardware Management".
The prototype stands out in the list of Unix system calls because of
the dots, which usually represent not a variable number of arguments.
In a real system, however, a system call can't actually have a
variable number of arguments. System calls must have a well-defined
number of arguments because user programs can access them only through
hardware "gates,'' as outlined in "User Space and Kernel Space" in
Chapter 2, "Building and Running Modules". Therefore, the dots in the prototype
represent not a variable number of arguments but a single optional
argument, traditionally identified as char *argp.
The dots are simply there to prevent type checking during
compilation. The actual nature of the third argument depends on the
specific control command being issued (the second argument). Some
commands take no arguments, some take an integer value, and some take
a pointer to other data. Using a pointer is the way to pass arbitrary
data to the ioctl call; the device will then be
able to exchange any amount of data with user space.
As you might imagine, most ioctl implementations
consist of a switch statement that selects the
correct behavior according to the cmd
argument. Different commands have different numeric values, which are
usually given symbolic names to simplify coding. The symbolic name is
assigned by a preprocessor definition. Custom drivers usually declare
such symbols in their header files; scull.hdeclares them for scull. User programs
must, of course, include that header file as well to have access to
those symbols.
The command numbers should be unique across the system in order to
prevent errors caused by issuing the right command to the wrong
device. Such a mismatch is not unlikely to happen, and a program might
find itself trying to change the baud rate of a non-serial-port input
stream, such as a FIFO or an audio device. If each
ioctl number is unique, then the application will
get an EINVAL error rather than succeeding in doing
something unintended.
To help programmers create unique ioctl command
codes, these codes have been split up into several bitfields. The
first versions of Linux used 16-bit numbers: the top eight were the
"magic'' number associated with the device, and the bottom eight were
a sequential number, unique within the device. This happened because
Linus was "clueless'' (his own word); a better division of bitfields
was conceived only later. Unfortunately, quite a few drivers still use
the old convention. They have to: changing the command codes would
break no end of binary programs. In our sources, however, we will use
the new command code convention exclusively.
To choose ioctl numbers for your driver according
to the new convention, you should first check
include/asm/ioctl.h and
Documentation/ioctl-number.txt. The header
defines the bitfields you will be using: type (magic number), ordinal
number, direction of transfer, and size of argument. The
ioctl-number.txt file lists the magic numbers
used throughout the kernel, so you'll be able to choose your own magic
number and avoid overlaps. The text file also lists the reasons why
the convention should be used.
-
-
-
The direction of data transfer, if the particular command involves a
data transfer. The possible values are _IOC_NONE
(no data transfer), _IOC_READ,
_IOC_WRITE, and _IOC_READ |
_IOC_WRITE (data is transferred both ways). Data transfer is
seen from the application's point of view;
_IOC_READ means reading fromthe device, so the driver must write to user space. Note that the
field is a bit mask, so _IOC_READ and
_IOC_WRITE can be extracted using a logical AND
operation.
-
The size of user data involved. The width of this field is
architecture dependent and currently ranges from 8 to 14 bits. You can
find its value for your specific architecture in the macro
_IOC_SIZEBITS. If you intend your driver to be
portable, however, you can only count on a size up to 255. It's not
mandatory that you use the size field. If you need
larger data structures, you can just ignore it. We'll see soon how
this field is used.
Here is how some ioctl commands are defined in
scull. In particular, these commands set
and get the driver's configurable parameters.
/* Use 'k' as magic number */
#define SCULL_IOC_MAGIC 'k'
#define SCULL_IOCRESET _IO(SCULL_IOC_MAGIC, 0)
/*
* S means "Set" through a ptr
* T means "Tell" directly with the argument value
* G means "Get": reply by setting through a pointer
* Q means "Query": response is on the return value
* X means "eXchange": G and S atomically
* H means "sHift": T and Q atomically
*/
#define SCULL_IOCSQUANTUM _IOW(SCULL_IOC_MAGIC, 1, scull_quantum)
#define SCULL_IOCSQSET _IOW(SCULL_IOC_MAGIC, 2, scull_qset)
#define SCULL_IOCTQUANTUM _IO(SCULL_IOC_MAGIC, 3)
#define SCULL_IOCTQSET _IO(SCULL_IOC_MAGIC, 4)
#define SCULL_IOCGQUANTUM _IOR(SCULL_IOC_MAGIC, 5, scull_quantum)
#define SCULL_IOCGQSET _IOR(SCULL_IOC_MAGIC, 6, scull_qset)
#define SCULL_IOCQQUANTUM _IO(SCULL_IOC_MAGIC, 7)
#define SCULL_IOCQQSET _IO(SCULL_IOC_MAGIC, 8)
#define SCULL_IOCXQUANTUM _IOWR(SCULL_IOC_MAGIC, 9, scull_quantum)
#define SCULL_IOCXQSET _IOWR(SCULL_IOC_MAGIC,10, scull_qset)
#define SCULL_IOCHQUANTUM _IO(SCULL_IOC_MAGIC, 11)
#define SCULL_IOCHQSET _IO(SCULL_IOC_MAGIC, 12)
#define SCULL_IOCHARDRESET _IO(SCULL_IOC_MAGIC, 15) /* debugging tool */
#define SCULL_IOC_MAXNR 15
The "exchange'' and "shift'' operations are not particularly useful
for scull. We implemented "exchange'' to
show how the driver can combine separate operations into a single
atomic one, and "shift'' to pair "tell'' and
"query.'' There are times when atomic[24] test-and-set
operations like these are needed, in particular, when applications
need to set or release locks.
The value of the ioctl cmd
argument is not currently used by the kernel, and it's quite unlikely
it will be in the future. Therefore, you could, if you were feeling
lazy, avoid the complex declarations shown earlier and explicitly
declare a set of scalar numbers. On the other hand, if you did, you
wouldn't benefit from using the bitfields. The header
is an example of this
old-fashioned approach, using 16-bit scalar values to define the
ioctl commands. That source file relied on
scalar numbers because it used the technology then available, not out
of laziness. Changing it now would be a gratuitous incompatibility.
The implementation of ioctl is usually a
switch statement based on the command number. But
what should the default selection be when the
command number doesn't match a valid operation? The question is
controversial. Several kernel functions return
-EINVAL ("Invalid argument''), which makes sense
because the command argument is indeed not a valid one. The POSIX
standard, however, states that if an inappropriate
ioctl command has been issued, then
-ENOTTY should be returned. The string associated
with that value used to be "Not a typewriter'' under all libraries up
to and including libc5. Only
libc6 changed the message to
"Inappropriate ioctl for device,'' which looks more to the
point. Because most recent Linux system are
libc6 based, we'll stick to the standard
and return -ENOTTY. It's still pretty common,
though, to return -EINVAL in response to an invalid
ioctl command.
Though the ioctl system call is most often used
to act on devices, a few commands are recognized by the kernel. Note
that these commands, when applied to your device, are decoded
before your own file operations are called. Thus,
if you choose the same number for one of your
ioctl commands, you won't ever see any request
for that command, and the application will get something unexpected
because of the conflict between the ioctlnumbers.
Commands in the last group are executed by the implementation of the
hosting filesystem (see the chattrcommand). Device driver writers are interested only in the first group
of commands, whose magic number is "T.'' Looking at the workings of
the other groups is left to the reader as an exercise;
ext2_ioctl is a most interesting function (though
easier than you may expect), because it implements the append-only
flag and the immutable flag.
-
Set the close-on-exec flag (File IOctl CLose on EXec). Setting this
flag will cause the file descriptor to be closed when the calling
process executes a new program.
-
-
Set or reset asynchronous notification for the file (as discussed in
"Asynchronous Notification" later in this chapter). Note that kernel
versions up to Linux 2.2.4 incorrectly used this command to modify the
O_SYNC flag. Since both actions can be accomplished
in other ways, nobody actually uses the FIOASYNC
command, which is reported here only for completeness.
-
The last item in the list introduced a new system call,
fcntl, which looks like
ioctl. In fact, the fcntlcall is very similar to ioctl in that it gets a
command argument and an extra (optional) argument. It is kept separate
from ioctl mainly for historical reasons: when
Unix developers faced the problem of controlling I/O operations, they
decided that files and devices were different. At the time, the only
devices with ioctl implementations were ttys,
which explains why -ENOTTY is the standard reply
for an incorrect ioctl command. Things have
changed, but fcntl remains in the name of
backward compatibility.
Another point we need to cover before looking at the
ioctl code for the
scull driver is how to use the extra
argument. If it is an integer, it's easy: it can be used directly. If
it is a pointer, however, some care must be taken.
When a pointer is used to refer to user space, we must ensure that the
user address is valid and that the corresponding page is currently
mapped. If kernel code tries to access an out-of-range address, the
processor issues an exception. Exceptions in kernel code are turned to
oops messages by every Linux kernel up through
2.0.x; version 2.1 and later handle the problem
more gracefully. In any case, it's the driver's responsibility to make
proper checks on every user-space address it uses and to return an
error if it is invalid.
There are a couple of interesting things to note about
access_ok. First is that it does not do the
complete job of verifying memory access; it only checks to see that
the memory reference is in a region of memory that the process might
reasonably have access to. In particular,
access_ok ensures that the address does not point
to kernel-space memory. Second, most driver code need not actually
call access_ok. The memory-access routines
described later take care of that for you. We will nonetheless
demonstrate its use so that you can see how it is done, and for
backward compatibility reasons that we will get into toward the end of
the chapter.
The scull source exploits the bitfields in
the ioctl number to check the arguments before
the switch:
-
-
Access to a device is controlled by the permissions on the device
file(s), and the driver is not normally involved in permissions
checking. There are situations, however, where any user is granted
read/write permission on the device, but some other operations should
be denied. For example, not all users of a tape drive should be able
to set its default block size, and the ability to work with a disk
device does not mean that the user can reformat the drive. In cases
like these, the driver must perform additional checks to be sure that
the user is capable of performing the requested operation.
Unix systems have traditionally restricted privileged operations to
the superuser account. Privilege is an all-or-nothing thing -- the
superuser can do absolutely anything, but all other users are highly
restricted. The Linux kernel as of version 2.2 provides a more
flexible system called capabilities. A
capability-based system leaves the all-or-nothing mode behind and
breaks down privileged operations into separate subgroups. In this
way, a particular user (or program) can be empowered to perform a
specific privileged operation without giving away the ability to
perform other, unrelated operations. Capabilities are still little
used in user space, but kernel code uses them almost exclusively.
The full set of capabilities can be found in
. A subset of those
capabilities that might be of interest to device driver writers
includes the following:
-
-
-
-
The ability to perform "raw'' I/O operations. Examples include
accessing device ports or communicating directly with USB devices.
-
-
Before performing a privileged operation, a device driver should check
that the calling process has the appropriate capability with the
capable function (defined in
):
In the scull sample driver, any user is
allowed to query the quantum and quantum set sizes. Only privileged
users, however, may change those values, since inappropriate values
could badly affect system performance. When needed, the
scull implementation of
ioctl checks a user's privilege level as follows:
The scull implementation of
ioctl only transfers the configurable parameters
of the device and turns out to be as easy as the following:
switch(cmd) {
#ifdef SCULL_DEBUG
case SCULL_IOCHARDRESET:
/*
* reset the counter to 1, to allow unloading in case
* of problems. Use 1, not 0, because the invoking
* process has the device open.
*/
while (MOD_IN_USE)
MOD_DEC_USE_COUNT;
MOD_INC_USE_COUNT;
/* don't break: fall through and reset things */
#endif /* SCULL_DEBUG */
case SCULL_IOCRESET:
scull_quantum = SCULL_QUANTUM;
scull_qset = SCULL_QSET;
break;
case SCULL_IOCSQUANTUM: /* Set: arg points to the value */
if (! capable (CAP_SYS_ADMIN))
return -EPERM;
ret = __get_user(scull_quantum, (int *)arg);
break;
case SCULL_IOCTQUANTUM: /* Tell: arg is the value */
if (! capable (CAP_SYS_ADMIN))
return -EPERM;
scull_quantum = arg;
break;
case SCULL_IOCGQUANTUM: /* Get: arg is pointer to result */
ret = __put_user(scull_quantum, (int *)arg);
break;
case SCULL_IOCQQUANTUM: /* Query: return it (it's positive) */
return scull_quantum;
case SCULL_IOCXQUANTUM: /* eXchange: use arg as pointer */
if (! capable (CAP_SYS_ADMIN))
return -EPERM;
tmp = scull_quantum;
ret = __get_user(scull_quantum, (int *)arg);
if (ret == 0)
ret = __put_user(tmp, (int *)arg);
break;
case SCULL_IOCHQUANTUM: /* sHift: like Tell + Query */
if (! capable (CAP_SYS_ADMIN))
return -EPERM;
tmp = scull_quantum;
scull_quantum = arg;
return tmp;
default: /* redundant, as cmd was checked against MAXNR */
return -ENOTTY;
}
return ret;
scull also includes six entries that act on
scull_qset. These entries are identical to the ones
for scull_quantum and are not worth showing in
print.
Sometimes controlling the device is better accomplished by writing
control sequences to the device itself. This technique is used, for
example, in the console driver, where so-called escape sequences
are used to move the cursor, change the default color, or perform other
configuration tasks. The benefit of implementing device control this
way is that the user can control the device just by writing data,
without needing to use (or sometimes write) programs built just
for configuring the device.
For example, the setterm program acts on
the console (or another terminal) configuration by printing escape
sequences. This behavior has the advantage of permitting the remote
control of devices. The controlling program can live on a different
computer than the controlled device, because a simple redirection of
the data stream does the configuration job. You're already used to
this with ttys, but the technique is more general.
The drawback of controlling by printing is that it adds policy
constraints to the device; for example, it is viable only if you are
sure that the control sequence can't appear in the data being written
to the device during normal operation. This is only partly true for
ttys. Although a text display is meant to display only ASCII
characters, sometimes control characters can slip through in the data
being written and can thus affect the console setup. This can happen,
for example, when you issue grep on a
binary file; the extracted lines can contain anything, and you often
end up with the wrong font on your
console.[25]
Controlling by write is definitely the way to go
for those devices that don't transfer data but just respond to
commands, such as robotic devices.
For instance, a driver written for fun by one of your authors moves a
camera on two axes. In this driver, the "device'' is simply a pair of
old stepper motors, which can't really be read from or written to. The
concept of "sending a data stream'' to a stepper motor makes little
or no sense. In this case, the driver interprets what is being
written as ASCII commands and converts the requests to sequences of
impulses that manipulate the stepper motors. The idea is similar,
somewhat, to the AT commands you send to the modem in order to set up
communication, the main difference being that the serial port used to
communicate with the modem must transfer real data as well. The
advantage of direct device control is that you can use
cat to move the camera without writing and
compiling special code to issue the ioctl calls.
Whenever a process must wait for an event (such as the arrival of data
or the termination of a process), it should go to sleep. Sleeping
causes the process to suspend execution, freeing the processor for
other uses. At some future time, when the event being waited for
occurs, the process will be woken up and will continue with its
job. This section discusses the 2.4 machinery for putting a process to
sleep and waking it up. Earlier versions are discussed in "Backward Compatibility" later in this chapter.
-
-
The interruptible variant works just like
sleep_on, except that the sleep can be
interrupted by a signal. This is the form that device driver writers
have been using for a long time, before
wait_event_interruptible (described later)
appeared.
-
-
Of course, sleeping is only half of the problem; something, somewhere will
have to wake the process up again. When a device driver sleeps directly,
there is usually code in another part of the driver that performs the wakeup,
once it knows that the event has occurred. Typically a driver will wake up
sleepers in its interrupt handler once new data has arrived. Other
scenarios are possible, however.
-
-
-
Normally, a wake_up call can cause an immediate
reschedule to happen, meaning that other processes might run before
wake_up returns. The "synchronous" variants
instead make any awakened processes runnable, but do not reschedule
the CPU. This is used to avoid rescheduling when the current process
is known to be going to sleep, thus forcing a reschedule anyway. Note
that awakened processes could run immediately on a different
processor, so these functions should not be expected to provide mutual
exclusion.
As an example of wait queue usage, imagine you want to put a process to
sleep when it reads your device and awaken it when someone else writes to
the device. The following code does just that:
DECLARE_WAIT_QUEUE_HEAD(wq);
ssize_t sleepy_read (struct file *filp, char *buf, size_t count,
loff_t *pos)
{
printk(KERN_DEBUG "process %i (%s) going to sleep\n",
current->pid, current->comm);
interruptible_sleep_on(&wq);
printk(KERN_DEBUG "awoken %i (%s)\n", current->pid, current->comm);
return 0; /* EOF */
}
ssize_t sleepy_write (struct file *filp, const char *buf, size_t count,
loff_t *pos)
{
printk(KERN_DEBUG "process %i (%s) awakening the readers...\n",
current->pid, current->comm);
wake_up_interruptible(&wq);
return count; /* succeed, to avoid retrial */
}
The code for this device is available as sleepy in the example
programs and can be tested using cat and
input/output redirection, as usual.
An important thing to remember with wait queues is that being woken up
does not guarantee that the event you were waiting for has occurred; a
process can be woken for other reasons, mainly because it received a
signal. Any code that sleeps should do so in a loop that tests the
condition after returning from the sleep, as discussed in "A Sample Implementation: scullpipe" later in this chapter.
The previous discussion is all that most driver writers will need to
know to get their job done. Some, however, will want to dig
deeper. This section attempts to get the curious started; everybody
else can skip to the next section without missing much that is
important.
void simplified_sleep_on(wait_queue_head_t *queue)
{
wait_queue_t wait;
init_waitqueue_entry(&wait, current);
current->state = TASK_INTERRUPTIBLE;
add_wait_queue(queue, &wait);
schedule();
remove_wait_queue (queue, &wait);
}
One other reason for calling the scheduler explicitly, however, is to
do exclusive waits. There can be situations in
which several processes are waiting on an event; when
wake_up is called, all of those processes will
try to execute. Suppose that the event signifies the arrival of an
atomic piece of data. Only one process will be able to read that data;
all the rest will simply wake up, see that no data is available, and
go back to sleep.
For this reason, the 2.3 development series added the concept of an
exclusive sleep. If processes sleep in an
exclusive mode, they are telling the kernel to wake only one of
them. The result is improved performance in some situations.
void simplified_sleep_exclusive(wait_queue_head_t *queue)
{
wait_queue_t wait;
init_waitqueue_entry(&wait, current);
current->state = TASK_INTERRUPTIBLE | TASK_EXCLUSIVE;
add_wait_queue_exclusive(queue, &wait);
schedule();
remove_wait_queue (queue, &wait);
}
You need to make reentrant any function that matches either of two
conditions. First, if it calls schedule, possibly
by calling sleep_on or
wake_up. Second, if it copies data to or from
user space, because access to user space might page-fault, and the
process will be put to sleep while the kernel deals with the missing
page. Every function that calls any such functions must be reentrant
as well. For example, if sample_read calls
sample_getdata, which in turn can block, then
sample_read must be reentrant as well as
sample_getdata, because nothing prevents another
process from calling it while it is already executing on behalf of a
process that went to sleep.
Both these statements assume that there are both input and output
buffers; in practice, almost every device driver has them. The input
buffer is required to avoid losing data that arrives when nobody is
reading. In contrast, data can't be lost on
write, because if the system call doesn't accept
data bytes, they remain in the user-space buffer. Even so, the output
buffer is almost always useful for squeezing more performance out of
the hardware.
The performance gain of implementing an output buffer in the driver
results from the reduced number of context switches and
user-level/kernel-level transitions. Without an output buffer
(assuming a slow device), only one or a few characters are accepted by
each system call, and while one process sleeps in
write, another process runs (that's one context
switch). When the first process is awakened, it resumes (another
context switch), write returns (kernel/user
transition), and the process reiterates the system call to write more
data (user/kernel transition); the call blocks, and the loop
continues. If the output buffer is big enough, the
write call succeeds on the first
attempt -- the buffered data will be pushed out to the device
later, at interrupt time -- without control needing to go back to
user space for a second or third write call. The
choice of a suitable size for the output buffer is clearly device
specific.
We didn't use an input buffer in scull,
because data is already available when read is
issued. Similarly, no output buffer was used, because data is simply
copied to the memory area associated with the device. Essentially,
the device is a buffer, so the implementation of
additional buffers would be superfluous. We'll see the use of buffers
in Chapter 9, "Interrupt Handling", in the section titled "Interrupt-Driven I/O".
Naturally, O_NONBLOCK is meaningful in the
open method also. This happens when the call can
actually block for a long time; for example, when opening a FIFO that
has no writers (yet), or accessing a disk file with a pending lock.
Usually, opening a device either succeeds or fails, without the need
to wait for external events. Sometimes, however, opening the device
requires a long initialization, and you may choose to support
O_NONBLOCK in your open method
by returning immediately with -EAGAIN ("try it
again") if the flag is set, after initiating device
initialization. The driver may also implement a blocking
open to support access policies in a way
similar to file locks. We'll see one such implementation in the
section "Blocking open as an Alternative to EBUSY" later in this chapter.
Some drivers may also implement special semantics for
O_NONBLOCK; for example, an open of a tape device
usually blocks until a tape has been inserted. If the tape drive is
opened with O_NONBLOCK, the open succeeds
immediately regardless of whether the media is present or not.
The /dev/scullpipe devices (there are four of
them by default) are part of the scullmodule and are used to show how blocking I/O is implemented.
Within a driver, a process blocked in a read call
is awakened when data arrives; usually the hardware issues an
interrupt to signal such an event, and the driver awakens waiting
processes as part of handling the interrupt. The
scull driver works differently, so that it
can be run without requiring any particular hardware or an interrupt
handler. We chose to use another process to generate the data and wake
the reading process; similarly, reading processes are used to wake
sleeping writer processes. The resulting implementation is similar to
that of a FIFO (or named pipe) filesystem node, whence the name.
The device driver uses a device structure that embeds two wait
queues and a buffer. The size of the buffer is configurable in the
usual ways (at compile time, load time, or runtime).
typedef struct Scull_Pipe {
wait_queue_head_t inq, outq; /* read and write queues */
char *buffer, *end; /* begin of buf, end of buf */
int buffersize; /* used in pointer arithmetic */
char *rp, *wp; /* where to read, where to write */
int nreaders, nwriters; /* number of openings for r/w */
struct fasync_struct *async_queue; /* asynchronous readers */
struct semaphore sem; /* mutual exclusion semaphore */
devfs_handle_t handle; /* only used if devfs is there */
} Scull_Pipe;
ssize_t scull_p_read (struct file *filp, char *buf, size_t count,
loff_t *f_pos)
{
Scull_Pipe *dev = filp->private_data;
if (f_pos != &filp->f_pos) return -ESPIPE;
if (down_interruptible(&dev->sem))
return -ERESTARTSYS;
while (dev->rp == dev->wp) { /* nothing to read */
up(&dev->sem); /* release the lock */
if (filp->f_flags & O_NONBLOCK)
return -EAGAIN;
PDEBUG("\"%s\" reading: going to sleep\n", current->comm);
if (wait_event_interruptible(dev->inq, (dev->rp != dev->wp)))
return -ERESTARTSYS; /* signal: tell the fs layer to handle it */
/* otherwise loop, but first reacquire the lock */
if (down_interruptible(&dev->sem))
return -ERESTARTSYS;
}
/* ok, data is there, return something */
if (dev->wp > dev->rp)
count = min(count, dev->wp - dev->rp);
else /* the write pointer has wrapped, return data up to dev->end */
count = min(count, dev->end - dev->rp);
if (copy_to_user(buf, dev->rp, count)) {
up (&dev->sem);
return -EFAULT;
}
dev->rp += count;
if (dev->rp == dev->end)
dev->rp = dev->buffer; /* wrapped */
up (&dev->sem);
/* finally, awaken any writers and return */
wake_up_interruptible(&dev->outq);
PDEBUG("\"%s\" did read %li bytes\n",current->comm, (long)count);
return count;
}
Note also, once again, the use of semaphores to protect critical
regions of the code. The scull code has to
be careful to avoid going to sleep when it holds a
semaphore -- otherwise, writers would never be able to add data, and
the whole thing would deadlock. This code uses
wait_event_interruptible to wait for data if need
be; it has to check for available data again after the wait,
though. Somebody else could grab the data between when we wake up and
when we get the semaphore back.
It's worth repeating that a process can go to sleep both when it calls
schedule, either directly or indirectly, and when
it copies data to or from user space. In the latter case the process
may sleep if the user array is not currently present in main
memory. If scull sleeps while copying data
between kernel and user space, it will sleep with the device semaphore
held. Holding the semaphore in this case is justified since it will
not deadlock the system, and since it is important that the device
memory array not change while the driver sleeps.
The implementation for write is quite similar to
that for read (and, again, its first line will be
explained later). Its only "peculiar'' feature is that it never
completely fills the buffer, always leaving a hole of at least one
byte. Thus, when the buffer is empty, wp and
rp are equal; when there is data there, they are
always different.
static inline int spacefree(Scull_Pipe *dev)
{
if (dev->rp == dev->wp)
return dev->buffersize - 1;
return ((dev->rp + dev->buffersize - dev->wp) % dev->buffersize) - 1;
}
ssize_t scull_p_write(struct file *filp, const char *buf, size_t count,
loff_t *f_pos)
{
Scull_Pipe *dev = filp->private_data;
if (f_pos != &filp->f_pos) return -ESPIPE;
if (down_interruptible(&dev->sem))
return -ERESTARTSYS;
/* Make sure there's space to write */
while (spacefree(dev) == 0) { /* full */
up(&dev->sem);
if (filp->f_flags & O_NONBLOCK)
return -EAGAIN;
PDEBUG("\"%s\" writing: going to sleep\n",current->comm);
if (wait_event_interruptible(dev->outq, spacefree(dev) > 0))
return -ERESTARTSYS; /* signal: tell the fs layer to handle it */
if (down_interruptible(&dev->sem))
return -ERESTARTSYS;
}
/* ok, space is there, accept something */
count = min(count, spacefree(dev));
if (dev->wp >= dev->rp)
count = min(count, dev->end - dev->wp); /* up to end-of-buffer */
else /* the write pointer has wrapped, fill up to rp-1 */
count = min(count, dev->rp - dev->wp - 1);
PDEBUG("Going to accept %li bytes to %p from %p\n",
(long)count, dev->wp, buf);
if (copy_from_user(dev->wp, buf, count)) {
up (&dev->sem);
return -EFAULT;
}
dev->wp += count;
if (dev->wp == dev->end)
dev->wp = dev->buffer; /* wrapped */
up(&dev->sem);
/* finally, awaken any reader */
wake_up_interruptible(&dev->inq); /* blocked in read() and select() */
/* and signal asynchronous readers, explained later in Chapter 5 */
if (dev->async_queue)
kill_fasync(&dev->async_queue, SIGIO, POLL_IN);
PDEBUG("\"%s\" did write %li bytes\n",current->comm, (long)count);
return count;
}
The device, as we conceived it, doesn't implement blocking
open and is simpler than a real FIFO. If you want
to look at the real thing, you can find it in
fs/pipe.c, in the kernel sources.
To test the blocking operation of the
scullpipe device, you can run some programs
on it, using input/output redirection as usual. Testing nonblocking
activity is trickier, because the conventional programs don't perform
nonblocking operations. The misc-progs source
directory contains the following simple program, called
nbtest, for testing nonblocking
operations. All it does is copy its input to its output, using
nonblocking I/O and delaying between retrials. The delay time is
passed on the command line and is one second by default.
Support for either system call requires support from the device driver
to function. In version 2.0 of the kernel the device method was
modeled on select (and no
poll was available to user programs); from
version 2.1.23 onward both were offered, and the device method was
based on the newly introduced poll system call
because poll offered more detailed control than
select.
The driver's method will be called whenever the user-space program
performs a poll or selectsystem call involving a file descriptor associated with the
driver. The device method is in charge of these two steps:
-
-
The second task performed by the poll method is returning the bit mask
describing which operations could be completed immediately; this is
also straightforward. For example, if the device has data available, a
read would complete without sleeping; the
poll method should indicate this state of
affairs. Several flags (defined in
) are used to indicate the
possible operations:
-
This bit must be set if the device can be read without blocking.
-
This bit must be set if "normal'' data is available for reading. A
readable device returns (POLLIN | POLLRDNORM).
-
This bit indicates that out-of-band data is available for reading from
the device. It is currently used only in one place in the Linux kernel
(the DECnet code) and is not generally applicable to device drivers.
-
High-priority data (out-of-band) can be read without blocking. This
bit causes select to report that an exception
condition occurred on the file, because selectreports out-of-band data as an exception condition.
-
When a process reading this device sees end-of-file, the driver must
set POLLHUP (hang-up). A process calling
select will be told that the device is readable,
as dictated by the select functionality.
-
An error condition has occurred on the device. When
poll is invoked, the device is reported as both
readable and writable, since both read and
write will return an error code without blocking.
-
This bit is set in the return value if the device can be written to
without blocking.
-
This bit has the same meaning as POLLOUT, and
sometimes it actually is the same number. A writable device returns
(POLLOUT | POLLWRNORM).
-
Like POLLRDBAND, this bit means that data with
nonzero priority can be written to the device. Only the datagram
implementation of poll uses this bit, since a
datagram can transmit out of band data.
It's worth noting that POLLRDBAND and
POLLWRBAND are meaningful only with file
descriptors associated with sockets: device drivers won't normally use
these flags.
The description of poll takes up a lot of space
for something that is relatively simple to use in practice. Consider
the scullpipe implementation of the
poll method:
unsigned int scull_p_poll(struct file *filp, poll_table *wait)
{
Scull_Pipe *dev = filp->private_data;
unsigned int mask = 0;
/*
* The buffer is circular; it is considered full
* if "wp" is right behind "rp". "left" is 0 if the
* buffer is empty, and it is "1" if it is completely full.
*/
int left = (dev->rp + dev->buffersize - dev->wp) % dev->buffersize;
poll_wait(filp, &dev->inq, wait);
poll_wait(filp, &dev->outq, wait);
if (dev->rp != dev->wp) mask |= POLLIN | POLLRDNORM; /* readable */
if (left != 1) mask |= POLLOUT | POLLWRNORM; /* writable */
return mask;
}
This code simply adds the two scullpipewait queues to the poll_table, then sets the
appropriate mask bits depending on whether data can be read or
written.
The poll code as shown is missing end-of-file
support. The poll method should return
POLLHUP when the device is at the end of the
file. If the caller used the select system call,
the file will be reported as readable; in both cases the application
will know that it can actually issue the readwithout waiting forever, and the read method will
return 0 to signal end-of-file.
With real FIFOs, for example, the reader sees an end-of-file when all
the writers close the file, whereas in
scullpipe the reader never sees
end-of-file. The behavior is different because a FIFO is intended to
be a communication channel between two processes, while
scullpipe is a trashcan where everyone can
put data as long as there's at least one reader. Moreover, it makes no
sense to reimplement what is already available in the kernel.
Implementing end-of-file in the same way as FIFOs do would mean
checking dev->nwriters, both in
read and in poll, and
reporting end-of-file (as just described) if no process has the device
opened for writing. Unfortunately, though, if a reader opened the
scullpipe device before the writer, it
would see end-of-file without having a chance to wait for data. The
best way to fix this problem would be to implement blocking within
open; this task is left as an exercise for the
reader.
The purpose of the poll and
select calls is to determine in advance if an I/O
operation will block. In that respect, they complement
read and write. More
important, poll and selectare useful because they let the application wait simultaneously for
several data streams, although we are not exploiting this feature in
the scull examples.
-
If there is space in the output buffer, writeshould return without delay. It can accept less data than the call
requested, but it must accept at least one byte. In this case,
poll reports that the device is writable.
-
If the output buffer is full, by default writeblocks until some space is freed. If O_NONBLOCK is
set, write returns immediately with a return
value of -EAGAIN (older System V Unices returned
0). In these cases poll should report that the
file is not writable. If, on the other hand, the device is not able to
accept any more data, write returns
-ENOSPC ("No space left on device''),
independently of the setting of O_NONBLOCK.
-
Never make a write call wait for data
transmission before returning, even if O_NONBLOCK
is clear. This is because many applications use
select to find out whether a
write will block. If the device is reported as
writable, the call must consistently not block. If the program using
the device wants to ensure that the data it enqueues in the output
buffer is actually transmitted, the driver must provide an
fsync method. For instance, a removable device
should have an fsync entry point.
Although these are a good set of general rules, one should also
recognize that each device is unique and that sometimes the rules must
be bent slightly. For example, record-oriented devices (such as tape
drives) cannot execute partial writes.
If some application will ever need to be assured that data has been
sent to the device, the fsync method must be
implemented. A call to fsync should return only
when the device has been completely flushed (i.e., the output buffer
is empty), even if that takes some time, regardless of whether
O_NONBLOCK is set. The datasync
argument, present only in the 2.4 kernel, is used to distinguish
between the fsync and
fdatasync system calls; as such, it is only of
interest to filesystem code and can be ignored by drivers.
The fsync method has no unusual features. The
call isn't time critical, so every device driver can implement it to
the author's taste. Most of the time, char drivers just have a
NULL pointer in their
fops. Block devices, on the other hand, always
implement the method with the general-purpose
block_fsync, which in turn flushes all the blocks
of the device, waiting for I/O to complete.
The actual implementation of the poll and
select system calls is reasonably simple, for
those who are interested in how it works. Whenever a user application
calls either function, the kernel invokes the
poll method of all files referenced by the system
call, passing the same poll_table to each of
them. The structure is, for all practical purposes, an array of
poll_table_entry structures allocated for a
specific poll or selectcall. Each poll_table_entry contains the
struct file pointer for the open device, a
wait_queue_head_t pointer, and a
wait_queue_t entry. When a driver calls
poll_wait, one of these entries gets filled in
with the information provided by the driver, and the wait queue entry
gets put onto the driver's queue. The pointer to
wait_queue_head_t is used to track the wait queue
where the current poll table entry is registered, in order for
free_wait to be able to dequeue the entry before
the wait queue is awakened.
If none of the drivers being polled indicates that I/O can occur
without blocking, the poll call simply sleeps
until one of the (perhaps many) wait queues it is on wakes it up.
What's interesting in the implementation of pollis that the file operation may be called with a
NULL pointer as poll_table
argument. This situation can come about for a couple of reasons. If
the application calling poll has provided a
timeout value of 0 (indicating that no wait should be done), there is
no reason to accumulate wait queues, and the system simply does not do
it. The poll_table pointer is also set to
NULL immediately after any driver being
polled indicates that I/O is possible. Since the
kernel knows at that point that no wait will occur, it does not build
up a list of wait queues.
Though the combination of blocking and nonblocking operations and the
select method are sufficient for querying the
device most of the time, some situations aren't efficiently managed by
the techniques we've seen so far.
Let's imagine, for example, a process that executes a long
computational loop at low priority, but needs to process incoming data
as soon as possible. If the input channel is the keyboard, you are
allowed to send a signal to the application (using the `INTR'
character, usually CTRL-C), but this signaling ability is part of the
tty abstraction, a software layer that isn't used for general char
devices. What we need for asynchronous notification is something
different. Furthermore, any input data should
generate an interrupt, not just CTRL-C.
User programs have to execute two steps to enable asynchronous
notification from an input file. First, they specify a process as the
"owner'' of the file. When a process invokes the
F_SETOWN command using the
fcntl system call, the process ID of the owner
process is saved in filp->f_owner for later
use. This step is necessary for the kernel to know just who to notify.
In order to actually enable asynchronous notification, the user
programs must set the FASYNC flag in the device by
means of the F_SETFL fcntlcommand.
After these two calls have been executed, the input file can request
delivery of a SIGIO signal whenever new data
arrives. The signal is sent to the process (or process group, if the
value is negative) stored in filp->f_owner.
For example, the following lines of code in a user program enable
asynchronous notification to the current process for the
stdin input file:
The program named asynctest in the sources
is a simple program that reads stdin as shown. It
can be used to test the asynchronous capabilities of
scullpipe. The program is similar to
cat, but doesn't terminate on end-of-file;
it responds only to input, not to the absence of input.
Note, however, that not all the devices support asynchronous
notification, and you can choose not to offer it. Applications usually
assume that the asynchronous capability is available only for sockets
and ttys. For example, pipes and FIFOs don't support it, at least in
the current kernels. Mice offer asynchronous notification because some
programs expect a mouse to be able to send SIGIO
like a tty does.
A more relevant topic for us is how the device driver can implement
asynchronous signaling. The following list details the sequence of
operations from the kernel's point of view:
-
-
When F_SETFL is executed to turn on
FASYNC, the driver's fasyncmethod is called. This method is called whenever the value of
FASYNC is changed in
filp->f_flags, to notify the driver of the
change so it can respond properly. The flag is cleared by default when
the file is opened. We'll look at the standard implementation of the
driver method soon.
-
While implementing the first step is trivial -- there's nothing to
do on the driver's part -- the other steps involve maintaining a
dynamic data structure to keep track of the different asynchronous
readers; there might be several of these readers. This dynamic data
structure, however, doesn't depend on the particular device involved,
and the kernel offers a suitable general-purpose implementation so
that you don't have to rewrite the same code in every driver.
The general implementation offered by Linux is based on one data
structure and two functions (which are called in the second and third
steps described earlier). The header that declares related material is
-- nothing new here -- and
the data structure is called struct
fasync_struct. As we did with wait queues, we need to insert
a pointer to the structure in the device-specific data
structure. Actually, we've already seen such a field in the section
"A Sample Implementation: scullpipe".
Here's how scullpipe implements the
fasync method:
int scull_p_fasync(fasync_file fd, struct file *filp, int mode)
{
Scull_Pipe *dev = filp->private_data;
return fasync_helper(fd, filp, mode, &dev->async_queue);
}
It's clear that all the work is performed by
fasync_helper. It wouldn't be possible, however,
to implement the functionality without a method in the driver, because
the helper function needs to access the correct pointer to
struct fasync_struct * (here
&dev->async_queue), and only the driver can
provide the information.
When data arrives, then, the following statement must be executed to
signal asynchronous readers. Since new data for the
scullpipe reader is generated by a process
issuing a write, the statement appears in the
write method of
scullpipe.
if (dev->async_queue)
kill_fasync(&dev->async_queue, SIGIO, POLL_IN);
It might appear that we're done, but there's still one thing
missing. We must invoke our fasync method when
the file is closed to remove the file from the list of active
asynchronous readers. Although this call is required only if
filp->f_flags has FASYNC set,
calling the function anyway doesn't hurt and is the usual
implementation. The following lines, for example, are part of the
close method for
scullpipe:
/* remove this filp from the asynchronously notified filp's */
scull_p_fasync(-1, filp, 0);
The difficult part of the chapter is over; now we'll quickly detail
the llseek method, which is useful and easy to
implement.
The llseek method implements the
lseek and llseek system
calls. We have already stated that if the llseekmethod is missing from the device's operations, the default
implementation in the kernel performs seeks from the beginning of the
file and from the current position by modifying
filp->f_pos, the current reading/writing
position within the file. Please note that for the
lseek system call to work correctly, the
read and write methods must
cooperate by updating the offset item they receive as argument (the
argument is usually a pointer to filp->f_pos).
You may need to provide your own llseek method if
the seek operation corresponds to a physical operation on the device
or if seeking from end-of-file, which is not implemented by the
default method, makes sense. A simple example can be seen in the
scull driver:
loff_t scull_llseek(struct file *filp, loff_t off, int whence)
{
Scull_Dev *dev = filp->private_data;
loff_t newpos;
switch(whence) {
case 0: /* SEEK_SET */
newpos = off;
break;
case 1: /* SEEK_CUR */
newpos = filp->f_pos + off;
break;
case 2: /* SEEK_END */
newpos = dev->size + off;
break;
default: /* can't happen */
return -EINVAL;
}
if (newpos<0) return -EINVAL;
filp->f_pos = newpos;
return newpos;
}
The only device-specific operation here is retrieving the file length
from the device. In scull the
read and write methods
cooperate as needed, as shown in "read and write" in Chapter 3, "Char Drivers".
Although the implementation just shown makes sense for
scull, which handles a well-defined data
area, most devices offer a data flow rather than a data area (just
think about the serial ports or the keyboard), and seeking those
devices does not make sense. If this is the case, you can't just
refrain from declaring the llseek operation,
because the default method allows seeking. Instead, you should use the
following code:
loff_t scull_p_llseek(struct file *filp, loff_t off, int whence)
{
return -ESPIPE; /* unseekable */
}
This function comes from the scullpipedevice, which isn't seekable; the error code is translated to
"Illegal seek,'' though the symbolic name means "is a pipe.''
Because the position indicator is meaningless for nonseekable devices,
neither read nor write needs
to update it during data transfer.
It's interesting to note that since pread and
pwrite have been added to the set of supported
system calls, the lseek device method is not the
only way a user-space program can seek a file. A proper implementation
of unseekable devices should allow normal readand write calls while preventing
pread and pwrite. This is
accomplished by the following line -- the first in both the
read and write methods of
scullpipe -- we
didn't explain when introducing those methods:
Offering access control is sometimes vital for the reliability of a
device node. Not only should unauthorized users not be permitted to
use the device (a restriction is enforced by the filesystem permission
bits), but sometimes only one authorized user should be allowed to
open the device at a time.
The problem is similar to that of using ttys. In that case, the
login process changes the ownership of the
device node whenever a user logs into the system, in order to prevent
other users from interfering with or sniffing the tty data
flow. However, it's impractical to use a privileged program to change
the ownership of a device every time it is opened, just to grant
unique access to it.
Every device shown in this section has the same behavior as the bare
scull device (that is, it implements a
persistent memory area) but differs from
scull in access control, which is
implemented in the open and
close operations.
The brute-force way to provide access control is to permit a device to
be opened by only one process at a time (single openness). This
technique is best avoided because it inhibits user ingenuity. A user
might well want to run different processes on the same device, one
reading status information while the other is writing data. In some
cases, users can get a lot done by running a few simple programs
through a shell script, as long as they can access the device
concurrently. In other words, implementing a single-open behavior
amounts to creating policy, which may get in the way of what your
users want to do.
Allowing only a single process to open a device has undesirable
properties, but it is also the easiest access control to implement for a
device driver, so it's shown here. The source code is extracted from a
device called scullsingle.
int scull_s_open(struct inode *inode, struct file *filp)
{
Scull_Dev *dev = &scull_s_device; /* device information */
int num = NUM(inode->i_rdev);
if (!filp->private_data && num > 0)
return -ENODEV; /* not devfs: allow 1 device only */
spin_lock(&scull_s_lock);
if (scull_s_count) {
spin_unlock(&scull_s_lock);
return -EBUSY; /* already open */
}
scull_s_count++;
spin_unlock(&scull_s_lock);
/* then, everything else is copied from the bare scull device */
if ( (filp->f_flags & O_ACCMODE) == O_WRONLY)
scull_trim(dev);
if (!filp->private_data)
filp->private_data = dev;
MOD_INC_USE_COUNT;
return 0; /* success */
}
The close call, on the other hand, marks the
device as no longer busy.
int scull_s_release(struct inode *inode, struct file *filp)
{
scull_s_count--; /* release the device */
MOD_DEC_USE_COUNT;
return 0;
}
Normally, we recommend that you put the open flag
scull_s_count (with the accompanying spinlock,
scull_s_lock, whose role is explained in the next
subsection) within the device structure (Scull_Dev
here) because, conceptually, it belongs to the device. The
scull driver, however, uses standalone
variables to hold the flag and the lock in order to use the same
device structure and methods as the bare
scull device and minimize code duplication.
Consider once again the test on the variable
scull_s_count just shown. Two separate actions are
taken there: (1) the value of the variable is tested, and the open is
refused if it is not 0, and (2) the variable is incremented to mark
the device as taken. On a single-processor system, these tests are
safe because no other process will be able to run between the two
actions.
As soon as you get into the SMP world, however, a problem arises. If
two processes on two processors attempt to open the device
simultaneously, it is possible that they could both test the value of
scull_s_count before either modifies it. In this
scenario you'll find that, at best, the single-open semantics of the
device is not enforced. In the worst case, unexpected concurrent
access could create data structure corruption and system crashes.
Instead, scullsingle uses a different
locking mechanism called a spinlock. Spinlocks
will never put a process to sleep. Instead, if a lock is not
available, the spinlock primitives will simply retry, over and over
(i.e., "spin''), until the lock is freed. Spinlocks thus have very
little locking overhead, but they also have the potential to cause a
processor to spin for a long time if somebody hogs the lock. Another
advantage of spinlocks over semaphores is that their implementation is
empty when compiling code for a uniprocessor system (where these
SMP-specific races can't happen). Semaphores are a more general
resource that make sense on uniprocessor computers as well as SMP, so
they don't get optimized away in the uniprocessor case.
Spinlocks can be the ideal mechanism for small critical sections.
Processes should hold spinlocks for the minimum time possible, and
must never sleep while holding a lock. Thus, the main
scull driver, which exchanges data with
user space and can therefore sleep, is not suitable for a spinlock
solution. But spinlocks work nicely for controlling access to
scull_s_single (even if they still are not the
optimal solution, which we will see in Chapter 9, "Interrupt Handling").
Spinlocks can be more complicated than this, and we'll get into the
details in Chapter 9, "Interrupt Handling". But the simple case as shown here
suits our needs for now, and all of the access-control variants of
scull will use simple spinlocks in this
manner.
The astute reader may have noticed that whereas
scull_s_open acquires the
scull_s_lock lock prior to incrementing the
scull_s_count flag,
scull_s_close takes no such precautions. This
code is safe because no other code will change the value of
scull_s_count if it is nonzero, so there will be no
conflict with this particular assignment.
The next step beyond a single system-wide lock is to let a single user
open a device in multiple processes but allow only one user to have
the device open at a time. This solution makes it easy to test the
device, since the user can read and write from several processes at
once, but assumes that the user takes some responsibility for
maintaining the integrity of the data during multiple accesses. This
is accomplished by adding checks in the openmethod; such checks are performed after the
normal permission checking and can only make access more restrictive
than that specified by the owner and group permission bits. This is
the same access policy as that used for ttys, but it doesn't resort to
an external privileged program.
Those access policies are a little trickier to implement than
single-open policies. In this case, two items are needed: an open
count and the uid of the "owner'' of the device. Once again, the best
place for such items is within the device structure; our example uses
global variables instead, for the reason explained earlier for
scullsingle. The name of the device is
sculluid.
The open call grants access on first open, but
remembers the owner of the device. This means that a user can open the
device multiple times, thus allowing cooperating processes to work
concurrently on the device. At the same time, no other user can open
it, thus avoiding external interference. Since this version of the
function is almost identical to the preceding one, only the relevant
part is reproduced here:
spin_lock(&scull_u_lock);
if (scull_u_count &&
(scull_u_owner != current->uid) && /* allow user */
(scull_u_owner != current->euid) && /* allow whoever did su */
!capable(CAP_DAC_OVERRIDE)) { /* still allow root */
spin_unlock(&scull_u_lock);
return -EBUSY; /* -EPERM would confuse the user */
}
if (scull_u_count == 0)
scull_u_owner = current->uid; /* grab it */
scull_u_count++;
spin_unlock(&scull_u_lock);
We chose to return -EBUSY and not
-EPERM, even though the code is performing a
permission check, in order to point a user who is denied access in the
right direction. The reaction to "Permission denied'' is usually to
check the mode and owner of the /dev file, while
"Device busy'' correctly suggests that the user should look for a
process already using the device.
This code also checks to see if the process attempting the open has
the ability to override file access permissions; if so, the open will
be allowed even if the opening process is not the owner of the
device. The CAP_DAC_OVERRIDE capability fits the
task well in this case.
When the device isn't accessible, returning an error is usually the
most sensible approach, but there are situations in which you'd prefer
to wait for the device.
For example, if a data communication channel is used both to transmit
reports on a timely basis (using crontab)
and for casual usage according to people's needs, it's much better for
the timely report to be slightly delayed rather than fail just because
the channel is currently busy.
This is one of the choices that the programmer must make when
designing a device driver, and the right answer depends on the
particular problem being solved.
The scullwuid device is a version of
sculluid that waits for the device on
open instead of returning
-EBUSY. It differs from
sculluid only in the following part of the
open operation:
spin_lock(&scull_w_lock);
while (scull_w_count &&
(scull_w_owner != current->uid) && /* allow user */
(scull_w_owner != current->euid) && /* allow whoever did su */
!capable(CAP_DAC_OVERRIDE)) {
spin_unlock(&scull_w_lock);
if (filp->f_flags & O_NONBLOCK) return -EAGAIN;
interruptible_sleep_on(&scull_w_wait);
if (signal_pending(current)) /* a signal arrived */
return -ERESTARTSYS; /* tell the fs layer to handle it */
/* else, loop */
spin_lock(&scull_w_lock);
}
if (scull_w_count == 0)
scull_w_owner = current->uid; /* grab it */
scull_w_count++;
spin_unlock(&scull_w_lock);
int scull_w_release(struct inode *inode, struct file *filp)
{
scull_w_count--;
if (scull_w_count == 0)
wake_up_interruptible(&scull_w_wait); /* awaken other uid's */
MOD_DEC_USE_COUNT;
return 0;
}
The problem with a blocking-open implementation is that it is really
unpleasant for the interactive user, who has to keep guessing what is
going wrong. The interactive user usually invokes precompiled commands
such as cp and
tar and can't just add
O_NONBLOCK to the opencall. Someone who's making a backup using the tape drive in the next
room would prefer to get a plain "device or resource busy'' message
instead of being left to guess why the hard drive is so silent today
while tar is scanning it.
This kind of problem (different, incompatible policies for the same
device) is best solved by implementing one device node for each access
policy. An example of this practice can be found in the Linux tape
driver, which provides multiple device files for the same
device. Different device files will, for example, cause the drive to
record with or without compression, or to automatically rewind the
tape when the device is closed.
Another technique to manage access control is creating different
private copies of the device depending on the process opening it.
Clearly this is possible only if the device is not bound to a hardware
object; scull is an example of such a
"software'' device. The internals of /dev/ttyuse a similar technique in order to give its process a different
"view'' of what the /dev entry point represents.
When copies of the device are created by the software driver, we call
them virtual devices -- just as virtual
consoles use a single physical tty device.
The /dev/scullpriv device node implements virtual
devices within the scull package. The
scullpriv implementation uses the minor number of
the process's controlling tty as a key to access the virtual
device. You can nonetheless easily modify the sources to use any
integer value for the key; each choice leads to a different
policy. For example, using the uid leads to a
different virtual device for each user, while using a
pid key creates a new device for each process
accessing it.
The decision to use the controlling terminal is meant to enable easy
testing of the device using input/output redirection: the device is
shared by all commands run on the same virtual terminal and is kept
separate from the one seen by commands run on another terminal.
The open method looks like the following code. It
must look for the right virtual device and possibly create one. The
final part of the function is not shown because it is copied from the
bare scull, which we've already seen.
/* The clone-specific data structure includes a key field */
struct scull_listitem {
Scull_Dev device;
int key;
struct scull_listitem *next;
};
/* The list of devices, and a lock to protect it */
struct scull_listitem *scull_c_head;
spinlock_t scull_c_lock;
/* Look for a device or create one if missing */
static Scull_Dev *scull_c_lookfor_device(int key)
{
struct scull_listitem *lptr, *prev = NULL;
for (lptr = scull_c_head; lptr && (lptr->key != key); lptr = lptr->next)
prev=lptr;
if (lptr) return &(lptr->device);
/* not found */
lptr = kmalloc(sizeof(struct scull_listitem), GFP_ATOMIC);
if (!lptr) return NULL;
/* initialize the device */
memset(lptr, 0, sizeof(struct scull_listitem));
lptr->key = key;
scull_trim(&(lptr->device)); /* initialize it */
sema_init(&(lptr->device.sem), 1);
/* place it in the list */
if (prev) prev->next = lptr;
else scull_c_head = lptr;
return &(lptr->device);
}
int scull_c_open(struct inode *inode, struct file *filp)
{
Scull_Dev *dev;
int key, num = NUM(inode->i_rdev);
if (!filp->private_data && num > 0)
return -ENODEV; /* not devfs: allow 1 device only */
if (!current->tty) {
PDEBUG("Process \"%s\" has no ctl tty\n",current->comm);
return -EINVAL;
}
key = MINOR(current->tty->device);
/* look for a scullc device in the list */
spin_lock(&scull_c_lock);
dev = scull_c_lookfor_device(key);
spin_unlock(&scull_c_lock);
if (!dev) return -ENOMEM;
/* then, everything else is copied from the bare scull device */
The release method does nothing special. It would
normally release the device on last close, but we chose not to
maintain an open count in order to simplify the testing of the driver.
If the device were released on last close, you wouldn't be able to
read the same data after writing to the device unless a background
process were to keep it open. The sample driver takes the easier
approach of keeping the data, so that at the next
open, you'll find it there. The devices are
released when scull_cleanup is called.
Here's the release implementation for
/dev/scullpriv, which closes the discussion of
device methods.
int scull_c_release(struct inode *inode, struct file *filp)
{
/*
* Nothing to do, because the device is persistent.
* A `real' cloned device should be freed on last close
*/
MOD_DEC_USE_COUNT;
return 0;
}
Many parts of the device driver API covered in this chapter have
changed between the major kernel releases. For those of you needing to
make your driver work with Linux 2.0 or 2.2, here is a quick rundown
of the differences you will encounter.
Wait Queues in Linux 2.2 and 2.0
A relatively small amount of the material in this chapter changed in
the 2.3 development cycle. The one significant change is in the area
of wait queues. The 2.2 kernel had a different and simpler
implementation of wait queues, but it lacked some important features,
such as exclusive sleeps. The new implementation of wait queues was
introduced in kernel version 2.3.1.
In the 2.2 release, the type of the first argument to the
fasync method changed. In the 2.0 kernel, a
pointer to the inode structure for the device was
passed, instead of the integer file descriptor:
The third argument to the fsyncfile_operations method (the integer
datasync value) was added in the 2.3 development
series, meaning that portable code will generally need to include a
wrapper function for older kernels. There is a trap, however, for
people trying to write portable fsync methods: at
least one distributor, which will remain nameless, patched the 2.4
fsync API into its 2.2 kernel. The kernel
developers usually (usually...) try to avoid
making API changes within a stable series, but they have little
control over what the distributors do.
Memory access was handled differently in the 2.0 kernels. The Linux
virtual memory system was less well developed at that time, and memory
access was handled a little differently. The new system was the key
change that opened 2.1 development, and it brought significant
improvements in performance; unfortunately, it was accompanied by yet
another set of compatibility headaches for driver writers.
-
-
-
This macro fetched the value at the given address, and returned it as
its return value. Once again, no verification was done by the
execution of the macro.
As an example of how the older calls are used, consider
scull one more time. A version of
scull using the 2.0 API would call
verify_area in this way:
case SCULL_IOCXQUANTUM: /* eXchange: use arg as pointer */
tmp = scull_quantum;
scull_quantum = get_user((int *)arg);
put_user(tmp, (int *)arg);
break;
default: /* redundant, as cmd was checked against MAXNR */
return -ENOTTY;
}
return 0;
The 2.0 kernel did not support the poll system
call; only the BSD-style select call was
available. The corresponding device driver method was thus called
select, and operated in a slightly different way,
though the actions to be performed are almost identical.
The scull driver deals with the
incompatibility by declaring a specific selectmethod to be used when it is compiled for version 2.0 of the kernel:
#ifdef __USE_OLD_SELECT__
int scull_p_poll(struct inode *inode, struct file *filp,
int mode, select_table *table)
{
Scull_Pipe *dev = filp->private_data;
if (mode == SEL_IN) {
if (dev->rp != dev->wp) return 1; /* readable */
PDEBUG("Waiting to read\n");
select_wait(&dev->inq, table); /* wait for data */
return 0;
}
if (mode == SEL_OUT) {
/*
* The buffer is circular; it is considered full
* if "wp" is right behind "rp". "left" is 0 if the
* buffer is empty, and it is "1" if it is completely full.
*/
int left = (dev->rp + dev->buffersize - dev->wp) % dev->buffersize;
if (left != 1) return 1; /* writable */
PDEBUG("Waiting to write\n");
select_wait(&dev->outq, table); /* wait for free space */
return 0;
}
return 0; /* never exception-able */
}
#else /* Use poll instead, already shown */
Prior to Linux 2.1, the llseek device method was
called lseek instead, and it received different
parameters from the current implementation. For that reason, under
Linux 2.0 you were not allowed to seek a file, or a device, past the 2
GB limit, even though the llseek system call was
already supported.
This chapter introduced the following symbols and header files.
- #include
-
This header declares all the macros used to define
ioctl commands. It is currently included by
.
-
-
-
-
Macros used to decode a command. In particular,
_IOC_TYPE(nr) is an OR combination of
_IOC_READ and _IOC_WRITE.
-
-
-
-
-
-
-
Calling any of these functions puts the current process to sleep on a
queue. Usually, you'll choose the interruptibleform to implement blocking read and write.
-
-
The wait_queue_t type is used when sleeping without
calling sleep_on. Wait queue entries must be
initialized prior to use; the task argument used is
almost always current.
-
-
-
This function selects a runnable process from the run queue. The
chosen process can be current or a different
one. You won't usually call schedule directly,
because the sleep_on functions do it internally.
-
This function puts the current process into a wait queue without
scheduling immediately. It is designed to be used by the
poll method of device drivers.
-
This function is a "helper'' for implementing the
fasync device method. The mode
argument is the same value that is passed to the method, while
fa points to a device-specific
fasync_struct *.
-
-
-
Back to:
|
|
|
|
|
|
© 2001, O'Reilly & Associates, Inc.
阅读(767) | 评论(0) | 转发(0) |