Linux Device Drivers: Enhanced Char Driver Op-pascal4123-ChinaUnix博客

Quick & Win -- 五多

首页　| 　博文目录　| 　关于我

pascal4123

博客访问： 1723879
博文数量： 607
博客积分： 10031
博客等级：上将
技术积分： 6633
用户组：普通用户
注册时间： 2006-03-30 17:41

文章分类

全部博文（607）

mind（2）
tech_spec（15）
misc（0）
biz（16）
self（7）
tech（521）
life（45）
未分配的博文（1）

文章存档

2011年（2）

2010年（15）

2009年（58）

2008年（172）

2007年（211）

2006年（149）

我的朋友

Linux Device Drivers, 2nd Edition

2nd Edition June 2001
0-59600-008-1, Order Number: 0081
586 pages, $39.95

Chapter 5
Enhanced Char Driver Operations

Contents:

In Chapter 3, "Char Drivers", we built a complete device driver that the user can write to and read from. But a real device usually offers more functionality than synchronous read and write. Now that we're equipped with debugging tools should something go awry, we can safely go ahead and implement new operations.

What is normally needed, in addition to reading and writing the device, is the ability to perform various types of hardware control via the device driver. Control operations are usually supported via the ioctl method. The alternative is to look at the data flow being written to the device and use special sequences as control commands. This latter technique should be avoided because it requires reserving some characters for controlling purposes; thus, the data flow can't contain those characters. Moreover, this technique turns out to be more complex to handle than ioctl. Nonetheless, sometimes it's a useful approach to device control and is used by tty's and other devices. We'll describe it later in this chapter in "Device Control Without ioctl".

As we suggested in the previous chapter, the ioctl system call offers a device specific entry point for the driver to handle "commands.'' ioctl is device specific in that, unlike read and other methods, it allows applications to access features unique to the hardware being driven, such as configuring the device and entering or exiting operating modes. These control operations are usually not available through the read/write file abstraction. For example, everything you write to a serial port is used as communication data, and you cannot change the baud rate by writing to the device. That is what ioctl is for: controlling the I/O channel.

Another important feature of real devices (unlike scull) is that data being read or written is exchanged with other hardware, and some synchronization is needed. The concepts of blocking I/O and asynchronous notification fill the gap and are introduced in this chapter by means of a modified scull device. The driver uses interaction between different processes to create asynchronous events. As with the original scull, you don't need special hardware to test the driver's workings. We will definitely deal with real hardware, but not until Chapter 8, "Hardware Management".
The prototype stands out in the list of Unix system calls because of the dots, which usually represent not a variable number of arguments. In a real system, however, a system call can't actually have a variable number of arguments. System calls must have a well-defined number of arguments because user programs can access them only through hardware "gates,'' as outlined in "User Space and Kernel Space" in Chapter 2, "Building and Running Modules". Therefore, the dots in the prototype represent not a variable number of arguments but a single optional argument, traditionally identified as char *argp. The dots are simply there to prevent type checking during compilation. The actual nature of the third argument depends on the specific control command being issued (the second argument). Some commands take no arguments, some take an integer value, and some take a pointer to other data. Using a pointer is the way to pass arbitrary data to the ioctl call; the device will then be able to exchange any amount of data with user space.
As you might imagine, most ioctl implementations consist of a switch statement that selects the correct behavior according to the cmd argument. Different commands have different numeric values, which are usually given symbolic names to simplify coding. The symbolic name is assigned by a preprocessor definition. Custom drivers usually declare such symbols in their header files; scull.hdeclares them for scull. User programs must, of course, include that header file as well to have access to those symbols.

The command numbers should be unique across the system in order to prevent errors caused by issuing the right command to the wrong device. Such a mismatch is not unlikely to happen, and a program might find itself trying to change the baud rate of a non-serial-port input stream, such as a FIFO or an audio device. If each ioctl number is unique, then the application will get an EINVAL error rather than succeeding in doing something unintended.

To help programmers create unique ioctl command codes, these codes have been split up into several bitfields. The first versions of Linux used 16-bit numbers: the top eight were the "magic'' number associated with the device, and the bottom eight were a sequential number, unique within the device. This happened because Linus was "clueless'' (his own word); a better division of bitfields was conceived only later. Unfortunately, quite a few drivers still use the old convention. They have to: changing the command codes would break no end of binary programs. In our sources, however, we will use the new command code convention exclusively.

To choose ioctl numbers for your driver according to the new convention, you should first check include/asm/ioctl.h and Documentation/ioctl-number.txt. The header defines the bitfields you will be using: type (magic number), ordinal number, direction of transfer, and size of argument. The ioctl-number.txt file lists the magic numbers used throughout the kernel, so you'll be able to choose your own magic number and avoid overlaps. The text file also lists the reasons why the convention should be used.
The direction of data transfer, if the particular command involves a data transfer. The possible values are _IOC_NONE (no data transfer), _IOC_READ, _IOC_WRITE, and _IOC_READ | _IOC_WRITE (data is transferred both ways). Data transfer is seen from the application's point of view; _IOC_READ means reading fromthe device, so the driver must write to user space. Note that the field is a bit mask, so _IOC_READ and _IOC_WRITE can be extracted using a logical AND operation.

The size of user data involved. The width of this field is architecture dependent and currently ranges from 8 to 14 bits. You can find its value for your specific architecture in the macro _IOC_SIZEBITS. If you intend your driver to be portable, however, you can only count on a size up to 255. It's not mandatory that you use the size field. If you need larger data structures, you can just ignore it. We'll see soon how this field is used.

Here is how some ioctl commands are defined in scull. In particular, these commands set and get the driver's configurable parameters.
 
/* Use 'k' as magic number */
#define SCULL_IOC_MAGIC 'k'

#define SCULL_IOCRESET _IO(SCULL_IOC_MAGIC, 0)

/*
 * S means "Set" through a ptr
 * T means "Tell" directly with the argument value
 * G means "Get": reply by setting through a pointer
 * Q means "Query": response is on the return value
 * X means "eXchange": G and S atomically
 * H means "sHift": T and Q atomically
 */
#define SCULL_IOCSQUANTUM _IOW(SCULL_IOC_MAGIC, 1, scull_quantum)
#define SCULL_IOCSQSET  _IOW(SCULL_IOC_MAGIC, 2, scull_qset)
#define SCULL_IOCTQUANTUM _IO(SCULL_IOC_MAGIC,  3)
#define SCULL_IOCTQSET  _IO(SCULL_IOC_MAGIC,  4)
#define SCULL_IOCGQUANTUM _IOR(SCULL_IOC_MAGIC, 5, scull_quantum)
#define SCULL_IOCGQSET  _IOR(SCULL_IOC_MAGIC, 6, scull_qset)
#define SCULL_IOCQQUANTUM _IO(SCULL_IOC_MAGIC,  7)
#define SCULL_IOCQQSET  _IO(SCULL_IOC_MAGIC,  8)
#define SCULL_IOCXQUANTUM _IOWR(SCULL_IOC_MAGIC, 9, scull_quantum)
#define SCULL_IOCXQSET  _IOWR(SCULL_IOC_MAGIC,10, scull_qset)
#define SCULL_IOCHQUANTUM _IO(SCULL_IOC_MAGIC, 11)
#define SCULL_IOCHQSET  _IO(SCULL_IOC_MAGIC, 12)
#define SCULL_IOCHARDRESET _IO(SCULL_IOC_MAGIC, 15) /* debugging tool */

#define SCULL_IOC_MAXNR 15
The "exchange'' and "shift'' operations are not particularly useful for scull. We implemented "exchange'' to show how the driver can combine separate operations into a single atomic one, and "shift'' to pair "tell'' and "query.'' There are times when atomic[24] test-and-set operations like these are needed, in particular, when applications need to set or release locks.

[24]A fragment of program code is said to be atomic when it will always be executed as though it were a single instruction, without the possibility of the processor being interrupted and something happening in between (such as somebody else's code running).

The value of the ioctl cmd argument is not currently used by the kernel, and it's quite unlikely it will be in the future. Therefore, you could, if you were feeling lazy, avoid the complex declarations shown earlier and explicitly declare a set of scalar numbers. On the other hand, if you did, you wouldn't benefit from using the bitfields. The header is an example of this old-fashioned approach, using 16-bit scalar values to define the ioctl commands. That source file relied on scalar numbers because it used the technology then available, not out of laziness. Changing it now would be a gratuitous incompatibility.

The implementation of ioctl is usually a switch statement based on the command number. But what should the default selection be when the command number doesn't match a valid operation? The question is controversial. Several kernel functions return -EINVAL ("Invalid argument''), which makes sense because the command argument is indeed not a valid one. The POSIX standard, however, states that if an inappropriate ioctl command has been issued, then -ENOTTY should be returned. The string associated with that value used to be "Not a typewriter'' under all libraries up to and including libc5. Only libc6 changed the message to "Inappropriate ioctl for device,'' which looks more to the point. Because most recent Linux system are libc6 based, we'll stick to the standard and return -ENOTTY. It's still pretty common, though, to return -EINVAL in response to an invalid ioctl command.

Though the ioctl system call is most often used to act on devices, a few commands are recognized by the kernel. Note that these commands, when applied to your device, are decoded before your own file operations are called. Thus, if you choose the same number for one of your ioctl commands, you won't ever see any request for that command, and the application will get something unexpected because of the conflict between the ioctlnumbers.

Those that can be issued on any file (regular, device, FIFO, or socket)

Commands in the last group are executed by the implementation of the hosting filesystem (see the chattrcommand). Device driver writers are interested only in the first group of commands, whose magic number is "T.'' Looking at the workings of the other groups is left to the reader as an exercise; ext2_ioctl is a most interesting function (though easier than you may expect), because it implements the append-only flag and the immutable flag.

Set the close-on-exec flag (File IOctl CLose on EXec). Setting this flag will cause the file descriptor to be closed when the calling process executes a new program.

Set or reset asynchronous notification for the file (as discussed in "Asynchronous Notification" later in this chapter). Note that kernel versions up to Linux 2.2.4 incorrectly used this command to modify the O_SYNC flag. Since both actions can be accomplished in other ways, nobody actually uses the FIOASYNC command, which is reported here only for completeness.

The last item in the list introduced a new system call, fcntl, which looks like ioctl. In fact, the fcntlcall is very similar to ioctl in that it gets a command argument and an extra (optional) argument. It is kept separate from ioctl mainly for historical reasons: when Unix developers faced the problem of controlling I/O operations, they decided that files and devices were different. At the time, the only devices with ioctl implementations were ttys, which explains why -ENOTTY is the standard reply for an incorrect ioctl command. Things have changed, but fcntl remains in the name of backward compatibility.

Another point we need to cover before looking at the ioctl code for the scull driver is how to use the extra argument. If it is an integer, it's easy: it can be used directly. If it is a pointer, however, some care must be taken.

When a pointer is used to refer to user space, we must ensure that the user address is valid and that the corresponding page is currently mapped. If kernel code tries to access an out-of-range address, the processor issues an exception. Exceptions in kernel code are turned to oops messages by every Linux kernel up through 2.0.x; version 2.1 and later handle the problem more gracefully. In any case, it's the driver's responsibility to make proper checks on every user-space address it uses and to return an error if it is invalid.
There are a couple of interesting things to note about access_ok. First is that it does not do the complete job of verifying memory access; it only checks to see that the memory reference is in a region of memory that the process might reasonably have access to. In particular, access_ok ensures that the address does not point to kernel-space memory. Second, most driver code need not actually call access_ok. The memory-access routines described later take care of that for you. We will nonetheless demonstrate its use so that you can see how it is done, and for backward compatibility reasons that we will get into toward the end of the chapter.

The scull source exploits the bitfields in the ioctl number to check the arguments before the switch:
Access to a device is controlled by the permissions on the device file(s), and the driver is not normally involved in permissions checking. There are situations, however, where any user is granted read/write permission on the device, but some other operations should be denied. For example, not all users of a tape drive should be able to set its default block size, and the ability to work with a disk device does not mean that the user can reformat the drive. In cases like these, the driver must perform additional checks to be sure that the user is capable of performing the requested operation.

Unix systems have traditionally restricted privileged operations to the superuser account. Privilege is an all-or-nothing thing -- the superuser can do absolutely anything, but all other users are highly restricted. The Linux kernel as of version 2.2 provides a more flexible system called capabilities. A capability-based system leaves the all-or-nothing mode behind and breaks down privileged operations into separate subgroups. In this way, a particular user (or program) can be empowered to perform a specific privileged operation without giving away the ability to perform other, unrelated operations. Capabilities are still little used in user space, but kernel code uses them almost exclusively.

The full set of capabilities can be found in . A subset of those capabilities that might be of interest to device driver writers includes the following:

The ability to perform "raw'' I/O operations. Examples include accessing device ports or communicating directly with USB devices.

Before performing a privileged operation, a device driver should check that the calling process has the appropriate capability with the capable function (defined in ):
In the scull sample driver, any user is allowed to query the quantum and quantum set sizes. Only privileged users, however, may change those values, since inappropriate values could badly affect system performance. When needed, the scull implementation of ioctl checks a user's privilege level as follows:
The scull implementation of ioctl only transfers the configurable parameters of the device and turns out to be as easy as the following:
 
 switch(cmd) {

#ifdef SCULL_DEBUG
   case SCULL_IOCHARDRESET:
     /*
     * reset the counter to 1, to allow unloading in case
     * of problems. Use 1, not 0, because the invoking
     * process has the device open.
     */
     while (MOD_IN_USE)
       MOD_DEC_USE_COUNT;
     MOD_INC_USE_COUNT;
     /* don't break: fall through and reset things */
#endif /* SCULL_DEBUG */

   case SCULL_IOCRESET:
    scull_quantum = SCULL_QUANTUM;
    scull_qset = SCULL_QSET;
    break;
    
   case SCULL_IOCSQUANTUM: /* Set: arg points to the value */
    if (! capable (CAP_SYS_ADMIN))
      return -EPERM;
    ret = __get_user(scull_quantum, (int *)arg);
    break;

   case SCULL_IOCTQUANTUM: /* Tell: arg is the value */
    if (! capable (CAP_SYS_ADMIN))
      return -EPERM;
    scull_quantum = arg;
    break;

   case SCULL_IOCGQUANTUM: /* Get: arg is pointer to result */
    ret = __put_user(scull_quantum, (int *)arg);
    break;

   case SCULL_IOCQQUANTUM: /* Query: return it (it's positive) */
    return scull_quantum;

   case SCULL_IOCXQUANTUM: /* eXchange: use arg as pointer */
    if (! capable (CAP_SYS_ADMIN))
      return -EPERM;
    tmp = scull_quantum;
    ret = __get_user(scull_quantum, (int *)arg);
    if (ret == 0)
      ret = __put_user(tmp, (int *)arg);
    break;

   case SCULL_IOCHQUANTUM: /* sHift: like Tell + Query */
    if (! capable (CAP_SYS_ADMIN))
      return -EPERM;
    tmp = scull_quantum;
    scull_quantum = arg;
    return tmp;

   default: /* redundant, as cmd was checked against MAXNR */
    return -ENOTTY;
 }
 return ret;
scull also includes six entries that act on scull_qset. These entries are identical to the ones for scull_quantum and are not worth showing in print.
Sometimes controlling the device is better accomplished by writing control sequences to the device itself. This technique is used, for example, in the console driver, where so-called escape sequences are used to move the cursor, change the default color, or perform other configuration tasks. The benefit of implementing device control this way is that the user can control the device just by writing data, without needing to use (or sometimes write) programs built just for configuring the device.

For example, the setterm program acts on the console (or another terminal) configuration by printing escape sequences. This behavior has the advantage of permitting the remote control of devices. The controlling program can live on a different computer than the controlled device, because a simple redirection of the data stream does the configuration job. You're already used to this with ttys, but the technique is more general.

The drawback of controlling by printing is that it adds policy constraints to the device; for example, it is viable only if you are sure that the control sequence can't appear in the data being written to the device during normal operation. This is only partly true for ttys. Although a text display is meant to display only ASCII characters, sometimes control characters can slip through in the data being written and can thus affect the console setup. This can happen, for example, when you issue grep on a binary file; the extracted lines can contain anything, and you often end up with the wrong font on your console.[25]

Controlling by write is definitely the way to go for those devices that don't transfer data but just respond to commands, such as robotic devices.

For instance, a driver written for fun by one of your authors moves a camera on two axes. In this driver, the "device'' is simply a pair of old stepper motors, which can't really be read from or written to. The concept of "sending a data stream'' to a stepper motor makes little or no sense. In this case, the driver interprets what is being written as ASCII commands and converts the requests to sequences of impulses that manipulate the stepper motors. The idea is similar, somewhat, to the AT commands you send to the modem in order to set up communication, the main difference being that the serial port used to communicate with the modem must transfer real data as well. The advantage of direct device control is that you can use cat to move the camera without writing and compiling special code to issue the ioctl calls.

Whenever a process must wait for an event (such as the arrival of data or the termination of a process), it should go to sleep. Sleeping causes the process to suspend execution, freeing the processor for other uses. At some future time, when the event being waited for occurs, the process will be woken up and will continue with its job. This section discusses the 2.4 machinery for putting a process to sleep and waking it up. Earlier versions are discussed in "Backward Compatibility" later in this chapter.
The interruptible variant works just like sleep_on, except that the sleep can be interrupted by a signal. This is the form that device driver writers have been using for a long time, before wait_event_interruptible (described later) appeared.

Of course, sleeping is only half of the problem; something, somewhere will have to wake the process up again. When a device driver sleeps directly, there is usually code in another part of the driver that performs the wakeup, once it knows that the event has occurred. Typically a driver will wake up sleepers in its interrupt handler once new data has arrived. Other scenarios are possible, however.

Normally, a wake_up call can cause an immediate reschedule to happen, meaning that other processes might run before wake_up returns. The "synchronous" variants instead make any awakened processes runnable, but do not reschedule the CPU. This is used to avoid rescheduling when the current process is known to be going to sleep, thus forcing a reschedule anyway. Note that awakened processes could run immediately on a different processor, so these functions should not be expected to provide mutual exclusion.

As an example of wait queue usage, imagine you want to put a process to sleep when it reads your device and awaken it when someone else writes to the device. The following code does just that:
 
DECLARE_WAIT_QUEUE_HEAD(wq);

ssize_t sleepy_read (struct file *filp, char *buf, size_t count, 
   loff_t *pos)
{
  printk(KERN_DEBUG "process %i (%s) going to sleep\n",
      current->pid, current->comm);
  interruptible_sleep_on(&wq);
  printk(KERN_DEBUG "awoken %i (%s)\n", current->pid, current->comm);
  return 0; /* EOF */
}

ssize_t sleepy_write (struct file *filp, const char *buf, size_t count,
		loff_t *pos)
{
  printk(KERN_DEBUG "process %i (%s) awakening the readers...\n",
     current->pid, current->comm);
  wake_up_interruptible(&wq);
  return count; /* succeed, to avoid retrial */
}
The code for this device is available as sleepy in the example programs and can be tested using cat and input/output redirection, as usual.

An important thing to remember with wait queues is that being woken up does not guarantee that the event you were waiting for has occurred; a process can be woken for other reasons, mainly because it received a signal. Any code that sleeps should do so in a loop that tests the condition after returning from the sleep, as discussed in "A Sample Implementation: scullpipe" later in this chapter.

The previous discussion is all that most driver writers will need to know to get their job done. Some, however, will want to dig deeper. This section attempts to get the curious started; everybody else can skip to the next section without missing much that is important.
 void simplified_sleep_on(wait_queue_head_t *queue)
 {
   wait_queue_t wait;

   init_waitqueue_entry(&wait, current);
   current->state = TASK_INTERRUPTIBLE;

   add_wait_queue(queue, &wait);
   schedule();
   remove_wait_queue (queue, &wait);
  }
One other reason for calling the scheduler explicitly, however, is to do exclusive waits. There can be situations in which several processes are waiting on an event; when wake_up is called, all of those processes will try to execute. Suppose that the event signifies the arrival of an atomic piece of data. Only one process will be able to read that data; all the rest will simply wake up, see that no data is available, and go back to sleep.

For this reason, the 2.3 development series added the concept of an exclusive sleep. If processes sleep in an exclusive mode, they are telling the kernel to wake only one of them. The result is improved performance in some situations.
 void simplified_sleep_exclusive(wait_queue_head_t *queue)
 {
   wait_queue_t wait;

   init_waitqueue_entry(&wait, current);
   current->state = TASK_INTERRUPTIBLE | TASK_EXCLUSIVE;

   add_wait_queue_exclusive(queue, &wait);
   schedule();
   remove_wait_queue (queue, &wait);
  }
You need to make reentrant any function that matches either of two conditions. First, if it calls schedule, possibly by calling sleep_on or wake_up. Second, if it copies data to or from user space, because access to user space might page-fault, and the process will be put to sleep while the kernel deals with the missing page. Every function that calls any such functions must be reentrant as well. For example, if sample_read calls sample_getdata, which in turn can block, then sample_read must be reentrant as well as sample_getdata, because nothing prevents another process from calling it while it is already executing on behalf of a process that went to sleep.

If a process calls write and there is no space in the buffer, the process must block, and it must be on a different wait queue from the one used for reading. When some data has been written to the hardware device, and space becomes free in the output buffer, the process is awakened and the write call succeeds, although the data may be only partially written if there isn't room in the buffer for the count bytes that were requested.

Both these statements assume that there are both input and output buffers; in practice, almost every device driver has them. The input buffer is required to avoid losing data that arrives when nobody is reading. In contrast, data can't be lost on write, because if the system call doesn't accept data bytes, they remain in the user-space buffer. Even so, the output buffer is almost always useful for squeezing more performance out of the hardware.

The performance gain of implementing an output buffer in the driver results from the reduced number of context switches and user-level/kernel-level transitions. Without an output buffer (assuming a slow device), only one or a few characters are accepted by each system call, and while one process sleeps in write, another process runs (that's one context switch). When the first process is awakened, it resumes (another context switch), write returns (kernel/user transition), and the process reiterates the system call to write more data (user/kernel transition); the call blocks, and the loop continues. If the output buffer is big enough, the write call succeeds on the first attempt -- the buffered data will be pushed out to the device later, at interrupt time -- without control needing to go back to user space for a second or third write call. The choice of a suitable size for the output buffer is clearly device specific.

We didn't use an input buffer in scull, because data is already available when read is issued. Similarly, no output buffer was used, because data is simply copied to the memory area associated with the device. Essentially, the device is a buffer, so the implementation of additional buffers would be superfluous. We'll see the use of buffers in Chapter 9, "Interrupt Handling", in the section titled "Interrupt-Driven I/O".

Naturally, O_NONBLOCK is meaningful in the open method also. This happens when the call can actually block for a long time; for example, when opening a FIFO that has no writers (yet), or accessing a disk file with a pending lock. Usually, opening a device either succeeds or fails, without the need to wait for external events. Sometimes, however, opening the device requires a long initialization, and you may choose to support O_NONBLOCK in your open method by returning immediately with -EAGAIN ("try it again") if the flag is set, after initiating device initialization. The driver may also implement a blocking open to support access policies in a way similar to file locks. We'll see one such implementation in the section "Blocking open as an Alternative to EBUSY" later in this chapter.

Some drivers may also implement special semantics for O_NONBLOCK; for example, an open of a tape device usually blocks until a tape has been inserted. If the tape drive is opened with O_NONBLOCK, the open succeeds immediately regardless of whether the media is present or not.

A Sample Implementation: scullpipe

The /dev/scullpipe devices (there are four of them by default) are part of the scullmodule and are used to show how blocking I/O is implemented.

Within a driver, a process blocked in a read call is awakened when data arrives; usually the hardware issues an interrupt to signal such an event, and the driver awakens waiting processes as part of handling the interrupt. The scull driver works differently, so that it can be run without requiring any particular hardware or an interrupt handler. We chose to use another process to generate the data and wake the reading process; similarly, reading processes are used to wake sleeping writer processes. The resulting implementation is similar to that of a FIFO (or named pipe) filesystem node, whence the name.

The device driver uses a device structure that embeds two wait queues and a buffer. The size of the buffer is configurable in the usual ways (at compile time, load time, or runtime).
 
typedef struct Scull_Pipe {
  wait_queue_head_t inq, outq;  /* read and write queues */
  char *buffer, *end;       /* begin of buf, end of buf */
  int buffersize;         /* used in pointer arithmetic */
  char *rp, *wp;         /* where to read, where to write */
  int nreaders, nwriters;     /* number of openings for r/w */
  struct fasync_struct *async_queue; /* asynchronous readers */
  struct semaphore sem;      /* mutual exclusion semaphore */
  devfs_handle_t handle;     /* only used if devfs is there */
} Scull_Pipe;
 
ssize_t scull_p_read (struct file *filp, char *buf, size_t count,
        loff_t *f_pos)
{
  Scull_Pipe *dev = filp->private_data;

  if (f_pos != &filp->f_pos) return -ESPIPE;

  if (down_interruptible(&dev->sem))
    return -ERESTARTSYS;
  while (dev->rp == dev->wp) { /* nothing to read */
    up(&dev->sem); /* release the lock */
    if (filp->f_flags & O_NONBLOCK)
      return -EAGAIN;
    PDEBUG("\"%s\" reading: going to sleep\n", current->comm);
    if (wait_event_interruptible(dev->inq, (dev->rp != dev->wp)))
      return -ERESTARTSYS; /* signal: tell the fs layer to handle it */
    /* otherwise loop, but first reacquire the lock */
    if (down_interruptible(&dev->sem))
      return -ERESTARTSYS;
  }
  /* ok, data is there, return something */
  if (dev->wp > dev->rp)
    count = min(count, dev->wp - dev->rp);
  else /* the write pointer has wrapped, return data up to dev->end */
    count = min(count, dev->end - dev->rp);
  if (copy_to_user(buf, dev->rp, count)) {
    up (&dev->sem);
    return -EFAULT;
  }
  dev->rp += count;
  if (dev->rp == dev->end)
    dev->rp = dev->buffer; /* wrapped */
  up (&dev->sem);

  /* finally, awaken any writers and return */
  wake_up_interruptible(&dev->outq);
  PDEBUG("\"%s\" did read %li bytes\n",current->comm, (long)count);
  return count;
}
Note also, once again, the use of semaphores to protect critical regions of the code. The scull code has to be careful to avoid going to sleep when it holds a semaphore -- otherwise, writers would never be able to add data, and the whole thing would deadlock. This code uses wait_event_interruptible to wait for data if need be; it has to check for available data again after the wait, though. Somebody else could grab the data between when we wake up and when we get the semaphore back.

It's worth repeating that a process can go to sleep both when it calls schedule, either directly or indirectly, and when it copies data to or from user space. In the latter case the process may sleep if the user array is not currently present in main memory. If scull sleeps while copying data between kernel and user space, it will sleep with the device semaphore held. Holding the semaphore in this case is justified since it will not deadlock the system, and since it is important that the device memory array not change while the driver sleeps.

The implementation for write is quite similar to that for read (and, again, its first line will be explained later). Its only "peculiar'' feature is that it never completely fills the buffer, always leaving a hole of at least one byte. Thus, when the buffer is empty, wp and rp are equal; when there is data there, they are always different.
 
static inline int spacefree(Scull_Pipe *dev)
{
  if (dev->rp == dev->wp)
    return dev->buffersize - 1;
  return ((dev->rp + dev->buffersize - dev->wp) % dev->buffersize) - 1;
}

ssize_t scull_p_write(struct file *filp, const char *buf, size_t count,
        loff_t *f_pos)
{
  Scull_Pipe *dev = filp->private_data;
  
  if (f_pos != &filp->f_pos) return -ESPIPE;

  if (down_interruptible(&dev->sem))
    return -ERESTARTSYS;
  
  /* Make sure there's space to write */
  while (spacefree(dev) == 0) { /* full */
    up(&dev->sem);
    if (filp->f_flags & O_NONBLOCK)
      return -EAGAIN;
    PDEBUG("\"%s\" writing: going to sleep\n",current->comm);
    if (wait_event_interruptible(dev->outq, spacefree(dev) > 0))
      return -ERESTARTSYS; /* signal: tell the fs layer to handle it */
    if (down_interruptible(&dev->sem))
      return -ERESTARTSYS;
  }
  /* ok, space is there, accept something */
  count = min(count, spacefree(dev));
  if (dev->wp >= dev->rp)
    count = min(count, dev->end - dev->wp); /* up to end-of-buffer */
  else /* the write pointer has wrapped, fill up to rp-1 */
    count = min(count, dev->rp - dev->wp - 1);
  PDEBUG("Going to accept %li bytes to %p from %p\n",
      (long)count, dev->wp, buf);
  if (copy_from_user(dev->wp, buf, count)) {
    up (&dev->sem);
    return -EFAULT;
  }
  dev->wp += count;
  if (dev->wp == dev->end)
    dev->wp = dev->buffer; /* wrapped */
  up(&dev->sem);

  /* finally, awaken any reader */
  wake_up_interruptible(&dev->inq); /* blocked in read() and select() */

  /* and signal asynchronous readers, explained later in Chapter 5 */
  if (dev->async_queue)
    kill_fasync(&dev->async_queue, SIGIO, POLL_IN);
  PDEBUG("\"%s\" did write %li bytes\n",current->comm, (long)count);
  return count;
}
The device, as we conceived it, doesn't implement blocking open and is simpler than a real FIFO. If you want to look at the real thing, you can find it in fs/pipe.c, in the kernel sources.

To test the blocking operation of the scullpipe device, you can run some programs on it, using input/output redirection as usual. Testing nonblocking activity is trickier, because the conventional programs don't perform nonblocking operations. The misc-progs source directory contains the following simple program, called nbtest, for testing nonblocking operations. All it does is copy its input to its output, using nonblocking I/O and delaying between retrials. The delay time is passed on the command line and is one second by default.
Support for either system call requires support from the device driver to function. In version 2.0 of the kernel the device method was modeled on select (and no poll was available to user programs); from version 2.1.23 onward both were offered, and the device method was based on the newly introduced poll system call because poll offered more detailed control than select.
The driver's method will be called whenever the user-space program performs a poll or selectsystem call involving a file descriptor associated with the driver. The device method is in charge of these two steps:
The second task performed by the poll method is returning the bit mask describing which operations could be completed immediately; this is also straightforward. For example, if the device has data available, a read would complete without sleeping; the poll method should indicate this state of affairs. Several flags (defined in ) are used to indicate the possible operations:

This bit must be set if the device can be read without blocking.

This bit must be set if "normal'' data is available for reading. A readable device returns (POLLIN | POLLRDNORM).

This bit indicates that out-of-band data is available for reading from the device. It is currently used only in one place in the Linux kernel (the DECnet code) and is not generally applicable to device drivers.

High-priority data (out-of-band) can be read without blocking. This bit causes select to report that an exception condition occurred on the file, because selectreports out-of-band data as an exception condition.

When a process reading this device sees end-of-file, the driver must set POLLHUP (hang-up). A process calling select will be told that the device is readable, as dictated by the select functionality.

An error condition has occurred on the device. When poll is invoked, the device is reported as both readable and writable, since both read and write will return an error code without blocking.

This bit is set in the return value if the device can be written to without blocking.

This bit has the same meaning as POLLOUT, and sometimes it actually is the same number. A writable device returns (POLLOUT | POLLWRNORM).

Like POLLRDBAND, this bit means that data with nonzero priority can be written to the device. Only the datagram implementation of poll uses this bit, since a datagram can transmit out of band data.

It's worth noting that POLLRDBAND and POLLWRBAND are meaningful only with file descriptors associated with sockets: device drivers won't normally use these flags.

The description of poll takes up a lot of space for something that is relatively simple to use in practice. Consider the scullpipe implementation of the poll method:
 
unsigned int scull_p_poll(struct file *filp, poll_table *wait)
{
  Scull_Pipe *dev = filp->private_data;
  unsigned int mask = 0;

  /*
   * The buffer is circular; it is considered full
   * if "wp" is right behind "rp". "left" is 0 if the
   * buffer is empty, and it is "1" if it is completely full.
   */
  int left = (dev->rp + dev->buffersize - dev->wp) % dev->buffersize;

  poll_wait(filp, &dev->inq, wait);
  poll_wait(filp, &dev->outq, wait);
  if (dev->rp != dev->wp) mask |= POLLIN | POLLRDNORM; /* readable */
  if (left != 1)     mask |= POLLOUT | POLLWRNORM; /* writable */

  return mask;
}
This code simply adds the two scullpipewait queues to the poll_table, then sets the appropriate mask bits depending on whether data can be read or written.

The poll code as shown is missing end-of-file support. The poll method should return POLLHUP when the device is at the end of the file. If the caller used the select system call, the file will be reported as readable; in both cases the application will know that it can actually issue the readwithout waiting forever, and the read method will return 0 to signal end-of-file.

With real FIFOs, for example, the reader sees an end-of-file when all the writers close the file, whereas in scullpipe the reader never sees end-of-file. The behavior is different because a FIFO is intended to be a communication channel between two processes, while scullpipe is a trashcan where everyone can put data as long as there's at least one reader. Moreover, it makes no sense to reimplement what is already available in the kernel.

Implementing end-of-file in the same way as FIFOs do would mean checking dev->nwriters, both in read and in poll, and reporting end-of-file (as just described) if no process has the device opened for writing. Unfortunately, though, if a reader opened the scullpipe device before the writer, it would see end-of-file without having a chance to wait for data. The best way to fix this problem would be to implement blocking within open; this task is left as an exercise for the reader.

The purpose of the poll and select calls is to determine in advance if an I/O operation will block. In that respect, they complement read and write. More important, poll and selectare useful because they let the application wait simultaneously for several data streams, although we are not exploiting this feature in the scull examples.

Reading data from the device

If there is data in the input buffer, the readcall should return immediately, with no noticeable delay, even if less data is available than the application requested and the driver is sure the remaining data will arrive soon. You can always return less data than you're asked for if this is convenient for any reason (we did it in scull), provided you return at least one byte.

If there is no data in the input buffer, by default read must block until at least one byte is there. If O_NONBLOCK is set, on the other hand, read returns immediately with a return value of -EAGAIN (although some old versions of System V return 0 in this case). In these cases poll must report that the device is unreadable until at least one byte arrives. As soon as there is some data in the buffer, we fall back to the previous case.

Writing to the device

If there is space in the output buffer, writeshould return without delay. It can accept less data than the call requested, but it must accept at least one byte. In this case, poll reports that the device is writable.

If the output buffer is full, by default writeblocks until some space is freed. If O_NONBLOCK is set, write returns immediately with a return value of -EAGAIN (older System V Unices returned 0). In these cases poll should report that the file is not writable. If, on the other hand, the device is not able to accept any more data, write returns -ENOSPC ("No space left on device''), independently of the setting of O_NONBLOCK.

Never make a write call wait for data transmission before returning, even if O_NONBLOCK is clear. This is because many applications use select to find out whether a write will block. If the device is reported as writable, the call must consistently not block. If the program using the device wants to ensure that the data it enqueues in the output buffer is actually transmitted, the driver must provide an fsync method. For instance, a removable device should have an fsync entry point.

Although these are a good set of general rules, one should also recognize that each device is unique and that sometimes the rules must be bent slightly. For example, record-oriented devices (such as tape drives) cannot execute partial writes.
If some application will ever need to be assured that data has been sent to the device, the fsync method must be implemented. A call to fsync should return only when the device has been completely flushed (i.e., the output buffer is empty), even if that takes some time, regardless of whether O_NONBLOCK is set. The datasync argument, present only in the 2.4 kernel, is used to distinguish between the fsync and fdatasync system calls; as such, it is only of interest to filesystem code and can be ignored by drivers.

The fsync method has no unusual features. The call isn't time critical, so every device driver can implement it to the author's taste. Most of the time, char drivers just have a NULL pointer in their fops. Block devices, on the other hand, always implement the method with the general-purpose block_fsync, which in turn flushes all the blocks of the device, waiting for I/O to complete.

The actual implementation of the poll and select system calls is reasonably simple, for those who are interested in how it works. Whenever a user application calls either function, the kernel invokes the poll method of all files referenced by the system call, passing the same poll_table to each of them. The structure is, for all practical purposes, an array of poll_table_entry structures allocated for a specific poll or selectcall. Each poll_table_entry contains the struct file pointer for the open device, a wait_queue_head_t pointer, and a wait_queue_t entry. When a driver calls poll_wait, one of these entries gets filled in with the information provided by the driver, and the wait queue entry gets put onto the driver's queue. The pointer to wait_queue_head_t is used to track the wait queue where the current poll table entry is registered, in order for free_wait to be able to dequeue the entry before the wait queue is awakened.

If none of the drivers being polled indicates that I/O can occur without blocking, the poll call simply sleeps until one of the (perhaps many) wait queues it is on wakes it up.

What's interesting in the implementation of pollis that the file operation may be called with a NULL pointer as poll_table argument. This situation can come about for a couple of reasons. If the application calling poll has provided a timeout value of 0 (indicating that no wait should be done), there is no reason to accumulate wait queues, and the system simply does not do it. The poll_table pointer is also set to NULL immediately after any driver being polled indicates that I/O is possible. Since the kernel knows at that point that no wait will occur, it does not build up a list of wait queues.

Though the combination of blocking and nonblocking operations and the select method are sufficient for querying the device most of the time, some situations aren't efficiently managed by the techniques we've seen so far.

Let's imagine, for example, a process that executes a long computational loop at low priority, but needs to process incoming data as soon as possible. If the input channel is the keyboard, you are allowed to send a signal to the application (using the `INTR' character, usually CTRL-C), but this signaling ability is part of the tty abstraction, a software layer that isn't used for general char devices. What we need for asynchronous notification is something different. Furthermore, any input data should generate an interrupt, not just CTRL-C.

User programs have to execute two steps to enable asynchronous notification from an input file. First, they specify a process as the "owner'' of the file. When a process invokes the F_SETOWN command using the fcntl system call, the process ID of the owner process is saved in filp->f_owner for later use. This step is necessary for the kernel to know just who to notify. In order to actually enable asynchronous notification, the user programs must set the FASYNC flag in the device by means of the F_SETFL fcntlcommand.

After these two calls have been executed, the input file can request delivery of a SIGIO signal whenever new data arrives. The signal is sent to the process (or process group, if the value is negative) stored in filp->f_owner.

For example, the following lines of code in a user program enable asynchronous notification to the current process for the stdin input file:
The program named asynctest in the sources is a simple program that reads stdin as shown. It can be used to test the asynchronous capabilities of scullpipe. The program is similar to cat, but doesn't terminate on end-of-file; it responds only to input, not to the absence of input.

Note, however, that not all the devices support asynchronous notification, and you can choose not to offer it. Applications usually assume that the asynchronous capability is available only for sockets and ttys. For example, pipes and FIFOs don't support it, at least in the current kernels. Mice offer asynchronous notification because some programs expect a mouse to be able to send SIGIO like a tty does.

A more relevant topic for us is how the device driver can implement asynchronous signaling. The following list details the sequence of operations from the kernel's point of view:

When F_SETFL is executed to turn on FASYNC, the driver's fasyncmethod is called. This method is called whenever the value of FASYNC is changed in filp->f_flags, to notify the driver of the change so it can respond properly. The flag is cleared by default when the file is opened. We'll look at the standard implementation of the driver method soon.

While implementing the first step is trivial -- there's nothing to do on the driver's part -- the other steps involve maintaining a dynamic data structure to keep track of the different asynchronous readers; there might be several of these readers. This dynamic data structure, however, doesn't depend on the particular device involved, and the kernel offers a suitable general-purpose implementation so that you don't have to rewrite the same code in every driver.

The general implementation offered by Linux is based on one data structure and two functions (which are called in the second and third steps described earlier). The header that declares related material is -- nothing new here -- and the data structure is called struct fasync_struct. As we did with wait queues, we need to insert a pointer to the structure in the device-specific data structure. Actually, we've already seen such a field in the section "A Sample Implementation: scullpipe".
Here's how scullpipe implements the fasync method:
 
int scull_p_fasync(fasync_file fd, struct file *filp, int mode)
{
  Scull_Pipe *dev = filp->private_data;

  return fasync_helper(fd, filp, mode, &dev->async_queue);
}
It's clear that all the work is performed by fasync_helper. It wouldn't be possible, however, to implement the functionality without a method in the driver, because the helper function needs to access the correct pointer to struct fasync_struct * (here &dev->async_queue), and only the driver can provide the information.

When data arrives, then, the following statement must be executed to signal asynchronous readers. Since new data for the scullpipe reader is generated by a process issuing a write, the statement appears in the write method of scullpipe.
 
 if (dev->async_queue)
   kill_fasync(&dev->async_queue, SIGIO, POLL_IN);
It might appear that we're done, but there's still one thing missing. We must invoke our fasync method when the file is closed to remove the file from the list of active asynchronous readers. Although this call is required only if filp->f_flags has FASYNC set, calling the function anyway doesn't hurt and is the usual implementation. The following lines, for example, are part of the close method for scullpipe:
 
 /* remove this filp from the asynchronously notified filp's */
 scull_p_fasync(-1, filp, 0);
The difficult part of the chapter is over; now we'll quickly detail the llseek method, which is useful and easy to implement.

The llseek method implements the lseek and llseek system calls. We have already stated that if the llseekmethod is missing from the device's operations, the default implementation in the kernel performs seeks from the beginning of the file and from the current position by modifying filp->f_pos, the current reading/writing position within the file. Please note that for the lseek system call to work correctly, the read and write methods must cooperate by updating the offset item they receive as argument (the argument is usually a pointer to filp->f_pos).

You may need to provide your own llseek method if the seek operation corresponds to a physical operation on the device or if seeking from end-of-file, which is not implemented by the default method, makes sense. A simple example can be seen in the scull driver:
 
loff_t scull_llseek(struct file *filp, loff_t off, int whence)
{
  Scull_Dev *dev = filp->private_data;
  loff_t newpos;

  switch(whence) {
   case 0: /* SEEK_SET */
    newpos = off;
    break;

   case 1: /* SEEK_CUR */
    newpos = filp->f_pos + off;
    break;

   case 2: /* SEEK_END */
    newpos = dev->size + off;
    break;

   default: /* can't happen */
    return -EINVAL;
  }
  if (newpos<0) return -EINVAL;
  filp->f_pos = newpos;
  return newpos;
}
The only device-specific operation here is retrieving the file length from the device. In scull the read and write methods cooperate as needed, as shown in "read and write" in Chapter 3, "Char Drivers".

Although the implementation just shown makes sense for scull, which handles a well-defined data area, most devices offer a data flow rather than a data area (just think about the serial ports or the keyboard), and seeking those devices does not make sense. If this is the case, you can't just refrain from declaring the llseek operation, because the default method allows seeking. Instead, you should use the following code:
 
loff_t scull_p_llseek(struct file *filp, loff_t off, int whence)
{
  return -ESPIPE; /* unseekable */
}
This function comes from the scullpipedevice, which isn't seekable; the error code is translated to "Illegal seek,'' though the symbolic name means "is a pipe.'' Because the position indicator is meaningless for nonseekable devices, neither read nor write needs to update it during data transfer.

It's interesting to note that since pread and pwrite have been added to the set of supported system calls, the lseek device method is not the only way a user-space program can seek a file. A proper implementation of unseekable devices should allow normal readand write calls while preventing pread and pwrite. This is accomplished by the following line -- the first in both the read and write methods of scullpipe -- we didn't explain when introducing those methods:
Offering access control is sometimes vital for the reliability of a device node. Not only should unauthorized users not be permitted to use the device (a restriction is enforced by the filesystem permission bits), but sometimes only one authorized user should be allowed to open the device at a time.

The problem is similar to that of using ttys. In that case, the login process changes the ownership of the device node whenever a user logs into the system, in order to prevent other users from interfering with or sniffing the tty data flow. However, it's impractical to use a privileged program to change the ownership of a device every time it is opened, just to grant unique access to it.

Every device shown in this section has the same behavior as the bare scull device (that is, it implements a persistent memory area) but differs from scull in access control, which is implemented in the open and close operations.

The brute-force way to provide access control is to permit a device to be opened by only one process at a time (single openness). This technique is best avoided because it inhibits user ingenuity. A user might well want to run different processes on the same device, one reading status information while the other is writing data. In some cases, users can get a lot done by running a few simple programs through a shell script, as long as they can access the device concurrently. In other words, implementing a single-open behavior amounts to creating policy, which may get in the way of what your users want to do.

Allowing only a single process to open a device has undesirable properties, but it is also the easiest access control to implement for a device driver, so it's shown here. The source code is extracted from a device called scullsingle.
 
int scull_s_open(struct inode *inode, struct file *filp)
{
  Scull_Dev *dev = &scull_s_device; /* device information */
  int num = NUM(inode->i_rdev);

  if (!filp->private_data && num > 0)
    return -ENODEV; /* not devfs: allow 1 device only */
  spin_lock(&scull_s_lock);
  if (scull_s_count) {
    spin_unlock(&scull_s_lock);
    return -EBUSY; /* already open */
  }
  scull_s_count++;
  spin_unlock(&scull_s_lock);
  /* then, everything else is copied from the bare scull device */

  if ( (filp->f_flags & O_ACCMODE) == O_WRONLY)
    scull_trim(dev);
  if (!filp->private_data)
    filp->private_data = dev;
  MOD_INC_USE_COUNT;
  return 0;     /* success */
}
The close call, on the other hand, marks the device as no longer busy.
 
int scull_s_release(struct inode *inode, struct file *filp)
{
  scull_s_count--; /* release the device */
  MOD_DEC_USE_COUNT;
  return 0;
}
Normally, we recommend that you put the open flag scull_s_count (with the accompanying spinlock, scull_s_lock, whose role is explained in the next subsection) within the device structure (Scull_Dev here) because, conceptually, it belongs to the device. The scull driver, however, uses standalone variables to hold the flag and the lock in order to use the same device structure and methods as the bare scull device and minimize code duplication.

Consider once again the test on the variable scull_s_count just shown. Two separate actions are taken there: (1) the value of the variable is tested, and the open is refused if it is not 0, and (2) the variable is incremented to mark the device as taken. On a single-processor system, these tests are safe because no other process will be able to run between the two actions.

As soon as you get into the SMP world, however, a problem arises. If two processes on two processors attempt to open the device simultaneously, it is possible that they could both test the value of scull_s_count before either modifies it. In this scenario you'll find that, at best, the single-open semantics of the device is not enforced. In the worst case, unexpected concurrent access could create data structure corruption and system crashes.

Instead, scullsingle uses a different locking mechanism called a spinlock. Spinlocks will never put a process to sleep. Instead, if a lock is not available, the spinlock primitives will simply retry, over and over (i.e., "spin''), until the lock is freed. Spinlocks thus have very little locking overhead, but they also have the potential to cause a processor to spin for a long time if somebody hogs the lock. Another advantage of spinlocks over semaphores is that their implementation is empty when compiling code for a uniprocessor system (where these SMP-specific races can't happen). Semaphores are a more general resource that make sense on uniprocessor computers as well as SMP, so they don't get optimized away in the uniprocessor case.

Spinlocks can be the ideal mechanism for small critical sections. Processes should hold spinlocks for the minimum time possible, and must never sleep while holding a lock. Thus, the main scull driver, which exchanges data with user space and can therefore sleep, is not suitable for a spinlock solution. But spinlocks work nicely for controlling access to scull_s_single (even if they still are not the optimal solution, which we will see in Chapter 9, "Interrupt Handling").
Spinlocks can be more complicated than this, and we'll get into the details in Chapter 9, "Interrupt Handling". But the simple case as shown here suits our needs for now, and all of the access-control variants of scull will use simple spinlocks in this manner.

The astute reader may have noticed that whereas scull_s_open acquires the scull_s_lock lock prior to incrementing the scull_s_count flag, scull_s_close takes no such precautions. This code is safe because no other code will change the value of scull_s_count if it is nonzero, so there will be no conflict with this particular assignment.

The next step beyond a single system-wide lock is to let a single user open a device in multiple processes but allow only one user to have the device open at a time. This solution makes it easy to test the device, since the user can read and write from several processes at once, but assumes that the user takes some responsibility for maintaining the integrity of the data during multiple accesses. This is accomplished by adding checks in the openmethod; such checks are performed after the normal permission checking and can only make access more restrictive than that specified by the owner and group permission bits. This is the same access policy as that used for ttys, but it doesn't resort to an external privileged program.

Those access policies are a little trickier to implement than single-open policies. In this case, two items are needed: an open count and the uid of the "owner'' of the device. Once again, the best place for such items is within the device structure; our example uses global variables instead, for the reason explained earlier for scullsingle. The name of the device is sculluid.

The open call grants access on first open, but remembers the owner of the device. This means that a user can open the device multiple times, thus allowing cooperating processes to work concurrently on the device. At the same time, no other user can open it, thus avoiding external interference. Since this version of the function is almost identical to the preceding one, only the relevant part is reproduced here:
 
 spin_lock(&scull_u_lock);
 if (scull_u_count && 
   (scull_u_owner != current->uid) && /* allow user */
   (scull_u_owner != current->euid) && /* allow whoever did su */
         !capable(CAP_DAC_OVERRIDE)) { /* still allow root */
     spin_unlock(&scull_u_lock);
     return -EBUSY;  /* -EPERM would confuse the user */
 }

 if (scull_u_count == 0)
   scull_u_owner = current->uid; /* grab it */

 scull_u_count++;
 spin_unlock(&scull_u_lock);
We chose to return -EBUSY and not -EPERM, even though the code is performing a permission check, in order to point a user who is denied access in the right direction. The reaction to "Permission denied'' is usually to check the mode and owner of the /dev file, while "Device busy'' correctly suggests that the user should look for a process already using the device.

This code also checks to see if the process attempting the open has the ability to override file access permissions; if so, the open will be allowed even if the opening process is not the owner of the device. The CAP_DAC_OVERRIDE capability fits the task well in this case.

When the device isn't accessible, returning an error is usually the most sensible approach, but there are situations in which you'd prefer to wait for the device.

For example, if a data communication channel is used both to transmit reports on a timely basis (using crontab) and for casual usage according to people's needs, it's much better for the timely report to be slightly delayed rather than fail just because the channel is currently busy.

This is one of the choices that the programmer must make when designing a device driver, and the right answer depends on the particular problem being solved.

The scullwuid device is a version of sculluid that waits for the device on open instead of returning -EBUSY. It differs from sculluid only in the following part of the open operation:
 
 spin_lock(&scull_w_lock);
 while (scull_w_count && 
  (scull_w_owner != current->uid) && /* allow user */
  (scull_w_owner != current->euid) && /* allow whoever did su */
  !capable(CAP_DAC_OVERRIDE)) {
   spin_unlock(&scull_w_lock);
   if (filp->f_flags & O_NONBLOCK) return -EAGAIN; 
   interruptible_sleep_on(&scull_w_wait);
   if (signal_pending(current)) /* a signal arrived */
    return -ERESTARTSYS; /* tell the fs layer to handle it */
   /* else, loop */
   spin_lock(&scull_w_lock);
 }
 if (scull_w_count == 0)
   scull_w_owner = current->uid; /* grab it */
 scull_w_count++;
 spin_unlock(&scull_w_lock);
 
int scull_w_release(struct inode *inode, struct file *filp)
{
  scull_w_count--;
  if (scull_w_count == 0)
    wake_up_interruptible(&scull_w_wait); /* awaken other uid's */
  MOD_DEC_USE_COUNT;
  return 0;
}
The problem with a blocking-open implementation is that it is really unpleasant for the interactive user, who has to keep guessing what is going wrong. The interactive user usually invokes precompiled commands such as cp and tar and can't just add O_NONBLOCK to the opencall. Someone who's making a backup using the tape drive in the next room would prefer to get a plain "device or resource busy'' message instead of being left to guess why the hard drive is so silent today while tar is scanning it.

This kind of problem (different, incompatible policies for the same device) is best solved by implementing one device node for each access policy. An example of this practice can be found in the Linux tape driver, which provides multiple device files for the same device. Different device files will, for example, cause the drive to record with or without compression, or to automatically rewind the tape when the device is closed.

Another technique to manage access control is creating different private copies of the device depending on the process opening it.

Clearly this is possible only if the device is not bound to a hardware object; scull is an example of such a "software'' device. The internals of /dev/ttyuse a similar technique in order to give its process a different "view'' of what the /dev entry point represents. When copies of the device are created by the software driver, we call them virtual devices -- just as virtual consoles use a single physical tty device.

The /dev/scullpriv device node implements virtual devices within the scull package. The scullpriv implementation uses the minor number of the process's controlling tty as a key to access the virtual device. You can nonetheless easily modify the sources to use any integer value for the key; each choice leads to a different policy. For example, using the uid leads to a different virtual device for each user, while using a pid key creates a new device for each process accessing it.

The decision to use the controlling terminal is meant to enable easy testing of the device using input/output redirection: the device is shared by all commands run on the same virtual terminal and is kept separate from the one seen by commands run on another terminal.

The open method looks like the following code. It must look for the right virtual device and possibly create one. The final part of the function is not shown because it is copied from the bare scull, which we've already seen.
 
/* The clone-specific data structure includes a key field */
struct scull_listitem {
  Scull_Dev device;
  int key;
  struct scull_listitem *next;
  
};

/* The list of devices, and a lock to protect it */
struct scull_listitem *scull_c_head;
spinlock_t scull_c_lock;

/* Look for a device or create one if missing */
static Scull_Dev *scull_c_lookfor_device(int key)
{
  struct scull_listitem *lptr, *prev = NULL;

  for (lptr = scull_c_head; lptr && (lptr->key != key); lptr = lptr->next)
    prev=lptr;
  if (lptr) return &(lptr->device);

  /* not found */
  lptr = kmalloc(sizeof(struct scull_listitem), GFP_ATOMIC);
  if (!lptr) return NULL;

  /* initialize the device */
  memset(lptr, 0, sizeof(struct scull_listitem));
  lptr->key = key;
  scull_trim(&(lptr->device)); /* initialize it */
  sema_init(&(lptr->device.sem), 1);

  /* place it in the list */
  if (prev) prev->next = lptr;
  else    scull_c_head = lptr;

  return &(lptr->device);
}

int scull_c_open(struct inode *inode, struct file *filp)
{
  Scull_Dev *dev;
  int key, num = NUM(inode->i_rdev);
 
  if (!filp->private_data && num > 0)
    return -ENODEV; /* not devfs: allow 1 device only */

  if (!current->tty) { 
    PDEBUG("Process \"%s\" has no ctl tty\n",current->comm);
    return -EINVAL;
  }
  key = MINOR(current->tty->device);

  /* look for a scullc device in the list */
  spin_lock(&scull_c_lock);
  dev = scull_c_lookfor_device(key);
  spin_unlock(&scull_c_lock);

  if (!dev) return -ENOMEM;

  /* then, everything else is copied from the bare scull device */
The release method does nothing special. It would normally release the device on last close, but we chose not to maintain an open count in order to simplify the testing of the driver. If the device were released on last close, you wouldn't be able to read the same data after writing to the device unless a background process were to keep it open. The sample driver takes the easier approach of keeping the data, so that at the next open, you'll find it there. The devices are released when scull_cleanup is called.

Here's the release implementation for /dev/scullpriv, which closes the discussion of device methods.
 
int scull_c_release(struct inode *inode, struct file *filp)
{
  /*
   * Nothing to do, because the device is persistent.
   * A `real' cloned device should be freed on last close
   */
  MOD_DEC_USE_COUNT;
  return 0;
}
Many parts of the device driver API covered in this chapter have changed between the major kernel releases. For those of you needing to make your driver work with Linux 2.0 or 2.2, here is a quick rundown of the differences you will encounter.

Wait Queues in Linux 2.2 and 2.0

A relatively small amount of the material in this chapter changed in the 2.3 development cycle. The one significant change is in the area of wait queues. The 2.2 kernel had a different and simpler implementation of wait queues, but it lacked some important features, such as exclusive sleeps. The new implementation of wait queues was introduced in kernel version 2.3.1.
In the 2.2 release, the type of the first argument to the fasync method changed. In the 2.0 kernel, a pointer to the inode structure for the device was passed, instead of the integer file descriptor:
The third argument to the fsyncfile_operations method (the integer datasync value) was added in the 2.3 development series, meaning that portable code will generally need to include a wrapper function for older kernels. There is a trap, however, for people trying to write portable fsync methods: at least one distributor, which will remain nameless, patched the 2.4 fsync API into its 2.2 kernel. The kernel developers usually (usually...) try to avoid making API changes within a stable series, but they have little control over what the distributors do.

Memory access was handled differently in the 2.0 kernels. The Linux virtual memory system was less well developed at that time, and memory access was handled a little differently. The new system was the key change that opened 2.1 development, and it brought significant improvements in performance; unfortunately, it was accompanied by yet another set of compatibility headaches for driver writers.

This macro fetched the value at the given address, and returned it as its return value. Once again, no verification was done by the execution of the macro.

As an example of how the older calls are used, consider scull one more time. A version of scull using the 2.0 API would call verify_area in this way:
 
 case SCULL_IOCXQUANTUM: /* eXchange: use arg as pointer */
  tmp = scull_quantum;
  scull_quantum = get_user((int *)arg);
  put_user(tmp, (int *)arg);
  break;

 default: /* redundant, as cmd was checked against MAXNR */
  return -ENOTTY;
 }
  return 0;
The 2.0 kernel did not support the poll system call; only the BSD-style select call was available. The corresponding device driver method was thus called select, and operated in a slightly different way, though the actions to be performed are almost identical.

The scull driver deals with the incompatibility by declaring a specific selectmethod to be used when it is compiled for version 2.0 of the kernel:
 
#ifdef __USE_OLD_SELECT__
int scull_p_poll(struct inode *inode, struct file *filp,
         int mode, select_table *table)
{
  Scull_Pipe *dev = filp->private_data;

  if (mode == SEL_IN) {
    if (dev->rp != dev->wp) return 1; /* readable */
    PDEBUG("Waiting to read\n");
    select_wait(&dev->inq, table); /* wait for data */
    return 0;
  }
  if (mode == SEL_OUT) {
    /*
     * The buffer is circular; it is considered full
     * if "wp" is right behind "rp". "left" is 0 if the
     * buffer is empty, and it is "1" if it is completely full.
     */
    int left = (dev->rp + dev->buffersize - dev->wp) % dev->buffersize;
    if (left != 1) return 1; /* writable */
    PDEBUG("Waiting to write\n");
    select_wait(&dev->outq, table); /* wait for free space */
    return 0;
  }
  return 0; /* never exception-able */
}
#else /* Use poll instead, already shown */
Prior to Linux 2.1, the llseek device method was called lseek instead, and it received different parameters from the current implementation. For that reason, under Linux 2.0 you were not allowed to seek a file, or a device, past the 2 GB limit, even though the llseek system call was already supported.
This chapter introduced the following symbols and header files.

#include

This header declares all the macros used to define ioctl commands. It is currently included by .

Macros used to decode a command. In particular, _IOC_TYPE(nr) is an OR combination of _IOC_READ and _IOC_WRITE.

Calling any of these functions puts the current process to sleep on a queue. Usually, you'll choose the interruptibleform to implement blocking read and write.

The wait_queue_t type is used when sleeping without calling sleep_on. Wait queue entries must be initialized prior to use; the task argument used is almost always current.

This function selects a runnable process from the run queue. The chosen process can be current or a different one. You won't usually call schedule directly, because the sleep_on functions do it internally.

This function puts the current process into a wait queue without scheduling immediately. It is designed to be used by the poll method of device drivers.

This function is a "helper'' for implementing the fasync device method. The mode argument is the same value that is passed to the method, while fa points to a device-specific fasync_struct *.

Back to:

| | | | | |

阅读(784) | 评论(0) | 转发(0) |

上一篇：Linux Device Drivers: Debugging Techniques

下一篇：Linux Device Drivers: Flow of Time

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6

Linux Device Drivers, 2nd Edition

Chapter 5 Enhanced Char Driver Operations

Contents:

A Sample Implementation: scullpipe

Reading data from the device

Writing to the device

Wait Queues in Linux 2.2 and 2.0

Chapter 5
Enhanced Char Driver Operations