|
Linux Device Drivers, 2nd Edition
2nd Edition June 2001
0-59600-008-1, Order Number: 0081
586 pages, $39.95
|
Chapter 13
mmap and DMA
Contents:
This chapter delves into the area of Linux memory management, with an
emphasis on techniques that are useful to the device driver writer.
The material in this chapter is somewhat advanced, and not everybody
will need a grasp of it. Nonetheless, many tasks can only be done
through digging more deeply into the memory management subsystem; it
also provides an interesting look into how an important part of the
kernel works.
The material in this chapter is divided into three sections. The
first covers the implementation of the mmapsystem call, which allows the mapping of device memory directly into a
user process's address space. We then cover the kernel
kiobuf mechanism, which provides direct access to
user memory from kernel space. The kiobuf system
may be used to implement "raw I/O'' for certain kinds of devices.
The final section covers direct memory access (DMA) I/O operations,
which essentially provide peripherals with direct access to system
memory.
Of course, all of these techniques require an understanding of how
Linux memory management works, so we start with an overview of that
subsystem.
Linux is, of course, a virtual memory system, meaning that the
addresses seen by user programs do not directly correspond to the
physical addresses used by the hardware. Virtual memory introduces a
layer of indirection, which allows a number of nice things. With
virtual memory, programs running on the system can allocate far more
memory than is physically available; indeed, even a single process can
have a virtual address space larger than the system's physical memory.
Virtual memory also allows playing a number of tricks with the
process's address space, including mapping in device memory.
-
-
-
-
-
Recent developments have eliminated the limitations on memory, and
32-bit systems can now work with well over 4 GB of system memory
(assuming, of course, that the processor itself can address that much
memory). The limitation on how much memory can be directly mapped
with logical addresses remains, however. Only the lowest portion of
memory (up to 1 or 2 GB, depending on the hardware and the kernel
configuration) has logical addresses; the rest (high memory) does not.
High memory can require 64-bit physical addresses, and the kernel must
set up explicit virtual address mappings to manipulate it. Thus, many
kernel functions are limited to low memory only; high memory tends to
be reserved for user-space process pages.
-
-
-
-
-
-
-
-
-
-
-
-
The kernel doesn't need to worry about doing page-table lookups during
normal program execution, because they are done by the hardware.
Nonetheless, the kernel must arrange things so that the hardware can
do its work. It must build the page tables and look them up whenever
the processor reports a page fault, that is, whenever the page
associated with a virtual address needed by the processor is not
present in memory. Device drivers, too, must be able to build page
tables and handle faults when implementing mmap.
-
-
-
These inline functions[50] are used to retrieve the pgd,
pmd, and pte entries associated
with address. Page-table lookup begins with a
pointer to struct mm_struct. The pointer associated
with the memory map of the current process is
current->mm, while the pointer to kernel space
is described by &init_mm. Two-level processors
define pmd_offset(dir,add) as (pmd_t
*)dir, thus folding the pmd over the
pgd. Functions that scan page tables are always
declared as inline, and the compiler optimizes out
any pmd lookup.
-
-
This macro returns a boolean value that indicates whether the data
page is currently in memory. This is the most used of several
functions that access the low bits in the
pte -- the bits that are discarded by
pte_page. Pages may be absent, of course, if the
kernel has swapped them to disk (or if they have never been loaded).
The page tables themselves, however, are always present in the current
Linux implementation. Keeping page tables in memory simplifies the
kernel code because pgd_offset and friends never
fail; on the other hand, even a process with a "resident storage
size'' of zero keeps its page tables in real RAM, wasting some memory
that might be better used elsewhere.
Just seeing the list of these functions is not enough for you to
be proficient in the Linux memory management algorithms; real memory
management is much more complex and must deal with other
complications, like cache coherence. The previous list should
nonetheless be sufficient to give you a feel for how page management
is implemented; it is also about all that you will need to know, as a
device driver writer, to work occasionally with page tables. You can
get more information from the include/asm and
mm subtrees of the kernel source.
The memory areas of a process can be seen by looking in
/proc/pid/maps(where pid, of course, is replaced by a
process ID). /proc/self is a special case of
/proc/pid, because it always
refers to the current process. As an example, here are a couple of
memory maps, to which we have added short comments after a sharp sign:
start-end perm offset major:minor inode image.
Each field in /proc/*/maps (except the image
name) corresponds to a field in struct
vm_area_struct, and is described in the following list.
-
-
A bit mask with the memory area's read, write, and execute
permissions. This field describes what the process is allowed to do
with pages belonging to the area. The last character in the field is
either p for "private'' or s
for "shared.''
-
-
The major and minor numbers of the device holding the file that
has been mapped. Confusingly, for device mappings, the major and
minor numbers refer to the disk partition holding the device special file
that was opened by the user, and not the device itself.
-
- image
-
The name of the file (usually an executable image) that has been
mapped.
A driver that implements the mmap method needs to
fill a VMA structure in the address space of the process mapping the
device. The driver writer should therefore have at least a minimal
understanding of VMAs in order to use them.
Let's look at the most important fields in struct
vm_area_struct (defined in
). These fields may be used by
device drivers in their mmap implementation. Note
that the kernel maintains lists and trees of VMAs to optimize area
lookup, and several fields of vm_area_struct are
used to maintain this organization. VMAs thus can't be created at
will by a driver, or the structures will break. The main fields of
VMAs are as follows (note the similarity between these fields and the
/proc output we just saw):
-
-
-
The offset of the area in the file, in pages. When a file or device
is mapped, this is the file position of the first page mapped in this
area.
-
A set of flags describing this area. The flags of the most interest
to device driver writers are VM_IO and
VM_RESERVED. VM_IO marks a VMA
as being a memory-mapped I/O region. Among other things, the
VM_IO flag will prevent the region from being
included in process core dumps. VM_RESERVED tells
the memory management system not to attempt to swap out this VMA; it
should be set in most device mappings.
-
-
-
-
-
-
This method is intended to change the protection on a memory area,
but is currently not used. Memory protection is handled by the
page tables, and the kernel sets up the page-table entries
separately.
-
-
When a process tries to access a page that belongs to a valid VMA, but
that is currently not in memory, the nopagemethod is called (if it is defined) for the related area. The method
returns the struct page pointer for the physical
page, after, perhaps, having read it in from secondary storage. If
the nopage method isn't defined for the area, an
empty page is allocated by the kernel. The third argument,
write_access, counts as "no-share'': a nonzero
value means the page must be owned by the current process, whereas
0 means that sharing is possible.
-
This method handles write-protected page faults but is currently
unused. The kernel handles attempts to write over a protected page
without invoking the area-specific callback. Write-protect faults are
used to implement copy-on-write. A private page can be shared across
processes until one process writes to it. When that happens, the page
is cloned, and the process writes on its own copy of the page. If the
whole area is marked as read-only, a SIGSEGV is
sent to the process, and the copy-on-write is not performed.
-
This method is called when a page is selected to be swapped out. A
return value of 0 signals success; any other value signals an
error. In case of error, the process owning the page is sent a
SIGBUS. It is highly unlikely that a driver will
ever need to implement swapout; device mappings
are not something that the kernel can just write to disk.
Memory mapping is one of the most interesting features of modern Unix
systems. As far as drivers are concerned, memory mapping can be used
to provide user programs with direct access to device memory.
cat /proc/731/maps
08048000-08327000 r-xp 00000000 08:01 55505 /usr/X11R6/bin/XF86_SVGA
08327000-08369000 rw-p 002de000 08:01 55505 /usr/X11R6/bin/XF86_SVGA
40015000-40019000 rw-s fe2fc000 08:01 10778 /dev/mem
40131000-40141000 rw-s 000a0000 08:01 10778 /dev/mem
40141000-40941000 rw-s f4000000 08:01 10778 /dev/mem
...
The full list of the X server's VMAs is lengthy, but most of the
entries are not of interest here. We do see, however, three separate
mappings of /dev/mem, which give some insight
into how the X server works with the video card. The first mapping
shows a 16 KB region mapped at fe2fc000. This
address is far above the highest RAM address on the system; it is,
instead, a region of memory on a PCI peripheral (the video card). It
will be a control region for that card. The middle mapping is at
a0000, which is the standard location for video RAM
in the 640 KB ISA hole. The last /dev/memmapping is a rather larger one at f4000000 and is
the video memory itself. These regions can also be seen in
/proc/iomem:
Mapping a device means associating a range of user-space addresses to
device memory. Whenever the program reads or writes in the assigned
address range, it is actually accessing the device. In the X server
example, using mmap allows quick and easy access
to the video card's memory. For a performance-critical application
like this, direct access makes a large difference.
As you might suspect, not every device lends itself to the
mmap abstraction; it makes no sense, for
instance, for serial ports and other stream-oriented devices. Another
limitation of mmap is that mapping is
PAGE_SIZE grained. The kernel can dispose of
virtual addresses only at the level of page tables; therefore, the
mapped area must be a multiple of PAGE_SIZE and
must live in physical memory starting at an address that is a multiple
of PAGE_SIZE. The kernel accommodates for size
granularity by making a region slightly bigger if its size isn't a
multiple of the page size.
These limits are not a big constraint for drivers, because the program
accessing the device is device dependent anyway. It needs to know how
to make sense of the memory region being mapped, so the
PAGE_SIZE alignment is not a problem. A bigger
constraint exists when ISA devices are used on some non-x86 platforms,
because their hardware view of ISA may not be contiguous. For
example, some Alpha computers see ISA memory as a scattered set of
8-bit, 16-bit, or 32-bit items, with no direct mapping. In such
cases, you can't use mmap at all. The inability
to perform direct mapping of ISA addresses to Alpha addresses is due
to the incompatible data transfer specifications of the two systems.
Whereas early Alpha processors could issue only 32-bit and 64-bit
memory accesses, ISA can do only 8-bit and 16-bit transfers, and
there's no way to transparently map one protocol onto the other.
There are sound advantages to using mmap when
it's feasible to do so. For instance, we have already looked at the X
server, which transfers a lot of data to and from video memory;
mapping the graphic display to user space dramatically improves the
throughput, as opposed to an
lseek/writeimplementation. Another typical example is a program controlling a PCI
device. Most PCI peripherals map their control registers to a memory
address, and a demanding application might prefer to have direct
access to the registers instead of repeatedly having to call
ioctl to get its work done.
The filp argument in the method is the same as that
introduced in Chapter 3, "Char Drivers", while vma
contains the information about the virtual address range that is used
to access the device. Much of the work has thus been done by the
kernel; to implement mmap, the driver only has to
build suitable page tables for the address range and, if necessary,
replace vma->vm_ops with a new set of
operations.
-
-
-
-
The arguments to remap_page_range are fairly
straightforward, and most of them are already provided to you in the
VMA when your mmap method is called. The one
complication has to do with caching: usually, references to device
memory should not be cached by the processor. Often the system BIOS
will set things up properly, but it is also possible to disable
caching of specific VMAs via the protection field. Unfortunately,
disabling caching at this level is highly processor dependent. The
curious reader may wish to look at the function
pgprot_noncached from
drivers/char/mem.c to see what's involved. We
won't discuss the topic further here.
If your driver needs to do a simple, linear mapping of device memory
into a user address space, remap_page_range is
almost all you really need to do the job. The following code comes
from drivers/char/mem.c and shows how this task
is performed in a typical module called
simple (Simple Implementation Mapping Pages
with Little Enthusiasm):
The /dev/mem code checks to see if the requested
offset (stored in vma->vm_pgoff) is beyond
physical memory; if so, the VM_IO VMA flag is set to
mark the area as being I/O memory. The VM_RESERVED
flag is always set to keep the system from trying to swap this area
out. Then it is just a matter of calling
remap_page_range to create the necessary page
tables.
Here, we will provide open and
close operations for our VMA. These operations
will be called anytime a process opens or closes the VMA; in
particular, the open method will be invoked
anytime a process forks and creates a new reference to the VMA. The
open and close VMA methods
are called in addition to the processing performed by the kernel, so
they need not reimplement any of the work done there. They exist as a
way for drivers to do any additional processing that they may require.
So, we will override the default vma->vm_ops
with operations that keep track of the usage count. The code is quite
simple -- a complete mmap implementation for a
modularized /dev/mem looks like the following:
This code relies on the fact that the kernel initializes to
NULL the vm_ops field in the
newly created area before calling
f_op->mmap. The code just shown checks the
current value of the pointer as a safety measure, should something
change in future kernels.
The nopage method, therefore, must be implemented
if you want to support the mremap system
call. But once you have nopage, you can choose to
use it extensively, with some limitations (described later). This
method is shown in the next code fragment. In this implementation of
mmap, the device method only replaces
vma->vm_ops. The nopagemethod takes care of "remapping'' one page at a time and returning
the address of its struct page structure. Because
we are just implementing a window onto physical memory here, the
remapping step is simple -- we need only locate and return a
pointer to the struct page for the desired address.
An implementation of /dev/mem using
nopage looks like the following:
Since, once again, we are simply mapping main memory here, the
nopage function need only find the correct
struct page for the faulting address and increment
its reference count. The required sequence of events is thus to
calculate the desired physical address, turn it into a logical address
with __va, and then finally to turn it
into a struct page with
virt_to_page. It would be possible, in general,
to go directly from the physical address to the struct
page, but such code would be difficult to make portable
across architectures. Such code might be necessary, however, if one
were trying to map high memory, which, remember, has no logical
addresses. simple, being simple, does not
worry about that (rare) case.
The nopage method normally returns a pointer to a
struct page. If, for some reason, a normal page
cannot be returned (e.g., the requested address is beyond the device's
memory region), NOPAGE_SIGBUS can be returned to
signal the error. nopage can also return
NOPAGE_OOM to indicate failures caused by resource
limitations.
All the examples we've seen so far are reimplementations of
/dev/mem; they remap physical addresses into user
space. The typical driver, however, wants to map only the small
address range that applies to its peripheral device, not all of
memory. In order to map to user space only a subset of the whole
memory range, the driver needs only to play with the offsets. The
following lines will do the trick for a driver mapping a region of
simple_region_size bytes, beginning at physical
address simple_region_start (which should be page
aligned).
In addition to calculating the offsets, this code introduces a check
that reports an error when the program tries to map more memory than
is available in the I/O region of the target device. In this code,
psize is the physical I/O size that is left after
the offset has been specified, and vsize is the
requested size of virtual memory; the function refuses to map
addresses that extend beyond the allowed memory range.
Note that the user process can always use mremapto extend its mapping, possibly past the end of the physical device
area. If your driver has no nopage method, it
will never be notified of this extension, and the additional area will
map to the zero page. As a driver writer, you may well want to
prevent this sort of behavior; mapping the zero page onto the end of
your region is not an explicitly bad thing to do, but it is highly
unlikely that the programmer wanted that to happen.
Of course, a more thorough implementation could check to see if the
faulting address is within the device area, and perform the remapping
if that is the case. Once again, however, nopagewill not work with PCI memory areas, so extension of PCI mappings is
not possible.
In Linux, a page of physical addresses is marked as "reserved'' in
the memory map to indicate that it is not available for memory
management. On the PC, for example, the range between 640 KB and 1 MB
is marked as reserved, as are the pages that host the kernel code
itself.
The limitations of remap_page_range can be seen
by running mapper, one of the sample
programs in misc-progs in the files provided on
the O'Reilly FTP site. mapper is a simple
tool that can be used to quickly test the mmapsystem call; it maps read-only parts of a file based on the
command-line options and dumps the mapped region to standard output.
The following session, for instance, shows that
/dev/mem doesn't map the physical page located at
address 64 KB -- instead we see a page full of zeros (the host
computer in this examples is a PC, but the result would be the same on
other platforms):
morgana.root# ./mapper /dev/mem 0x10000 0x1000 | od -Ax -t x1
mapped "/dev/mem" from 65536 to 69632
000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
001000
The inability of remap_page_range to deal with
RAM suggests that a device like scullpcan't easily implement mmap, because its device
memory is conventional RAM, not I/O memory. Fortunately, a relatively
easy workaround is available to any driver that needs to map RAM into
user space; it uses the nopage method that we
have seen earlier.
The way to map real RAM to user space is to use
vm_ops->nopage to deal with page faults one at a
time. A sample implementation is part of the
scullp module, introduced in Chapter 7, "Getting Hold of Memory".
scullp is the page oriented char device.
Because it is page oriented, it can implement
mmap on its memory. The code implementing
memory mapping uses some of the concepts introduced earlier in "Memory Management in Linux".
Before examining the code, let's look at the design choices that
affect the mmap implementation in
scullp.
-
scullp doesn't release device memory as
long as the device is mapped. This is a matter of policy rather than a
requirement, and it is different from the behavior of
scull and similar devices, which are
truncated to a length of zero when opened for writing. Refusing to
free a mapped scullp device allows a
process to overwrite regions actively mapped by another process, so
you can test and see how processes and device memory interact. To
avoid releasing a mapped device, the driver must keep a count of
active mappings; the vmas field in the device
structure is used for this purpose.
-
Memory mapping is performed only when the scullp order parameter is 0.
The parameter controls how get_free_pagesis invoked (see Chapter 7, "Getting Hold of Memory", "get_free_page and Friends").
This choice is dictated by the internals of
get_free_pages, the allocation engine
exploited by scullp. To maximize allocation
performance, the Linux kernel maintains a list of free pages for each
allocation order, and only the page count of the first page in a
cluster is incremented by get_free_pages and
decremented by free_pages. The
mmap method is disabled for a
scullp device if the allocation order is
greater than zero, because nopage
deals with
single pages rather than clusters of pages. (Return to "A scull Using
Whole Pages: scullp" in Chapter 7, "Getting Hold of Memory" if you need
a refresher on
scullp and the memory allocation order
value.)
This implementation of scullp_mmap is very short,
because it relies on the nopage function to do
all the interesting work:
int scullp_mmap(struct file *filp, struct vm_area_struct *vma)
{
struct inode *inode = INODE_FROM_F(filp);
/* refuse to map if order is not 0 */
if (scullp_devices[MINOR(inode->i_rdev)].order)
return -ENODEV;
/* don't do anything here: "nopage" will fill the holes */
vma->vm_ops = &scullp_vm_ops;
vma->vm_flags |= VM_RESERVED;
vma->vm_private_data = scullp_devices + MINOR(inode->i_rdev);
scullp_vma_open(vma);
return 0;
}
The purpose of the leading conditional is to avoid mapping devices
whose allocation order is not 0. scullp's
operations are stored in the vm_ops field, and a
pointer to the device structure is stashed in the
vm_private_data field. At the end,
vm_ops->open is called to update the usage count
for the module and the count of active mappings for the device.
void scullp_vma_open(struct vm_area_struct *vma)
{
ScullP_Dev *dev = scullp_vma_to_dev(vma);
dev->vmas++;
MOD_INC_USE_COUNT;
}
void scullp_vma_close(struct vm_area_struct *vma)
{
ScullP_Dev *dev = scullp_vma_to_dev(vma);
dev->vmas--;
MOD_DEC_USE_COUNT;
}
The function sculls_vma_to_dev simply returns the
contents of the vm_private_data field. It exists
as a separate function because kernel versions prior to 2.4 lacked
that field, requiring that other means be used to get that pointer.
See "Backward Compatibility" at the end of this chapter for
details.
Most of the work is then performed by nopage. In
the scullp implementation, the
address parameter to nopage is
used to calculate an offset into the device; the offset is then used
to look up the correct page in the scullpmemory tree.
struct page *scullp_vma_nopage(struct vm_area_struct *vma,
unsigned long address, int write)
{
unsigned long offset;
ScullP_Dev *ptr, *dev = scullp_vma_to_dev(vma);
struct page *page = NOPAGE_SIGBUS;
void *pageptr = NULL; /* default to "missing" */
down(&dev->sem);
offset = (address - vma->vm_start) + VMA_OFFSET(vma);
if (offset >= dev->size) goto out; /* out of range */
/*
* Now retrieve the scullp device from the list, then the page.
* If the device has holes, the process receives a SIGBUS when
* accessing the hole.
*/
offset >>= PAGE_SHIFT; /* offset is a number of pages */
for (ptr = dev; ptr && offset >= dev->qset;) {
ptr = ptr->next;
offset -= dev->qset;
}
if (ptr && ptr->data) pageptr = ptr->data[offset];
if (!pageptr) goto out; /* hole or end-of-file */
page = virt_to_page(pageptr);
/* got it, now increment the count */
get_page(page);
out:
up(&dev->sem);
return page;
}
scullp uses memory obtained with
get_free_pages. That memory is addressed using
logical addresses, so all scullp_nopage has to do
to get a struct page pointer is to call
virt_to_page.
The scullp device now works as expected, as
you can see in this sample output from the
mapper utility. Here we send a directory
listing of /dev (which is long) to the
scullp device, and then use the
mapper utility to look at pieces of that
listing with mmap.
morgana% ls -l /dev > /dev/scullp
morgana% ./mapper /dev/scullp 0 140
mapped "/dev/scullp" from 0 to 140
total 77
-rwxr-xr-x 1 root root 26689 Mar 2 2000 MAKEDEV
crw-rw-rw- 1 root root 14, 14 Aug 10 20:55 admmidi0
morgana% ./mapper /dev/scullp 8192 200
mapped "/dev/scullp" from 8192 to 8392
0
crw -- -- -- - 1 root root 113, 1 Mar 26 1999 cum1
crw -- -- -- - 1 root root 113, 2 Mar 26 1999 cum2
crw -- -- -- - 1 root root 113, 3 Mar 26 1999 cum3
Although it's rarely necessary, it's interesting to see how a driver
can map a virtual address to user space using
mmap. A true virtual address, remember, is an
address returned by a function like vmalloc or
kmap -- that is, a virtual address mapped in
the kernel page tables. The code in this section is taken from
scullv, which is the module that works like
scullp but allocates its storage through
vmalloc.
Most of the scullv implementation is like
the one we've just seen for scullp, except
that there is no need to check the order parameter
that controls memory allocation. The reason for this is that
vmalloc allocates its pages one at a time,
because single-page allocations are far more likely to succeed than
multipage allocations. Therefore, the allocation order problem
doesn't apply to vmalloced space.
Most of the work of vmalloc is building page
tables to access allocated pages as a continuous address range. The
nopage method, instead, must pull the page
tables back apart in order to return a struct page
pointer to the caller. Therefore, the nopageimplementation for scullv must scan the
page tables to retrieve the page map entry associated with the page.
The function is similar to the one we saw for
scullp, except at the end. This code
excerpt only includes the part of nopage that
differs from scullp:
pgd_t *pgd; pmd_t *pmd; pte_t *pte;
unsigned long lpage;
/*
* After scullv lookup, "page" is now the address of the page
* needed by the current process. Since it's a vmalloc address,
* first retrieve the unsigned long value to be looked up
* in page tables.
*/
lpage = VMALLOC_VMADDR(pageptr);
spin_lock(&init_mm.page_table_lock);
pgd = pgd_offset(&init_mm, lpage);
pmd = pmd_offset(pgd, lpage);
pte = pte_offset(pmd, lpage);
page = pte_page(*pte);
spin_unlock(&init_mm.page_table_lock);
/* got it, now increment the count */
get_page(page);
out:
up(&dev->sem);
return page;
The page tables are looked up using the functions introduced at the
beginning of this chapter. The page directory used for this purpose is
stored in the memory structure for kernel space,
init_mm. Note that
scullv obtains the
page_table_lock prior to traversing the page
tables. If that lock were not held, another processor could make a
change to the page table while scullv was
halfway through the lookup process, leading to erroneous results.
Based on this discussion, you might also want to map addresses
returned by ioremap to user space. This mapping
is easily accomplished because you can use
remap_page_range directly, without implementing
methods for virtual memory areas. In other words,
remap_page_range is already usable for building
new page tables that map I/O memory to user space; there's no need to
look in the kernel page tables built by vremap as
we did in scullv.
As of version 2.3.12, the Linux kernel supports an I/O abstraction
called the kernel I/O buffer, or
kiobuf. The kiobuf interface is intended to hide
much of the complexity of the virtual memory system from device
drivers (and other parts of the system that do I/O). Many features
are planned for kiobufs, but their primary use in the 2.4 kernel is to
facilitate the mapping of user-space buffers into the kernel.
-
-
-
-
Locking a kiovec in this manner is unnecessary, however, for most
applications of kiobufs seen in device drivers.
Unix systems have long provided a "raw'' interface to some
devices -- block devices in particular -- which performs I/O
directly from a user-space buffer and avoids copying data through the
kernel. In some cases much improved performance can be had in this
manner, especially if the data being transferred will not be used
again in the near future. For example, disk backups typically read a
great deal of data from the disk exactly once, then forget about it.
Running the backup via a raw interface will avoid filling the system
buffer cache with useless data.
The Linux kernel has traditionally not provided a raw interface, for a
number of reasons. As the system gains in popularity, however, more
applications that expect to be able to do raw I/O (such as large
database management systems) are being ported. So the 2.3 development
series finally added raw I/O; the driving force behind the kiobuf
interface was the need to provide this capability.
Raw I/O is not always the great performance boost that some people
think it should be, and driver writers should not rush out to add the
capability just because they can. The overhead of setting up a raw
transfer can be significant, and the advantages of buffering data in
the kernel are lost. For example, note that raw I/O operations almost
always must be synchronous -- the write system
call cannot return until the operation is complete. Linux currently
lacks the mechanisms that user programs need to be able to safely
perform asynchronous raw I/O on a user buffer.
In this section, we add a raw I/O capability to the
sbull sample block driver. When kiobufs
are available, sbull actually registers two
devices. The block sbull device was
examined in detail in Chapter 12, "Loading Block Drivers". What we didn't see in
that chapter was a second, char device (called
sbullr), which provides raw access to the
RAM-disk device. Thus, /dev/sbull0 and
/dev/sbullr0 access the same memory; the former
using the traditional, buffered mode and the second providing raw
access via the kiobuf mechanism.
It is worth noting that in Linux systems, there is no need for block
drivers to provide this sort of interface. The raw device, in
drivers/char/raw.c, provides this capability in
an elegant, general way for all block devices. The block drivers need
not even know they are doing raw I/O. The raw I/O code in
sbull is essentially a simplification of
the raw device code for demonstration purposes.
Raw I/O to a block device must always be sector aligned, and its
length must be a multiple of the sector size. Other kinds of devices,
such as tape drives, may not have the same constraints.
sbullr behaves like a block device and
enforces the alignment and length requirements. To that end, it
defines a few symbols:
The sbullr raw device will be registered
only if the hard-sector size is equal to
SBULLR_SECTOR. There is no real reason why a
larger hard-sector size could not be supported, but it would
complicate the sample code unnecessarily.
The sbullr implementation adds little to
the existing sbull code. In particular,
the open and close methods
from sbull are used without modification.
Since sbullr is a char device, however, it
needs read and writemethods. Both are defined to use a single transfer function as
follows:
ssize_t sbullr_read(struct file *filp, char *buf, size_t size,
loff_t *off)
{
Sbull_Dev *dev = sbull_devices +
MINOR(filp->f_dentry->d_inode->i_rdev);
return sbullr_transfer(dev, buf, size, off, READ);
}
ssize_t sbullr_write(struct file *filp, const char *buf, size_t size,
loff_t *off)
{
Sbull_Dev *dev = sbull_devices +
MINOR(filp->f_dentry->d_inode->i_rdev);
return sbullr_transfer(dev, (char *) buf, size, off, WRITE);
}
static int sbullr_transfer (Sbull_Dev *dev, char *buf, size_t count,
loff_t *offset, int rw)
{
struct kiobuf *iobuf;
int result;
/* Only block alignment and size allowed */
if ((*offset & SBULLR_SECTOR_MASK) || (count & SBULLR_SECTOR_MASK))
return -EINVAL;
if ((unsigned long) buf & SBULLR_SECTOR_MASK)
return -EINVAL;
/* Allocate an I/O vector */
result = alloc_kiovec(1, &iobuf);
if (result)
return result;
/* Map the user I/O buffer and do the I/O. */
result = map_user_kiobuf(rw, iobuf, (unsigned long) buf, count);
if (result) {
free_kiovec(1, &iobuf);
return result;
}
spin_lock(&dev->lock);
result = sbullr_rw_iovec(dev, iobuf, rw,
*offset >> SBULLR_SECTOR_SHIFT,
count >> SBULLR_SECTOR_SHIFT);
spin_unlock(&dev->lock);
/* Clean up and return. */
unmap_kiobuf(iobuf);
free_kiovec(1, &iobuf);
if (result > 0)
*offset += result << SBULLR_SECTOR_SHIFT;
return result << SBULLR_SECTOR_SHIFT;
}
static int sbullr_rw_iovec(Sbull_Dev *dev, struct kiobuf *iobuf, int rw,
int sector, int nsectors)
{
struct request fakereq;
struct page *page;
int offset = iobuf->offset, ndone = 0, pageno, result;
/* Perform I/O on each sector */
fakereq.sector = sector;
fakereq.current_nr_sectors = 1;
fakereq.cmd = rw;
for (pageno = 0; pageno < iobuf->nr_pages; pageno++) {
page = iobuf->maplist[pageno];
while (ndone < nsectors) {
/* Fake up a request structure for the operation */
fakereq.buffer = (void *) (kmap(page) + offset);
result = sbull_transfer(dev, &fakereq);
kunmap(page);
if (result == 0)
return ndone;
/* Move on to the next one */
ndone++;
fakereq.sector++;
offset += SBULLR_SECTOR;
if (offset >= PAGE_SIZE) {
offset = 0;
break;
}
}
}
return ndone;
}
Some quick tests copying data show that a copy to or from an
sbullr device takes roughly two-thirds the
system time as the same copy to the block
sbull device. The savings is gained by
avoiding the extra copy
through the buffer cache. Note that if the same data is read several
times over, that savings will evaporate -- especially for a real
hardware device. Raw device access is often not the best approach,
but for some applications it can be a major improvement.
Although kiobufs remain controversial in the kernel development
community, there is interest in using them in a wider range of
contexts. There is, for example, a patch that implements Unix pipes
with kiobufs -- data is copied directly from one process's address
space to the other with no buffering in the kernel at all. A patch
also exists that makes it easy to use a kiobuf to map kernel virtual
memory into a process's address space, thus eliminating the need for a
nopage implementation as shown earlier.
Direct memory access, or DMA, is the advanced topic that completes our
overview of memory issues. DMA is the hardware mechanism that allows
peripheral components to transfer their I/O data directly to and from
main memory without the need for the system processor to be involved
in the transfer. Use of this mechanism can greatly increase
throughput to and from a device, because a great deal of computational
overhead is eliminated.
To exploit the DMA capabilities of its hardware, the device driver
needs to be able to correctly set up the DMA transfer and synchronize
with the hardware. Unfortunately, because of its hardware nature, DMA
is very system dependent. Each architecture has its own techniques to
manage DMA transfers, and the programming interface is different for
each. The kernel can't offer a unified interface, either, because a
driver can't abstract too much from the underlying hardware
mechanisms. Some steps have been made in that direction, however, in
recent kernels.
This chapter concentrates mainly on the PCI bus, since it is currently
the most popular peripheral bus available. Many of the concepts are more
widely applicable, though. We also touch on how some other buses, such
as ISA and SBus, handle DMA.
Before introducing the programming details, let's review how a DMA
transfer takes place, considering only input transfers to simplify the
discussion.
-
-
-
The second case comes about when DMA is used asynchronously. This
happens, for example, with data acquisition devices that go on pushing
data even if nobody is reading them. In this case, the driver should
maintain a buffer so that a subsequent read call
will return all the accumulated data to user space. The steps
involved in this kind of transfer are slightly different:
-
-
-
The peripheral device writes the data to the buffer and raises another
interrupt when it's done.
-
A variant of the asynchronous approach is often seen with network
cards. These cards often expect to see a circular buffer (often
called a DMA ring buffer) established in memory
shared with the processor; each incoming packet is placed in the next
available buffer in the ring, and an interrupt is signaled. The
driver then passes the network packets to the rest of the kernel, and
places a new DMA buffer in the ring.
Another relevant item introduced here is the DMA buffer. To exploit
direct memory access, the device driver must be able to allocate one
or more special buffers, suited to DMA. Note that many drivers
allocate their buffers at initialization time and use them until
shutdown -- the word allocate in the previous
lists therefore means "get hold of a previously allocated buffer.''
The main problem with the DMA buffer is that when it is bigger than
one page, it must occupy contiguous pages in physical memory because
the device transfers data using the ISA or PCI system bus, both of
which carry physical addresses. It's interesting to note that this
constraint doesn't apply to the SBus (see "SBus" in Chapter 15, "Overview of Peripheral Buses"), which uses virtual
addresses on the peripheral bus. Some architectures
can also use virtual addresses on the PCI bus,
but a portable driver cannot count on that capability.
Although DMA buffers can be allocated either
at system boot or at
runtime, modules can only allocate their buffers at runtime. Chapter 7,
"Getting Hold of Memory" introduced these techniques: "Boot-Time
Allocation" talked about allocation at system boot, while
"The Real Story of kmalloc" and "get_free_page and Friends"
described allocation at runtime. Driver writers must take care to
allocate the right kind of memory when it will be used for DMA
operations -- not all memory zones are suitable. In particular,
high memory will not work for DMA on most systems -- the
peripherals simply cannot work with addresses that high.
Most devices on modern buses can handle 32-bit addresses, meaning that
normal memory allocations will work just fine for them. Some PCI
devices, however, fail to implement the full PCI standard and cannot
work with 32-bit addresses. And ISA devices, of course, are limited
to 16-bit addresses only.
For devices with this kind of limitation, memory should be allocated
from the DMA zone by adding the GFP_DMA flag to the
kmalloc or get_free_pagescall. When this flag is present, only memory that can be addressed
with 16 bits will be allocated.
We have seen how get_free_pages (and therefore
kmalloc) can't return more than 128 KB (or, more
generally, 32 pages) of consecutive memory space. But the request is
prone to fail even when the allocated buffer is less than 128 KB,
because system memory becomes fragmented over time.[52]
Actually, there is another way to allocate DMA space: perform
aggressive allocation until you are able to get enough consecutive
pages to make a buffer. We strongly discourage this allocation
technique if there's any other way to achieve your goal. Aggressive
allocation results in high machine load, and possibly in a system
lockup if your aggressiveness isn't correctly tuned. On the other
hand, sometimes there is no other way available.
In practice, the code invokes kmalloc(GFP_ATOMIC)
until the call fails; it then waits until the kernel frees some pages,
and then allocates everything once again. If you keep an eye on the
pool of allocated pages, sooner or later you'll find that your DMA
buffer of consecutive pages has appeared; at this point you can
release every page but the selected buffer. This kind of behavior is
rather risky, though, because it may lead to a deadlock. We suggest
using a kernel timer to release every page in case allocation doesn't
succeed before a timeout expires.
We're not going to show the code here, but you'll find it in
misc-modules/allocator.c; the code is thoroughly
commented and designed to be called by other modules. Unlike every
other source accompanying this book, the allocator is covered by the
GPL. The reason we decided to put the source under the GPL is that it
is neither particularly beautiful nor particularly clever, and if
someone is going to use it, we want to be sure that the source is
released with the module.
A device driver using DMA has to talk to hardware connected to the
interface bus, which uses physical addresses, whereas program code
uses virtual addresses.
As a matter of fact, the situation is slightly more complicated than
that. DMA-based hardware uses bus, rather than
physical, addresses. Although ISA and PCI
addresses are simply physical addresses on the PC, this is not true
for every platform. Sometimes the interface bus is connected through
bridge circuitry that maps I/O addresses to different physical
addresses. Some systems even have a page-mapping scheme that can make
arbitrary pages appear contiguous to the peripheral bus.
The virt_to_bus conversion must be used when the
driver needs to send address information to an I/O device (such as an
expansion board or the DMA controller), while
bus_to_virt must be used when address information
is received from hardware connected to the bus.
The functions in this section require a struct
pci_dev structure for your device. The details of setting
up a PCI device are covered in Chapter 15, "Overview of Peripheral Buses". Note, however,
that the routines described here can also be used with ISA devices; in
that case, the struct pci_dev pointer should simply
be passed in as NULL.
The first question that must be answered before performing DMA is
whether the given device is capable of such operation on the current
host. Many PCI devices fail to implement the full 32-bit bus address
space, often because they are modified versions of old ISA hardware.
The Linux kernel will attempt to work with such devices, but it is not
always possible.
The function pci_dma_supported should be called
for any device that has addressing limitations:
int pci_dma_supported(struct pci_dev *pdev, dma_addr_t mask);
Here, mask is a simple bit mask describing which
address bits the device can successfully use. If the return value is
nonzero, DMA is possible, and your driver should set the
dma_mask field in the PCI device structure to the
mask value. For a device that can only handle 16-bit addresses, you
might use a call like this:
if (pci_dma_supported (pdev, 0xffff))
pdev->dma_mask = 0xffff;
else {
card->use_dma = 0; /* We'll have to live without DMA */
printk (KERN_WARN, "mydev: DMA not supported\n");
}
int pci_set_dma_mask(struct pci_dev *pdev, dma_addr_t mask);
For devices that can handle 32-bit addresses, there is no need to call
pci_dma_supported.
A DMA mapping is a combination of allocating a
DMA buffer and generating an address for that buffer that is
accessible by the device. In many cases, getting that address
involves a simple call to virt_to_bus; some
hardware, however, requires that mapping
registers be set up in the bus hardware as well. Mapping
registers are an equivalent of virtual memory for peripherals. On
systems where these registers are used, peripherals have a relatively
small, dedicated range of addresses to which they may perform DMA.
Those addresses are remapped, via the mapping registers, into system
RAM. Mapping registers have some nice features, including the ability
to make several distributed pages appear contiguous in the device's
address space. Not all architectures have mapping registers, however;
in particular, the popular PC platform has no mapping registers.
Setting up a useful address for the device may also, in some cases,
require the establishment of a bounce buffer.
Bounce buffers are created when a driver attempts to perform DMA on an
address that is not reachable by the peripheral device -- a
high-memory address, for example. Data is then copied to and from the
bounce buffer as needed. Making code work properly with bounce
buffers requires adherence to some rules, as we will see shortly.
The DMA mapping sets up a new type, dma_addr_t, to
represent bus addresses. Variables of type
dma_addr_t should be treated as opaque by the
driver; the only allowable operations are to pass them to the DMA
support routines and to the device itself.
-
-
These are set up for a single operation. Some architectures allow for
significant optimizations when streaming mappings are used, as we will
see, but these mappings also are subject to a stricter set of rules in
how they may be accessed. The kernel developers recommend the use of
streaming mappings over consistent mappings whenever possible. There
are two reasons for this recommendation. The first is that, on systems
that support them, each DMA mapping uses one or more mapping registers
on the bus. Consistent mappings, which have a long lifetime, can
monopolize these registers for a long time, even when they are not
being used. The other reason is that, on some hardware, streaming
mappings can be optimized in ways that are not available to consistent
mappings.
void *pci_alloc_consistent(struct pci_dev *pdev, size_t size,
dma_addr_t *bus_addr);
This function handles both the allocation and the mapping of the
buffer. The first two arguments are our PCI device structure and the
size of the needed buffer. The function returns the result of the DMA
mapping in two places. The return value is a kernel virtual address
for the buffer, which may be used by the driver; the associated bus
address, instead, is returned in bus_addr.
Allocation is handled in this function so that the buffer will be
placed in a location that works with DMA; usually the memory is just
allocated with get_free_pages (but note that the
size is in bytes, rather than an order value).
void pci_free_consistent(struct pci_dev *pdev, size_t size,
void *cpu_addr, dma_handle_t bus_addr);
-
These two symbols should be reasonably self-explanatory. If data is
being sent to the device (in response, perhaps, to a
write system call),
PCI_DMA_TODEVICE should be used; data going to the
CPU, instead, will be marked with
PCI_DMA_FROMDEVICE.
-
-
dma_addr_t pci_map_single(struct pci_dev *pdev, void *buffer,
size_t size, int direction);
The return value is the bus address that you can pass to the device,
or NULL if something goes wrong.
void pci_unmap_single(struct pci_dev *pdev, dma_addr_t bus_addr,
size_t size, int direction);
You may be wondering why the driver can no longer work with a buffer
once it has been mapped. There are actually two reasons why this rule
makes sense. First, when a buffer is mapped for DMA, the kernel must
ensure that all of the data in that buffer has actually been written
to memory. It is likely that some data will remain in the processor's
cache, and must be explicitly flushed. Data written to the buffer by
the processor after the flush may not be visible to the device.
Second, consider what happens if the buffer to be mapped is in a
region of memory that is not accessible to the device. Some
architectures will simply fail in this case, but others will create a
bounce buffer. The bounce buffer is just a separate region of memory
that is accessible to the device. If a buffer is
mapped with a direction of PCI_DMA_TODEVICE, and a
bounce buffer is required, the contents of the original buffer will be
copied as part of the mapping operation. Clearly, changes to the
original buffer after the copy will not be seen by the device.
Similarly, PCI_DMA_FROMDEVICE bounce buffers are
copied back to the original buffer by
pci_unmap_single; the data from the device is not
present until that copy has been done.
void pci_sync_single(struct pci_dev *pdev, dma_handle_t bus_addr,
size_t size, int direction);
Scatter-gather mappings are a special case of streaming DMA mappings.
Suppose you have several buffers, all of which need to be transferred
to or from the device. This situation can come about in several ways,
including from a readv or
writev system call, a clustered disk I/O request,
or a list of pages in a mapped kernel I/O buffer. You could simply
map each buffer in turn and perform the required operation, but there
are advantages to mapping the whole list at once.
One reason is that some smart devices can accept a
scatterlist of array pointers and lengths and
transfer them all in one DMA operation; for example, "zero-copy''
networking is easier if packets can be built in multiple pieces.
Linux is likely to take much better advantage of such devices in the
future. Another reason to map scatterlists as a whole is to take
advantage of systems that have mapping registers in the bus
hardware. On such systems, physically discontiguous pages can be
assembled into a single, contiguous array from the device's point of
view. This technique works only when the entries in the scatterlist
are equal to the page size in length (except the first and last), but
when it does work it can turn multiple operations into a single DMA
and speed things up accordingly.
-
-
int pci_map_sg(struct pci_dev *pdev, struct scatterlist *list,
int nents, int direction);
-
-
void pci_unmap_sg(struct pci_dev *pdev, struct scatterlist *list,
int nents, int direction);
void pci_dma_sync_sg(struct pci_dev *pdev, struct scatterlist *sg,
int nents, int direction);
-
-
-
-
The actual form of DMA operations on the PCI bus is very dependent on
the device being driven. Thus, this example does not apply to any real
device; instead, it is part of a hypothetical driver called
dad (DMA Acquisition Device). A driver for
this device might define a transfer function like this:
int dad_transfer(struct dad_dev *dev, int write, void *buffer,
size_t count)
{
dma_addr_t bus_addr;
unsigned long flags;
/* Map the buffer for DMA */
dev->dma_dir = (write ? PCI_DMA_TODEVICE : PCI_DMA_FROMDEVICE);
dev->dma_size = count;
bus_addr = pci_map_single(dev->pci_dev, buffer, count,
dev->dma_dir);
dev->dma_addr = bus_addr;
/* Set up the device */
writeb(dev->registers.command, DAD_CMD_DISABLEDMA);
writeb(dev->registers.command, write ? DAD_CMD_WR : DAD_CMD_RD);
writel(dev->registers.addr, cpu_to_le32(bus_addr));
writel(dev->registers.len, cpu_to_le32(count));
/* Start the operation */
writeb(dev->registers.command, DAD_CMD_ENABLEDMA);
return 0;
}
This function maps the buffer to be transferred and starts the device
operation. The other half of the job must be done in the interrupt
service routine, which would look something like this:
void dad_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
struct dad_dev *dev = (struct dad_dev *) dev_id;
/* Make sure it's really our device interrupting */
/* Unmap the DMA buffer */
pci_unmap_single(dev->pci_dev, dev->dma_addr, dev->dma_size,
dev->dma_dir);
/* Only now is it safe to access the buffer, copy to user, etc. */
...
}
SPARC-based systems have traditionally included a Sun-designed bus
called the SBus. This bus is beyond the scope of this chapter, but a
quick mention is worthwhile. There is a set of functions (declared in
) for performing DMA mappings on
the SBus; they have names like
sbus_alloc_consistent and
sbus_map_sg. In other words, the SBus DMA API
looks almost exactly like the PCI interface. A detailed look at the
function definitions will be required before working with DMA on the
SBus, but the concepts will match those discussed earlier for the PCI
bus.
The ISA bus allows for two kinds of DMA transfers: native DMA and ISA
bus master DMA. Native DMA uses standard DMA-controller circuitry on
the motherboard to drive the signal lines on the ISA bus. ISA bus
master DMA, on the other hand, is handled entirely by the peripheral
device. The latter type of DMA is rarely used and doesn't require
discussion here because it is similar to DMA for PCI devices, at least
from the driver's point of view. An example of an ISA bus master is
the 1542 SCSI controller, whose driver is
drivers/scsi/aha1542.c in the kernel sources.
-
The controller holds information about the DMA transfer, such as the
direction, the memory address, and the size of the transfer. It also
contains a counter that tracks the status of ongoing transfers. When
the controller receives a DMA request signal, it gains control of the
bus and drives the signal lines so that the device can read or write
its data.
- The peripheral device
-
The device must activate the DMA request signal when it's ready to
transfer data. The actual transfer is managed by the DMAC; the
hardware device sequentially reads or writes data onto the bus when
the controller strobes the device. The device usually raises an
interrupt when the transfer is over.
- The device driver
-
The original DMA controller used in the PC could manage four
"channels," each associated with one set of DMA registers.
Four devices could store their DMA information in the controller
at the same time. Newer PCs contain the equivalent of two DMAC
devices:[53] the second controller (master) is
connected to the system processor, and the first (slave) is
connected to channel 0 of the second controller.[54]
The channel argument is a number between 0 and 7
or, more precisely, a positive number less than
MAX_DMA_CHANNELS. On the PC,
MAX_DMA_CHANNELS is defined as 8, to match the
hardware. The name argument is a string
identifying the device. The specified name appears in the file
/proc/dma, which can be read by user programs.
The return value from request_dma is 0 for
success and -EINVAL or -EBUSY if
there was an error. The former means that the requested channel is
out of range, and the latter means that another device is holding the
channel.
We also suggest that you request the DMA channel
after you've requested the interrupt line and
that you release it before the interrupt. This is
the conventional order for requesting the two resources; following the
convention avoids possible deadlocks. Note that every device using DMA
needs an IRQ line as well; otherwise, it couldn't signal the
completion of data transfer.
In a typical case, the code for open looks like
the following, which refers to our hypothetical
dad module. The dad device as
shown uses a fast interrupt handler without support for shared IRQ
lines.
int dad_open (struct inode *inode, struct file *filp)
{
struct dad_device *my_device;
/* ... */
if ( (error = request_irq(my_device.irq, dad_interrupt,
SA_INTERRUPT, "dad", NULL)) )
return error; /* or implement blocking open */
if ( (error = request_dma(my_device.dma, "dad")) ) {
free_irq(my_device.irq, NULL);
return error; /* or implement blocking open */
}
/* ... */
return 0;
}
void dad_close (struct inode *inode, struct file *filp)
{
struct dad_device *my_device;
/* ... */
free_dma(my_device.dma);
free_irq(my_device.irq, NULL);
/* ... */
}
The driver needs to configure the DMA controller either when
read or write is called, or
when preparing for asynchronous transfers. This latter task is
performed either at open time or in response to
an ioctl command, depending on the driver and the
policy it implements. The code shown here is the code that is
typically called by the read or
write device methods.
This subsection provides a quick overview of the internals of the DMA
controller so you will understand the code introduced here. If you
want to learn more, we'd urge you to read
and some hardware manuals
describing the PC architecture. In particular, we don't deal with the
issue of 8-bit versus 16-bit data transfers. If you are writing device
drivers for ISA device boards, you should find the relevant
information in the hardware manuals for the devices.
-
-
-
Indicates whether the channel must read from the device
(DMA_MODE_READ) or write to it
(DMA_MODE_WRITE). A third mode exists,
DMA_MODE_CASCADE, which is used to release control
of the bus. Cascading is the way the first controller is connected to
the top of the second, but it can also be used by true ISA bus-master
devices. We won't discuss bus mastering here.
-
-
In addition to these functions, there are a number of housekeeping
facilities that must be used when dealing with DMA devices:
-
A DMA channel can be disabled within the controller. The channel
should be disabled before the controller is configured, to prevent
improper operation (the controller is programmed via eight-bit data
transfers, and thus none of the previous functions is executed
atomically).
-
-
-
This function clears the DMA flip-flop. The flip-flop is used to
control access to 16-bit registers. The registers are accessed by two
consecutive 8-bit operations, and the flip-flop is used to select the
least significant byte (when it is clear) or the most significant byte
(when it is set). The flip-flop automatically toggles when 8 bits
have been transferred; the programmer must clear the flip-flop (to set
it to a known state) before accessing the DMA registers.
The only thing that remains to be done is to configure the device
board. This device-specific task usually consists of reading or
writing a few I/O ports. Devices differ in significant ways. For
example, some devices expect the programmer to tell the hardware how
big the DMA buffer is, and sometimes the driver has to read a value
that is hardwired into the device. For configuring the board, the
hardware manual is your only friend.
As with other parts of the kernel, both memory mapping and DMA have
seen a number of changes over the years. This section describes
the things a driver writer must take into account in order to write
portable code.
Changes to Memory Management
The 2.3 development series saw major changes in the way memory
management worked. The 2.2 kernel was quite limited in the amount of
memory it could use, especially on 32-bit processors. With 2.4, those
limits have been lifted; Linux is now able to manage all the memory
that the processor is able to address. Some things have had to change
to make all this possible; overall, however, the scale of the changes
at the API level is surprisingly small.
Thus, for example, pte_page returned an
unsigned long value instead of struct page
*. The virt_to_page macro did not
exist at all; if you needed to find a struct page
entry you had to go directly to the memory map to get it. The macro
MAP_NR would turn a logical address into an index
in mem_map; thus, the current
virt_to_page macro could be defined (and, in
sysdep.h in the sample code, is defined) as
follows:
struct page has also changed with time; in
particular, the virtual field is present in Linux
2.4 only.
The vm_area_struct structure saw a number of
changes in the 2.3 development series, and more in 2.1. These included
the following:
In the 2.0 kernel, the init_mm structure was not
exported to modules. Thus, a module that wished to access
init_mm had to dig through the task table to find
it (as part of the init process). When
running on a 2.0 kernel, scullp finds
init_mm with this bit of code:
static struct mm_struct *init_mm_ptr;
#define init_mm (*init_mm_ptr) /* to avoid ifdefs later */
static void retrieve_init_mm_ptr(void)
{
struct task_struct *p;
for (p = current ; (p = p->next_task) != current ; )
if (p->pid == 0)
break;
init_mm_ptr = p->mm;
}
This chapter introduced the following symbols related to memory
handling. The list doesn't include the symbols introduced in the first
section, as that section is a huge list in itself and those symbols
are rarely useful to device drivers.
- #include
-
-
-
-
-
-
-
-
-
These functions convert between kernel virtual and bus addresses. Bus
addresses must be used to talk to peripheral devices.
-
- int pci_dma_supported(struct pci_dev *pdev, dma_addr_t mask);
-
- void *pci_alloc_consistent(struct pci_dev *pdev, size_t size, dma_addr_t *bus_addr)
- void pci_free_consistent(struct pci_dev *pdev, size_t size, void *cpuaddr, dma_handle_t bus_addr);
-
-
- dma_addr_t pci_map_single(struct pci_dev *pdev, void *buffer, size_t size, int direction);
- void pci_unmap_single(struct pci_dev *pdev, dma_addr_t bus_addr, size_t size, int direction);
-
- void pci_sync_single(struct pci_dev *pdev, dma_handle_t bus_addr, size_t size, int direction)
-
Synchronizes a buffer that has a streaming mapping. This function
must be used if the processor must access a buffer while the streaming
mapping is in place (i.e., while the device owns the buffer).
-
The scatterlist structure describes an I/O
operation that involves more than one buffer. The macros
sg_dma_address and
sg_dma_len may be used to extract bus addresses
and buffer lengths to pass to the device when implementing
scatter-gather operations.
- pci_map_sg(struct pci_dev *pdev, struct scatterlist *list, int nents, int direction);
- pci_unmap_sg(struct pci_dev *pdev, struct scatterlist *list, int nents, int direction);
- pci_dma_sync_sg(struct pci_dev *pdev, struct scatterlist *sg, int nents, int direction)
-
-
-
-
-
-
-
-
-
Back to:
|
|
|
|
|
|
© 2001, O'Reilly & Associates, Inc.
阅读(969) | 评论(0) | 转发(0) |