分类: LINUX
2009-11-25 22:17:57
I/O Schedulers and I/O Performance
In a modern system, the relative performance gap between disks and the rest of the system is quite large—and widening. The worst component of disk performance is the process of moving the read/write head from one part of the disk to another, an operation known as a seek. In a world where many operations are measured in a handful of processor cycles (which might take all of a third of a nanosecond each), a single disk seek can average over eight milliseconds—still a small number, to be sure, but 25 million times longer than a single processor cycle!
Given the disparity in performance between disk drives and the rest of the system, it would be incredibly crude and inefficient to send I/O requests to the disk in the order in which they are issued. Therefore, modern operating system kernels implement I/O schedulers, which work to minimize the number and size of disk seeks by manipulating the order in which I/O requests are serviced, and the times at which they are serviced. I/O schedulers work hard to lessen the performance penalties associated with disk access.
Disk Addressing
To understand the role of an I/O scheduler, some background information is necessary. Hard disks address their data using the familiar geometry-based addressing of cylinders, heads, and sectors, or CHS addressing. A hard drive is composed of multiple platters, each consisting of a single disk, spindle, and read/write head. You can think of each platter as a CD (or record), and the set of platters in a disk as a stack of CDs. Each platter is divided into circular ring-like tracks, like on a CD. Each track is then divided up into of an integer number of sectors.
To locate a specific unit of data on a disk, the drive’s logic requires three pieces of information: the cylinder, head, and sector values. The cylinder value specifies the track on which the data resides. If you lay the platters on top of one another, a given track forms a cylinder through each platter. In other words, a cylinder is represented by a track at the same distance from the center on each disk. The head value identifies the exact read/write head (and thus the exact platter) in question. The search is now narrowed down to a single track on a single platter. The disk then uses the sector value to identify an exact sector on the track. The search is now complete: the hard disk knows what platter, what track, and what sector to look in for the data. It can position the read/write head of the correct platter over the correct track, and read from or write to the requisite sector.
Thankfully, modern hard disks do not force computers to communicate with their disks in terms of cylinders, heads, and sectors. Instead, contemporary hard drives map a unique block number (also called physical blocks or device blocks) over each cylinder/head/sector triplet—effectively, a block maps to a specific sector. Modern operating systems can then address hard drives using these block numbers—a process known as logical block addressing (LBA)—and the hard drive internally translates the block number into the correct CHS address.* Although nothing guarantees it, the block-to-CHS mapping tends to be sequential: physical block n tends to be physically adjacent on disk to logical block n + 1. This sequential mapping is important, as we shall soon see.
Filesystems, meanwhile, exist only in software. They operate on their own units, known as logical blocks (sometimes called filesystem blocks, or, confusingly, just blocks).
The logical block size must be an integer multiple of the physical
block size. In other words, a filesystem’s logical blocks map to one or
more of a disk’s physical blocks.
The Life of an I/O Scheduler
I/O schedulers perform two basic operations: merging and sorting. Merging is
the process of taking two or more adjacent I/O requests, and combining
them into a single request. Consider two requests, one to read from
disk block 5, and another to read from disk blocks 6 through 7. These
requests can be merged into a single request to read from disk blocks 5
through 7. The total amount of I/O might be the same, but the number of
I/O operations is reduced by half. Sorting, the more important of the two
operations, is the process of arranging pending I/O requests in
ascending block order. For example, given I/O operations to blocks 52,
109, and 7, the I/O scheduler would sort these requests into the
ordering 7, 52, and 109. If a request was then issued to block 81, it
would be inserted between the requests to blocks 52 and 109. The I/O
scheduler would then dispatch the requests to the disk in the order
that they exist in the queue: 7, then 52, then 81, and finally 109. In this manner, the disk head’s movements are
minimized. Instead of potentially haphazard movements—here to there and
back, seeking all over the disk—the disk head moves in a smooth, linear
fashion. Because seeks are the most expensive part of disk I/O,
performance is improved. Helping Out Reads Each read request must return up-to-date data. Thus, if
the requested data is not in the page cache, the reading process must
block until the data can be read from disk—a potentially lengthy
operation. We call this performance impact read latency. A typical application might initiate several read I/O
requests in a short period. Because each request is individually
synchronized, the later requests are dependent on the earlier
ones’ completion. Consider reading every file in a directory. The
application opens the first file, reads a chunk of it, waits for data,
reads another chunk, and so on, until the entire file is read. Then the
application starts again, on the next file. The requests become
serialized: a subsequent request cannot be issued until the current
request completes. This is in stark contrast to write requests, which (in
their default, nonsynchronized state) need not initiate any disk I/O
until some time in the future. Thus, from the perspective of a
user-space application, write requests stream, unencumbered by
the performance of the disk. This streaming behavior only compounds the
problem for reads: as writes stream, they can hog the kernel and disk’s
attention. This phenomenon is known as the writes-starving-reads problem. If an I/O scheduler always sorted new requests
by the order of insertion, it would be possible to starve requests to
far-off blocks indefinitely. Consider our previous example. If new
requests were continually issued to blocks in, say, the 50s, the
request to block 109 would never be serviced. Because read latency is
critical, this behavior would greatly hurt system performance. Thus,
I/O schedulers employ a mechanism to prevent starvation. A simple approach—such as the one taken by the 2.4 Linux kernel’s I/O scheduler, the Linus Elevator*—is
to simply stop insertion-sorting if there is a sufficiently old request
in the queue. This trades overall performance for per-request fairness
and, in the case of reads, improves latency. The problem is that this
heuristic is a bit too simplistic. Recognizing this, the 2.6 Linux
kernel witnessed the demise of the Linus Elevator, and unveiled several
new I/O schedulers in its place. The Deadline I/O Scheduler The Deadline I/O Scheduler was
introduced to solve the problems with the 2.4 I/O scheduler, and
traditional elevator algorithms in general. The Linus Elevator
maintains a sorted list of pending I/O requests. The I/O request at the
head of the queue is the next one to be serviced. The Deadline I/O
Scheduler keeps this queue, but kicks things up a notch by introducing
two additional queues: the read FIFO queue, and the write FIFO queue.
The items in each of these queues are sorted by submission time
(effectively, the first in is the first out). The read FIFO queue, as
its name suggests, contains only read requests. The write FIFO queue,
likewise, contains only write requests. Each request in the FIFO queues
is assigned an expiration value. The read FIFO queue has an expiration
time of 500 milliseconds. The write FIFO queue has an expiration time
of five seconds. When a new I/O request is submitted, it is
insertion-sorted into the standard queue, and placed at the tail of its
respective (read or write) FIFO queue. Normally, the hard drive is sent
I/O requests from the head of the standard sorted queue. This maximizes
global throughput by minimizing seeks, as the normal queue is sorted by
block number (as with the Linus Elevator). When the item at the head of one of the FIFO queues
grows older than the expiration value associated with its queue,
however, the I/O scheduler stops dispatching I/O requests from the
standard queue, and begins servicing requests from that queue—the
request at the head of the FIFO queue is serviced, plus a couple of
extras for good measure. The I/O scheduler needs to check and handle
only the requests at the head of the queue, as those are the oldest
requests. In this manner, the Deadline I/O Scheduler can enforce
a soft deadline on I/O requests. Although it makes no promise that an
I/O request will be serviced before its expiration time, the I/O
scheduler generally services requests near their expiration times.
Thus, the Deadline I/O Scheduler continues to provide good global
throughput without starving any one request for an unacceptably long
time. Because read requests are given shorter expiration times, the
writes-starving-reads problem is minimized. The Anticipatory I/O Scheduler The Deadline I/O Scheduler’s behavior is good, but not
perfect. Recall our discussion on read dependency. With the Deadline
I/O Scheduler, the first read request in a series of reads is serviced
in short order, at or before its expiration time, and the I/O scheduler
then returns to servicing I/O requests from the sorted queue—so far, so
good. But suppose the application then swoops in and hits us with
another read request? Eventually its expiration time will also
approach, and the I/O scheduler will submit it to the disk, which will
seek over to promptly handle the request, then seek back to continue
handling requests from the sorted queue. This seeking back and forth
can continue for some time because many applications exhibit this
behavior. While latency is kept to a minimum, global throughput is not
very good because the read requests keep coming in, and the disk has to
keep seeking back and forth to handle them. Performance would be
improved if the disk just took a break to wait for another read, and
did not move away to service the sorted queue again. But,
unfortunately, by the time the application is scheduled and submits its
next dependent read request, the I/O scheduler has already shifted
gears. The problem again stems from those darn dependent
reads—each new read request is issued only when the previous one is
returned, but by the time the application receives the read data, is
scheduled to run, and submits its next read request, the I/O scheduler
has moved on, and begun servicing other requests. This results in a
wasted pair of seeks for each read: the disk seeks to the read,
services it, and then seeks back. If only there was some way for the
I/O scheduler to know—to anticipate—that another read would
soon be submitted to the same part of the disk, instead of seeking back
and forth, it could wait in anticipation of the next read. Saving those
awful seeks certainly would be worth a few milliseconds of waiting. This is exactly how the Anticipatory I/O Scheduler
operates. It began life as the Deadline I/O Scheduler, but was gifted
with the addition of an anticipation mechanism. When a read request is
submitted, the Anticipatory I/O Scheduler services it within its
deadline, as usual. Unlike the Deadline I/O Scheduler, however, the
Anticipatory I/O Scheduler then sits and waits, doing nothing, for up
to six milliseconds. Chances are good that the application will issue
another read to the same part of the filesystem during those six
milliseconds. If so, that request is serviced immediately, and the
Anticipatory I/O Scheduler waits some more. If six milliseconds go by
without a read request, the Anticipatory I/O Scheduler decides it has
guessed wrong, and returns to whatever it was doing before (i.e.,
servicing the standard sorted queue). If even a moderate number of
requests are anticipated correctly, a great deal of time—two expensive
seeks’ worth at each go—is saved. Because most reads are dependent, the
anticipation pays off much of the time. The CFQ I/O Scheduler The Complete Fair Queuing (CFQ)
I/O Scheduler works to achieve similar goals, albeit via a different
approach.* With CFQ, each process is assigned its own queue, and each
queue is assigned a timeslice. The I/O scheduler visits each queue in a
round-robin fashion, servicing requests from the queue until the
queue’s timeslice is exhausted, or until no more requests remain. In
the latter case, the CFQ I/O Scheduler will then sit idle for a brief
period—by default, 10 ms—waiting for a new request on the queue. If the
anticipation pays off, the I/O scheduler avoids seeking. If not, the
waiting was in vain, and the scheduler moves on to the next process’
queue. Within each process’ queue, synchronized requests (such
as reads) are given priority over nonsynchronized requests. In this
manner, CFQ favors reads and prevents the writes-starving-reads
problem. Because of the per-process queue setup, the CFQ I/O Scheduler
is fair to all processes, while still providing good global
performance. The CFQ I/O Scheduler is well suited to most workloads, and makes an excellent first choice. The Noop I/O Scheduler The Noop I/O Scheduler is the most basic of the
available schedulers. It performs no sorting whatsoever, only basic
merging. It is used for specialized devices that do not require (or
that perform) their own request sorting.