首页　| 　博文目录　| 　关于我

博客访问： 393045
博文数量： 62
博客积分： 5015
博客等级：大校
技术积分： 915
用户组：普通用户
注册时间： 2006-03-08 02:00

文章分类

全部博文（62）

unresolved（2）
unresolved（0）
diary（0）
diary（0）
Algorithm（0）
Algorithm（2）
Linux-I/O-SCSI（41）
Linux-I/O-SCSI（0）
Linux（10）

scsi（0）

net（0）

pci（0）

net（0）

memory（0）

Driver（0）

Kernel（4）
Linux（0）

scsi（0）

net（0）

pci（0）

net（0）

memory（0）

Driver（0）

Kernel（0）
Assembly（0）
Assembly（0）
Web（0）
Web（0）
Jave（0）
Jave（0）
Solaris（0）
Solaris（0）
JXME（0）
JXTA（0）
JXTA（0）
未分配的博文（7）

文章存档

2009年（45）

2008年（17）

我的朋友

Bean_lee

Which I/O controller is the fairest of them all?

By Jonathan Corbet
May 12, 2009

An I/O controller is a system component intended to arbitrate access to block storage devices; it should ensure that different groups of processes get specific levels of access according to a policy defined by the system administrator. In other words, it prevents I/O-intensive processes from hogging the disk. This feature can be useful on just about any kind of system which experiences disk contention; it becomes a necessity on systems running a number of virtualized (or containerized) guests. At the moment, Linux lacks an I/O controller in the mainline kernel. There is, however, no shortage of options out there. This article will look at some of the I/O controller projects currently pushing for inclusion into the mainline.

[Block layer structure] For the purposes of this discussion, it may be helpful to refer to your editor's bad artwork, as seen on the right, for a simplistic look at how block I/O happens in a Linux system. At the top, we have several sources of I/O activity. Some requests come from the virtual memory layer, which is cleaning out dirty pages and trying to make room for new allocations. Others come from filesystem code, and others yet will originate directly from user space. It's worth noting that only user-space requests are handled in the context of the originating process; that creates complications that we'll get back to. Regardless of the source, I/O requests eventually find themselves at the block layer, represented by the large blue box in the diagram.

Within the block layer, I/O requests may first be handled by one or more virtual block drivers. These include the device mapper code, the MD RAID layer, etc. Eventually a (perhaps modified) request heads toward a physical device, but first it goes into the I/O scheduler, which tries to optimize I/O activity according to a policy of its own. The I/O scheduler works to minimize seeks on rotating storage, but it may also implement I/O priorities or other policy-related features. When it deems that the time is right, the I/O scheduler passes requests to the physical block driver, which eventually causes them to be executed by the hardware.

All of this is relevant because it is possible to hook an I/O controller into any level of this diagram - and the various controller developers have done exactly that. There are advantages and disadvantages to doing things at each layer, as we will see.

dm-ioband

The by Ryo Tsuruta (and others) operates at the virtual block driver layer. It implements a new device mapper target (called "ioband") which prioritizes requests passing through. The policy is a simple proportional weighting system; requests are divided up into groups, each of which gets bandwidth according to the weight assigned by the system administrator. Groups can be determined by user ID, group ID, process ID, or process group. Administration is done with the dmsetup tool.

dm-ioband works by assigning a pile of "tokens" to each group. If I/O traffic is low, the controller just stays out of the way. Once traffic gets high enough, though, it will charge each group for every I/O request on its way through. Once a group runs out of tokens, its I/O will be put onto a list where it will languish, unloved, while other groups continue to have their requests serviced. Once all groups which are actively generating I/O have exhausted their tokens, everybody gets a new set and the process starts anew.

The basic dm-ioband code has a couple of interesting limitations. One is that it does not use the control group mechanism, as would normally be expected for a resource controller. It also has a real problem with I/O operations initiated asynchronously by the kernel. In many cases - perhaps the majority of cases - I/O requests are created by kernel subsystems (memory management, for example) which are trying to free up resources and which are not executing in the context of any specific process. These requests do not have a readily-accessible return label saying who they belong to, so dm-ioband does not know how to account for them. So they run under the radar, substantially reducing the value of the whole I/O controller exercise.

The good news is that there's a solution to both problems in the form of the patch, also by Ryo. This patch interfaces between dm-ioband and the control group mechanism, allowing bandwidth control to be applied to arbitrary control groups. Unlike some other solutions, dm-ioband still does not use control groups for bandwidth control policy; control groups are really only used to define the groups of processes to operate on.

The other feature added by blkio-cgroup is a mechanism by which the owner of arbitrary I/O requests can be identified. To this end, it adds some fields to the array of page_cgroup structures in the kernel. This array is maintained by the memory usage controller subsystem; one can think of struct page_cgroup as a bunch of extra stuff added into struct page. Unlike the latter, though, struct page_cgroup is normally not used in the kernel's memory management hot paths, and it's generally out of sight, so people tend not to notice when it grows. But, there is one struct page_cgroup for every page of memory in the system, so this is a large array.

This array already has the means to identify the owner for any given page in the system. Or, at least, it will identify an owner; there's no real attempt to track multiple owners of shared pages. The blkio-cgroup patch adds some fields to this array to make it easy to identify which control group is associated with a given page. Given that, and given that block I/O requests include the address of the memory pages involved, it is not too hard to look up a control group to associate with each request. Modules like dm-ioband can then use this information to control the bandwidth used by all requests, not just those initiated directly from user space.

The advantages of dm-ioband include device-mapper integration (for those who use the device mapper), and a relatively small and well-contained code base - at least until blkio-cgroup is added into the mix. On the other hand, one must use the device mapper to use dm-ioband, and the scheduling decisions made there are unlikely to help the lower-level I/O scheduler implement its policy correctly. Finally, dm-ioband does not provide any sort of quality-of-service guarantees; it simply ensures that each group gets something close to a given percentage of the available I/O bandwidth.

io-throttle

The by Andrea Righi take a different approach. This controller uses the control group mechanism from the outset, so all of the policy parameters are set via the control group virtual filesystem. The main parameter for each control group is the maximum bandwidth that group can consume; thus, io-throttle enforces absolute bandwidth numbers, rather than dividing up the available bandwidth proportionally as is done with dm-ioband. (Incidentally, both controllers can also place limits on the number of I/O operations rather than bandwidth). There is a "watermark" value; it sets a level of utilization below which throttling will not be performed. Each control group has its own watermark, so it is possible to specify that some groups are throttled before others.

Each control group is associated with a specific block device. If the administrator wants to set identical policies for three different devices, three control groups must still be created. But this approach does make it possible to set different policies for different devices.

One of the more interesting design decisions with io-throttle is its placement in the I/O structure: it operates at the top, where I/O requests are initiated. This approach necessitates the placement of calls to cgroup_io_throttle() wherever block I/O requests might be created. So they show up in various parts of the memory management subsystem, in the filesystem readahead and writeback code, in the asynchronous I/O layer, and, of course, in the main block layer I/O submission code. This makes the io-throttle patch a bit more invasive than some others.

There is an advantage to doing throttling at this level, though: it allows io-throttle to slow down I/O by simply causing the submitting process to sleep for a while; this is generally preferable to filling memory with queued BIO structures. Sleeping is not always possible - it's considered poor form in large parts of the virtual memory subsystem, for example - so io-throttle still has to queue I/O requests at times.

The io-throttle code does not provide true quality of service, but it gets a little closer. If the system administrator does not over-subscribe the block device, then each group should be able to get the amount of bandwidth which has been allocated to it. This controller handles the problem of asynchronously-generated I/O requests in the same way dm-ioband does: it uses the blkio-cgroup code.

The advantages of the io-throttle approach include relatively simple code and the ability to throttle I/O by causing processes to sleep. On the down side, operating at the I/O creation level means that hooks must be placed into a number of kernel subsystems - and maintained over time. Throttling I/O at this level may also interfere with I/O priority policies implemented at the I/O scheduler level.

io-controller

Both dm-ioband and io-throttle suffer from a significant problem: they can defeat the policies (such as I/O priority) being implemented by the I/O scheduler. Given that a bandwidth control module is, for all practical purposes, an I/O scheduler in its own right, one might think that it would make sense to do bandwidth control at the I/O scheduler level. The by Vivek Goyal do just that.

Io-controller provides a conceptually simple, control-group-based mechanism. Each control group is given a weight which determines its access to I/O bandwidth. Control groups are not bound to specific devices in io-controller, so the same weights apply for access to every device in the system. Once a process has been placed within a control group, it will have bandwidth allocated out of that group's weight, with no further intervention needed - at least, for any block device which uses one of the standard I/O schedulers.

The io-controller code has been designed to work with all of the mainline I/O controllers: CFQ, Deadline, Anticipatory, and no-op. Making that work requires significant changes to those schedulers; they all need to have a hierarchical, fair-scheduling mechanism to implement the bandwidth allocation policy. The CFQ scheduler already has a single level of fair scheduling, but the io-controller code needs a second level. Essentially, one level implements the current CFQ fair queuing algorithm - including I/O priorities - while the other applies the group bandwidth limits. What this means is that bandwidth limits can be applied in a way which does not distort the other I/O scheduling decisions made by CFQ. The other I/O schedulers lack multiple queues (even at a single level), so the io-controller patch needs to add them.

Vivek's patch starts by stripping the current multi-queue code out of CFQ, adding multiple levels to it, and making it part of the generic elevator code. That allows all of the I/O schedulers to make use of it with (relatively) little code churn. The CFQ code shrinks considerably, but the other schedulers do not grow much. Vivek, too, solves the asynchronous request problem with the blkio-cgroup code.

This approach has the clear advantage of performing bandwidth throttling in ways consistent with the other policies implemented by the I/O scheduler. It is well contained, in that it does not require the placement of hooks in other parts of the kernel, and it does not require the use of the device mapper. On the other hand, it is by far the largest of the bandwidth controller patches, it cannot implement different policies for different devices, and it doesn't yet work reliably with all I/O schedulers.

Choosing one

The proliferation of bandwidth controllers has been seen as a problem for at least the last year. There is no interest in merging multiple controllers, so, at some point, it will become necessary to pick one of them to put into the mainline. It has been hoped that the various developers involved would get together and settle on one implementation, but that has not yet happened, leading Andrew Morton to recently:

I'm thinking we need to lock you guys in a room and come back in 15 minutes.

Seriously, how are we to resolve this? We could lock me in a room and come back in 15 days, but there's no reason to believe that I'd emerge with the best answer.

At the Storage and Filesystem Workshop in April, the participants appear to have been leaning heavily toward a solution at the I/O scheduler level - and, thus, io-controller. The cynical among us might be tempted to point out that Vivek was in the room, while the developers of the competing offerings were not. But such people should also ask why an I/O scheduling problem should be solved at any other level.

In any case, the developers of dm-ioband and io-throttle have not stopped their work since this workshop was held, and the wider kernel community has not yet made a decision in this area. So the picture remains only slightly less murky than before. About the only clear area of consensus would appear to be the use of blkio-cgroup for the tracking of asynchronously-generated requests. For the rest, the locked-room solution may yet prove necessary.

Which I/O controller is the fairest of them all?

Posted May 12, 2009 19:54 UTC (Tue) by nrafique (subscriber, #55312) []

Thanks Jonathan for the very nice article. There is one thing I would like to point out though. You mentioned that io-controller has the limitation that "it cannot implement different policies for different devices". This limitation can be easily fixed. In fact, at Google, we already have a (relatively simple) patch for allowing us to set different weights for different devices. And we have been testing Vivek's patches with that patch applied. We would make sure that this patch is included in the next posting by Vivek.

why not stack them

Posted May 13, 2009 0:28 UTC (Wed) by pflugstad (subscriber, #224) []

While I agree that putting the I/O controller at the existing I/O schedule layer seems to make the most sense, I would have thought that an obvious solution to this type of thing would be to modify the block layer setup so that you can stack I/O controllers and schedulers. Instead of integrating io-controller into each I/O scheduler as what seems to be being done here (please correct me of I'm mistaken).

That way, you can possibly select a different io controller and io scheduler combo and you don't have all the code churn if you want yet another io controller algorithm.

why not stack them

Posted May 13, 2009 1:28 UTC (Wed) by vgoyal (subscriber, #49279) []

I personally think that IO controller is basically an IO scheduling operation. CFQ already provides fairness between processes. IO controller extends the same concept to provide fairness between process groups also.

Now if fairness between processes and fairness among process groups is implemented at two different places then we run into the issue of one not knowing about the policy of other and higher level process group fairness breaking the notion of lower level IO scheduler. For example, higher level controller would not know anything about process classes (RT, BE, IDLE) or prorities with-in classes, or how reads are favored over writes etc by the IO scheduler etc.

I draw parallel from cpu controller where process group scheduling is implemented along with process scheduling in the group. They are not different entities stacked on top of each other.

So stacking one on top of other probably is not the best idea. Implementing it at elevator layer enables us to implement in such a manner so that code sharing can take place between 4 IO schedulers and code duplication is avoided.

The only limitation of doing it at IO scheduler level is that we provide hierarhical IO scheduling only at leaf nodes where IO scheduler is running and not at intermediate higher level logical devices. This is inline with IO scheduling status as of today where notion of process io priority or classes is only in effect on leaf nodes and not on intermediate logical devices.

If we really need control at higher level logical devices (which I am not very sure about), then we need to come up with something better. One suggestion was to update the weights of individual groups on end devices dynamically (I think it is hard to implement). Or may be we need a higher level controller also which breaks the underlying IO controller but provides fairness among group?

I personally think that by implementing hierarchical control at IO scheduler level we get at least one thing right and may be cover majority of the cases. At least we should get that right first.

Which I/O controller is the fairest of them all?

Posted May 13, 2009 4:50 UTC (Wed) by jejb (subscriber, #6654) []

It's not quite correct to say that only Vivek was in the room at the storage summit: Fernando Cao presented the state of dm-ioband and the reasons for wanting it quite eloquently. What was missing was io-throttle.

dm-ioband solves exactly the problem of allocating bandwidth to virtual machines because it operates at the VM level. The problem is that it's too high up in the stack efficiently (it's many layers away from the physical I/O driver, which is where the actual bandwith allocations are done). io-throttle looks to be substantially similar to dm-ioband in this regard, so it would suffer from the same problem. Conversely, io-controller can regulate exactly and efficiently the bandwidths at the physical layer, but it's too far away from the virtual machine layers to ensure exact compliance with VM based limits (in fact one can prove cases where one VM can run away with all the bandwidth).

Where I thought we'd got to at the storage summit was an appreciation that the physical layer is the correct one to regulate at (being the closes to the device) but that to solve the VM bandwidth allocation problem, we'd also need much of the machinery of dm-ioband (the page tracking and I/O accounting per VM) then what it would do is periodically apply corrections to the static limits of io-controller to ensure the per VM limits were respected (a bit like the way irqbalanced operates to keep interrupts well distributed).

Unfortunately the agreement seems to be misphasing again ... fortunately I can say this is Jens' problem now ...

Which I/O controller is the fairest of them all?

Posted May 13, 2009 14:23 UTC (Wed) by vgoyal (subscriber, #49279) []

A higher level software (a daemon or something else) doing dynamic weight adjustments for the groups at physical device level to achieve the bandwidth goal at virtual block device sounds like a good idea. It might be little complicated to implement though. :-). At LSF we sort of had the agreement to go in that direction but dm-ioband developers have to respond to this scheme of things and see if it satisfies their requirements or not. If it does, then probably they can start development on daemon for weight adjustment while we stabilize the IO scheduler based io controller.

I am not sure what kind of storage configurations are common but IO scheduler based solution alone should just work fine for all kind of Hardware RAID solutions and for disks directly attached to system without any software RAID. The only problematic case seems to be software RAID where bandwidth allocation will take place at higher level logical device.

With-in software RAID also, one needs to figure out which configurations are of particularly of more concerns. For striped ones, may be it is fair to assume that IO is evenly distributed across the various disks and if we can provide proportional bandwidth on individual disk, it should also translate into proportional bandwidth division for logical device.

VMs, and multiple disks priorities

Posted May 14, 2009 7:26 UTC (Thu) by zmi (subscriber, #4829) []

I was missing virtual machines (XEN, VMware, ..) from the article, just
found above user comment mention it. Isn't the need for I/O controllers
very urgent when you start to use VMs? For a normal machine, if it's I/O
intensive you normally run only one application on it (database, webserver,
SAP) and don't really need control - just a faster disk subsystem.

But with multiple VMs running, the need to prioritize them grows, and if an
I/O controller gets into the kernel, it should help here. XENserver (which
is free since April) already allows priorities for VMs/disks, I wonder how
they implemented it?

Also, the other place where I see need of priorities for different mount
points. Each mount point should get a priority, so that you can say
/importantdb has higher priority than /standarddb. They could go to the
same disk subsystem, or another, but the thing is I want I/O on
/importantdb to not block because of /standarddb doing it's backup.

Which I/O controller is the fairest of them all?

Posted May 13, 2009 9:20 UTC (Wed) by danpb (subscriber, #4831) []

The combination of dm-ioband + blkio-cgroup ends up with a really overcomplicated userspace interface for controlling I/O. To quote from the blkio-cgroup announcement, you have todo:

[quote]
make new bio cgroups and put some processes in them.

# mkdir /cgroup/grp1
# mkdir /cgroup/grp2
# echo 1234 > /cgroup/grp1/tasks
# echo 5678 > /cgroup/grp2/tasks

Now, check the ID of each blkio cgroup which is just created.

# cat /cgroup/grp1/blkio.id
2
# cat /cgroup/grp2/blkio.id
3

Finally, attach the cgroups to "ioband1" and assign them weights.

# dmsetup message ioband1 0 type cgroup
# dmsetup message ioband1 0 attach 2
# dmsetup message ioband1 0 attach 3
# dmsetup message ioband1 0 weight 2:30
# dmsetup message ioband1 0 weight 3:60
[/quote]

Now, consider if this was done entirely within cgroups, without any use of dmsetup. You could achieve the same result with

[example]
make new bio cgroups and put some processes in them.

# mkdir /cgroup/grp1
# mkdir /cgroup/grp2
# echo 1234 > /cgroup/grp1/tasks
# echo 5678 > /cgroup/grp2/tasks

Then assign them weights.

# echo "2:30" > /cgroup/grp1/blkio.weight
# echo "3:60" > /cgroup/grp2/blkio.weight
[/example]

This has the added benefit that if you're writing management APIs for this, you only need to be able to create/delete files & directories, and not worry about spawning external processes like dm-setup. So when the time comes to use this capability in libvirt, I'm rather hoping the kernel guys have decided on a pure cgroups userspace interface without device-mapper in the way.

Which I/O controller is the fairest of them all?

Posted May 13, 2009 14:28 UTC (Wed) by vgoyal (subscriber, #49279) []

Agreed that pure cgroup based interface for io control should be the goal as it makes the life easier for users of the infrastructure. IO scheduler based IO controller has got just cgroup interface to do the control and no need of device mapper tools.

Which I/O controller is the fairest of them all?

Posted May 15, 2009 19:37 UTC (Fri) by im14u2c (subscriber, #5246) []

This array already has the means to identify the owner for any given page in the system. Or, at least, it will identify an owner; there's no real attempt to track multiple owners of shared pages. The blkio-cgroup patch adds some fields to this array to make it easy to identify which control group is associated with a given page. Given that, and given that block I/O requests include the address of the memory pages involved, it is not too hard to look up a control group to associate with each request. Modules like dm-ioband can then use this information to control the bandwidth used by all requests, not just those initiated directly from user space.

It's not obvious why you'd want to ding the owner of a page for the I/O associated with writing it out, unless that task is also blocked waiting for more pages. It seems like whatever is currently causing memory pressure might also be a good candidate to bill for this I/O, and that task may not be the task that owns the page getting operated on in the case of VM traffic.

For example, suppose I load $MEMHOG, and it allocates lots of pages pushing 20 or 30 small, well behaved apps to disk. Why should $MEMHOG be able to charge all this traffic it's triggering to those 20 or 30 small tasks, effectively getting more than its share of bandwidth through the back door?

Which I/O controller is the fairest of them all?

Posted May 16, 2009 20:16 UTC (Sat) by giraffedata (subscriber, #1954) []

It's not obvious why you'd want to ding the owner of a page for the I/O associated with writing it out, unless that task is also blocked waiting for more pages.

I agree. Allocating I/O simply doesn't work in Linux, where most of the I/O is actually done independently by the kernel, not directly by users. The kernel needs instead to allocate among users the user-visible resources that the disk I/O supports. Charge a user for writing to a file. For dirtying a page. For nonlocal memory/file references. You can suspend process execution when these things (in aggregate) go over budget and thereby speed up other processes.

But if we want something small and simple, charging a disk write to the owner of the page does a pretty good approximation of this in some of the most important cases, because in those cases the owning task is blocked waiting for pages because what it's doing is continually writing to disk-backed files, through the cache.

阅读(1054) | 评论(0) | 转发(0) |

上一篇：Further Oops Insights

下一篇：Which I/O controller is the fairest of them all?

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6