deadline算法的核心就是在传统的电梯算法中加入了请求超时的机制,该机制主要体现在两点:(1)请求超时时,对超时请求的选择。(2)没有请求超时时,当扫描完电梯最后一个request后,准备返回时,对第一个request的选择。基于以上两点,平衡了系统i/o吞吐量和响应时间。此外,该算法开考虑到了读操作对写操作造成的饥饿。
算法核心数据结构:
struct deadline_data {
struct rb_root sort_list[2];
struct list_head fifo_list[2];
/*
* next in sort order. read, write or both are NULL
*/
struct request *next_rq[2];
unsigned int batching; /* number of sequential requests made */
sector_t last_sector; /* head position */
unsigned int starved; /* times reads have starved writes */
/*
* settings that change how the i/o scheduler behaves
*/
int fifo_expire[2];
int fifo_batch;
int writes_starved;
int front_merges;
};
sort_list:按照request中的sector号大小,把每个request组织在以sort_list为根的红黑树中。这样方便快速查找。总共有读写两棵树。
fifo_list:按照超时先后顺寻,把request链入filo_list,同样也是分为读写两个队列。
因此,任何一个request,在未提交给设备的请求队列之前,都会同时存在于以上两个结构中。
next_rq:指向sort_list中的下一个请求。
batching:调度算法可能连续提交多个请求,batching用于记录当前连续提交的request数目V灰猙atching < fifo_batch,都可以继续进行连续提交。
starved:提交读request而造成写饥饿的次数。如果starved超过writes_starved,则需要提交写request,从而避免写饥饿。
deadline I/O调度器中最重要的两个方法是:deadline_add_request和deadline_dispatch_requests。它们分别向调度器队列(非请求队列)添加request和把调度器队列中的请求分发到块设备的请求队列。
static void
deadline_add_request(struct request_queue *q, struct request *rq)
{
struct deadline_data *dd = q->elevator->elevator_data;
const int data_dir = rq_data_dir(rq);
deadline_add_rq_rb(dd, rq);
/*
* set expire time (only used for reads) and add to fifo list
*/
rq_set_fifo_time(rq, jiffies + dd->fifo_expire[data_dir]);
list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
}
deadline_add_request把请求分别加入红黑树和fifo。并记录请求的超时时间。这儿借用了request的donelist的next字段。因为在deadline调度队列的请求,绝不可能被调入请求完成队列:
#define rq_set_fifo_time(rq,exp) ((rq)->donelist.next = (void *) (exp))
static int deadline_dispatch_requests(struct request_queue *q, int force)
{
struct deadline_data *dd = q->elevator->elevator_data;
const int reads = !list_empty(&dd->fifo_list[READ]);
const int writes = !list_empty(&dd->fifo_list[WRITE]);
struct request *rq;
int data_dir;
/*
* batches are currently reads XOR writes
*/
if (dd->next_rq[WRITE])
rq = dd->next_rq[WRITE];
else
rq = dd->next_rq[READ];
if (rq) {
/* we have a "next request" */
if (dd->last_sector != rq->sector)
/* end the batch on a non sequential request */
dd->batching += dd->fifo_batch;
if (dd->batching < dd->fifo_batch)
/* we are still entitled to batch */
goto dispatch_request;
}
/*
* at this point we are not running a batch. select the appropriate
* data direction (read / write)
*/
if (reads) {
BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
if (writes && (dd->starved++ >= dd->writes_starved))
goto dispatch_writes;
data_dir = READ;
goto dispatch_find_request;
}
/*
* there are either no reads or writes have been starved
*/
if (writes) {
dispatch_writes:
BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
dd->starved = 0;
data_dir = WRITE;
goto dispatch_find_request;
}
return 0;
dispatch_find_request:
/*
* we are not running a batch, find best request for selected data_dir
*/
if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) {
/*
* A deadline has expired, the last request was in the other
* direction, or we have run out of higher-sectored requests.
* Start again from the request with the earliest expiry time.
*/
rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
} else {
/*
* The last req was the same dir and we have a next request in
* sort order. No expired requests so continue on from here.
*/
rq = dd->next_rq[data_dir];
}
dd->batching = 0;
dispatch_request:
/*
* rq is the selected appropriate request.
*/
dd->batching++;
deadline_move_request(dd, rq);
return 1;
}
deadline_dispatch_requests是deadline算法的核心。主要分为三个部分:(1)确定是要分发读还是写 request。影响处理分发类型因素主要包括是否处于batching阶段以及是否发生写饥饿。(2)确定方向以后,根据读写方向找到该方向上下一个请求进行分发。影响定位下一个请求因素主要包括请求是否超时。(3)找到合适的request后,调用deadline_move_request分发给块设备的请求队列。
1.确定读写方向.
a.首先,根据是否处于batching来确定当前处理读写的方向。因为如果处在batching过程中,就意味着调度程序需要连续处理同一方向的请求。这样可以有效增加系统吞吐量。因此,根据batching的方向,可以确定当前处理请求的方向。而读写batching是互斥的:
/*
* batches are currently reads XOR writes
*/
if (dd->next_rq[WRITE])
rq = dd->next_rq[WRITE];
else
rq = dd->next_rq[READ];
通过这个判断,保证了batching只会在某一方向进行,而不会交错。因为在deadline_move_request中有:
dd->next_rq[READ] = NULL;
dd->next_rq[WRITE] = NULL;
dd->next_rq[data_dir] = deadline_latter_request(rq);
也就是在batching方向的next_rq才可能指向下一个request。
if (rq) {
/* we have a "next request" */
if (dd->last_sector != rq->sector)
/* end the batch on a non sequential request */
dd->batching += dd->fifo_batch;
if (dd->batching < dd->fifo_batch)
/* we are still entitled to batch */
goto dispatch_request;
}
如果存在下一个request,也就是没有扫描到电梯的末尾。则判断该request是否和上一个request相连。如果相连,并且batching的request数没有超过fifo_batch,则当前这个request就是我们要分发的request。因此直接跳到最后,把request分发到设备请求队列。此时将忽略写饥饿和超时的处理。如果不连续,则要结束batching。
b.如果此时并不处于batching过程中,则根据是否造成写饥饿“超标”来确定读写方向。
if (reads) {
BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
if (writes && (dd->starved++ >= dd->writes_starved))
goto dispatch_writes;
data_dir = READ;
goto dispatch_find_request;
}
/*
* there are either no reads or writes have been starved
*/
if (writes) {
dispatch_writes:
BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
dd->starved = 0;
data_dir = WRITE;
goto dispatch_find_request;
}
调度器先处理read,也就是read请求优先。但在处理过程中考虑到了写饥饿。如果此时还有写请求,则写饥饿计数+1,如果写饥饿次数大于了writes_starved,则写饥饿已经“超标”了,因此直接跳到 dispath_writes去处理写请求。如果写饥饿没有“超标”,则继续处理读请求。
2.根据读写方向,找到当前要处理的请求:
dispatch_find_request:
/*
* we are not running a batch, find best request for selected data_dir
*/
if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) {
/*
* A deadline has expired, the last request was in the other
* direction, or we have run out of higher-sectored requests.
* Start again from the request with the earliest expiry time.
*/
rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
} else {
/*
* The last req was the same dir and we have a next request in
* sort order. No expired requests so continue on from here.
*/
rq = dd->next_rq[data_dir];
}
dd->batching = 0;
在寻找指定方向上的请求时,考虑了请求的超时时间。这就是deadline的算法核心所在。调度器首先调用deadline_check_fifo来检查队列中队首,也就是最老的一个请求是否超时。如果超时则指定当前处理的请求为该超时的请求。但如果没有超时,但已经扫描了电梯的末尾:!dd->next_rq[data_dir]。此时需要返回到电梯首部。但与传统的电梯算法不同,deadline调度器不是返回到sector最小的request开始继续扫描。而是返回到等待时间最久的那个requst,从那个request 开始,沿sector递增方向继续扫描。也就是说,如果超时,或者扫描到电梯尾,都会返回来处理等待最久的request,并从这个request开始继续进行电梯扫描。当然,如果既没有发生超时,也没有扫描到电梯末尾,则沿sector递增方向上的下一个request就是当前要处理的request。
3.找到要处理的request后,把它分发到块设备的请求队列。
整个deadline调度器比较简洁,总共只有400多行。它充分考虑了batching,写饥饿,请求超时这三大因素。在保证吞吐量的基础上,有考虑到了响应延时。