Linux Anticipatory （预测）I/O 调度算法分析笔记-kernel

kernel_bird的ChinaUnix博客kernelbird.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

kernel_bird

博客访问： 42541
博文数量： 16
博客积分： 365
博客等级：一等列兵
技术积分： 135
用户组：普通用户
注册时间： 2012-05-08 13:58

文章分类

全部博文（16）

Linux内核（13）
未分配的博文（3）

文章存档

2014年（1）

2013年（1）

2012年（14）

我的朋友

30893896

相关博文

Linux Anticipatory （预测）I/O 调度算法分析笔记

分类： LINUX

2012-05-08 14:33:26

as_add_reques是调度算法的入口。AS和Deadline比较类似，都是先把request加入 sector 排序的红黑树，然后再把requst加入fifo。只不过AS因为加入了预测，需要在加入requst时，调用as_update_rq来更新当前算法所维护的状态。此外，二者不同之处是，Deadline是以读写来区分request的方向，而AS是以是否同步来区分方向：data_dir = rq_is_sync(rq); AS把读和同步写都归为一个方向：sync（也就是read）进行处理。这个是合理的，因为AS是基于进程进行预测的。相同的进程较可能具有相同的同异步读写方式。此外，把async/sync read, sync write归于read方向，并对其进行预测，是因为读和同步写往往更为紧迫。一个写操作往往需要读来更新buffer cache，因此读操作会直接影响到后面的写。而async write往往是后台进程的行为，如pdflush，紧迫程度相对较低。所以，AS的这种分类方法，以及对”read”的预测是合理的。
static void as_add_request(struct request_queue *q, struct request *rq)
{
struct as_data *ad = q->elevator->elevator_data;
int data_dir;
RQ_SET_STATE(rq, AS_RQ_NEW);
data_dir = rq_is_sync(rq);
rq->elevator_private = as_get_io_context(q->node);
rq->elevator_private = as_get_io_context(q->node) 获得当前io_context。AS用于预测的统计数据都保存在io_context中（准确地说应该是io_context的 as_io_context中）。每个进程有自己的io_context结构，创建新进程时，可以通过设置CLONE_IO标志来拷贝父进程的 io_context。由于io_context是AS预测的基础和基本单位，不同的进程有不同的io_context，所以说AS是针对进程进行预测的。

if (RQ_IOC(rq)) {
as_update_iohist(ad, RQ_IOC(rq)->aic, rq);
atomic_inc(&RQ_IOC(rq)->aic->nr_queued);
}
as_add_rq_rb(ad, rq);
/*
* set expire time and add to fifo list
*/
rq_set_fifo_time(rq, jiffies + ad->fifo_expire[data_dir]);
list_add_tail(&rq->queuelist, &ad->fifo_list[data_dir]);

as_update_rq(ad, rq); /* keep state machine up to date */
as_update_rq更新AS算法的统计信息。
Static void as_update_rq(struct as_data *ad, struct request *rq)
{
const int data_dir = rq_is_sync(rq);
/* keep the next_rq cache up to date */
ad->next_rq[data_dir] = as_choose_req(ad, rq, ad->next_rq[data_dir]);
AS 算法允许有条件的回扫。ad->next_rq[data_dir] = as_choose_req(ad, rq, ad->next_rq[data_dir])在当前request和同一方向上下一个request间进行选择。确定选择那个request为下一个requst的因素有三个：1.当前request的起始sector(s1)，2.当前下一个request的起始sector(s2)，3.以及同方向上完成的最后一个request的结束sector: ad->last_sector[data_dir](last)。其中第三个参数代表了磁头的当前位置。注意，s1, s2都有可能小于last（s2为什么可能大于last？）。as_choose_req分别计算s1,s2到last的距离。如果是在last前面，并且该距离不超过MAXBACK，则当前距离为(last –s)* BACK_PENALTY. 比较加入惩罚因子的距离（顺着当前方向的惩罚因子为1），选择距离较小的一个request。因此，如果request离当前磁头位置较近，即便是它在磁头前面，也可能产生回扫。

if (ad->antic_status == ANTIC_WAIT_REQ
   || ad->antic_status == ANTIC_WAIT_NEXT) {
  if (as_can_break_anticipation(ad, rq))
   as_antic_stop(ad);
}
选定好request后，如果当前AS正处于预测状态(ANTIC_WAIT_REQ& ANTIC_WAIT_NEXT)，则需要根据request自身的信息以及AS维护的预测状态信息来判断是否要停止预测（调用 as_can_break_anticipation进行判断）。如果当前request是我们所要预测的，则调用as_antic_stop来停止预测（设置预测状态为ANTIC_FINISHED）。同时，as_antic_stop还会启动AS的工作队列，把预测到的request分发到设备，具体详见AS预测状态转化图。
as_can_break_anticipation需要更深入分析

static int as_dispatch_request(struct request_queue *q, int force)
{
struct as_data *ad = q->elevator->elevator_data;
const int reads = !list_empty(&ad->fifo_list[REQ_SYNC]);
const int writes = !list_empty(&ad->fifo_list[REQ_ASYNC]);
struct request *rq;

/* Signal that the write batch was uncontended, so we can't time it */
if (ad->batch_data_dir == REQ_ASYNC && !reads) {
if (ad->current_write_count == 0 || !writes)
ad->write_batch_idled = 1;
}
要理解上面代码，首先需要理解current_write_count的意义。current_write_count的注释是： how many requests left this batch 。但是我觉得其真正的含义是在write batch时，当前最多能分发的write 请求数。因为，在开始write batch时，current_write_count被赋值为write_batch_count，即write batch允许分发的最大write 请求数。当把一个write 请求分发到设备时，current_write_count递减。当current_write_count递减到0时（as_move_to_dispatch），as_batch_expired会返回write batch超时。由此可见，write batch的超时，实际上是jiffies和最大write batch请求数目确定的，即write_batch_count和current_write_count。在 update_write_batch（需要具体分析）中，会根据这些信息，动态更新write_batch_count，从而实现动态调整write bath超时的功能。
因此，上面代码表示，如果当前batch方向为write (async write)并且没有读请求，如果此时write batch的指标用完（(ad->current_write_count == 0），或者队列中没有write 请求。则标记write batch为idle。设置write_batch_idled = 1，是为了在update_write_batch中根据现在的batch情况，动态调整write_batch_count。

if (!(reads || writes)
  || ad->antic_status == ANTIC_WAIT_REQ
  || ad->antic_status == ANTIC_WAIT_NEXT
  || ad->changed_batch)//wait the compeletion of batched request????
  return 0;//in the antic state. wait for more request.
首先检查是否有读写，没有读写请求，自然立即返回。
然后检查AS是否处于预测状态的WAIT_REQ和WAIT_NEXT。因为，在这两个状态下，AS都不能向设备分发request，需要等待当前 request完成，或等待下一个request提交到AS，或者等待预测结束，具体参见AS 预测状态转换图。注意，整个预测都过程，都是在batch阶段进行的。
此时，如果batch的方向发生变化，也不能分发request到设备，必须等到上次batch的所有request都完成了，才能继续处理下一个batch。changed_batch==1表示batch方向发生变化，但是老的batch的所有request并没有处理完。changed_batch在新开始一个read或者write batch时被置1. 在分发request的时候，会检查changed_batch，如果batch发生改变，并且上一个batch还有剩下的request （nr_dispatched）没有处理完，则直接返回，不继续分发request。当处理完一个write request，在as_completed_request中，会检查是不是老 batch的最后request（ad->changed_batch && ad->nr_dispatched == 1）。如果是则，把changed_batch设为0，表明老batch的request都已处理完。由此可见，AS从batch方向改变到老batch 处理完所有请求前，不会分发任何request。

if (!(reads && writes && as_batch_expired(ad))) {
/*
* batch is still running or no reads or no writes
*/
rq = ad->next_rq[ad->batch_data_dir];

if (ad->batch_data_dir == REQ_SYNC && ad->antic_expire) {
if (as_fifo_expired(ad, REQ_SYNC))//fifo expired will interupt anticipation.
goto fifo_expired;

  if (as_can_anticipate(ad, rq)) {
   as_antic_waitreq(ad);
   return 0;
  }
}
if (rq) {
  /* we have a "next request" */
  if (reads && !writes)
   ad->current_batch_expires =
    jiffies + ad->batch_expire[REQ_SYNC];
  goto dispatch_request;
}
}
这段是处理batch阶段请求的逻辑。主要考虑两点：
1.read 请求是否有超时。如果超时，将打断batching，直接处理超时的请求。这个和deadline不同，在deadline中，如果处于batching 阶段，不管是否超时，都不会打断batching。不过，该超时的检查，只在当前为read batch并且启动预测的情况下（ad->batch_data_dir == REQ_SYNC && ad->antic_expire）才进行。这是因为，如果进行预测，很有可能导致等待的read 请求超时。如果不在这儿检查超时，最坏的情况可能是预测超时后再处理等待的请求，这样，在经常预测不成功的情况，会大大降低性能。如果我们禁止了预测功能，即antic_expire ==0，则就和deadline一样了，会直接分发请求。
2.是否可以进行预测。如果可以，则调用as_antic_waitreq来更改AS的预测状态，并立即返回，等待预测的请求。如果不能进行预测，则要分发当前batch的请求。as_can_anticipate需要具体分析。
此外，如果当前没有write请求，AS自动延长read batch的时间。current_batch_expires保存当前batch的到期时间。

/*
* at this point we are not running a batch. select the appropriate
* data direction (read / write)
*/
if (reads) {
  BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[REQ_SYNC]));
  if (writes && ad->batch_data_dir == REQ_SYNC)
   /*
    * Last batch was a read, switch to writes
    */
   goto dispatch_writes;
  if (ad->batch_data_dir == REQ_ASYNC) {
   WARN_ON(ad->new_batch);
   ad->changed_batch = 1;
  }
  ad->batch_data_dir = REQ_SYNC;
  rq = rq_entry_fifo(ad->fifo_list[REQ_SYNC].next);
  ad->last_check_fifo[ad->batch_data_dir] = jiffies;
  goto dispatch_request;
}
/*
* the last batch was a read
*/

if (writes) {
dispatch_writes:
BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[REQ_ASYNC]));

if (ad->batch_data_dir == REQ_SYNC) {
ad->changed_batch = 1;

   /*
    * new_batch might be 1 when the queue runs out of
    * reads. A subsequent submission of a write might
    * cause a change of batch before the read is finished.
    */
   ad->new_batch = 0;
  }
  ad->batch_data_dir = REQ_ASYNC;
  ad->current_write_count = ad->write_batch_count;
  ad->write_batch_idled = 0;
  rq = rq_entry_fifo(ad->fifo_list[REQ_ASYNC].next);
  ad->last_check_fifo[REQ_ASYNC] = jiffies;
  goto dispatch_request;
}

BUG();
return 0;
以上部分是选择新的batch方向。因为代码能运行到此处，说明batch已经超时，因此需要选择batch方向。
首先尝试选择read为batch新方向。如果前一个batch方向为read，则跳过去选择write为新方向。这样，read和write batch便可以交替执行。如果选择下一个batch为read方向，则需要修改一些参数，并从read方向的fifo中取出下一个请求分发给设备。
选择write方向和read方向类似，不过如果当前batch方向变成了write方向（也就是上次batch方向为read方向），则new_batch会被清零。new_batch为1表示等待第一个read请求完成。new_batch置1有两个地方:
1．在下面的代码中，在调用as_move_to_dispatch把请求提交给设备前，会检查当前请求是否为第一个read 请求。如果batch方向改变了，并且新方向为read，则该请求肯定是read batch的第一个请求。
2．在完成了一个请求调用as_completed_request时，如果完成的请求为batch方向的最后一个请求，而当前batch方向变为read时，这个时候也会把new_batch设为1.这说明当前完成的request为write。在完成该request前，batch已经改变方向为read，需要等待第一个read请求完成。同理，在as_completed_request中，如果检查到当前完成的请求为read方向的第一个请求时（ad->new_batch && ad->batch_data_dir == rq_is_sync(rq)），new_batch会被清零。
new_batch是干什么用的呢？也就是为什么需要等待第一个read请求完成呢？纵观整个AS，只有两个地方用到了new_batch。
1. 在as_batch_expired中，如果检查到new_batch为1，就立即放回。也就是说，如果当前是在等待第一个read batch，则batch不会超时。问题：当new_batch为1时，当前batch方向一定为read吗？如果是，为什么在 as_completed_request里面有这样的判断呢？ad->new_batch && ad->batch_data_dir == rq_is_sync(rq)
2.在as_completed_request中，如果当前为read batch并且在等待第一个read请求完成时，才会调用update_write_batch来更新writebatch。注释提到：Start counting the batch from when a request of that direction is actually serviced. This should help devices with big TCQ windows and writeback caches。究竟是为什么呢？

dispatch_request:
/*
* If a request has expired, service it.
*/
if (as_fifo_expired(ad, ad->batch_data_dir)) {
fifo_expired:
rq = rq_entry_fifo(ad->fifo_list[ad->batch_data_dir].next);
}
if (ad->changed_batch) {
WARN_ON(ad->new_batch);

if (ad->nr_dispatched)//if changed batch, wait for the compeletion of bached request.
return 0;

  if (ad->batch_data_dir == REQ_ASYNC)
   ad->current_batch_expires = jiffies +
     ad->batch_expire[REQ_ASYNC];
  else
   ad->new_batch = 1;

ad->changed_batch = 0;
}
/*
* rq is the selected appropriate request.
*/
as_move_to_dispatch(ad, rq);

return 1;
}

Dispatch request的整个过程可以归纳为：1 去定batch方向。2 在batch方向找request进行分发。
对于1，如果在batch阶段，则继续，否则需要重新选择batch方向。
对于2，在有了batch方向后，需要根据该方向的fifo是否超时来判断是取顺序的下一个请求还是取超时请求。
所以，从整个过程来看，AS（包括deadline）只会处于两个阶段：read batch和write batch（中间可能存在changed batch，此阶段可以看成上一个batch的延续）。而在每个阶段，只会处理该方向的request，要么是顺序的request（包括预测到的 request），要么是超时的request。

AS算法流程图

AS预测状态转化图

阅读(536) | 评论(0) | 转发(0) |

上一篇：详解Linux内核之双向循环链表

下一篇：hlist 哈希链表

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6