Blktrace原理简介及使用-up哥小号-ChinaUnix博客

up哥小号的ChinaUnix博客upge.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

up哥小号

博客访问： 154225
博文数量： 10
博客积分： 207
博客等级：入伍新兵
技术积分： 380
用户组：普通用户
注册时间： 2012-11-10 12:44

文章分类

全部博文（10）

文章存档

2015年（2）

2013年（1）

2012年（7）

我的朋友

fire0225

相关博文

Blktrace原理简介及使用

分类： LINUX

2012-12-06 15:57:56

blktrace具体的磁盘或分区... 2

blkparse解析blktrace产生的数据... 3

使用实例... 3

输出解析... 4

附录：action含义... 8

附件：blktrace.pdf

Blktrace简介

Blktrace是一个用户态的工具，用来收集磁盘IO信息中当IO进行到块设备层（block层，所以叫blk trace）时的详细信息（如IO请求提交，入队，合并，完成等等一些列的信息）。

块设备层处于下图（借用褚霸的图）中的 “block layer”

Blktrace工作原理

(1) blktrace测试的时候，会分配物理机上逻辑cpu个数个线程，并且每一个线程绑定一个逻辑cpu来收集数据

(2) blktrace在debugfs挂载的路径（默认是/sys/kernel/debug ）下每个线程产生一个文件（就有了对应的文件描述符），然后调用ioctl函数（携带文件描述符， _IOWR(0x12,115,struct blk_user_trace_setup)，& blk_user_trace_setup三个参数），产生系统调用将这些东西给内核去调用相应函数来处理，由内核经由debugfs文件系统往此文件描述符写入数据

(3) blktrace需要结合blkparse来使用，由blkparse来解析blktrace产生的特定格式的二进制数据

(4) blkparse仅打开blktrace产生的文件，从文件里面取数据做展示以及最后做per cpu的统计输出，但blkparse中展示的数据状态（如 A，U，Q，详细见下）是blkparse在t->action & 0xffff之后自己把数值转换为“A，Q，U之类的状态”来展示的。

Blktrace安装

1. yum install blktrace

2. 源码获取（你也可以从源码安装）

git clone git://git.kernel.org/pub/scm/linux/kernel/git/axboe/blktrace.git bt

cd bt

make

make install

Blktrace的使用

Debugfs挂载

由之前的blktrace工作原理可知，blktrace需要借助内核经由debugfs文件系统（debugfs文件系统在内存中）来输出信息

所以用blktrace工具之前需要先挂载debugfs文件系统

mount –t debugfs debugfs /sys/kernel/debug

或者在/etc/fstab中添加下面一行以便在开机启动的时候自动挂载

debug /sys/kernel/debug debugfs default 0 0

blktrace具体的磁盘或分区

blktrace具体语法man blktrace，这里讲常用的

文件输出

mkdir test #blktrace生成的数据默认会在当前目录，如之前在blktrace原理中提到，每个逻辑cpu都有一个线程，产生一个文件，故会产生cpu数目个文件

blktrace –d /dev/sda –o test1

#对 /dev/sda的trace，输出文件名为test1. Blktrace.[0-cpu数-1] （文件里面存的是二进制数据，需要blkparse来解析）

终端输出

Blktrace –d /dev/sda –o - |blkparse -i –

输出到终端用“-”表示，可是都是一堆二进制东西，没法看，所以需要实时blkparse来解析

Blkparse 的“-i”后加文件名，blktrace输出为“-“代表终端（代码里面写死了，就是用这个符号来代表终端），blkparse也用“-”来代表终端解析

blkparse解析blktrace产生的数据

blkparse具体语法man blkparse，这里讲常用的

文件解析

blkparse -i test1 #对test1.blktrace. [0-cpu数-1]都解析（只统计有数据的），

实时解析

实时数据的解析即上blktrace的“终端输出”

使用实例

终端1：

blktrace /dev/sda -o - |blkparse -i – 跑着

终端2：

dd if=/dev/zero of=/root/a1 bs=4k count=1000

终端1显示

8,0 16 3041 94.435078912 891 A W 72411584 + 8 <- (8,2) 71884224

8,0 16 3042 94.435079691 891 Q W 72411584 + 8 [flush-8:0]

8,0 16 3043 94.435080790 891 M W 72411584 + 8 [flush-8:0]

8,0 16 3044 94.435083089 891 A W 72411592 + 8 <- (8,2) 71884232

输出解析

这是默认输出格式，代码里默认输出格式为，再按action输出或不输出后续信息

先输出 –f "%D %2c %8s %5T.%9t %5p %2a %3d "

其中每个字母代表意思如下，数字代表占几个字符，和printf里的数字输出一样的

如

8,0 16 3042 94.435079691 891 Q W 72411584 + 8 [flush-8:0]

由于默认格式为先输出–f "%D %2c %8s %5T.%9t %5p %2a %3d "

（1）8,0 按默认输出对应%D，主从设备号

（2）16 按默认输出对应%2c，表示cpu id

（3）3042 按默认输出对应%8s，表示序列号（序列号是blkparse自己产生的一个序号，实际IO里没有这个号）

（4）94.435079691 按默认对应%5T.%9t，表示”秒.纳秒”

（5）891对应%5p,表示，进程id

（6）Q对应%2a，表示Action，Action表格如下（如Q表示IO handled by request queue code），更详细的含义见附录action表

The following table shows the various actions which may be output.

Act Description

A IO was remapped to a different device

B IO bounced

C IO completion

D IO issued to driver

F IO front merged with request on queue

G Get request

I IO inserted onto request queue

M IO back merged with request on queue

P Plug request

Q IO handled by request queue code

S Sleep request

T Unplug due to timeout

U Unplug request

X Split

（7）W 对应%3d，表示RWBS域（W表示写操作），各字母含义如下

至少包含“RWD“（ R 读，W写，D块被忽略）中的1个字符

还可以附加“BS“（B barrier，S同步）

再输出（源代码里面这么写的）

switch (act[0]) {

case 'R': /* Requeue */

case 'C': /* Complete */

if (t->action & BLK_TC_ACT(BLK_TC_PC)) {

char *p = dump_pdu(pdu_buf, pdu_len);

if (p)

fprintf(ofp, "(%s) ", p);

fprintf(ofp, "[%d]\n", t->error);

} else {

if (elapsed != -1ULL) {

if (t_sec(t))

fprintf(ofp, "%llu + %u (%8llu) [%d]\n",

(unsigned long long) t->sector,

t_sec(t), elapsed, t->error);

else

fprintf(ofp, "%llu (%8llu) [%d]\n",

(unsigned long long) t->sector,

elapsed, t->error);

} else {

if (t_sec(t))

fprintf(ofp, "%llu + %u [%d]\n",

(unsigned long long) t->sector,

t_sec(t), t->error);

else

fprintf(ofp, "%llu [%d]\n",

(unsigned long long) t->sector,

t->error);

}

break;

case 'D': /* Issue */

case 'I': /* Insert */

case 'Q': /* Queue */

case 'B': /* Bounce */

if (t->action & BLK_TC_ACT(BLK_TC_PC)) {

char *p;

fprintf(ofp, "%u ", t->bytes);

p = dump_pdu(pdu_buf, pdu_len);

if (p)

fprintf(ofp, "(%s) ", p);

fprintf(ofp, "[%s]\n", name);

} else {

if (elapsed != -1ULL) {

if (t_sec(t))

fprintf(ofp, "%llu + %u (%8llu) [%s]\n",

(unsigned long long) t->sector,

t_sec(t), elapsed, name);

else

fprintf(ofp, "(%8llu) [%s]\n", elapsed,

name);

} else {

if (t_sec(t))

fprintf(ofp, "%llu + %u [%s]\n",

(unsigned long long) t->sector,

t_sec(t), name);

else

fprintf(ofp, "[%s]\n", name);

}

break;

case 'M': /* Back merge */

case 'F': /* Front merge */

case 'G': /* Get request */

case 'S': /* Sleep request */

if (t_sec(t))

fprintf(ofp, "%llu + %u [%s]\n",

(unsigned long long) t->sector, t_sec(t), name);

else

fprintf(ofp, "[%s]\n", name);

break;

case 'P': /* Plug */

fprintf(ofp, "[%s]\n", name);

break;

case 'U': /* Unplug IO */

case 'T': /* Unplug timer */

fprintf(ofp, "[%s] %u\n", name, get_pdu_int(t));

break;

case 'A': /* remap */

get_pdu_remap(t, &r);

fprintf(ofp, "%llu + %u <- (%d,%d) %llu\n",

(unsigned long long) t->sector, t_sec(t),

MAJOR(r.device_from), MINOR(r.device_from),

(unsigned long long) r.sector_from);

break;

case 'X': /* Split */

fprintf(ofp, "%llu / %u [%s]\n", (unsigned long long) t->sector,

get_pdu_int(t), name);

break;

case 'm': /* Message */

fprintf(ofp, "%*s\n", pdu_len, pdu_buf);

break;

default:

fprintf(stderr, "Unknown action %c\n", act[0]);

break;

}

所以

具体解析

8,0 16 3042 94.435079691 891 Q W 72411584 + 8 [flush-8:0]

中的act[0]=’Q’,后面的72411584是（8，0即sda）相对8:0的扇区起始号，+8，为后面连续的8个扇区（默认一个扇区512byte，所以8个扇区就是4K），后面的[flush-8:0]是程序的名字。

8,0 16 3041 94.435078912 891 A W 72411584 + 8 <- (8,2) 71884224

Action[0]=’A’, 72411584是相对8:0（即sda）的起始扇区号，（8,2）是相对/dev/sda2分区的扇区号为71884224，(由于/dev/sda2分区时sda磁盘上面的一个分区，故sda2上面的起始位置要先映射到sda磁盘上面去)

由于扇区号在磁盘上面是连续的，磁盘又被格式化成很多块，一个块里包含多个扇区，所以，扇区号/块大小=块号，

根据块号你就可以找到对应的inode，

debugfs -R 'icheck 块号' 具体磁盘或分区

如你的扇区号是相对sda2上面算出来的块号，那debugfs –R ‘icheck 块号’ /dev/sda2就可以找到对应的inode

根据inode你就可以找到对应的文件是什么了
find / -inum your_inode

有一个例子见淘宝牛人写的一篇http://blog.tao.ma/?p=61

附录：action含义

C – complete A previously issued request has been completed. The output

will detail the sector and size of that request, as well as the success or

failure of it.

D – issued A request that previously resided on the block layer queue or in

the io scheduler has been sent to the driver.

I – inserted A request is being sent to the io scheduler for addition to the

internal queue and later service by the driver. The request is fully formed

at this time.

Q – queued This notes intent to queue io at the given location. No real requests

exists yet.

B – bounced The data pages attached to this bio are not reachable by the

hardware and must be bounced to a lower memory location. This causes

a big slowdown in io performance, since the data must be copied to/from

kernel buffers. Usually this can be fixed with using better hardware -

either a better io controller, or a platform with an IOMMU.

m – message Text message generated via kernel call to blk add trace msg.

M – back merge A previously inserted request exists that ends on the boundary

of where this io begins, so the io scheduler can merge them together.

F – front merge Same as the back merge, except this io ends where a previously

inserted requests starts.

G – get request To send any type of request to a block device, a struct request

container must be allocated first.

S – sleep No available request structures were available, so the issuer has to

wait for one to be freed.

P – plug When io is queued to a previously empty block device queue, Linux

will plug the queue in anticipation of future ios being added before this

data is needed.

U – unplug Some request data already queued in the device, start sending

requests to the driver. This may happen automatically if a timeout period

has passed (see next entry) or if a number of requests have been added to

the queue.

T – unplug due to timer If nobody requests the io that was queued after

plugging the queue, Linux will automatically unplug it after a defined

period has passed.

X – split On raid or device mapper setups, an incoming io may straddle a

device or internal zone and needs to be chopped up into smaller pieces

for service. This may indicate a performance problem due to a bad setup

of that raid/dm device, but may also just be part of normal boundary

conditions. dm is notably bad at this and will clone lots of io.

A – remap For stacked devices, incoming io is remapped to device below it in

the io stack. The remap action details what exactly is being remapped to

what.

附件，官方文档

blktrace.pdf blktrace.pdf

阅读(8371) | 评论(1) | 转发(0) |

上一篇：Linux VFS中close系统调用实现原理

下一篇：文件显示大小和实际大小以及文件的洞的问题

给主人留下些什么吧！~~

tanssy2014-06-11 14:35:51

请问blkparse解析结果第四列是时间是否是该动作开始的时间？还是该动作结束时间？
那么该动作持续时间是否是输出结果下一行的时间减去该行的时间？

回复 | 举报

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6