使用oprofile分析性能瓶颈-gliethttp-ChinaUnix博客

gliethttpgliethttp.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

gliethttp

博客访问： 15657883
博文数量： 2005
博客积分： 11986
博客等级：上将
技术积分： 22535
用户组：普通用户
注册时间： 2007-05-17 13:56

文章分类

全部博文（2005）

audio?bluetoot和（192）
wifi和wpa_suppli（36）
insight（0）
nand和yaffs2、jf（26）
arm开发（83）
mips开发（10）
php（9）
fedora/readhat（22）
安全?认证?黑客（15）
操作系统（PC和嵌（8）
sd接口（10）
GSM/GPRS无线通信（5）
tty串口?hid鼠标（87）
软硬件tcpip?unix（70）
PCB电路板制作和（40）
产品（5）
cs8900（2）
DMA（6）
atom（4）
android手机相关（99）
经济（8）
pci（17）
wine（7）
wiki（6）
linux开发?内核交（203）
算法、心得和多领（122）
菜谱（12）
linux应用程序开（108）
minigui?ucgui等G（65）
ecos和redboot开（15）
busybox（19）
Makefile?GCC和GD（50）
firmware（2）
logcat（1）
binder（5）
adb（6）
syslogd（2）
hald（3）
shell（49）
dbus（13）
windows（50）
ubuntu（228）
ucos-ii开发（9）
wince开发（25）
freertos（8）
ddk驱动开发（2）
51单片机（2）
python（19）
delphi（41）
C++和C（17）
java（8）
日记（65）
文摘（21）
影视（13）
生活其他（51）
未分配的博文（4）

文章存档

2014年（2）

2013年（2）

2012年（16）

2011年（66）

2010年（368）

2009年（743）

2008年（491）

2007年（317）

我的朋友

使用oprofile分析性能瓶颈

1. 概述

oprofile 是 Linux 平台上，类似 INTEL VTune 的一个功能强大的性能分析工具。

其支持两种采样(sampling)方式：基于事件的采样(event based)和基于时间的采样(time based)。

基于事件的采样是oprofile只记录特定事件（比如L2 cache miss）的发生次数，当达到用户设定的
定值时oprofile 就记录一下（采一个样）。这种方式需要CPU 内部有性能计数器(performace counter)。
现代CPU内部一般都有性能计数器，龙芯2E内部亦内置了2个性能计数器。

基于时间的采样是oprofile 借助OS 时钟中断的机制，每个时钟中断 oprofile 都会记录一次(采一次样）。
引入的目的在于，提供对没有性能计数器 CPU 的支持。其精度相对于基于事件的采样要低。因为要借助 OS
时钟中断的支持，对禁用中断的代码oprofile不能对其进行分析。

oprofile 在Linux 上分两部分，一个是内核模块(oprofile.ko)，一个为用户空间的守护进程(oprofiled)。
前者负责访问性能计数器或者注册基于时间采样的函数(使用register_timer_hook注册之，使时钟中断处理
程序最后执行profile_tick 时可以访问之)，并采样置于内核的缓冲区内。后者在后台运行，负责从内核空
间收集数据，写入文件。

2. oprofile 的安装

以龙芯2E平台为例，要使用oprofile 首先得采用打开oprofile支持的内核启动。然后安装下面3个软件包：
oprofile, oprofile-common, oprofile-gui，其中核心软件包是oprofile-common，其包括以下工具集：

/usr/bin/oprofiled 守护进程
/usr/bin/opcontrol 控制前端，负责控制与用户交互，用得最多
/usr/bin/opannotate 根据搜集到的数据，在源码或者汇编层面上注释并呈现给用户
/usr/bin/opreport 生成二进制镜像或符号的概览
/usr/bin/ophelp 列出oprofile支持的事件
/usr/bin/opgprof 生成gprof格式的剖析数据
...

目前oprofile 在龙芯2E上已经移植好了，包括用户空间的工具集软件包，亦可用矣。

一个测试用的内核，已经打开 oprofile ，位于

用户空间工具集deb 包位于：

3. oprofile 快速上手

a. 初始化

opcontrol --init

该命令会加载oprofile.ko模块，mount oprofilefs。成功后会在/dev/oprofile/目录下导出
一些文件和目录如： cpu_type, dump, enable, pointer_size, stats/

b. 配置

主要设置计数事件和样本计数，以及计数的CPU模式（用户态、核心态）

opcontrol --setup --event=CYCLES:1000::0:1

则是设置计数事件为CYCLES，即对处理器时钟周期进行计数
样本计数为1000，即每1000个时钟周期，oprofile 取样一次。
处理器运行于核心态则不计数
运行于用户态则计数

--event=name:count:unitmask:kernel:user

   name:    event name, e.g. CYCLES or ICACHE_MISSES
   count: reset counter value e.g. 100000
   unitmask: hardware unit mask e.g. 0x0f
   kernel: whether to profile kernel: 0 or 1
   user:    whether to profile userspace: 0 or 1

c. 启动

opcontrol --start

d. 运行待分析之程序

./ffmpeg -c cif -vcodec mpeg4 -i /root/paris.yuv paris.avi

e. 取出数据

opcontrol --dump
opcontrol --stop

f. 分析结果

opreport -l ./ffmpeg

则会输出如下结果：

CPU: GODSON2E, speed 0 MHz (estimated)
Counted CYCLES events (Cycles) with a unit mask of 0x00 (No unit mask) count 10000
samples  %       symbol name
11739 27.0148  pix_abs16_c
6052    13.9274  pix_abs16_xy2_c
4439    10.2154  ff_jpeg_fdct_islow
2574    5.9235  pix_abs16_y2_c
2555    5.8798  dct_quantize_c
2514    5.7854  pix_abs8_c
2358    5.4264  pix_abs16_x2_c
1388    3.1942  diff_pixels_c
964    2.2184  ff_estimate_p_frame_motion
852    1.9607  simple_idct_add
768    1.7674  sse16_c
751    1.7283  ff_epzs_motion_search
735    1.6914  pix_norm1_c
619    1.4245  pix_sum_c
561    1.2910  mpeg4_encode_blocks
558    1.2841  encode_thread
269    0.6190  put_no_rnd_pixels16_c
255    0.5868  dct_unquantize_h263_inter_c

......

4. 例子

oprofile 可以分析处理器周期、TLB 失误、分支预测失误、缓存失误、中断处理程序，等等。
你可以使用 opcontrol --list-events 列出当前处理器上可监视事件列表。

下面分析一个编写不当的例子：

[带有cache问题的代码cache.c]
+++++++++++++++++++++++++++++++++++++++++++++++

int matrix[2047][7];

void bad_access()
{
int k, j, sum = 0;

for(k = 0; k < 7; k++)
      for(j = 0; j < 2047; j++)
         sum += matrix[j][k] * 1024;

}

int main()
{
int i;

for(i = 0; i< 100000; i++)
bad_access();

return 0;

}

+++++++++++++++++++++++++++++++++++++++++++++++

编译之： gcc -g cache.c -o cache

使用oprofile 分析之：

opcontrol --init

opcontrol --setup --event=DCACHE_MISSES:500::0:1

opcontrol --start && ./cache && opcontrol --dump && opcontrol --stop

使用 opannotate 分析结果为：

/*
* Command line: opannotate --source ./cachee
*
* Interpretation of command line:
* Output annotated source file with samples
* Output all files
*
* CPU: GODSON2E, speed 0 MHz (estimated)
* Counted ICACHE_MISSES events (Instruction Cache misses number ) with a unit mask of 0x00 (No unit mask) count 500
*/
/*
* Total samples for file : "/comcat/test/pmc.test/cachee.c"
*
*    34 100.000
*/

            :int matrix[2047][7];
            :
            :void bad_access()
            :{ /* bad_access total:    33 97.0588 */
            : int k, j, sum = 0;
            :
            : for(k = 0; k < 7; k++)
33 97.0588 :       for(j = 0; j < 2047; j++)
            :          sum += matrix[j][k] * 1024;
            :
            :}
            :
            :int main()
            :{ /* main total:    1  2.9412 */
            : int i;
            :
   1  2.9412 : for(i = 0; i< 10000; i++)
            :             bad_access();
            :
            : return 0;
            :
            :}
            :

opreport 解析的结果为：

GodSonSmall:/comcat/test/pmc.test# opreport -l ./cache
CPU: GODSON2E, speed 0 MHz (estimated)
Counted ICACHE_MISSES events (Instruction Cache misses number ) with a unit mask of 0x00 (No unit mask) count 500
samples  %       symbol name
33    97.0588  bad_access
1       2.9412  main

可以看到bad_access() cache miss 事件的样本共有33个，占总数的97%

改进 bad_access() 为 good_access() 后：

void good_access()
{
int k, j, sum = 0;

for(k = 0; k < 2047; k++)
      for(j = 0; j < 7; j++)
         sum += matrix[k][j] * 1024;

}

CPU: GODSON2E, speed 0 MHz (estimated)
Counted ICACHE_MISSES events (Instruction Cache misses number ) with a unit mask of 0x00 (No unit mask) count 500
samples  %       symbol name
22    95.6522  good_access
1       4.3478  main

可以看到改进后 cache miss 事件的样本减少为22个，占总数的95%
可以使用gprof, 编译你程序时加 -pg -g

运行之会在当前目录产生 gmon.out

gprof ./your_program_name

就可以看到了

----------------------------------------

使用oprofile 更精确：

opcontrol --reset
opcontrol --init
opcontrol --setup --event=CYCLES:1000
opcontrol --start && ./your_program_name && opcontrol --dump && opcontrol --stop

opreport -l ./your_program_name

就可以看到了，使用oprofile，编译时只要加 -g 就可以了

阅读(2054) | 评论(1) | 转发(0) |

上一篇：mmap和msync使用内存映射可以实现进程间共享内存MAP_SHARED

下一篇：用OProfile彻底了解性能

给主人留下些什么吧！~~

2011-05-22 15:37:36

学习了，多谢楼主分享哦！也欢迎广大linux爱好者来我的论坛一起讨论arm哦！www.lt-net.cn

回复 | 举报

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6