1. 系统运行环境
客观的反应系统的执行环境,通常包括如CPU 利用率,Memory 使用情况, Storage 剩余情况等。这些资料也非常重要,比如可以快速的知道,当时是否有Process 在疯狂的执行,当时是不是处于严重的low memory 情况, Storage 是否有耗尽的情况发生等。
2. CPU Usage
追查CPU 利用率可大体的知道,当时机器是否有Process 在疯狂的运行, 当时系统运行是否繁忙。通常死机分析,只需要抓取基本的使用情况即可。下面说一下一般的抓取方式
2.1 top
top 可以简单的查询Cpu 的基本使用情况
Usage: top [ -m max_procs ] [ -n iterations ] [ -d delay ] [ -s sort_column ] [ -t ] [ -h ]
-m num Maximum number of processes to display.
-n num Updates to show before exiting.
-d num Seconds to wait between updates.
-s col Column to sort by (cpu,vss,rss,thr).
-t Show threads instead of processes.
-h Display this help screen.
注意的是top 的CPU% 是按全部CPU 来计算的,如果以单线程来计算,比如当时有开启4个核心,那么最多吃到25%.
个人常见的操作方式如: top -t -m 5 -n 2
I.E 正常情况
User 2%, System 12%, IOW 0%, IRQ 0%
User 14 + Nice 0 + Sys 67 + Idle 471 + IOW 0 + IRQ 0 + SIRQ 2 = 554
PID TID PR CPU% S VSS RSS PCY UID Thread Proc
2423 2423 1 8% R 2316K 1128K root top top
270 270 0 1% S 2160K 924K root aee_resmon /system/bin/aee_resmon
159 159 0 0% D 0K 0K root bat_thread_kthr
3 3 0 0% S 0K 0K root ksoftirqd/0
57 57 0 0% D 0K 0K root hps_main
User 1%, System 7%, IOW 0%, IRQ 0%
User 10 + Nice 0 + Sys 41 + Idle 494 + IOW 0 + IRQ 0 + SIRQ 0 = 545
PID TID PR CPU% S VSS RSS PCY UID Thread Proc
2423 2423 1 8% R 2324K 1152K root top top
57 57 0 0% D 0K 0K root hps_main
242 419 0 0% S 8600K 4540K shell mobile_log_d /system/bin/mobile_log_d
982 991 0 0% S 4364K 1156K media_rw sdcard /system/bin/sdcard
272 272 0 0% S 30680K 9048K root em_svr /system/bin/em_svr
从上面可以看出, 系统基本运行正常,没有很吃CPU 的进程.
I.E 异常情况
User 59%, System 4%, IOW 2%, IRQ 0%
User 1428 + Nice 0 + Sys 110 + Idle 811 + IOW 67 + IRQ 0 + SIRQ 1 = 2417
PID TID PR CPU% S VSS RSS PCY UID Thread Proc
16132 32195 3 14% R 997100K 53492K bg u0_a60 Thread-1401 com.android.mms
16132 32190 1 14% R 997100K 53492K bg u0_a60 Thread-1400 com.android.mms
16132 32188 2 14% R 997100K 53492K bg u0_a60 Thread-1399 com.android.mms
16132 32187 0 14% R 997100K 53492K bg u0_a60 Thread-1398 com.android.mms
18793 18793 4 1% R 2068K 1020K shell top top
User 67%, System 3%, IOW 7%, IRQ 0%
User 1391 + Nice 0 + Sys 75 + Idle 435 + IOW 146 + IRQ 0 + SIRQ 1 = 2048
PID TID PR CPU% S VSS RSS PCY UID Thread Proc
16132 32195 3 16% R 997100K 53492K bg u0_a60 Thread-1401 com.android.mms
16132 32188 2 16% R 997100K 53492K bg u0_a60 Thread-1399 com.android.mms
16132 32190 0 16% R 997100K 53492K bg u0_a60 Thread-1400 com.android.mms
16132 32187 1 16% R 997100K 53492K bg u0_a60 Thread-1398 com.android.mms
18793 18793 4 2% R 2196K 1284K shell top top
可以明显的看到,贵司的mms 的4个thread 都有进入了deadloop, 分别占用了一个cpu core. 同时可以快速抓取他们的java trace, 进一步可以看到当时MMS 的四个backtrace, 以便快速分析.
"Thread-1401" prio=5 tid=32 SUSPENDED JIT
| group="main" sCount=1 dsCount=0 obj=0x4264f860 self=0x7b183558
| sysTid=32195 nice=0 sched=0/0 cgrp=apps/bg_non_interactive handle=2078705952
| state=S schedstat=( 3284456714198 104216273858 383002 ) utm=324720 stm=3725 core=5
at com.yulong.android.mms.c.f.d(MmsChatDataServer.java:~1095)
at com.yulong.android.mms.ui.MmsChatActivity$37.run(MmsChatActivity.java:7582)
at java.lang.Thread.run(Thread.java:841)
"Thread-1400" prio=5 tid=31 SUSPENDED JIT
| group="main" sCount=1 dsCount=0 obj=0x41f5d8f0 self=0x7be2a8e8
| sysTid=32190 nice=0 sched=0/0 cgrp=apps/bg_non_interactive handle=2078029504
| state=S schedstat=( 3284905134412 105526230562 382946 ) utm=324805 stm=3685 core=5
at com.yulong.android.mms.ui.MmsChatActivity$37.run(MmsChatActivity.java:~7586)
at java.lang.Thread.run(Thread.java:841)
"Thread-1399" prio=5 tid=30 SUSPENDED JIT
| group="main" sCount=1 dsCount=0 obj=0x42564d28 self=0x7b0e6838
| sysTid=32188 nice=0 sched=0/0 cgrp=apps/bg_non_interactive handle=2077662640
| state=S schedstat=( 3288042313685 103203810616 375959 ) utm=325143 stm=3661 core=7
at com.yulong.android.mms.ui.MmsChatActivity$37.run(MmsChatActivity.java:~7586)
at java.lang.Thread.run(Thread.java:841)
"Thread-1398" prio=5 tid=29 SUSPENDED
| group="main" sCount=1 dsCount=0 obj=0x4248e5a8 self=0x7be0d128
| sysTid=32187 nice=0 sched=0/0 cgrp=apps/bg_non_interactive handle=2079251904
| state=S schedstat=( 3287248372432 105116936413 379634 ) utm=325055 stm=3669 core=6
at com.yulong.android.mms.ui.MmsChatActivity$37.run(MmsChatActivity.java:~7586)
at java.lang.Thread.run(Thread.java:841)
当时处于suspend, 即意味着当时这四个thread 正在执行java code, 而抓取backtrace 时强制将thread suspend 。看起来客户改动所致,并且客户有Proguard, 麻烦客户自己review 代码。
2.2 cputime
通常用cputime 来打印一段时间内, CPU 的利用率的统计情况, 资讯比top 更加详细.
cputime -h
Usage: cputime [-start/-stop] [-n iterations] [-d delay] [-e time] [ -m max_count ] [-p] [-s sort_column] [-i id] [-h]
-start Start cpu time monitor.
-stop Stop cpu time monitor.
-n num Updates to show before exiting.
-d num Seconds to wait between updates.
-m num Maximum number of information to display.
-e num Enable CPU time monitor and stop monitor after num seconds. If no this parameter will
show last cputime monitor data.
-p Show process instead of thread. If no this parameter default will show thread information.
-s col Column to sort by time/user/kernel/id/isr_c/isr_t(cputime/user time/kernel time/id/isr_count/
isr_time). If no this parameter default will sort by cputime.
-i id show isr information of thread id.
-h Display this help screen.
* Example *
cputime -e 100: Enable cputime monitor and stop after 100 seconds. Then show thread cputime, and sort by cputime.
cputime -e 200 -s user: Enable cputime monitor and stop after 200 seconds. Then show thread cputime, and sort by user time.
cputime: Show thread cputime, and sort by cputime.
cputime -p -s id: Show process cputime, and sort by process id.
2.3 ftrace
ftrace 可以纪录CPU 最为详细的执行情况, 即linux scheduler 的执行情况. 通常默认只开启 sched_switch.
如何抓取ftrace 可以查询相关的FAQ.
2.4 Kernel core status
有的时候我们需要追查一下,当时Kernel 的基本调度情况,以及接收中断的情况,以判断当前CPU 执行的基本情况是否异常。比如有时候如果某个中断上来太过频繁,就容易导致系统运行缓慢,甚至死机。
* CPU Sched status
adb shell cat proc/sched_debug
Use sysrq-trigger
* CPU interrupts
adb shell cat proc/interrupts
adb shell cat proc/irq/irq_id/spurious
3. Memory Usage
Memory Usage, 我们通常会审查,系统当时memory 是否足够, 是否处于low memory 状态, 是否可能出现因无法申请到memory 而卡死的情况.
常见的一些基本信息如下:
* meminfo: basic memory status
adb shell cat proc/meminfo
adb shell cat proc/pid/maps
adb shell cat proc/pid/smaps
* procrank info: all process memory status
adb shell procrank
adb shell procmem pid
adb shell dumpsys meminfo pid
* zoneinfo:
adb shell cat proc/zoneinfo
* buddyinfo:
adb shell cat /proc/buddyinfo
4. Storage Usage
查看Storage 的情况,通常主要是查询data 分区是否已经刷满, sdcard 是否已经刷满, 剩余的空间是否足够。以及是否有产生超大文件等。
通常使用的命令如 df
df
Filesystem Size Used Free Blksize
/dev 446.0M 128.0K 445.8M 4096
/sys/fs/cgroup 446.0M 12.0K 445.9M 4096
/mnt/secure 446.0M 0.0K 446.0M 4096
/mnt/asec 446.0M 0.0K 446.0M 4096
/mnt/obb 446.0M 0.0K 446.0M 4096
/system 1.2G 915.3M 355.5M 4096
/data 1.1G 136.7M 1010.1M 4096
/cache 106.2M 48.0K 106.2M 4096
/protect_f 4.8M 52.0K 4.8M 4096
/protect_s 4.8M 48.0K 4.8M 4096
/mnt/cd-rom 1.2M 1.2M 0.0K 2048
/mnt/media_rw/sdcard0 4.6G 1.1G 3.4G 32768
/mnt/secure/asec 4.6G 1.1G 3.4G 32768
/storage/sdcard0 4.6G 1.1G 3.4G 32768
以及ls, du 等命令,如du
du -help
usage: du [-H | -L | -P] [-a | -d depth | -s] [-cgkmrx] [file ...]
du -LP -d 1
8 ./lost+found
88 ./local
384 ./misc
48 ./nativebenchmark
912 ./nativetest
8 ./dontpanic
13376 ./data
8 ./app-private
8 ./app-asec
129424 ./app-lib
8 ./app
136 ./property
16 ./ssh
116312 ./dalvik-cache
8 ./resource-cache
48 ./drm
8 ./mediadrm
8 ./security
3888 ./nvram
8 ./amit
8 ./acdapi
88
32 ./sec
8 ./user
16 ./media
16 ./agps_supl
8 ./anr
8 ./gps_mnl
8 ./nfc_socket
16 ./ccci_cfg
32 ./mdlog
1312 ./system
176 ./recovery
32 ./backup
274688 .