2012年(5)
分类:
2012-11-02 13:57:15
原文地址:一次Oracle rac宕机分析 作者:风之幻想
某客户rac数据库2号节点实例自动宕节点,以下为分析报告
一、现象回顾:
2号节点发生故障时,alert日志显示如下:
Thread 2 advanced to log sequence 77740 (LGWR switch)
Current log# 24 seq# 77740 mem# 0: /dev/rcrm4_rd2_91_2G
Current log# 24 seq# 77740 mem# 1: /dev/rcrm4_rd2_92_2G
Mon Mar 28 09:45:23 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:46:52 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:47:52 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:47:53 BEIST 2011
Thread 2 advanced to log sequence 77741 (LGWR switch)
Current log# 25 seq# 77741 mem# 0: /dev/rcrm5_rd2_101_2G
Current log# 25 seq# 77741 mem# 1: /dev/rcrm5_rd2_102_2G
Mon Mar 28 09:48:52 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:49:52 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:50:52 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:51:52 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:52:53 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:53:53 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:54:53 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:55:53 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:56:53 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:57:53 BEIST 2011
PMON failed to acquire latch, see PMON dump
Mon Mar 28 09:58:44 BEIST 2011
Received an instance abort message from instance 1 (reason 0x0)
Please check instance 1 alert and LMON trace files for detail.
LMD0: terminating instance due to error 481
Mon Mar 28 09:58:45 BEIST 2011
System state dump is made for local instance
System State dumped to trace file /opt/oracle/admin/crmdb/bdump/crmdb2_diag_3732132.trc
Mon Mar 28 09:58:45 BEIST 2011
Shutting down instance (abort)
License high water mark = 2290
从alert日志来看,二号节点主要表现为PMON进程不能获得latch(PMON failed to acquire latch),超时等待10分钟,之后接受到一号节点将二号节点宕机指令(Received an instance abort message from instance 1 (reason 0x0)),LMD0进程将实例abort(LMD0: terminating instance due to error 481),并在宕机之前系统自动做了systemstate dump(System state dump is made for local instance),宕机之前用户在2节点的会话数为2290个(License high water mark = 2290)。
采用ass.awk工具分析crmdb2_diag_3732132.trc可以得到如下结果:
Starting Systemstate 1
.........................................
Ass.Awk Version 1.0.9 - Processing crmdb2_diag_3732132.trc
System State 1
~~~~~~~~~~~~~~~~
1:
2: waiting for 'latch: object queue header operation'[Latch 700000d1fe74fd0]
3: last wait for 'DIAG idle wait'
4: waiting for 'rdbms ipc message'
5: last wait for 'ges lmd/lmses to freeze in rcfg - mrcvr'
6: last wait for 'ges remote message'
7: waiting for 'gc object scan'
8: last wait for 'gc object scan'
9: last wait for 'gc object scan'
10: last wait for 'gc object scan'
11: waiting for 'gc object scan'
12: last wait for 'gc object scan'
13: last wait for 'gc object scan'
14: last wait for 'gc object scan'
15: waiting for 'latch: object queue header operation'[Latch 700000d1fc80b20]
16: waiting for 'latch: cache buffers chains'[Latch 700000cdf26c988]
17: waiting for 'latch: object queue header operation'[Latch 700000d1fe74fd0]
18: waiting for 'latch: object queue header operation'[Latch 700000d1fe74fd0]
19: last wait for 'gc object scan'
20: last wait for 'gc object scan'
21: last wait for 'gc object scan'
22: waiting for 'latch: cache buffers chains'[Latch 700000cea8a6a80]
23: waiting for 'rdbms ipc message'
24: waiting for 'gcs drm freeze in enter server mode'
25: waiting for 'gcs drm freeze in enter server mode'
26: waiting for 'gcs drm freeze in enter server mode'
27: waiting for 'gcs drm freeze in enter server mode'
28: waiting for 'gcs drm freeze in enter server mode'
29: waiting for 'gcs drm freeze in enter server mode'
30: waiting for 'gcs drm freeze in enter server mode'
31: waiting for 'gcs drm freeze in enter server mode'
32: waiting for 'rdbms ipc message'
33: waiting for 'rdbms ipc message'
34: last wait for 'smon timer'
35: waiting for 'rdbms ipc message'
36: waiting for 'rdbms ipc message'
37: waiting for 'rdbms ipc message'
38: waiting for 'rdbms ipc message'
39:
40:
41: waiting for 'rdbms ipc message'
Blockers
~~~~~~~~
Above is a list of all the processes. If they are waiting for a resource
then it will be given in square brackets. Below is a summary of the
waited upon resources, together with the holder of that resource.
Notes:
~~~~~
o A process id of '???' implies that the holder was not found in the
systemstate.
Resource Holder State
Latch 700000d1fe74fd0 ??? Blocker
Latch 700000d1fc80b20 ??? Blocker
Latch 700000cdf26c988 ??? Blocker
Latch 700000cea8a6a80 ??? Blocker
Object Names
~~~~~~~~~~~~
Latch 700000d1fe74fd0 Child object queue header operation
Latch 700000d1fc80b20 holding (efd=4) 700000d1fc80b20 Child obj
Latch 700000cdf26c988 Child cache buffers chains
Latch 700000cea8a6a80 Child cache buffers chains
762178 Lines Processed.
进一步通过分析crmdb2_diag_3732132.trc文件(分析过程详见附录),进程之间有如下关系:
(1)当故障发生时,orapid=7占用着latch: object queue header operation,addr=700000d1fc80b20,orapid=8,10,11,12,13,14,15,19,21处于等待状态,且会 话均为dead状态。
(2)当故障发生时,orapid=9,20占用着latch: object queue header operation,addr=700000d1fe74fd0,orapid=2,17,18处于等待状态,其中2进程为alive,17,18进程均 为dead状态。
(3)当故障发生时orapid=16等待latch: cache buffers chains,addr=700000cdf26c988,进程为dead状态
(4)当故障发生时orapid=12等待latch: cache buffers chains,addr=700000cdf26c988,进程为dead状态。
(5)值得注意的是故障发生时dbwr进程均处于dead状态,等待事件如下:
Dumping Session Wait History
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488289 sec
=1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488292 sec
=1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488290 sec
=1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488290 sec
=1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488289 sec
=1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488292 sec
=1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488295 sec
=1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488293 sec
=1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488291 sec
=1, =0, =0
for 'gcs drm freeze in enter server mode' count=1 wait_time=0.488289 sec
故障发生时各日志文件显示如下:
(1)1号节点日志信息:
crmdb1_lmon_4579776.trc显示,当1号节点LMON进程检测LMD进程到存在异常时(超时900秒),即向2号节点发起请 求(requesting memberkill of instances w/o ftdones),2号节点收到之后由LMD完成宕库操作。
*** 2011-03-28 09:58:43.010
* kjfcdrmrcfg: waited 901 secs (6501391,6502292,900) for lmd to receive all ftdones, requesting memberkill of instances w/o ftdones:
kjfsprn: sync status inst 0 tmout 900 (sec)
kjfsprn: sync propose inc 4 level 691378
kjfsprn: dmap ver 4 (step 0)
kjfsprn: sync inc 4 level 691378
kjfsprn: sync bitmap 0 1
DEFER MSG QUEUE ON LMD0 IS EMPTY
SEQUENCES:
0:0.0 1:210167409.0
alert_crmdb1.log显示:
* kjfc_kill_no_sync: send abort msg to node 1 (noftdn)
Mon Mar 28 09:58:45 BEIST 2011
(2)2号节点信息:
crmdb2_lmd0_2154764.trc显示:2号节点lmd进程收到指令,准备宕节点,宕节点之前(5秒钟)使用diag进程进行systemstate dump。
*** 2011-03-28 09:58:44.845
Received a special admin message (type 2) from node 0
Message body: flag 0x5 data 0x0 0x0 0x0
Abort the instance
ksuitm: waiting up to [5] seconds before killing DIAG(3732132)
二、故障分析
LMS(GLOBAL CACHE SERVICE PROCESS)进程主要用来管理集群内数据块的访问,并在不同实例的BUFFER CACHE中传输块镜像。LMS进程跨集群管理数据库的请求,并保证在所有实例的BUFFER CACHE中一个数据块的镜像只能出现一次。LMS进程靠着在实例中传递消息来协调数据块的访问,当一个实例请求数据块时,该实例的LMD进程发出一个数 据块资源的请求,该请求只向MASTER数据块的实例的LMD进程,MASTER实例的LMD进程同时正在使用的实例的LMD进程释放该资源,这时拥有该 资源的实例的LMS进程会创建一个数据块镜像的一致性读,然后把该数据块传递到请求该资源的实例的BUFFER CACHE中。LMS进程保证了在每一时刻只能允许一个实例去更新数据块,并负责保持该数据块的镜像纪录(包含更新数据块的状态FLAG)。在故障发生 时,由进程状态关系图,可以看到LMS进程处于latch free(latch: object queue header operation和latch: cache buffers chains)且状态为dead状态(LMS0除外)。也就意味着1,2号节点之间block传输出现了问题。由于PMON进程和LMS进程相互争用 latch(latch: object queue header operation),进而PMON获取不了latch,出现超时等待10分钟,也就无法恢复有问题的进程。LMD (GLOBAL ENQUEUE SERVICE DAEMON)进程主要管理对全局队列和资源的访问,并更新相应队列的状态,处理来自于其他实例的资源请求。当2号节点LMS进程出现问题时,无法更新相 应队列状态,由日志可以看出2号节点处于空闲事件等待状态(ges remote message)LMON(GLOBAL ENQUEUE SERVICE MONITOR)进程主要监测群集内的全局队列和全局资源,管理实例和处理异常并相应的群集队列进行恢复操作。当1号节点LMON进程检测到2号节点出现 异常时,发送killmember指令,由2号节点lmd进程完成abort操作。我们同时注意到故障发生时,ckpt,dbwr进程均处于dead状 态,为保证一致性处于pending i/o状态,dbwr等待事件为gcs drm freeze in enter server mode,该等待事件在Oracle有如下解释(metalink doc:4755405.8):
Under heavy load, DRM sync can take very long as LMON is waiting for LMS
processes to empty their send queues. This wait is unnecessary as
correctness is already guaranteed by the sync channel flush mechanism.
This shows up as sessions having large "gcs drm freeze in enter server mode" wait times. DRM takes a long time (more than 5 minutes).
也就是在在高负载到情况下,尤其是需要节点之间大规模传递一致性块时,DRM sync功能LMON会等待 LMS进程清空发送队列,当然在LMS进程清空发送队列时可能会引起latch free等待,进而引起GES问题,而Oracle GES(GES包括LMON和LMD进程)检测到有问题时,将会abort实例(metalink:ID 9920699.8)。Oracle还指出DRM功能可能会超过5分钟。针对这一现象,Oracle定位为bug:6960699
三、解决办法:
屏蔽DRM功能,设置如下隐含参数
_gc_affinity_time=0
_gc_undo_affinity=FALSE
附录:
经过awk格式化之后的文件可以看出latch 700000d1fc80b20成为首号怀疑对象,进而在跟踪文件crmdb2_diag_3732132.trc寻找相关信息。
PROCESS 7:
----------------------------------------
SO: 700000d64f4f4c0, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=7, calls cur/top: 700000d69842360/700000d6983c508, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 94
last post received-location: KJCS Post snd proxy to flush msg
last process to post me: 700000d58f43670 1 6
last post sent: 0 0 24
last post sent-location: ksasnd
last process posted by me: 700000d58f43670 1 6
(latch info) wait_event=0 bits=20
holding (efd=4) 700000d1fc80b20 Child object queue header operation level=5 child#=77
Location from where latch is held: kclobj_2:
Context saved from call: 0
state=busy(shared) (val=0x4000000000000009)
waiters [orapid (seconds since: put on list, posted, alive check)]:
15 (2, 1301277525, 2)
waiter count=1
Process Group: DEFAULT, pseudo proc: 700000d6919aea0
O/S info: user: oracle, term: UNKNOWN, ospid: 3830616
OSD pid info: Unix process pid: 3830616, image: oracle@crmdb02 (LMS0)
从红色标注可以看出process=7(LMS0)进程holding child latch#=7,addr=700000d1fc80b20的object queue header operation锁,从而导致了process=15进程等待。
查看PROCESS=15状态:
PROCESS 15:
----------------------------------------
SO: 700000d8cf24760, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=15, calls cur/top: 700000d698433f8/700000d6983db38, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 94
last post received-location: KJCS Post snd proxy to flush msg
last process to post me: 700000d58f43670 1 6
last post sent: 0 0 98
last post sent-location: kjbrcvdscn: scn acked
last process posted by me: 700000d58f43670 1 6
(latch info) wait_event=0 bits=0
Location from where call was made: kclobj_2:
waiting for 700000d1fc80b20 Child object queue header operation level=5 child#=77
Location from where latch is held: kclobj_2:
Context saved from call: 0
state=busy(shared) (val=0x4000000000000009)
waiters [orapid (seconds since: put on list, posted, alive check)]:
15 (2, 1301277525, 2)
waiter count=1
gotten 688460698 times wait, failed first 3268358 sleeps 1757249
gotten 19572 times nowait, failed: 109
possible holder pid = 7 ospid=3830616
on wait list for 700000d1fc80b20
Process Group: DEFAULT, pseudo proc: 700000d6919aea0
O/S info: user: oracle, term: UNKNOWN, ospid: 3260452 (DEAD)
OSD pid info: Unix process pid: 3260452, image: oracle@crmdb02 (LMS8)
从红色标注可以看出process=15(LMS8)进程正在等待child latch#=7,addr=700000d1fc80b20,且进程已处于DEAD状态。
用操作系统命令查看,发现只有一个会话等待此child latch
[ora10g@mcprod ~]$ cat crmdb2_diag_3732132.trc|grep "waiting for 700000d1fc80b20" |wc -l
1
进一步查看共有9个会话holding此child latch。
[ora10g@mcprod ~]$ cat crmdb2_diag_3732132.trc|grep "holding (efd=4) 700000d1fc80b20" |wc -l
9
查看PROCESS=8状态。
PROCESS 8:
----------------------------------------
SO: 700000d8cf23f70, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=8, calls cur/top: 700000d69842620/700000d6983c7c8, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 94
last post received-location: KJCS Post snd proxy to flush msg
last process to post me: 700000d58f43670 1 6
last post sent: 0 0 98
last post sent-location: kjbrcvdscn: scn acked
last process posted by me: 700000d58f43670 1 6
(latch info) wait_event=0 bits=20
holding (efd=4) 700000d1fc80b20 Child object queue header operation level=5 child#=77
Location from where latch is held: kclobj_2:
Context saved from call: 0
state=busy(shared) (val=0x4000000000000009)
waiters [orapid (seconds since: put on list, posted, alive check)]:
15 (2, 1301277525, 2)
waiter count=1
Process Group: DEFAULT, pseudo proc: 700000d6919aea0
O/S info: user: oracle, term: UNKNOWN, ospid: 3503016 (DEAD)
OSD pid info: Unix process pid: 3503016, image: oracle@crmdb02 (LMS1)
可以看出process=8(LMS1)状态等同于process=15。
查看PROCESS=9,PROCESS=20状态
PROCESS 9:
----------------------------------------
SO: 700000d60f24a78, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=9, calls cur/top: 700000d698428e0/700000d6983caa0, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 94
last post received-location: KJCS Post snd proxy to flush msg
last process to post me: 700000d58f43670 1 6
last post sent: 0 0 98
last post sent-location: kjbrcvdscn: scn acked
last process posted by me: 700000d58f43670 1 6
(latch info) wait_event=0 bits=20
holding (efd=4) 700000d1fe74fd0 Child object queue header operation level=5 child#=91
Location from where latch is held: kclobj_2:
Context saved from call: 0
state=busy(shared) (val=0x4000000000000002)
waiters [orapid (seconds since: put on list, posted, alive check)]:
18 (2, 1301277525, 2)
17 (2, 1301277525, 2)
waiter count=2
Process Group: DEFAULT, pseudo proc: 700000d6919aea0
O/S info: user: oracle, term: UNKNOWN, ospid: 2982786 (DEAD)
OSD pid info: Unix process pid: 2982786, image: oracle@crmdb02 (LMS2)
可以看到process=9(LMS2), process=20(LMSd),会话holding了addr=700000d1fe74fd0 child#91的child latch,进而导致了process=17(LMSa)和process=18(LMSb)等待。
查看PROCESS=17状态
PROCESS 17:
----------------------------------------
SO: 700000d5cf23598, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=17, calls cur/top: 700000d67b9d3f8/700000d6983e0d0, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 94
last post received-location: KJCS Post snd proxy to flush msg
last process to post me: 700000d58f43670 1 6
last post sent: 0 0 98
last post sent-location: kjbrcvdscn: scn acked
last process posted by me: 700000d58f43670 1 6
(latch info) wait_event=0 bits=0
Location from where call was made: kclobj_2:
waiting for 700000d1fe74fd0 Child object queue header operation level=5 child#=91
Location from where latch is held: kclobj_2:
Context saved from call: 0
state=busy(shared) (val=0x4000000000000002)
waiters [orapid (seconds since: put on list, posted, alive check)]:
18 (2, 1301277525, 2)
17 (2, 1301277525, 2)
waiter count=2
gotten 687558442 times wait, failed first 3419261 sleeps 1918111
gotten 19762 times nowait, failed: 151
possible holder pid = 9 ospid=2982786
on wait list for 700000d1fe74fd0
Process Group: DEFAULT, pseudo proc: 700000d6919aea0
O/S info: user: oracle, term: UNKNOWN, ospid: 295048 (DEAD)
OSD pid info: Unix process pid: 295048, image: oracle@crmdb02 (LMSa)
查看PROCESS=18状态
PROCESS 18:
----------------------------------------
SO: 700000d58f42690, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=18, calls cur/top: 700000d69843f10/700000d6983e390, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 94
last post received-location: KJCS Post snd proxy to flush msg
last process to post me: 700000d58f43670 1 6
last post sent: 0 0 98
last post sent-location: kjbrcvdscn: scn acked
last process posted by me: 700000d58f43670 1 6
(latch info) wait_event=0 bits=0
Location from where call was made: kclobj_2:
waiting for 700000d1fe74fd0 Child object queue header operation level=5 child#=91
Location from where latch is held: kclobj_2:
Context saved from call: 0
state=busy(shared) (val=0x4000000000000002)
waiters [orapid (seconds since: put on list, posted, alive check)]:
18 (2, 1301277525, 2)
17 (2, 1301277525, 2)
waiter count=2
gotten 687558442 times wait, failed first 3419261 sleeps 1918111
gotten 19762 times nowait, failed: 151
possible holder pid = 9 ospid=2982786
on wait list for 700000d1fe74fd0
Process Group: DEFAULT, pseudo proc: 700000d6919aea0
O/S info: user: oracle, term: UNKNOWN, ospid: 3428506
OSD pid info: Unix process pid: 3428506, image: oracle@crmdb02 (LMSb)
可以看到process=17,18均等待addr=700000d1fe74fd0 child latch。
PROCESS=10(LMS3),PROCESS=11(LMS4),PROCESS=12(LMS5),PORCESS=13(LMS6),PROCESS=14(LMS7),PROCESS=15(LMS8),PROCESS=19(LMSc),PROCESS=21(LMSe) 状态均为类似状态,即进程holding child latch#=7,addr=700000d1fc80b20的object queue header operation锁,从而导致了process=15进程等待。
查看PROCESS=16状态
PROCESS 16:
----------------------------------------
SO: 700000d60f25268, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=16, calls cur/top: 700000d69843990/700000d6983ddf8, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 94
last post received-location: KJCS Post snd proxy to flush msg
last process to post me: 700000d58f43670 1 6
last post sent: 0 0 98
last post sent-location: kjbrcvdscn: scn acked
last process posted by me: 700000d58f43670 1 6
(latch info) wait_event=0 bits=0
Location from where call was made: kclgrantlk_2:
waiting for 700000cdf26c988 Child cache buffers chains level=1 child#=162223
Location from where latch is held: kcbchg: kslbegin: bufs not pinned:
Context saved from call: 562086705
state=busy(exclusive) (val=0x200000000000086f) holder orapid = 2159
waiters [orapid (seconds since: put on list, posted, alive check)]:
16 (140, 1301277525, 2)
waiter count=1
gotten 4188007 times wait, failed first 98161 sleeps 554
gotten 461256 times nowait, failed: 124
possible holder pid = 2159 ospid=3392358
on wait list for 700000cdf26c988
Process Group: DEFAULT, pseudo proc: 700000d6919aea0
O/S info: user: oracle, term: UNKNOWN, ospid: 3560010 (DEAD)
查看process=22状态
PROCESS 22:
----------------------------------------
SO: 700000d8cf24f50, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=22, calls cur/top: 700000d67b9dc38/700000d6983eea8, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 94
last post received-location: KJCS Post snd proxy to flush msg
last process to post me: 700000d58f43670 1 6
last post sent: 0 0 135
last post sent-location: kclcget
last process posted by me: 700000d60f53498 3 1
(latch info) wait_event=0 bits=0
Location from where call was made: kclexpandlock:
waiting for 700000cea8a6a80 Child cache buffers chains level=1 child#=61810
Location from where latch is held: kcbzgb: scan from tail. nowait:
Context saved from call: 0
state=busy(exclusive) (val=0x2000000000000456) holder orapid = 1110
waiters [orapid (seconds since: put on list, posted, alive check)]:
22 (279, 1301277525, 2)
waiter count=1
gotten 5121038 times wait, failed first 15184 sleeps 753
gotten 586404 times nowait, failed: 1427022
possible holder pid = 1110 ospid=2978810
on wait list for 700000cea8a6a80
Process Group: DEFAULT, pseudo proc: 700000d6919aea0
O/S info: user: oracle, term: UNKNOWN, ospid: 3109298 (DEAD)
OSD pid info: Unix process pid: 3109298, image: oracle@crmdb02 (LMSf)
最后检查process=2状态
PROCESS 2:
----------------------------------------
SO: 700000d60f24288, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=2, calls cur/top: 700000d6983b730/700000d6983b730, flag: (e) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 504403214633274144 133 2
last post received-location: kslges
last process to post me: 700000d5cf85938 137 1
last post sent: 0 0 121
last post sent-location: kcbzww
last process posted by me: 700000d58f7f6e0 198 1
(latch info) wait_event=0 bits=6
Location from where call was made: kcbw_unlink_q:
waiting for 700000d1fe74fd0 Child object queue header operation level=5 child#=91
Location from where latch is held: kclobj_2:
Context saved from call: 0
state=busy(shared) (val=0x4000000000000002)
waiters [orapid (seconds since: put on list, posted, alive check)]:
18 (2, 1301277525, 2)
17 (2, 1301277525, 2)
waiter count=2
gotten 687558442 times wait, failed first 3419261 sleeps 1918111
gotten 19762 times nowait, failed: 151
possible holder pid = 9 ospid=2982786
holding (efd=8) 700000cf3a6e170 Child cache buffers lru chain level=2 child#=91
Location from where latch is held: kcbzfw: buffer header
Context saved from call: 0
state=busy, wlstate=free
waiters [orapid (seconds since: put on list, posted, alive check)]:
63 (50, 1301277525, 2)
waiter count=1
holding (efd=8) 700000ce2a79ef0 Child cache buffers chains level=1 child#=81019
Location from where latch is held: kcbso1: set no access:
Context saved from call: 0
state=busy(exclusive) (val=0x2000000000000002) holder orapid = 2
Process Group: DEFAULT, pseudo proc: 700000d6919aea0
O/S info: user: oracle, term: UNKNOWN, ospid: 1274566
OSD pid info: Unix process pid: 1274566, image: oracle@crmdb02 (PMON)