aix 16G内存 单机 10.2.0.4
应用反馈无法访问数据库了。
sqlplus / as sysdba 挂起
看alert.log 不断输出
Tue Apr 6 10:01:41 2021
PMON failed to acquire latch, see PMON dump
Tue Apr 6 10:02:41 2021
PMON failed to acquire latch, see PMON dump
Tue Apr 6 10:02:55 2021
ksvcreate: Process(m000) creation failed
Tue Apr 6 10:03:41 2021
PMON failed to acquire latch, see PMON dump
kill -9 pmon进程后,告警日志输出以下内容:
Tue Apr 6 10:46:30 2021
MMNL: terminating instance due to error 472
Tue Apr 6 10:46:30 2021
Errors in file /home/oracle/admin/orcl/udump/orcl_ora_5865490.trc:
ORA-00600: internal error code, arguments: [504], [0x700000010007608], [640], [5], [session allocation], [0], [0], [0x000000000]
Tue Apr 6 10:46:31 2021
Errors in file /home/oracle/admin/orcl/udump/orcl_ora_5865490.trc:
ORA-07445: exception encountered: core dump [kgscDump+01dc] [SIGSEGV] [Address not mapped to object] [0x7C000FAC7C017000] [] []
ORA-00600: internal error code, arguments: [504], [0x700000010007608], [640], [5], [session allocation], [0], [0], [0x000000000]
Tue Apr 6 10:46:31 2021
奇怪的是,告警日志中提示MMNL进程由于472错误终止实例。
重启实例正常了。
检查当时的其他trc文件,发现mmnl的trc里有很多信息。
而 mmon 的跟踪文件中不断输出以下信息:
再看 pmon 的跟踪文件输出以下信息:
。。。
。。。
还有如下信息:
----------------------------------------
SO: 70000006f6b84c0, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=92, calls cur/top: 0/70000002ed1a930, flag: (0) -
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 0
last post received-location: No post
last process to post me: none
last post sent: 0 0 0
last post sent-location: No post
last process posted by me: none
(latch info) wait_event=0 bits=280
holding (efd=6) 70000006b9d8148 Child library cache level=5 child#=2
Location from where latch is held: kghfrunp: clatch: nowait:
Context saved from call: 0
state=busy, wlstate=free
waiters [orapid (seconds since: put on list, posted, alive check)]:
8 (2181, 1617676902, 0)
waiter count=1
holding (efd=6) 7000000100e7668 Child shared pool level=7 child#=1
Location from where latch is held: kghfrunp: alloc: session dur:
Context saved from call: 0
state=busy, wlstate=free
waiters [orapid (seconds since: put on list, posted, alive check)]:
89 (2450, 1617676902, 0)
94 (2450, 1617676902, 0)
78 (2445, 1617676902, 0)
28 (2442, 1617676902, 0)
84 (2442, 1617676902, 0)
省略500字
85 (2070, 1617676902, 0)
130 (2061, 1617676902, 3)
138 (1992, 1617676902, 0)
146 (1923, 1617676902, 0)
waiter count=46
Process Group: DEFAULT, pseudo proc: 70000006f783298
O/S info: user: oracle, term: UNKNOWN, ospid: 5865490
OSD pid info: Unix process pid: 5865490, image: oracle@p550-zjgl1
Short stack dump:
ksdxfstk+002c<-ksdxcb+04e4<-sspuser+0074<-000044F0<-skgpwwait+00bc<-ksliwat+06c0<-kslwaitns_timed+0024<-kskthbwt+022c<-kslwait+00f4<-kksLockWait+01fc<-kgxWait+0168<-kgxExclusive+00bc<-kksFreeHeapGetMutex+0158<-kksCursorFreeCallBack+0088<-kglobf0+0264<-kglhpd_internal+0228<-kglhpd+0010<-kghfrx+0028<-kghfrunp+0b44<-kghfnd+07e8<-kghalo+0a24<-ksp_param_handle_alloc+0168<-kspcrec+01bc<-ksucre+0408<-ksucresg+017c<-kpolna+02c4<-kpogsk+00c4<-opiodr+0ae0<-ttcpip+1020<-opitsk+1124<-opiino+0990<-opiodr+0ae0<-opidrv+0484<-sou2o+0090<-opimai_real+01bc<-main+0098<-__start+0098
Dump of memory from 0x070000006F657380 to 0x070000006F657588
70000006F657380 00000004 00000000 07000000 6946E648 [............iF.H]
70000006F657390 00000010 000313A7 07000000 2ED1A930 [...............0]
70000006F6573A0 00000003 000313A7 07000000 6FD593E0 [............o...]
70000006F6573B0 0000000B 000313A7 07000000 6F9CBEA8 [............o...]
70000006F6573C0 00000004 0003129B 00000000 00000000 [................]
70000006F6573D0 00000000 00000000 00000000 00000000 [................]
Repeat 26 times
70000006F657580 00000000 00000000 [........]
*** 2021-04-06 10:41:42.821
PMON unable to acquire latch 7000000100e7668 Child shared pool level=7 child#=1
Location from where latch is held: kghfrunp: alloc: session dur:
Context saved from call: 0
state=busy, wlstate=free
通过筛选possible holder关键字发现:
可能时pid=92 130的两个进程持有资源导致pmon异常。
停库前做了 HANG ANALYSIS,根据输出的文件,只找到了 92 这个pid的信息
印象中,10点此数据库不是非常忙,应该是解析方面出了问题,如果杀掉92这个会话可能会轻量级解决。
下次再发生类似问题时,优先看pmon的trc。
搜集信息方法:
sqlplus -prelim / as sysdba
oradebug setmypid
oradebug unlimit
oradebug hanganalyze 3
oradebug dump systemstate 258
等待30秒
oradebug hanganalyze 3
oradebug dump systemstate 258
oradebug tracefile_name
exit
或者
alter session set events 'immediate trace name systemstate level 258';
alter session set events 'immediate trace name hanganalyze level 3';
等待1分钟
alter session set events 'immediate trace name systemstate level 258';
alter session set events 'immediate trace name hanganalyze level 3';
参考:
Patch For Bug:5377099 Is Superseded By Patch For Bug:8426816 (Doc ID 817033.1)
其他参考:
awk
-f ass109.awk
/u01/oracle/diag/rdbms/ora11g/ora11g/trace/ora11g_ora_9976.trc
ass109.zip
https://fritshoogland.files.wordpress.com/2020/04/mutexes-2.pdf
https://andreynikolaev.files.wordpress.com/2016/03/latches_and_mutexes_in_oracle_12c.pdf