近期比较爆发,宕机了好几个,龙生九子,各有不同,先记录下来,后面有时间再深入研究
测试库突然宕机
先看alert.log
-
kkjcre1p: unable to spawn jobq slave process
-
Errors in file /home/ora/diag/rdbms/eastdb/orcl/trace/orcl_cjq0_3462.trc:
-
Process J000 died, see its trace file
-
kkjcre1p: unable to spawn jobq slave process
-
Errors in file /home/ora/diag/rdbms/orcl/orcl/trace/orcl_cjq0_3462.trc:
-
Process W000 died, see its trace file
-
Process J000 died, see its trace file
-
kkjcre1p: unable to spawn jobq slave process
-
Errors in file /home/ora/diag/rdbms/orcl/orcl/trace/orcl_cjq0_3462.trc:
-
Fri Apr 16 08:46:18 2021
-
Process J000 died, see its trace file
-
kkjcre1p: unable to spawn jobq slave process
-
Errors in file /home/ora/diag/rdbms/orcl/orcl/trace/orcl_cjq0_3462.trc:
-
Fri Apr 16 08:46:21 2021
-
Process W000 died, see its trace file
-
Fri Apr 16 08:46:21 2021
-
PMON (ospid: 3335): terminating the instance due to error 474
-
Fri Apr 16 08:46:22 2021
-
System state dump requested by (instance=1, osid=3335 (PMON)), summary=[abnormal instance termination].
-
System State dumped to trace file /home/ora/diag/rdbms/orcl/orcl/trace/orcl_diag_3370.trc
-
Instance terminated by PMON, pid = 3335
关键信息是 error 474,这个代表smon完蛋了。
smon是干啥的?
那么,smon宕机从哪里入手分析?
很好还是diag的trace文件,这里是 orcl_diag_3370.trc
搜索process 13:smon ,其中的13 是这台机器的
oracle id 进程编号,其他机器上会不同
继续往下搜Session Wait History,看看有无异常的等待:
-
Session Wait History:
-
elapsed time of 0.263819 sec since current wait
-
0: waited for 'smon timer'
-
sleep time=0x12c, failed=0x0, =0x0
-
wait_id=9247545 seq_num=7117 snap_id=1
-
wait times: snap=5 min 0 sec, exc=5 min 0 sec, total=5 min 0 sec
-
wait times: max=5 min 0 sec
-
wait counts: calls=1 os=99
-
occurred after 0.439011 sec of elapsed time
-
1: waited for 'smon timer'
-
sleep time=0x12c, failed=0x0, =0x0
-
wait_id=9247544 seq_num=7116 snap_id=1
-
wait times: snap=5 min 0 sec, exc=5 min 0 sec, total=5 min 0 sec
-
wait times: max=5 min 0 sec
-
wait counts: calls=1 os=99
-
occurred after 0.253953 sec of elapsed time
-
2: waited for 'smon timer'
-
sleep time=0x12c, failed=0x0, =0x0
-
wait_id=9247543 seq_num=7115 snap_id=1
-
wait times: snap=5 min 0 sec, exc=5 min 0 sec, total=5 min 0 sec
-
wait times: max=5 min 0 sec
-
wait counts: calls=1 os=99
-
occurred after 0.030880 sec of elapsed time
-
3: waited for 'smon timer'
-
sleep time=0x12c, failed=0x0, =0x0
-
wait_id=9247542 seq_num=7114 snap_id=1
-
wait times: snap=5 min 0 sec, exc=5 min 0 sec, total=5 min 0 sec
-
wait times: max=5 min 0 sec
-
wait counts: calls=1 os=99
-
occurred after 0.047717 sec of elapsed time
-
4: waited for 'smon timer'
-
sleep time=0x12c, failed=0x0, =0x0
-
wait_id=9247541 seq_num=7113 snap_id=1
-
wait times: snap=5 min 0 sec, exc=5 min 0 sec, total=5 min 0 sec
-
wait times: max=5 min 0 sec
-
wait counts: calls=1 os=99
-
occurred after 0.007141 sec of elapsed time
-
5: waited for 'smon timer'
-
sleep time=0x12c, failed=0x0, =0x0
-
wait_id=9247540 seq_num=7112 snap_id=1
-
wait times: snap=5 min 0 sec, exc=5 min 0 sec, total=5 min 0 sec
-
wait times: max=5 min 0 sec
-
wait counts: calls=1 os=99
-
occurred after 0.176498 sec of elapsed time
-
6: waited for 'smon timer'
-
sleep time=0x12c, failed=0x0, =0x0
-
wait_id=9247539 seq_num=7111 snap_id=1
-
wait times: snap=5 min 0 sec, exc=5 min 0 sec, total=5 min 0 sec
-
wait times: max=5 min 0 sec
-
wait counts: calls=1 os=99
-
occurred after 0.183811 sec of elapsed time
-
7: waited for 'smon timer'
-
sleep time=0x12c, failed=0x0, =0x0
-
wait_id=9247538 seq_num=7110 snap_id=1
-
wait times: snap=5 min 0 sec, exc=5 min 0 sec, total=5 min 0 sec
-
wait times: max=5 min 0 sec
-
wait counts: calls=1 os=99
-
occurred after 0.088497 sec of elapsed time
-
8: waited for 'smon timer'
-
sleep time=0x12c, failed=0x0, =0x0
-
wait_id=9247537 seq_num=7109 snap_id=1
-
wait times: snap=5 min 0 sec, exc=5 min 0 sec, total=5 min 0 sec
-
wait times: max=5 min 0 sec
-
wait counts: calls=1 os=99
-
occurred after 0.262751 sec of elapsed time
-
9: waited for 'smon timer'
-
sleep time=0x12c, failed=0x0, =0x0
-
wait_id=9247536 seq_num=7108 snap_id=1
-
wait times: snap=5 min 0 sec, exc=5 min 0 sec, total=5 min 0 sec
-
wait times: max=5 min 0 sec
-
wait counts: calls=1 os=99
-
occurred after 0.029236 sec of elapsed time
-
Sampled Session History of session 66 serial 1
-
---------------------------------------------------
-
The sampled session history is constructed by sampling
-
the target session every 1 second. The sampling process
-
captures at each sample if the session is in a non-idle wait,
-
an idle wait, or not in a wait. If the session is in a
-
non-idle wait then one interval is shown for all the samples
-
the session was in the same non-idle wait. If the
-
session is in an idle wait or not in a wait for
-
consecutive samples then one interval is shown for all
-
the consecutive samples. Though we display these consecutive
-
samples in a single interval the session may NOT be continuously
-
idle or not in a wait (the sampling process does not know).
-
-
The history is displayed in reverse chronological order.
没看到有什么异常。
改转向其他地方了,对,就是pmon的trace文件。
直接到最底部
-
0BF4F4EC0 00000000 00000000 00000000 00000000 [................]
-
Repeat 113 times
-
0BF4F55E0 BF4F55E0 00000000 BF4F55E0 00000000 [.UO......UO.....]
-
0BF4F55F0 00000000 00000000 BF4F55F8 00000000 [.........UO.....]
-
0BF4F5600 BF4F55F8 00000000 00000000 00000000 [.UO.............]
-
0BF4F5610 00000000 00000000 00000000 00000000 [................]
-
Repeat 1 times
-
kjzduptcctx: Notifying DIAG for crash event
-
----- Abridged Call Stack Trace -----
-
ksedsts()+461<-kjzdssdmp()+267<-kjzduptcctx()+232<-kjzdicrshnfy()+53<-ksuitm()+1332<-ksulhdcb()+499<-ksucln()+1243<-ksbrdp()+971<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+266<-ssthrdmain()+252<-main()+201<-__libc_start_main()+253<-_start()+36
-
-
----- End of Abridged Call Stack Trace -----
-
-
*** 2021-04-16 08:46:21.779
-
PMON (ospid: 3335): terminating the instance due to error 474
-
ksuitm: waiting up to [5] seconds before killing DIAG(3370)
call stack trace对于问题定位非常重要。
我感觉其中关键的函数是ksucln()
猜测还是smon的老本行,清理对象时遇到问题。
smon宕机相关问题
-
ORA-474:SMON进程终止并出现错误
-
1- ORA-00474:SMON进程在并行事务恢复期间因错误而终止
-
-
解决方案:
-
-
通过在您的init@SID.ora中添加以下参数来关闭并行恢复,
-
fast_start_parallel_rollback = FALSE
-
反弹实例。
-
-
有关更多详细信息,请参阅:
-
-
Ora-600 [15789]和Ora-474(Doc ID 1094645.1)
-
-
-
2-导致数据库崩溃的ORA-600 [504]和ORA-474实例崩溃,ORA-600 [kcbnew_3]可以使它们崩溃。(低于11.2.0.2的版本)
-
-
解决方案:
-
-
升级到10.2.0.5或11.2.0.2或更高版本
-
或
-
检查MOS平台上一次性修补程序:9084487的可用性。
-
-
有关更多详细信息,请参阅:
-
-
ORA-00600 [504]和ORA-474导致数据库崩溃(文档ID 1209577.1)
-
-
3-在警报日志中报告的ORA-600 [13011]和ORA-474,其中跟踪失败的SQL类似于“从smon_scn_time删除,其中scn =(从smon_scn_time中选择min(scn))”
-
-
解决方案:
-
-
分析表smon_scn_time验证结构级联并重建其所有索引
-
-
有关更多详细信息,请参阅:
-
-
实例终止于错误ORA-00474:SMON进程终止于错误(文档ID 1361872.1)
-
-
如果报告了不同表的错误,请尝试相同的解决方案(分析报告的表和重建其索引)
-
-
有关此错误的疑难解答,请参阅以下文档,以了解更多详细信息:
-
-
了解和诊断ORA-00600 [13011]错误(文档ID 1392778.1)
-
-
4-使用ORA-474和ORA-660 [4464] / ORA-600 [4427](在低于11.2.0.2的版本上)导致实例崩溃
-
-
这是Bug 11814907:用ORA-00474重新启动实例:由于关闭了SMON过程而导致错误终止错误9857702的重复项:返还ORA-600 [4464]
-
-
解决方案:
-
-
升级到11.2.0.2或更高版本,或者安装临时补丁9857702(如果适用于您的平台)
-
-
5-警报日志中报告了ORA-00600 [KDOURP_INORDER2]和ORA-00474(版本低于11.2)
-
-
是错误7627304:ORA-00600 [KDOURP_INORDER2]和ORA-00474:SMON,过程PMON终止实例已作为错误7662491的副本关闭:实例崩溃/ ORA-600 [KDDUMMY_BLKCHK]恢复期间命中
-
-
解决方案:
-
-
升级至11.2或安装临时补丁7662491(如果适用于您的平台)
参考:
Troubleshooting ORA-46x and ORA-47x xxxx Process Terminated With Error (Doc ID 1907129.1)
SRDC - Instance Termination (non-RAC) Issues : Checklist of Evidence to Supply (Doc ID 2507010.1)
数据库系统监视进程(SMON)(文档ID 1495163.1)
对于宕机问题,搜集方法可以用
tfactl,顺便看看帮助内容----很丰富。
-
[oracle@shdb01 ~]$ tfactl diagcollect -srdc -help
-
-
Service Request Data Collection (SRDC).
-
-
Usage : /opt/oracle.ahf/tfa/bin/tfactl diagcollect -srdc [-tag ] [-z ] [-last | -from -to | -for ] -database
-
-tag The files will be collected into tagname directory inside
-
repository
-
-z The collection zip file will be given this name within the
-
TFA collection repository
-
-last Files from last 'n' [m]inutes, 'n' [d]ays or 'n' [h]ours
-
-since Same as -last. Kept for backward compatibility.
-
-from "Mon/dd/yyyy hh:mm:ss" From
-
or "yyyy-mm-dd hh:mm:ss"
-
or "yyyy-mm-ddThh:mm:ss"
-
or "yyyy-mm-dd"
-
-to "Mon/dd/yyyy hh:mm:ss" To
-
or "yyyy-mm-dd hh:mm:ss"
-
or "yyyy-mm-ddThh:mm:ss"
-
or "yyyy-mm-dd"
-
-for "Mon/dd/yyyy" For .
-
or "yyyy-mm-dd"
-
-
can be any of the following,
-
DBCORRUPT Required Diagnostic Data Collection for a Generic Database Corruption
-
Listener_Services SRDC - Data Collection for TNS-12516 / TNS-12518 / TNS-12519 / TNS-12520.
-
Naming_Services SRDC - Data Collection for ORA-12154 / ORA-12514 / ORA-12528.
-
ORA-00020 SRDC for database ORA-00020 Maximum number of processes exceeded
-
ORA-00060 SRDC for ORA-00060. Internal error code.
-
ORA-00494 SRDC for ORA-00494.
-
ORA-00600 SRDC for ORA-00600. Internal error code.
-
ORA-00700 SRDC for ORA-00700. Soft internal error.
-
ORA-01031 SRDC - How to Collect Standard Information for ORA - 1031 /ORA -1017 during SYSDBA connections
-
ORA-01555 SRDC - ORA-1555: Checklist of Evidence to Supply (Doc ID 1682708.1)
-
ORA-01578 SRDC - Required Diagnostic Data Collection for ORA-01578
-
ORA-01628 SRDC for database ORA-01628 Snapshot too Old problems
-
ORA-04020 SRDC for ORA-04020
-
ORA-04021 SRDC for ORA-04021.
-
ORA-04030 SRDC for ORA-04030. OS process private memory was exhausted.
-
ORA-04031 SRDC for ORA-04031. More shared memory is needed in the shared/streams pool.
-
ORA-07445 SRDC for ORA-07445. Exception encountered, core dump.
-
ORA-08102 SRDC - Required Diagnostic Data Collection for ORA-08102.
-
ORA-08103 SRDC - Required Diagnostic Data Collection for ORA-08103.
-
ORA-12751 SRDC for ORA-12751. Internal error code.
-
ORA-22924 SRDC - ORA-22924 or ORA-1555 on LOB data: Checklist of Evidence to Supply (Doc ID 1682707.1)
-
ORA-27300 SRDC for ORA-27300. OS system dependent operation:open failed with status: (status).
-
ORA-27301 SRDC for ORA-27301. OS failure message: (message).
-
ORA-27302 SRDC for ORA-27302. failure occurred at: (module).
-
ORA-30036 SRDC for database ORA-30036 Unable to extend Undo Tablespace problems
-
TNS-12154 SRDC - Data Collection for TNS-12154.
-
TNS-12514 SRDC - Data Collection for TNS-12514.
-
TNS-12516 SRDC - Data Collection for TNS-12516.
-
TNS-12518 SRDC - Data Collection for TNS-12518.
-
TNS-12519 SRDC - Data Collection for TNS-12519.
-
TNS-12520 SRDC - Data Collection for TNS-12520.
-
TNS-12528 SRDC - Data Collection for TNS-12528.
-
ahf SRDC - Data Collection for orachk or exachk issue, after running orachk -debug or exachk -debug.
-
crs SRDC FOR CRS
-
crsasm SRDC FOR ASM CRS Related Errors
-
crsasmcell SRDC FOR ASM CRS CELL Related Errors
-
dbacl SRDC - How to Collect Standard Information for Access Control Lists (ACLs).
-
dbaqgen SRDC - How To Collect Information For Troubleshooting Problem In An Oracle Advanced Queuing Environment.
-
dbaqmon SRDC - How to Collect Information for Troubleshooting Queue Monitor (QMON) Issues.
-
dbaqnotify SRDC - How to Collect Information for Troubleshooting Notification in an Advanced Queuing Environment.
-
dbaqperf SRDC - How To Collect Information For Troubleshooting Performance In An Oracle Advanced Queuing Environment.
-
dbaqpurge SRDC - How to Collect Information for Troubleshooting Non-Purged Messages in an Advanced Queuing Environment
-
dbasm SRDC AUTOMATION: ENHANCE ASM/DBFS/DNFS/ACFS COLLECTIONS
-
dbaudit SRDC - How to Collect Standard Information for Database Auditing
-
dbaum SRDC - AUM : Checklist of Evidence to Supply (Doc ID 1682741.1)
-
dbaumwaitevents SRDC - Wait Events related to Undo: Checklist of Evidence to Supply (Doc ID 1682723.1)
-
dbawrspace SRDC for database AWR space problems
-
dbbeqconnection SRDC - Bequeath Connection Issues: Checklist of Evidence to Supply (Doc ID 1928047.1)
-
dbdatapatch SRDC - Data Collection for Datapatch issues.
-
dbddlerrors SRDC - DDL Errors: Checklist of Evidence to Supply
-
dbemon SRDC - How to Collect Information for Troubleshooting Event Monitor (EMON) Issues
-
dbenqdeq SRDC - How to Collect Standard Information for Advanced Queueing Issues Using TFA Collector (Recommended) or Manual Steps
-
dbexp SRDC - How to Collect Information for Troubleshooting Export (EXP) Related Problems
-
dbexpdp SRDC - Diagnostic Collection for DataPump Export Generic Issues
-
dbexpdpapi SRDC - Diagnostic Collection for DataPump Export API Issues
-
dbexpdpperf SRDC - Diagnostic Collection for DataPump Export Performance Issues
-
dbexpdptts SRDC - Data to supply for Transportable Tablespace Datapump and original EXPORT, IMPORT
-
dbfra SRDC - Required diagnostic data collection for FRA related errors.
-
dbfs SRDC for dbfs.
-
dbggclassicmode SRDC for DOC ID 1913426.1, 1913376.1 and 1912964.1
-
dbggintegratedmode SRDC for GoldenGate extract/replicat abends problems.
-
dbhang SRDC for database Hang problems
-
dbimp SRDC - Diagnostic Collection for Traditional Import Issues
-
dbimpdp SRDC - Diagnostic Collection for DataPump Import (IMPDP) Generic Issues
-
dbimpdpperf SRDC - Diagnostic Collection for DataPump Import (IMPDP) Performance Issues
-
dbinstall SRDC for Oracle RDBMS install problems.
-
dbinstancecrash SRDC - Instance Termination (non-RAC) Issues : Checklist of Evidence to Supply (Doc ID 2507010.1)
-
dbinvalidcomp SRDC - Invalid Components and Objects : Checklist of Evidence to Supply
-
dbinvalidobj SRDC - Objects Getting Invalidated: Checklist of Evidence to Supply
-
dbparameterfiles SRDC - Parameter Files :Checklist of Evidence to Supply.
-
dbparameters SRDC - Database Parameters: Checklist of Evidence to Supply.
-
dbpartition SRDC - Data to Supply for Create/Maintain Partitioned/Subpartitioned Table/Index Issues
-
dbpartitionperf SRDC - Data to Supply for Slow Create/Alter/Drop Commands Against Partitioned Table/Index
-
dbpatchconflict SRDC for Oracle RDBMS patch conflict problems.
-
dbpatchinstall
-
dbperf SRDC for database performance problems
-
dbplugincompliance SRDC - Collect Relevant Diagnostic Information For All Compliance Related Issues Within Enterprise Manager 12c and 13c for Oracle Database.
-
dbpreupgrade SRDC for database preupgrade problems.
-
dbprocmgmt SRDC - Generic Process Management and Related Issues: Checklist of Evidence to Supply (Doc ID 2500734.1)
-
dbrac SRDC FOR RAC Specific Issues
-
dbracinst SRDC AUTOMATION: ENHANCE ASM/DBFS/DNFS/ACFS COLLECTIONS
-
dbracmin Minimal SRDC FOR RAC Specific Issues
-
dbracperf SRDC for RAC database performance problems
-
dbrman SRDC - Required diagnostic data collection for RMAN related errors.
-
dbrmanperf SRDC - Required diagnostic data collection for RMAN Performance(1671509.1).
-
dbscn SRDC for database SCN problems.
-
dbshutdown SRDC - Shutdown Issues : Checklist of Evidence to Supply (Doc ID 1906473.1)
-
dbslowddl SRDC - Slow DDL: Checklist of Evidence to Supply
-
dbspatialexportimport SRDC - Data Collection for Oracle Spatial Export/Import Issues.
-
dbspatialinstall SRDC - Data Collection for Oracle Spatial Installation Issues.
-
dbsqlperf SRDC - How to Collect Standard Information for a SQL Performance Problem Using TFA Collector.
-
dbstandalonedbca SRDC - DBCA Issues: Checklist of Evidence to Supply
-
dbstartup SRDC - Startup Issues: Checklist of Evidence to Supply (Doc ID 1905616.1)
-
dbtde SRDC - How to Collect Standard Information for Transparent Data Encryption (TDE) (Doc ID 1905607.1)
-
dbtextinstall SRDC - Data Collection for Oracle Text Installation Issues - 12c.
-
dbtextupgrade SRDC - Data Collection for Oracle Text Upgrade Issues - 12c.
-
dbundocorruption SRDC - Required Diagnostic Data Collection for UNDO Corruption.
-
dbunixresources SRDC to capture diagnostic data for DB issues related to O/S resources
-
dbupgrade SRDC for database upgrade problems.
-
dbvault SRDC - How to Collect Standard Information for Database Vault
-
dbwindowsresources SRDC - DB on Windows Resources : Checklist of Evidence to Supply.
-
dbwinservice SRDC - OracleService on Windows: Checklist of Evidence to Supply (Doc ID 1918781.1)
-
dbxdb SRDC for database XDB Installation and Invalid Object problems
-
dnfs SRDC for DNFS.
-
emagentgeneric SRDC - Collect Trace/Log Information for Enterprise Manager Management Agent Generic Issues
-
emagentpatching SRDC - Collect Trace/Log Information for Failures during Enterprise Manager 13c Management Agent Patching.
-
emagentperf EM SRDC - Collect Diagnostic Data for EM Agent Performance Issues.
-
emagentstartup SRDC - Collecting Logs for Enterprise Manager 13c Agent Startup Errors.
-
emagtpatchdeploy SRDC - Collecting Log Files for EM 13c Agent or Agent Patch Deployment.
-
emagtupgpatch SRDC - Collecting Log Files for EM 13c Agent Upgrade or Local Installation or Patching.
-
emcliadd EM SRDC - Errors during the adding of a database/listener/ASM target via EMCLI.
-
emclusdisc EM SRDC - Cluster target, cluster (RAC) database or ASM target is not discovered.
-
emdbaasdeploy SRDC - Collect Trace/Log Information For Failures During Database As A Service(DBAAS) Deployment.
-
emdbsys EM SRDC - Database system target is not discovered/detected/removed/renamed correctly.
-
emdebugoff SRDC for unsetting EM Debug.
-
emdebugon SRDC for setting EM Debug.
-
emfleetpatching SRDC - Collecting Diagnostic Data for Enterprise Manager Fleet Maintenance Patching Issues.
-
emgendisc EM SRDC - General error is received when discovering or removing a database/listener/ASM target.
-
emmetricalert SRDC for EM Metric Events not Raised and General Metric Alert Related Issues.
-
emomscrash SRDC - Collect Diagnostic Data for all Enterprise Manager OMS Crash / Restart Performance Issues.
-
emomsheap SRDC - Collecting Diagnostic Data for Enterprise Manager OMS Heap Usage Alert Performance Issues.
-
emomshungcpu SRDC - Collecting Diagnostic Data for Enterprise Manager OMS hung or High CPU Usage Performance Issues.
-
emomspatching SRDC - Collect Trace/Log Information for Failures during Enterprise Manager 13c OMS Patching.
-
empatchplancrt SRDC - Collecting Diagnostic Data for Enterprise Manager Patch Plan Creation Issues.
-
emprocdisc EM SRDC - Database/listener/ASM target is not discovered/detected by the discovery process.
-
emtbsmetric SRDC - Collect Relevant Diagnostic Information For All Tablespace Space Used (%) Metric Issues Within Enterprise Manager For Oracle Database 12c and 13c.
-
esexalogic SRDC - Exalogic Full Exalogs Data Collection Information.
-
exservice SRDC - Exadata: Storage Software Service Or Offload Server Service Failures.
-
exsmartscan SRDC - Exadata: Smart Scan Not Working Issues.
-
gg_abend SRDC for DOC ID 2650417.1
-
ggintegratedmodenodb SRDC for GoldenGate extract/replicat abends problems.
-
gridinfra SRDC AUTOMATION: ENHANCE ASM/DBFS/DNFS/ACFS COLLECTIONS
-
gridinfrainst SRDC AUTOMATION: ENHANCE ASM/DBFS/DNFS/ACFS COLLECTIONS
-
instterm SRDC for instance terminated events, such as ORA-00469: ORA-00470: ORA-00480: ORA-00490: ORA-00491, ORA-00492, ORA-00493, ORA-00495, ORA-00496, ORA-00497, ORA-00498
-
internalerror SRDC for all other types of internal database errors.
-
ora1000 SRDC - Open Cursors:Checklist of Evidence to Supply.
-
ora18 SRDC - ORA-18 or Sessions Parameter: Checklist of Evidence to Supply.
-
ora25319 SRDC - How to Collect Information for Troubleshooting an ORA-25319 Error in an Advanced Queuing Environment.
-
ora4023 SRDC - ORA-4023 : Checklist of Evidence to Supply
-
ora4063 SRDC - ORA-4063 : Checklist of Evidence to Supply
-
ora445 SRDC - ORA-445 or Unable to Spawn Process: Checklist of Evidence to Supply (Doc ID 2500730.1)
-
xdb600 SRDC - Required Diagnostic Data Collection for XDB ORA-00600 and ORA-07445 Internal Error Issues using TFA Collector
-
xdbinstall SRDC - Required Diagnostic Data Collection for XDB Installation and Invalid Object for Issues for 12c and Onward
-
zlgeneric SRDC - Zero Data Loss Recovery Appliance (ZDLRA) Data Collection.
-
[oracle@shdb01 ~]$
结合alert.log,从最早的告警开始,发现15日14点awr就没有生成。
从 dba_hist_active_sess_history 看看出问题前库里在忙啥
set lines 500
set long 9999
set pages 999
set serveroutput on size 1000000
alter session set nls_date_format = 'yyyy/mm/dd hh24:mi:ss';
alter session set nls_timestamp_format = 'yyyy-mm-dd hh24.mi.ss.ff';
select instance_number, sample_id,sample_time,count(*) cnt
from dba_hist_active_sess_history where SAMPLE_TIME between
TO_TIMESTAMP('2021/04/15 13:00', 'yyyy/mm/dd hh24:mi') and
TO_TIMESTAMP('2021/04/16 10:00', 'yyyy/mm/dd hh24:mi')
group by instance_number, sample_id,sample_time
order by instance_number, sample_id,sample_time;
也没有数据了(宕机前也没有什么会话,大早晨8:30测试库能有什么业务)。
m000进程没有日志文件,只有j000的日志中每隔2秒提示:
Process J000 is dead ... state=KSOSP_SPAWNED
操作系统的messages中出问题时有oom报错:
-
Apr 11 18:58:27 host auditd[1737]: Audit daemon rotating log files
-
Apr 11 22:05:14 host auditd[1737]: Audit daemon rotating log files
-
Apr 12 11:09:15 host auditd[1737]: Audit daemon rotating log files
-
Apr 12 20:10:36 host auditd[1737]: Audit daemon rotating log files
-
Apr 13 13:01:16 host auditd[1737]: Audit daemon rotating log files
-
Apr 13 19:23:24 host auditd[1737]: Audit daemon rotating log files
-
Apr 14 13:49:20 host auditd[1737]: Audit daemon rotating log files
-
Apr 15 12:23:09 host auditd[1737]: Audit daemon rotating log files
-
Apr 15 17:34:48 host kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
-
Apr 15 17:34:48 host kernel: oracle cpuset=/ mems_allowed=0
-
Apr 15 17:34:48 host kernel: Pid: 3388, comm: oracle Tainted: G --------------- T 2.6.32-431.el6.x86_64 #1
-
Apr 15 17:34:48 host kernel: Call Trace:
-
Apr 15 17:34:48 host kernel: [] ? cpuset_print_task_mems_allowed+0x91/0xb0
-
Apr 15 17:34:48 host kernel: [] ? dump_header+0x90/0x1b0
-
Apr 15 17:34:48 host kernel: [] ? security_real_capable_noaudit+0x3c/0x70
-
Apr 15 17:34:48 host kernel: [] ? oom_kill_process+0x82/0x2a0
-
Apr 15 17:34:48 host kernel: [] ? select_bad_process+0xe1/0x120
-
Apr 15 17:34:48 host kernel: [] ? out_of_memory+0x220/0x3c0
-
Apr 15 17:34:48 host kernel: [] ? __alloc_pages_nodemask+0x8ac/0x8d0
-
Apr 15 17:34:48 host kernel: [] ? alloc_pages_current+0xaa/0x110
-
Apr 15 17:34:48 host kernel: [] ? __page_cache_alloc+0x87/0x90
-
Apr 15 17:34:48 host kernel: [] ? find_get_page+0x1e/0xa0
-
Apr 15 17:34:48 host kernel: [] ? filemap_fault+0x1a7/0x500
-
Apr 15 17:34:48 host kernel: [] ? __do_fault+0x54/0x530
-
Apr 15 17:34:48 host kernel: [] ? handle_pte_fault+0xf7/0xb00
-
Apr 15 17:34:48 host kernel: [] ? rb_reserve_next_event+0xb4/0x370
-
Apr 15 17:34:48 host kernel: [] ? native_sched_clock+0x13/0x80
-
Apr 15 17:34:48 host kernel: [] ? rb_reserve_next_event+0xb4/0x370
-
Apr 15 17:34:48 host kernel: [] ? native_sched_clock+0x13/0x80
-
Apr 15 17:34:48 host kernel: [] ? handle_mm_fault+0x22a/0x300
-
Apr 15 17:34:48 host kernel: [] ? __do_page_fault+0x138/0x480
-
Apr 15 17:34:48 host kernel: [] ? thread_group_times+0x3d/0x120
-
Apr 15 17:34:48 host kernel: [] ? ring_buffer_lock_reserve+0xa2/0x160
-
Apr 15 17:34:48 host kernel: [] ? mmput+0x1e/0x120
-
Apr 15 17:34:48 host kernel: [] ? trace_nowake_buffer_unlock_commit+0x43/0x60
-
Apr 15 17:34:48 host kernel: [] ? ftrace_raw_event_sys_exit+0xb9/0xc0
-
Apr 15 17:34:48 host kernel: [] ? do_page_fault+0x3e/0xa0
-
Apr 15 17:34:48 host kernel: [] ? page_fault+0x25/0x30
-
Apr 15 17:34:48 host kernel: Mem-Info:
-
Apr 15 17:34:48 host kernel: Node 0 DMA per-cpu:
-
Apr 15 17:34:48 host kernel: CPU 0: hi: 0, btch: 1 usd: 0
-
Apr 15 17:34:48 host kernel: CPU 1: hi: 0, btch: 1 usd: 0
-
Apr 15 17:34:48 host kernel: CPU 2: hi: 0, btch: 1 usd: 0
-
Apr 15 17:34:48 host kernel: CPU 3: hi: 0, btch: 1 usd: 0
-
Apr 15 17:34:48 host kernel: Node 0 DMA32 per-cpu:
-
Apr 15 17:34:48 host kernel: CPU 0: hi: 186, btch: 31 usd: 0
-
Apr 15 17:34:48 host kernel: CPU 1: hi: 186, btch: 31 usd: 0
-
Apr 15 17:34:48 host kernel: CPU 2: hi: 186, btch: 31 usd: 0
-
Apr 15 17:34:48 host kernel: CPU 3: hi: 186, btch: 31 usd: 0
-
Apr 15 17:34:48 host kernel: Node 0 Normal per-cpu:
-
Apr 15 17:34:48 host kernel: CPU 0: hi: 186, btch: 31 usd: 0
-
Apr 15 17:34:48 host kernel: CPU 1: hi: 186, btch: 31 usd: 0
-
Apr 15 17:34:48 host kernel: CPU 2: hi: 186, btch: 31 usd: 23
-
Apr 15 17:34:48 host kernel: CPU 3: hi: 186, btch: 31 usd: 0
-
Apr 15 17:34:48 host kernel: active_anon:463690 inactive_anon:140220 isolated_anon:0
-
Apr 15 17:34:48 host kernel: active_file:245 inactive_file:504 isolated_file:0
-
Apr 15 17:34:48 host kernel: unevictable:0 dirty:11 writeback:0 unstable:0
-
Apr 15 17:34:48 host kernel: free:22140 slab_reclaimable:10286 slab_unreclaimable:84990
-
Apr 15 17:34:48 host kernel: mapped:11740 shmem:46989 pagetables:215832 bounce:0
-
Apr 15 17:34:48 host kernel: Node 0 DMA free:15684kB min:248kB low:308kB high:372kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15292kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
-
Apr 15 17:34:48 host kernel: lowmem_reserve[]: 0 3000 4010 4010
-
Apr 15 17:34:48 host kernel: Node 0 DMA32 free:55048kB min:50372kB low:62964kB high:75556kB active_anon:1627600kB inactive_anon:333656kB active_file:916kB inactive_file:1996kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3072096kB mlocked:0kB dirty:32kB writeback:0kB mapped:23000kB shmem:120460kB slab_reclaimable:23224kB slab_unreclaimable:199140kB kernel_stack:27016kB pagetables:538512kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
-
Apr 15 17:34:48 host kernel: lowmem_reserve[]: 0 0 1010 1010
-
Apr 15 17:34:48 host kernel: Node 0 Normal free:17828kB min:16956kB low:21192kB high:25432kB active_anon:227160kB inactive_anon:227224kB active_file:64kB inactive_file:20kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1034240kB mlocked:0kB dirty:12kB writeback:0kB mapped:23960kB shmem:67496kB slab_reclaimable:17920kB slab_unreclaimable:140820kB kernel_stack:7400kB pagetables:324816kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:16 all_unreclaimable? no
-
Apr 15 17:34:48 host kernel: lowmem_reserve[]: 0 0 0 0
-
Apr 15 17:34:48 host kernel: Node 0 DMA: 1*4kB 4*8kB 2*16kB 2*32kB 3*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15684kB
-
Apr 15 17:34:48 host kernel: Node 0 DMA32: 5534*4kB 1983*8kB 236*16kB 85*32kB 110*64kB 28*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 55120kB
-
Apr 15 17:34:48 host kernel: Node 0 Normal: 4451*4kB 3*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17828kB
-
Apr 15 17:34:48 host kernel: 50408 total pagecache pages
-
Apr 15 17:34:48 host kernel: 2357 pages in swap cache
-
Apr 15 17:34:48 host kernel: Swap cache stats: add 13218790, delete 13216433, find 24808505/26126131
-
Apr 15 17:34:48 host kernel: Free swap = 0kB
-
Apr 15 17:34:48 host kernel: Total swap = 4194296kB
-
Apr 15 17:34:48 host kernel: 1048560 pages RAM
-
Apr 15 17:34:48 host kernel: 67274 pages reserved
-
Apr 15 17:34:48 host kernel: 179030 pages shared
-
Apr 15 17:34:48 host kernel: 816137 pages non-shared
-
Apr 15 17:34:48 host kernel: [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
-
Apr 15 17:34:48 host kernel: [ 465] 0 465 2814 1 1 -17 -1000 udevd
-
Apr 15 17:34:48 host kernel: [ 1589] 0 1589 47371 136 0 0 0 vmtoolsd
-
Apr 15 17:34:48 host kernel: [ 1737] 0 1737 23300 73 2 -17 -1000 auditd
-
Apr 15 17:34:48 host kernel: [ 1739] 0 1739 20521 80 1 0 0 audispd
-
Apr 15 17:34:48 host kernel: [ 1740] 0 1740 5301 42 0 0 0 sedispatch
-
Apr 15 17:34:48 host kernel: [ 1814] 0 1814 2705 44 2 0 0 irqbalance
-
Apr 15 17:34:48 host kernel: [ 1833] 32 1833 4759 22 0 0 0 rpcbind
-
Apr 15 17:34:48 host kernel: [ 1942] 0 1942 3396 44 3 -17 -1000 lldpad
Free swap = 0kB ?
估计是内存不足,部署osw,再观察吧。
阅读(2918) | 评论(0) | 转发(0) |