Chinaunix首页 | 论坛 | 博客
  • 博客访问: 3716673
  • 博文数量: 715
  • 博客积分: 1860
  • 博客等级: 上尉
  • 技术积分: 7745
  • 用 户 组: 普通用户
  • 注册时间: 2008-04-07 08:51
个人简介

偶尔有空上来看看

文章分类

全部博文(715)

文章存档

2023年(75)

2022年(134)

2021年(238)

2020年(115)

2019年(11)

2018年(9)

2017年(9)

2016年(17)

2015年(7)

2014年(4)

2013年(1)

2012年(11)

2011年(27)

2010年(35)

2009年(11)

2008年(11)

分类: Oracle

2021-04-16 20:01:32


近期比较爆发,宕机了好几个,龙生九子,各有不同,先记录下来,后面有时间再深入研究

测试库突然宕机

先看alert.log 

  1. kkjcre1p: unable to spawn jobq slave process
  2. Errors in file /home/ora/diag/rdbms/eastdb/orcl/trace/orcl_cjq0_3462.trc:
  3. Process J000 died, see its trace file
  4. kkjcre1p: unable to spawn jobq slave process
  5. Errors in file /home/ora/diag/rdbms/orcl/orcl/trace/orcl_cjq0_3462.trc:
  6. Process W000 died, see its trace file
  7. Process J000 died, see its trace file
  8. kkjcre1p: unable to spawn jobq slave process
  9. Errors in file /home/ora/diag/rdbms/orcl/orcl/trace/orcl_cjq0_3462.trc:
  10. Fri Apr 16 08:46:18 2021
  11. Process J000 died, see its trace file
  12. kkjcre1p: unable to spawn jobq slave process
  13. Errors in file /home/ora/diag/rdbms/orcl/orcl/trace/orcl_cjq0_3462.trc:
  14. Fri Apr 16 08:46:21 2021
  15. Process W000 died, see its trace file
  16. Fri Apr 16 08:46:21 2021
  17. PMON (ospid: 3335): terminating the instance due to error 474
  18. Fri Apr 16 08:46:22 2021
  19. System state dump requested by (instance=1, osid=3335 (PMON)), summary=[abnormal instance termination].
  20. System State dumped to trace file /home/ora/diag/rdbms/orcl/orcl/trace/orcl_diag_3370.trc
  21. Instance terminated by PMON, pid = 3335
关键信息是 error 474,这个代表smon完蛋了。



smon是干啥的?
那么,smon宕机从哪里入手分析?

很好还是diag的trace文件,这里是 orcl_diag_3370.trc
搜索process 13:smon ,其中的13 是这台机器的 oracle id 进程编号,其他机器上会不同
继续往下搜Session Wait History,看看有无异常的等待:

点击(此处)折叠或打开

  1. Session Wait History:
  2.         elapsed time of 0.263819 sec since current wait
  3.      0: waited for 'smon timer'
  4.         sleep time=0x12c, failed=0x0, =0x0
  5.         wait_id=9247545 seq_num=7117 snap_id=1
  6.         wait times: snap=5 min 0 sec, exc=5 min 0 sec, total=5 min 0 sec
  7.         wait times: max=5 min 0 sec
  8.         wait counts: calls=1 os=99
  9.         occurred after 0.439011 sec of elapsed time
  10.      1: waited for 'smon timer'
  11.         sleep time=0x12c, failed=0x0, =0x0
  12.         wait_id=9247544 seq_num=7116 snap_id=1
  13.         wait times: snap=5 min 0 sec, exc=5 min 0 sec, total=5 min 0 sec
  14.         wait times: max=5 min 0 sec
  15.         wait counts: calls=1 os=99
  16.         occurred after 0.253953 sec of elapsed time
  17.      2: waited for 'smon timer'
  18.         sleep time=0x12c, failed=0x0, =0x0
  19.         wait_id=9247543 seq_num=7115 snap_id=1
  20.         wait times: snap=5 min 0 sec, exc=5 min 0 sec, total=5 min 0 sec
  21.         wait times: max=5 min 0 sec
  22.         wait counts: calls=1 os=99
  23.         occurred after 0.030880 sec of elapsed time
  24.      3: waited for 'smon timer'
  25.         sleep time=0x12c, failed=0x0, =0x0
  26.         wait_id=9247542 seq_num=7114 snap_id=1
  27.         wait times: snap=5 min 0 sec, exc=5 min 0 sec, total=5 min 0 sec
  28.         wait times: max=5 min 0 sec
  29.         wait counts: calls=1 os=99
  30.         occurred after 0.047717 sec of elapsed time
  31.      4: waited for 'smon timer'
  32.         sleep time=0x12c, failed=0x0, =0x0
  33.         wait_id=9247541 seq_num=7113 snap_id=1
  34.         wait times: snap=5 min 0 sec, exc=5 min 0 sec, total=5 min 0 sec
  35.         wait times: max=5 min 0 sec
  36.         wait counts: calls=1 os=99
  37.         occurred after 0.007141 sec of elapsed time
  38.      5: waited for 'smon timer'
  39.         sleep time=0x12c, failed=0x0, =0x0
  40.         wait_id=9247540 seq_num=7112 snap_id=1
  41.         wait times: snap=5 min 0 sec, exc=5 min 0 sec, total=5 min 0 sec
  42.         wait times: max=5 min 0 sec
  43.         wait counts: calls=1 os=99
  44.         occurred after 0.176498 sec of elapsed time
  45.      6: waited for 'smon timer'
  46.         sleep time=0x12c, failed=0x0, =0x0
  47.         wait_id=9247539 seq_num=7111 snap_id=1
  48.         wait times: snap=5 min 0 sec, exc=5 min 0 sec, total=5 min 0 sec
  49.         wait times: max=5 min 0 sec
  50.         wait counts: calls=1 os=99
  51.         occurred after 0.183811 sec of elapsed time
  52.      7: waited for 'smon timer'
  53.         sleep time=0x12c, failed=0x0, =0x0
  54.         wait_id=9247538 seq_num=7110 snap_id=1
  55.         wait times: snap=5 min 0 sec, exc=5 min 0 sec, total=5 min 0 sec
  56.         wait times: max=5 min 0 sec
  57.         wait counts: calls=1 os=99
  58.         occurred after 0.088497 sec of elapsed time
  59.      8: waited for 'smon timer'
  60.         sleep time=0x12c, failed=0x0, =0x0
  61.         wait_id=9247537 seq_num=7109 snap_id=1
  62.         wait times: snap=5 min 0 sec, exc=5 min 0 sec, total=5 min 0 sec
  63.         wait times: max=5 min 0 sec
  64.         wait counts: calls=1 os=99
  65.         occurred after 0.262751 sec of elapsed time
  66.      9: waited for 'smon timer'
  67.         sleep time=0x12c, failed=0x0, =0x0
  68.         wait_id=9247536 seq_num=7108 snap_id=1
  69.         wait times: snap=5 min 0 sec, exc=5 min 0 sec, total=5 min 0 sec
  70.         wait times: max=5 min 0 sec
  71.         wait counts: calls=1 os=99
  72.         occurred after 0.029236 sec of elapsed time
  73.     Sampled Session History of session 66 serial 1
  74.     ---------------------------------------------------
  75.     The sampled session history is constructed by sampling
  76.     the target session every 1 second. The sampling process
  77.     captures at each sample if the session is in a non-idle wait,
  78.     an idle wait, or not in a wait. If the session is in a
  79.     non-idle wait then one interval is shown for all the samples
  80.     the session was in the same non-idle wait. If the
  81.     session is in an idle wait or not in a wait for
  82.     consecutive samples then one interval is shown for all
  83.     the consecutive samples. Though we display these consecutive
  84.     samples in a single interval the session may NOT be continuously
  85.     idle or not in a wait (the sampling process does not know).
  86.  
  87.     The history is displayed in reverse chronological order.
没看到有什么异常。

改转向其他地方了,对,就是pmon的trace文件。
直接到最底部

点击(此处)折叠或打开

  1. 0BF4F4EC0 00000000 00000000 00000000 00000000 [................]
  2.         Repeat 113 times
  3. 0BF4F55E0 BF4F55E0 00000000 BF4F55E0 00000000 [.UO......UO.....]
  4. 0BF4F55F0 00000000 00000000 BF4F55F8 00000000 [.........UO.....]
  5. 0BF4F5600 BF4F55F8 00000000 00000000 00000000 [.UO.............]
  6. 0BF4F5610 00000000 00000000 00000000 00000000 [................]
  7.   Repeat 1 times
  8. kjzduptcctx: Notifying DIAG for crash event
  9. ----- Abridged Call Stack Trace -----
  10. ksedsts()+461<-kjzdssdmp()+267<-kjzduptcctx()+232<-kjzdicrshnfy()+53<-ksuitm()+1332<-ksulhdcb()+499<-ksucln()+1243<-ksbrdp()+971<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+266<-ssthrdmain()+252<-main()+201<-__libc_start_main()+253<-_start()+36
  11.  
  12. ----- End of Abridged Call Stack Trace -----

  13. *** 2021-04-16 08:46:21.779
  14. PMON (ospid: 3335): terminating the instance due to error 474
  15. ksuitm: waiting up to [5] seconds before killing DIAG(3370)
call stack trace对于问题定位非常重要。
我感觉其中关键的函数是ksucln()

猜测还是smon的老本行,清理对象时遇到问题。


smon宕机相关问题

  1. ORA-474:SMON进程终止并出现错误
  2. 1- ORA-00474:SMON进程在并行事务恢复期间因错误而终止

  3. 解决方案:

  4. 通过在您的init@SID.ora中添加以下参数来关闭并行恢复,
  5. fast_start_parallel_rollback = FALSE
  6. 反弹实例。

  7. 有关更多详细信息,请参阅:

  8. Ora-600 [15789]和Ora-474(Doc ID 1094645.1)


  9. 2-导致数据库崩溃的ORA-600 [504]和ORA-474实例崩溃,ORA-600 [kcbnew_3]可以使它们崩溃。(低于11.2.0.2的版本)

  10. 解决方案:

  11. 升级到10.2.0.5或11.2.0.2或更高版本

  12. 检查MOS平台上一次性修补程序:9084487的可用性。

  13. 有关更多详细信息,请参阅:

  14. ORA-00600 [504]和ORA-474导致数据库崩溃(文档ID 1209577.1)

  15. 3-在警报日志中报告的ORA-600 [13011]和ORA-474,其中跟踪失败的SQL类似于“从smon_scn_time删除,其中scn =(从smon_scn_time中选择min(scn))”

  16. 解决方案:

  17. 分析表smon_scn_time验证结构级联并重建其所有索引

  18. 有关更多详细信息,请参阅:

  19. 实例终止于错误ORA-00474:SMON进程终止于错误(文档ID 1361872.1)

  20. 如果报告了不同表的错误,请尝试相同的解决方案(分析报告的表和重建其索引)

  21. 有关此错误的疑难解答,请参阅以下文档,以了解更多详细信息:

  22. 了解和诊断ORA-00600 [13011]错误(文档ID 1392778.1)

  23. 4-使用ORA-474和ORA-660 [4464] / ORA-600 [4427](在低于11.2.0.2的版本上)导致实例崩溃

  24. 这是Bug 11814907:用ORA-00474重新启动实例:由于关闭了SMON过程而导致错误终止错误9857702的重复项:返还ORA-600 [4464]

  25. 解决方案:

  26. 升级到11.2.0.2或更高版本,或者安装临时补丁9857702(如果适用于您的平台)

  27. 5-警报日志中报告了ORA-00600 [KDOURP_INORDER2]和ORA-00474(版本低于11.2)

  28. 是错误7627304:ORA-00600 [KDOURP_INORDER2]和ORA-00474:SMON,过程PMON终止实例已作为错误7662491的副本关闭:实例崩溃/ ORA-600 [KDDUMMY_BLKCHK]恢复期间命中

  29. 解决方案:

  30. 升级至11.2或安装临时补丁7662491(如果适用于您的平台)


参考:
Troubleshooting ORA-46x and ORA-47x xxxx Process Terminated With Error (Doc ID 1907129.1)
SRDC - Instance Termination (non-RAC) Issues : Checklist of Evidence to Supply (Doc ID 2507010.1)
数据库系统监视进程(SMON)(文档ID 1495163.1)

对于宕机问题,搜集方法可以用 tfactl,顺便看看帮助内容----很丰富。

  1. [oracle@shdb01 ~]$ tfactl diagcollect -srdc -help
  2. Service Request Data Collection (SRDC).
  3. Usage : /opt/oracle.ahf/tfa/bin/tfactl diagcollect -srdc [-tag ] [-z ] [-last | -from -to | -for ] -database
  4. -tag The files will be collected into tagname directory inside
  5. repository
  6. -z The collection zip file will be given this name within the
  7. TFA collection repository
  8. -last Files from last 'n' [m]inutes, 'n' [d]ays or 'n' [h]ours
  9. -since Same as -last. Kept for backward compatibility.
  10. -from "Mon/dd/yyyy hh:mm:ss" From
  11. or "yyyy-mm-dd hh:mm:ss"
  12. or "yyyy-mm-ddThh:mm:ss"
  13. or "yyyy-mm-dd"
  14. -to "Mon/dd/yyyy hh:mm:ss" To
  15. or "yyyy-mm-dd hh:mm:ss"
  16. or "yyyy-mm-ddThh:mm:ss"
  17. or "yyyy-mm-dd"
  18. -for "Mon/dd/yyyy" For .
  19. or "yyyy-mm-dd"
  20. can be any of the following,
  21. DBCORRUPT Required Diagnostic Data Collection for a Generic Database Corruption
  22. Listener_Services SRDC - Data Collection for TNS-12516 / TNS-12518 / TNS-12519 / TNS-12520.
  23. Naming_Services SRDC - Data Collection for ORA-12154 / ORA-12514 / ORA-12528.
  24. ORA-00020 SRDC for database ORA-00020 Maximum number of processes exceeded
  25. ORA-00060 SRDC for ORA-00060. Internal error code.
  26. ORA-00494 SRDC for ORA-00494.
  27. ORA-00600 SRDC for ORA-00600. Internal error code.
  28. ORA-00700 SRDC for ORA-00700. Soft internal error.
  29. ORA-01031 SRDC - How to Collect Standard Information for ORA - 1031 /ORA -1017 during SYSDBA connections
  30. ORA-01555 SRDC - ORA-1555: Checklist of Evidence to Supply (Doc ID 1682708.1)
  31. ORA-01578 SRDC - Required Diagnostic Data Collection for ORA-01578
  32. ORA-01628 SRDC for database ORA-01628 Snapshot too Old problems
  33. ORA-04020 SRDC for ORA-04020
  34. ORA-04021 SRDC for ORA-04021.
  35. ORA-04030 SRDC for ORA-04030. OS process private memory was exhausted.
  36. ORA-04031 SRDC for ORA-04031. More shared memory is needed in the shared/streams pool.
  37. ORA-07445 SRDC for ORA-07445. Exception encountered, core dump.
  38. ORA-08102 SRDC - Required Diagnostic Data Collection for ORA-08102.
  39. ORA-08103 SRDC - Required Diagnostic Data Collection for ORA-08103.
  40. ORA-12751 SRDC for ORA-12751. Internal error code.
  41. ORA-22924 SRDC - ORA-22924 or ORA-1555 on LOB data: Checklist of Evidence to Supply (Doc ID 1682707.1)
  42. ORA-27300 SRDC for ORA-27300. OS system dependent operation:open failed with status: (status).
  43. ORA-27301 SRDC for ORA-27301. OS failure message: (message).
  44. ORA-27302 SRDC for ORA-27302. failure occurred at: (module).
  45. ORA-30036 SRDC for database ORA-30036 Unable to extend Undo Tablespace problems
  46. TNS-12154 SRDC - Data Collection for TNS-12154.
  47. TNS-12514 SRDC - Data Collection for TNS-12514.
  48. TNS-12516 SRDC - Data Collection for TNS-12516.
  49. TNS-12518 SRDC - Data Collection for TNS-12518.
  50. TNS-12519 SRDC - Data Collection for TNS-12519.
  51. TNS-12520 SRDC - Data Collection for TNS-12520.
  52. TNS-12528 SRDC - Data Collection for TNS-12528.
  53. ahf SRDC - Data Collection for orachk or exachk issue, after running orachk -debug or exachk -debug.
  54. crs SRDC FOR CRS
  55. crsasm SRDC FOR ASM CRS Related Errors
  56. crsasmcell SRDC FOR ASM CRS CELL Related Errors
  57. dbacl SRDC - How to Collect Standard Information for Access Control Lists (ACLs).
  58. dbaqgen SRDC - How To Collect Information For Troubleshooting Problem In An Oracle Advanced Queuing Environment.
  59. dbaqmon SRDC - How to Collect Information for Troubleshooting Queue Monitor (QMON) Issues.
  60. dbaqnotify SRDC - How to Collect Information for Troubleshooting Notification in an Advanced Queuing Environment.
  61. dbaqperf SRDC - How To Collect Information For Troubleshooting Performance In An Oracle Advanced Queuing Environment.
  62. dbaqpurge SRDC - How to Collect Information for Troubleshooting Non-Purged Messages in an Advanced Queuing Environment
  63. dbasm SRDC AUTOMATION: ENHANCE ASM/DBFS/DNFS/ACFS COLLECTIONS
  64. dbaudit SRDC - How to Collect Standard Information for Database Auditing
  65. dbaum SRDC - AUM : Checklist of Evidence to Supply (Doc ID 1682741.1)
  66. dbaumwaitevents SRDC - Wait Events related to Undo: Checklist of Evidence to Supply (Doc ID 1682723.1)
  67. dbawrspace SRDC for database AWR space problems
  68. dbbeqconnection SRDC - Bequeath Connection Issues: Checklist of Evidence to Supply (Doc ID 1928047.1)
  69. dbdatapatch SRDC - Data Collection for Datapatch issues.
  70. dbddlerrors SRDC - DDL Errors: Checklist of Evidence to Supply
  71. dbemon SRDC - How to Collect Information for Troubleshooting Event Monitor (EMON) Issues
  72. dbenqdeq SRDC - How to Collect Standard Information for Advanced Queueing Issues Using TFA Collector (Recommended) or Manual Steps
  73. dbexp SRDC - How to Collect Information for Troubleshooting Export (EXP) Related Problems
  74. dbexpdp SRDC - Diagnostic Collection for DataPump Export Generic Issues
  75. dbexpdpapi SRDC - Diagnostic Collection for DataPump Export API Issues
  76. dbexpdpperf SRDC - Diagnostic Collection for DataPump Export Performance Issues
  77. dbexpdptts SRDC - Data to supply for Transportable Tablespace Datapump and original EXPORT, IMPORT
  78. dbfra SRDC - Required diagnostic data collection for FRA related errors.
  79. dbfs SRDC for dbfs.
  80. dbggclassicmode SRDC for DOC ID 1913426.1, 1913376.1 and 1912964.1
  81. dbggintegratedmode SRDC for GoldenGate extract/replicat abends problems.
  82. dbhang SRDC for database Hang problems
  83. dbimp SRDC - Diagnostic Collection for Traditional Import Issues
  84. dbimpdp SRDC - Diagnostic Collection for DataPump Import (IMPDP) Generic Issues
  85. dbimpdpperf SRDC - Diagnostic Collection for DataPump Import (IMPDP) Performance Issues
  86. dbinstall SRDC for Oracle RDBMS install problems.
  87. dbinstancecrash SRDC - Instance Termination (non-RAC) Issues : Checklist of Evidence to Supply (Doc ID 2507010.1)
  88. dbinvalidcomp SRDC - Invalid Components and Objects : Checklist of Evidence to Supply
  89. dbinvalidobj SRDC - Objects Getting Invalidated: Checklist of Evidence to Supply
  90. dbparameterfiles SRDC - Parameter Files :Checklist of Evidence to Supply.
  91. dbparameters SRDC - Database Parameters: Checklist of Evidence to Supply.
  92. dbpartition SRDC - Data to Supply for Create/Maintain Partitioned/Subpartitioned Table/Index Issues
  93. dbpartitionperf SRDC - Data to Supply for Slow Create/Alter/Drop Commands Against Partitioned Table/Index
  94. dbpatchconflict SRDC for Oracle RDBMS patch conflict problems.
  95. dbpatchinstall
  96. dbperf SRDC for database performance problems
  97. dbplugincompliance SRDC - Collect Relevant Diagnostic Information For All Compliance Related Issues Within Enterprise Manager 12c and 13c for Oracle Database.
  98. dbpreupgrade SRDC for database preupgrade problems.
  99. dbprocmgmt SRDC - Generic Process Management and Related Issues: Checklist of Evidence to Supply (Doc ID 2500734.1)
  100. dbrac SRDC FOR RAC Specific Issues
  101. dbracinst SRDC AUTOMATION: ENHANCE ASM/DBFS/DNFS/ACFS COLLECTIONS
  102. dbracmin Minimal SRDC FOR RAC Specific Issues
  103. dbracperf SRDC for RAC database performance problems
  104. dbrman SRDC - Required diagnostic data collection for RMAN related errors.
  105. dbrmanperf SRDC - Required diagnostic data collection for RMAN Performance(1671509.1).
  106. dbscn SRDC for database SCN problems.
  107. dbshutdown SRDC - Shutdown Issues : Checklist of Evidence to Supply (Doc ID 1906473.1)
  108. dbslowddl SRDC - Slow DDL: Checklist of Evidence to Supply
  109. dbspatialexportimport SRDC - Data Collection for Oracle Spatial Export/Import Issues.
  110. dbspatialinstall SRDC - Data Collection for Oracle Spatial Installation Issues.
  111. dbsqlperf SRDC - How to Collect Standard Information for a SQL Performance Problem Using TFA Collector.
  112. dbstandalonedbca SRDC - DBCA Issues: Checklist of Evidence to Supply
  113. dbstartup SRDC - Startup Issues: Checklist of Evidence to Supply (Doc ID 1905616.1)
  114. dbtde SRDC - How to Collect Standard Information for Transparent Data Encryption (TDE) (Doc ID 1905607.1)
  115. dbtextinstall SRDC - Data Collection for Oracle Text Installation Issues - 12c.
  116. dbtextupgrade SRDC - Data Collection for Oracle Text Upgrade Issues - 12c.
  117. dbundocorruption SRDC - Required Diagnostic Data Collection for UNDO Corruption.
  118. dbunixresources SRDC to capture diagnostic data for DB issues related to O/S resources
  119. dbupgrade SRDC for database upgrade problems.
  120. dbvault SRDC - How to Collect Standard Information for Database Vault
  121. dbwindowsresources SRDC - DB on Windows Resources : Checklist of Evidence to Supply.
  122. dbwinservice SRDC - OracleService on Windows: Checklist of Evidence to Supply (Doc ID 1918781.1)
  123. dbxdb SRDC for database XDB Installation and Invalid Object problems
  124. dnfs SRDC for DNFS.
  125. emagentgeneric SRDC - Collect Trace/Log Information for Enterprise Manager Management Agent Generic Issues
  126. emagentpatching SRDC - Collect Trace/Log Information for Failures during Enterprise Manager 13c Management Agent Patching.
  127. emagentperf EM SRDC - Collect Diagnostic Data for EM Agent Performance Issues.
  128. emagentstartup SRDC - Collecting Logs for Enterprise Manager 13c Agent Startup Errors.
  129. emagtpatchdeploy SRDC - Collecting Log Files for EM 13c Agent or Agent Patch Deployment.
  130. emagtupgpatch SRDC - Collecting Log Files for EM 13c Agent Upgrade or Local Installation or Patching.
  131. emcliadd EM SRDC - Errors during the adding of a database/listener/ASM target via EMCLI.
  132. emclusdisc EM SRDC - Cluster target, cluster (RAC) database or ASM target is not discovered.
  133. emdbaasdeploy SRDC - Collect Trace/Log Information For Failures During Database As A Service(DBAAS) Deployment.
  134. emdbsys EM SRDC - Database system target is not discovered/detected/removed/renamed correctly.
  135. emdebugoff SRDC for unsetting EM Debug.
  136. emdebugon SRDC for setting EM Debug.
  137. emfleetpatching SRDC - Collecting Diagnostic Data for Enterprise Manager Fleet Maintenance Patching Issues.
  138. emgendisc EM SRDC - General error is received when discovering or removing a database/listener/ASM target.
  139. emmetricalert SRDC for EM Metric Events not Raised and General Metric Alert Related Issues.
  140. emomscrash SRDC - Collect Diagnostic Data for all Enterprise Manager OMS Crash / Restart Performance Issues.
  141. emomsheap SRDC - Collecting Diagnostic Data for Enterprise Manager OMS Heap Usage Alert Performance Issues.
  142. emomshungcpu SRDC - Collecting Diagnostic Data for Enterprise Manager OMS hung or High CPU Usage Performance Issues.
  143. emomspatching SRDC - Collect Trace/Log Information for Failures during Enterprise Manager 13c OMS Patching.
  144. empatchplancrt SRDC - Collecting Diagnostic Data for Enterprise Manager Patch Plan Creation Issues.
  145. emprocdisc EM SRDC - Database/listener/ASM target is not discovered/detected by the discovery process.
  146. emtbsmetric SRDC - Collect Relevant Diagnostic Information For All Tablespace Space Used (%) Metric Issues Within Enterprise Manager For Oracle Database 12c and 13c.
  147. esexalogic SRDC - Exalogic Full Exalogs Data Collection Information.
  148. exservice SRDC - Exadata: Storage Software Service Or Offload Server Service Failures.
  149. exsmartscan SRDC - Exadata: Smart Scan Not Working Issues.
  150. gg_abend SRDC for DOC ID 2650417.1
  151. ggintegratedmodenodb SRDC for GoldenGate extract/replicat abends problems.
  152. gridinfra SRDC AUTOMATION: ENHANCE ASM/DBFS/DNFS/ACFS COLLECTIONS
  153. gridinfrainst SRDC AUTOMATION: ENHANCE ASM/DBFS/DNFS/ACFS COLLECTIONS
  154. instterm SRDC for instance terminated events, such as ORA-00469: ORA-00470: ORA-00480: ORA-00490: ORA-00491, ORA-00492, ORA-00493, ORA-00495, ORA-00496, ORA-00497, ORA-00498
  155. internalerror SRDC for all other types of internal database errors.
  156. ora1000 SRDC - Open Cursors:Checklist of Evidence to Supply.
  157. ora18 SRDC - ORA-18 or Sessions Parameter: Checklist of Evidence to Supply.
  158. ora25319 SRDC - How to Collect Information for Troubleshooting an ORA-25319 Error in an Advanced Queuing Environment.
  159. ora4023 SRDC - ORA-4023 : Checklist of Evidence to Supply
  160. ora4063 SRDC - ORA-4063 : Checklist of Evidence to Supply
  161. ora445 SRDC - ORA-445 or Unable to Spawn Process: Checklist of Evidence to Supply (Doc ID 2500730.1)
  162. xdb600 SRDC - Required Diagnostic Data Collection for XDB ORA-00600 and ORA-07445 Internal Error Issues using TFA Collector
  163. xdbinstall SRDC - Required Diagnostic Data Collection for XDB Installation and Invalid Object for Issues for 12c and Onward
  164. zlgeneric SRDC - Zero Data Loss Recovery Appliance (ZDLRA) Data Collection.
  165. [oracle@shdb01 ~]$
结合alert.log,从最早的告警开始,发现15日14点awr就没有生成。

从 dba_hist_active_sess_history 看看出问题前库里在忙啥

set lines 500
set long 9999
set pages 999
set serveroutput on size 1000000 
alter session set nls_date_format = 'yyyy/mm/dd hh24:mi:ss';
alter session set nls_timestamp_format = 'yyyy-mm-dd hh24.mi.ss.ff';

select instance_number, sample_id,sample_time,count(*) cnt
from dba_hist_active_sess_history where SAMPLE_TIME between 
 TO_TIMESTAMP('2021/04/15 13:00', 'yyyy/mm/dd hh24:mi') and
TO_TIMESTAMP('2021/04/16 10:00', 'yyyy/mm/dd hh24:mi')
group by instance_number, sample_id,sample_time
order by instance_number, sample_id,sample_time;  

也没有数据了(宕机前也没有什么会话,大早晨8:30测试库能有什么业务)。

m000进程没有日志文件,只有j000的日志中每隔2秒提示:
Process J000 is dead ... state=KSOSP_SPAWNED

操作系统的messages中出问题时有oom报错:

  1. Apr 11 18:58:27 host auditd[1737]: Audit daemon rotating log files
  2. Apr 11 22:05:14 host auditd[1737]: Audit daemon rotating log files
  3. Apr 12 11:09:15 host auditd[1737]: Audit daemon rotating log files
  4. Apr 12 20:10:36 host auditd[1737]: Audit daemon rotating log files
  5. Apr 13 13:01:16 host auditd[1737]: Audit daemon rotating log files
  6. Apr 13 19:23:24 host auditd[1737]: Audit daemon rotating log files
  7. Apr 14 13:49:20 host auditd[1737]: Audit daemon rotating log files
  8. Apr 15 12:23:09 host auditd[1737]: Audit daemon rotating log files
  9. Apr 15 17:34:48 host kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
  10. Apr 15 17:34:48 host kernel: oracle cpuset=/ mems_allowed=0
  11. Apr 15 17:34:48 host kernel: Pid: 3388, comm: oracle Tainted: G --------------- T 2.6.32-431.el6.x86_64 #1
  12. Apr 15 17:34:48 host kernel: Call Trace:
  13. Apr 15 17:34:48 host kernel: [] ? cpuset_print_task_mems_allowed+0x91/0xb0
  14. Apr 15 17:34:48 host kernel: [] ? dump_header+0x90/0x1b0
  15. Apr 15 17:34:48 host kernel: [] ? security_real_capable_noaudit+0x3c/0x70
  16. Apr 15 17:34:48 host kernel: [] ? oom_kill_process+0x82/0x2a0
  17. Apr 15 17:34:48 host kernel: [] ? select_bad_process+0xe1/0x120
  18. Apr 15 17:34:48 host kernel: [] ? out_of_memory+0x220/0x3c0
  19. Apr 15 17:34:48 host kernel: [] ? __alloc_pages_nodemask+0x8ac/0x8d0
  20. Apr 15 17:34:48 host kernel: [] ? alloc_pages_current+0xaa/0x110
  21. Apr 15 17:34:48 host kernel: [] ? __page_cache_alloc+0x87/0x90
  22. Apr 15 17:34:48 host kernel: [] ? find_get_page+0x1e/0xa0
  23. Apr 15 17:34:48 host kernel: [] ? filemap_fault+0x1a7/0x500
  24. Apr 15 17:34:48 host kernel: [] ? __do_fault+0x54/0x530
  25. Apr 15 17:34:48 host kernel: [] ? handle_pte_fault+0xf7/0xb00
  26. Apr 15 17:34:48 host kernel: [] ? rb_reserve_next_event+0xb4/0x370
  27. Apr 15 17:34:48 host kernel: [] ? native_sched_clock+0x13/0x80
  28. Apr 15 17:34:48 host kernel: [] ? rb_reserve_next_event+0xb4/0x370
  29. Apr 15 17:34:48 host kernel: [] ? native_sched_clock+0x13/0x80
  30. Apr 15 17:34:48 host kernel: [] ? handle_mm_fault+0x22a/0x300
  31. Apr 15 17:34:48 host kernel: [] ? __do_page_fault+0x138/0x480
  32. Apr 15 17:34:48 host kernel: [] ? thread_group_times+0x3d/0x120
  33. Apr 15 17:34:48 host kernel: [] ? ring_buffer_lock_reserve+0xa2/0x160
  34. Apr 15 17:34:48 host kernel: [] ? mmput+0x1e/0x120
  35. Apr 15 17:34:48 host kernel: [] ? trace_nowake_buffer_unlock_commit+0x43/0x60
  36. Apr 15 17:34:48 host kernel: [] ? ftrace_raw_event_sys_exit+0xb9/0xc0
  37. Apr 15 17:34:48 host kernel: [] ? do_page_fault+0x3e/0xa0
  38. Apr 15 17:34:48 host kernel: [] ? page_fault+0x25/0x30
  39. Apr 15 17:34:48 host kernel: Mem-Info:
  40. Apr 15 17:34:48 host kernel: Node 0 DMA per-cpu:
  41. Apr 15 17:34:48 host kernel: CPU 0: hi: 0, btch: 1 usd: 0
  42. Apr 15 17:34:48 host kernel: CPU 1: hi: 0, btch: 1 usd: 0
  43. Apr 15 17:34:48 host kernel: CPU 2: hi: 0, btch: 1 usd: 0
  44. Apr 15 17:34:48 host kernel: CPU 3: hi: 0, btch: 1 usd: 0
  45. Apr 15 17:34:48 host kernel: Node 0 DMA32 per-cpu:
  46. Apr 15 17:34:48 host kernel: CPU 0: hi: 186, btch: 31 usd: 0
  47. Apr 15 17:34:48 host kernel: CPU 1: hi: 186, btch: 31 usd: 0
  48. Apr 15 17:34:48 host kernel: CPU 2: hi: 186, btch: 31 usd: 0
  49. Apr 15 17:34:48 host kernel: CPU 3: hi: 186, btch: 31 usd: 0
  50. Apr 15 17:34:48 host kernel: Node 0 Normal per-cpu:
  51. Apr 15 17:34:48 host kernel: CPU 0: hi: 186, btch: 31 usd: 0
  52. Apr 15 17:34:48 host kernel: CPU 1: hi: 186, btch: 31 usd: 0
  53. Apr 15 17:34:48 host kernel: CPU 2: hi: 186, btch: 31 usd: 23
  54. Apr 15 17:34:48 host kernel: CPU 3: hi: 186, btch: 31 usd: 0
  55. Apr 15 17:34:48 host kernel: active_anon:463690 inactive_anon:140220 isolated_anon:0
  56. Apr 15 17:34:48 host kernel: active_file:245 inactive_file:504 isolated_file:0
  57. Apr 15 17:34:48 host kernel: unevictable:0 dirty:11 writeback:0 unstable:0
  58. Apr 15 17:34:48 host kernel: free:22140 slab_reclaimable:10286 slab_unreclaimable:84990
  59. Apr 15 17:34:48 host kernel: mapped:11740 shmem:46989 pagetables:215832 bounce:0
  60. Apr 15 17:34:48 host kernel: Node 0 DMA free:15684kB min:248kB low:308kB high:372kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15292kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
  61. Apr 15 17:34:48 host kernel: lowmem_reserve[]: 0 3000 4010 4010
  62. Apr 15 17:34:48 host kernel: Node 0 DMA32 free:55048kB min:50372kB low:62964kB high:75556kB active_anon:1627600kB inactive_anon:333656kB active_file:916kB inactive_file:1996kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3072096kB mlocked:0kB dirty:32kB writeback:0kB mapped:23000kB shmem:120460kB slab_reclaimable:23224kB slab_unreclaimable:199140kB kernel_stack:27016kB pagetables:538512kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
  63. Apr 15 17:34:48 host kernel: lowmem_reserve[]: 0 0 1010 1010
  64. Apr 15 17:34:48 host kernel: Node 0 Normal free:17828kB min:16956kB low:21192kB high:25432kB active_anon:227160kB inactive_anon:227224kB active_file:64kB inactive_file:20kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1034240kB mlocked:0kB dirty:12kB writeback:0kB mapped:23960kB shmem:67496kB slab_reclaimable:17920kB slab_unreclaimable:140820kB kernel_stack:7400kB pagetables:324816kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:16 all_unreclaimable? no
  65. Apr 15 17:34:48 host kernel: lowmem_reserve[]: 0 0 0 0
  66. Apr 15 17:34:48 host kernel: Node 0 DMA: 1*4kB 4*8kB 2*16kB 2*32kB 3*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15684kB
  67. Apr 15 17:34:48 host kernel: Node 0 DMA32: 5534*4kB 1983*8kB 236*16kB 85*32kB 110*64kB 28*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 55120kB
  68. Apr 15 17:34:48 host kernel: Node 0 Normal: 4451*4kB 3*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17828kB
  69. Apr 15 17:34:48 host kernel: 50408 total pagecache pages
  70. Apr 15 17:34:48 host kernel: 2357 pages in swap cache
  71. Apr 15 17:34:48 host kernel: Swap cache stats: add 13218790, delete 13216433, find 24808505/26126131
  72. Apr 15 17:34:48 host kernel: Free swap = 0kB
  73. Apr 15 17:34:48 host kernel: Total swap = 4194296kB
  74. Apr 15 17:34:48 host kernel: 1048560 pages RAM
  75. Apr 15 17:34:48 host kernel: 67274 pages reserved
  76. Apr 15 17:34:48 host kernel: 179030 pages shared
  77. Apr 15 17:34:48 host kernel: 816137 pages non-shared
  78. Apr 15 17:34:48 host kernel: [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
  79. Apr 15 17:34:48 host kernel: [ 465] 0 465 2814 1 1 -17 -1000 udevd
  80. Apr 15 17:34:48 host kernel: [ 1589] 0 1589 47371 136 0 0 0 vmtoolsd
  81. Apr 15 17:34:48 host kernel: [ 1737] 0 1737 23300 73 2 -17 -1000 auditd
  82. Apr 15 17:34:48 host kernel: [ 1739] 0 1739 20521 80 1 0 0 audispd
  83. Apr 15 17:34:48 host kernel: [ 1740] 0 1740 5301 42 0 0 0 sedispatch
  84. Apr 15 17:34:48 host kernel: [ 1814] 0 1814 2705 44 2 0 0 irqbalance
  85. Apr 15 17:34:48 host kernel: [ 1833] 32 1833 4759 22 0 0 0 rpcbind
  86. Apr 15 17:34:48 host kernel: [ 1942] 0 1942 3396 44 3 -17 -1000 lldpad

Free swap = 0kB ?
估计是内存不足,部署osw,再观察吧。
阅读(2974) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~