春节的假期里接到客户客户的电话,曰:主机重启后,RAC一个也起不来(一个4节点的RAC,两个满配的570+两个半配的570).一台主机启动很慢很慢,一台主机报错,四个节点竟然2个节点报硬件错误!幸好今年春节在魔都过,简单的了解了一下情况,火速赶往现场,路上联系主机工程师,NND在魔都的工程师只有一人并且是转销售去了的,估计不会来,电话找公司安排主机工程师,竟然无人接电话,无果,打公司800电话,TMD还是无人接,看来TMD什么7×24啊,什么800,都TMD是浮云,接单之前吹得天花乱坠,有事的时候又找不到人,找到了又安排一个新手去,TMD还不如我这个业余的去处理好了.......要不是跟客户熟,客户早就发飙了.....好了,牢骚发完了处理问题吧.....
硬件不熟,还是先检查RAC为啥起不来,检查crsd进程的log:
2011-02-07 15:03:03.869: [ CRSRTI][1]32CSS is not ready. Received status 3 from CSS. Waiting for good status ..
2011-02-07 15:03:05.254: [ COMMCRS][351]clsc_connect: (1103b91d0) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_secu_crs))
2011-02-07 15:03:05.254: [ CSSCLNT][1]clsssInitNative: connect failed, rc 9
2011-02-07 15:03:05.256: [ CRSRTI][1]32CSS is not ready. Received status 3 from CSS. Waiting for good status ..
2011-02-07 15:03:06.590: [ COMMCRS][353]clsc_connect: (1103b91d0) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_secu_crs))
2011-02-07 15:03:06.590: [ CSSCLNT][1]clsssInitNative: connect failed, rc 9
2011-02-07 15:03:06.590: [ CRSRTI][1]32CSS is not ready. Received status 3 from CSS. Waiting for good status ..
2011-02-07 15:03:07.973: [ COMMCRS][355]clsc_connect: (1103b91d0) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_secu_crs))
发现是cssd没起来,继续检查cssd的日志,发现一些信息:
[ CSSD]2011-02-07 15:13:08.415 >node3: Copyright 2011, Oracle version 10.2.0.4.0
[ CSSD]2011-02-07 15:13:08.415 >node3: CSS daemon log for node node1, number 1, in cluster crs
[ CSSD]2011-02-07 15:13:08.421 [1] >TRACE: clssscmain: local-only set to false
[ CSSD]2011-02-07 15:13:08.427 [1] >TRACE: clssnmReadNodeInfo: added node 1 (node1) to cluster
[ CSSD]2011-02-07 15:13:08.431 [1] >TRACE: clssnmReadNodeInfo: added node 2 (node2) to cluster
[ CSSD]2011-02-07 15:13:08.436 [1] >TRACE: clssnmReadNodeInfo: added node 3 (node3) to cluster
[ clsdmt]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=node1DBG_CSSD))
[ CSSD]2011-02-07 15:13:08.441 [1] >TRACE: clssnmReadNodeInfo: added node 4 (node4) to cluster
[ CSSD]2011-02-07 15:13:08.444 [1] >TRACE: clssgmInitCMInfo: Wait for remote node termination set to 805306368 seconds
[ CSSD]2011-02-07 15:13:08.446 [1029] >TRACE: clssnm_skgxninit: Compatible vendor clusterware not in use
[ CSSD]2011-02-07 15:13:08.446 [1029] >TRACE: clssnm_skgxnmon: skgxn init failed
[ CSSD]2011-02-07 15:13:08.447 [1] >TRACE: clssnmNMInitialize: misscount set to (30)
[ CSSD]2011-02-07 15:13:08.448 [1] >TRACE: clssnmNMInitialize: Network heartbeat thresholds are: impending reconfig 15000 ms, reconfig start (misscount) 30000 ms
[ CSSD]2011-02-07 15:13:08.451 [1] >TRACE: clssnmDiskStateChange: state from 1 to 2 disk (0//dev/voting1)
[ CSSD]2011-02-07 15:13:08.452 [1030] >TRACE: clssnmvDPT: spawned for disk 0 (/dev/voting1)
[ CSSD]2011-02-07 15:13:08.453 [1] >TRACE: clssnmDiskStateChange: state from 1 to 2 disk (1//dev/voting2)
[ CSSD]2011-02-07 15:13:08.453 [1287] >TRACE: clssnmvDPT: spawned for disk 1 (/dev/voting2)
[ CSSD]2011-02-07 15:13:08.455 [1] >TRACE: clssnmDiskStateChange: state from 1 to 2 disk (2//dev/voting3)
[ CSSD]2011-02-07 15:13:08.455 [1544] >TRACE: clssnmvDPT: spawned for disk 2 (/dev/voting3)
[ CSSD]2011-02-07 15:13:10.464 [1030] >TRACE: clssnmDiskStateChange: state from 2 to 4 disk (0//dev/voting1)
[ CSSD]2011-02-07 15:13:10.464 [1801] >TRACE: clssnmvKillBlockThread: spawned for disk 0 (/dev/voting1) initial sleep interval (1000)ms
[ CSSD]2011-02-07 15:13:10.464 [1030] >TRACE: clssnmReadDskHeartbeat: node(2) is down. rcfg(13) wrtcnt(604) LATS(4844712) Disk lastSeqNo(604)
[ CSSD]2011-02-07 15:13:10.464 [1030] >TRACE: clssnmReadDskHeartbeat: node(3) is down. rcfg(11) wrtcnt(604) LATS(4844712) Disk lastSeqNo(604)
[ CSSD]2011-02-07 15:13:10.464 [1030] >TRACE: clssnmReadDskHeartbeat: node(4) is down. rcfg(14) wrtcnt(3085) LATS(4844712) Disk lastSeqNo(3085)
[ CSSD]2011-02-07 15:13:10.481 [1544] >TRACE: clssnmDiskStateChange: state from 2 to 4 disk (2//dev/voting3)
[ CSSD]2011-02-07 15:13:10.481 [2058] >TRACE: clssnmvKillBlockThread: spawned for disk 2 (/dev/voting3) initial sleep interval (1000)ms
[ CSSD]2011-02-07 15:13:10.481 [1544] >TRACE: clssnmReadDskHeartbeat: node(2) is down. rcfg(13) wrtcnt(604) LATS(4844729) Disk lastSeqNo(604)
[ CSSD]2011-02-07 15:13:10.481 [1544] >TRACE: clssnmReadDskHeartbeat: node(3) is down. rcfg(11) wrtcnt(605) LATS(4844729) Disk lastSeqNo(605)
[ CSSD]2011-02-07 15:13:10.487 [1287] >TRACE: clssnmDiskStateChange: state from 2 to 4 disk (1//dev/voting2)
[ CSSD]2011-02-07 15:13:10.487 [2315] >TRACE: clssnmvKillBlockThread: spawned for disk 1 (/dev/voting2) initial sleep interval (1000)ms
[ CSSD]2011-02-07 15:13:10.488 [1] >TRACE: clssnmFatalInit: fatal mode enabled
[ CSSD]2011-02-07 15:13:10.500 [2829] >TRACE: clssnmClusterListener: Listening on (ADDRESS=(PROTOCOL=tcp)(HOST=node1-priv)(PORT=49895))
[ CSSD]2011-02-07 15:13:10.500 [2829] >TRACE: clssnmClusterListener: Probing node node2 (2), probcon(1113fa5d0)
[ CSSD]2011-02-07 15:13:10.500 [2829] >TRACE: clssnmClusterListener: Probing node node3 (3), probcon(11156db50)
[ CSSD]2011-02-07 15:13:10.501 [2829] >TRACE: clssnmClusterListener: Probing node node4 (4), probcon(111570730)
[ CSSD]2011-02-07 15:13:10.501 [2829] >TRACE: clssnmDiscHelper: node2, node(2) connection failed, con (1113fa5d0), probe(1113fa5d0)
只发现“clssnm_skgxnmon: skgxn init failed”这样的错误,在metalink上查了一下,发现没啥可以参考的结果,其实这个日志里一个重要的信息被我忽略了:[ CSSD]2011-02-07 17:29:53.412 [1] >TRACE: clssgmInitCMInfo: Wait for remote node termination set to 805306368 seconds,这导致我花了很多时间去检查日志,重启主机,在我ps -ef|grep d.bin的时候也忽略了oprocd进程的参数值。
阅读(2202) | 评论(0) | 转发(0) |