从cssd进程的日志里看上去,还是比较正常的啊,但是怎么它就起不来呢?根据ORACLE的文档:Troubleshooting 10g and 11.1 Clusterware Reboots [ID 265769.1]所描述的,导致节点重启的进程有两,ocssd.bin及oprocd,但是现在节点也不重启,crsctl start crs是能启动这些进程的,但是crsctl check crs时就hang住,多次重启节点后发现cssd进程是能起来的,偶尔crsctl check css可以看到css进程起来了,但是检查状态的时候返回结果是非常的慢,隐隐约约的怀疑是IO的问题,客户也提示先检查一下IO看看,用dd命令测试ocr文件是可以读的,说明IO貌似没有问题。过一段时间之后*d.bin的进程竟然都停止了,节点也没有重启.....
暂时无果,尝试着去检查硬件出问题的那2台小机:
errpt -a
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
BFE4C025 0207150811 P H sysplanar0 UNDETERMINED ERROR
BFE4C025 0207150011 P H sysplanar0 UNDETERMINED ERROR
A6DF45AA 0207145811 I O RMCdaemon The daemon is started.
9DBCFDEE 0207145411 T O errdemon ERROR LOGGING TURNED ON
192AC071 0207142011 T O errdemon ERROR LOGGING TURNED OFF
BFE4C025 0207140611 P H sysplanar0 UNDETERMINED ERROR
A6DF45AA 0207140011 I O RMCdaemon The daemon is started.
2BFA76F6 0207135611 T S SYSPROC SYSTEM SHUTDOWN BY USER
9DBCFDEE 0207135811 T O errdemon ERROR LOGGING TURNED ON
192AC071 0201223311 T O errdemon ERROR LOGGING TURNED OFF
A6DF45AA 0130225711 I O RMCdaemon The daemon is started.
2BFA76F6 0130225511 T S SYSPROC SYSTEM SHUTDOWN BY USER
9DBCFDEE 0130225711 T O errdemon ERROR LOGGING TURNED ON
192AC071 0130225211 T O errdemon ERROR LOGGING TURNED OF
errpt -aj BFE4C025
LABEL: SCAN_ERROR_CHRP
IDENTIFIER: BFE4C025
Date/Time: Mon Feb 7 15:08:54 BEIST 2011
Sequence Number: 16758
Machine Id: 00CE63EF4C00
Node Id: secusz
Class: H
Type: PERM
Resource Name: sysplanar0
Resource Class: planar
Resource Type: sysplanar_rspc
Location:
Description
UNDETERMINED ERROR
Failure Causes
UNDETERMINED
Recommended Actions
RUN SYSTEM DIAGNOSTICS.
Detail Data
PROBLEM DATA
0644 00E0 0000 05FC 9600 8E00 0000 0000 0000 0000 4942 4D00 5048 0030 0100 3F30
2011 0207 0642 4350 2011 0207 0642 4350 4500 0106 0000 0000 0000 0000 0000 0000
501D 5CD8 501D 5CD8 5548 0018 0100 3F30 6103 4400 0000 0000 0000 A004 0000 0000
5053 00F0 0101 3F30 0201 0002 0000 00E8 003C 0004 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 3131 3030 3135 3130 2020 2020 2020 2020
2020 2020 2020 2020 2020 2020 2020 2020 C000 0028 4C2B 4C14 5537 3837 392E 3030
312E 4451 4458 4644 5700 0000 4944 1CCD 5057 5253 504C 5900 0000 0000 0000 0000
0000 0000 0000 0000 5045 1800 3931 3137 2D35 3730 3036 4536 3345 4600 0000 0000
.... ....
系统提示做一个diag,那就做吧:
diag结果显示:
The following informational event was reported by Platform Firmware.
CEC hardware System resources deconfigured by system due to prior error event.
Supporting data:
SRC: B150FD00
Additional Words: 2-010000F0 3-28DA0110 4-C1009002 5-000000FF
6-00000002 7-00000000 8-00000000 9-00000000
Error log information:
Date: Mon Feb 7 13:56:02 BEIST 2011
Sequence number: 2408
Label: SCAN_ERROR_CHRP
Press Enter or Cancel to return to the
application.
google了一下说是内存有被deconfig掉,出现CEC这种类似的错误不要重启机器,问了农仙也说是内存问题,看来这个问题基本确认了!但是我们重启了几次,似乎没啥问题,lsattr -El mem0 显示内存是正确的,看样子是重启后有变好了,那就把日志清除掉,重启机器后检查发现正常了,致以为啥之前会被deconfig掉,农仙说要连接ASMI去找原因,这个我就不会鸟!那这个问题就留给主机工程师去解决吧,现在对系统没啥影响了,至少现在是!
检查另外一台的系统日志,做了一个diag:
The Service Request Number(s)/Probable Cause(s)
(causes are listed in descending order of probability):
11001510: Power/Cooling subsystem Unrecovered Error, bypassed with loss of
redundancy. Refer to the system service documentation for more
information.
Additional Words: 2-003C0004 3-00000000 4-00000000 5-00000000
6-00000000 7-00000000 8-00000000 9-00000000
Error log information:
Date: Mon Feb 7 15:08:54 BEIST 2011
Sequence number: 16758
Label: SCAN_ERROR_CHRP
Priority: L FRU: PWRSPLY Location:
U7879.001.DQDXFDW
Priority: L FRU: 10N8505 S/N: YL11C7157160 CCIN: 28EA Location:
U7879.001.DQDYBNR-P1-C8
Use Enter to continue.
发现是电源问题以为是电源坏了,跟现场的工程师交流了一番,是他们加电的时候一个电源的插座没插好导致的,现在正常了,但是系统面板上的黄灯还没有消失掉,那就用命令清除掉吧,清除后重启这台主机,启动过程巨慢无比啊,一个小时后还没启动完成!!....趁着主机的重启这段时间再次去检查RAC问题,这次发现有了一些新的发现:)
阅读(2210) | 评论(1) | 转发(0) |