这篇已经写过了,怎么就被我删除了呢.....郁闷.......
趁着还记得些,今天补上:
续前,连接node1继续检查日志,crsctl start crs启动后检查进程的状态:
ps -ef|grep d.bin
[node1:root:/] ps -ef|grep d.bin
root 188748 141248 0 19:54:44 - 0:00 /u01/app/oracle/product/10.2.0/crs/bin/oprocd.bin run -t 1000 -m 805306365000 -hsi 0:0:50:75:90
root 238046 239336 0 19:54:44 - 0:03 /u01/app/oracle/product/10.2.0/crs/bin/ocssd.bin
root 176750 3187142 0 21:32:30 pts/1 0:00 grep d.bin
root 124406 144798 0 19:54:44 - 0:00 /u01/app/oracle/product/10.2.0/crs/bin/crsd.bin reboot
root 152910 184994 0 19:54:25 - 0:00 /u01/app/oracle/product/10.2.0/crs/bin/evmd.bin
[node1:root:/]
这一次终于没忽略 oprocd进程的参数值了!怎么看着805306365000像一个溢出值,用“oprocd 805306365000”作为关键字在metalink上检索了一下,发现一个
Note ID 1277538.1跟我的报错情况类似,连错误消息都一样:
[ CSSD]2011-02-07 18:05:17.317 [4371] >TRACE: clssnmSendingThread: sending status msg to all nodes
[ CSSD]2011-02-07 18:05:17.317 [4371] >TRACE: clssnmSendingThread: sent 4 status msgs to all nodes
[ CSSD]2011-02-07 18:05:18.343 [2829] >WARNING: clssnmeventhndlr: Receive failure with node 3 (user), state 3, con(1112f4ef0), probe(0), rc=11
[ CSSD]2011-02-07 18:05:18.343 [2829] >TRACE: clssnmDiscHelper: node3, node(3) connection failed, con (1112f4ef0), probe(0)
[ CSSD]2011-02-07 18:05:18.343 [2829] >TRACE: clssnmDiscHelper: node 3 clean up, con (1112f4ef0), init state 3, cur state 3
[ CSSD]2011-02-07 18:05:18.343 [3857] >TRACE: clssgmPeerEventHndlr: receive failed, node 3 (user) (111e1b5b0), rc 11
[ CSSD]2011-02-07 18:05:18.343 [3857] >TRACE: clssgmPeerDeactivate: node 3 (node3), death 0, state 0x1 connstate 0xf
[ CSSD]2011-02-07 18:05:18.343 [3857] >TRACE: clssgmPeerListener: discarded 0 future msgsfor 3
[ CSSD]2011-02-07 18:05:18.417 [1] >ERROR: clssgmStartNMMon: timed out waiting on nested NM reconfig. Self-sacrificing to kick others awake.
[ CSSD]2011-02-07 18:05:18.417 [1] >ERROR: StartCMMon(): clssnmNMDetach failed - 2
[ CSSD]2011-02-07 18:05:18.417 [1] >ERROR: ###################################
[ CSSD]2011-02-07 18:05:18.417 [1] >ERROR: clssscExit: CSSD aborting
[ CSSD]2011-02-07 18:05:18.417 [1] >ERROR: ###################################
开始的时候用“clssnmNMDetach failed - 2 ”作为关键字没有找到有参考意义的notes,再crsctl get css了一下:
[node1:oracle:/home/oracle] crsctl get css diagwait
805306368[node1:oracle:/home/oracle]
嗯,错误跟这个notes说的差不多,应该就是它导致的了!小高兴了一把,为啥这个值会变得如此大呢?现在不是找这个问题的时候,先照着oracle给的方案该吧,以后在去找:
Solution
- Schedule a downtime to reset the diagwait back to 13 per
Document 559365.1- Ensure CRS is down on all cluster nodes before modifying the diagwait parameter setting.
- Check diagwait setting and oprocd process after CRS restart.
The following output is expected:
> crsctl get css diagwait
13
> ps -ef | grep oprocd (now show -m 10000)
root 21062 20881 0 Dec 11 ? 0:23 /oracle/crs/product/10/crs/bin/oprocd.bin run -t 1000 -m 10000 -hsi 5:10:50:75:
修改步骤:
- Execute as root
#crsctl stop crs
#/bin/oprocd stop
- Ensure that Clusterware stack is down on all nodes by executing
#ps -ef |egrep "crsd.bin|ocssd.bin|evmd.bin|oprocd"
This
should return no processes. If there are clusterware processes running
and you proceed to the next step, you will corrupt your OCR. Do not
continue until the clusterware processes are down on all the nodes of
the cluster.
- From one node of the cluster, change the value of the "diagwait" parameter to 13 seconds by issuing the command as root:
#crsctl set css diagwait 13 -force
- Check if diagwait is set successfully by executing. the
following command. The command should return 13. If diagwait is not set,
the following message will be returned "Configuration parameter
diagwait is not defined"
#crsctl get css diagwait
- Restart the Oracle Clusterware on all the nodes by executing:
#crsctl start crs
- Validate that the node is running by executing:
#crsctl check crs
停了CRS,发出修改过命令:
#crsctl set css diagwait 13 -force N久也没返回结果,新开一个窗口crsctl get css diagwait也不返回结果,也不报错,取消也取消不了,无奈,重开一个连接把机器再次重新启动一遍,在重启之前,我将/tmp/.oracle/下的所有内容全部删除了。等机器其来后,crsctl get css diagwait发现值是13,看样子已经改过来了嘛!在检查一下进程的参数,发现现在是对的了,那么CRS应该能起来了,再次crsctl check crs,等了N久终于看到css起来了,但是其它两个进程还是起不来....再一次陷入迷茫中.....日志里也没啥明显的错误,命令发出去要等N久,修改diagwait参数的时候没等到命令结束,但是参数还是改好了,css启动好了,但是crsctl check css还是要等N久.....node2重启了1个多小时了,还没重启好....再次怀疑是存储的问题了,在备选的处理方案里还有关闭所有的节点,开启一个节点来测试CRS及重启存储的方案还没试,命令发出以后系统像hang住似的,关闭其它3个节点,用一个节点来试感觉也可能不行,想起之前删除/添加节点的时候CRS起不来(没记清楚是CRS起不来还是别的问题),所有的方法试完后都没结果,重启存储就好了,剩下的方案就是重启存储了!要是存储重启后CRS还是不能起来,我该怎么办呢?如果要重新安装,现在距离开市还有二十四五个小时,到还来得及,但是加上导入数据,测试应用等等的,要在这个时间内完成看起来也比较悬,要是重新安装的时候Clusterware那里CRS也启动不了,那......想想这些后果就抓狂啊......去机房看看,node2还没重启好.....不管了...直接按关机按钮.....
阅读(2475) | 评论(0) | 转发(0) |