春节后处理的第一个RAC故障--3-itpub.com.cn-ChinaUnix博客

Niconicowang.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

itpub.com.cn

博客访问： 300348
博文数量： 60
博客积分： 1437
博客等级：中尉
技术积分： 632
用户组：普通用户
注册时间： 2011-02-10 14:12

文章分类

全部博文（60）

性能优化（2）
ORACLE故障处理（6）
网络资源（0）
文档参考（0）
其它（2）
心情随笔（1）
学习笔记（0）
操作系统（2）
Oracle GoldenGat（15）
Oracle基础知识（8）
Oracle 高可用性（5）

Oracle RAC（2）

Oracle Stream（3）

Oracle DataGuard（0）
Oracle备份与恢复（9）
RAC故障处理（8）
未分配的博文（2）

文章存档

2012年（7）

2011年（53）

我的朋友

相关博文

春节后处理的第一个RAC故障--3

分类： Oracle

2011-02-14 15:50:59

这篇已经写过了，怎么就被我删除了呢.....郁闷.......
趁着还记得些，今天补上：
续前，连接node1继续检查日志，crsctl start crs启动后检查进程的状态:
ps -ef|grep d.bin
[node1:root:/] ps -ef|grep d.bin
    root 188748 141248   0 19:54:44      - 0:00 /u01/app/oracle/product/10.2.0/crs/bin/oprocd.bin run -t 1000 -m 805306365000 -hsi 0:0:50:75:90
    root 238046 239336   0 19:54:44      - 0:03 /u01/app/oracle/product/10.2.0/crs/bin/ocssd.bin
    root 176750 3187142   0 21:32:30 pts/1 0:00 grep d.bin
    root 124406 144798   0 19:54:44      - 0:00 /u01/app/oracle/product/10.2.0/crs/bin/crsd.bin reboot
    root 152910 184994   0 19:54:25      - 0:00 /u01/app/oracle/product/10.2.0/crs/bin/evmd.bin
[node1:root:/]
这一次终于没忽略 oprocd进程的参数值了！怎么看着805306365000像一个溢出值，用“oprocd 805306365000”作为关键字在metalink上检索了一下，发现一个Note ID 1277538.1跟我的报错情况类似，连错误消息都一样：
[    CSSD]2011-02-07 18:05:17.317 [4371] >TRACE:   clssnmSendingThread: sending status msg to all nodes
[    CSSD]2011-02-07 18:05:17.317 [4371] >TRACE:   clssnmSendingThread: sent 4 status msgs to all nodes
[    CSSD]2011-02-07 18:05:18.343 [2829] >WARNING: clssnmeventhndlr: Receive failure with node 3 (user), state 3, con(1112f4ef0), probe(0), rc=11
[    CSSD]2011-02-07 18:05:18.343 [2829] >TRACE:   clssnmDiscHelper: node3, node(3) connection failed, con (1112f4ef0), probe(0)
[    CSSD]2011-02-07 18:05:18.343 [2829] >TRACE:   clssnmDiscHelper: node 3 clean up, con (1112f4ef0), init state 3, cur state 3
[    CSSD]2011-02-07 18:05:18.343 [3857] >TRACE:   clssgmPeerEventHndlr: receive failed, node 3 (user) (111e1b5b0), rc 11
[    CSSD]2011-02-07 18:05:18.343 [3857] >TRACE:   clssgmPeerDeactivate: node 3 (node3), death 0, state 0x1 connstate 0xf
[    CSSD]2011-02-07 18:05:18.343 [3857] >TRACE:   clssgmPeerListener: discarded 0 future msgsfor 3
[    CSSD]2011-02-07 18:05:18.417 [1] >ERROR:   clssgmStartNMMon: timed out waiting on nested NM reconfig. Self-sacrificing to kick others awake.
[    CSSD]2011-02-07 18:05:18.417 [1] >ERROR:   StartCMMon(): clssnmNMDetach failed - 2
[    CSSD]2011-02-07 18:05:18.417 [1] >ERROR:   ###################################
[    CSSD]2011-02-07 18:05:18.417 [1] >ERROR:   clssscExit: CSSD aborting
[    CSSD]2011-02-07 18:05:18.417 [1] >ERROR:   ###################################
开始的时候用“clssnmNMDetach failed - 2 ”作为关键字没有找到有参考意义的notes，再crsctl get css了一下：
[node1:oracle:/home/oracle] crsctl get css diagwait
805306368[node1:oracle:/home/oracle]
嗯，错误跟这个notes说的差不多，应该就是它导致的了！小高兴了一把，为啥这个值会变得如此大呢？现在不是找这个问题的时候，先照着oracle给的方案该吧，以后在去找：
Solution - Schedule a downtime to reset the diagwait back to 13 per Document 559365.1

- Ensure CRS is down on all cluster nodes before modifying the diagwait parameter setting.

- Check diagwait setting and oprocd process after CRS restart.

The following output is expected:

> crsctl get css diagwait
13
> ps -ef | grep oprocd (now show -m 10000)
root 21062 20881 0 Dec 11 ? 0:23 /oracle/crs/product/10/crs/bin/oprocd.bin run -t 1000 -m 10000 -hsi 5:10:50:75:
修改步骤：

Execute as root

#crsctl stop crs
#/bin/oprocd stop
Ensure that Clusterware stack is down on all nodes by executing

#ps -ef |egrep "crsd.bin|ocssd.bin|evmd.bin|oprocd"
This should return no processes. If there are clusterware processes running and you proceed to the next step, you will corrupt your OCR. Do not continue until the clusterware processes are down on all the nodes of the cluster.
From one node of the cluster, change the value of the "diagwait" parameter to 13 seconds by issuing the command as root:

#crsctl set css diagwait 13 -force
Check if diagwait is set successfully by executing. the following command. The command should return 13. If diagwait is not set, the following message will be returned "Configuration parameter diagwait is not defined"

#crsctl get css diagwait
Restart the Oracle Clusterware on all the nodes by executing:

#crsctl start crs
Validate that the node is running by executing:

#crsctl check crs

停了CRS，发出修改过命令：
#crsctl set css diagwait 13 -force
N久也没返回结果，新开一个窗口crsctl get css diagwait也不返回结果，也不报错，取消也取消不了，无奈，重开一个连接把机器再次重新启动一遍，在重启之前，我将/tmp/.oracle/下的所有内容全部删除了。等机器其来后，crsctl get css diagwait发现值是13，看样子已经改过来了嘛！在检查一下进程的参数，发现现在是对的了，那么CRS应该能起来了，再次crsctl check crs，等了N久终于看到css起来了，但是其它两个进程还是起不来....再一次陷入迷茫中.....日志里也没啥明显的错误，命令发出去要等N久，修改diagwait参数的时候没等到命令结束，但是参数还是改好了，css启动好了,但是crsctl check css还是要等N久.....node2重启了1个多小时了，还没重启好....再次怀疑是存储的问题了，在备选的处理方案里还有关闭所有的节点，开启一个节点来测试CRS及重启存储的方案还没试，命令发出以后系统像hang住似的，关闭其它3个节点，用一个节点来试感觉也可能不行，想起之前删除/添加节点的时候CRS起不来（没记清楚是CRS起不来还是别的问题），所有的方法试完后都没结果，重启存储就好了，剩下的方案就是重启存储了！要是存储重启后CRS还是不能起来，我该怎么办呢？如果要重新安装，现在距离开市还有二十四五个小时，到还来得及，但是加上导入数据，测试应用等等的，要在这个时间内完成看起来也比较悬，要是重新安装的时候Clusterware那里CRS也启动不了，那......想想这些后果就抓狂啊......去机房看看，node2还没重启好.....不管了...直接按关机按钮.....

阅读(2499) | 评论(0) | 转发(0) |

上一篇：配置Oracle GoldenGate支持DDL复制2--实际配置记录

下一篇：RAC 环境中的等待事件

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6