Chinaunix首页 | 论坛 | 博客
  • 博客访问: 550335
  • 博文数量: 65
  • 博客积分: 1158
  • 博客等级: 少尉
  • 技术积分: 1261
  • 用 户 组: 普通用户
  • 注册时间: 2012-07-18 22:07
文章分类

全部博文(65)

文章存档

2016年(1)

2014年(2)

2013年(9)

2012年(53)

分类:

2012-11-22 17:18:04

原文地址:RAC故障处理一例 作者:itpub.com.cn

上周六午夜12点刚要睡觉,电话响起,这个时候来电话肯定没啥好事,一看手机号码不认识,通了电话才知道是我们外聘的HP工程师在客户现场处理故障,客户是两台HP小型机做了一个两个节点的RAC,由于客户的原因导致第二个节点系统无法进入多用户模式,估计是在系统里乱操作,删了什么操作系统文件,导致机器只能进入维护模式,因此第二个节点不得不重新安装,HP工程师是克隆了另外一个节点的系统到第二个节点的,然后修改IP,主机名等等的配置好Service Guard后,HA能起来,但是启动CRS的时候,第二个节点报如下错误:

  1. Attempting to start CRS stack
  2. Failure at scls_scr_create with code 1
  3. Internal Error Information:
  4.   Category: 1234
  5.   Operation: scls_scr_create
  6.   Location: mkdir
  7.   Other: Unable to make user dir
  8.   Dep: 2
折腾了半天毫无进展,想重启系统然系统自己带起来,但是跟HP的工程师交流了一下,主机起来后CRS是要手工启动的,那么重启就毫无意义了,在Unix、Linux下,CRS的启动停止脚本是放在init.d目录里的,对HP-Unix不太熟悉,问了才知道HP-Unix中,这个目录是在/sbin/init.d 中,而不是/etc/init.d 目录,从这个目录里用./init.crs 脚本来启动CRS,用法如下:
# ./init.crs xxx <--随便输入一个让它显示用法
Usage: ./init.crs {stop|start|enable|disable}
# ./init.crs start
这次的错误信息有参考意义了:

  1. /sbin/init.d/init.cssd[537]: /var/opt/oracle/scls_scr/rqtmsdb2/root/cssrun: Cannot create the specified file.
  2. Startup will be queued to init within 30 seconds.
错误日志显示CRS不能创建cssrun这个文件,
检查之:
# cd /var/opt/oracle/scls_scr/rqtmsdb2/root/
sh: /var/opt/oracle/scls_scr/rqtmsdb2/root/:  not found.
咦,没有这个目录!
#
cd /var/opt/oracle/scls_scr/
ls -l 一看就明白了:

  1. # ls -l
  2. total 0
  3. drwxr-xr-x 4 root sys 96 Dec 31 2010 rqtmsdb1
因为这个系统是从第一个节点克隆过来的,所以这个本应该是rqtmsdb2的目录现在是rqtmsdb1,怪不得呢!
修改之:

  1. # mv rqtmsdb1 rqtmsdb2
  2. # ls -l
  3. total 0
  4. drwxr-xr-x 4 root sys 96 Dec 31 2010 rqtmsdb2
  5. # cd rq*
    # ls -l
    total 16
    drwxr-xr-x   2 orarac     sys             96 Dec 31  2010 orarac
    drwxr-xr-x   2 root       sys           8192 Nov 17 09:55 root
    # cd root
    # ls -l
    total 48
    -rw-rw-rw-   1 root       root             8 Nov 17 15:33 crsdboot
    -rw-r--r--   1 root       sys              7 Dec 31  2010 crsstart
    -rw-rw-rw-   1 root       sys              6 Nov 17 15:33 cssrun
    -rw-r--r--   1 root       sys              0 Nov 17 15:33 noclsmon
    -rw-rw-rw-   1 root       root             0 Nov 17 15:33 nooprocd
再次启动CRS:

  1. # cd /sbin/init.d
  2. #
  3. # ./init.crs start
  4. Startup will be queued to init within 30 seconds.
  5. # ps -ef|grep d.bin
  6.     root 18734 22410 1 02:22:49 pts/ta 0:00 grep d.bin
  7. # ps -ef|grep d.bin
  8.     root 2059 1 0 22:03:36 ? 0:00 /ora_soft/oracle/product/crs/bin/crsd.bin reboot
  9.   orarac 18782 2057 0 02:23:09 ? 0:00 /ora_soft/oracle/product/crs/bin/evmd.bin
  10.   orarac 19013 19012 0 02:23:14 ? 0:00 /ora_soft/oracle/product/crs/bin/ocssd.bin
  11. # /ora_soft/oracle/product/crs/bin/crsctl check crs
  12. CSS appears healthy
  13. CRS appears healthy
  14. EVM appears healthy
  15. # /ora_soft/oracle/product/crs/bin/crlctl stop crs
  16. sh: /ora_soft/oracle/product/crs/bin/crlctl: not found.
  17. # /ora_soft/oracle/product/crs/bin/crsctl stop crs
  18. Stopping resources.
  19. Successfully stopped CRS resources
  20. Stopping CSSD.
  21. Shutting down CSS daemon.
  22. Shutdown request successfully issued.
  23. # ps -ef|grep d.bin
  24.     root 21987 22410 0 02:24:53 pts/ta 0:00 grep d.bin
  25. # /ora_soft/oracle/product/crs/bin/crsctl start crs
  26. Attempting to start CRS stack
  27. The CRS stack will be started shortly
  28. # ps -ef|grep d.bin
  29.     root 23992 22410 0 02:32:59 pts/ta 0:00 grep d.bin
  30. # ps -ef|grep d.bin
  31.     root 23995 22410 0 02:33:05 pts/ta 0:00 grep d.bin
  32. # ps -ef|grep d.bin
  33.     root 21829 1 0 02:24:44 ? 0:00 /ora_soft/oracle/product/crs/bin/crsd.bin reboot
  34.   orarac 24152 21817 0 02:33:18 ? 0:00 /ora_soft/oracle/product/crs/bin/evmd.bin
  35.   orarac 24299 24298 0 02:33:21 ? 0:00 /ora_soft/oracle/product/crs/bin/ocssd.bin
  36.     root 24577 22410 0 02:33:31 pts/ta 0:00 grep d.bin
  37. # /ora_soft/oracle/product/crs/bin/crsctl status
  38. Unknown parameter: status
  39. # /ora_soft/oracle/product/crs/bin/crsctl check crs
  40. CSS appears healthy
  41. CRS appears healthy
  42. EVM appears healthy
  43. #
这次能够正常启动了!
回头检查第一个节点,这个节点HP工程师跟我说什么也没动过,我就信了,克隆一个系统嘛是对这个节点不用做任何改动,但是现实且很残酷!
命令敲下去:
#
cd /sbin/init.d

#

#
./init.crs start
Startup will be queued to init within 30 seconds.
等不到d.bin的进程,无任何反应,回头检查操作系统日志:

  1. Nov 18 03:26:00 rqtmsdb1 syslog: Cluster Ready Services waiting on dependencies. Diagnostics in /tmp/crsctl.2104.
  2. Nov 18 03:26:00 rqtmsdb1 syslog: Cluster Ready Services waiting on dependencies. Diagnostics in /tmp/crsctl.2116.
  3. Nov 18 03:26:00 rqtmsdb1 syslog: Cluster Ready Services waiting on dependencies. Diagnostics in /tmp/crsctl.2154.
  4. Nov 18 03:34:16 rqtmsdb1 syslog: Cluster Ready Services waiting on dependencies. Diagnostics in /tmp/crsctl.2154.
看来有些错误信息啊,其中的一个文件:

  1. #cat /tmp/crsctl.2104
  2. Failed 3 to bind listening endpoint:(ADDRESS=(PROTOCOL=tcp)(HOST=rqtmsdb1-priv))
  3. #
无法绑定监听到PricateIP上,再去检查/etc/hosts文件,发现没有Pricate IP!,只有第二个节点的Pricate IP,再去检查第二个节点的/etc/hosts文件,对比后添加第一个节点的Pricate IP :
192.168.0.1     rqtmsdb1-priv
没在开始去检查/etc/hosts文件真是失误啊!听到的一定要自己再确认一遍!又一次在RAC环境里载在/etc/hosts文件手里!!!之前在一个客户那里配置RAC,工程师给我将localhosts这个系统默认的东东去掉了,导致我在这个上面花了一天的时间才找到是没有localhosts导致的!
再次启动CRS,这次正常启动了!以为一切都好了,可以去睡觉了,没先到后面VIP还有问题,
crs_start -all  启动Cluste,报告不能启动,VIP起不来,后面的就都失败了,这个错误好办,之前解决过,先设置对VIP进行debug:

  1. #/ora_soft/oracle/product/crs/bin/crsctl debug log res "ora.rqtmsdb1.vip:5"
然后单独启动VIP资源:

  1. # /ora_soft/oracle/product/crs/bin/srvctl start nodeapps -n rqtmsdb1

  1. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:29 EAT 2012 [ 25193 ] Checking interface existance
  2. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:29 EAT 2012 [ 25193 ] Calling getifbyip
  3. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:29 EAT 2012 [ 25193 ] getifbyip: started for 172.16.7.22
  4. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:29 EAT 2012 [ 25193 ] Completed getifbyip
  5. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:29 EAT 2012 [ 25193 ] switched to standby : start/check operation
  6. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:33 EAT 2012 [ 25193 ] Completed with initial interface test
  7. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:33 EAT 2012 [ 25193 ] Broadcast = 172.16.7.255
  8. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:33 EAT 2012 [ 25193 ] Interface tests
  9. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:33 EAT 2012 [ 25193 ] checkIf: start for if=lan0
  10. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:33 EAT 2012 [ 25193 ] checkIf: get default gw
  11. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:33 EAT 2012 [ 25193 ] defaultgw: started
  12. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:33 EAT 2012 [ 25193 ] defaultgw: completed with
  13. rqtmsdb1:ora.rqtmsdb1.vip:checkIf: Default gateway is not defined (host=rqtmsdb1)
  14. rqtmsdb1:ora.rqtmsdb1.vip:Interface lan0 checked failed (host=rqtmsdb1)
  15. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:33 EAT 2012 [ 25193 ] checkIf: end for if=lan0
  16. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:33 EAT 2012 [ 25193 ] DEBUG: FAIL_WHEN_ALL_LINK_DOWN = 1 and IF_USING =
  17. rqtmsdb1:ora.rqtmsdb1.vip:Invalid parameters, or failed to bring up VIP (host=rqtmsdb1)
  18. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:33 EAT 2012 [ 25341 ] Checking interface existance
  19. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:33 EAT 2012 [ 25341 ] Calling getifbyip
  20. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:33 EAT 2012 [ 25341 ] getifbyip: started for 172.16.7.22
  21. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:33 EAT 2012 [ 25341 ] Completed getifbyip
  22. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:33 EAT 2012 [ 25341 ] switched to standby : start/check operation
  23. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:37 EAT 2012 [ 25341 ] Completed with initial interface test
  24. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:37 EAT 2012 [ 25341 ] Broadcast = 172.16.7.255
  25. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:37 EAT 2012 [ 25341 ] Performing CRS_STAT testing
  26. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:37 EAT 2012 [ 25341 ] Completed CRS_STAT testing
  27. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:37 EAT 2012 [ 25341 ] Interface tests
  28. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:37 EAT 2012 [ 25341 ] checkIf: start for if=lan0
  29. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:37 EAT 2012 [ 25341 ] checkIf: get default gw
  30. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:37 EAT 2012 [ 25341 ] defaultgw: started
  31. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:37 EAT 2012 [ 25341 ] defaultgw: completed with
  32. rqtmsdb1:ora.rqtmsdb1.vip:checkIf: Default gateway is not defined (host=rqtmsdb1)
  33. rqtmsdb1:ora.rqtmsdb1.vip:Interface lan0 checked failed (host=rqtmsdb1)
  34. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:37 EAT 2012 [ 25341 ] checkIf: end for if=lan0
  35. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:37 EAT 2012 [ 25341 ] DEBUG: FAIL_WHEN_ALL_LINK_DOWN = 1 and IF_USING =
  36. rqtmsdb1:ora.rqtmsdb1.vip:Invalid parameters, or failed to bring up VIP (host=rqtmsdb1)
  37. CRS-1006: No more members to consider
  38. CRS-0215: Could not start resource 'ora.rqtmsdb1.vip'.
  39. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:48 EAT 2012 [ 25801 ] Checking interface existance
  40. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:48 EAT 2012 [ 25801 ] Calling getifbyip
  41. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:48 EAT 2012 [ 25801 ] getifbyip: started for 172.16.7.22
  42. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:48 EAT 2012 [ 25801 ] Completed getifbyip
  43. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:48 EAT 2012 [ 25801 ] switched to standby : start/check operation
  44. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:52 EAT 2012 [ 25801 ] Completed with initial interface test
  45. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:52 EAT 2012 [ 25801 ] Broadcast = 172.16.7.255
  46. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:52 EAT 2012 [ 25801 ] Interface tests
  47. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:52 EAT 2012 [ 25801 ] checkIf: start for if=lan0
  48. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:52 EAT 2012 [ 25801 ] checkIf: get default gw
  49. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:52 EAT 2012 [ 25801 ] defaultgw: started
  50. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:52 EAT 2012 [ 25801 ] defaultgw: completed with
  51. rqtmsdb1:ora.rqtmsdb1.vip:checkIf: Default gateway is not defined (host=rqtmsdb1)
  52. rqtmsdb1:ora.rqtmsdb1.vip:Interface lan0 checked failed (host=rqtmsdb1)
  53. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:52 EAT 2012 [ 25801 ] checkIf: end for if=lan0
  54. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:52 EAT 2012 [ 25801 ] DEBUG: FAIL_WHEN_ALL_LINK_DOWN = 1 and IF_USING =
  55. rqtmsdb1:ora.rqtmsdb1.vip:Invalid parameters, or failed to bring up VIP (host=rqtmsdb1)
  56. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:52 EAT 2012 [ 25949 ] Checking interface existance
  57. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:52 EAT 2012 [ 25949 ] Calling getifbyip
  58. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:52 EAT 2012 [ 25949 ] getifbyip: started for 172.16.7.22
  59. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:52 EAT 2012 [ 25949 ] Completed getifbyip
  60. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:52 EAT 2012 [ 25949 ] switched to standby : start/check operation
  61. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:56 EAT 2012 [ 25949 ] Completed with initial interface test
  62. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:56 EAT 2012 [ 25949 ] Broadcast = 172.16.7.255
  63. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:56 EAT 2012 [ 25949 ] Performing CRS_STAT testing
  64. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:56 EAT 2012 [ 25949 ] Completed CRS_STAT testing
  65. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:56 EAT 2012 [ 25949 ] Interface tests
  66. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:56 EAT 2012 [ 25949 ] checkIf: start for if=lan0
  67. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:56 EAT 2012 [ 25949 ] checkIf: get default gw
  68. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:56 EAT 2012 [ 25949 ] defaultgw: started
  69. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:56 EAT 2012 [ 25949 ] defaultgw: completed with
  70. rqtmsdb1:ora.rqtmsdb1.vip:checkIf: Default gateway is not defined (host=rqtmsdb1)
  71. rqtmsdb1:ora.rqtmsdb1.vip:Interface lan0 checked failed (host=rqtmsdb1)
  72. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:56 EAT 2012 [ 25949 ] checkIf: end for if=lan0
  73. rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:56 EAT 2012 [ 25949 ] DEBUG: FAIL_WHEN_ALL_LINK_DOWN = 1 and IF_USING =
  74. rqtmsdb1:ora.rqtmsdb1.vip:Invalid parameters, or failed to bring up VIP (host=rqtmsdb1)
  75. CRS-0215: Could not start resource 'ora.rqtmsdb1.LISTENER_RQTMSDB1.lsnr'.
  76. #

没有配置默认网关,在检查IP地址配置情况,发现,IP地址是配置在lan2上的,一问才知道,由于lan0经常出问题,这次改到lan2,不早说啊,nnd!!

VIP在启动的时候回去ping默认网关,如果不通,那么VIP是起不来的。HP工程师配置好默认网关后,修改VIP到lan0上去:
先删除之:
su - oracle
oifcfg delif -global
然后再重新配置:

  1. $oifcfg setif -global lan2/172.16.7.0:public
  2. $oifcfg setif -global lan3/192.168.0.0:cluster_interconnect

#/ora_soft/oracle/product/crs/bin/srvctl modify nodeapps -n rqtmsdb2 -A 172.16.7.23/255.255.255.0/lan2

#/ora_soft/oracle/product/crs/bin/srvctl modify nodeapps -n rqtmsdb1 -A 172.16.7.22/255.255.255.0/lan2

修改完成后再次crs_start -all ,RAC启动成功,手工,睡觉!














阅读(3007) | 评论(0) | 转发(0) |
0

上一篇:生活感悟

下一篇:Oracle维护索引

给主人留下些什么吧!~~