分类: 系统运维
2008-09-01 16:09:20
今天早上起来登陆thoribm78,79发现8个端口一个都没有起来,
thoribm78# gabconfig -a
thoribm78# lltconfig –a list
用llt也没有输出,表明llt也未启动。有些怀疑是不是新版本的问题,重起后就一个端口都未启动。
thoribm79# gabconfig -a
===============================================================
thoribm78# date
Sat Aug 23 18:29:05 PDT 2008
检查vcs日志
thoribm78# tail -f engine_A.log
2008/08/23 08:11:36 VCS NOTICE V-16-1-11027 VCS engine startup arguments=-restart
2008/08/23 08:11:36 VCS NOTICE V-16-1-11050 VCS engine version=5.0
2008/08/23 08:11:36 VCS NOTICE V-16-1-11051 VCS engine join version=5.0.30.0
2008/08/23 08:11:36 VCS NOTICE V-16-1-11052 VCS engine pstamp=Veritas-5.0MP3-07/20/08-09:29:00
2008/08/23 08:11:36 VCS NOTICE V-16-1-10114 Opening GAB library
2008/08/23 08:11:36 VCS NOTICE V-16-1-10619 'HAD' starting on: thoribm78
2008/08/23 08:11:36 VCS INFO V-16-1-10196 Cluster logger started
2008/08/23 08:11:36 VCS INFO V-16-1-10125 GAB timeout set to 30000 ms
2008/08/23 08:11:36 VCS ERROR V-16-1-10116 GabHandle::open failed errno = 161
2008/08/23 08:11:36 VCS ERROR V-16-1-11033 GAB open failed. Exiting
thoribm79# tail -f engine_A.log
2008/08/23 08:01:54 VCS NOTICE V-16-1-10322 System thoribm79 (Node '1') changed state from EXITING to EXITED
2008/08/23 08:09:07 VCS INFO V-16-1-10196 Cluster logger started
2008/08/23 08:09:07 VCS NOTICE V-16-1-11022 VCS engine (had) started
2008/08/23 08:09:07 VCS NOTICE V-16-1-11050 VCS engine version=5.0
2008/08/23 08:09:07 VCS NOTICE V-16-1-11051 VCS engine join version=5.0.30.0
2008/08/23 08:09:07 VCS NOTICE V-16-1-11052 VCS engine pstamp=Veritas-5.0MP3-07/20/08-09:29:00
2008/08/23 08:09:07 VCS NOTICE V-16-1-10114 Opening GAB library
2008/08/23 08:09:11 VCS NOTICE V-16-1-10619 'HAD' starting on: thoribm79
2008/08/23 08:09:14 VCS INFO V-16-1-10125 GAB timeout set to 30000 ms
2008/08/23 08:09:28 VCS CRITICAL V-16-1-11306 Did not receive cluster membership, manual intervention may be needed for seeding
V-16-1-11033 GAB open failed. Exiting
大概是Vcs打开gab进程,结果等到timeout已过,gab也未启动,只得退出。
用“ V-16-1-11033 GAB open failed. Exiting”做关键字,找到一篇文章是由于llt和gab的启动脚本丢失导致llt和gab无法启动。
gab error v-15-1-20109 port h registration failed gab not configured
Details:
On reboot, LLT and GAB doesn't get started.
We see the following error on bootup in messages file:
May 8 10:37:25 emarssu2 Had[2611]: [ID 702911 daemon.notice] VCS NOTICE V-16-1-11034 Registering for cluster membership
May 8 10:37:25 emarssu2 gab: [ID 243033 kern.notice] GAB ERROR V-15-1-20109 Port h registration failed, GAB not configured
May 8 10:37:25 emarssu2 Had[2611]: [ID 702911 daemon.notice] VCS ERROR V-16-1-10116 GabHandle::open failed errno = 261
May 8 10:37:25 emarssu2 Had[2611]: [ID 702911 daemon.notice] VCS ERROR V-16-1-11033 GAB open failed. Exiting
May 8 10:37:29 emarssu2 llt: [ID 296307 kern.notice] LLT Protocol unavailable
May 8 10:37:29 emarssu2 gab: [ID 983455 kern.notice] GAB INFO V-15-1-20022 GAB unavailable
In Solaris 10, there should be two files in /etc/rc2.d for LLT and GAB
# ls -l *llt
-rwxr--r-- 2 root sys 1539 Sep 29 2005 S70llt
-rwxr--r-- 1 root sys 2073 Apr 5 2004 X70llt
# ls -l *gab
-rwxr--r-- 3 root sys 1979 Nov 10 2005 S92gab
-rwxr--r-- 1 root sys 2346 Dec 10 2003 X92gab
If they are missing then LLT and GAB may not startup.
For some unknown reason ,my vcs engine startup arguments is not onenode any
more.Acturally I can change the parameter using command "hastart -onenode"
manually.But I want to know where can I change the argument="one node" ,so
that the vcs engine can startup automatically. My platform is sun
9.
Thanks more.
2006/07/28 23:38:31 VCS NOTICE V-16-1-11050 VCS engine version=4.0
2006/07/28 23:38:31 VCS NOTICE V-16-1-11051 VCS engine join version=4.1
2006/07/28 23:38:31 VCS NOTICE V-16-1-11052 VCS engine pstamp=4.0 07/23/04-15:56:00
2006/07/28 23:38:31 VCS NOTICE V-16-1-10114 Opening GAB library
2006/07/28 23:38:31 VCS NOTICE V-16-1-10619 'HAD' starting on: ztenms01
2006/07/28 23:38:32 VCS INFO V-16-1-10125 GAB timeout set to 15000 ms
2006/07/28 23:38:32 VCS ERROR V-16-1-10116 GabHandle:
261
2006/07/28 23:38:32 VCS ERROR V-16-1-11033 GAB open failed. Exiting
2006/07/29 01:55:33 VCS NOTICE V-16-1-11022 VCS engine (had) started
2006/07/29 01:55:33 VCS NOTICE V-16-1-11027 VCS engine startup arguments=-onenode
2006/07/29 01:55:33 VCS NOTICE V-16-1-11022 VCS engine (had) started
2006/07/29 01:55:33 VCS NOTICE V-16-1-11027 VCS engine startup arguments=-onenode
2006/07/29 01:56:10 VCS NOTICE V-16-1-11050 VCS engine version=4.0
2006/07/29 01:56:10 VCS NOTICE V-16-1-11051 VCS engine join version=4.1
2006/07/29 01:56:10 VCS NOTICE V-16-1-11052 VCS engine pstamp=4.0 07/23/04-15:56:00
2006/07/29 01:56:22 VCS NOTICE V-16-1-10115 Using GABSIM
Fix GAB to start
If there is no GAB, VCS will not be able to communicate with any other
nodes and thus will start in a single node .
OK, so first fix /etc/llttab and /etc/llthosts (both machines must have
the same links and the same cluster id and the hosts file must know
about the other node), then fix /etc/gabtab to something like :
/sbin/gabconfig -n2
This will make sure that GAB will only start once there are 2 modes in
the cluster.
OK, to check if LLT has started properly: lltstat -vnv
to check if GAB started : gabconfig -a
Mike Wang wrote:
> For some unknown reason ,my vcs engine startup arguments is not onenode any
> more.Acturally I can change the parameter using command "hastart -onenode"
> manually.But I want to know where can I change the argument="one node" ,so
> that the vcs engine can startup automatically. My platform is
我随即检查两机的启动脚本,都在。想起来检查一下llt和gab的配置文件。
thoribm78# ls -rlt llt*
-rw-r--r-- 1 root system 110 Mar 24 09:16 llttab.bak
-rw-rw-r-- 1 root system 102 Aug 21 00:35 llt.conf
thoribm78# ls -rlt *tab
-rw-r--r-- 1 root system 2447 Dec 11 2007 bootptab
-rw-r--r-- 1 root system 4069 Aug 21 00:45 hasnaptab
-rw-r--r-- 1 root system 209 Aug 21 00:48 vrtswebtab
-rw-r--r-- 1 root system 23 Aug 21 02:31 vcsmmtab
-rw-r--r-- 1 root system 108 Aug 21 02:31 org.llttab
-rw-r--r-- 1 root system 23 Aug 21 02:31 gabtab
-rw-rw-r-- 1 oracle oinstall 697 Aug 22 01:36 oratab
-rw-r--r-- 1 root system 3674 Aug 23 08:04 inittab
-rw-r--r-- 1 root system 14 Aug 23 08:05 rmtab
-rw-r--r-- 1 root system 37 Aug 23 08:06 xtab
thoribm78# ls -rlt gab*
-rw-r--r-- 1 root system 23 Aug 21 02:31 gabtab
thoribm79# cd /etc/
thoribm79# ls -rlt llt*
-rw-rw-r-- 1 root system 102 Aug 21 00:35 llt.conf
-rw-r--r-- 1 root system 25 Aug 21 02:31 llthosts
-rw-r--r-- 1 root system 108 Aug 21 02:31 llttab
thoribm79# more llthosts
0 thoribm78
1 thoribm79
thoribm79# ls -rlt *tab
-rw-r--r-- 1 root system 2447 Dec 11 2007 bootptab
-rw-r--r-- 1 root system 4069 Aug 21 00:45 hasnaptab
-rw-r--r-- 1 root system 209 Aug 21 00:48 vrtswebtab
-rw-r--r-- 1 root system 23 Aug 21 02:31 vcsmmtab
-rw-r--r-- 1 root system 23 Aug 21 02:31 gabtab
-rw-r--r-- 1 root system 108 Aug 21 02:31 llttab
-rw-rw-r-- 1 oracle oinstall 697 Aug 22 01:36 oratab
-rw-r--r-- 1 root system 3853 Aug 23 08:06 inittab
thoribm79#
thoribm79# ls -rlt gab*
-rw-r--r-- 1 root system 23 Aug 21 02:31 gabtab
thoribm79# more llttab
set-node thoribm79
set-cluster 666
link en2 /dev/dlpi/en:2 - ether - -
link en3 /dev/dlpi/en:3 - ether - -
一看才知道,原来是thoribm78的llt配置文件llthosts,llttab文件全部丢失,导致无法启动。由于llt和gab是vcs运行的基础,丢失后导致集群里的llt和gab均无法启动。
Llthosts文件的丢失,导致系统无法认出集群里主机的组成,所以在thoribm78上运行没有任何输出
thoribm79# rcp llttab llthosts thoribm78:/etc/
reboot both nodes
重起后发现8个端口还是未启动,用lltconfig –a list来看,有输出了,但是thoribm78的2个private网卡状态均为down, 用lltstat –vnv更能看出来。原以为是thoribm78的网卡坏,但又觉得不可能。
检查llt和gab 的配置文件,原来直接copy过来的llttab文件需要修改,修改正确后重起主机,一切正常。
后记: llt和gab是vcs的基础,最早启动llt再gab,如果日志里提示gab都未启动,其他的6个端口肯定也启动不了。如果gab或llt都不能启动,就要先检查llt和gab的启动脚本和配置文件是否存在,是否正确。