1. 在NetApp存储上执行fcp status
lun show
lun show –m
lun show –v
igroup show
fcp show initiator
rdfile /etc/messages
cf status
2. 在NetApp存储上执行cf takeover 工作存储切换到备用存储上,同时存储故障灯亮红灯,同时检查lun、igroup、fcp状态,确认lun以及链路正常,存储端不存在问题。
3. 查看AIX系统日志,errpt –a 发现fcs0,fcs1有大量的报错,同时hdisk2到hdisk13 failed path,通过lsdev –Cc disk,回显如下:
hdisk10 Available 04-00-02 MPIO Other FC SCSI Disk Drive
hdisk11 Available 04-00-02 MPIO Other FC SCSI Disk Drive
hdisk12 Available 04-00-02 MPIO Other FC SCSI Disk Drive
hdisk13 Available 04-00-02 MPIO Other FC SCSI Disk Drive
hdisk2 Available 04-00-02 MPIO Other FC SCSI Disk Drive
hdisk3 Available 04-00-02 MPIO Other FC SCSI Disk Drive
hdisk4 Available 05-00-01 MPIO Other FC SCSI Disk Drive
hdisk5 Available 05-00-01 MPIO Other FC SCSI Disk Drive
hdisk6 Available 04-00-02 MPIO Other FC SCSI Disk Drive
hdisk7 Available 04-00-02 MPIO Other FC SCSI Disk Drive
hdisk8 Available 04-00-02 MPIO Other FC SCSI Disk Drive
hdisk9 Available 04-00-02 MPIO Other FC SCSI Disk Drive
4. 通过lspath查看磁盘链路情况,回显如下:
# lspath
Enabled hdisk0 sas0-----本地磁盘
Enabled hdisk1 sas0-----本地磁盘
Enabled hdisk2 fscsi0----NetApp存储
Enabled hdisk3 fscsi0
Enabled hdisk4 fscsi0
Enabled hdisk5 fscsi0
Enabled hdisk6 fscsi0
Enabled hdisk7 fscsi0
Enabled hdisk8 fscsi0
Enabled hdisk9 fscsi0
Enabled hdisk10 fscsi0
Enabled hdisk11 fscsi0
Enabled hdisk12 fscsi0
Enabled hdisk13 fscsi0
Enabled hdisk2 fscsi0
Enabled hdisk3 fscsi0
Enabled hdisk4 fscsi0
Enabled hdisk5 fscsi0
Enabled hdisk6 fscsi0
Enabled hdisk7 fscsi0
Enabled hdisk8 fscsi0
Enabled hdisk9 fscsi0
Enabled hdisk10 fscsi0
Enabled hdisk11 fscsi0
Enabled hdisk12 fscsi0
Enabled hdisk13 fscsi0
Missing hdisk2 fscsi1
Missing hdisk3 fscsi1
Enabled hdisk4 fscsi1
Enabled hdisk5 fscsi1
Missing hdisk6 fscsi1
Missing hdisk7 fscsi1
Missing hdisk8 fscsi1
Missing hdisk9 fscsi1
Missing hdisk10 fscsi1
Missing hdisk11 fscsi1
Missing hdisk12 fscsi1
Missing hdisk13 fscsi1
Missing hdisk2 fscsi1
Missing hdisk3 fscsi1
Enabled hdisk4 fscsi1
Enabled hdisk5 fscsi1
Missing hdisk6 fscsi1
Missing hdisk7 fscsi1
Missing hdisk8 fscsi1
Missing hdisk9 fscsi1
Missing hdisk10 fscsi1
Missing hdisk11 fscsi1
Missing hdisk12 fscsi1
Missing hdisk13 fscsi1
可手工激活链路,但系统重启后,仍然会报链路丢失!
5. 通过命令可知,有部份链路丢失,通过安装NetApp存储的ntap_aix_host_utilities后,执行sanlun lun show –p,可得到如下结果:
fas3140a:/vol/oracle/q0/asm2 (LUN 8)
80.0g (85905637376) lun state: GOOD
Filer_CF_State: Cluster Enabled Multipath_Policy: None
Multipath-provider: None
-------- --------- ------------ ---- ------- -------
host filer primary partner
path path device host filer filer
state type filename HBA port port
-------- --------- ------------ ---- ------- -------
up secondary hdisk9 fcs1 0b
通过这个命令,可以确认:MPIO并没有生效,同时只有一条链路是正常
由此,得出结论如下:
1. 因AIX系统的MPIO并没有生效,所以导致当存储进行takeover/giveback后,系统找不到其他的路径来识别存储硬盘
2. 因无法识别存储硬盘,ORACLE RAC的CRS服务找不到ocr和vote
disk,试图重启ocssd服务,当重试时间大于默认阀值时(200秒),RAC就驱逐该节点,导致在ORACLE_HOME/CRS_HOME
crash出core文件,系统重启。
3. 当系统重启后,ORACLE RAC的CRS服务检测到ONS,GSD,OCSSD启动服务的启动时间大于REBOOT时间,认为该节点故障,继续驱逐该节点,导致ORACLE RAC故障,只有vip进程可正常启动。
解决办法:
1. 从NetApp官方网站下载NetApp for aix 的MPIO ODM库包,进行安装查看所安装的文件集
NetApp.mpio_attach_kit.pcmodm 5.0.0.0
NetApp.mpio_attach_kit.iscsi 5.0.0.0
NetApp.mpio_attach_kit.fcp 5.0.0.0
NetApp.mpio_attach_kit.config 5.0.0.0
2. 删除fcs0,fcs1及下级设备
rmdev –Rdl fcs0
rmdev –Rdl fcs1
3. 通过lsdev –Cc adapter |grep fcs和lsdev –Cc disk确认设备文件已删除,该操作并不会损坏数据,只是修改ODM库中对设备的定义
4. 重启系统,让系统自动去安装并检测HBA卡的下级设备
5. 通过lsdev –Cc disk,回显如下:
hdisk0 Available 00-08-00 SAS Disk Drive
hdisk1 Available 00-08-00 SAS Disk Drive
hdisk2 Available 04-00-02 MPIO NetApp FCP Default PCM Disk
hdisk3 Available 04-00-02 MPIO NetApp FCP Default PCM Disk
hdisk4 Available 04-00-02 MPIO NetApp FCP Default PCM Disk
hdisk5 Available 04-00-02 MPIO NetApp FCP Default PCM Disk
hdisk6 Available 04-00-02 MPIO NetApp FCP Default PCM Disk
hdisk7 Available 04-00-02 MPIO NetApp FCP Default PCM Disk
hdisk8 Available 04-00-02 MPIO NetApp FCP Default PCM Disk
hdisk9 Available 04-00-02 MPIO NetApp FCP Default PCM Disk
hdisk10 Available 04-00-02 MPIO NetApp FCP Default PCM Disk
hdisk11 Available 04-00-02 MPIO NetApp FCP Default PCM Disk
hdisk12 Available 04-00-02 MPIO NetApp FCP Default PCM Disk
hdisk13 Available 04-00-02 MPIO NetApp FCP Default PCM Disk
6. 同时执行sanlun lun show –p,回显如下:
ONTAP_PATH: fas3140a:/vol/oracle/q0/asm2
LUN: 8
LUN Size: 80.0g (85905637376)
Host Device: hdisk9
LUN State: GOOD
Controller_CF_State: Cluster Enabled
Controller Partner: fas3140b
Multipath Provider: AIX Native
Multipathing Algorithm: round_robin
--------- ----------- ------ ------ ----------- ----------
MPIO Controller AIX Controller AIX MPIO
path path MPIO host target HBA path
status type path HBA port priority
--------- ----------- ------ ------ ----------- ----------
Enabled secondary path0 fcs0 0b 1
Enabled primary path1 fcs0 0d 1
Enabled secondary path2 fcs1 0d 1
Enabled primary path3 fcs1 0b 1
由此看到存储映射给主机的磁盘已经进行了链路聚合。
7. 修改NetApp映射过来的磁盘rw_timeout时间为60秒(两节点都要执行)
chdev –l hdisk2 –a rw_timeout=60
chdev –l hdisk3 –a rw_timeout=60
chdev –l hdisk4 –a rw_timeout=60
chdev –l hdisk5 –a rw_timeout=60
chdev –l hdisk6 –a rw_timeout=60
chdev –l hdisk7 –a rw_timeout=60
chdev –l hdisk8 –a rw_timeout=60
chdev –l hdisk9 –a rw_timeout=60
chdev –l hdisk10 –a rw_timeout=60
chdev –l hdisk11–a rw_timeout=60
chdev –l hdisk12 –a rw_timeout=60
chdev –l hdisk13 –a rw_timeout=60
8. 修改crs的css丢失时间阀值为120秒
crsctl set css misscount 120(两节点都要做)
9. 在存储上执行cf takeover/cf giveback,发现主机端并未重启,并ORACLE RAC运行正常
#crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora.hbepi.db application ONLINE ONLINE jkdb2
ora....i1.inst application ONLINE ONLINE jkdb1
ora....i2.inst application ONLINE ONLINE jkdb2
ora....SM1.asm application ONLINE ONLINE jkdb1
ora....B1.lsnr application ONLINE ONLINE jkdb1
ora.jkdb1.gsd application ONLINE ONLINE jkdb1
ora.jkdb1.ons application ONLINE ONLINE jkdb1
ora.jkdb1.vip application ONLINE ONLINE jkdb1
ora....SM2.asm application ONLINE ONLINE jkdb2
ora....B2.lsnr application ONLINE ONLINE jkdb2
ora.jkdb2.gsd application ONLINE ONLINE jkdb2
ora.jkdb2.ons application ONLINE ONLINE jkdb2
ora.jkdb2.vip application ONLINE ONLINE jkdb2
注:AIX操作系统因版本较低(5.3.0.6),已升级到5.3.0.8
阅读(2862) | 评论(0) | 转发(0) |