Chinaunix首页 | 论坛 | 博客
  • 博客访问: 968265
  • 博文数量: 584
  • 博客积分: 2293
  • 博客等级: 大尉
  • 技术积分: 3045
  • 用 户 组: 普通用户
  • 注册时间: 2006-03-28 11:15
文章分类

全部博文(584)

文章存档

2012年(532)

2011年(47)

2009年(5)

我的朋友

分类: 系统运维

2011-12-30 03:08:12

Ok, this is part II of the post PowerHA failure scenario when dealing with SAN-booted LPARs – part I.

The first scenario we have performed was to disable 100% of the MPIO storage paths to the active HACMP node by un-mapping Virtual FibreChannel adapters (vadapters from both VIOS protecting the active node [LPAR]). On both VIOS we have performed the following command:

1.$ vfcmap -vadapter vfchostXYZ -fcp

where the vfchostXYZ was server-side (or better VIOS side) vadapter handling FC I/O traffic for this LPAR. The result? The LPAR with HACMP/PowerHA active Resource Groups on it (jkwha001d) after some time evicted itself from the HACMP cluster, and the old passive node jkwha002d became the active one (Oracle started there). The root cause of the jkwha001d LPAR is the following one:

01.root@jkwha001d : /home/root :# errpt
02.IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
03.AFA89905   0103071611 I O grpsvcs        Group Services daemon started
04.97419D60   0103071611 I O topsvcs        Topology Services daemon started
05.A6DF45AA   0103071111 I O RMCdaemon      The daemon is started.
06.67145A39   0103070911 U S SYSDUMP        SYSTEM DUMP
07.F48137AC   0103070911 U O minidump       COMPRESSED MINIMAL DUMP
08.DC73C03A   0103070911 T S fscsi0         SOFTWARE PROGRAM ERROR
09.9DBCFDEE   0103070911 T O errdemon       ERROR LOGGING TURNED ON
10.26623394   0103070511 T H fscsi1         COMMUNICATION PROTOCOL ERROR
11.DC73C03A   0103070511 T S fscsi1         SOFTWARE PROGRAM ERROR
12.26623394   0103070511 T H fscsi1         COMMUNICATION PROTOCOL ERROR
13.(..)
14.DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
15.26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
16.DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
17.DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
18.DE3B8540   0103070411 P H hdisk4         PATH HAS FAILED
19.26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
20.DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
21.26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
22.DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
23.26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
24.DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
25.26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
26.DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
27.26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
28.DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
29.26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
30.DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
31.26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
32.DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
33.26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
34.DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
35.26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
36.DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
37.DE3B8540   0103070411 P H hdisk8         PATH HAS FAILED
38.26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
39.DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
40.DE3B8540   0103070411 P H hdisk6         PATH HAS FAILED
41.26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
42.DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
43.DE3B8540   0103070411 P H hdisk3         PATH HAS FAILED
44.26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
45.DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
46.DE3B8540   0103070411 P H hdisk2         PATH HAS FAILED
47.26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
48.DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
49.DE3B8540   0103070411 P H hdisk0         PATH HAS FAILED
50.26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
51.DE3B8540   0103070411 P H hdisk5         PATH HAS FAILED
52.DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
53.26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
54.DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
55.(..)
56.A39F8A49   1210235210 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
57.D5676F6F   1210235210 T H fscsi1         ATTACHED SCSI TARGET DEVICE ERROR
58.DE3B8540   1210235210 P H hdisk6         PATH HAS FAILED
59.D5676F6F   1210235210 T H fscsi0         ATTACHED SCSI TARGET DEVICE ERROR
60.DE3B8540   1210235210 P H hdisk1         PATH HAS FAILED
61.D5676F6F   1210235210 T H fscsi1         ATTACHED SCSI TARGET DEVICE ERROR
62.(..)

As you can see AIX6.1 generated a “SYSTEM DUMP” which is actually a system panic indicator. Normally AIX saves the memory image of it’s kernel on runtime to the Logical Volumes configured with “sysdump” type and then reboots. This allows investigation why the machine crashed, but here you won’t see anything. You’ve lost FC storage connectivity (even rootvg) , so it even couldn’t save it . So from where it knows it? Probably the state of the LPAR crash can be saved somewhere in POWERVM firmware area. It’s one of the RAS things in AIX/POWER. So far, so good, the HACMP/PowerHA was able to recover services…

OK, but we wanted to have a real proof, so we have performed a double-check. We suspected that VIOS can have some magic way to communicate with LPAR. We wanted exclude that factor, and we’ve performed simulation of disconnecting the FC from storage level. The initital state was stable HACMP cluster, with RG active on jkwha002d (and jkwha001d being passive), all MPIO paths to Netapp cluster (storage1, storage2) were reported as “Enabled” by “lspath”. The igroup term in Netapp concept is “inititator group” and is responsible for things like LUN masking. If you remove access to some FC WWPNs on the LUN, it is going to end-like in situation in which AIX will have hdisks point to non-existing SCSI adapters and AIX SCSI stack will get a lot of errors (FC is just wrapper around SCSI).

On the 2nd storage controller (storage2) the igroup jkwha002d_boot was controlling access to the OS level LUNs (rootvg, etc):

1.storage2> lun show -g jkwha002d_boot
2./vol/jkwha002d_boot/boot/rootvg   40.0g (42953867264)   (r/w, online, mapped)
3.storage2>

Also on 2nd storage there was an igroup responsbile for SAP and Oracle LUNs:

1.storage2> lun show -g jkwha002d_cluster
2./vol/sap_jw1_mirrlog/mirrlog/mirrlog01.lun      1g (1073741824)    (r/w, online, mapped)
3./vol/sap_jw1_oraarch/oraarch/oraarch01.lun     51g (54760833024)   (r/w, online, mapped)
4./vol/sap_jw1_sapdata/sapdata2/sapdata02.lun     15g (16106127360)   (r/w, online, mapped)
5.storage2>

The same igroup was present on 1st storage system, controlling the remaining LUNs (yes, this is active-active configuration, where reads/writes are being managed by AIX LVM/VG consisting on LUNs on two controllers):

1.storage1> lun show -g jkwha002d_cluster
2./vol/clus_001d_002d_hb/hb/hbvg     20m (20971520)      (r/w, online, mapped)
3./vol/sap_jw1_origlog/origlog/origlog01.lun      1g (1073741824)    (r/w, online, mapped)
4./vol/sap_jw1_sapbin/sapbin/sapbin01.lun    100g (107374182400)  (r/w, online, mapped)
5./vol/sap_jw1_sapdata/sapdata1/sapdata01.lun     15g (16106127360)   (r/w, online, mapped)
6.storage1>

Let’s do it:

01.storage1> igroup remove
02.usage:
03.igroup remove [ -f ] ...
04.- removes node(s) from an initiator group. The node may also
05.specified as its alias. See "fcp wwpn-alias" for help with aliases
06.By default, if lun maps exist for the initiator group, the node is
07.not removed from the group. The -f option can be used to force the
08.removal.
09.For more information, try 'man na_igroup'
10.storage1>
11.storage1> igroup remove -f jkwha002d_cluster c0:50:76:00:61:20:00:de
12.storage1> igroup remove -f jkwha002d_cluster c0:50:76:00:61:20:00:e0
13.storage2> igroup remove -f jkwha002d_cluster c0:50:76:00:61:20:00:de
14.storage2> igroup remove -f jkwha002d_cluster c0:50:76:00:61:20:00:e0
15.storage2> igroup remove -f jkwha002d_boot c0:50:76:00:61:20:00:de
16.storage2> igroup remove -f jkwha002d_boot c0:50:76:00:61:20:00:e0

.. and HACMP failover won’t work. The active LPAR node (jkwha002d) is going end up in zoombie state. If you have opened SSH sessions before, everything would indicate that is working, ls on /etc, some commands utils too. But the reason is because some things are cached in AIX’s filesystem cache. Everything is going to fail with I/O errors, lspath will cry too (it won’t display all MPIO paths as failed but this is story for another post), examples:

001.root@jkwha002d:# topas
002.bash: /usr/bin/topas: There is an input or output error.
003.root@jkwha002d:# nmon
004.bash: /usr/bin/nmon: There is an input or output error.
005.root@jkwha002d:# lspath
006.Failed  hdisk0 fscsi1
007.Failed  hdisk1 fscsi1
008.Enabled hdisk0 fscsi1
009.Failed  hdisk1 fscsi1
010.Failed  hdisk0 fscsi0
011.Failed  hdisk1 fscsi0
012.Failed  hdisk0 fscsi0
013.Enabled hdisk1 fscsi0
014.Failed  hdisk2 fscsi0
015.Failed  hdisk2 fscsi0
016.Failed  hdisk2 fscsi0
017.Failed  hdisk2 fscsi0
018.Enabled hdisk2 fscsi1
019.Failed  hdisk2 fscsi1
020.Failed  hdisk2 fscsi1
021.Failed  hdisk2 fscsi1
022.Failed  hdisk3 fscsi0
023.Failed  hdisk4 fscsi0
024.Failed  hdisk5 fscsi0
025.Failed  hdisk6 fscsi0
026.Failed  hdisk7 fscsi0
027.Failed  hdisk8 fscsi0
028.Failed  hdisk3 fscsi0
029.Failed  hdisk4 fscsi0
030.Failed  hdisk5 fscsi0
031.Failed  hdisk6 fscsi0
032.Failed  hdisk7 fscsi0
033.Failed  hdisk8 fscsi0
034.Failed  hdisk3 fscsi0
035.Failed  hdisk4 fscsi0
036.Failed  hdisk5 fscsi0
037.Failed  hdisk6 fscsi0
038.Failed  hdisk7 fscsi0
039.Failed  hdisk8 fscsi0
040.Failed  hdisk3 fscsi0
041.Failed  hdisk4 fscsi0
042.Failed  hdisk5 fscsi0
043.Failed  hdisk6 fscsi0
044.Failed  hdisk7 fscsi0
045.Failed  hdisk8 fscsi0
046.Failed  hdisk3 fscsi1
047.Failed  hdisk4 fscsi1
048.Enabled hdisk5 fscsi1
049.Enabled hdisk6 fscsi1
050.Failed  hdisk7 fscsi1
051.Failed  hdisk8 fscsi1
052.Enabled hdisk3 fscsi1
053.Enabled hdisk4 fscsi1
054.Failed  hdisk5 fscsi1
055.Failed  hdisk6 fscsi1
056.Failed  hdisk7 fscsi1
057.Failed  hdisk8 fscsi1
058.Failed  hdisk3 fscsi1
059.Failed  hdisk4 fscsi1
060.Failed  hdisk5 fscsi1
061.Failed  hdisk6 fscsi1
062.Enabled hdisk7 fscsi1
063.Failed  hdisk8 fscsi1
064.Failed  hdisk3 fscsi1
065.Failed  hdisk4 fscsi1
066.Failed  hdisk5 fscsi1
067.Failed  hdisk6 fscsi1
068.Failed  hdisk7 fscsi1
069.Enabled hdisk8 fscsi1
070.Missing hdisk9 fscsi0
071.Missing hdisk9 fscsi0
072.Missing hdisk9 fscsi1
073.Missing hdisk9 fscsi1
074.root@jkwha002d:# errpt | more
075.IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
076.E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
077.A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
078.E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
079.A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
080.E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
081.A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
082.B6267342   0103074511 P H hdisk0         DISK OPERATION ERROR
083.A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
084.E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
085.A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
086.E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
087.A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
088.B6267342   0103074511 P H hdisk0         DISK OPERATION ERROR
089.A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
090.EA88F829   0103074511 I O SYSJ2          USER DATA I/O ERROR
091.A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
092.E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
093.A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
094.00B984B3   0103074511 U H hdisk0         UNDETERMINED ERROR
095.A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
096.B6267342   0103074511 P H hdisk0         DISK OPERATION ERROR
097.A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
098.E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
099.A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW

Oracle is going to panic (which is good behaviour in my opinion):

01.root@jkwha002d:# tail alert_JW1.log
02.Mon Jan  3 07:39:21 2011
03.Errors in file /oracle/JW1/saptrace/background/jw1_lgwr_831500.trc:
04.ORA-00345: redo log write error block 7 count 1
05.ORA-00312: online log 13 thread 1: '/oracle/JW1/origlogA/log_g13m1.dbf'
06.ORA-27063: number of bytes read/written is incorrect
07.IBM AIX RISC System/6000 Error: 5: I/O error
08.Additional information: -1
09.Additional information: 512
10.Mon Jan  3 07:39:23 2011
11.Errors in file /oracle/JW1/saptrace/background/jw1_ckpt_368764.trc:
12.ORA-00206: error in writing (block 3, # blocks 1) of control file
13.ORA-00202: control file: '/oracle/JW1/origlogA/cntrl/cntrlJW1.dbf'
14.ORA-27072: File I/O error
15.IBM AIX RISC System/6000 Error: 5: I/O error
16.Additional information: 10
17.Additional information: 3
18.ORA-00206: error in writing (block 3, # blocks 1) of control file
19.ORA-00202: control file: '/oracle/JW1/origlogB/cntrl/cntrlJW1.dbf'
20.ORA-27072: File I/O error
21.IBM AIX RISC System/6000 Error: 5: I/O error
22.Additional information: 10
23.Additional information: 3
24.Mon Jan  3 07:39:23 2011
25.Errors in file /oracle/JW1/saptrace/background/jw1_ckpt_368764.trc:
26.ORA-00221: error on write to control file
27.ORA-00206: error in writing (block 3, # blocks 1) of control file
28.ORA-00202: control file: '/oracle/JW1/origlogA/cntrl/cntrlJW1.dbf'
29.ORA-27072: File I/O error
30.IBM AIX RISC System/6000 Error: 5: I/O error
31.Additional information: 10
32.Additional information: 3
33.ORA-00206: error in writing (block 3, # blocks 1) of control file
34.ORA-00202: control file: '/oracle/JW1/origlogB/cntrl/cntrlJW1.dbf'
35.ORA-27072: File I/O error
36.IBM AIX RISC System/6000 Error: 5: I/O error
37.Additional information: 10
38.Additional information: 3
39.Mon Jan  3 07:39:23 2011
40.CKPT: terminating instance due to error 221
41.Instance terminated by CKPT, pid = 368764
42.root@jkwha002d:#

but what’s most interesting is that Oracle will ask AIX to write it to the alert log file – and will be available to read by commands like “tail”, but “cat” command won’t work (you won’t be able to read whole alert log file becasue you don’t have I/O!). What’s even more interesting is that you won’t see those messages after rebooting! (after kernel memory is gone!). If you don’t have I/O how you are going to write/fsync this file???

Another additional thing is that the active HACMP node still will be active, it will be handling Resource Groups, etc. Failover won’t happen. Possible solutions to this problem should be an kernel-based check that would verify that at least /etc is accesible. Why kernel-based? Because you have to have some form of heartbeat in memory (like AIX kernel module or uncachable binary always present in RAM, running in Real-Time scheduling priority) that would continuesly . If it would fail for several times it should reboot the node (that would trigger Resource Group failover to 2nd node in the cluster).

Note: typical HACMP scripting – at least in theory – is not enough, even if it would force running /sbin/halt, how can you be sure that /sbin/halt all required libc contents are in memory??

Perhaps i’ll write a part III of this article :)

-Jakub.

阅读(6355) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~