Chinaunix首页 | 论坛 | 博客
  • 博客访问: 58893
  • 博文数量: 44
  • 博客积分: 1245
  • 博客等级: 中尉
  • 技术积分: 255
  • 用 户 组: 普通用户
  • 注册时间: 2010-05-08 10:41
文章分类

全部博文(44)

文章存档

2013年(1)

2012年(5)

2011年(38)

我的朋友

分类: 服务器与存储

2011-09-09 13:12:11

Well, we’ve now had our first failure. A filer failed, but was not service impacting. We maintain a pair of filers in two distinct sites. The local pair back up each other, but both serve disk. The data is mirrored to the other site, which serves as a remote backup in the event of site failure. We hope this is a fairly fault tolerant solution, and is probably not an unusual design choice.

What we experienced was a single filer failing, due to some issue with a PCI card. This filer then spun up as a vFiler on the partner node at the local site. No real service interruption to disk being served to hosts. Obviously, minor CIFS impact and no doubt something would’ve been impacted should we have been serving NFS from this head. Primarily, we serve FC LUNs, so the failover was pretty seamless. A couple of server admins noted that their hosts were reporting paths to their disks being down.

Overall, not a dramatic event. This is good, this is why we invest in technology – No longer is failure not an option, failure is not service impacting. All good.

The recovery process? Not so ideal, in my opinion.

As mentioned previously, we maintain 2 filer heads per site and have 2 sites. The particular failure we experienced should have been a no-brainer within regards to recovery. We maintain mirror copies of our data at the opposite site. Those mirrors are either asynchronous (where the business has decided/agreed that a certain amount of data loss is acceptable in the event of site failure) or synchronous (where the business has decided that the RPO for this data is 0 minutes. No data loss is acceptable). Again, this should all be pretty standard stuff

After the problematic PCI card being replaced, we followed the procedure given by support for the filer giveback. The procedure given was

On the failed filer:

  • Boot Data ONTAP – boot_ontap
  • Press Ctrl-C to enter the maintenance menu and select option 5 then issue the following
    • fcadmin devmap
    • fcadmin config
    • storage show disk -p
    • aggr status -r

There were no errors reported, so we continued with:

  • Leave maintenance mode – halt
  • boot_ontap
  • Ensure the filer is “waiting for giveback”

On the partner filer:

  • fcadmin devmap
  • fcadmin config
  • partner fcadmin config
  • storage show disk -p
  • priv set advanced
  • cf monitor all
  • priv set
  • cf status

We could see from the partner node that the failed node was ready for giveback, so we continued with

  • cifs terminate -t 10
  • cf giveback

And that should have been that. All the pre-giveback checks passed / didn’t report any errors. I was initially concerned when trying to boot into maintainence mode, and being presented with:

In a cluster, you MUST ensure that the partner is (and remains) down, or that takeover is manually disabled on the partner node, because clustering software is not started or fully enabled in Maintenance mode.

FAILURE TO DO SO CAN RESULT IN YOUR FILESYSTEMS BEING DESTROYED
Continue with boot?

Quite a scary message to be presented with, but the support guys confirmed this was normal and I should continue with the process. So, we pressed on with the giveback.

partner-node(takeover)> snapmirror status

Snapmirror is on.
Source                  Destination              State          Lag        Status
remote-node:volume      local-node:volume        Snapmirrored   00:04:42   Idle
 [...]
partner-node(takeover)> cf giveback
[partner-node (takeover): cf.misc.operatorGiveback:info]: Cluster monitor: giveback initiated by operator
[partner-node (takeover): replication.givebackCancel:error]: SnapMirror transfer in progress or in-sync; canceling giveback.
[partner-node (takeover): cf.rsrc.givebackVeto:error]: Cluster monitor: snapmirror: giveback cancelled due to active state
[partner-node (takeover): cf.fm.givebackCancelled:warning]: Cluster monitor: giveback cancelled partner-node(takeover)> snapmirror status
Snapmirror is on. Source                  Destination              State          Lag        Status
remote-node:volume      local-node:volume        Snapmirrored   00:04:56   Idle
[...]
partner-node(takeover)>

The problem was that we had synchronous relationships between local and remote sites – One filer in each site performs synchronous replication, and the partner in each site performs asynchronous replication. The restriction appears to be that all synchronous relationships need to be stopped for filer giveback

This seems incredible – A business decision has been made that this is mission critical data. The filer keeps data replicated in real time. But, to perform a giveback, you need to put your data at risk. How disappointing. The rest of the process was pretty smooth, and we were pretty happy with how things were going. But, to then expose your data to risk just to perform a standard task. That seems foolish

This is certainly not something that we experienced with our SVC solution – We had a very similar SVC implementation, but all data was replicated in real time. 2 SVC nodes per site, all nodes serving disk. We experienced a single SVC node failure. The impact? Disks went into write-through mode. We fixed the issue with the failed node, and bought it back into the cluster. The impact? Disk reads and writes were cached once more. It was entirely uneventful, as it should have been.

阅读(578) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~