分类: 服务器与存储
2011-09-09 13:12:11
Well, we’ve now had our first failure. A filer failed, but was not service impacting. We maintain a pair of filers in two distinct sites. The local pair back up each other, but both serve disk. The data is mirrored to the other site, which serves as a remote backup in the event of site failure. We hope this is a fairly fault tolerant solution, and is probably not an unusual design choice.
What we experienced was a single filer failing, due to some issue with a PCI card. This filer then spun up as a vFiler on the partner node at the local site. No real service interruption to disk being served to hosts. Obviously, minor CIFS impact and no doubt something would’ve been impacted should we have been serving NFS from this head. Primarily, we serve FC LUNs, so the failover was pretty seamless. A couple of server admins noted that their hosts were reporting paths to their disks being down.
Overall, not a dramatic event. This is good, this is why we invest in technology – No longer is failure not an option, failure is not service impacting. All good.
The recovery process? Not so ideal, in my opinion.
As mentioned previously, we maintain 2 filer heads per site and have 2 sites. The particular failure we experienced should have been a no-brainer within regards to recovery. We maintain mirror copies of our data at the opposite site. Those mirrors are either asynchronous (where the business has decided/agreed that a certain amount of data loss is acceptable in the event of site failure) or synchronous (where the business has decided that the RPO for this data is 0 minutes. No data loss is acceptable). Again, this should all be pretty standard stuff
After the problematic PCI card being replaced, we followed the procedure given by support for the filer giveback. The procedure given was
On the failed filer:
There were no errors reported, so we continued with:
On the partner filer:
We could see from the partner node that the failed node was ready for giveback, so we continued with
And that should have been that. All the pre-giveback checks passed / didn’t report any errors. I was initially concerned when trying to boot into maintainence mode, and being presented with:
In a cluster, you MUST ensure that the partner is (and remains) down, or that takeover is manually disabled on the partner node, because clustering software is not started or fully enabled in Maintenance mode.
FAILURE TO DO SO CAN RESULT IN YOUR FILESYSTEMS BEING DESTROYED
Continue with boot?
Quite a scary message to be presented with, but the support guys confirmed this was normal and I should continue with the process. So, we pressed on with the giveback.
partner-node(takeover)> snapmirror statusThe problem was that we had synchronous relationships between local and remote sites – One filer in each site performs synchronous replication, and the partner in each site performs asynchronous replication. The restriction appears to be that all synchronous relationships need to be stopped for filer giveback
This seems incredible – A business decision has been made that this is mission critical data. The filer keeps data replicated in real time. But, to perform a giveback, you need to put your data at risk. How disappointing. The rest of the process was pretty smooth, and we were pretty happy with how things were going. But, to then expose your data to risk just to perform a standard task. That seems foolish
This is certainly not something that we experienced with our SVC solution – We had a very similar SVC implementation, but all data was replicated in real time. 2 SVC nodes per site, all nodes serving disk. We experienced a single SVC node failure. The impact? Disks went into write-through mode. We fixed the issue with the failed node, and bought it back into the cluster. The impact? Disk reads and writes were cached once more. It was entirely uneventful, as it should have been.