about EMC Hot Spare-zhshujun-ChinaUnix博客

为理想而奋斗zhshujun.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

zhshujun

博客访问： 2034857
博文数量： 346
博客积分： 10221
博客等级：上将
技术积分： 4079
用户组：普通用户
注册时间： 2009-06-01 19:43

文章分类

全部博文（346）

shell（4）
下载（0）
编程（2）

vi（1）

perl（0）
Database（50）

asm（2）

ocfs2（4）

rac（17）

informix（2）

trouble（1）

sybase（0）

RMAN（16）

MS-sql（0）

DB2（3）

Oracle（5）
commvault（3）
考证（1）
SAN&NAS（2）
Veritas（77）

BESR（1）

错误处理（7）

RE（1）

VVR（2）

SF（2）

CCS（1）

EV（2）

VxVM（8）

BE（12）

NBU（30）

VCS（10）
数据备份（2）
TSM（1）
理财点滴（0）
ERP（1）

SAP（0）
OS 对比（1）
Windows（4）

Powershell（2）
虚拟化（40）

ha（0）

drs（0）

vc（0）

vsphere（2）

高手的blog（2）

hyper-v（1）

citrix（1）

vmware（33）
邮件系统（4）

Exchange（1）

lotus notes（3）
安全（18）

MVM（0）

ePO（5）

McAfee（3）

Other（9）

sep（1）
Linux（16）

ubuntu（1）

iscsi（3）

centos（1）

Suse（1）

Redhat（5）
Unix（29）

link（1）

FreeBSD（0）

HP-UX（4）

Solaris（17）

Aix（7）
带库（2）

IBM带库（0）

Quantum（1）
生活琐碎（9）
存储基础（12）
网络（1）

Brocade（0）

cisco（0）
磁盘阵列（32）

xiv（0）

ds5000（0）

ds4000（0）

ds3000（4）

CX系列（2）

data ontap（3）

HP存储（0）

IBM存储（7）

HDS（2）

NetApp（10）

EMC（1）
legato（4）

autostart（0）

networker（4）
未分配的博文（31）

文章存档

2012年（1）

2011年（102）

2010年（116）

2009年（127）

我的朋友

相关博文

about EMC Hot Spare

分类：服务器与存储

2009-06-24 23:41:59

这两天针对EMC的hotspare做了大量的测试，起因是因为用户处的一块系统盘失效导致global write cache被disable, 要知道，write cache被禁止后，会使I/O性能下降很多。这时，一些I/O比较频繁的应用会表现为速度下降，响应慢。用vmstat观察发现idle基本为10%-20%, 显示I/O busy, wait也会变为非0，在用户处为2-5。

其实，用户处配置了36, 73, 146G的hotspare盘各一快，而系统盘(前5块，disk0-disk5)为76G。但即使是hotspare已经完全顶替了损坏的系统盘，这时write cache仍然被禁止。由于备件未到，最终采取了将146G的hotspare盘，直接插入损坏的系统盘的槽位。

以上的过程，我们得出结论(针对用户的环境)：

1) write cache被禁止，系统地性能会急剧下降

2) 如果是系统盘损坏，无论有无hotspare盘，write cache都会被disable

3) 一旦系统盘恢复正常后(数据同步完成后，73G的盘通常需要1小时)，write cache才会自动地enable

4) 如果是非系统盘损坏，write cache不会被disable

5) 容量大的hotspare可以顶替小容量的坏盘

6) 如果SPS有1个失效，也会将write cache disable.

在我们的CX500测试环境中，也作了同样的测试。发现大部分测试结果都一样，但是第3)点有所不同，结论是：

一旦系统盘开始同步(而非等待完成), write cache就可以恢复成enable的状态，这大大的减少了write cache被disable的时间，也正是我们需要的行为。

那么，这也许是CX600的一个bug, 在CX500中已经修正? 抑或是通过升级微码就可以避免这个问题呢?

以下是从EMC的手册中摘抄的与hotspare相关的一些说明：

Hot spare - A single global spare disk, that serves as a temporary replacement for a failed disk in a RAID 5, 3, 1, or 1/0 LUN. Data from the failed disk is reconstructed automatically on the hot spare. It is reconstructed from the parity data or mirrored data on the working disks in the LUN; therefore, the data on the LUN is always accessible. A hot spare LUN cannot belong to a storage group

RAID type	Number of disks you can use
RAID 5	3 - 16
RAID 3	5 or 9 (CX-series)
RAID 1/0	2, 4, 6, 8, 10, 12, 14, 16
RAID 1	2
RAID 0	3 - 16
Disk	1
Hot spare	1

Note: If you have LUNs consisting of FC drives, allocate an FC drive as a hot

spare.If you have LUNs consisting of ATA drives, allocate an ATA drive as a

hot spare.

Rebuild priority The rebuild priority is the relative importance of reconstructing data on either a hot spare or a new disk that replaces a failed disk in a LUN. It determines the amount of resource the SP devotes to rebuilding instead of to normal I/O activity. Table 8-3 lists and describes the rebuild time associated with each rebuild value.

Value	Target rebuild time in hours
ASAP	0 (as quickly as possible) This is default.
HIGH	6
MEDIUM	12
LOW	18

The rebuild priorities correspond to the target times listed above. The storage system attempts to rebuild the LUN in the target time or less. The actual time to rebuild the LUN depends on the I/O workload, the LUN size, and the LUN RAID type. For a RAID group with multiple LUNs, the highest priority specified for any LUN in the group is used for all LUNs on the group.

Rebuilding a RAID 5, 3, 1, or 1/0 LUN

You can monitor the rebuilding of a new disk from the General tab of its Disk Properties dialog box (page 14-15).

A new disk module’s state changes as follows:

1. Powering up - The disk is powering up.

2. Rebuilding - The storage system is reconstructing the data on the new disk from the information on the other disks in the LUN. If the disk is the replacement for a hot spare that is being integrated into a redundant LUN, the state is Equalizing instead of Rebuilding. In this situation, the storage system is simply copying the data from the hot spare onto the new disk.

3. Enabled - The disk is bound and assigned to the SP being used as the communication channel to the enclosure.

A hot spare’s state changes as follows:

1. Rebuilding - The SP is rebuilding the data on the hot spare.

2. Enabled - The hot spare is fully integrated into the LUN, or the failed disk has been replaced with a new disk and the SP is copying the data from the hot spare onto the new disk.

3. Ready - The copy is complete. The LUN consists of the disks in the original slots and the hot spare is on standby.

Rebuilding occurs at the same time as user I/O. The rebuild priority for the LUN determines the duration of the rebuild process and the amount of SP resources dedicated to rebuilding. A High or ASAP (as soon as possible) rebuild priority consumes many resources and may significantly degrade performance. A Low rebuild priority consumes fewer resources with less effect on performance. You can determine the rebuild priority for a LUN from the General tab of its LUN Properties dialog box (page 14-14).

Failed vault disk with storage-system write caching enabled

If you are using write caching, the storage system uses the disks listed in Table 14-3 for its cache vault. If one of these disks fails, the storage system dumps its write cache image to the remaining disks in the vault; then it writes all dirty (modified) pages to disk and disables write caching.

Storage-system write caching remains disabled until a replacement disk is inserted and the storage system rebuilds the LUN with the replacement disk in it. You can determine whether storage-system write caching is enabled or disabled from the Cache tab of its

Properties dialog box (page 14-14).

Storage-system type	Cache vault disks
CX3-series, CX-series	0-0 through 0-4

"What is the High Availability Cache Vault (HACV) setting and what is the risk of setting it on or off?"


ID: emc126011
Usage: 23
Date Created: 01/16/2006
Last Modified: 05/10/2007
STATUS: Approved
Audience: Customer

Knowledgebase Solution


Question: What is the High Availability Cache Vault (HACV) setting and what is the risk of setting it on or off?
Question: Purpose of High Availability Cache Vault (HACV) on a CLARiiON CX- and DL-Series array
Environment: Product: CLARiiON CX200
Environment: Product: CLARiiON CX300
Environment: Product: CLARiiON CX300i
Environment: Product: CLARiiON CX400
Environment: Product: CLARiiON CX500
Environment: Product: CLARiiON CX500i
Environment: Product: CLARiiON CX600
Environment: Product: CLARiiON CX700
Environment: Product: CLARiiON DL300
Environment: Product: CLARiiON DL310
Environment: Product: CLARiiON DL700
Environment: Product: CLARiiON DL710
Environment: Product: CLARiiON CX3-10c
Environment: Product: CLARiiON CX3-20
Environment: Product: CLARiiON CX3-20c
Environment: Product: CLARiiON CX3-20F
Environment: Product: CLARiiON CX3-40
Environment: Product: CLARiiON CX3-40c
Environment: Product: CLARiiON CX3-40F
Environment: Product: CLARiiON CX3-80
Problem: Does the HA Cache Vault prevent write cache from disabling in case another critical component fails?
Problem: What does the HA Cache Vault check box in Navisphere Manager do?
Problem: What events does HA Cache Vault protect the write cache from?
Fix: If you enable the HA cache vault (HACV), a single drive failure will cause the write cache to become disabled, thus reducing the risk of losing data in the event of a second drive failing. If you disable the HACV, a single drive failure does not disable the write cache, leaving data at risk if a second drive fails. When you disable the HCCV, you will receive a warning message stating that this operation will allow write caching to continue even if one of the cache vault drives fails. If there is already a failure on one of the cache vault drives, this operation will not re-enable the write cache. Cache will not re-enable in the event of an SP reboot until the fault condition is corrected.

The following table describes the consequences of having HACV enabled or disabled when a problem occurs.

                                                      HACV Matrix

Problem
HACV enabled
HACV disabled

CacheState

After failure
Data Loss
CacheState

After failure
Data Loss

No disk failures
Enabled
No
Enabled
No

Above and the SP panic or reboot
Enabled
No
Enabled
No

Above and double SP panic
Disabled
Yes
Disabled
Yes

Above and Array power cycles
Enabled
No
Enabled
No

Single vault disk fails
Disabled
No
Enabled
No

Above and the SP panic or reboot
Disabled
No
Disabled
No

Above and double SP panic
Disabled
No
Disabled
Yes

Above and Array power cycles
Disabled
No
Disabled
No

Second vault disk fails
Disabled
User LUNs
Disabled
User LUNs

Above and the SP panic or reboot
Disabled
No
Disabled
No

Above and double SP panic
Disabled
No
Disabled
Possible *

Above and Array power cycles
Disabled
No
Disabled
Possible *

* When a second vault disk fails, cache will disable and begin to de-stage because this takes more time than a dump of cache memory to the vault. There is a window of vulnerability for LUNs to end up with dirty cache.

Notes: There are issues that will cause the cache to become disabled and that the HACV setting has no effect on:
AC power loss
SP failure
Fan failure
Power supply failure
SPS failure
User-induced cache disable
Insufficient number of available cache pages
Over- temperature

Notes: HA Cache Vault is defined in Navisphere Manager Help as:
"HA Cache Vault Note:

Supported only on CX-Series storage systems. Determines the availability of storage-system write caching when a single drive in the cache vault fails. When the check box is enabled (default), write caching is disabled if a single vault disk fails. When the check box is cleared, write caching is not disabled if a single disk fails.

Important: Disabling the HA Cache Vault check box puts the data at risk if another cache vault disk should fail."

阅读(2057) | 评论(0) | 转发(0) |

上一篇：NETBACKUP 迁移过程

下一篇：a EMC metaLUN example - extending Disk on AIX

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6