Chinaunix首页 | 论坛 | 博客
  • 博客访问: 1991938
  • 博文数量: 346
  • 博客积分: 10221
  • 博客等级: 上将
  • 技术积分: 4079
  • 用 户 组: 普通用户
  • 注册时间: 2009-06-01 19:43
文章分类

全部博文(346)

文章存档

2012年(1)

2011年(102)

2010年(116)

2009年(127)

我的朋友

分类: 服务器与存储

2009-06-24 23:41:59

这两天针对EMChotspare做了大量的测试,起因是因为用户处的一块系统盘失效导致global write cachedisable, 要知道,write cache被禁止后,会使I/O性能下降很多。这时,一些I/O比较频繁的应用会表现为速度下降,响应慢。用vmstat观察发现idle基本为10%-20%, 显示I/O busy, wait也会变为非0,在用户处为2-5

其实,用户处配置了36, 73, 146Ghotspare盘各一快,而系统盘(5块,disk0-disk5)76G。但即使是hotspare已经完全顶替了损坏的系统盘,这时write cache仍然被禁止。由于备件未到,最终采取了将146Ghotspare盘,直接插入损坏的系统盘的槽位。

以上的过程,我们得出结论(针对用户的环境)

1)     write cache被禁止,系统地性能会急剧下降

2)     如果是系统盘损坏,无论有无hotspare盘,write cache都会被disable

3)     一旦系统盘恢复正常后(数据同步完成后,73G的盘通常需要1小时)write cache才会自动地enable

4)     如果是非系统盘损坏,write cache不会被disable

5)     容量大的hotspare可以顶替小容量的坏盘

6)     如果SPS1个失效,也会将write cache disable.

 

在我们的CX500测试环境中,也作了同样的测试。发现大部分测试结果都一样,但是第3)点有所不同,结论是:

一旦系统盘开始同步(而非等待完成), write cache就可以恢复成enable的状态,这大大的减少了write cachedisable的时间,也正是我们需要的行为。

那么,这也许是CX600的一个bug, CX500中已经修正? 抑或是通过升级微码就可以避免这个问题呢?

以下是从EMC的手册中摘抄的与hotspare相关的一些说明:

Hot spare - A single global spare disk, that serves as a temporary replacement for a failed disk in a RAID 5, 3, 1, or 1/0 LUN. Data from the failed disk is reconstructed automatically on the hot spare. It is reconstructed from the parity data or mirrored data on the working disks in the LUN; therefore, the data on the LUN is always accessible. A hot spare LUN cannot belong to a storage group

 

 

RAID type

Number of disks you can use

RAID 5

3 - 16

RAID 3

5 or 9 (CX-series)

RAID 1/0

2, 4, 6, 8, 10, 12, 14, 16

RAID 1

2

RAID 0

3 - 16

Disk

1

Hot spare

1

 

Note: If you have LUNs consisting of FC drives, allocate an FC drive as a hot

spare.If you have LUNs consisting of ATA drives, allocate an ATA drive as a

hot spare.

 

 

Rebuild priority The rebuild priority is the relative importance of reconstructing data on either a hot spare or a new disk that replaces a failed disk in a LUN. It determines the amount of resource the SP devotes to rebuilding instead of to normal I/O activity. Table 8-3 lists and describes the rebuild time associated with each rebuild value.

 

Value

Target rebuild time in hours

ASAP

0 (as quickly as possible) This is default.

HIGH

6

MEDIUM

12

LOW

18

 

 

The rebuild priorities correspond to the target times listed above. The storage system attempts to rebuild the LUN in the target time or less. The actual time to rebuild the LUN depends on the I/O workload, the LUN size, and the LUN RAID type. For a RAID group with multiple LUNs, the highest priority specified for any LUN in the group is used for all LUNs on the group.

Rebuilding a RAID 5, 3, 1, or 1/0 LUN

You can monitor the rebuilding of a new disk from the General tab of its Disk Properties dialog box (page 14-15).

A new disk module’s state changes as follows:

1. Powering up - The disk is powering up.

2. Rebuilding - The storage system is reconstructing the data on the new disk from the information on the other disks in the LUN. If the disk is the replacement for a hot spare that is being integrated into a redundant LUN, the state is Equalizing instead of  Rebuilding. In this situation, the storage system is simply copying the data from the hot spare onto the new disk.

3. Enabled - The disk is bound and assigned to the SP being used as the communication channel to the enclosure.

 

A hot spare’s state changes as follows:

1. Rebuilding - The SP is rebuilding the data on the hot spare.

2. Enabled - The hot spare is fully integrated into the LUN, or the failed disk has been replaced with a new disk and the SP is copying the data from the hot spare onto the new disk.

3. Ready - The copy is complete. The LUN consists of the disks in the original slots and the hot spare is on standby.

 

Rebuilding occurs at the same time as user I/O. The rebuild priority for the LUN determines the duration of the rebuild process and the amount of SP resources dedicated to rebuilding. A High or ASAP (as soon as possible) rebuild priority consumes many resources and may significantly degrade performance. A Low rebuild priority consumes fewer resources with less effect on performance. You can determine the rebuild priority for a LUN from the General tab of its LUN Properties dialog box (page 14-14).

 

Failed vault disk with storage-system write caching enabled

If you are using write caching, the storage system uses the disks listed in Table 14-3 for its cache vault. If one of these disks fails, the storage system dumps its write cache image to the remaining disks in the vault; then it writes all dirty (modified) pages to disk and disables write caching.

Storage-system write caching remains disabled until a replacement disk is inserted and the storage system rebuilds the LUN with the replacement disk in it. You can determine whether storage-system write caching is enabled or disabled from the Cache tab of its

Properties dialog box (page 14-14).

 

Storage-system type

Cache vault disks

CX3-series, CX-series

0-0 through 0-4

 "What is the High Availability Cache Vault (HACV) setting and what is the risk of setting it on or off?"

    
ID:  emc126011
Usage:  23
Date Created:  01/16/2006
Last Modified:  05/10/2007
STATUS:  Approved
Audience:  Customer
   
Knowledgebase Solution  
  

Question:  What is the High Availability Cache Vault (HACV) setting and what is the risk of setting it on or off?
Question:  Purpose of High Availability Cache Vault (HACV) on a CLARiiON CX- and DL-Series array
Environment:  Product: CLARiiON CX200
Environment:  Product: CLARiiON CX300
Environment:  Product: CLARiiON CX300i
Environment:  Product: CLARiiON CX400
Environment:  Product: CLARiiON CX500
Environment:  Product: CLARiiON CX500i
Environment:  Product: CLARiiON CX600
Environment:  Product: CLARiiON CX700
Environment:  Product: CLARiiON DL300
Environment:  Product: CLARiiON DL310
Environment:  Product: CLARiiON DL700
Environment:  Product: CLARiiON DL710
Environment:  Product: CLARiiON CX3-10c
Environment:  Product: CLARiiON CX3-20
Environment:  Product: CLARiiON CX3-20c
Environment:  Product: CLARiiON CX3-20F
Environment:  Product: CLARiiON CX3-40
Environment:  Product: CLARiiON CX3-40c
Environment:  Product: CLARiiON CX3-40F
Environment:  Product: CLARiiON CX3-80
Problem:  Does the HA Cache Vault prevent write cache from disabling in case another critical component fails?
Problem:  What does the HA Cache Vault check box in Navisphere Manager do?
Problem:  What events does HA Cache Vault protect the write cache from?
Fix:  If you enable the HA cache vault (HACV), a single drive failure will cause the write cache to become disabled, thus reducing the risk of losing data in the event of a second drive failing. If you disable the HACV, a single drive failure does not disable the write cache, leaving data at risk if a second drive fails. When you disable the HCCV, you will receive a warning message stating that this operation will allow write caching to continue even if one of the cache vault drives fails. If there is already a failure on one of the cache vault drives, this operation will not re-enable the write cache.  Cache will not re-enable in the event of an SP reboot until the fault condition is corrected.

The following table describes the consequences of having HACV enabled or disabled when a problem occurs.

                                                      HACV Matrix





Problem
HACV enabled
HACV disabled

CacheState

After failure
Data Loss
CacheState

After failure
Data Loss

No disk failures
Enabled
No
Enabled
No

Above and the SP panic or reboot
Enabled
No
Enabled
No

Above and double SP panic
Disabled
Yes
Disabled
Yes

Above and Array power cycles
Enabled
No
Enabled
No

 

Single vault disk fails
Disabled
No
Enabled
No

Above and the SP panic or reboot
Disabled
No
Disabled
No

Above and double SP panic
Disabled
No
Disabled
Yes

Above and Array power cycles
Disabled
No
Disabled
No



Second vault disk fails
Disabled
User LUNs
Disabled
User LUNs

Above and the SP panic or reboot
Disabled
No
Disabled
No

Above and double SP panic
Disabled
No
Disabled
Possible *

Above and Array power cycles
Disabled
No
Disabled
Possible *




* When a second vault disk fails, cache will disable and begin to de-stage because this takes more time than a dump of cache memory to the vault. There is a window of vulnerability for LUNs to end up with dirty cache.

Notes:  There are issues that will cause the cache to become disabled and that the HACV setting has no effect on:  
AC power loss
SP failure
Fan failure
Power supply failure
SPS failure
User-induced cache disable
Insufficient number of available cache pages
Over- temperature

Notes:  HA Cache Vault is defined in Navisphere Manager Help as:
"HA Cache Vault Note:

Supported only on CX-Series storage systems. Determines the availability of storage-system write caching when a single drive in the cache vault  fails. When the check box is enabled (default), write caching is disabled if a single vault disk fails. When the check box is cleared, write caching  is not disabled if a single disk fails.

Important:  Disabling the HA Cache Vault check box puts the data at risk if another cache vault disk should fail."
 

阅读(1920) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~