存储入门文章（2）--RAID-lxhhust-ChinaUnix博客

liverpool_hustlxhhust.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

lxhhust

博客访问： 547285
博文数量： 64
博客积分： 1591
博客等级：上尉
技术积分： 736
用户组：普通用户
注册时间： 2010-12-08 14:54

文章分类

全部博文（64）

分布式计算与存储（2）
内存管理（1）
系统结构（1）
网络编程（1）
进程线程（4）
c/c++（1）
linux基础（3）
技术概述（2）
编程技巧（3）
shell（2）
linux内核（7）
文件系统（1）
链表（2）
锁机制（3）
设备驱动（19）
Makefile（2）
DiskSim（1）
存储入门文章（9）
未分配的博文（0）

文章存档

2011年（42）

2010年（22）

我的朋友

相关博文

存储入门文章（2）--RAID

分类： LINUX

2010-12-14 17:08:33

存储入门文章（二）

名称：--RAID: High-Performance, Reliable Secondary Storage

出处：ACM Computing Surveys

作者：Peter M. Chen Edward K.Lee Gargth A.Gibson Randy H.Katz David A.Patterson

单位：

Abstract

striping across mutiple disks(在多个磁盘中分条）来提升性能。

redundancy（冗余）提升可靠性。

RAID（redundant arrays of inexpensive disks)

1.introduction

磁盘性能的提升落后于微处理器技术的提升，运用磁盘阵列可以是一个解决这个问题的方案。但磁盘阵列对磁盘错误的容忍度非常脆弱，RAID应运而生。

2.Background 磁盘术语：

platters:盘片

arms：磁臂

heads：磁头

actuator：传动装置

sectors：扁区

track：磁道

cylinder:柱面

disk services times = seek time + rotational latency + data transfer time

data transfer time 是与磁盘旋转速度，磁介质密度，磁头与磁道距离有关的函数。

head positioning time = seek time + rotational latency

head positioning time和般情况下占的比重要大于data transfer time

Data Paths

以读为例。

(1)在盘片的表面，信息是以改变极性的方式记录的。“flux reversals"通过底层的读电子器件转化为数字脉冲。

(2)ST506/412是一个标准，它定义了一个在底层对磁盘系统的接口

(3)脉冲被解码来区分数据bits和与时间相关的flux reversals

(4)bits以字节对齐，应用纠错编码。通过外围的总线接口，如SCSI，以数据块的形式向上层提供服务。SCSI和IPI-3也包括了一个数据映射层。将逻辑块号映射为物理柱面，磁道，扇区。

(5)String指的是共享同一条路径的磁盘的集合

(6)图中，每条衡线代表一个标准接口

3.Disk Array basics Data Striping and Redundancy

data striping提升性能

redundancy提升可靠性

并行性有两种情况

（1）多个独立的请求可以用多个磁盘并行处理，减少io排队时间

（2）单个多块的请求可以用多个磁盘合作来服务，提升单个请求的传输率

细粒度的数据交叉意思是所有的io请求，不管大小，都要访问所有的磁盘。它可以提升所有io请求的数据传输率。但是在一个给定的时间，只有一个逻辑io请求接受服务并且所有的磁盘都为每个请求浪费到positioning时间。

粗粒度的数据交叉指的是，对于小的io请求，只访问几个磁盘。而对于大的io请求可以访问所有的磁盘。这样的话，几个小的io请求可以同时服务，而对于大的io请求则允许访问到所有磁盘。

redundancy带来的两个问题

（1）计算冗余信息的方法

主要用的是奇偶效验，也有的是用海明码或是里德索罗门编码

（2）冗余信息的分布方法

一种是将冗余信息分布在一部个盘上，另一种是将其分布在所有盘上。

第二种方法可以避免热点和保持负载平衡

Basic RAID Organizations

Non-Redundant(RAID Level 0)

开销最小，因为它没有应用redundancy.

有最好的写性能（因为它不需要更新冗余信息），但没有最好的读性能（小于RAID 1，因为对于mirroring，可以选择有最短seek+rotation延持的盘）。单个磁盘错误会导致数据丢失。

主要用于超级计算环境（对性能和容量要求高，但对可靠性要求低）

Mirrored (RAID Level 1)

对于mirroring，读操作可以选择有最短queueing+seek+rotation延持的盘

主要用来数据库应用（有效性和事务率非常重要，但存储效率是次要的）

Memory-Style ECC (RAID Level 2)

应用海明码进行校验，相对于RAID1，减少了校验盘的数量。

Bit-Interleaved Parity (RAID Level 3)

只有单个校验盘，属于位分条

读请求要访问所有的数据盘，写请求要访问所有的数据盘和校验盘。因此同时只能有一个请求服务。

Block-Interleaved Parity (RAID Level 4)

块分条（striping unit)

小于striping unit的读请求只需要访问一个单独的数据盘

小于striping unit的写请求要更新对应的数据盘和效验盘

一个小写请求，对应四次磁盘io：读老数据和老效验，写新数据和写新效验

每次写都要写效验盘，使得效验盘成为瓶颈

Block-interleaved Distributed-Parity (RAID Level 5)

将效验块分布到各个磁盘上，消除了效验盘瓶颈。这样做的还有一个好处是，使所有的盘都参与到数据读。有最好的小读，大读，大写性能。由于要执行read-modify-write,小写同mirroring相比性能较低。

A useful property of the left-symmetric parity distribution is that whenever you traverse the striping units sequentially, you will access each disk once before accessing any disk twice. This property reduces disk conflicts when servicing large requests.

P+Q Redundancy (RAID Level 6)

用Reed-Solomon编码用最少的冗余盘来保护两个盘的失效。

一个read-modify-write需要6次磁盘访问

Performance and Cost Comparisons

各种磁盘陈列的度量标准是：reliability , performance , cost

Ground Rules and Observations

大多数的二级存储系统，特别是磁盘阵列是以吞吐量为主导的。因此我们更关心它的聚合吞吐量而不是如：它对单个请求的响应时间。这个理论有技术基础：随着异步io，预取，读缓存，写缓存的广泛应用，throughput越来越重要。

在throughput-oriented systems性能可以随着添加新的组件线性的增长，因此还要加上成本这个参数。所以要用per second per dollar 而不是per second来描述性能。

Comparisons

Reliability

在RAID5中，两个盘失效的平均时间间隔为

，N为磁盘个数，G为一个错误校验集合中的盘数，MTTF（disk）是一个单盘的平均失效时间间隔。MTTR（disk）是一个单盘的平均修复时间间隔。

system crashes and parity inconsistency

由于系统崩溃导致的数据一致性问题，它可能打断写操作，比如数据已经更新写入磁盘，但效验值还没有写入。

对于block-interleaved disk array，在写之前必须在非异失型存储器上记录充足的信息，一直保持到相关写操作完成。可以用硬件或软件来实现这个日志功能。

uncorrectable bit-errors

Disk manufactures generally agree that reading a disk is very unlikely to cause permanent errors. Most uncorrectable errors are generated because data is incorrectly written or gradually damaged as the magnetic media ages.

读磁盘不太可能导致硬件错误，大部分的错误是数据不正确的写入或是磁介质老化。

一个好的方法来减少这种情况的影响是，利用一种方法预测磁盘什么时候出现这种问题。VAXsimPLUS这个工具就用来根据磁盘的各种warning来进行判断bit error。

Correlated Disk Failures

简单磁盘陈列模型认为磁盘失效是不相关的，但事实上环境问题和制造业的因素将不断的导致相关的磁盘失效。

disks are generally more likely to fail either very early or very late in their lifetimes. Early failures are frequently caused by transient defects which may not have been detected during the manufacturer’s burn-in process; late failures occur when a disk wears out.

磁盘在它的生命周期中可能很早或很晚来发生。很早的原因是出厂时有没有发现的缺陷，很晚的原因是磁盘老化。

Implementation Considerations

Avoiding Stale Data

When a disk fails, the logical sectors corresponding to the failed disk must be marked invalid before any request that would normally access to the failed disk can be serviced. This invalid mark prevents users from reading corrupted data on the failed disk.

当磁盘失效时，这个磁盘相关的逻辑扇区要标记为invaild，防止在失效的盘上读取数据

When an invalid logical sector is reconstructed to a spare disk, the logical sector must be marked valid before any write request that would normally write to the failed disk can be serviced.This ensures that ensuing writes update the reconstructed data on the spare disk.

当失效的扇区在一个新的盘的上重构完成时，这个扇区要被标记为valid。

The valid/invalid state information can be maintained as a bit-vector either on a separate device or by reserving a small amount of storage on the disks currently configured into the disk array.

这些valid/invalid信息可以保存在单独的设备上，或是在磁盘上开辟一块区域。

Regenerating Parity after a System Crash

System crashes can result in inconsistent parity by interrupting write operations.

系统崩溃可能导致写操作时数据效验的不致性。

• Before servicing any write request, the corresponding parity sectors must be marked inconsistent.

当执行写请求之前，据有相关的效验扇区都要标记为不一致状态

• When bringing a system up from a system crash, all inconsistent parity sectors must be regenerated.

当系统重新启动后，所有的不一致效验扇区一定要重新生成。

Operating with a Failed Disk

第一种方法：demand recondstruction

当陈列中有空闲盘时，对失效扇区的访问立即在新的空闲盘上激发重构。

每二种方法：parity sparing

将效验盘作为空闲盘，将数据重构放在效验盘上。当加入新设备是，再把数据复制到新的空闲盘，重新生成考验。

Orthogonal RAID

？

4.ADVANCED TOPICS Improving Small Write Performance for RAID Level 5

(1)Buffering and Caching

Write buffering, also called asynchronous writes, acknowledges a user’s write before the write goes to disk.

在没有写到磁盘前就应当写请求。可以减少响应时间，但不能提供吞吐率。可能导致数据不致问题。当在高负载情况下，buffer会迅速填满，效果会不明显。

By writing larger units, small writes can be turned into full stripe writes, thus eliminating altogether the Achilles heel of RAID level 5 workloads [Menon93a]. Write buffering also allows better disk scheduling by writing multiple blocks at one time.

将小写合并，通过同时写多个块，更好的实现磁盘调度。

Read caching is normally used in disk systems to improve the response time and throughput when reading data. In a RAID level 5 disk array, however, it can serve a secondary purpose. If the old data required for computing the new parity is in the cache, read caching reduces the number of disk accesses required for small writes from four to three.

读缓存可以降低响应时间和提高吞吐率。应用于RAID 5，如果效验数据在cache中，可以减少对磁盘的访问次数。

By also caching recently written parity, the read of the old parity can sometimes be eliminated, further reducing the number of disk accesses for small writes from three to two.

通过缓存写效验，可以进一步减少对磁盘的访问次数。

Floating Parity

Floating parity

clusters parity into cylinders, each containing a track of free blocks. Whenever a parity block needs to be updated, the new parity block can be written on the rotationally nearest, unallocated block following the old parity block.

？？

To efficiently implement floating parity, directories for the locations of unallocated blocks and parity blocks must be stored in primary memory.

？？

Parity Logging

Parity logging reduces the overhead for small writes by delaying the read of the old parity and the write of the new parity. Instead of immediately updating the parity, an update image, the difference between the old and new parity, is temporarily written to a log. Delaying the update allows the parity to be grouped together in large contiguous blocks that can be updated more efficiently.

延迟读老效验和写新效验。将这些效验组合在一起高效的更新。

This delay takes place in two parts. First, the parity update image is stored temporarily in non-volatile memory. When this memory, which could be a few tens of KB, fills up, the parity update image is written to the log. When the log fills up, the parity update image is read into memory along with all the old parity and is applied to the old parity. The resulting new parity is then written to disk.

首先效验更新映像临时性存在非易失性内存中，当填满后，再将它们写入日志，当日志满后再一次性写入磁盘。

Parity logging reduces the small write overhead from four disk accesses to a little more than two disk accesses,

这种方案将小写负载从4减为略大于2.

Declustered Parity

???

Exploiting On-Line Spare Disks

As Figure 9 illustrates, distributed sparing distributes the capacity of a spare disk across all the disks in the disk array [Menon91].

将空闲盘的的容量分布到各个磁盘上。

Parity sparing is similar to distributed sparing except that it uses the spare capacity to store parity information [Reddy91, Chandy93].

不同于distributed sparing的是，它用空闲的容量存储校验的另一个副本。

Data Striping in Disk Arrays

权衡两个因素：

（1）Maximize the amount of useful data that each disk transfers with each logical I/O. Typically, a disk must spend some time seeking and rotating between each logical I/O that it services. This

positioning time represents wasted work—no data is transferred during this time. It is hence

beneficial to maximize the amount of useful work done in between these positioning times.

最大化每次逻辑io传输的数据，因为positioning要耗费很多时间

（2）Utilize all disks.

充分利用所有的磁盘，使达到负载均衡。

阅读(1812) | 评论(0) | 转发(0) |

上一篇：存储入门文章（1）--disk drive modeling

下一篇：DiskSim diskmodel API函数的一个bug

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6