Chinaunix首页 | 论坛 | 博客
  • 博客访问: 1988019
  • 博文数量: 1000
  • 博客积分: 0
  • 博客等级: 民兵
  • 技术积分: 7921
  • 用 户 组: 普通用户
  • 注册时间: 2013-08-20 09:23

storage R&D guy.










分类: 服务器与存储

2015-07-14 15:27:09


The HDFS RAID module provides a  (DRFS) that is used along with an instance of the Hadoop  (DFS). A filestored in the DRFS (the source file) is divided into stripes consisting of several blocks. For each stripe, a number of parity blocks are stored in theparity file corresponding to this source file. This makes it possible to recompute blocks in the source file or parity file when they are lost or corrupted.

The main benefit of the DRFS is the increased protection against data corruption it provides. Because of this increased protection, replication levels can be lowered while maintaining the same availability guarantees, which results in significant storage space savings.


HDFS RAID 是一个模块,在HDFS之上,类似一个应用的概念。它提供一个可运行于HDFS之上的DRFS实例。



从网上搜索的关于hdfs存储及引入raid的必要性的论述: 阴影部分文字为引用


上面所说的图中规模 是PB级的数据

HDFS-RAID 借助了raid分条(striping 延展)技术的概念,它把文件每X个Block作为一个stripe来进行编码校验,其中X就是stripe length。比如,一个文件/foo/bar有16个Block,stripe length是10的话,该文件就有2个stripe。每个stripe是一个独立的编码校验单元,编解码都是以stripe为单位的。上面提到的那个文件,第1-10块作为stripe1进行编码,第11-16块作为stripe2进行编码,生成stripe1的编码校验不需要stripe2中的Block参与,反之亦然。  这是enrasure code 后面再仔细学习
目前,HDFS Raid采用了XOR和RS(Reed-Solomon)两种编码方式。

XOR编码相对而言比较简单,纠错能力也弱一些。它采用异或算法生成校验码parity,每个stripe只生成1个parity,对应上面的/foo/bar文件,两个stripe就会有两个parity Block,这两个parity Block组成一个parity文件/raidxor/foo/bar。在每个Block一个副本的情况下,如果某个stripe中丢失或者损坏了一个Block,通过XOR是可以将它恢复出来的,但是大于一个就不行了。

RS编码实现比较复杂,但是纠错能力较强,被广泛用于各种商业用途,比如CD,DVD和通信协议WiMAX。它的特点在于允许用户自定义parity长度(parity len)来达到压缩和可靠性的平衡。如果parity len = 4,会生成四个parity Block,组成/raidrs/foo/bar文件。它能够容忍同一stripe中同时丢失4个Block还能将恢复出来。HDFS Raid也主要采用这种编码方式。

需要注意的是,HDFS Raid建议将同一stripe的Block(包括parity Block)分散放置在不同的datanode上,这样避免某个datanode发生故障时,影响对这些Block的恢复,这个道理跟采用冗余备份时不能将同一Block的三个副本放在同一个datanode上的道理是一样的。

这里需要再仔细区别一下 原来的block复制3份 冗余容错  和 融入raid后 利用erasure code容错  所能容忍的错误个数 两者比较时 错误的方向是不一样的  复制x份容错 是纵向的 这一个block 可以坏x-1份 erasure code容错 是横向的  这一个stripe条带里的x个block,有y个parity,那么可以容忍这个stripe里坏y个block
这里有个问题:如果原来的冗余容错 也是横向的 一个文件 可以坏任意个block 是这样吗?只要是横向的,3份里面留有1份,就可以读到原来的整个文件的数据。

Architecture and implementation

HDFS Raid consists of several software components:

  • the DRFS client, which provides application access to the the files in the DRFS and transparently recovers any corrupt or missing blocks encountered when reading a file,
  • the , a daemon that creates and maintains parity files for all data files stored in the DRFS,

  • the , which periodically recomputes blocks that have been lost or corrupted,

  • the  utility, which allows the administrator to manually trigger the recomputation of missing or corrupt blocks and to check for files that have become irrecoverably corrupted.

  • the , which provides the encode and decode of the bytes in blocks

the DRFS client:

The DRFS client is implemented as a layer on top of the DFS client that intercepts all incoming calls and passes them on to the underlying client. Whenever the underlying DFS throws a  or a  (because he source file contains corrupt or missing blocks), the DRFS client catches these exceptions, locates the parity file for the current source file and recomputes the missing blocks before returning them to the application.

It is important to note that while the DRFS client recomputes missing blocks when reading corrupt files it does not insert these missing blocks back into the file system. Instead, it discards them once the application request has been fulfilled. The  daemon and the  tool can be used to persistently fix bad blocks.

凌驾于DFS client之上 拦截所有连接请求(从上面看应该是读取数据的请求 read a file)并将它们转入下层的client。 当下层的client抛出checksum或blockmissing的异常,DRFS client捕获这些异常 并定位异常块的parity数据进行数据修复然后才把数据返回给用户



The  periodically scans all paths for which the configuration specifies that they should be stored in the DRFS. For each path, it recursively inspects all files that have more than 2 blocks and selects those that have not been recently modified (default is within the last 24 hours). Once it has selected a source file, it iterates over all its stripes and creates the appropriate number of parity blocks for each stripe. The parity blocks are then concatenated together and stored as the parity file corresponding to this source file. Once the parity file has been created, the replication factor for the corresponding source file is lowered as specified in the configuration. The  also periodically deletes parity files that have become orphaned or outdated.

There are currently two implementations of the :

  • , which computes parity blocks locally at the . Since computing parity blocks is a computationally expensive task the scalability of this approach is limited.

  • , which dispatches map reduce tasks to compute parity blocks.


阅读(1048) | 评论(0) | 转发(0) |