storage R&D guy.
全部博文(1000)
分类: 服务器与存储
2015-07-14 15:27:09
原文地址:HDFS-RAID基本概念学习笔记 作者:parrot18
The HDFS RAID module provides a (DRFS) that is used along with an instance of the Hadoop (DFS). A filestored in the DRFS (the source file) is divided into stripes consisting of several blocks. For each stripe, a number of parity blocks are stored in theparity file corresponding to this source file. This makes it possible to recompute blocks in the source file or parity file when they are lost or corrupted.
The main benefit of the DRFS is the increased protection against data corruption it provides. Because of this increased protection, replication levels can be lowered while maintaining the same availability guarantees, which results in significant storage space savings.
个人理解:
HDFS RAID 是一个模块,在HDFS之上,类似一个应用的概念。它提供一个可运行于HDFS之上的DRFS实例。
存储在DRFS上的文件,被分成条带状(RAID技术一般都是将数据分成条带状,或纵向,或横向,配合校验数据,校验数据有存于单独一盘,也有分布在各个磁盘存储,具体内容可再上网搜索相关资料,有一本很好的参考书《大话存储》),每个条带,都有相应的parity数据,存储在parity文件中。parity文件提供数据纠错。奇偶校验允许错一位。
DRFS主要是为了减少HDFS的副本数目,比如将3个副本数目降低到2个,但仍具备数据可靠的能力。从而减少存储所需的磁盘容量。
从网上搜索的关于hdfs存储及引入raid的必要性的论述: 阴影部分文字为引用
在分布式文件系统中,为了提高文件存储的可靠性,一般采用文件分Block的方法,并把每个Block的多个副本分别存储在不同的服务器上,开源的分布式文件系统HDFS同样采用了这样的技术。但是,这样的方式会造成空间较大的浪费,HDFS每个文件的Block会有三个副本,如果文件大小为120MB,Block大小为64MB,则需要该文件会有两个Block,每个Block有三个副本,就是说一个120MB的文件会耗费360MB的HDFS存储空间(64MB*3+56MB*3),需要3倍于原文件大小的存储空间(300%)。随着HDFS集群的不断扩大,需要更多的磁盘来存储这些文件块的副本。如果一个HDFS集群达到下图所示的规模,可能任何一个公司都要考虑一下是不是该采用其他技术来弥补多副本造成的空间浪费了。
XOR编码相对而言比较简单,纠错能力也弱一些。它采用异或算法生成校验码parity,每个stripe只生成1个parity,对应上面的/foo/bar文件,两个stripe就会有两个parity Block,这两个parity Block组成一个parity文件/raidxor/foo/bar。在每个Block一个副本的情况下,如果某个stripe中丢失或者损坏了一个Block,通过XOR是可以将它恢复出来的,但是大于一个就不行了。
RS编码实现比较复杂,但是纠错能力较强,被广泛用于各种商业用途,比如CD,DVD和通信协议WiMAX。它的特点在于允许用户自定义parity长度(parity len)来达到压缩和可靠性的平衡。如果parity len = 4,会生成四个parity Block,组成/raidrs/foo/bar文件。它能够容忍同一stripe中同时丢失4个Block还能将恢复出来。HDFS Raid也主要采用这种编码方式。
需要注意的是,HDFS Raid建议将同一stripe的Block(包括parity Block)分散放置在不同的datanode上,这样避免某个datanode发生故障时,影响对这些Block的恢复,这个道理跟采用冗余备份时不能将同一Block的三个副本放在同一个datanode上的道理是一样的。
HDFS Raid consists of several software components:
the , a daemon that creates and maintains parity files for all data files stored in the DRFS,
the , which periodically recomputes blocks that have been lost or corrupted,
the utility, which allows the administrator to manually trigger the recomputation of missing or corrupt blocks and to check for files that have become irrecoverably corrupted.
the , which provides the encode and decode of the bytes in blocks
The DRFS client is implemented as a layer on top of the DFS client that intercepts all incoming calls and passes them on to the underlying client. Whenever the underlying DFS throws a or a (because he source file contains corrupt or missing blocks), the DRFS client catches these exceptions, locates the parity file for the current source file and recomputes the missing blocks before returning them to the application.
It is important to note that while the DRFS client recomputes missing blocks when reading corrupt files it does not insert these missing blocks back into the file system. Instead, it discards them once the application request has been fulfilled. The daemon and the tool can be used to persistently fix bad blocks.
凌驾于DFS client之上 拦截所有连接请求(从上面看应该是读取数据的请求 read a file)并将它们转入下层的client。 当下层的client抛出checksum或blockmissing的异常,DRFS client捕获这些异常 并定位异常块的parity数据进行数据修复然后才把数据返回给用户
注意:DRFS在计算修复损坏的块之后,回应用户相应数据,然后直接把损坏的块丢弃,并不会把计算恢复好的块存入文件系统。永久修复坏块的工作由Blockfixer和raidshell来做。
RaidNode
The periodically scans all paths for which the configuration specifies that they should be stored in the DRFS. For each path, it recursively inspects all files that have more than 2 blocks and selects those that have not been recently modified (default is within the last 24 hours). Once it has selected a source file, it iterates over all its stripes and creates the appropriate number of parity blocks for each stripe. The parity blocks are then concatenated together and stored as the parity file corresponding to this source file. Once the parity file has been created, the replication factor for the corresponding source file is lowered as specified in the configuration. The also periodically deletes parity files that have become orphaned or outdated.
There are currently two implementations of the :
, which computes parity blocks locally at the . Since computing parity blocks is a computationally expensive task the scalability of this approach is limited.
, which dispatches map reduce tasks to compute parity blocks.