HDFS-RAID基本概念学习笔记-xiong9937-ChinaUnix博客

storage&nbsp;architect

首页　| 　博文目录　| 　关于我

xiong9937

博客访问： 2008826
博文数量： 1000
博客积分： 0
博客等级：民兵
技术积分： 7921
用户组：普通用户
注册时间： 2013-08-20 09:23

个人简介

storage R&D guy.

文章分类

全部博文（1000）

hh（5）
python（1）
flashcache（2）
levelDB（12）
java（4）
mac（5）
zookeeper（73）
ceph（108）
investation（2）
raid（3）
USB（21）
raise（1）
others（2）
salary（2）
salary（0）
KVM（11）
3G（2）
SAS（3）
PMC（2）
cold（24）
algorithm（9）
HDFS（92）
HDFS（4）
gdb（5）
hp（1）
DDK（27）
C（25）
eclipse（3）
tools（52）
kernel（37）
iscsi（19）
HPC（1）
FS（35）
scst（15）
istributed （5）
cloud（19）
NAS（41）
intel（1）
algorithm（0）
command（2）
tcpip（18）
documents（2）
board（1）
memory（13）
management（1）
linux boot（34）
bios（3）
pcie（56）
memory（3）
ethnet（56）
driver（3）
fcoe（13）
FC（14）
english（4）
switch（2）
links（14）
private（0）
protocal（0）
office（2）
network（2）
vm（8）
database（1）
os（43）
storage（27）

fcoe（4）
server（3）
未分配的博文（1）

文章存档

2019年（5）

2017年（47）

2016年（38）

2015年（539）

2014年（193）

2013年（178）

我的朋友

相关博文

HDFS-RAID基本概念学习笔记

分类：服务器与存储

2015-07-14 15:27:09

原文地址：HDFS-RAID基本概念学习笔记作者：parrot18

Overview

The HDFS RAID module provides a (DRFS) that is used along with an instance of the Hadoop (DFS). A filestored in the DRFS (the source file) is divided into stripes consisting of several blocks. For each stripe, a number of parity blocks are stored in theparity file corresponding to this source file. This makes it possible to recompute blocks in the source file or parity file when they are lost or corrupted.

The main benefit of the DRFS is the increased protection against data corruption it provides. Because of this increased protection, replication levels can be lowered while maintaining the same availability guarantees, which results in significant storage space savings.

个人理解：

HDFS RAID 是一个模块，在HDFS之上，类似一个应用的概念。它提供一个可运行于HDFS之上的DRFS实例。

存储在DRFS上的文件，被分成条带状（RAID技术一般都是将数据分成条带状，或纵向，或横向，配合校验数据，校验数据有存于单独一盘，也有分布在各个磁盘存储，具体内容可再上网搜索相关资料，有一本很好的参考书《大话存储》），每个条带，都有相应的parity数据，存储在parity文件中。parity文件提供数据纠错。奇偶校验允许错一位。

DRFS主要是为了减少HDFS的副本数目，比如将3个副本数目降低到2个，但仍具备数据可靠的能力。从而减少存储所需的磁盘容量。

从网上搜索的关于hdfs存储及引入raid的必要性的论述：阴影部分文字为引用

在分布式文件系统中，为了提高文件存储的可靠性，一般采用文件分Block的方法，并把每个Block的多个副本分别存储在不同的服务器上，开源的分布式文件系统HDFS同样采用了这样的技术。但是，这样的方式会造成空间较大的浪费，HDFS每个文件的Block会有三个副本，如果文件大小为120MB，Block大小为64MB，则需要该文件会有两个Block，每个Block有三个副本，就是说一个120MB的文件会耗费360MB的HDFS存储空间(64MB*3+56MB*3)，需要3倍于原文件大小的存储空间(300%)。随着HDFS集群的不断扩大，需要更多的磁盘来存储这些文件块的副本。如果一个HDFS集群达到下图所示的规模，可能任何一个公司都要考虑一下是不是该采用其他技术来弥补多副本造成的空间浪费了。

上面所说的图中规模是PB级的数据

Hadoop-hdfs-raid现在是对现有Hadoop的一个包装，而不是把这部分代码嵌入到现有的Hadoop代码里，因为那样会增加代码的复杂度和不稳定性。从hadoop-2.0开始，它将要作为一个单独的project存在

网上关于HDFS-RAID基本容错、纠错思想的论述：

HDFS-RAID 借助了raid分条（striping 延展）技术的概念，它把文件每X个Block作为一个stripe来进行编码校验，其中X就是stripe length。比如，一个文件/foo/bar有16个Block，stripe length是10的话，该文件就有2个stripe。每个stripe是一个独立的编码校验单元，编解码都是以stripe为单位的。上面提到的那个文件，第1-10块作为stripe1进行编码，第11-16块作为stripe2进行编码，生成stripe1的编码校验不需要stripe2中的Block参与，反之亦然。这是enrasure code 后面再仔细学习

目前，HDFS Raid采用了XOR和RS(Reed-Solomon)两种编码方式。

XOR只允许创建一位parity字节

RS允许创建任意给定数目的parity字节

XOR编码相对而言比较简单，纠错能力也弱一些。它采用异或算法生成校验码parity，每个stripe只生成1个parity，对应上面的/foo/bar文件，两个stripe就会有两个parity Block，这两个parity Block组成一个parity文件/raidxor/foo/bar。在每个Block一个副本的情况下，如果某个stripe中丢失或者损坏了一个Block，通过XOR是可以将它恢复出来的，但是大于一个就不行了。

RS编码实现比较复杂，但是纠错能力较强，被广泛用于各种商业用途，比如CD,DVD和通信协议WiMAX。它的特点在于允许用户自定义parity长度(parity len)来达到压缩和可靠性的平衡。如果parity len = 4，会生成四个parity Block，组成/raidrs/foo/bar文件。它能够容忍同一stripe中同时丢失4个Block还能将恢复出来。HDFS Raid也主要采用这种编码方式。

需要注意的是，HDFS Raid建议将同一stripe的Block(包括parity Block)分散放置在不同的datanode上，这样避免某个datanode发生故障时，影响对这些Block的恢复，这个道理跟采用冗余备份时不能将同一Block的三个副本放在同一个datanode上的道理是一样的。

这里需要再仔细区别一下原来的block复制3份冗余容错和融入raid后利用erasure code容错所能容忍的错误个数两者比较时错误的方向是不一样的复制x份容错是纵向的这一个block 可以坏x-1份 erasure code容错是横向的这一个stripe条带里的x个block，有y个parity，那么可以容忍这个stripe里坏y个block

这里有个问题：如果原来的冗余容错也是横向的一个文件可以坏任意个block 是这样吗？只要是横向的，3份里面留有1份，就可以读到原来的整个文件的数据。

Architecture and implementation

HDFS Raid consists of several software components:

the DRFS client, which provides application access to the the files in the DRFS and transparently recovers any corrupt or missing blocks encountered when reading a file,
the , a daemon that creates and maintains parity files for all data files stored in the DRFS,
the , which periodically recomputes blocks that have been lost or corrupted,
the utility, which allows the administrator to manually trigger the recomputation of missing or corrupt blocks and to check for files that have become irrecoverably corrupted.
the , which provides the encode and decode of the bytes in blocks

下面分别搜索、学习这几个部分的相关资料

the DRFS client：

The DRFS client is implemented as a layer on top of the DFS client that intercepts all incoming calls and passes them on to the underlying client. Whenever the underlying DFS throws a or a (because he source file contains corrupt or missing blocks), the DRFS client catches these exceptions, locates the parity file for the current source file and recomputes the missing blocks before returning them to the application.

It is important to note that while the DRFS client recomputes missing blocks when reading corrupt files it does not insert these missing blocks back into the file system. Instead, it discards them once the application request has been fulfilled. The daemon and the tool can be used to persistently fix bad blocks.

凌驾于DFS client之上拦截所有连接请求（从上面看应该是读取数据的请求 read a file）并将它们转入下层的client。当下层的client抛出checksum或blockmissing的异常，DRFS client捕获这些异常并定位异常块的parity数据进行数据修复然后才把数据返回给用户

注意：DRFS在计算修复损坏的块之后，回应用户相应数据，然后直接把损坏的块丢弃，并不会把计算恢复好的块存入文件系统。永久修复坏块的工作由Blockfixer和raidshell来做。

RaidNode

The periodically scans all paths for which the configuration specifies that they should be stored in the DRFS. For each path, it recursively inspects all files that have more than 2 blocks and selects those that have not been recently modified (default is within the last 24 hours). Once it has selected a source file, it iterates over all its stripes and creates the appropriate number of parity blocks for each stripe. The parity blocks are then concatenated together and stored as the parity file corresponding to this source file. Once the parity file has been created, the replication factor for the corresponding source file is lowered as specified in the configuration. The also periodically deletes parity files that have become orphaned or outdated.

There are currently two implementations of the :

, which computes parity blocks locally at the . Since computing parity blocks is a computationally expensive task the scalability of this approach is limited.
, which dispatches map reduce tasks to compute parity blocks.

基本参考资料：

阅读(1058) | 评论(0) | 转发(0) |

上一篇：冷存储四个层次的涵义

下一篇：Linux 多核下绑定硬件中断到不同 CPU（IRQ Affinity）

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6