dedup util 数据块零碰撞算法-platinum-ChinaUnix博客

Platinum's BLogplatinum.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

platinum

博客访问： 2273220
博文数量： 230
博客积分： 9346
博客等级：中将
技术积分： 3418
用户组：普通用户
注册时间： 2006-01-26 01:58

文章分类

全部博文（230）

Troubleshooting（2）
Programing（11）
Security（4）
Documentation（38）
Kernel（14）
Network（28）
Shell（7）
System（18）
Others（55）
TIPS（37）
未分配的博文（16）

文章存档

2015年（30）

2014年（7）

2013年（12）

2012年（2）

2011年（3）

2010年（42）

2009年（9）

2008年（15）

2007年（74）

2006年（36）

我的朋友

最近访客

推荐博文

dedup util 数据块零碰撞算法

分类：

2010-06-20 01:38:02

dedup util中使用md5算法计算数据块hashkey。md5是128位的hash值，理论上产生碰撞的概率非常小，据说比磁盘发生物理损坏的概率还要小几个数据级。然而，虽然说概率非常微小，但产生碰撞的可能性真实存在，王小云教授的团队已经找到快速发现碰撞的算法。在重复数据删除技术中，鉴于性能考虑，主流做法是使用碰撞概率更小的hash算法，如sha256, sha512，或者同时使用两种以上hash算法，或者使用非平凡的hash算法，从而减小碰撞概率，从而可以假设不会发生碰撞。数据碰撞的存在，使数据存在潜在的安全问题，这让很多用户不敢放心使用重复数据删除技术及相关产品。dedup util是一款轻量级数据打包归档工具，我认为它的数据安全性要高于性能，因此以牺牲部分性能为代价，对md5碰撞问题进行彻底解决，确保数据的安全性。

假设md5不会产生碰撞，那么hash_md5(block)产生的值是唯一的，也就是说一个数据块对应一个唯一的hash值，可以用1:1映射来表示。现在，这个假设被否定，可能存在两个数据块的md5哈稀值相同，即hash_md5(block1) = hash_md5(block2)，多个数据块对应一个hash值，这时就需要使用1:n映射来表示。新的hashtable表项使用三元组表示：

其中，md5_hashkey表示数据块md5值，block_nr表示md5哈稀值相同的数据块数量，block_IDs表示这些数据块逻辑块号。在程序设计中，将block_nr与block_IDs合并在一块，结构如下所示：

+----------------------------------------------------+
| block_nr | block_ID1 | block_ID2 | ... | block_IDn |
+----------------------------------------------------+ 　

这实际上是使用链接法来解决hash碰撞问题，每个桶的长度不定。对于每个数据块，都要与桶中的数据块进行逐位的比较，以保证数据块的唯一性。算法描述如下：

1、计算数据块md5值，hkey = hash_md5(block)；

2、查找hashtable，bindex = hash_value(hkey, htable)；

3、如果未发现哈稀表项，即bindex == NULL，则直接将hkey插入hashtable, 并且block_nr = 1，block_ID1 = g_unique_block_nr；

4、如果发现哈稀表项，则作如下处理

　4.1 遍历哈稀桶，将逻辑号对应的物理数据块与当前数据块进行比较；

　4.2 如果找到相同数据块，则该数据块已经存在，无需处理；

4.3 如果未找到相同数据块，则将数据块插入桶尾，并将数据块写入文件，并且block_nr++, block_IDn = g_unique_block_nr；

不作碰撞处理的算法，未发现md5值哈稀表项则直接返回逻辑块号，没有如上算法的第4步。这里需要进行数据块的比较，性能损失主要发生在这里。算法细节请直接参考源码src/dedup.c中的dedup_regfile，部分代码摘录如下：

pos = 0;
while (rwsize = read(fd, buf, g_block_size))
{
/* if the last block */
if (rwsize != g_block_size)
break;
/* calculate md5 */
md5(buf, rwsize, md5_checksum);
/* check hashtable with hashkey
NOTE: no md5 collsion problem, but lose some performace
hashtable entry format: (md5_key, block_id list)
+--------------------------------+
| id num | id1 | id2 | ... | idn |
+--------------------------------+
*/
unsigned int cbindex;
int bflag = 0;
unsigned int *bindex = (block_id_t *)hash_value((void *)md5_checksum, htable);
/* the block exists */
if (bindex != NULL)
{
int i;
for (i = 0; i < *bindex; i++)
{
if (0 == block_cmp(buf, fd_bdata, *(bindex + i + 1), g_block_size))
{
cbindex = *(bindex + i + 1);
bflag = 1;
break;
}
}
}
/* insert hash entry and write unique block into bdata*/
if (bindex == NULL || (bindex != NULL && bflag == 0))
{
if (bindex == NULL)
bflag = 1;
bindex = (bflag) ? (block_id_t *)malloc(BLOCK_ID_SIZE * 2) :
(block_id_t *)realloc(bindex, BLOCK_ID_SIZE * ((*bindex) + 1));
if (NULL == bindex)
{
perror("malloc/realloc in dedup_regfile");
break;
}
*bindex = (bflag) ? 1 : (*bindex) + 1;
*(bindex + *bindex) = g_unique_block_nr;
cbindex = g_unique_block_nr;
hash_insert((void *)strdup(md5_checksum), (void *)bindex, htable);
write(fd_bdata, buf, rwsize);
g_unique_block_nr++;
}
metadata[pos] = cbindex;
memset(buf, 0, g_block_size);
memset(md5_checksum, 0, 16 + 1);
pos++;
}

pos
 = 0;
	while (rwsize = read(fd, buf, g_block_size)) 
	{
		/* if the last block */
		if (rwsize != g_block_size)
			break;
		/* calculate md5 */
		md5(buf, rwsize, md5_checksum);
		/* check hashtable with hashkey 
		   NOTE: no md5 collsion problem, but lose some performace 
		   hashtable entry format: (md5_key, block_id list)
		   +--------------------------------+
		   | id num | id1 | id2 | ... | idn |
		   +--------------------------------+
		*/
		unsigned int cbindex;
		int bflag = 0;
		unsigned int *bindex = (block_id_t *)hash_value((void *)md5_checksum, 
htable);
		/* the block exists */
		if (bindex != NULL)
		{
			int i;
			for (i = 0; i < *bindex; i++)
			{
				if (0 == block_cmp(buf, fd_bdata, *(bindex + i + 1), g_block_size))
				{
					cbindex = *(bindex + i + 1);
					bflag = 1;
					break;
				}
			}
		}
		/* insert hash entry and write unique block into bdata*/
		if (bindex == NULL || (bindex != NULL && bflag == 0))
		{
			if (bindex == NULL)
				bflag = 1;
			bindex = (bflag) ? (block_id_t *)malloc(BLOCK_ID_SIZE * 2) :
				(block_id_t *)realloc(bindex, BLOCK_ID_SIZE * ((*bindex) + 1));
			if (NULL == bindex)
			{
				perror("malloc/realloc in dedup_regfile");
				break;
			}
			*bindex = (bflag) ? 1 : (*bindex) + 1;
			*(bindex + *bindex) = g_unique_block_nr;
			cbindex = g_unique_block_nr;
			hash_insert((void *)strdup(md5_checksum), (void *)bindex, htable);
			write(fd_bdata, buf, rwsize);
			g_unique_block_nr++;
		}
		metadata[pos] = cbindex;
		memset(buf, 0, g_block_size);
		memset(md5_checksum, 0, 16 + 1);
		pos++;
	}

阅读(2356) | 评论(0) | 转发(0) |

上一篇：Linux下的 C/C++ 开发与调试工具

下一篇：一个利用缓冲区溢出执行代码的例子

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6