基于 Dedup 的数据打包技术-platinum-ChinaUnix博客

Platinum's BLogplatinum.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

platinum

博客访问： 2277565
博文数量： 230
博客积分： 9346
博客等级：中将
技术积分： 3418
用户组：普通用户
注册时间： 2006-01-26 01:58

文章分类

全部博文（230）

Troubleshooting（2）
Programing（11）
Security（4）
Documentation（38）
Kernel（14）
Network（28）
Shell（7）
System（18）
Others（55）
TIPS（37）
未分配的博文（16）

文章存档

2015年（30）

2014年（7）

2013年（12）

2012年（2）

2011年（3）

2010年（42）

2009年（9）

2008年（15）

2007年（74）

2006年（36）

我的朋友

最近访客

推荐博文

基于 Dedup 的数据打包技术

分类：

2010-06-20 01:28:21

基于Dedup的数据打包技术

作者简介：刘爱贵，研究方向为网络存储、数据挖掘和分布式计算；毕业于中科院，目前从事存储软件研发工作。 Email: Aigui.Liu@gmail.com
注：作者学识和经验水平有限，如有错误或不当之处，敬请批评指正。

0、引言
    Tar, winrar, winzip是最为常见的数据打包工具软件，它们把文件集体封装成一个单独的数据包，从而方便数据的分布、传输、归档以及持久保存等目的。这类工具通常都支持数据压缩技术，从而有效减少数据的存储空间，常用压缩算法有Huffman编码、Z77/z78、LZW等。压缩算法的原理是通过对数据的重新编码，高频率数据片段采用较短的编码，低频率数据片段采用较长的编码，从而获得全局的上数据量较小的文件表示。

1、Dedup原理
   Deduplication，即重复数据删除，它是一种非常新的且流行度很高的存储技术，可以大大减少数据的数量。重复数据删除技术，通过数据集中重复的数据，从而消除冗余数据。借助dedup技术，可以提高存储系统的效率，有效节约成本、减少传输过程中的网络带宽。同时它也是一种绿色存储技术，能有效降低能耗(存储空间小了，所需要存储系统磁盘也就少了，自然所需要电能就减少了)。
   dedup按照消重的粒度可以分为文件级和数据块级。文件级的dedup技术也称为单一实例存储(SIS, Single Instance Store)，数据块级的重复数据删除，其消重粒度更小，可以达到4-24KB之间。显然，数据块级的可以提供更高的数据消重率，因此目前主流的 dedup产品都是数据块级的。重复数据删除原理如下图所示。将文件都分割成数据块(可以是定长或变长的数据块)，采用MD5或SHA1等Hash算法 (可以同时使用两种或以上hash算法，或CRC校验等，以获得非常小概率的数据碰撞发生)为数据块计算FingerPrint。具有相同FP指纹的数据块即可认为是相同的数据块，存储系统中仅需要保留一份。这样，一个物理文件在存储系统中就对应一个逻辑表示，由一组FP组成的元数据。当进行读取文件时，先读取逻辑文件，然后根据FP序列，从存储系统中取出相应数据块，还原物理文件副本。
   重复数据删除目前主要应用于数据备份，因此对数据进行多次备份后，存在大量重复数据，非常适合dedup技术。事实上，dedup技术可以用于很多场合，包括在线数据、近线数据、离线数据存储系统，甚至可以在文件系统、卷管理器、NAS、SAN中实施。还可以用于网络数据传输，当然也可以应用于数据打包技术。dedup技术可以帮助众多应用降低数据存储量，节省网络带宽，提高存储效率、减小备份窗口，绿色节能。这里，基于dedup实现一种数据打包技术。

2、基于Dedup的数据打包模型

数据包文件的数据布局：

Header

Unique block data

File metadata

数据包由三部分组成：文件头(header)、唯一数据块集(unique block data)和逻辑文件元数据(file metadata)。其中，header为一个结构体，定义了数据块大小、唯一数据块数量、数据块ID大小、包中文件数量、元数据在包中的位置等元信息。文件头后紧接就存储着所有唯一的数据块，大小和数量由文件头中元信息指示。在数据块之后，就是数据包中文件的逻辑表示元数据，由多个实体组成，结构如下所示，一个实体表示一个文件。解包时根据文件的元数据，逐一提取数据块，还原出当初的物理文件。

逻辑文件的元数据表示：

Entry header

pathname

Entry data

Last block data

逻辑文件的实体头中记录着文件名长度、数据块数量、数据块ID大小和最后一个数据块大小等信息。紧接着是文件名数据，长度在实体头中定义。文件名数据之后，存储着一组唯一数据块的编号，编号与唯一数据块集中的数据块一一对应。最后存储着文件最后一个数据块，由于这个数据块大小通常比正常数据块小，重复概率非常小，因此单独保存。

3、原型实现
基于上面的数据布局，就可以实现支持重复数据删除的数据打包方法。本人在Linux系统上实现了一个原型，实现中使用了hashtable来记录和查询唯一数据块信息，使用MD5算法计算数据块指纹，并使用zlib中的z77压缩算法对删除了重复数据后的数据包进行压缩。hashtable, MD5, z77算法和实现，这里不作介绍，有兴趣的读者可以参考相关资源。下面给出dedup.h, dedup.c undedup.c源码文件。目前实现的原型还相对比较粗糙。

/* dedup.h */

+ expand source view plain copy to clipboard print ?

#ifndef _DEDUP_H
#define _DEDUP_H
#include "md5.h"
#include "hash.h"
#include "hashtable.h"
#include "libz.h"
#ifdef __cplusplus
extern "C" {
#endif
/*
* deduplication file data layout
* --------------------------------------------------
* | header | unique block data | file metadata |
* --------------------------------------------------
*
* file metedata entry layout
* -----------------------------------------------------------------
* | entry header | pathname | entry data | last block data |
* -----------------------------------------------------------------
*/
typedef unsigned int block_id_t;
#define BLOCK_SIZE 4096 /* 4K Bytes */
#define BACKET_SIZE 10240
#define MAX_PATH_LEN 255
#define BLOCK_ID_SIZE (sizeof(block_id_t))
/* deduplication package header */
#define DEDUP_MAGIC_NUM 0x1329149
typedef struct _dedup_package_header {
unsigned int block_size;
unsigned int block_num;
unsigned int blockid_size;
unsigned int magic_num;
unsigned int file_num;
unsigned long long metadata_offset;
} dedup_package_header;
#define DEDUP_PKGHDR_SIZE (sizeof(dedup_package_header))
/* deduplication metadata entry header */
typedef struct _dedup_entry_header {
unsigned int path_len;
unsigned int block_num;
unsigned int entry_size;
unsigned int last_block_size;
int mode;
} dedup_entry_header;
#define DEDUP_ENTRYHDR_SIZE (sizeof(dedup_entry_header))
#ifdef __cplusplus
}
#endif
#endif

#ifndef _DEDUP_H
#define _DEDUP_H
#include "md5.h"
#include "hash.h"
#include "hashtable.h"
#include "libz.h"
#ifdef __cplusplus
extern "C" {
#endif
/* 
 * deduplication file data layout
 * --------------------------------------------------
 * |  header  |  unique block data |  file metadata |
 * --------------------------------------------------
 *
 * file metedata entry layout
 * -----------------------------------------------------------------
 * |  entry header  |  pathname  |  entry data  |  last block data |
 * -----------------------------------------------------------------
 */
typedef unsigned int block_id_t;
#define BLOCK_SIZE	4096	 /* 4K Bytes */
#define BACKET_SIZE     10240
#define MAX_PATH_LEN	255
#define BLOCK_ID_SIZE   (sizeof(block_id_t))
/* deduplication package header */
#define DEDUP_MAGIC_NUM	0x1329149
typedef struct _dedup_package_header {
	unsigned int block_size;
	unsigned int block_num;
	unsigned int blockid_size;
	unsigned int magic_num;
	unsigned int file_num;
	unsigned long long metadata_offset;
} dedup_package_header;
#define DEDUP_PKGHDR_SIZE	(sizeof(dedup_package_header))
/* deduplication metadata entry header */
typedef struct _dedup_entry_header {
	unsigned int path_len;
	unsigned int block_num;
	unsigned int entry_size;
	unsigned int last_block_size;
	int mode;
} dedup_entry_header;
#define DEDUP_ENTRYHDR_SIZE	(sizeof(dedup_entry_header))
#ifdef __cplusplus
}
#endif
#endif

/* dedup.c */

+ expand source view plain copy to clipboard print ?

#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include "dedup.h"
/* unique block number in package */
static unsigned int g_unique_block_nr = 0;
/* regular file number in package */
static unsigned int g_regular_file_nr = 0;
/* block length */
static unsigned int g_block_size = BLOCK_SIZE;
/* hashtable backet number */
static unsigned int g_htab_backet_nr = BACKET_SIZE;
void show_md5(unsigned char md5_checksum[16])
{
int i;
for (i = 0; i < 16; i++)
{
printf("%02x", md5_checksum[i]);
}
}
void show_pkg_header(dedup_package_header dedup_pkg_hdr)
{
printf("block_size = %d\n", dedup_pkg_hdr.block_size);
printf("block_num = %d\n", dedup_pkg_hdr.block_num);
printf("blockid_size = %d\n", dedup_pkg_hdr.blockid_size);
printf("magic_num = 0x%x\n", dedup_pkg_hdr.magic_num);
printf("file_num = %d\n", dedup_pkg_hdr.file_num);
printf("metadata_offset = %lld\n", dedup_pkg_hdr.metadata_offset);
}
int dedup_regfile(char *fullpath, int prepos, int fd_bdata, int fd_mdata, hashtable *htable, int debug)
{
int fd;
char *buf = NULL;
unsigned int rwsize, pos;
unsigned char md5_checksum[16 + 1] = {0};
unsigned int *metadata = NULL;
unsigned int block_num = 0;
struct stat statbuf;
dedup_entry_header dedup_entry_hdr;
if (-1 == (fd = open(fullpath, O_RDONLY)))
{
perror("open regulae file");
return errno;
}
if (-1 == fstat(fd, &statbuf))
{
perror("fstat regular file");
goto _DEDUP_REGFILE_EXIT;
}
block_num = statbuf.st_size / g_block_size;
metadata = (unsigned int *)malloc(BLOCK_ID_SIZE * block_num);
if (metadata == NULL)
{
perror("malloc metadata for regfile");
goto _DEDUP_REGFILE_EXIT;
}
buf = (char *)malloc(g_block_size);
if (buf == NULL)
{
perror("malloc buf for regfile");
goto _DEDUP_REGFILE_EXIT;
}
pos = 0;
while (rwsize = read(fd, buf, g_block_size))
{
/* if the last block */
if (rwsize != g_block_size)
break;
/* calculate md5 */
md5(buf, rwsize, md5_checksum);
/* check hashtable with hashkey */
unsigned int *bindex = (block_id_t *)hash_value((void *)md5_checksum, htable);
if (bindex == NULL)
{
bindex = (unsigned int *)malloc(BLOCK_ID_SIZE);
if (NULL == bindex)
{
perror("malloc in dedup_regfile");
break;
}
/* insert hash entry and write unique block into bdata*/
*bindex = g_unique_block_nr;
hash_insert((void *)strdup(md5_checksum), (void *)bindex, htable);
write(fd_bdata, buf, rwsize);
g_unique_block_nr++;
}
metadata[pos] = *bindex;
memset(buf, 0, g_block_size);
memset(md5_checksum, 0, 16 + 1);
pos++;
}
/* write metadata into mdata */
dedup_entry_hdr.path_len = strlen(fullpath) - prepos;
dedup_entry_hdr.block_num = block_num;
dedup_entry_hdr.entry_size = BLOCK_ID_SIZE;
dedup_entry_hdr.last_block_size = rwsize;
dedup_entry_hdr.mode = statbuf.st_mode;
write(fd_mdata, &dedup_entry_hdr, sizeof(dedup_entry_header));
write(fd_mdata, fullpath + prepos, dedup_entry_hdr.path_len);
write(fd_mdata, metadata, BLOCK_ID_SIZE * block_num);
write(fd_mdata, buf, rwsize);
g_regular_file_nr++;
_DEDUP_REGFILE_EXIT:
close(fd);
if (metadata) free(metadata);
if (buf) free(buf);
return 0;
}
int dedup_dir(char *fullpath, int prepos, int fd_bdata, int fd_mdata, hashtable *htable, int debug)
{
DIR *dp;
struct dirent *dirp;
struct stat statbuf;
char subpath[MAX_PATH_LEN] = {0};
if (NULL == (dp = opendir(fullpath)))
{
return errno;
}
while ((dirp = readdir(dp)) != NULL)
{
if (strcmp(dirp->d_name, ".") == 0 || strcmp(dirp->d_name, "..") == 0)
continue;
sprintf(subpath, "%s/%s", fullpath, dirp->d_name);
if (0 == lstat(subpath, &statbuf))
{
if (debug)
printf("%s\n", subpath);
if (S_ISREG(statbuf.st_mode))
dedup_regfile(subpath, prepos, fd_bdata, fd_mdata, htable,debug);
else if (S_ISDIR(statbuf.st_mode))
dedup_dir(subpath, prepos, fd_bdata, fd_mdata, htable, debug);
}
}
closedir(dp);
return 0;
}
int dedup_package(int path_nr, char **src_paths, char *dest_file, int debug)
{
int fd, fd_bdata, fd_mdata, ret = 0;
struct stat statbuf;
hashtable *htable = NULL;
dedup_package_header dedup_pkg_hdr;
char **paths = src_paths;
int i, rwsize, prepos;
char buf[1024 * 1024] = {0};
if (-1 == (fd = open(dest_file, O_WRONLY | O_CREAT, 0755)))
{
perror("open dest file");
ret = errno;
goto _DEDUP_PKG_EXIT;
}
htable = create_hashtable(g_htab_backet_nr);
if (NULL == htable)
{
perror("create_hashtable");
ret = errno;
goto _DEDUP_PKG_EXIT;
}
fd_bdata = open("./.bdata", O_RDWR | O_CREAT, 0777);
fd_mdata = open("./.mdata", O_RDWR | O_CREAT, 0777);
if (-1 == fd_bdata || -1 == fd_mdata)
{
perror("open bdata or mdata");
ret = errno;
goto _DEDUP_PKG_EXIT;
}
g_unique_block_nr = 0;
g_regular_file_nr = 0;
for (i = 0; i < path_nr; i++)
{
if (lstat(paths[i], &statbuf) < 0)
{
perror("lstat source path");
ret = errno;
goto _DEDUP_PKG_EXIT;
}
if (S_ISREG(statbuf.st_mode) || S_ISDIR(statbuf.st_mode))
{
if (debug)
printf("%s\n", paths[i]);
/* get filename position in pathname */
prepos = strlen(paths[i]) - 1;
if (strcmp(paths[i], "/") != 0 && *(paths[i] + prepos) == '/')
{
*(paths[i] + prepos--) = '\0';
}
while(*(paths[i] + prepos) != '/' && prepos >= 0) prepos--;
prepos++;
if (S_ISREG(statbuf.st_mode))
dedup_regfile(paths[i], prepos, fd_bdata, fd_mdata, htable, debug);
else
dedup_dir(paths[i], prepos, fd_bdata, fd_mdata, htable, debug);
}
else
{
if (debug)
printf("%s is not regular file or directory.\n", paths[i]);
}
}
/* fill up dedup package header */
dedup_pkg_hdr.block_size = g_block_size;
dedup_pkg_hdr.block_num = g_unique_block_nr;
dedup_pkg_hdr.blockid_size = BLOCK_ID_SIZE;
dedup_pkg_hdr.magic_num = DEDUP_MAGIC_NUM;
dedup_pkg_hdr.file_num = g_regular_file_nr;
dedup_pkg_hdr.metadata_offset = DEDUP_PKGHDR_SIZE + g_block_size * g_unique_block_nr;
write(fd, &dedup_pkg_hdr, DEDUP_PKGHDR_SIZE);
/* fill up dedup package unique blocks*/
lseek(fd_bdata, 0, SEEK_SET);
while(rwsize = read(fd_bdata, buf, 1024 * 1024))
{
write(fd, buf, rwsize);
memset(buf, 0, 1024 * 1024);
}
/* fill up dedup package metadata */
lseek(fd_mdata, 0, SEEK_SET);
while(rwsize = read(fd_mdata, buf, 1024 * 1024))
{
write(fd, buf, rwsize);
memset(buf, 0, 1024 * 1024);
}
if (debug)
show_pkg_header(dedup_pkg_hdr);
_DEDUP_PKG_EXIT:
close(fd);
close(fd_bdata);
close(fd_mdata);
unlink("./.bdata");
unlink("./.mdata");
hash_free(htable);
return ret;
}
void usage()
{
printf("Usage: dedup [OPTION...] \n");
printf("\nPackage files with deduplicaton technique.\n");
printf("Mandatory arguments to long options are mandatory for short options too.\n");
printf(" -z, --compress filter the archive through compress\n");
printf(" -b, --block block size for deduplication, default is 4096\n");
printf(" -t, --hashtable hashtable backet number, default is 10240\n");
printf(" -d, --debug print debug messages\n");
printf(" -h, --help give this help list\n");
printf("\nReport bugs to .\n");
}
int main(int argc, char *argv[])
{
char tmp_file[] = "./.dedup\0";
int bz = 0, bhelp = 0, bdebug = 0;
int ret = -1, c;
struct option longopts[] =
{
{"compress", 0, &bz, 'z'},
{"block", 1, 0, 'b'},
{"hashtable", 1, 0, 't'},
{"debug", 0, &bdebug, 'd'},
{"help", 0, &bhelp, 'h'},
{0, 0, 0, 0}
};
/* parse options */
while ((c = getopt_long (argc, argv, "zb:t:dh", longopts, NULL)) != EOF)
{
switch(c)
{
case 'z':
bz = 1;
break;
case 'b':
g_block_size = atoi(optarg);
break;
case 't':
g_htab_backet_nr = atoi(optarg);
break;
case 'd':
bdebug = 1;
break;
case 'h':
case '?':
default:
bhelp = 1;
break;
}
}
if (bhelp == 1 || (argc - optind) < 2)
{
usage();
return 0;
}
if (bz)
{
/* dedup and compress */
ret = dedup_package(argc - optind -1 , argv + optind + 1, tmp_file, bdebug);
if (ret == 0)
{
ret = zlib_compress_file(tmp_file, argv[optind]);
unlink(tmp_file);
}
}
else
{
/* dedup only */
ret = dedup_package(argc - optind - 1, argv + optind + 1, argv[optind], bdebug);
}
return ret;
}

#include 
<stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <dirent.h>
#include <unistd.h>
#include <getopt.h>
#include <fcntl.h>
#include <dirent.h>
#include <errno.h>
#include "dedup.h"
/* unique block number in package */
static unsigned int g_unique_block_nr = 0;
/* regular file number in package */
static unsigned int g_regular_file_nr = 0;
/* block length */
static unsigned int g_block_size = BLOCK_SIZE;
/* hashtable backet number */
static unsigned int g_htab_backet_nr = BACKET_SIZE;
void show_md5(unsigned char md5_checksum[16])
{
	int i;
	for (i = 0; i < 16; i++)
	{
        	printf("%02x", md5_checksum[i]);
	}
}
void show_pkg_header(dedup_package_header dedup_pkg_hdr)
{
        printf("block_size = %d\n", dedup_pkg_hdr.block_size);
        printf("block_num = %d\n", dedup_pkg_hdr.block_num);
	printf("blockid_size = %d\n", dedup_pkg_hdr.blockid_size);
	printf("magic_num = 0x%x\n", dedup_pkg_hdr.magic_num);
	printf("file_num = %d\n", dedup_pkg_hdr.file_num);
	printf("metadata_offset = %lld\n", dedup_pkg_hdr.metadata_offset);
}
int dedup_regfile(char *fullpath, int prepos, int fd_bdata, int 
fd_mdata, hashtable *htable, int debug)
{
	int fd;
	char *buf = NULL;
	unsigned int rwsize, pos;
	unsigned char md5_checksum[16 + 1] = {0};
	unsigned int *metadata = NULL;
	unsigned int block_num = 0;
	struct stat statbuf;
	dedup_entry_header dedup_entry_hdr;
	if (-1 == (fd = open(fullpath, O_RDONLY)))
	{
		perror("open regulae file");
		return errno;
	}
	if (-1 == fstat(fd, &statbuf))
	{
		perror("fstat regular file");
		goto _DEDUP_REGFILE_EXIT;
	}
	block_num = statbuf.st_size / g_block_size;
	metadata = (unsigned int *)malloc(BLOCK_ID_SIZE * block_num);
	if (metadata == NULL)
	{
		perror("malloc metadata for regfile");
		goto _DEDUP_REGFILE_EXIT;
	}
	buf = (char *)malloc(g_block_size);
	if (buf == NULL)
	{
		perror("malloc buf for regfile");
		goto _DEDUP_REGFILE_EXIT;
	}
	pos = 0;
	while (rwsize = read(fd, buf, g_block_size)) 
	{
		/* if the last block */
		if (rwsize != g_block_size)
			break;
		/* calculate md5 */
		md5(buf, rwsize, md5_checksum);
		/* check hashtable with hashkey */
		unsigned int *bindex = (block_id_t *)hash_value((void *)md5_checksum, 
htable);
		if (bindex == NULL)
		{
			bindex = (unsigned int *)malloc(BLOCK_ID_SIZE);
			if (NULL == bindex)
			{
				perror("malloc in dedup_regfile");
				break;
			}
			/* insert hash entry and write unique block into bdata*/
			*bindex = g_unique_block_nr;
			hash_insert((void *)strdup(md5_checksum), (void *)bindex, htable);
			write(fd_bdata, buf, rwsize);
			g_unique_block_nr++;
		}
		metadata[pos] = *bindex;
		memset(buf, 0, g_block_size);
		memset(md5_checksum, 0, 16 + 1);
		pos++;
	}
	/* write metadata into mdata */
	dedup_entry_hdr.path_len = strlen(fullpath) - prepos;
	dedup_entry_hdr.block_num = block_num;
	dedup_entry_hdr.entry_size = BLOCK_ID_SIZE;
	dedup_entry_hdr.last_block_size = rwsize;
	dedup_entry_hdr.mode = statbuf.st_mode;
	write(fd_mdata, &dedup_entry_hdr, sizeof(dedup_entry_header));
	write(fd_mdata, fullpath + prepos, dedup_entry_hdr.path_len);
	write(fd_mdata, metadata, BLOCK_ID_SIZE * block_num);
	write(fd_mdata, buf, rwsize);
	g_regular_file_nr++;
_DEDUP_REGFILE_EXIT:
	close(fd);
	if (metadata) free(metadata);
	if (buf) free(buf);
	return 0;
}
int dedup_dir(char *fullpath, int prepos, int fd_bdata, int fd_mdata, 
hashtable *htable, int debug)
{
	DIR *dp;
	struct dirent *dirp;
	struct stat statbuf;
	char subpath[MAX_PATH_LEN] = {0};
	if (NULL == (dp = opendir(fullpath)))
	{
		return errno;
	}
	while ((dirp = readdir(dp)) != NULL)
	{
		if (strcmp(dirp->d_name, ".") == 0 || strcmp(dirp->d_name, "..")
 == 0)
			continue;
		sprintf(subpath, "%s/%s", fullpath, dirp->d_name);
		if (0 == lstat(subpath, &statbuf))
		{
			if (debug)
				printf("%s\n", subpath);
			if (S_ISREG(statbuf.st_mode)) 
				dedup_regfile(subpath, prepos, fd_bdata, fd_mdata, htable,debug);
			else if (S_ISDIR(statbuf.st_mode))
				dedup_dir(subpath, prepos, fd_bdata, fd_mdata, htable, debug);
		}
	}
	closedir(dp);
	return 0;
}
int dedup_package(int path_nr, char **src_paths, char *dest_file, int 
debug)
{
	int fd, fd_bdata, fd_mdata, ret = 0;
	struct stat statbuf;
	hashtable *htable = NULL;
	dedup_package_header dedup_pkg_hdr;
	char **paths = src_paths;
	int i, rwsize, prepos;
	char buf[1024 * 1024] = {0};
	if (-1 == (fd = open(dest_file, O_WRONLY | O_CREAT, 0755)))
	{
		perror("open dest file");
		ret = errno;
		goto _DEDUP_PKG_EXIT;
	}
	htable = create_hashtable(g_htab_backet_nr);
	if (NULL == htable)
	{
		perror("create_hashtable");
		ret = errno;
		goto _DEDUP_PKG_EXIT;
	}
	fd_bdata = open("./.bdata", O_RDWR | O_CREAT, 0777);
	fd_mdata = open("./.mdata", O_RDWR | O_CREAT, 0777);
	if (-1 == fd_bdata || -1 == fd_mdata)
	{
		perror("open bdata or mdata");
		ret = errno;
		goto _DEDUP_PKG_EXIT;
	}
	g_unique_block_nr = 0;
	g_regular_file_nr = 0;
	for (i = 0; i < path_nr; i++)
	{
		if (lstat(paths[i], &statbuf) < 0)
		{
			perror("lstat source path");
			ret = errno;
			goto _DEDUP_PKG_EXIT;
		}
		if (S_ISREG(statbuf.st_mode) || S_ISDIR(statbuf.st_mode))
		{
			if (debug)
				printf("%s\n", paths[i]);
			/* get filename position in pathname */	
			prepos = strlen(paths[i]) - 1;
			if (strcmp(paths[i], "/") != 0 && *(paths[i] + prepos) == 
'/')
			{
				*(paths[i] + prepos--) = '\0';
			}
			while(*(paths[i] + prepos) != '/' && prepos >= 0) 
prepos--;
			prepos++;
			if (S_ISREG(statbuf.st_mode))
				dedup_regfile(paths[i], prepos, fd_bdata, fd_mdata, htable, debug);
			else
				dedup_dir(paths[i], prepos, fd_bdata, fd_mdata, htable, debug);
		}	
		else 
		{
			if (debug)
				printf("%s is not regular file or directory.\n", paths[i]);
		}
	}
	/* fill up dedup package header */
	dedup_pkg_hdr.block_size = g_block_size;
	dedup_pkg_hdr.block_num = g_unique_block_nr;
	dedup_pkg_hdr.blockid_size = BLOCK_ID_SIZE;
	dedup_pkg_hdr.magic_num = DEDUP_MAGIC_NUM;
	dedup_pkg_hdr.file_num = g_regular_file_nr; 
	dedup_pkg_hdr.metadata_offset = DEDUP_PKGHDR_SIZE + g_block_size * 
g_unique_block_nr;
	write(fd, &dedup_pkg_hdr, DEDUP_PKGHDR_SIZE);
	/* fill up dedup package unique blocks*/
	lseek(fd_bdata, 0, SEEK_SET);
	while(rwsize = read(fd_bdata, buf, 1024 * 1024))
	{
		write(fd, buf, rwsize);
		memset(buf, 0, 1024 * 1024);
	}
	/* fill up dedup package metadata */
	lseek(fd_mdata, 0, SEEK_SET);
	while(rwsize = read(fd_mdata, buf, 1024 * 1024))
	{
		write(fd, buf, rwsize);
		memset(buf, 0, 1024 * 1024);
	}
	if (debug)
		show_pkg_header(dedup_pkg_hdr);
_DEDUP_PKG_EXIT:
	close(fd);
	close(fd_bdata);
	close(fd_mdata);
	unlink("./.bdata");
	unlink("./.mdata");
	hash_free(htable);
	
	return ret;
}
void usage()
{
	printf("Usage:  dedup [OPTION...] <target file> <source files 
...>\n");
	printf("\nPackage files with deduplicaton technique.\n");
	printf("Mandatory arguments to long options are mandatory for short 
options too.\n");
	printf("  -z, --compress   filter the archive through compress\n");
	printf("  -b, --block      block size for deduplication, default is 
4096\n");
	printf("  -t, --hashtable  hashtable backet number, default is 
10240\n");
	printf("  -d, --debug      print debug messages\n");
	printf("  -h, --help       give this help list\n");
	printf("\nReport bugs to <Aigui.Liu@gmail.com>.\n");
}
int main(int argc, char *argv[])
{
	char tmp_file[] = "./.dedup\0";
	int bz = 0, bhelp = 0, bdebug = 0;
	int ret = -1, c;
	struct option longopts[] =
	{
		{"compress", 0, &bz, 'z'},
		{"block", 1, 0, 'b'},
		{"hashtable", 1, 0, 't'},
		{"debug", 0, &bdebug, 'd'},
		{"help", 0, &bhelp, 'h'},
		{0, 0, 0, 0}
	};
	/* parse options */
	while ((c = getopt_long (argc, argv, "zb:t:dh", longopts, NULL)) != 
EOF)
	{
		switch(c) 
		{
		case 'z':
			bz = 1;
			break;
		case 'b':
			g_block_size = atoi(optarg);
			break;
		case 't':
			g_htab_backet_nr = atoi(optarg);
			break;
		case 'd':
			bdebug = 1;
			break;
		case 'h':
		case '?':
		default:
			bhelp = 1;
			break;
		}
	}
	if (bhelp == 1 || (argc - optind) < 2)
	{
		usage();
		return 0;
	}
	if (bz)
	{
		/* dedup and compress */
		ret = dedup_package(argc - optind -1 , argv + optind + 1, tmp_file, 
bdebug);
		if (ret == 0)
		{
			ret = zlib_compress_file(tmp_file, argv[optind]);
			unlink(tmp_file);
		}
	} 
	else 
	{
		/* dedup only */
		ret = dedup_package(argc - optind - 1, argv + optind + 1, 
argv[optind], bdebug);
	}
	return ret;
}

/* undedup.c */

+ expand source view plain copy to clipboard print ?

#include
#include
#include
#include
#include
#include
#include
#include
#include "dedup.h"
/* block length */
static unsigned int g_block_size = BLOCK_SIZE;
void show_pkg_header(dedup_package_header dedup_pkg_hdr)
{
printf("block_size = %d\n", dedup_pkg_hdr.block_size);
printf("block_num = %d\n", dedup_pkg_hdr.block_num);
printf("blockid_size = %d\n", dedup_pkg_hdr.blockid_size);
printf("magic_num = 0x%x\n", dedup_pkg_hdr.magic_num);
printf("file_num = %d\n", dedup_pkg_hdr.file_num);
printf("metadata_offset = %lld\n", dedup_pkg_hdr.metadata_offset);
}
int prepare_target_file(char *pathname, char *basepath, int mode)
{
char fullpath[MAX_PATH_LEN] = {0};
char path[MAX_PATH_LEN] = {0};
char *p = NULL;
int pos = 0, fd;
sprintf(fullpath, "%s/%s", basepath, pathname);
p = fullpath;
while (*p != '\0')
{
path[pos++] = *p;
if (*p == '/')
mkdir(path, 0755);
p++;
}
fd = open(fullpath, O_WRONLY | O_CREAT, mode);
return fd;
}
int undedup_regfile(int fd, dedup_entry_header dedup_entry_hdr, char *dest_dir, int debug)
{
char pathname[MAX_PATH_LEN] = {0};
block_id_t *metadata = NULL;
unsigned int block_num = 0;
char *buf = NULL;
char *last_block_buf = NULL;
long long offset, i;
int fd_dest, ret = 0;
metadata = (block_id_t *) malloc(BLOCK_ID_SIZE * dedup_entry_hdr.block_num);
if (NULL == metadata)
return errno;
buf = (char *)malloc(g_block_size);
last_block_buf = (char *)malloc(g_block_size);
if (NULL == buf || NULL == last_block_buf)
{
ret = errno;
goto _UNDEDUP_REGFILE_EXIT;
}
read(fd, pathname, dedup_entry_hdr.path_len);
read(fd, metadata, BLOCK_ID_SIZE * dedup_entry_hdr.block_num);
read(fd, last_block_buf, dedup_entry_hdr.last_block_size);
fd_dest = prepare_target_file(pathname, dest_dir, dedup_entry_hdr.mode);
if (fd_dest == -1)
{
ret = errno;
goto _UNDEDUP_REGFILE_EXIT;
}
if (debug)
printf("%s/%s\n", dest_dir, pathname);
/* write regular block */
block_num = dedup_entry_hdr.block_num;
for(i = 0; i < block_num; ++i)
{
offset = DEDUP_PKGHDR_SIZE + metadata[i] * g_block_size;
lseek(fd, offset, SEEK_SET);
read(fd, buf, g_block_size);
write(fd_dest, buf, g_block_size);
}
/* write last block */
write(fd_dest, last_block_buf, dedup_entry_hdr.last_block_size);
close(fd_dest);
_UNDEDUP_REGFILE_EXIT:
if (metadata) free(metadata);
if (buf) free(buf);
if (last_block_buf) free(last_block_buf);
return ret;
}
int undedup_package(char *src_file, char *dest_dir, int debug)
{
int fd, i, ret = 0;
dedup_package_header dedup_pkg_hdr;
dedup_entry_header dedup_entry_hdr;
unsigned long long offset;
if (-1 == (fd = open(src_file, O_RDONLY)))
{
perror("open source file");
return errno;
}
if (read(fd, &dedup_pkg_hdr, DEDUP_PKGHDR_SIZE) != DEDUP_PKGHDR_SIZE)
{
perror("read dedup_package_header");
ret = errno;
goto _UNDEDUP_PKG_EXIT;
}
if (debug)
show_pkg_header(dedup_pkg_hdr);
offset = dedup_pkg_hdr.metadata_offset;
for (i = 0; i < dedup_pkg_hdr.file_num; ++i)
{
if (lseek(fd, offset, SEEK_SET) == -1)
{
ret = errno;
break;
}
if (read(fd, &dedup_entry_hdr, DEDUP_ENTRYHDR_SIZE) != DEDUP_ENTRYHDR_SIZE)
{
ret = errno;
break;
}
ret = undedup_regfile(fd, dedup_entry_hdr, dest_dir, debug);
if (ret != 0)
break;
offset += DEDUP_ENTRYHDR_SIZE;
offset += dedup_entry_hdr.path_len;
offset += dedup_entry_hdr.block_num * dedup_entry_hdr.entry_size;
offset += dedup_entry_hdr.last_block_size;
}
_UNDEDUP_PKG_EXIT:
close(fd);
return ret;
}
void usage()
{
printf("Usage: undedup [OPTION...] \n");
printf("\nUnpackage files with deduplicaton technique.\n");
printf("Mandatory arguments to long options are mandatory for short options too.\n");
printf(" -z, --uncompress filter the archive through uncompress\n");
printf(" -c, --directory change to directory, default is PWD\n");
printf(" -d, --debug print debug messages\n");
printf(" -h, --help give this help list\n");
printf("\nReport bugs to .\n");
}
int main(int argc, char *argv[])
{
char tmp_file[] = "./.dedup\0";
char path[MAX_PATH_LEN] = ".\0";
int bz = 0, bhelp = 0, bdebug = 0;
int ret = -1, c;
struct option longopts[] =
{
{"compress", 0, &bz, 'z'},
{"directory", 1, 0, 'c'},
{"debug", 0, &bdebug, 'd'},
{"help", 0, &bhelp, 'h'},
{0, 0, 0, 0}
};
while ((c = getopt_long (argc, argv, "zc:dh", longopts, NULL)) != EOF)
{
switch(c)
{
case 'z':
bz = 1;
break;
case 'c':
sprintf(path, "%s", optarg);
break;
case 'd':
bdebug = 1;
break;
case 'h':
case '?':
default:
bhelp = 1;
break;
}
}
if (bhelp == 1 || (argc - optind) < 1)
{
usage();
return 0;
}
if (bz)
{
/* uncompress and undedup */
ret = zlib_decompress_file(argv[optind], tmp_file);
if (ret == 0)
{
ret = undedup_package(tmp_file, path, bdebug);
unlink(tmp_file);
}
}
else
{
/* only undedup */
ret = undedup_package(argv[optind], path, bdebug);
}
return ret;
}

#include 
<stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <getopt.h>
#include <fcntl.h>
#include <errno.h>
#include "dedup.h"
/* block length */
static unsigned int g_block_size = BLOCK_SIZE;
void show_pkg_header(dedup_package_header dedup_pkg_hdr)
{
        printf("block_size = %d\n", dedup_pkg_hdr.block_size);
        printf("block_num = %d\n", dedup_pkg_hdr.block_num);
        printf("blockid_size = %d\n", dedup_pkg_hdr.blockid_size);
        printf("magic_num = 0x%x\n", dedup_pkg_hdr.magic_num);
        printf("file_num = %d\n", dedup_pkg_hdr.file_num);
        printf("metadata_offset = %lld\n", 
dedup_pkg_hdr.metadata_offset);
}
int prepare_target_file(char *pathname, char *basepath, int mode)
{
	char fullpath[MAX_PATH_LEN] = {0};
	char path[MAX_PATH_LEN] = {0};
	char *p = NULL;
	int pos = 0, fd;
	sprintf(fullpath, "%s/%s", basepath, pathname);
	p = fullpath;
	while (*p != '\0')
	{
		path[pos++] = *p;
		if (*p == '/')
			mkdir(path, 0755);
		p++;
	} 
	fd = open(fullpath, O_WRONLY | O_CREAT, mode);
	return fd;
}
int undedup_regfile(int fd, dedup_entry_header dedup_entry_hdr, char 
*dest_dir, int debug)
{
	char pathname[MAX_PATH_LEN] = {0};
	block_id_t *metadata = NULL;
	unsigned int block_num = 0;
	char *buf = NULL;
	char *last_block_buf = NULL;
	long long offset, i;
	int fd_dest, ret = 0;
	metadata = (block_id_t *) malloc(BLOCK_ID_SIZE * 
dedup_entry_hdr.block_num);
	if (NULL == metadata)
		return errno;
	buf = (char *)malloc(g_block_size);
	last_block_buf = (char *)malloc(g_block_size);
	if (NULL == buf || NULL == last_block_buf)
	{
		ret = errno;
		goto _UNDEDUP_REGFILE_EXIT;
	}
	read(fd, pathname, dedup_entry_hdr.path_len);
	read(fd, metadata, BLOCK_ID_SIZE * dedup_entry_hdr.block_num);
	read(fd, last_block_buf, dedup_entry_hdr.last_block_size);
	fd_dest = prepare_target_file(pathname, dest_dir, 
dedup_entry_hdr.mode);
	if (fd_dest == -1)
	{
		ret = errno;
		goto _UNDEDUP_REGFILE_EXIT;
	}
	if (debug)
		printf("%s/%s\n", dest_dir, pathname);
	/* write regular block */
	block_num = dedup_entry_hdr.block_num;
	for(i = 0; i < block_num; ++i)
	{
		offset = DEDUP_PKGHDR_SIZE + metadata[i] * g_block_size;
		lseek(fd, offset, SEEK_SET);
		read(fd, buf, g_block_size);
		write(fd_dest, buf, g_block_size);
	}
	/* write last block */
	write(fd_dest, last_block_buf, dedup_entry_hdr.last_block_size);
	close(fd_dest);
_UNDEDUP_REGFILE_EXIT:
	if (metadata) free(metadata);
	if (buf) free(buf);
	if (last_block_buf) free(last_block_buf);
	return ret;
}
int undedup_package(char *src_file, char *dest_dir, int debug)
{
	int fd, i, ret = 0;
	dedup_package_header dedup_pkg_hdr;
	dedup_entry_header dedup_entry_hdr;
	unsigned long long offset;
        if (-1 == (fd = open(src_file, O_RDONLY)))
        {
                perror("open source file");
                return errno;
        }
	if (read(fd, &dedup_pkg_hdr, DEDUP_PKGHDR_SIZE) != 
DEDUP_PKGHDR_SIZE)
	{
		perror("read dedup_package_header");
		ret = errno;
		goto _UNDEDUP_PKG_EXIT;
	}
	if (debug)
		show_pkg_header(dedup_pkg_hdr);
	offset = dedup_pkg_hdr.metadata_offset;
	for (i = 0; i < dedup_pkg_hdr.file_num; ++i)
	{
		if (lseek(fd, offset, SEEK_SET) == -1)
		{
			ret = errno;
			break;
		}
			
		if (read(fd, &dedup_entry_hdr, DEDUP_ENTRYHDR_SIZE) != 
DEDUP_ENTRYHDR_SIZE)
		{
			ret = errno;
			break;
		}
		ret = undedup_regfile(fd, dedup_entry_hdr, dest_dir, debug);
		if (ret != 0)
			break;
		offset += DEDUP_ENTRYHDR_SIZE;
		offset += dedup_entry_hdr.path_len;
		offset += dedup_entry_hdr.block_num * dedup_entry_hdr.entry_size;
		offset += dedup_entry_hdr.last_block_size;
	}
_UNDEDUP_PKG_EXIT:
	close(fd);
	return ret;
}
void usage()
{
        printf("Usage:  undedup [OPTION...] <source file>\n");
	printf("\nUnpackage files with deduplicaton technique.\n");
	printf("Mandatory arguments to long options are mandatory for short 
options too.\n");
	printf("  -z, --uncompress  filter the archive through uncompress\n");
	printf("  -c, --directory   change to directory, default is PWD\n");
	printf("  -d, --debug       print debug messages\n");
	printf("  -h, --help        give this help list\n");
	printf("\nReport bugs to <Aigui.Liu@gmail.com>.\n");
}
int main(int argc, char *argv[])
{
	char tmp_file[] = "./.dedup\0";
	char path[MAX_PATH_LEN] = ".\0";
	int bz = 0, bhelp = 0, bdebug = 0;
	int ret = -1, c;
	struct option longopts[] =
	{
		{"compress", 0, &bz, 'z'},
		{"directory", 1, 0, 'c'},
		{"debug", 0, &bdebug, 'd'},
		{"help", 0, &bhelp, 'h'},
		{0, 0, 0, 0}
	};
	while ((c = getopt_long (argc, argv, "zc:dh", longopts, NULL)) != EOF)
	{
		switch(c)
		{
		case 'z':
			bz = 1;
			break;
		case 'c':
			sprintf(path, "%s", optarg);
			break;
		case 'd':
			bdebug = 1;
			break;
		case 'h':
		case '?':
		default:
			bhelp = 1;
			break;
		}
	}
	if (bhelp == 1 || (argc - optind) < 1)
	{
		usage();
		return 0;
	}
	if (bz)
	{
		/* uncompress and undedup */
		ret = zlib_decompress_file(argv[optind], tmp_file);
		if (ret == 0)
		{
			ret = undedup_package(tmp_file, path, bdebug);
			unlink(tmp_file);
		}
	}
	else
	{
		/* only undedup */
		ret = undedup_package(argv[optind], path, bdebug);
	}
	return ret;
}

/* dedup usage */

Usage: dedup [OPTION...]

Package files with deduplicaton technique.

-z, --compress filter the archive through compress

-b, --block block size for deduplication, default is 4096

-t, --hashtable hashtable backet number, default is 10240

-d, --debug print debug messages

-h, --help give this help list

/* undedup usage */

Usage: undedup [OPTION...]

Unpackage files with deduplicaton technique.

-z, --uncompress filter the archive through uncompress
-c, --directory   change to directory, default is PWD
-d, --debug       print debug messages
-h, --help        give this help list

4、初步测试
这里使用linux最新的kernel源码进行测试，并与tar工具进行比较。从下载linux-2.6.32.tar.gz文件，并解压出源文件，然后分别使用tar和dedup工具进行打包，分别得到以下几个文件。

Filename	File size	commands
linux-2.2.6.32.tar	382392320 (365MB)	tar cvf linux-2.6.32.tar linux-2.6.32/
linux-2.2.6.32.tar.dd	380381944 (363M)	dedup linux-2.6.32.tar.dd linux-2.6.32.tar
linux-2.2.6.32.dd	357325910 (341MB)	dedup linux-2.6.32.dd linux-2.6.32/
linux-2.2.6.32.tar.gz	84322110 (81MB)	gzip -c linux-2.6.32.tar > linux-2.6.32.tar.gz
linux-2.2.6.32.tar.dd.gz	83978234 (81MB)	gzip -c linux-2.6.32.tar.dd > linux-2.6.32.tar.dd.gz
linux-2.2.6.32.dd.gz	83674306 (80MB)	gzip -c linux-2.6.32.dd > linux-2.6.32.dd.gz

linux-2.6.32.tar.gz解压出来的kernel源码文件数据很多，使用这个文件来测试应该具有普遍的意义。通过初步的测试结果，我们可以看出，即使在这样不明确数据是否具备较高重复率的情况下，dedup技术也能较明显地减少数据包的数据量。在数据重复率很高的测试用例下，比如全0或全1 的大文件，dedup要远远优于tar。比如，全0的64MB文件，tar+gzip的结果为65KB，而dedup的结果才有286字节。

5、TODO
1、变长数据块。目前是定长数据块的实现，技术上较为简单，变长数据块可能会获得更高的数据压缩率。
2、相似文件识别。如果两个文件只有很小的差别，比如在某处插入了若干字节，找出这些数据块并单独处理，可能会提高数据压缩率。

阅读(1429) | 评论(0) | 转发(0) |

上一篇：重复数据删除技术

下一篇：Linux下的 C/C++ 开发与调试工具

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6