大规模数据量的去重处理-qytan36-ChinaUnix博客

宁静的水泡

首页　| 　博文目录　| 　关于我

qytan36

博客访问： 17381
博文数量： 22
博客积分： 920
博客等级：准尉
技术积分： 220
用户组：普通用户
注册时间： 2010-03-27 15:19

文章分类

全部博文（22）

数据结构与算法（6）
心路历程（1）
密码学（2）
Web开发（5）
C&C++（2）
程序设计（5）
数据挖掘（1）
机器学习（0）
未分配的博文（0）

文章存档

2010年（22）

我的朋友

最近访客

推荐博文

大规模数据量的去重处理

分类：

2010-03-27 16:11:42

Large scale redundancy removal

In many situations, we need to remove duplicated items from a set. We usually read items from the set into a hash table one by one, and output them into a result file in the end. Actually, this method works well when the set size is small. With the size becoming bigger and bigger, performance and memory problems will rise up. The performance of the hash table would turn worse and the memory allocated would become bigger. How to solve this problem?

Define X as the set. x is an item of X. H = {h | h = hash(x), x belongs to X}, H is the hash set of X. Now we could divide H into n parts: p1 = {h | h0 <= h < h1}, p2 = {h | h1 <= h < h2}, ..., p(n-1) = {h | h(n-1) <= h < hn}, p(n) = {h | hn <= h}. Now we can traverse the set X, output x into filei if hash(x) belongs to pi. Sometimes if the computing cost of hash value is expensive, we need to store the hash result in filei. Then we can traverse from file1 to filen, remove redundancy of the sub set using the common method and output them into a result file together.

Usually the set is a string list, and we use its string length as the hash value in the first step to divide the string list into several parts. In the second parts, we use its string value to do redudancy removal.

Here is an C# implementation:

int interval = 20; int dn = 50; string[] fileNames = new string[dn]; StreamWriter[] writers = new StreamWriter[dn]; for (int i = 0; i < dn; i++) { fileNames[i] = outputFileName + "." + i.ToString(); writers[i] = new StreamWriter(fileNames[i]); } StreamReader reader = new StreamReader(inputFileName); while (!reader.EndOfStream) { string line = reader.ReadLine(); int n = line .Length / interval; if (n > dn - 1) n = dn - 1; writers[n].WriteLine(line); } reader.Close(); for (int i = 0; i < dn; i++) { writers[i].Close(); } Console.WriteLine("Finished partition"); StreamWriter duplicateWriter = new StreamWriter(outputFileName + ".duplicate.txt"); StreamWriter output = new StreamWriter(outputFileName); int count = 0; for (int i = 0; i < dn; i++) { Hashtable hi = new Hashtable(); StreamReader ireader = new StreamReader(fileNames[i]); while (!ireader.EndOfStream) { string line = ireader.ReadLine(); if (hi.ContainsKey(line) == false) { hi.Add(sentence, null); output.WriteLine(line); } else { duplicateWriter.WriteLine(line); } } ireader.Close(); count += hi.Count; } output.Close(); duplicateWriter.Close(); Console.WriteLine("Done: " + count); for (int i = 0; i < dn; i++) { File.Delete(fileNames[i]); }

阅读(332) | 评论(0) | 转发(0) |

上一篇：虚函数在构造函数中失效

下一篇：利用队列实现树的先序遍历

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6