Chinaunix首页 | 论坛 | 博客
  • 博客访问: 17381
  • 博文数量: 22
  • 博客积分: 920
  • 博客等级: 准尉
  • 技术积分: 220
  • 用 户 组: 普通用户
  • 注册时间: 2010-03-27 15:19
文章分类

全部博文(22)

文章存档

2010年(22)

我的朋友
最近访客

分类:

2010-03-27 16:11:42

Large scale redundancy removal

     In many situations, we need to remove duplicated items from a set. We usually read items from the set into a hash table one by one, and output them into a result file in the end. Actually, this method works well when the set size is small. With the size becoming bigger and bigger, performance and memory problems will rise up. The performance of the hash table would turn worse and the memory allocated would become bigger. How to solve this problem?

     Define X as the set. x is an item of X. H = {h | h = hash(x), x belongs to X}, H is the hash set of X. Now we could divide H into n parts: p1 = {h | h0 <= h < h1}, p2 = {h | h1 <= h < h2}, ..., p(n-1) = {h | h(n-1) <= h < hn}, p(n) = {h | hn <= h}. Now we can traverse the set X, output x into filei if hash(x) belongs to pi. Sometimes if the computing cost of hash value is expensive, we need to store the hash result in filei. Then we can traverse from file1 to filen, remove redundancy of the sub set using the common method and output them into a result file together.

     Usually the set is a string list, and we use its string length as the hash value in the first step to divide the string list into several parts. In the second parts, we use its string value to do redudancy removal.

     Here is an C# implementation:

int interval = 20;
                int dn = 50;
                string[] fileNames = new string[dn];
                StreamWriter[] writers = new StreamWriter[dn];
                for (int i = 0; i < dn; i++)
                {
                    fileNames[i] = outputFileName + "." + i.ToString();
                    writers[i] = new StreamWriter(fileNames[i]);
                }

                StreamReader reader = new StreamReader(inputFileName);
                while (!reader.EndOfStream)
                {
                    string line = reader.ReadLine();
                    int n = line .Length / interval;
                    if (n > dn - 1) n = dn - 1;
                    writers[n].WriteLine(line);
                }
                reader.Close();

                for (int i = 0; i < dn; i++)
                {
                    writers[i].Close();
                }

                Console.WriteLine("Finished partition");

                StreamWriter duplicateWriter = new StreamWriter(outputFileName + ".duplicate.txt");
                StreamWriter output = new StreamWriter(outputFileName);
                int count = 0;
                for (int i = 0; i < dn; i++)
                {
                    Hashtable hi = new Hashtable();
                    StreamReader ireader = new StreamReader(fileNames[i]);
                    while (!ireader.EndOfStream)
                    {
                        string line = ireader.ReadLine();

                        if (hi.ContainsKey(line) == false)
                        {
                            hi.Add(sentence, null);
                            output.WriteLine(line);
                        }
                        else
                        {
                            duplicateWriter.WriteLine(line);
                        }
                    }
                    ireader.Close();
                    count += hi.Count;
                }
                output.Close();
                duplicateWriter.Close();
                Console.WriteLine("Done: " + count);
                for (int i = 0; i < dn; i++)
                {
                    File.Delete(fileNames[i]);
                }


阅读(332) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~