分类:
2010-03-27 16:11:42
Large scale redundancy removal
In many situations, we need to remove duplicated items from a set. We usually read items from the set into a hash table one by one, and output them into a result file in the end. Actually, this method works well when the set size is small. With the size becoming bigger and bigger, performance and memory problems will rise up. The performance of the hash table would turn worse and the memory allocated would become bigger. How to solve this problem?
Define X as the set. x is an item of X. H = {h | h = hash(x), x belongs to X}, H is the hash set of X. Now we could divide H into n parts: p1 = {h | h0 <= h < h1}, p2 = {h | h1 <= h < h2}, ..., p(n-1) = {h | h(n-1) <= h < hn}, p(n) = {h | hn <= h}. Now we can traverse the set X, output x into filei if hash(x) belongs to pi. Sometimes if the computing cost of hash value is expensive, we need to store the hash result in filei. Then we can traverse from file1 to filen, remove redundancy of the sub set using the common method and output them into a result file together.
Usually the set is a string list, and we use its string length as the hash value in the first step to divide the string list into several parts. In the second parts, we use its string value to do redudancy removal.
Here is an C# implementation:
|