Chinaunix首页 | 论坛 | 博客
  • 博客访问: 220846
  • 博文数量: 45
  • 博客积分: 2095
  • 博客等级: 大尉
  • 技术积分: 430
  • 用 户 组: 普通用户
  • 注册时间: 2005-03-29 11:32
文章存档

2011年(2)

2010年(1)

2009年(1)

2008年(5)

2007年(1)

2006年(16)

2005年(19)

我的朋友

分类: Python/Ruby

2008-06-30 15:28:19

Fastest way to uniqify a list in Python

Suppose you have a list in python that looks like this:

 ['a','b','a']
 # or like this:
 [1,2,2,2,3,4,5,6,6,6,6]

and you want to remove all duplicates so you get this result:

 ['a','b']
 # or
 [1,2,3,4,5,6]

How do you do that? ...the fastest way? I wrote a couple of alternative implementations and did a quick benchmark loop on the various implementations to find out which way was the fastest. (I haven't looked at memory usage). The slowest function was 78 times slower than the fastest function.

However, there's one very important difference between the various functions. Some are order preserving and some are not. For example, in an order preserving function, apart from the duplicates, the order is guaranteed to be the same as it was inputted. Eg, uniqify([1,2,2,3])==[1,2,3]

Here are the functions:

 def f1(seq):
    # not order preserving
    set = {}
    map(set.__setitem__, seq, [])
    return set.keys()

 def f2(seq): 
    # order preserving
    checked = []
    for e in seq:
        if e not in checked:
            checked.append(e)
    return checked

 def f3(seq):
    # Not order preserving
    keys = {}
    for e in seq:
        keys[e] = 1
    return keys.keys()

 def f4(seq): 
    # order preserving
    noDupes = []
    [noDupes.append(i) for i in seq if not noDupes.count(i)]
    return noDupes

 def f5(seq, idfun=None): 
    # order preserving
    if idfun is None:
        def idfun(x): return x
    seen = {}
    result = []
    for item in seq:
        marker = idfun(item)
        # in old Python versions:
        # if seen.has_key(marker)
        # but in new ones:
        if marker in seen: continue
        seen[marker] = 1
        result.append(item)
    return result

 def f6(seq):
    # Not order preserving    
    set = Set(seq)
    return list(set)

And what you've all been waiting for (if you're still reading). Here are the results:

 * f2 13.24
 * f4 11.73
 * f5 0.37
 f1 0.18
 f3 0.17
 f6 0.19

 (* order preserving)

Clearly f5 is the "best" solution. Not only is it really really fast; it's also order preserving and supports an optional transform function which makes it possible to do this:

 >>> a=list('ABeeE')
 >>> f5(a)
 ['A','B','e','E']
 >>> f5(a, lambda x: x.lower())
 ['A','B','e'] 

UPDATE

From the comments I've now added a couple of more functions to the benchmark. Some which don't support uniqify a list of objects that can't be hashed unless passed with a special hashing method. So see all the functions

Here are the new results:

 * f5 10.1
 * f5b 9.99
 * f8 6.49
 * f10 6.57
 * f11 6.6
 f1 4.28
 f3 3.55
 f6 4.03
 f7 2.59
 f9 2.58

(f2 and f4) were too slow for this testdata.


Comment

- 14th August 2006  []
Keep in mind you can also use:

>>> lst = [1, 1, 3, 4, 4, 5, 6, 7, 6]
>>> set(lst)
set([1, 3, 4, 5, 6, 7])
Paul Boddie - 14th August 2006   []
Isn't that what f6 does, apart from the final conversion to a list again?
- 14th August 2006   []
Right. I totally missed f6()
none - 14th August 2006  []
It would have been interesting to tests against more complex objects which redefines __cmp__.

- Sylvain

PS: Your comment system seems to be broken as I currently have the name and email fields filled up with Lawrence details.
- 14th August 2006   []
Thanks for the bug report. I'm working on fixing it. It's due to the fact that I've started caching the pages on a proxying server. Thanks.
- 14th August 2006   []
Should be fixed now.
- 14th August 2006  []
problem with the previous comment:
>>> lst = ['a','a','b',1,1,2,3,4,5]
>>> set(lst)
set(['a', 1, 2, 'b', 4, 5, 3])
Justin - 14th August 2006  []
looks like you are using Set from python2.3 and not set from python2.4
djo - 14th August 2006  []
Peter - worth reading this:



Short version: map() can be trivially parallelized, for() can't. Functional-tasting solutions are going to run faster in multi-processor environments. Worth considering.
Martijn Faassen - 14th August 2006   []
So how would you parallelize uniqify, trivially or not?
djo - 15th August 2006   []
That's a good point; I just jumped in with something I thought was useful when I saw map(), the comparison implicit in __setitem__ passed me by. Using map() to generate side-effects is pretty alien to me.

In that case, I guess I'd do something based loosely on quicksort. It's nearly there already, it simply needs to throw away duplicate results when concatenating the child arrays (easy to do when the child arrays are guaranteed in-order and unique), and because it's a divide and conquer algorithm it should lend itself to a parallel environment.
- 15th August 2006   []
not much point in parallelizing sorting unless # of elements are high enough that the splitting is actually worth it; plus at least for python, there still is the gil to worry about...
Brantley Harris - 14th August 2006  []
This comes up often enough for me that I think there should be a built-in function.
- 14th August 2006  []
Your benchmark code likely doesn't dowhat you think it does for the current case. The 'X' instances always compare different so everything is unique.

Here's my example, with a bit of a cheat to make the "idfun=None' case faster. "Using "." for leading spaces because I can't figure out how to put Python code in this comment system

def f7(seq, idfun=None):
....return list(_f7(seq, idfun))
def _f7(seq, idfun=None):
....seen = set()
....if idfun is None:
........for x in seq:
............if x in seen:
................continue
............seen.add(x)
............yield x
....else:
........for x in seq:
............x = idfun(x)
............if x in seen:
................continue
............seen.add(x)
............yield x

Since your benchmark didn't test that case I figured I could ignore it. :) The timing numbers I get are

* f2 66.65
* f4 66.13
* f5 2.19
* f7 1.91
f1 1.06
f3 0.97
f6 0.99

and these are in line with my own benchmark. Function call overhead in Python is high. Most of the performance difference comes from calling idfun.

I also don't incur the overhead of doing the list.append lookup for each new element. The list making is all in C code. There is overhead for the generator but that's pretty fast. you may also end up prefering an iterator solution over making full lists.

Regarding the parallelization of 'map'. That assumes a pure functional environment. Consider

>>> class Counter(object):
... def __init__(self): self.count = 0
... def __call__(self, x):
... i = self.count; self.count += 1; return i
...
>>> map(Counter(), [9, 8, 3, 6, 1])
[0, 1, 2, 3, 4]
>>>
Manuzhai - 14th August 2006  []
I'd say that list(set([...])) is short enough? Or you could just work from the set directly. And I think the built-in set got a lot faster going from 2.3 to 2.4, due to it being rewritten in C.
Fuzzyman - 15th August 2006   []
Even in Python 2.4 the built-in set is based on the Python dictionary. This is why the speed results were so similar.

In Python 2.5 it has been optimised further with a custom internal data-structure, and *should* be faster.

Fuzzyman
- 16th August 2006   []
but not ordered unfortunately.
Dave Kirby - 14th August 2006  []
firstly "set" is a built in type in 2.4, and it is bad form to name variables the same as existing types or modules - it can lead to confusing code and subtle bugs.

Secondly, the sets.Set class used in f6 is implemented in Python, both in 2.3 and 2.4, while the set builtin type is implemented in C so should be significantly faster.

Here is alternative order-preserving function. I have not timed it, but it should be pretty fast:

def f7(seq):
seen = set()
return [ x for x in seq if x not in seen and not seen.add(x)]
Michael Urman - 15th August 2006  []
What's most interesting is how changing the data seriously changes the actual performance. So if you know what your data looks like, you might be able do better than f5. If you don't know, f5 is probably a safe bet.

For instance, by changing the data to list('abcab'), and running the test 1000 times instead of 10, f2 becomes second fastest, and almost twice as fast as f5:
* f2 0.14
* f4 0.22
* f5 0.28
f1 0.16
f3 0.12
f6 0.45
- 15th August 2006  []
Kirby had the right idea but his implementation is more complicated than it needs to be.

def f7(seq): return list(set(seq))

You can drop the wrapping list() if you just need a sequence and not a real list. In that case the function is just

f7 = set
- 15th August 2006  []
Alright, so my list of changes:

Don't use sets.Set, use set built-in for all functions.

Use Andrew Dalke's f5() (though returning a generator is not what was required, reworked to return a list using list.append)

Use my f7(), which is:

def f7(seq):
....return {}.fromkeys(seq).keys()

Use Dave Kirby's f7() as f8().

Modify the test case to be huge:

blahlist = range(100000)
testdata = map(lambda a: a % 3, blahlist) + blahlist

Stop using f2() and f4(), they lose every time.

Results (old test case on the right):

* f5 54.688 --- 0.328
* f8 43.797 --- 0.297
f1 32.047 --- 0.156
f3 24.14 --- 0.157
f6 13.531 --- 0.14
f7 14.251 --- 0.156

So it looks like using the set built-in beats both sets.Set() and dict.

The other thing is there's probably very few cases where you'd want to unique-ify a list AND care about the order of that list at the same time. Removing duplicate transmissions is in that category (probably ought to fix your protocol if that's the case, though), but if there's others I can't think of them.
Anonymous - 15th August 2006  []
If you recode the block

if marker in seen: continue
seen[marker] = 1
result.append(item)

to

if marker not in seen:
____seen[marker] = 1
____result.append(item)

the performance of f5 doubles!
Anonymous - 17th August 2006  []
fastest non preserving (?):

def f7(seq):
____S,L = set, list
____return S(L(seq))
Jiri Vit - 18th August 2006  []
Speed tip:
make local references in functions for builtins

def foo():
____L = []
____AppendToListL = L.append
...
...
____AppendToListL(item)
thesamet - 20th August 2006  []
The results just confirm a known strategy to optimize python code. Replace python logic with C implementation that does the same. If it's built-in, the better. :)
- 21st August 2006  []
if the nicely simple list(set(seq)) appeals, then this variation may as well as its fast and preserves order:

def f13(seq):
# order preserving
....uniq = set(seq)
....return [item for item in seq if item in uniq]
- 21st August 2006  []
oops. never mind the above... should really have a coffee first (and write a test case too ...) Dave Kirby's version gets a +1 from me.
- 21st August 2006  []
The coffee-corrected return for f13 (above) should be:

return [item for item in seq if item in uniq and not uniq.remove(item)]

Essentially the inverse of Dave Kirby's approach but just a hair faster for whatever reason set() only knows.
nn - 22nd August 2006   []
My bet is on the memory allocation strategy. Adding to a set repeatedly allocates more memory for the set. I guess removing from the set does not deallocate memory until the end of the function.
Maxime Biais - 14th February 2007  []
I always use this one (order preserving, generative, concise, but don't work on no hashable).

from itertools import ifilter
def _f14(iterable):
....m = set()
....return ifilter(lambda x: not (m.__contains__(x) or m.add(x)), iterable)

def f14(lst):
....return list(_f14(lst))

My results:
* f5 1.94
* f5b 1.89
* f8 1.18
* f10 1.24
* f11 1.23
* f13 1.31
* f14 1.87
f1 0.67
f3 0.64
f6 0.68
f7 0.38
f9 0.41

Not so bad ;)
阅读(1754) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~