个人简介

《Python科学计算》的作者

文章分类

文章存档

2012年（27）

我的朋友

兼并数组和字典功能的Series

			In [3]:
		

import pandas as pd

Series对象本质上是一个NumPy的数组，因此NumPy的数组处理函数可以直接对Series进行处理。但是Series除了可以使用位置作为下标存取元素之外，还可以使用标签下标存取元素，这一点和字典相似。每个Series对象实际上都由两个数组组成：

下面创建一个Series对象，并查看其两个属性：

			In [4]:
		

s = pd.Series([1,2,3,4,5], index=["a","b","c","d","e"])

print s.index

print s.values

Index([a, b, c, d, e], dtype=object)

[1 2 3 4 5]

Series的下标存取，同时支持位置和标签两种形式：

			In [5]:
		

print s[2], s["d"]

3 4

Series也支持位置切片和标签切片。位置切片遵循Python的切片规则，包括起始位置，但不包括结束位置；但标签切片则同时包括起始标签和结束标签。之所以如此设计是因为在使用标签切片时，通常我们不知道标签的顺序，如果不包含结束标签，很难确定结束标签的前一个标签是什么。

			In [6]:
		

C((u"s[1:3]", s[1:3]), (u"s['b':'d']", s['b':'d']))

s[1:3]     s['b':'d']

------     ----------

b    2     b    2

c    3     c    3

           d    4

和NumPy数组一样，Series也可以使用一个位置列表或者位置数组进行存取；同时还可以使用标签列表和标签数组。

			In [7]:
		

C((u"s[[1,3,2]]", s[[1,3,2]]), (u"s[['b','d','c']]", s[['b','d','c']]))

s[[1,3,2]]     s[['b','d','c']]

----------     ----------------

b    2         b    2

d    4         d    4

c    3         c    3

可以看出Series同时具有数组和字典的功能，因此它也支持一些字典的方法，例如Series.iteritems()：

			In [8]:
		

print list(s.iteritems())

[('a', 1), ('b', 2), ('c', 3), ('d', 4), ('e', 5)]

Series魔法都在Index里

Index对象也是ndarray的派生类，values属性可以获得ndarray数组：

			In [31]:
		

index = s.index

print index.__class__.mro()

index.values

[, , ]

					Out[31]:
				

array(['a', 'b', 'c', 'd', 'e'], dtype=object)

Index可以当作一维数组，支持所有的数组下标操作：

			In [35]:
		

print index[[1, 3]]

print index[index > 'c']

print index[1::2]

Index([b, d], dtype=object)

Index([d, e], dtype=object)

Index([b, d], dtype=object)

Index也具有字典的映射功能，它将数组中的值映射到其位置：

			In [43]:
		

print index.get_loc('c')

index.get_indexer(['a', 'c', 'z'])

					Out[43]:
				

array([ 0,  2, -1])

Index对象的字典功能由其中的Engine对象提供：

			In [18]:
		

e = s.index._engine

print e

			In [22]:
		

e.get_loc('b')

					Out[22]:
				

			In [42]:
		

e.get_indexer(np.array(['a', 'd', 'e', 'z'], 'O'))

					Out[42]:
				

array([ 0,  3,  4, -1])

Engine对象的字典功能由mapping提供：

			In [19]:
		

ht = e.mapping

print ht

			In [53]:
		

ht.get_item('d')

					Out[53]:
				

			In [55]:
		

ht.lookup(np.array(['a', 'd', 'e', 'z'], 'O'))

					Out[55]:
				

array([ 0,  3,  4, -1])

当Index中的每个值都唯一时，可以使用HashTable将值映射到其位置之上。若值不唯一，则Pandas会采用两种较慢的算法。

在Pandas的内部实现中，还考虑了内存的因素，因此对于较大的、排序的、值唯一的Index，也会采用二分搜索法，省去了由HashTable带来的额外内存开销。

下面我们创建这三种Index对象：

			In [80]:
		

N = 10000

unique_keys = np.array(list(set(pd.core.common.rands(5) for i in xrange(N))), 'O')

duplicate_keys = unique_keys.copy()

duplicate_keys[-1] = duplicate_keys[0]

sorted_keys = np.sort(duplicate_keys)

unique_index = pd.Index(unique_keys)

sorted_index = pd.Index(sorted_keys)

duplicate_index = pd.Index(duplicate_keys)

to_search = unique_keys[N-2]

每个Index都有is_unique和is_monotonic属性，分别表示值是否唯一，和是否排序。下面的程序显示上面三个Index对象的属性：

			In [101]:
		

from itertools import product

def dataframe_fromfunc(func, index, columns):

    return pd.DataFrame([[func(idx, col) for col in columns] for idx in index],

            index = index, columns = columns)

predicates = ["is_unique", "is_monotonic"]

index = ["unique_index", "sorted_index", "duplicate_index"]

dataframe_fromfunc(lambda idx, pred:getattr(globals()[idx], pred), index, predicates)

					Out[101]:
				

所有的Index都支持get_loc()，但只有值唯一的Index才支持get_indexer()。三者的get_loc()方法所返回的位置信息也不同：

这三种返回值都可以作为下标存取ndarray中的值。

			In [103]:
		

print unique_index.get_loc(unique_keys[0])

print sorted_index.get_loc(unique_keys[0])

print duplicate_index.get_loc(unique_keys[0])

slice(8528, 8530, None)

[ True False False ..., False False  True]

下面比较三者的运算速度：

			In [81]:
		

%timeit unique_index.get_loc(to_search)

1000000 loops, best of 3: 828 ns per loop

			In [82]:
		

%timeit sorted_index.get_loc(to_search)

100000 loops, best of 3: 15.2 us per loop

			In [83]:
		

%timeit duplicate_index.get_loc(to_search)

1000 loops, best of 3: 284 us per loop

阅读(17413) | 评论(0) | 转发(0) |

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们