Python tips 几则-xuanhan863-ChinaUnix博客

xuanhan863

首页　| 　博文目录　| 　关于我

xuanhan863

博客访问： 69852
博文数量： 16
博客积分： 471
博客等级：下士
技术积分： 150
用户组：普通用户
注册时间： 2011-01-05 22:50

文章分类

全部博文（16）

算法（0）
国际化（3）
python（1）
未分配的博文（12）

文章存档

2012年（2）

2011年（14）

我的朋友

相关博文

Python tips 几则

分类： Python/Ruby

2011-08-01 15:32:53

某些时候我们需要绕开初始化函数来创建对象，比如做反序列化。
>>> class A(object):
... def __init__(self, x, y):
... print "init", x, y
... self.x = x
... self.y = y
... def test(self):
... print "test:", self.x, self.y
...
>>> class _Empty(object): pass
...
>>> o = _Empty()
>>> o.__class__ = A
>>> o.x = 1
>>> o.y = 2
>>> o.test()
test: 1 2
>>> type(o)
<class '__main__.A'>
>>> isinstance(o, A)
True
对于 classic class，我们还可以直接用 types.instance()，这样更简单一些。
>>> class A:
... def __init__(self, x, y):
... print "init:", x, y
... self.x = x
... self.y = y
... def test(self):
... print "test:", self.x, self.y
...
>>> import types
>>> a1 = types.InstanceType(A, dict(x = 1, y = 2))
>>> a1.test()
test: 1 2
>>> a1.__class__
<class __main__.A at 0x1025869b0>
>>> class _Empty: pass
...
>>> a2 = _Empty()
>>> a2.__class__ = A
>>> a2.x = 1
>>> a2.y = 2
>>> a2.test()
test: 1 2
可见 Python 的对象区别在于 __class__、__bases__ 和 __dict__ 这些，其他的好说。
顺便提一下 types module，我们也许还会用到下面这样的 "动态" 编程方式。
>>> import sys, types
>>> sys.modules["X"] = types.ModuleType("X", "test module")
>>> import X
>>> X.__doc__
'test module'

file
在 Python 中操作文件非常简单。我们可以用内置函数 file() 或 open() 打开文件。
>>> with open("a.txt", "w") as f:
... f.write("Hellom World!\n")
...
>>> !cat a.txt
Hellom
>>> !file a.txt
a.txt: ASCII text
其实多数时候，推荐用 open() 打开文件，而用 file 做类型测试。
open() 返回一个 File Ojbect。在打开文件时，我们需要指定 mode。
r: 只读。
w: 只写。如果文件已存在，将被替换。
a: 添加。
b: 二进制模式。(和其他模式一起工作，在区分文本文件和二进制文件的系统有效，如 Windows)
r+: 更新文件，可读写，不会截短文件。
w+: 更新文件，可读写，会清除原有内容。
a+: 更新文件，可读写，总是在尾部添加。
1. iter & with
File Object 实现了迭代器和 with_segment，我们可以直接用下述方式遍历一个文件内容。
>>> with open("a.txt", "r") as f:
... for s in f: print s
...
如果文件很大，那么就不合适用 iter 和 readlines() 了，我们可以用 readline() 来遍历。
>>> with open("a.txt", "r") as f:
... while True:
... line = f.readline()
... if not line: break
... print line
...
注意文本文件的 "空行" 起码包含了一个换行符(\n, \r\n)，并不等于 "" (empty)。
2. Encoding
在实际开发中，我们通常需要用特定的编码读写文本文件。这时候需要用 codecs.open() 代替 open()。
>>> with codecs.open("a.txt", "w", "utf-16") as f:
... f.write("abc")
...
>>> !file a.txt
a.txt: Little-endian UTF-16 Unicode text, with no line terminators
>>> with codecs.open("a.txt", "w", "utf-32") as f:
... f.write("abc")
...
>>> !file a.txt
a.txt: Unicode text, UTF-32, little-endian
(更多编码内容，请参考《Python Library: Encoding》)
3. Binary
*nix 下的文件函数并不区分文本文件和二进制文件，我们实际操作的区别也仅限于是写入 string 还是 byte。
要读写二进制，我们需要将数字等内容转换成字节数组，而非字符串。
>>> import array
>>> a = array.array("i")
>>> a.append(0x12)
>>> a.append(0x34)
>>> with open("a.dat", "wb") as f:
... a.tofile(f)
...
>>> !xxd -g 1 a.dat
0000000: 12 00 00 00 34 00 00 00 ....4...
>>> with open("a.dat", "rb") as f:
... b = array.array("i")
... b.fromfile(f, 2)
... for x in b: print hex(x)
...
0x12
0x34
还可以使用 struct，直接以 C Struct 方式获取字节。(Python 2.7 提供了 memoryview 更方便些)
>>> import struct
>>> with open("a.dat", "wb") as f:
... f.write(struct.pack("ii", 0x12, 0x34))
...
>>> !xxd -g 1 a.dat
0000000: 12 00 00 00 34 00 00 00 ....4...
>>> with open("a.dat", "rb") as f:
... s = f.read(8)
... for x in struct.unpack("ii", s): print hex(x)
...
0x12
0x34
(pack & unpack 的 fmt 格式化字符数量和参数数量要相等，详情参阅官方文档)
struct 还提供了 pack_into() 和 pack_from() 这样的 buffer 操作函数，这也是我们日常开发的常用手段之一。
可以使用 bytearray、ctypes.create_string_buffer() 或 array 生成 buffer 对象。
>>> buffer = bytearray(100)
>>> struct.pack_into("ilc", buffer, 0, 1234, 1234567890, "a")
>>> buffer
bytearray(b'\xd2\x04\x00\x00\x00\x00\x00\x00\xd2\x02\x96I\x00\x00\x00\x00a...\x00')
>>> struct.unpack_from("ilc", str(buffer), 0) # 必须转换成 str
(1234, 1234567890, 'a')
>>> buffer = array.array("c", "\0" * 100)
>>> struct.pack_into("ilc", buffer, 0, 1234, 1234567890, "a")
>>> struct.unpack_from("ilc", buffer, 0)
(1234, 1234567890, 'a')
>>> buffer = ctypes.create_string_buffer(100)
>>> struct.pack_into("ilc", buffer, 0, 1234, 1234567890, "a")
>>> buffer.raw
'\xd2\x04\x00\x00\x00\x00\x00\x00\xd2\x02\x96I\x00\x00\x00\x00 ... \x00'
>>> struct.unpack_from("ilc", buffer, 0)
(1234, 1234567890, 'a')

fileinput 提供了一种遍历多个文件的方便手段。
>>> from fileinput import *
>>> !cat a.txt
0
1
2
3
4
5
6
7
8
9
>>> !cat b.txt
a
b
c
d
e
f
g
h
i
j
>>> for line in input(["a.txt", "b.txt"]):
... print "[{0}] {1}:{2} - {3}".format(lineno(), filename(), filelineno(), line)
...
[1] a.txt:1 - 0
[2] a.txt:2 - 1
[3] a.txt:3 - 2
[4] a.txt:4 - 3
[5] a.txt:5 - 4
[6] a.txt:6 - 5
[7] a.txt:7 - 6
[8] a.txt:8 - 7
[9] a.txt:9 - 8
[10] a.txt:10 - 9
[11] b.txt:1 - a
[12] b.txt:2 - b
[13] b.txt:3 - c
[14] b.txt:4 - d
[15] b.txt:5 - e
[16] b.txt:6 - f
[17] b.txt:7 - g
[18] b.txt:8 - h
[19] b.txt:9 - i
[20] b.txt:10 - j
fileinput 默认以文本文件方式打开，因此以行为单位进行统计。
lineno: 返回被读取的行数。
filename: 当前被打开的文件名。
filelineno: 当前被打开文件的行号。
isfirstline: 是否当前文件的首行。
当然，我们可以随时中断当前文件的遍历，进入下一个文件。
>>> for line in input(["a.txt", "b.txt"]):
... print lineno(), filename(), filelineno(), line
... if filename() == "a.txt" and filelineno() > 3: nextfile()
...
1 a.txt 1 0
2 a.txt 2 1
3 a.txt 3 2
4 a.txt 4 3
5 b.txt 1 a
6 b.txt 2 b
7 b.txt 3 c
8 b.txt 4 d
9 b.txt 5 e
10 b.txt 6 f
11 b.txt 7 g
12 b.txt 8 h
13 b.txt 9 i
14 b.txt 10 j
注意：lineno() 统计的是读取的行数。
我们还可以对原文件行进行 "编辑"，当然得对原文件做个备份。
>>> for line in input(["a.txt", "b.txt"], inplace = 1, backup = ".bak"):
... print "[{0}] {1}:{2} {3}".format(lineno(), filename(), filelineno(), line)
...
>>> !cat a.txt
[1] a.txt:1 0
[2] a.txt:2 1
[3] a.txt:3 2
[4] a.txt:4 3
[5] a.txt:5 4
[6] a.txt:6 5
[7] a.txt:7 6
[8] a.txt:8 7
[9] a.txt:9 8
[10] a.txt:10 9
>>> !cat a.txt.bak
0
1
2
3
4
5
6
7
8
9
怎么样？inplace = 1 会将输出到 stdout 的内容写入原文件，至于原内容则在备份文件里了。
openhook 参数允许我们自定义 File Object，比如内置的 hook_encoded() 会调用 codecs.open() 函数来打开文件，从而可以指定编码方式。
>>> with codecs.open("x.txt", "w", "gb2312") as f:
... f.write("我们\n")
... f.write("中国\n")
...
>>> !xxd -g 1 x.txt
0000000: ce d2 c3 c7 0a d6 d0 b9 fa 0a ..........
>>> for line in input(["x.txt"], openhook = fileinput.hook_encoded("gb2312")):
... print line
...
我们
中国

File Descriptor

除了 File Object，我们还可以直接使用我们所熟悉的文件描述符(File Descriptor)来读写文件。

在 Unix-like 系统中，系统为每个进程维护一个文件表，表中保存了称为文件描述符的递增非负整数，以及所打开文件的相关信息，包括指针、inode、元数据等等。类似 Windows 文件句柄(File Handle)。

文件描述符通常从 3 开始，因为 0 ~ 2 已经固定分配给 STDIN、STDOUT、STDERR 了。
>>> sys.stdin.fileno() 0 >>> sys.stdout.fileno() 1 >>> sys.stderr.fileno() 2
File Object 对象有个 fileno() 函数返回所打开文件的描述符。(相关函数都在 os 模块中)
>>> with open("a.dat", "w") as f: ... f.write("123") ... fd = f.fileno() ... os.write(fd, "abc") # 描述符函数 ... 3 >>> !xxd -g 1 a.dat 0000000: 61 62 63 31 32 33 abc123
也可以用 fdopen() 将文件描述符包装成 File Object。
>>> fd = os.open("a.dat", os.O_RDWR, 0664) >>> os.read(fd, 100) 'abc123' >>> os.lseek(fd, 0, os.SEEK_SET) 0 >>> f = os.fdopen(fd, "r+") >>> f.fileno(), fd (3, 3) >>> f.read() 'abc123' >>> os.close(fd)
通常我们用哪种方式打开文件，最好用对应的方式关闭文件。

相关函数信息，可参考官方文档。

6. tempfile

从没在开发中用过临时文件？不会吧？

TemporaryFile: 创建一个临时 File Object 对象，关闭时自动删除。
NamedTemporaryFile: 同样是创建临时文件对象，但可获取临时文件名，同时可用 delete 参数决定是否自动删除文件。
SpooledTemporaryFile: 和 TemporaryFile 类似，不过只有在数据超过 max_size 参数阀值时，才写入硬盘。也可调用 rollover() 强行写入。
>>> with NamedTemporaryFile(suffix = ".tmp", prefix = "my_", delete = False) as f: ... global name ... name = f.name ... f.write("abc1234") ... >>> name '/var/folders/C-/C-m2K8KYFfamHQriaTh5vE+++TI/-Tmp-/my_UMvMBV.tmp' >>> with open(name, "r") as f: ... print f.read() ... abc1234 >>> os.remove(name)
如果仅需要获取一个可用的临时文件名，可以用 os.tempnam() 。
>>> os.tempnam() '/var/tmp/tmp.tDUBtZ'
mkstemp: 返回临时文件描述符和文件名，需要我们自行删除临时文件。
>>> f = mkstemp() >>> f (3, '/var/folders/C-/C-m2K8KYFfamHQriaTh5vE+++TI/-Tmp-/tmpggNY3k') >>> os.close(f[0]) >>> os.path.exists(f[1]) True >>> os.remove(f[1])
mkdtemp: 创建临时目录。权限 0700，也就是 "drwx------"。
>>> d = mkdtemp() >>> d '/var/folders/C-/C-m2K8KYFfamHQriaTh5vE+++TI/-Tmp-/tmpqlNdTp' >>> os.path.isdir(d) True >>> os.rmdir(d)
gettempdir: 返回临时文件存放目录。
gettempprefix: 返回默认临时文件名前缀。
>>> gettempdir() '/var/folders/C-/C-m2K8KYFfamHQriaTh5vE+++TI/-Tmp-' >>> gettempprefix() 'tmp'
说明：我现在用的是 Mac OS X 10.6，输出信息和 Linux 不同也很正常。

shutil

copyfile: 拷贝文件。目标必须是有效文件名，不能是目录。仅拷贝文件内容，不包括权限和状态数据。
>>> !echo "abc" > a.txt >>> !chmod 0764 a.txt >>> mkdir b >>> ls -l total 24 -rwxrw-r-- 1 yuhen staff 4 11 6 14:31 a.txt* drwxr-xr-x 2 yuhen staff 68 11 6 14:31 b/ >>> copyfile("a.txt", "./b") ------------------------------------------------------------ Traceback (most recent call last): File "", line 1, in IOError: [Errno 21] Is a directory: './b' >>> copyfile("a.txt", "b.txt") >>> ls -l total 32 -rwxrw-r-- 1 yuhen staff 4 11 6 14:31 a.txt* drwxr-xr-x 2 yuhen staff 68 11 6 14:31 b/
copymode: 仅拷贝权限设置，不包括文件拥有人、组和内容。目标文件必须存在。
>>> copymode("a.txt", "c.txt") ------------------------------------------------------------ Traceback (most recent call last): File "", line 1, in OSError: [Errno 2] No such file or directory: 'c.txt' >>> !touch c.txt >>> ls -l total 32 -rwxrw-r-- 1 yuhen staff 4 11 6 14:31 a.txt* drwxr-xr-x 2 yuhen staff 68 11 6 14:31 b/ -rw-r--r-- 1 yuhen staff 4 11 6 14:32 b.txt -rw-r--r-- 1 yuhen staff 0 11 6 14:34 c.txt >>> copymode("a.txt", "c.txt") >>> ls -l total 32 -rwxrw-r-- 1 yuhen staff 4 11 6 14:31 a.txt* drwxr-xr-x 2 yuhen staff 68 11 6 14:31 b/ -rw-r--r-- 1 yuhen staff 4 11 6 14:32 b.txt -rwxrw-r-- 1 yuhen staff 0 11 6 14:34 c.txt*
copystat: 仅拷贝权限、状态时间(Modify,Access)等，不包括拥有人、组和文件内容。目标文件必须存在。
>>> ls -l total 24 -rw-r--r-- 1 yuhen staff 4 11 6 14:58 a.txt -rw-r--r-- 1 yuhen staff 0 11 6 14:58 d.txt >>> copystat("a.txt", "d.txt") >>> !stat -x a.txt File: "a.txt" Size: 4 FileType: Regular File Mode: (0644/-rw-r--r--) Uid: ( 501/ yuhen) Gid: ( 20/ staff) Device: 14,2 Inode: 3322771 Links: 1 Access: Sat Nov 6 14:58:01 2010 Modify: Sat Nov 6 14:58:00 2010 Change: Sat Nov 6 14:58:00 2010 >>> !stat -x d.txt File: "d.txt" Size: 0 FileType: Regular File Mode: (0644/-rw-r--r--) Uid: ( 501/ yuhen) Gid: ( 20/ staff) Device: 14,2 Inode: 3322772 Links: 1 Access: Sat Nov 6 14:58:01 2010 Modify: Sat Nov 6 14:58:00 2010 Change: Sat Nov 6 14:58:44 2010
copy: 拷贝文件，目标可以是存放目录。如目标文件已存在，将被覆盖。拷贝内容和权限设置。
copy2: 和 copy() 类似，不过会同时复制状态信息，相当于调用了 copystat()。
>>> copy("a.txt", "e.txt") >>> ls -l total 40 -rwxrw-r-- 1 yuhen staff 4 11 6 14:31 a.txt* drwxr-xr-x 2 yuhen staff 68 11 6 14:31 b/ -rw-r--r-- 1 yuhen staff 4 11 6 14:32 b.txt -rwxrw-r-- 1 yuhen staff 0 11 6 14:34 c.txt* -rwxrw-r-- 1 yuhen staff 0 11 6 14:36 d.txt* -rwxrw-r-- 1 yuhen staff 4 11 6 14:39 e.txt* >>> copy("a.txt", "b/x.txt") >>> copy("a.txt", "b") >>> ls -l b total 16 -rwxrw-r-- 1 yuhen staff 4 11 6 14:40 a.txt* -rwxrw-r-- 1 yuhen staff 4 11 6 14:39 x.txt*
copytree: 递归复制整个目录。权限和状态被完整复制。必须提供目标目录名，且不能已存在。

注：Python 2.6 可以用 ignore_patter() 进行过滤。
>>> ls -l b total 8 -rw-r--r-- 1 yuhen staff 4 11 6 14:58 a.txt -rw-r--r-- 1 yuhen staff 0 11 6 14:58 d.txt >>> mkdir c >>> copytree("./b", "./c") ------------------------------------------------------------ Traceback (most recent call last): OSError: [Errno 17] File exists: './c' >>> copytree("./b", "./c/") ------------------------------------------------------------ Traceback (most recent call last): OSError: [Errno 17] File exists: './c/' >>> copytree("./b", "./c/b")
rmtree: 递归删除整个目录和其中的文件。
>>> ls -l d total 16 -rwxrw-r-- 1 yuhen staff 4 11 6 14:40 a.txt* -rwxrw-r-- 1 yuhen staff 4 11 6 14:39 x.txt* >>> rmtree("./d")
move: 移动文件或目录。在同一文件系统上相当于 os.rename()，不同文件系统则是调用 copy()，然后 remove(src)。
>>> move("a.txt", "./b/c.dat") >>> move("./b", "./c") >>> ls -lR c total 0 drwxr-xr-x 6 yuhen staff 204 11 6 14:52 b/ c/b: total 24 -rwxrw-r-- 1 yuhen staff 4 11 6 14:40 a.txt* -rw-r--r-- 1 yuhen staff 0 11 6 14:48 b.dat -rwxrw-r-- 1 yuhen staff 4 11 6 14:31 c.dat* -rwxrw-r-- 1 yuhen staff 4 11 6 14:39 x.txt*

Python Library: time

Python 中的时间有点 “乱”，除了熟悉的 datetime Module 外，还有和 *nix、C 相关的 time Module.

相关时间概念：

绝对时间：某个绝对精确的时间值。如 2010-11-1 13:48:05 。
相对时间：相对于某个时间的前后差。如：5分钟以前。
基准点(epoch)：一个时间基准点，通常指 1970-1-1 00:00:00 UTC。*nix 系统使用自该基准点以来消逝的秒数来表达绝对时间。
协调世界时(UTC)：世界不同时区的一个基准，比如我国为 UTC+8。
阳光节约时(DST)：也就是所谓的夏时制。好在我国已经取消了，真麻烦。

(注：UTC、DST 的详细信息可查阅百度百科)

我们可以直接用浮点数(float)存储 time() 返回的自 epoch 以来的秒数，还可以用 struct_time 结构体存储不同时间字段。
>>> t = time() >>> t 1289031328.7737219 >>> st = gmtime(t) >>> st time.struct_time(tm_year=2010, tm_mon=11, tm_mday=6, tm_hour=8, tm_min=15, tm_sec=28, tm_wday=5, tm_yday=310, tm_isdst=0) >>> st2 = localtime(t) >>> st2 time.struct_time(tm_year=2010, tm_mon=11, tm_mday=6, tm_hour=16, tm_min=15, tm_sec=28, tm_wday=5, tm_yday=310, tm_isdst=0) >>> import datetime >>> datetime.datetime.now() datetime.datetime(2010, 11, 6, 16, 16, 33, 414042)
通过转换后的 struct_time，我们看到 time() 返回是 UTC 标准时，localtime() 才是我们系统所设的 UTC+8 北京时间。

想将 struct_time 转换回 epoch，可以调用 mktime() 或 calendar.timegm()。
>>> mktime(st) # 不能用来转换 localtime() 的结果，缺少时区。 1289002528.0 >>> mktime(st2) 1289031328.0 >>> import calendar >>> calendar.timegm(st) 1289031328
总结一下转换关系：
(time) --> epoch --(gmtime)--> UTC struct_time --(calendar.timegm) --> epoch (time) --> epoch --(localtime)--> LOCAL_TIME struct_time --(mktime) --> epoch
其他相关函数：

ctime: 将 epoch 转换为字符串。
asctime: 将 struct_time 转换为字符串。
>>> t 1289031328.7737219 >>> st time.struct_time(tm_year=2010, tm_mon=11, tm_mday=6, tm_hour=8, tm_min=15, tm_sec=28, tm_wday=5, tm_yday=310, tm_isdst=0) >>> ctime(t) 'Sat Nov 6 16:15:28 2010' >>> asctime(st) 'Sat Nov 6 08:15:28 2010'
clock: 返回当前进程消耗的CPU时间（秒）。
sleep: 暂停进程（秒，可以是小数，以便设置毫秒、微秒级暂停）。
>>> clock() 0.56022400000000006 >>> sleep(0.1)
strftime: 将 struct_time 格式化为字符串。
strptime: 将字符串格式化为 struct_time。
>>> st time.struct_time(tm_year=2010, tm_mon=11, tm_mday=6, tm_hour=8, tm_min=15, tm_sec=28, tm_wday=5, tm_yday=310, tm_isdst=0) >>> s = strftime("%Y-%m-%d %H:%M:%S", st) >>> s '2010-11-06 08:15:28' >>> strptime(s, "%Y-%m-%d %H:%M:%S") time.struct_time(tm_year=2010, tm_mon=11, tm_mday=6, tm_hour=8, tm_min=15, tm_sec=28, tm_wday=5, tm_yday=310, tm_isdst=-1)

Python Library: re
正则表达式是处理字符串最重要的一种手段了。

1. 基本信息

特殊字符(需转义处理)：. ^ $ * + ? { } [ ] \ | ( )
字符定义：

\d: 十进制数字，相当于 [0-9]
\D: 非数字字符，相当于 [^0-9]
\s: 空白字符，相当于 [ \t\n\r\f\v]。
\S: 非空白字符。
\w: 字母或数字，相当于 [a-zA-Z0-9]。
\W: 非字母数字。
.: 任意字符。
|: 或。
^: 非或开始位置。
$: 结束位置。
\b: 单词边界(完整单词，而非其他单词中的子串)。
\B: 非单词边界。

重复：

*: 0 或任意个字符。
?: 0 或一个字符。
+: 1 或多个字符。
{n, m}: n 到 m 个字符。
{n}: n 个字符。
{n,}: n 到尽可能多的字符。
{,m}: 相当于 {0,m}。

通常使用添加一个 "?" 来避免贪婪匹配。

2. 正则函数

re 有几个重要的函数：

match(): 匹配字符串开始位置。
search(): 扫描字符串，找到第一个位置。
findall(): 找到全部匹配，以列表返回。
finditer(): 找到全部匹配，以迭代器返回。

match 和 search 仅匹配一次，匹配不到返回 None。
>>> import re >>> s = "12abc345ab" >>> m = re.match(r"\d+", s) >>> m.group(), m.span() ('12', (0, 2)) >>> m = re.match(r"\d{3,}", s) >>> m is None True >>> m = re.search(r"\d{3,}", s) >>> m.group(), m.span() ('345', (5, 8)) >>> m = re.search(r"\d+", s) >>> m.group(), m.span() ('12', (0, 2))
findall 返回列表(或空列表)，finditer 和 match/search 一样返回 MatchObject 对象。
>>> ms = re.findall(r"\d+", s) >>> ms ['12', '345'] >>> ms = re.findall(r"\d{5}", s) >>> ms [] >>> for m in re.finditer(r"\d+", s): print m.group(), m.span() ... 12 (0, 2) 345 (5, 8) >>> for m in re.finditer(r"\d{5}", s): print m.group(), m.span() ... >>>
3. MatchObject

match、search、finditer 返回的对象 —— MatchObject。

group(): 返回匹配的完整字符串。
start(): 匹配的开始位置。
end(): 匹配的结束位置。
span(): 包含起始、结束位置的元组。
groups(): 返回分组信息。
groupdict(): 返回命名分组信息。

>>> m = re.match(r"(\d+)(?P[abc]+)", s) >>> m.group() '12abc' >>> m.start() 0 >>> m.end() 5 >>> m.span() (0, 5) >>> m.groups() ('12', 'abc') >>> m.groupdict() {'letter': 'abc'}
group() 可以接收多个参数，用于返回指定序号的分组。
>>> m.group(0) '12abc' >>> m.group(1) '12' >>> m.group(2) 'abc' >>> m.group(1,2) ('12', 'abc') >>> m.group(0,1,2) ('12abc', '12', 'abc')
start()、end() 和 span() 同样能接收分组序号。和 group() 一样，序号 0 表示整体匹配结果。
>>> m.start(0), m.end(0) (0, 5) >>> m.start(1), m.end(1) (0, 2) >>> m.start(2), m.end(2) (2, 5) >>> m.span(0) (0, 5) >>> m.span(1) (0, 2) >>> m.span(2) (2, 5)
3. 编译标志

可以用 re.I、re.M 等参数，也可以直接在表达式中添加 "(?iLmsux)" 标志。

s: 单行。"." 匹配包括还行符在内的所有字符。
i: 忽略大小写。
L: 让 "\w" 能匹配当地字符，貌似对中文支持不好。
m: 多行。
x: 忽略多余的空白字符，让表达式更易阅读。
u: Unicode。

试试看。
>>> re.findall(r"[a-z]+", "%123Abc%45xyz&") ['bc', 'xyz'] >>> re.findall(r"[a-z]+", "%123Abc%45xyz&", re.I) ['Abc', 'xyz'] >>> re.findall(r"(?i)[a-z]+", "%123Abc%45xyz&") ['Abc', 'xyz']
下面这么写好看多了吧？
>>> patter = r""" ... (\d+) #number ... ([a-z]+) #letter ... """ >>> re.findall(pattern, "%123Abc\n%45xyz&", re.I | re.S | re.X) [('123', 'Abc'), ('45', 'xyz')]
4. 组操作

(1) 命名组：(?P...)
>>> for m in re.finditer(r"(?P\d+)(?P[a-z]+)", "%123Abc%45xyz&", re.I): ... print m.groupdict() ... {'number': '123', 'letter': 'Abc'} {'number': '45', 'letter': 'xyz'}
(2) 无捕获组：(?:...)

作为匹配条件，但不返回。
>>> for m in re.finditer(r"(?:\d+)([a-z]+)", "%123Abc%45xyz&", re.I): ... print m.groups() ... ('Abc',) ('xyz',)
(3) 反向引用：\ 或 (?P=name)

引用前面的组。
>>> for m in re.finditer(r"\w+", "%123Abc%45xyz&"): ... print m.group() ... 123Abc >>> for m in re.finditer(r"<(\w)>\w+", "%123Abc%45xyz&"): ... print m.group() ... 123Abc 45xyz >>> for m in re.finditer(r"<(?P\w)>\w+", "%123Abc%45xyz&"): ... print m.group() ... 123Abc 45xyz
(4) 声明

正声明 (?=...)：组内容必须出现在右侧，不返回。
负声明 (?!...)：组内容不能出现在右侧，不返回。
反向正声明 (?<=)：组内容必须出现在左侧，不返回。
反向负声明 (?>>> for m in re.finditer(r"\d+(?=[ab])", "%123Abc%45xyz%780b&", re.I): ... print m.group() ... 123 780 >>> for m in re.finditer(r"(?更多信息请阅读官方文档或更专业的书籍。

5. 修改字符串

(1) split: 用 pattern 做分隔符切割字符串。如果用 "(pattern)"，那么分隔符也会返回。
>>> re.split(r"\W", "abc,123,x") ['abc', '123', 'x'] >>> re.split(r"(\W)", "abc,123,x") ['abc', ',', '123', ',', 'x']
(2) sub: 替换子串。

可指定替换次数。
>>> re.sub(r"[a-z]+", "*", "abc,123,x") '*,123,*' >>> re.sub(r"[a-z]+", "*", "abc,123,x", 1) '*,123,x'
subn() 和 sub() 差不多，不过返回 "(新字符串，替换次数)"。
>>> re.subn(r"[a-z]+", "*", "abc,123,x") ('*,123,*', 2)
还可以将替换字符串改成函数，以便替换成不同的结果。
>>> def repl(m): ... print m.group() ... return "*" * len(m.group()) ... >>> re.subn(r"[a-z]+", repl, "abc,123,x") abc x ('***,123,*', 2)

阅读(811) | 评论(0) | 转发(0) |

上一篇：通过gettext方式实现国际化（i18n）

下一篇：<转> 布隆过滤器 (Bloom Filter) 详解

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6