浏览Python文档时,突然想测试下Python字符串拼接效率,有以下三种常用的方式:
1. '+'或'+=',运算符拼接。
2. 'str.join',str的join函数拼接。
3. 'io.StringIO',内存字符串IO拼接。
测试环境:Mac OS X 10.9, Python 3.3
测试文件:GNU C Library完整的一个HTML文档,大小4.9MB,行数95605
开始前猜想:#1应该效率最低,#2和#3不相伯仲
完整源码:
点击(此处)折叠或打开(test_main.py)
-
import argparse
-
import io
-
import os
-
import sys
-
-
def run_case1(fp):
-
"""基准"""
-
filesize = 0
-
for line in fp:
-
filesize += len(line)
-
-
def run_case2(fp):
-
"""字符串相加"""
-
src = ''
-
for line in fp:
-
src += line
-
-
def run_case3(fp):
-
"""字符串join"""
-
src = ''.join(line for line in fp)
-
-
def run_case4(fp):
-
"""字符串内存IO"""
-
output = io.StringIO()
-
for line in fp:
-
output.write(line)
-
src = output.getvalue()
-
-
def collect_rusage(fp, count, proc):
-
for i in range(count):
-
fp.seek(0)
-
if os.fork() == 0:
-
proc(fp)
-
os._exit(os.EX_OK)
-
pid, status, rusage = os.wait3(os.WEXITED)
-
yield rusage.ru_utime, rusage.ru_stime, rusage.ru_maxrss
-
-
def main():
-
parser = argparse.ArgumentParser(description='Python str')
-
parser.add_argument('infile', nargs='?', type=argparse.FileType('r'),
-
default='-', help='text source [default: sys.stdin]')
-
parser.add_argument('-c', '--count', type=int, default=10,
-
help='loop count for each case')
-
args = parser.parse_args()
-
-
# main logic begins
-
testcases = [run_case1, run_case2, run_case3, run_case4]
-
for case in testcases:
-
total = list(collect_rusage(args.infile, args.count, case))
-
avg_utime = sum(e[0] for e in total)/len(total)
-
avg_stime = sum(e[1] for e in total)/len(total)
-
avg_maxrss = sum(e[2] for e in total)/len(total)
-
print('{} utime: {:.06f}, stime: {:.06f}, maxrss: {}'.format(
-
case.__doc__, avg_utime, avg_stime, int(avg_maxrss)))
-
-
if __name__ == '__main__':
-
main()
脚本执行三次,每次运行中,每个case执行100次并取平均值,输出:
-
bash-3.2 $python test_main.py The\ GNU\ C\ Library.html --count 100
-
基准 utime: 0.019205, stime: 0.001823, maxrss: 1407590
-
字符串相加 utime: 0.027731, stime: 0.005384, maxrss: 12171386
-
字符串join utime: 0.025856, stime: 0.007648, maxrss: 22402129
-
字符串内存IO utime: 0.029018, stime: 0.007745, maxrss: 22345195
-
-
bash-3.2 $python test_main.py The\ GNU\ C\ Library.html --count 100
-
基准 utime: 0.018908, stime: 0.001806, maxrss: 1397063
-
字符串相加 utime: 0.027610, stime: 0.005370, maxrss: 12161966
-
字符串join utime: 0.026054, stime: 0.007611, maxrss: 22383452
-
字符串内存IO utime: 0.029314, stime: 0.007732, maxrss: 22340034
-
-
bash-3.2 $python test_main.py The\ GNU\ C\ Library.html --count 100
-
基准 utime: 0.019428, stime: 0.001793, maxrss: 1395957
-
字符串相加 utime: 0.027489, stime: 0.005308, maxrss: 12176220
-
字符串join utime: 0.025874, stime: 0.007610, maxrss: 22412247
-
字符串内存IO utime: 0.028902, stime: 0.007883, maxrss: 22341509
-
-
bash-3.2 $wc -l The\ GNU\ C\ Library.html
-
95605 The GNU C Library.html
-
bash-3.2 $ls -l The\ GNU\ C\ Library.html
-
-rw-r--r--@ 1 Guorui staff 5123001 3 10 10:50 The GNU C Library.html
最终结果颇有回味之处:
1. ‘+’效率低得令人发指,以至于我上面的代码没有包含它,在电脑边等不出结果。。。这与文档的描述一致,平时千万不要用。
2. ‘+=’竟然是效率最高的方式,CPU和RAM的占用最少,颇感意外,没在文档中找到说明,看来得抽空研究下源码。
3. ‘io.StringIO’比‘str.join’稍逊,但不太明显,实践中可以忽略这一点性能差异,按照实际需要选用。
阅读(5330) | 评论(0) | 转发(1) |