awk, python, perl文本处理效率对比-wcw-ChinaUnix博客

写给自己的Blogwcw.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

wcw

博客访问： 558603
博文数量： 83
博客积分： 6010
博客等级：准将
技术积分： 1169
用户组：普通用户
注册时间： 2007-04-29 22:34

文章分类

全部博文（83）

php（1）
shell（5）
stock（1）
extjs（4）
mysql（8）
Web（17）
program（15）
linux-kernel（3）
Latex（0）
Gentoo（19）
net（5）
未分配的博文（5）

文章存档

2011年（3）

2010年（29）

2009年（30）

2008年（21）

我的朋友

最近访客

推荐博文

awk, python, perl文本处理效率对比

分类：

2011-02-14 00:27:24

以下3个文件依次是用python、awk和perl写的脚本，做同一件事情：

diff.sh f1 f2

f1和f2每一行的第一个字段（以空格分割）为key，如果f2某一行的key在f1中不存在，则输出f2该行。

比如：

a.dat的内容是

1 a

2 a

b.dat的内容是

1 b

3 b

那么diff.sh a.dat b.dat则输出

3 b

代码：

diff.py

#!/usr/bin/python import sys if len(sys.argv) != 3: print "Usage: " + sys.argv[0] + "file1 file2"; sys.exit(-1); file1 = sys.argv[1] file2 = sys.argv[2] list1 = {}; for line in open(file1): list1[line.split()[0]] = 1; for line in open(file2): key = line.split()[0]; if key not in list1: sys.stdout.write(line)

diff.sh

#!/bin/sh if [[ $# < 2 ]];then echo "Usage: $0 file1 file2" exit fi function do_diff() { if [[ $# < 2 ]];then echo "Usage: $0 file1 file2" return 1 fi if [[ ! -f $1 ]];then echo "$1 is not file" return 2 fi if [[ ! -f $2 ]];then echo "$2 is not file" return 3 fi awk ' BEGIN{FS=OFS=" "} ARGIND == 1 { arr[$1] = 1; } ARGIND == 2 { if (!($1 in arr)) { print $0; } } ' $1 $2 } do_diff $1 $2

diff.pl

#!/usr/bin/perl -w exit if (1 > $#ARGV); my %map_orig; my $file_orig = shift @ARGV; open FH, "<$file_orig" or die "can't open file: $file_orig"; while (<FH>) { chomp; #$map_orig{$_} = 1; my ($filed) = split /\s+/; $map_orig{$filed} = 1; } close (FH); my $file_diff = shift @ARGV; open FH, "<$file_diff" or die "can't open file: $file_diff"; while (<FH>) { chomp; my ($filed) = split /\s+/; print "$_\n" if (!defined$map_orig{$filed}); } close (FH)

diff2.pl

#!/usr/bin/perl -w exit if (1 > $#ARGV); my %map_orig; my $file_orig = shift @ARGV; open FH, "<$file_orig" or die "can't open file: $file_orig"; while (<FH>) { chomp; #$map_orig{$_} = 1; my ($filed) = split(" "); $map_orig{$filed} = 1; } close (FH); my $file_diff = shift @ARGV; open FH, "<$file_diff" or die "can't open file: $file_diff"; while (<FH>) { chomp; my ($filed) = split(" "); print "$_\n" if (!defined$map_orig{$filed}); } close (FH)

以上4个文件的算法都是一样的，把第一个文件的key读取放到一个map中，再读取第2个文件的key，判断是否在该map中，不是则打印到标准输出。diff.pl和diff2.pl的区别是前者用了正则，后者是用字符串匹配。

测试方法：time diff.xx f1 f2 > out

测试文件f1有123183923行，每一行格式为：

key value（两个字段）

文件大小为2.5G

f2有439116行，每一行的格式也是：

key value(两字段）

文件大小为5.6M

测试结果(time real)：

diff.py的时间为3m46s = 226s

diff.sh的时间为3m49s = 229s

diff.pl的时间为(7m21s + 7m12s) = 437s

diff2.pl的时间为(7m41s + 7m34s)/2 = 454s

结果显示awk和python的性能差不多，perl则要明显差些。看来python的dict优化得很好，居然能赶上awk的性能，很出乎我的意料。

以上测试在同一台机器上跑，测试环境一样，但非严格公平（不同时间内机器负载等可能略有不同）。

--------------

有人质疑diff.pl是用了正则导致了效率降低，于是我取消了正则，用简单的字符串匹配，结果性能并没有预期的得到提升，相反，甚至有了一点点下降。

阅读(11546) | 评论(5) | 转发(0) |

上一篇：ssh登录失败

下一篇：博客已升级，请注意变更地址

给主人留下些什么吧！~~

guxing18412014-03-24 16:22:20

以上其实是测试循环读取性能，python的新加语句 for xx in open(xxxx) 确实跟awk的性能相近比perl快，但是python如果使用 f = open(xxxxx) 然后 while True: f.readline 这样就会比蜗牛还要慢哈哈，这些并不能代表文本处理的性能，还有使用hash表awk的hash表跟perl不是一个档次，python 的dict 和 perl 的hash谁快我没有测试过，谁有测试发发吧

回复 | 举报

chinaunix网友2011-03-06 16:00:50

很好的, 收藏了推荐一个博客，提供很多免费软件编程电子书下载： http://free-ebooks.appspot.com

回复 | 举报

chinaunix网友2011-02-28 11:46:36

如果你是有点常识的程序员，做“文本处理效率”的对比，得出“perl则要明显差些”这样的结论。首先需要认真检查Perl代码，99.9%是代码或是环境因素才会得到这样荒唐的结论。

回复 | 举报

ontherd2011-02-25 10:46:15

大牛，啥脚本都会。

回复 | 举报

chinaunix网友2011-02-22 11:12:28

此文说明2个问题： 1. 博主不懂perl,结果写出来的perl代码很像basic 2. 博主想贬一下perl, 非要来一个\s+进行大量正则运算,你的phtyon为什么不使用正则？awk也可以, 为什么perl不直接使用空格匹配？

回复 | 举报

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6