外部排序过程:
1) 将文件大小分块,并对每个分块进行快速排序,输出多个已经排序完成的小文件
2) 合并这些小文件,得到最终输出文件
假定源文件有M个记录,每个分块包含N个记录,其中 M >= N,我们将得到 upper_bound( M // N )个分块 (P)。
1)时间复杂度: O1 = N * lg(N) * P
2)时间复杂度: O2 = N * P
总体时间复杂度: O = N * lg(N) * P + N * P
通常P << lg(N),则时间复杂度主要取决于 N * lg(N) * P,当分块越少时,复杂度越少,但占用内存空间越多,当P = 1时,没有分块发生,等效于快速排序,时间复杂度为 N * lg(N);当P接近N时(分块太多),时间复杂度接近 N^2 * lg(N)
merge 算法:
源于上次的mergesort算法(可做一定优化):
def merge(A, B):
C = []
la = len(A)
lb = len(B)
ln = la + lb
j = 0
k = 0
for i in range(ln):
if j >= la:
C.append(B[k])
k += 1
continue;
if k >= lb:
C.append(A[j])
j += 1
continue;
if A[j] > B[k]:
C.append(B[k])
k += 1
else:
C.append(A[j])
j += 1
return C
|
我们将文件分成了P个部分,因此我们需要合并这P个记录,我们只需要将其中记录两两合并,递归一下即可。由于单词merge的复杂度为O(N),因此merge所有记录的复杂度为 O(N * P)
算法list_merge:
from mergesort import merge
def list_merge(T):
ln = len(T)
R = []
if ln == 1:
return T
La = T.pop()
Lb = T.pop()
L = merge(La, Lb)
T.append(L)
return list_merge(T)
|
==>
def list_merge(T):
return reduce(merge, T)
完整的主程序(esort.py):
注意输入文件格式为每行一个整数,纯文本格式。
from __future__ import with_statement
from quicksort import quicksort
from list_merge import list_merge
from os import unlink
"""
sort L, and output the result to file @op
"""
def blk_sort(L, op):
quicksort(L)
for x in L:
op.write(str(x) + '\n')
"""
split @infile based on @threshold,
each splitted file is sorted seperately
return with the name list of splitted files
"""
def split_file(infile, threshold):
n = 0
nth = 0
L = []
outList = []
with open(infile, "rt") as inp:
for line in inp:
if (n == 0):
outfile = infile + '.out.' + str(nth)
outList.append(outfile)
print 'sorting ' + outfile
try:
outp = open(outfile, "wb")
except:
print "unable to write file " + outfile
return
L.append(int(line))
n += 1
if n >= threshold:
blk_sort(L, outp)
n = 0
nth += 1
L = []
outp.close()
if n > 0:
blk_sort(L, outp)
outp.close()
return outList
def _blk_merge(T):
return list_merge(T)[0]
def blk_merge(outFiles, threshold, outFinal):
numFiles = len(outFiles)
# how many records to fetch for each file
records = threshold // numFiles
# EOFed files
eofFiles = 0
exit_loop = False
tmp = []
T = []
R = []
opList = []
ofp = open(outFinal, "wb")
for f in outFiles:
try:
op = open(f, "rb")
opList.append(op)
except:
print 'open file ' + f + ' error.'
return;
while not exit_loop:
for o in opList:
tmp = []
for r in range(records):
try:
tmp.append(int(o.next()))
except StopIteration:
eofFiles += 1
if eofFiles >= numFiles:
exit_loop = True
break;
T.append(tmp)
R = _blk_merge(T)
T = []
for x in R:
ofp.write(str(x) + '\n')
ofp.close()
for f in opList:
f.close()
for fname in outFiles:
unlink(fname)
def esort(infile, threshold, outfile):
outList = split_file(infile, threshold)
blk_merge(outList, threshold, outfile)
if __name__ == '__main__':
import sys
argc = len(sys.argv)
if argc != 4:
print sys.argv[0] + ' '
sys.exit(1)
infile = sys.argv[1]
threshold = int(sys.argv[2])
outfile = sys.argv[3]
esort(infile, threshold, outfile)
|
对1M大小的随机数据文件,每个分块大小为100K和10K的情况:
$ time python ./esort.py 1m.dat 100000 output.dat > /dev/null
real 0m25.818s
user 0m25.669s
sys 0m0.122s
$ time python ./esort.py 1m.dat 10000 output.dat > /dev/null
real 0m45.356s
user 0m45.248s
sys 0m0.098s
$
|
阅读(2172) | 评论(1) | 转发(0) |