Chinaunix首页 | 论坛 | 博客
  • 博客访问: 175115
  • 博文数量: 42
  • 博客积分: 2185
  • 博客等级: 大尉
  • 技术积分: 455
  • 用 户 组: 普通用户
  • 注册时间: 2009-06-11 21:32
文章分类

全部博文(42)

文章存档

2012年(5)

2011年(13)

2010年(6)

2009年(18)

我的朋友

分类: Python/Ruby

2009-12-05 21:31:08

外部排序过程:

1) 将文件大小分块,并对每个分块进行快速排序,输出多个已经排序完成的小文件
2) 合并这些小文件,得到最终输出文件

假定源文件有M个记录,每个分块包含N个记录,其中 M >= N,我们将得到 upper_bound( M // N )个分块 (P)。

1)时间复杂度:  O1 = N * lg(N) * P
2)时间复杂度:  O2 = N * P

总体时间复杂度: O = N * lg(N) * P + N * P
通常P << lg(N),则时间复杂度主要取决于 N * lg(N) * P,当分块越少时,复杂度越少,但占用内存空间越多,当P = 1时,没有分块发生,等效于快速排序,时间复杂度为 N * lg(N);当P接近N时(分块太多),时间复杂度接近  N^2 * lg(N)

merge 算法:
源于上次的mergesort算法(可做一定优化):

def merge(A, B):
    C = []
    la = len(A)
    lb = len(B)
    ln = la + lb
    j = 0
    k = 0

    for i in range(ln):
        if j >= la:
            C.append(B[k])
            k += 1
            continue;
        if k >= lb:
            C.append(A[j])
            j += 1
            continue;
        if A[j] > B[k]:
            C.append(B[k])
            k += 1
        else:
            C.append(A[j])
            j += 1

    return C


我们将文件分成了P个部分,因此我们需要合并这P个记录,我们只需要将其中记录两两合并,递归一下即可。由于单词merge的复杂度为O(N),因此merge所有记录的复杂度为 O(N * P)

算法list_merge:

from mergesort import merge

def list_merge(T):
    ln = len(T)

    R = []

    if ln == 1:
        return T

    La = T.pop()
    Lb = T.pop()
    L = merge(La, Lb)
    T.append(L)

    return list_merge(T)


==>
def list_merge(T):
    return reduce(merge, T)

完整的主程序(esort.py):
注意输入文件格式为每行一个整数,纯文本格式。


from __future__ import with_statement

from quicksort import quicksort
from list_merge import list_merge

from os import unlink

"""
sort L, and output the result to file @op
"
""
def blk_sort(L, op):
    quicksort(L)
    for x in L:
        op.write(str(x) + '\n')

"""
split @infile based on @threshold,
each splitted file is sorted seperately
return with the name list of splitted files
"
""
def split_file(infile, threshold):
    n = 0
    nth = 0
    L = []
    outList = []

    with open(infile, "rt") as inp:
        for line in inp:
            if (n == 0):
                outfile = infile + '.out.' + str(nth)
                outList.append(outfile)
                print 'sorting ' + outfile
                try:
                    outp = open(outfile, "wb")
                except:
                    print "unable to write file " + outfile
                    return
            L.append(int(line))
            n += 1
            if n >= threshold:
                blk_sort(L, outp)
                n = 0
                nth += 1
                L = []
                outp.close()

    if n > 0:
        blk_sort(L, outp)
        outp.close()

    return outList

def _blk_merge(T):
    return list_merge(T)[0]

def blk_merge(outFiles, threshold, outFinal):
    numFiles = len(outFiles)

    # how many records to fetch for each file
    records = threshold // numFiles

    # EOFed files
    eofFiles = 0

    exit_loop = False

    tmp = []
    T = []
    R = []
    opList = []

    ofp = open(outFinal, "wb")

    for f in outFiles:
        try:
            op = open(f, "rb")
            opList.append(op)
        except:
            print 'open file ' + f + ' error.'
            return;

    while not exit_loop:
            for o in opList:
                tmp = []
                for r in range(records):
                    try:
                        tmp.append(int(o.next()))
                    except StopIteration:
                        eofFiles += 1
                        if eofFiles >= numFiles:
                            exit_loop = True
                        break;
                T.append(tmp)
            R = _blk_merge(T)
            T = []
            for x in R:
                ofp.write(str(x) + '\n')

    ofp.close()

    for f in opList:
        f.close()

    for fname in outFiles:
        unlink(fname)

def esort(infile, threshold, outfile):
    outList = split_file(infile, threshold)

    blk_merge(outList, threshold, outfile)

if __name__ == '__main__':
    import sys

    argc = len(sys.argv)

    if argc != 4:
        print sys.argv[0] + ' '
        sys.exit(1)

    infile = sys.argv[1]
    threshold = int(sys.argv[2])
    outfile = sys.argv[3]

    esort(infile, threshold, outfile)


对1M大小的随机数据文件,每个分块大小为100K和10K的情况:


$ time python ./esort.py 1m.dat 100000 output.dat > /dev/null

real 0m25.818s

user 0m25.669s

sys 0m0.122s


$ time python ./esort.py 1m.dat 10000 output.dat > /dev/null

real 0m45.356s

user 0m45.248s

sys 0m0.098s

$



阅读(2118) | 评论(1) | 转发(0) |
给主人留下些什么吧!~~

chinaunix网友2010-03-15 04:24:29

R = [] if ln == 1: return T La = T.pop() Lb = T.pop() R 是做什么的呢 T.POP是出来一个数组还是出来单独的一个数字? T 怎么得来的