Chinaunix首页 | 论坛 | 博客
  • 博客访问: 231385
  • 博文数量: 65
  • 博客积分: 1410
  • 博客等级: 上尉
  • 技术积分: 655
  • 用 户 组: 普通用户
  • 注册时间: 2006-03-31 13:33
文章分类

全部博文(65)

文章存档

2016年(1)

2010年(4)

2009年(4)

2008年(4)

2007年(40)

2006年(12)

我的朋友

分类:

2007-04-24 11:28:43

Alexander Ratushnyak 发明的无损压缩算法。可以将维基百科上面摘取的100M大小的内容压缩到17M,这个压缩比很高了。原文在维基百科上,考虑到维基随时可能会访问不到,这里先收藏一下内容。


PAQ
From Wikipedia, the free encyclopedia
Jump to: navigation, search
PAQ is a series of open source data compression archivers that have evolved through collaborative development to top rankings on several benchmarks measuring compression ratio (although at the expense of speed and memory usage). The best compression in this series is obtained by PAQAR 4.0 by Alexander Ratushnyak, released July 25, 2004 (updated July 27, 2004) for most non-text data, or PASqDa 4.1 by Przemyslaw Skibinski, released July 1, 2005, for English text. Both compressors are surpassed on some benchmarks by WinRK in PWCM mode by Malcolm Taylor, released in Jan. 2005. PWCM (PAQ weighted context mixing) is an independently developed closed source implementation of the PAQ algorithm.

Algorithm
PAQ uses a context mixing algorithm. Context mixing is related to PPM in that the compressor is divided into a predictor and an arithmetic coder, but differs in that the next-symbol prediction is computed using a weighed combination of probability estimates from a large number of models conditioned on different contexts. Unlike PPM, a context need not be contiguous. Most PAQ versions collect next-symbol statistics for the following contexts:

n-grams. The context is the last n bytes before the predicted symbol (as in PPM).
whole word n-grams, ignoring case and nonalphabetic characters (useful in text files).
“sparse” contexts, for example, the second and fourth bytes preceding the predicted symbol (useful in some binary formats)
“analog” contexts, consisting of the high order bits of previous 8 or 16 bit words (useful for multimedia files)
two dimensional contexts (useful for images, tables, and spreadsheets). The row length is determined by finding the stride length of repeating byte patterns.
Although contexts usually fall on byte boundaries, the predicted symbols are actually single bits. In each distinct context, two counts are maintained, n0, a count of zero bits, and n1, a count of 1 bits. In order to favor recent history, half of the count over 2 is discarded when the opposite bit is observed. For example, if the current state associated with a context is (n0,n1) = (12,3) and a 1 is observed, then the counts are updated to (7,4).

A bit is arithmetic coded with space proportional to its probability, either P(1) or P(0) = 1 - P(1). The probabilities are computed by weighted addition of the 0 and 1 counts:

S0 = ∑ win0i
i
S1 = ∑ win1i
i

S = S0 + S1

P(0) = S0 / S

P(1) = S1 / S

where wi is the weight of the i’th model. In early versions of PAQ, the weights were fixed and set in an ad-hoc manner. In later versions, the weights were adjusted adaptively in the direction that would reduce future errors in the same context set. If the bit to be coded is y, then the weight adjustment is:

err = (P(1) − y) / (S0S1)

wi < - wi + (Sn1i - S1(n0i + n1i))err

PAQ derivations
PAQ is free software distributed under the GNU General Public License. This has allowed other authors to fork the PAQ compression engine and add new features such as a graphical user interface or better speed (at the expense of compression ratio). The most notable PAQ-clones are:

WinUDA 0.290, based on PAQ6 but faster
UDA 0.300, based on PAQ8
KGB, is basically PAQ6v2 with a GUI
Emilcont based on PAQ6

阅读(5550) | 评论(0) | 转发(0) |
0

上一篇:Virtue Linux

下一篇:solaris 10硬盘安装

给主人留下些什么吧!~~