PAQ
From Wikipedia, the free encyclopedia
Jump to: navigation, search
PAQ is a series of open source data compression archivers that have
evolved through collaborative development to top rankings on several
benchmarks measuring compression ratio (although at the expense of
speed and memory usage). The best compression in this series is
obtained by PAQAR 4.0 by Alexander Ratushnyak, released July 25, 2004
(updated July 27, 2004) for most non-text data, or PASqDa 4.1 by
Przemyslaw Skibinski, released July 1, 2005, for English text. Both
compressors are surpassed on some benchmarks by WinRK in PWCM mode by
Malcolm Taylor, released in Jan. 2005. PWCM (PAQ weighted context
mixing) is an independently developed closed source implementation of
the PAQ algorithm.
Algorithm
PAQ uses a context mixing algorithm. Context mixing is related to PPM
in that the compressor is divided into a predictor and an arithmetic
coder, but differs in that the next-symbol prediction is computed using
a weighed combination of probability estimates from a large number of
models conditioned on different contexts. Unlike PPM, a context need
not be contiguous. Most PAQ versions collect next-symbol statistics for
the following contexts:
n-grams. The context is the last n bytes before the predicted symbol (as in PPM).
whole word n-grams, ignoring case and nonalphabetic characters (useful in text files).
“sparse” contexts, for example, the second and fourth bytes preceding the predicted symbol (useful in some binary formats)
“analog” contexts, consisting of the high order bits of previous 8 or 16 bit words (useful for multimedia files)
two dimensional contexts (useful for images, tables, and spreadsheets).
The row length is determined by finding the stride length of repeating
byte patterns.
Although contexts usually fall on byte boundaries, the predicted
symbols are actually single bits. In each distinct context, two counts
are maintained, n0, a count of zero bits, and n1, a count of 1 bits. In
order to favor recent history, half of the count over 2 is discarded
when the opposite bit is observed. For example, if the current state
associated with a context is (n0,n1) = (12,3) and a 1 is observed, then
the counts are updated to (7,4).
A bit is arithmetic coded with space proportional to its probability, either P(1) or P(0) = 1 - P(1). The probabilities are computed by weighted addition of the 0 and 1 counts:
S0 = ∑ win0i
i
S1 = ∑ win1i
i
S = S0 + S1
P(0) = S0 / S
P(1) = S1 / S
where wi is the weight of the i’th model. In early versions of PAQ, the weights were fixed and set in an ad-hoc manner. In later versions, the weights were adjusted adaptively in the direction that would reduce future errors in the same context set. If the bit to be coded is y, then the weight adjustment is:
err = (P(1) − y) / (S0S1)
wi < - wi + (Sn1i - S1(n0i + n1i))err
PAQ derivations
PAQ is free software distributed under the GNU General Public License.
This has allowed other authors to fork the PAQ compression engine and
add new features such as a graphical user interface or better speed (at
the expense of compression ratio). The most notable PAQ-clones are:
WinUDA 0.290, based on PAQ6 but faster
UDA 0.300, based on PAQ8
KGB, is basically PAQ6v2 with a GUI
Emilcont based on PAQ6