Arithmetic Coding
Arithmetic Coding and PPM
Arithmetic coding is a general technique for coding the outcome of a
stochastic process based on an adaptive model. The expected bit rate
is the cross-entropy rate of the model versus the actual process.
PPM, prediction by partial matching, is an adaptive statistical model
of a symbol sequence which models the likelihood of the next byte based
on a (relatively short) suffix of the sequence of previous bytes.
Byte Stream Models
Although arithmetic coding works with any sample space, taking the
outcomes to be bytes allows the coding of text in arbitrary character
sets as well as binary files. Encoding proceeds a byte at a time and
provides output a bit at a time; decoding proceeds a bit at a time and
provides output a byte at a time. The end-of-file outcome is coded as
a separate outcome, mirroring the behavior of byte reading in both
Java and C streams.
Suppose b[0],...,b[m-1] is a sequence of bytes of length
m where b[m-1]=EOF. A suitable model
provides an estimate P(b[n]|b[0],...,b[n-1]) for the
probability of the nth byte in the stream or end-of-file
given its observation of the previous n-1 bytes. The
number of bits required to encode b[n] is approximately
-log_2 P_n(b[n]|b[0],...b[n-1]). A slight penalty is
incurred based on the precision of the underlying arithmetic.
Encoding with Cumulative Probabilities
In a sequence of bytes b[0], b[1], ..., b[m-1], the byte
b[n] is coded based on an ordering b_0, ...,
b_256 of the all 256 bytes and the end-of-file symbol.
Assuming b[n]=b_k for some k, the arithmetic
encoder stores the sub-interval
[ SUM_j < k P(b[j]|b[0],...,b[j-1]),
SUM_j <= k P(b[j]|b[0],...,b[j-1]) ]
of [0.0,1.0]. Independently of the indexing, the width
of the interval is P(b[n]|b[0],...,b[n-1]). In practice,
the interval is approximated using three integers low,
high, and total, which define the interval
[low/total,high/total]. The difference
high-low must be non-zero for every outcome with non-zero
probability. A slight loss in coding precision arises from this
integer approximation and is bounded by the precision induced by
total.
Decoding with Cumulative Probabilities
For decoding, the total used to encode the next symbol's
interval is provided by the model. This requires each possible outcome
P(.|b[0],...,b[n-1]) to be coded with
the same total value. Given the total, the decoder returns a
count k such that low<=k<high. The model
then determines the symbol from the count k. The coder
encodes byte b[n] based on the previous symbols
b[0],...,b[n-1], enabling it to update its model after
every symbol. The decoder is able to reconstruct the same model after
decoding each symbol. The only restriction on the models are that
they are trained only on the previous n-1 symbols; in
particular, they cannot use information about the symbol being coded.
Models
The arithmetic coder is distributed with several illustrative models,
culminating in PPMC, prediction by partial matching with model C.
When introduced in the 1980s, PPM provided the best known compression
on general text and binary data such as images. It has since been
beaten by embellishments, first PPMC a few years later which had
better fallback strategies for unseen contexts, then PPM* a decade
after that, which kept track of unbounded length contexts, and then
more recently, PPMZ, which provides several improvements to both
estimation and implementation.
Uniform Model
Uniform distributions assign the same probability to every outcome.
The uniform distribution codes a byte b as the interval
[low,high,total] = [b,b+1,257] and end-of-file as
[256,257,257]. The expected entropy rate is -log_2
1/257, just over 8 bits. The loss is due to encoding
end-of-file. For a regular file, the file system keeps track of its
length so that all 256 bytes may be included. Read and write in
languages like C and Java return bytes in the range 0-255 and return
-1 if the end-of-file is reached; essentially 257 outcomes. An
alternative implementation of arithmetic coding would store the length
n separately in log_2 n bits. Although this
is typically more efficient than coding end-of-file in a model, it
disrupts the streaming nature of coding/decoding.
Adaptive Unigram Model
The adaptive unigram model keeps a count of previously seen bytes and
estimates the likelihood of a byte b based on its
frequency count(b), where each byte's count is
initialized to 1. Adding 1 to unseen events
in this way is known as Laplace smoothing. Whenever the total count
exceeds the maximum, the counts are rescaled by dividing by
2 and rounding up to at least 1. The model
is adaptive in that it updates after each byte; it is a unigram
because strings of length 1 are used. A binary search is used during
decoding to determine the interval, but the totals are updated by a
linear traversal of all bytes before the updated byte in the order. A
more efficient implementation would involve Fenwick trees (Fenwick
1994) or a frequency-balanced model (Moffat 1999).
PPM: Prediction by Partial Matching
The PPM model (Cleary and Witten 1984) predicts the next byte based on
a sequence of previous bytes. It generalizes the byte frequency model
by keeping counts for sequences of bytes b_1,..,b_n for some value of
n (to be precise, we write PPM(n) for the nth order PPM
model).
The PPM model illustrates the final aspect of the arithmetic coding
model -- escapes. A model has the option of informing the coder that
it doesn't currently know how to code the symbol and providing the
likelihood interval for this "escape" outcome. Then, the coder asks
again. With PPM, after each escape, the context is narrowed to a
shorter prefix of the current byte, eventually grounding out in the
uniform distribution. At each escaped context in the backoff, outcomes
that have been seen in longer contexts are excluded from the counts.
The PPM model is configurable by constructor for maximum context
length, and rescaling and maximum count parameters can be tuned
through constants in the code.
I/O: BitInput and BitOutput
Arithmetic coding relies on processing files a single bit at a
time. The classes BitInput and BitOutput
wrap ordinary input and output streams to provide buffered access to
individual bits.
Sets of Bytes
In order to implement PPM with tight compression, it is necessary to
keep track of which symbols have been seen in escaped contexts so that
their counts may be subtracted from the narrower contexts. The
representation of the bytes that have been seen is handled with a
compact, generic implementation of byte sets. The members are stored
in the bits four long values and accesses through shifting and
masking. (This was found to be faster and smaller than a binary
array.)
Other Arithmetic Coders
There are several other arithmetic coders available, all written in straight C.
Witten, Neal and Cleary's
The original, from Witten, I. H., R. Neal,and J. G. Cleary. 1987. Arithmetic coding for
data compression. Communications of the ACM 30(6): 520-540.
ftp://ftp.cpsc.ucalgary.ca/pub/projects/ar.cod/cacm-87.shar
Mark Nelson's
Mark Nelson's refactoring of the original CACM code
for Dr. Dobbs Journal, February 1991:
http://dogma.net/markn/articles/arith/part1.htm
Radford Neal's
Radford Neal's low-precision version:
ftp://ftp.cs.toronto.edu/pub/radford/lowp_ac.shar
Alistair Moffat's
Alistair Moffat's latest, based on
Moffat, Neal, and Witten (1998), along with pointers to older systems:
http://www.cs.mu.oz.au/~alistair/arith_coder/
Charles Bloom's
PPMZ page with
source code and results. This is currently the best performing coder
in terms of compression and speed, including several improvements to
both estimation itself and to computing adaptive models efficiently.
Test Corpora
These corpora are commonly used to evaluate compression schemes.
Benchmark Results
Jeff Gilchrist's Archive Comparison Test. Includes extensive links to existing compressors.
Compression FAQs
Compression FAQ
Other Tutorials
Art of Lossless Data Compression
References for Arithmetic Coding and PPM
- Aberg, J., Y. Sharktov, and B.J.M. Schmeets. 1997. Towards understanding and improving escape probabilities in PPM.
IEEE Data Compression Conference:52-61.
- Aberg, J. and Y. Shtarkov. 1997. Text compression by context tree
weighting. IEEE Data Compression Conference:377-386.
- Aberg, J., Y. M. Shtarkov, and B.J.M. Smeets. 1997.
Multialphabet coding with separate alphabet description. In Proc. of
Compression and Complexity of SEQUENCES 1997, pages 56-65. IEEE
Computer Society.
- Bell, T.C., J.G. Cleary, and I.H. Witten. 1990. Text
compression. Prentice Hall, Englewood Cliffs, NJ.
Introduces Calgary Corpus
- T.C. Bell and I.H. Witten. 1994. Relationship between greedy parsing
and symbolwise text compression. Journal of the ACM, 41(4):708--724.
Connection between Lempel-Ziv and statistical coding
- T.C. Bell, I.H. Witten and J.G. Cleary. 1989. Modeling for text
compression. ACM Computing Surveys 21(4):557-591.
- J.L. Bentley, D.D. Sleator, R.E. Tarjan, and V.K. Wei. 1986. A
locally adaptive data compression scheme. Communications of the ACM
29(4):320--330.
Code based on recency statistic using move-to-front lists.
- Bloom, C. 1995. New techniques in context modeling and arithmetic encoding.
Introduces deferred summation and PPMCB, which involves only keeping
a single outcome for long contexts.
- Bloom, C. 1995. Solving the problems of context modeling.
Introduces PPMZ, containing local-order estimation (LOE) to predict
which order model to use and secondary escape estimation (SEE), which estimates
the escape probabilities dynamically.
- Bunton, S. 1997. An executable taxonomy of on-line modeling algorithms.
In IEEE Data Compression Conference. 42-51.
- Bunton, S. 1997. On-Line Stochastic Processes in Data Compression.
University of Washington.
- Cheney, J. 2001. Compressing XML with multiplexed hierarchical
models. In IEEE Data Compression Conference, pages
163--172.
- Cleary, J. G. and W.J. Teahan. 1995. Experiments in the zero frequency problem.
IEEE Data Compression Conference 480.
- Cleary, J.G. and I.H. Witten. 1984. Data compression using
adaptive coding and partial string matching. IEEE Transactions on
Communications. 32(4):396-402.
Introduces
PPM
- Cleary, J.G., W.J. Teahan, and I.H. Witten. 1995. Unbounded length contexts for PPM.
IEEE Data Compression Conference (DCC).
- Cleary, J.G., and W.J. Teahan. 1997. Unbounded context lengths for
PPM. The Computer Journal 40(2/3):67-75.
Introduces PPM*
- Cormack, G.V. and R.N. Horspool. 1984. Algorithms for adaptive Huffman codes.
Information Processing Letters 18(3):159-165.
- Cormack, G.V. and R.N. Horspool. 1987. Data compression using
dynamic Markov modeling. The Computer Journal
30(6):541-550.
- Cover, T. and J. Thomas. 1991. Elements of Information
Theory. John Wiley and Sons, Inc. New York.
- Drinic, M. and D. Kirovski. 2002.
PPMexe: PPM for Compressing Software.
IEEE Data Compression Conference.
- Fenwick, P.M. 1994. A new data structure for cumulative frequency
tables. Software Practice and Experience 24(3):327-336.
- Howard, P.G. and J.S. Vitter. 1992. Analysis of Arithmetic Coding for
Data Compression.
Information Processing and Management 28(6):749-763.
- Howard, P.G. and J.S. Vitter. 1992. Practical Implementations of
Arithmetic Coding. In Images and Text Compression. Kluwer
Academic Publishers.
- Howard, P.G. and J.S. Vitter. 1993. Design and analysis of fast text compression based on quasi-arithmetic coding.
IEEE Data Compression Conference. 98-107.
Introduces PPMD.
- Howard, P.G. and J.S. Vitter. 1994. Arithmetic Coding for Data Compression. IEEE Data Compression Conference 82(6):857-865.
Table-based probability acceleration.
- Jones, C. 1981. An efficient coding system for long source
sequences. IEEE Trans. Information Theory
27(3):280-291.
- Hischberg, D.S. and D.A. Lelewer. 1992. Context modeling for text
compression. In J.A. Storer, Image and Text
Compression. 113-144.
- Langdon, G.G., Jr., and J. Rissanen. 1982. A simple general binary
source code. IEEE Trans. Information Theory
27:800-803.
- Levene, M. and P. Wood. 2002.
XML Structure Compression. Manuscript, Birkbeck College, University of London.
- Liefke, H. and D. Suciu. 2000. XMILL: An efficient compressor for
XML data. In Proceedings of the ACM SIGMOD International Conference
on Management of Data. 153-164, Dallas, Texas.
- Manning, C. D. and Schuetze, H. 1999. Foundations of
Statistical Natural Language Processing. MIT Press.
Great intro to statistical language modeling.
- Moffat, A. 1990. Implementing the PPM data compression
scheme. IEEE Transactions on Communication
38(11):1917-1921.
Introduces PPMC
- Moffat, A. 1990b. Linear time adaptive arithmetic coding. IEEE
Transactions on Information Theory 38(11):1917-1921.
- Moffat, A., R.M. Neal, and I.H. Witten. 1998. Arithmetic coding revisited.
ACM Transactions on Information Systems 16(3):256-294.
- Moffat. A. 1999. An improved data structure for cumulative probability tables.
Software - Practice and Experience 29(7):647-659
- Moffat, A. and A. Turpin. 2002. Compression and
Coding Algorithms. Kluwer Academic Publishers.
- Pasco, R. 1976. Source Coding Algorithms for Fast Data
Compression. Ph.D. Thesis. Stanford University.
- Pennebaker, W.B. and J.L. Mitchell. 1988. Probability estimation
for the Q-coder. IBM Journal of Research and Development
32(6):737-752.
- Pennebaker, W.B., J.L. Mitchell, G.G. Langdon, Jr., and
R.B. Arps. 1988. An Overview of the Basic Principles of the Q-Coder
Adaptive Binary Arithmetic Coder. IBM J. Res. Develop.
32:717-726.
- Stuiver, L. and A. Moffat. 1998. Piecewise integer mapping for
arithmetic coding.
IEEE Data Compression Conference. 1-10.
- Rissanen, J. 1976. Generalized Kraft Inequality and Arithmetic
Coding. IBM Journal of Research and Development 20(3):198-203.
Theory of arithmetic coding. Could someone please let me know if Rissanen was
the inventor and this is the original reference.
- Rissanen, J. 1984. Universal coding, information, prediction, and esitmation.
IEEE Transactions on Information Theory 29(4):629-636.
- Rissanen, J. and G.G. Langdon. 1979. Arithmetic coding. IBM
Journal of Research and Development 23(2):149-162.
- Rissanen, J. and K.M. Mohiuddin. 1989. A multiplication-free
multialphabet arithmetic code. IEEE Transactions on
Communications 37(2):93-98.
- Rubin, F. 1979. Arithmetic stream coding using fixed precision
registers. IEEE Trans. Information Theory 25(6):672-675.
- Schindler, M. 1998. A fast renormalisation for arithmetic
coding. IEEE Data Compression Conference 572.
- Shannon, C. E. 1948. A mathematical theory of communication.
Bell System Technical Journal 27:379-423.
Introduces the entire field of information theory. Among other things, shows that the entropy rate of the source stochastic process determines a lower bound on the expected bits per symbol in a coder.
- Shkarin, D. ????. PPM: one step to practicality. Manuscript.
- Slattery, M.J. and J.L. Mitchell. The Qx-coder. 1998. IBM Web page.
- Teahan, W. J. and J. G. Cleary. 1997. Models of English text. DCC:12.
- Turpin, A. and A. Moffat. 1997. Efficient approximate adaptive
coding. IEEE Data Compression Conference. 357-366.
- Witten, I.H. and T. Bell. 1991. The zero frequency problem. IEEE Transactions
on Information Systems 37(4):1085.
- Witten, I.H., A. Moffat and T.C. Bell. 1999. Managing
Gigabytes: Compressing and Indexing Documents and Images, Second
Edition. Morgan-Kaufmann Publishers.
- Witten, I.H., R.M. Neal and J.G. Cleary. 1987. Arithmetic coding
for data compression. Communications of the ACM 30(6):520-540.
Practice of arithmetic coding for compression
- Witten, I.H., R.M. Neal, and J.G. Cleary. 1988. Compress and
compact discussed further. Communications of the ACM 31(9): 1140-1145.
Copyright 2002. Bob Carpenter. Maintained by Bob Carpenter,
carp@colloquial.com.