Port of PPMZ

Helsinki University of Technology
Laboratory of Information Processing Science

Hannu Peltola and Jorma Tarhio:
PPMZ for Linux
We have ported PPMZ to Linux: ppmz.tar.gz. See more details below.
General

PPM (Prediction by Partial Match) is a classic compression algorithm. PPM predicts the probability of a given character based on the characters that immediately precede it. PPMZ is an efficient version of PPM. PPMZ has been developed and programmed in C by Charles Bloom. The source code of PPMZ consists of two parts: ppmz and crblib.
License

The source code is covered by the Bloom Public License.
Port of PPMZ to gcc

We have ported PPMZ v9.1 to a Linux with gcc 2.95.2. Some routines could not be compiled without changes. Minimal changes were made using conditional compilation. An identifier 'unix' was used to indicate changes related to the operating system. Details are presented in the files ppmz/gcc.compile and crblib/gcc.compile.
Old versions are preserved with the extension '.old'. Unmodified versions are preserved with the extension '.original'.
The port is available as ppmz.tar.gz (0.6 MB). There is also an executable for processors using i386 instruction set. There is also another static linked executable. For safety reasons we recommend recompiling.
A port to Sparc is under construction.
Notes about PPMZ

Especially with large input files PPMZ uses pretty much memory. Any tests recording used time should report available main memory.
"HeaderLen not included in report of results because all info in the header is not necessary for compression decompression. see ppmzhead.c for details":
Header consists of:

4-char signature, for convenient decoding
ulong CRC , for convenient error-checking
ulong rawlen, so that the buffer size can be known (i.e. I can use fread instead of fgetc)
3-ulong RunTransform info, again for array allocation (the fact that these are not needed is proved by the fact that they are not passed to UnRunTransform)

Test results

On the PPMZ home page there are some test results on the files of the Calgary Corpus; it can also be found from Canterbury. We repeated them with the ported version of PPMZ, and got the following results:

PPMZ v9.1 results on the Calgary Corpus

file raw size compressed

by Bloom by us by Bloom by us

bib 111261 111261 24256 24256

book1 768771 768771 212733 212733

book2 610856 610856 143075 143074

geo 102404 102400 51635 59446

news 377109 377109 105725 105722

obj1 21504 21504 9854 9853

obj2 246814 246814 68804 68801

paper1 53161 53161 14772 14772

paper2 82199 82199 22749 22748

pic 513216 513216 50685 50685

progc 39611 39611 11180 11179

progl 71646 71646 13185 13185

progp 49379 49379 9122 9124

trans 93695 93695 14508 14508

Currently the Canterbury Corpus is the most polular compression benchmark. Below are the results for PPMZ and the best results reported on the result page of the Canterbury Corpus.

PPMZ v9.1 results on the Canterbury Corpus

file raw size compressed bpc

PPMZ best reported

text 152089 39576 2.081 2.20

fax 513216 50685 0.790 0.77

Csrc 11150 2603 1.867 2.08

Excl 1029744 139748 1.085 0.83

SPRC 38240 11715 2.450 2.58

tech 426754 97460 1.827 1.95

poem 481861 133508 2.216 2.36

html 24603 6744 2.192 2.32

list 3721 1048 2.253 2.40

man 4227 1514 2.865 2.98

play 125179 36547 2.335 2.49

This page is maintained by Jorma Tarhio and Hannu Peltola, E-mail: tarhio at cs.hut.fi.
This page has been updated on April 10, 2002
URL: http://www.cs.hut.fi/u/tarhio/ppmz/

PPMZ v9.1 results on the Calgary Corpus
file	raw size		compressed
file	by Bloom	by us	by Bloom	by us

bib	111261	111261	24256	24256
book1	768771	768771	212733	212733
book2	610856	610856	143075	143074
geo	102404	102400	51635	59446
news	377109	377109	105725	105722
obj1	21504	21504	9854	9853
obj2	246814	246814	68804	68801
paper1	53161	53161	14772	14772
paper2	82199	82199	22749	22748
pic	513216	513216	50685	50685
progc	39611	39611	11180	11179
progl	71646	71646	13185	13185
progp	49379	49379	9122	9124
trans	93695	93695	14508	14508