Helsinki University of Technology
Laboratory of Information Processing ScienceHannu Peltola and Jorma Tarhio:
PPMZ for Linux
We have ported PPMZ to Linux: ppmz.tar.gz. See more details below.General
PPM (Prediction by Partial Match) is a classic compression algorithm. PPM predicts the probability of a given character based on the characters that immediately precede it. PPMZ is an efficient version of PPM. PPMZ has been developed and programmed in C by Charles Bloom. The source code of PPMZ consists of two parts: ppmz and crblib.
License
The source code is covered by the Bloom Public License.
Port of PPMZ to gcc
We have ported PPMZ v9.1 to a Linux with gcc 2.95.2. Some routines could not be compiled without changes. Minimal changes were made using conditional compilation. An identifier 'unix' was used to indicate changes related to the operating system. Details are presented in the files ppmz/gcc.compile and crblib/gcc.compile.
Old versions are preserved with the extension '.old'. Unmodified versions are preserved with the extension '.original'.
The port is available as ppmz.tar.gz (0.6 MB). There is also an executable for processors using i386 instruction set. There is also another static linked executable. For safety reasons we recommend recompiling.
A port to Sparc is under construction.
Notes about PPMZ
Especially with large input files PPMZ uses pretty much memory. Any tests recording used time should report available main memory.
"HeaderLen not included in report of results because all info in the header is not necessary for compression decompression. see ppmzhead.c for details":Header consists of:
- 4-char signature, for convenient decoding
- ulong CRC , for convenient error-checking
- ulong rawlen, so that the buffer size can be known (i.e. I can use fread instead of fgetc)
- 3-ulong RunTransform info, again for array allocation (the fact that these are not needed is proved by the fact that they are not passed to UnRunTransform)
Test results
On the PPMZ home page there are some test results on the files of the Calgary Corpus; it can also be found from Canterbury. We repeated them with the ported version of PPMZ, and got the following results:
PPMZ v9.1 results on the Calgary Corpus file raw size compressed by Bloom by us by Bloom by us bib 111261 111261 24256 24256 book1 768771 768771 212733 212733 book2 610856 610856 143075 143074 geo 102404 102400 51635 59446 news 377109 377109 105725 105722 obj1 21504 21504 9854 9853 obj2 246814 246814 68804 68801 paper1 53161 53161 14772 14772 paper2 82199 82199 22749 22748 pic 513216 513216 50685 50685 progc 39611 39611 11180 11179 progl 71646 71646 13185 13185 progp 49379 49379 9122 9124 trans 93695 93695 14508 14508 Currently the Canterbury Corpus is the most polular compression benchmark. Below are the results for PPMZ and the best results reported on the result page of the Canterbury Corpus.
PPMZ v9.1 results on the Canterbury Corpus file raw size compressed bpc PPMZ best reported text 152089 39576 2.081 2.20 fax 513216 50685 0.790 0.77 Csrc 11150 2603 1.867 2.08 Excl 1029744 139748 1.085 0.83 SPRC 38240 11715 2.450 2.58 tech 426754 97460 1.827 1.95 poem 481861 133508 2.216 2.36 html 24603 6744 2.192 2.32 list 3721 1048 2.253 2.40 man 4227 1514 2.865 2.98 play 125179 36547 2.335 2.49
This page is maintained by Jorma Tarhio and Hannu Peltola, E-mail: tarhio at cs.hut.fi.
This page has been updated on April 10, 2002
URL: http://www.cs.hut.fi/u/tarhio/ppmz/