The mifluz classes WordDBCompress and WordBitCompress do the compression/decompression
work. From the list of keys stored in a page it extracts several lists
of numbers. Each list of numbers has common statistical properties that
allow good compression.
The WordDBCompress_compress_c and WordDBCompress_uncompress_c functions are C callbacks that are called by the the page compression code in BerkeleyDB. The C callbacks then call the WordDBCompress compress/uncompress methods. The WordDBCompress creates a WordBitCompress object that acts as a buffer holding the compressed stream.
Compression algorithm.
Most DB pages contain redundant data because mifluz chose
to store one word occurrence per entry.
Because of this choice the pages have a very simple structure.
Here is a real world example of what a page can look like: (key structure: word identifier + 4 numerical fields)
756 1 4482 1 10b
756 1 4482 1 142
756 1 4484 1 40
756 1 449f 1 11e
756 1 4545 1 11
756 1 45d3 1 545
756 1 45e0 1 7e5
756 1 45e2 1 830
756 1 45e8 1 545
756 1 45fe 1 ec
756 1 4616 1 395
756 1 461a 1 1eb
756 1 4631 1 49
756 1 4634 1 48
.... etc ....
To compress we chose to only code differences between adjacent entries. A flag is stored for each entry indicating which fields have changed. When a field is different from the previous one, the compression stores the difference which is likely to be small since the entries are sorted.
The basic idea is to build columns of numbers, one for each field, and then compress them individually. One can see that the first and second columns will compress very well since all the values are the same. The third column will also compress well since the differences between the numbers are small, leading to a small set of numbers.