I wrote a simple filter for genomic data: it works by reading three bases and coding them in a byte,
stated the number of distinct bases are four, this operation actual takes 6 bit. Nevertheless it is not
convenient to insert four bases per byte because amino acid transcription happens on a triplet basis,
and a loose of correlation might occur. For the remaining two bits it is possible to implement a limited
RLE system, please look at the code below for details.
Code:
program gen;
...
end.
I made some experiments on the E.coli file of the Canterbury large corpus, the file was first packed with
the above filter and then the output compressed with gzip and glue (see the post
Grouped (ROLZ) LZW
compressor). Here are the result:
original file size 4638690
gzip size 1341254
glue size 1345972
filtered file size 1508628
gzip filtered size 1165463
glue filtered size 1388177
I choose to make this confrontation by the perception the 1-order prediction inside glue could behave good
for the DNA source, as one can see the difference in compression for the original file is small. Otherwise for the
filtered file the story is more sad (to me). gzip leads to a better compression, glue does not.
The interpretation I feel to make is that increasing the number of symbols for glue allows for a large dictionary but
at the same time increases the average bits for group selection done by Huffman, while prediction for amino acids
is worse than bases. Instead gzip benefits from an expanded dictionary and his entropy reduction based on the most
recent occurrence makes the difference as usual.