I wrote a simple filter for genomic data: it works by reading three bases and coding them in a byte,

stated the number of distinct bases are four, this operation actual takes 6 bit. Nevertheless it is not

convenient to insert four bases per byte because amino acid transcription happens on a triplet basis,

and a loose of correlation might occur. For the remaining two bits it is possible to implement a limited

RLE system, please look at the code below for details.

I made some experiments on the E.coli file of the Canterbury large corpus, the file was first packed withCode:program gen; {$R-,S-} var aa,oaa,cnt,lim,trip,base:byte; nb:word; fin,fout:file; fname:string[12]; begin write('Input file?'); readln(fname); write('RLE?'); readln(lim); if lim>3 then lim:=3; assign(fin,fname); reset(fin,1); fname:=copy(fname,1,pos('.',fname)-1); assign(fout,fname+'.out'); rewrite(fout,1); oaa:=64; while not eof(fin) do begin aa:=0; for trip:=0 to 2 do begin if not eof(fin) then blockread(fin,base,1,nb) else begin writeln('Sequence truncation'); halt(1); end; case base of ord('c'),ord('C'):base:=0; ord('g'),ord('G'):base:=1; ord('a'),ord('A'):base:=2; ord('t'),ord('T'),ord('u'),ord('U'):base:=3; else begin writeln('The file does not appear as genome'); halt(1); end; end; aa:=aa+base shl (trip shl 1); end; if (aa=oaa) and (cnt<lim) then inc(cnt) else begin if oaa<64 then oaa:=oaa+cnt shl 6 else oaa:=aa; blockwrite(fout,oaa,1,nb); oaa:=aa; cnt:=0; end; end; if oaa<64 then begin oaa:=oaa+cnt shl 6; blockwrite(fout,oaa,1,nb); end; close(fin); close(fout); end.

the above filter and then the output compressed with gzip and glue (see the post Grouped (ROLZ) LZW

compressor). Here are the result:

original file size 4638690

gzip size 1341254

glue size 1345972

filtered file size 1508628

gzip filtered size 1165463

glue filtered size 1388177

I choose to make this confrontation by the perception the 1-order prediction inside glue could behave good

for the DNA source, as one can see the difference in compression for the original file is small. Otherwise for the

filtered file the story is more sad (to me). gzip leads to a better compression, glue does not.

The interpretation I feel to make is that increasing the number of symbols for glue allows for a large dictionary but

at the same time increases the average bits for group selection done by Huffman, while prediction for amino acids

is worse than bases. Instead gzip benefits from an expanded dictionary and his entropy reduction based on the most

recent occurrence makes the difference as usual.