A new member of the paq8* family - specialized (but not limited) to (genomic) sequencing data.
~ 6x faster than paq8px
Code:
- Forked form paq8px_v200 (https://github.com/hxim/paq8px)
- Removed most models except MatchModel, NormalModel, WordModel, DMCForect/DMCModel, LSTMModel
- Removed most transformations except EOL
- Removed most detections except Text
- Removed all pre-training
- Removed most command line options except Adaptive Learning Rate (A), LSTM model (L)
- Removed some NormalModel mixer contexts, removed the fast order 1 model
- Renamed WordModel to LineModel, removed all contexts except some linemodel ones
- Increased NormalModel contexts from 14->24
- Increased MatchModel minimum match length: 5->16, stepsize: 2->8
- Tuned mixer scaling factor: 940, 80 -> 2048, 256
This release does not perform any transformations for the FASTA or the 2bit formats, yet. Please do your transformations before compressing.
Suggested transformation in case of FASTA (multi FASTA): 1) create two sections: one with titles another with sequence data, 2) remove line breaks from sequence data.
Although ...
- it supports the usual -1..-12 compression levels, it looks like high levels are not really needed - compression is almost the same at lower memory levels as on higher ones (in case of the FASTA format at least).
- it contains the LSTM model, it seems to have no benefit on this kind of data.
- it still has the -a command line switch (Adaptive Learning Rate), it seems to have no benefit on this kind of data.
Source: https://github.com/GotthardtZ/paq8gen
Windows binaries: https://github.com/GotthardtZ/paq8gen/releases/tag/v1
Contributions and tweaks are welcome.