Page 1 of 3 123 LastLast
Results 1 to 30 of 77

Thread: SARS-CoV-2 Coronavirus Data Compression Benchmark

  1. #1
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    528
    Thanks
    204
    Thanked 187 Times in 128 Posts

    SARS-CoV-2 Coronavirus Data Compression Benchmark

    Quote Originally Posted by Shelwien View Post
    We don't know yet, ask again in June.
    For now, there's this: https://coronavirus.innar.com
    Looking at that data file, the most obvious transform to try initially is newline removal. The sequence is line-wrapped which hugely harms LZ matches. The first 1000 sequences cover 494041 lines of fasta. Hence:

    Code:
    jkb$ head -494041 ../coronavirus.fasta|xz | wc -c
    73256
    Or the same data pasting seq onto a single line:

    Code:
    jkb$ (head -494041 ../coronavirus.fasta|sed 's/>.*/<&</' | tr -d '\n';echo) |xz > _.xz
    jkb$ wc -c _.xz
    39036 _.xz
    jkb$ xz -d < _.xz | tr '<' '\012'|fold -60 | tail -n +2 > _2
    jkb$ head -494041 ../coronavirus.fasta|md5sum
    1e644e17c4dbe17b7b0fed633f107106  -
    jkb$ md5sum _2
    1e644e17c4dbe17b7b0fed633f107106  _2
    That's 47% smaller! Clearly this doesn't continue forever as over time the line wraps become less and less important and full-length data has line-wraps at the same places so become part of the LZ match itself. However I suspect it's one of the key differences between 2-bit and FASTA (the other being efficiency of entropy encoding small alphabets, which really harms huffman).

    Edit: for full dataset xz gets 1651360. That's a significant win over the 2bit version. Brotli --quality=11 gives 1937169 on this transformed file too.


    I had a dust off of the old sam_comp SequenceSqueeze entry. I'm sure it's no longer state of the art (it really shouldn't be anyway!), but I'm curious how it does. Things that are unexpected in that data set though are large runs of Ns. It needs a new model to cope with that.

    The philosophy there is to align all the reads to a common reference genome using an off-the-shelf sequence alignment / mapping tool (eg bwa mem) and then encode the SAM file using POS + CIGAR + SEQ to anchor every base pair to a reference genome coordinate. It's then possible to build per-site models and encode accordingly. On the first 1000 seqs (as above) it got 35660 bytes. However it performs badly on the entire data set. I'm unsure why.

    This still may not be as good as a large LZ tool as it cannot handle correlations. So a more accurate multi-seq encoding tool would produce a phylogentic tree, updating it for each new sequence and encoding each base based on the probability it is in a particular branch of that tree. This means it starts to learn correlations. Eg we've seen 3 mutations already which means we're 75% likely to be lineage X, 15% to be lineage Y and 10% something else. That makes upcoming SNPs more predictable.

    I'd expect heavy CM tools to learn and do better anyway, but the top performing tools are pretty costly.
    Last edited by JamesB; 1st January 2021 at 12:32.

  2. Thanks (2):

    Dresdenboy (31st December 2020),Mike (31st December 2020)

  3. #2
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    528
    Thanks
    204
    Thanked 187 Times in 128 Posts
    I decided to also try the aligned coronavirus.fasta file through Scramble to the CRAM 3.0 format. I had to do some text shenanigans to reinject the original FASTA sequences as bwa seems to have a bug (or deficiency?) in that it replaces IUPAC codes with N. Anyway, after all that hackery, I got:

    Code:
    jkb@Orion:~/covid19/io_lib$ ./progs/scramble -9jZ -s50000 -V3.0 -r ../NC_045512.fasta ../coronavirus-iupac.sam _30.cram
    jkb@Orion:~/covid19/io_lib$ ls -l _30.cram
    -rw-r--r-- 1 jkb jkb 1582477 Dec 31 21:05 _30.cram
    
    # Optimised for ratio (uses about 6Gb RAM to encode):
    jkb@Orion:~/covid19/io_lib$ ./progs/scramble -9jZ -s1500000 -V3.0 -r ../NC_045512.fasta ../coronavirus-iupac.sam _30.cram
    jkb@Orion:~/covid19/io_lib$ ls -l _30.cram
    -rw-r--r-- 1 jkb jkb 1140638 Dec 31 21:22 _30.cram
    
    
    # simple decode
    jkb@Orion:~/covid19/io_lib$ samtools fasta _30.cram|fold -60 |sed 's/>.*/& /' > _2
    [M::bam2fq_mainloop] discarded 0 singletons
    [M::bam2fq_mainloop] processed 44918 reads
    jkb@Orion:~/covid19/io_lib$ cmp _2 ../coronavirus.fasta
    jkb@Orion:~/covid19/io_lib$
    It needs the ../NC_045512.fasta reference file to decode again, so a compressed copy of that should be added to file size.

    Note this isn't something unusual or custom for this exercise. It's the dominant file format for the ENA sequence archive combined with a standard tool. I enabled some less used compression codecs though (bzip2 and lzma, normally disabled as we favour speed over size) and upped the slice size to 5x as many seqs). Upping the slice size further I can get to 1140638 bytes. That's quite competitive, although it's using a lot of memory by that stage to buffer the data prior to compressing the blocks.

    Decoder size is quite big (samtools.xz is about 205Kb) , but it's a toolkit that does lots of other jobs and for sure a decoder could be considerably smaller if it was purely for competition purposes rather than part of a general tool chain.

    Note I'd expect tools like SPRING or PgRC to do better still. As does the CRAM 4.0 draft (1096577 bytes).

  4. Thanks:

    xinix (1st January 2021)

  5. #3
    Member
    Join Date
    Dec 2020
    Location
    Tallinn
    Posts
    16
    Thanks
    4
    Thanked 3 Times in 3 Posts

    SARS-CoV-2 Coronavirus Data Compression Benchmark

    Call for participation!

    https://coronavirus.innar.com/

    More information:

    https://arxiv.org/abs/2012.12013
    https://arxiv.org/pdf/2012.12013

    ( from another thread )

    Thank you for the shout out, Shelwien!

    Thank you for so many great ideas, James!

    JKB: However it performs badly on the entire data set. I'm unsure ...
    My hypothesis: I downloaded and included everything from the
    "Severe acute respiratory syndrome coronavirus 2 data hub" as of 13 Dec 2020. I have seen many papers excluding sequences which have a very different lengths compared to the reference file, but it seems that including everything could also be "a feature and not a bug". However, it could mess with some approaches, which look for differences.

    Happy New Year!
    ‚ÄčInnar

  6. Thanks:

    Gotty (3rd January 2021)

  7. #4
    Member
    Join Date
    Nov 2015
    Location
    -
    Posts
    75
    Thanks
    300
    Thanked 18 Times in 14 Posts
    Quote Originally Posted by JamesB View Post
    I decided to also try the aligned coronavirus.fasta file through Scramble to the CRAM 3.0 format. I had to do some text shenanigans to reinject the original FASTA sequences as bwa seems to have a bug (or deficiency?) in that it replaces IUPAC codes with N. Anyway, after all that hackery, I got:
    Hi!
    Is it realistic to encode enwik8 in CRAM format?

  8. #5
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    528
    Thanks
    204
    Thanked 187 Times in 128 Posts
    Quote Originally Posted by xinix View Post
    Hi!
    Is it realistic to encode enwik8 in CRAM format?
    No. It's a custom format designed for storing sequence alignments. It's not doing anything super smart or complex compression wise, but structured data means structured application of standard compression tools. See:


    https://github.com/samtools/hts-spec...ter/CRAMv3.pdf

    I have a fork of my code that adds Bsc and Zpaq as internal codecs for the data chunks. It obviously helps, but at a CPU cost. There are other formats which are tackling the same space (such as Spring, Deez, or the mythical MPEG work). Some of these will be ahead of CRAM, but CRAM has momentum and for now is one of the more commonly used formats. BAM is the most common for this sort of data, but it's maybe double the size. There are still over a million CRAM files out there though consisting of multiple petabytes of storage. Given that volume of data it also needs to be fast and multi-threadable. It tries to aim for the sweet spot in speed vs ratio.

    Note one key goal of CRAM (not applicable in this benchmark) is indexing and random access. Typically we align sequences to a reference genome (or to their assembly "contig" consensus sequences) and sort by position so browsers can dip in and explore specific regions. Eg see https://jbrowse.org/jbrowse1.html (showing delta to the reference).

    PS. Thanks whoever for splitting this off to its own topic.

  9. Thanks:

    xinix (1st January 2021)

  10. #6
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    528
    Thanks
    204
    Thanked 187 Times in 128 Posts
    Quote Originally Posted by innar View Post
    My hypothesis: I downloaded and included everything from the
    "Severe acute respiratory syndrome coronavirus 2 data hub" as of 13 Dec 2020. I have seen many papers excluding sequences which have a very different lengths compared to the reference file, but it seems that including everything could also be "a feature and not a bug". However, it could mess with some approaches, which look for differences.

    Happy New Year!
    ‚ÄčInnar
    If you use a sequence aligner (bwa, minimap, bowtie, SNAP, etc) then you'll find out which ones match fully, which match partially ("soft-clips" so the end fragments are written verbatim) and which don't match at all. It's reasonable to try different techniques for these different data types - eg denovo assembly of the unmapped records - but I think bwa basically aligned pretty much everything. Some sequence tools just go direct for multiple-sequence assembly rather than aligning to a reference. Spring does this. It can give better results sometimes. I still think though the best method would be a tree based analysis, to avoid repeatedly encoding the same combinations of variants.

    There were a lot of incomplete segments ("N"s). I assume this is where the original data failed, although it depends on sequencing technology. I know most about the Illumina ones as my work involves those. They have 100ish PCR primer pairs to amplify up a small part of the genome (an "amplicon"), with those fragments overlapping each other. Thus if all PCR primers work then you get full coverage. If a couple of them fail you'll get a drop out and part of your genome is unknown, which is the runs of Ns. Eg see the 65C figure in https://www.protocols.io/view/covid-...=64939&step=11 for an example of amplicon 64 dropping to zero coverage.

    The actual raw data is available too - rather than just the consensus sequences. That's MUCH MUCH bigger and an entirely different and more complex problem! This is what CRAM was really designed for. I just cheated a bit by massaging the consensus sequences into the same format. I must admit I didn't expect it'd do that well. It's not really optimal for sequence only as the bulk of our effort went into the more volumous types of data (quality strings and auxiliary tags).

    You can find some of the raw data in EBI's ENA page. My search to https://www.ebi.ac.uk/ena/browser/advanced-search was: tax_eq(2697049) AND submitted_format="cram"

    An example hit (one of >80,000): https://www.ebi.ac.uk/ena/browser/view/ERR4329202

  11. #7
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    577
    Thanks
    220
    Thanked 833 Times in 341 Posts
    Here's a quick entry based on LILY and a pre-processor for the 2bit format, for a total of 1.169.683 bytes for the 2 files ("lily.zip" and "1").
    Just run "3.bat" to test it. On my machine it takes 116s to decompress.

    The pre-processor is more suited for LZ-based compressors. I only tested with 7zip (LZMA), the best it could do was:
    Code:
    1.806.659 bytes       coronavirus.fasta
    1.954.416 bytes       coronavirus.2Bit
    1.651.564 bytes       coronavirus.2Bit pre-processed
    So still a bit far from paq8l, but much faster decompression.

    Just for kicks I'm running (or trying to run...) paq8px_v200 on it, at level 12 and with the LSTM active, on the 2Bit file. Should hopefully be done before 2022

  12. Thanks (3):

    Gotty (3rd January 2021),jethro (5th January 2021),xinix (1st January 2021)

  13. #8
    Member
    Join Date
    Dec 2020
    Location
    Tallinn
    Posts
    16
    Thanks
    4
    Thanked 3 Times in 3 Posts
    Quote Originally Posted by mpais View Post
    Here's a quick entry based on LILY and a pre-processor for the 2bit format, for a total of 1.169.683 bytes for the 2 files ("lily.zip" and "1").
    Just run "3.bat" to test it. On my machine it takes 116s to decompress.
    Thanks! I have successfully validated the entry - it took 102s to decompress on a Windows 10 Education @ Intel Core i5-7500 CPU @ 3.40 GHz, 16GB RAM (without internet) - md5sum matches. I have added the entry by mpais as the new leading entry of the benchmark.

  14. Thanks:

    Gotty (3rd January 2021)

  15. #9
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    528
    Thanks
    204
    Thanked 187 Times in 128 Posts
    I'm confused on the 2bit encoding. The biorxiv link refers to the UCSC genome browser. Googling I found this:

    http://www.genome.ucsc.edu/FAQ/FAQformat.html#format7

    That describes a 2-bit format as "T - 00, C - 01, A - 10, G - 11" plus indicators for blocks of N, but it makes no reference to IUPAC (ambiguity codes), eg R, Y, S etc. These do occur in the FASTA files. I don't see any twoBit to FASTA conversion tool mentioned in that UCSC page. Does such a thing exist? If so has it actually been validated to confirm the encoding isn't lossy? My suspicion is that it is. It needs validating as it'd break the competition if it's not a reversible transform.

    Edit: as mentioned above, if you want a basic transform to improve off-the-shelf compressors, consider simply un-line-wrapping the sequence portion. That's a HUGE win, far more so than conversion to 2-bit for many tools.

  16. Thanks:

    Gotty (3rd January 2021)

  17. #10
    Member
    Join Date
    Dec 2020
    Location
    Tallinn
    Posts
    16
    Thanks
    4
    Thanked 3 Times in 3 Posts
    Hi James,

    Your suspicion is correct, 2bit is lossy regarding IUPAC. However, to my knowledge, coronavirus.fasta does not contain them (have you noticed otherwise?):

    $cat coronavirus.fasta | grep N | grep -v ">" | wc -l
    446916
    $cat coronavirus.fasta | grep S | grep -v ">" | wc -l (excluding the ID rows with ">")
    0
    $cat coronavirus.fasta | grep R | grep -v ">" | wc -l
    0
    $cat coronavirus.fasta | grep Y | grep -v ">" | wc -l
    0

    Source of twoBitToFa:
    https://hgwdev.gi.ucsc.edu/~kent/src...ls/twoBitToFa/

    Source of faToTwoBit:
    https://hgwdev.gi.ucsc.edu/~kent/src...ls/faToTwoBit/

    I did have to manually (sed etc) help the twoBitToFa transform to get the original (identical) file. Since for participants this is not necessary if the decompressed version of 2bit is identical and if 2bit version has not been a lossy transform. I agree -- the last one is a big 'IF' and I have to validate that. My intuition is that in short term, .2bit file will give better results, but in long term .fasta (where the algorithm itself has to have a better transform) would be susceptible for more compression. Of course, if 2bit is lossy (re: the containing information) in this case, I am prepared to remove it.

    Regarding un-line-wrapping: I see value of using a dataset downloaded directly from a respected repository and instead I would acknowledge those clever set of bytes
    (sed 's/>.*/<&</'|tr -d '\n' and tr '<' '\012'|fold -60 | tail -n +2), which clearly show that additional ca 30 bytes at compression and decompression can make a huge difference. Pre-un-line-wrapping would not acknowledge such a clever and simple transform

    Innar

  18. #11
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    577
    Thanks
    220
    Thanked 833 Times in 341 Posts
    Quote Originally Posted by JamesB View Post
    Edit: as mentioned above, if you want a basic transform to improve off-the-shelf compressors, consider simply un-line-wrapping the sequence portion. That's a HUGE win, far more so than conversion to 2-bit for many tools.
    Well, your result for xz is a near match to the one I got with LZMA, so all things being the same, the compression time for the 2Bit file is reduced significantly. However...

    Quote Originally Posted by innar View Post
    However, to my knowledge, coronavirus.fasta does not contain them (have you noticed otherwise?):
    No, JamesB is right, there are a lot of IUPAC codes in the coronavirus.fasta file, I just checked. So the files aren't really equivalent, sorry.
    I see you spent a lot of time running paq8l and cmix on the 2Bit file, but since the information content between the 2 versions isn't the same, I think you need to choose just one of them.

  19. Thanks:

    Gotty (3rd January 2021)

  20. #12
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    577
    Thanks
    220
    Thanked 833 Times in 341 Posts
    Here's a new submission, again based on LILY but this time on the fasta variant.
    Since the rules are the same as those of the LTCB, the submission size is 1.109.894 bytes ("lily.zip" and "1"), since in the zip I'm including the (minified) source code for the preprocessor, so "2.exe" can simply be obtained by compiling that code.

    I believe the choice of challenge format isn't suited to the stated goal, it has the same problem as the LTCB.
    If the purpose is to create better models so as to advance "the understanding of the underlying context and data" in general, a format where the whole dataset is public will inevitably lead to overfitting.
    Even if that is somehow acceptable, I fail to see of what use would it be if I simply submit a decompressor + payload that regenerates the dataset but from which no further insights can be gleaned.

  21. Thanks (3):

    Gotty (3rd January 2021),JamesB (3rd January 2021),xinix (3rd January 2021)

  22. #13
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    127
    Thanks
    43
    Thanked 33 Times in 22 Posts
    Quote Originally Posted by innar View Post
    Hi James,

    Your suspicion is correct, 2bit is lossy regarding IUPAC. However, to my knowledge, coronavirus.fasta does not contain them (have you noticed otherwise?):

    $cat coronavirus.fasta | grep N | grep -v ">" | wc -l
    446916
    $cat coronavirus.fasta | grep S | grep -v ">" | wc -l (excluding the ID rows with ">")
    0
    $cat coronavirus.fasta | grep R | grep -v ">" | wc -l
    0
    $cat coronavirus.fasta | grep Y | grep -v ">" | wc -l
    0

    Source of twoBitToFa:
    https://hgwdev.gi.ucsc.edu/~kent/src...ls/twoBitToFa/

    Source of faToTwoBit:
    https://hgwdev.gi.ucsc.edu/~kent/src...ls/faToTwoBit/


    Innar
    I don't know about your system but mine produces 7543 results for R

    Code:
    cat coronavirus.fasta | grep R | grep -v ">" | wc -l
    7543

  23. #14
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    739
    Thanks
    424
    Thanked 486 Times in 260 Posts
    Quote Originally Posted by mpais View Post
    No, JamesB is right, there are a lot of IUPAC codes in the coronavirus.fasta file, I just checked. So the files aren't really equivalent, sorry.
    I see you spent a lot of time running paq8l and cmix on the 2Bit file, but since the information content between the 2 versions isn't the same, I think you need to choose just one of them.
    Quote Originally Posted by mpais View Post
    Here's a new submission, again based on LILY but this time on the fasta variant.
    @innar: how about keeping the fasta variant? As the 2bit variant lost some information, it is not the original one, or "the correct one" or the useful one for scientific purposes anyway. Also, the fasta is easier to read
    If you agree, could you please rerun the baselines with fasta? No need to hurry. (Do you need a hand or hands from us in that?)
    It would be also interesting to rerun the baseline entries also with the preprocessed file (see here). As also suggested by JamesB: those line breaks seriously hurt compression. Since all entries will use some preprocessing anyway, why not give that advantage for the baselines, too?
    Last edited by Gotty; 3rd January 2021 at 14:01.

  24. #15
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    528
    Thanks
    204
    Thanked 187 Times in 128 Posts
    Quote Originally Posted by mpais View Post
    Well, your result for xz is a near match to the one I got with LZMA, so all things being the same, the compression time for the 2Bit file is reduced significantly. However...
    It's a bit odd. I found xz to give better results than xz -9. I'm not sure what the default compression level is, but I didn't expect it to be better. That was on un-line-wrapped fasta, and it's much better than line-wrapped.

    Zstd show a much more extreme benefit with removing the line wrapping, both in size and time.

    Code:
    jkb@Orion:~/covid19$ time zstd -19 < coronavirus.fasta|wc -c
    4154091
    
    real    0m56.072s
    user    0m55.453s
    sys     0m0.672s
    jkb@Orion:~/covid19$ time zstd -19 < coronavirus.unwrapped.fasta|wc -c
    2077998
    
    real    0m25.400s
    user    0m24.875s
    sys     0m0.531s
    I think presenting the data in this manner, or at least a trivial little script that converts to it, is sensible. It permits us to explore standard algorithms as part of our baseline.

    Once again, the optimal compression ratio isn't with the highest compression level:

    Code:
    jkb@Orion:~/covid19$ time zstd -15 < coronavirus.unwrapped.fasta|wc -c
    2066543
    
    real    0m19.471s
    user    0m19.047s
    sys     0m0.438s
    That's really quite speedy for that ratio.

    Edit: Although Brotli --quality=9 hammers it. 6.5s compression time and 1639413 bytes long. Better than xz too:

    Code:
    jkb@Orion:~/covid19$ time brotli --quality=9 < coronavirus.fasta | wc -c
    4959323
    
    real    0m12.830s
    user    0m12.281s
    sys     0m0.438s
    jkb@Orion:~/covid19$ time brotli --quality=9 < coronavirus.unwrapped.fasta | wc -c
    1639413
    
    real    0m6.478s
    user    0m6.000s
    sys     0m0.469s

  25. #16
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    528
    Thanks
    204
    Thanked 187 Times in 128 Posts
    Quote Originally Posted by mpais View Post
    I believe the choice of challenge format isn't suited to the stated goal, it has the same problem as the LTCB.
    If the purpose is to create better models so as to advance "the understanding of the underlying context and data" in general, a format where the whole dataset is public will inevitably lead to overfitting.
    Even if that is somehow acceptable, I fail to see of what use would it be if I simply submit a decompressor + payload that regenerates the dataset but from which no further insights can be gleaned.
    The goal is an interesting and valid one, given there are many such datasets of 1000s of similar copies of the same genome. For evaluating the real world benefit of such a tool however we'd be less interested in compressor / decompressor size as that's a one-off and we may have 1000s of data sets we'd be using it on. This especially holds true when the compressed data is so small - around 1Mb. It starts to get into the realm of who can compress their decoder the best, which is a rather meta-competition. I see the logic though - we're trying to avoid over fitting the data or to be storing custom data-set specific metrics in the decompressor. However the more general case would be many comparable datasets and to average across multiple files.

    That's a lot more work and time though, so I think it's fine to explore just the one small file. I think it may be prudent though to have two leaderboards - with and without decompressor size. For the "without" case, let the submission authors state what files constitute the compressed file (eg the main file, an associated dictionary, and a little script to reformat the output) and what bits are general purpose and shouldn't be included.

  26. #17
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    127
    Thanks
    43
    Thanked 33 Times in 22 Posts
    From what i understand the coronavirus file is the DNA of many corona viruses? So they are very similar and thus the file is highly compressible. Maybe bsdiff is a good fit then? Because for each sequence basically you can't do better than 2 bits per ATGC.

    EDIT: So this is the reason why brotli is better than lzma. REP dists matter a lot for this file and brotli has many of them?

  27. #18
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    528
    Thanks
    204
    Thanked 187 Times in 128 Posts
    Quote Originally Posted by algorithm View Post
    From what i understand the coronavirus file is the DNA of many corona viruses? So they are very similar and thus the file is highly compressible. Maybe bsdiff is a good fit then? Because for each sequence basically you can't do better than 2 bits per ATGC.

    EDIT: So this is the reason why brotli is better than lzma. REP dists matter a lot for this file and brotli has many of them?

    Unsure on the differences in brotli vs lzma, but note they were very close in size (just not speed - brotli level 9, which beat its size for qual 10 and 11 - is hugely faster than lzma).

    Regarding the data, yes it's many coronavirus genomes. Some complete end-to-end and some not, be they truncated or just missing segments (base "N"). The full length genome is around 30Kb (uncompressed) so that gives you an indication on the minimum window length an LZ algorithm would need, and in practice it needs much more to cope with the poor ones (eg to skip back a few Mb maybe to find one that doesn't have a run of "N"s at a key point).

    Obviously you can do better than 2 bits per ATGC in highly redundant data sets. The LZ based tools are doing just that obviously. Also the modelling based ones similarly gain from similar approaches. A short genome like this doesn't need a complex model to be able to predict what comes next, especially as viral genomes don't contain lots of redundancy and repetitive elements.

  28. #19
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    528
    Thanks
    204
    Thanked 187 Times in 128 Posts
    Some musings, to explain how own code works and the sizes it gets.

    Enabling the libbsc support in my experimental CRAM branch I get this:

    Code:
    -rw-r--r-- 1 jkb jkb 973413 Jan  3 14:24 _.cram
    (without BSC enabled it was 1005359 bytes)
    
    jkb@Orion:~/covid19/io_lib$ ./progs/cram_size _.cram
    Block CORE              , total size           0
    Block content_id      11, total size        6476 g             n            RN
    Block content_id      12, total size          31          8                 QS
    Block content_id      13, total size         199 g    0   8                 IN
    Block content_id      14, total size       86408   l                        SC
    Block content_id      15, total size          68       14                   BF
    Block content_id      17, total size       22206 g               B          AP
    Block content_id      19, total size          13          8                 MQ
    Block content_id      25, total size       30340  b              B          RL
    Block content_id      26, total size       31119 g l                        FN
    Block content_id      27, total size       56404             3   B          FC
    Block content_id      28, total size      590966   l             B          FP
    Block content_id      29, total size        1264 g      4                   DL
    Block content_id      30, total size       14821      0   8      B          BA
    Block content_id      31, total size       93584       1         B          BS
    Block content_id      33, total size          69       14                   RI
    Block content_id      37, total size        1219        4                  
    Block content_id      42, total size       36170 g l             B          BB
    The various codec symbols there are "n"=name tokeniser, "B"=Bsc, "g"=Gzip(deflate), "l"=LZMA, and 0-9=various rANS codecs. Multiples per line as this contains discrete slices of data and each slice may choose a different codec.

    The 2 letter codes are the type of data being encoded. In order: Read Name, Quality Score, INserted bases, BAM Flag, Alignment Position, Map Quality, Read Length, Feature Number, Feature Code, Feature Position, Deletion Length, BAse, Base Substitution code, Reference ID, and Base Block (multi-base edits).

    The sequence is stored as a delta to the reference, with a sequence aligning at a specific position (AP) and a CRAM interpretetation of the SAM/BAM "CIGAR" string. Eg in SAM (textual) we may see:

    CIGAR 7909S14032M3D3449M5D4384M, meaning 7909 soft-clipped (unaligned bases), 14032 matching (NB match can be match or mismatch, so it refers to both reference and query sequence marching together in sync), a 3 base-pair deletion (3D), 3449 match, 5 deletion, and finally 4384 match. These coordinates get turned into features, which have a code (mismatch, deletion, insertion, soft-clip, etc) in "FC" and a position in the sequence (relative to last feature) in "FP". These FP values appear to be the bulk of the file contents. Bsc does a better job than vanilla CRAM which is limited to bzip2 - that was 713xxx in size IIRC.

    The deletion lengths are stored in the "DL" data series (tiny). Base differences in M-regions are either a substition code (ACGTN -> ACGTN as a numeric "BS") or if the difference is something else it's stored verbatim in "BA". The BS field is the next largest. Finally "SC" is the soft-clipped bases which weren't aligned against the genome.

    Obvious things that can be improved are.

    - SC: I'd expect all of this data to be aligned properly, but the sequence aligner I used (bwa) split things up a bit into multiple records which gave a high proportion of clipped bases. Change aligner or parameters would probably shrink the file by another 50Kb or so (guesswork).

    - FP: this may perhaps be best done via a "tab stop" type of idea. Many of the differences are at consistent genome locations, while others are either random denovo mutations or sequencing errors.

    - FP: again, maybe we'd want to produce several reference sequences instead of just 1, encompassing the most abundant sequences. "RI" currently is constant, but it'd then become an indicator of which reference copy was aligned against. That means we can collapse many of the commonly occuring mutations into a single number representing the virus lineage. What's left after that is simply random differences. This is comparable to what I mentioned above about maintaining a lineage tree and encoding to a model of the likelihood of being in each lineage based on what's gone before.

    I should probably try and package this up as an entry, but the decode is large as it's a general purpose tool. The gzipped binary is 340Kb, and it needs another 9.5Kb for the gzipped reference genome. Anyway, I list is mainly to encourage people to think on it more.


    Edit: make that 921815 bytes (with bsc). I tweaked the parameters a bit.
    Last edited by JamesB; 3rd January 2021 at 17:36. Reason: Better :-)

  29. Thanks:

    Gotty (3rd January 2021)

  30. #20
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    577
    Thanks
    220
    Thanked 833 Times in 341 Posts
    I improved the transform a little, this entry gets down to 1.105.536 bytes ("lily.zip" + "1"). Same as before, the source code for the preprocessor is included.
    I think this is about as low as LILY will go, no point in using complex transforms when then it would simply make a lot more sense to just write a specialized compressor.

    LZMA seens to benefit proportionally more from the transform though, best is now 1.483.227 bytes.

    @JamesB:

    I agree that the goal is interesting, but, from my layman's point of view, your experimental CRAM results are much more important to advance the field than an entry based on a closed-source highly specific decompressor, even if your entry is beaten by 10%.

  31. Thanks (2):

    Gotty (3rd January 2021),JamesB (3rd January 2021)

  32. #21
    Member
    Join Date
    Dec 2020
    Location
    Tallinn
    Posts
    16
    Thanks
    4
    Thanked 3 Times in 3 Posts
    I have not yet validated the decompression, but cmix v18 on coronavirus.fasta (96ae68def050d85b1b7f2f1565cb848f) compressed archive is 939,904 bytes. I plan to calculate new baselines without line breaks (or whatever format seems most reasonable in consensus), but I would like to get opinions about the preprocessed file Gotty proposed? If I understand correctly - the identifiers are moved to the beginning of the file in the same order. Which makes it a reversible transform, but not compatible with FASTA standard any more? So, in a way, having (only) the FASTA version without line breaks would be most preferable?

    Gotty - is there a big difference between your preprocessed files and the JamesB's coronavirus.unwrapped.fasta?

  33. #22
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    739
    Thanks
    424
    Thanked 486 Times in 260 Posts
    coronavirus.unwrapped.fasta : JamesB removed the line breaks from the sequence data - if I understand correctly.
    coronavirus.fasta.preprocessed.full: in addition to the above un-wrapping, I also moved all descriptions to the top but keeping their order (let's call it un-interleaving ).
    I would go for both transforms: un-wrapping + un-interleaving. Both are reversible, both help compression. It's not a problem if after the transformation we loose fasta-compatibility as the process is reversible.

    @JamesB,
    could you have a look at coronavirus.fasta.preprocessed.full? What do you think?

  34. #23
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    528
    Thanks
    204
    Thanked 187 Times in 128 Posts
    My inclination is to prefer straight FASTA, but separating names from sequences is an improvement too.

    Ideally however you'd produce two files: names only, and sequences only. The nature of the data is very different and it helps tools to know where that junction is, rather than pasting them together and hoping the compression tool can spot the signal change (or that it's small enough that it doesn't harm greatly).

    PS. I got to 862990 bytes now. More room still in the tank for sure. I may go for something a bit bigger though with a smaller decoder. Right now my zip would come in at 1217473 mainly due to a large "scramble" binary. I need to hack it and cull most of the dead-weight in there.

  35. #24
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    739
    Thanks
    424
    Thanked 486 Times in 260 Posts
    >>Ideally however you'd produce two files: names only, and sequences only.
    Moving the names to a separate file is a good idea. Even dropping the names altogether... We are to match those sequences anyway, the names section can have any content - it should not matter for the purpose of the challenge.

    >>PS. I got to 862990 bytes now.
    Well done, JamesB!

  36. #25
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    577
    Thanks
    220
    Thanked 833 Times in 341 Posts
    paq8px_v200 has finally finished compressing the 2Bit variant:
    Code:
    paq8px_v200 -12L
    1.155.540 bytes in 200300.84 sec
    I suggest just keeping the FASTA file, either as-is or instead just the sequences then.

  37. #26
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    528
    Thanks
    204
    Thanked 187 Times in 128 Posts
    Here's a slightly reduced cram_to_fasta conversion tool. It still has a lot of unnecessary code in there. I culled some of the low hanging fruit, but lots still to do. This was just using LZMA and rANS, which made the CRAM file bigger, but not hugely so (881855 bytes).

    The reference fasta is needed to decode it (cv_edit_multi.fasta plus index which is about 91k as it's 3 copies). Obviously that could be lzma compressed too given I'm using lzma internally. I don't know how you're doing the measurement, but I think it's probably fair to count zipped data + meta-data as one value, and maybe zipped data + meta-data + program as a second value.

    The zip is 1082138 bytes.

    PS. I haven't really explored what the sweet spot is for multiple references. It's a balance between having the most commonly occuring mutations as a reference, reducing the size taken to store the deltas, vs the growth in specifying which reference we delta against and obviously the reference itself. (I'm aware there is a solution staring me in the face the - using the code for specifying previous sequences as references so it bootstraps, but I don't have that coded up.)

    Edit: Sorry I forgot to mention this is an Ubuntu Linux binary, built under Windows Subsystem for Linux. (I assume it works on normal Linux, but I haven't tested it so that assumption may be wrong!). Source is simply https://github.com/jkbonfield/io_lib with a small custom cram to fastq conversion tool rather than needing scramble to go from CRAM to SAM plus a trivial perl one-liner.
    Attached Files Attached Files

  38. #27
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    528
    Thanks
    204
    Thanked 187 Times in 128 Posts
    Quote Originally Posted by Gotty View Post
    >>Ideally however you'd produce two files: names only, and sequences only.
    Moving the names to a separate file is a good idea. Even dropping the names altogether... We are to match those sequences anyway, the names section can have any content - it should not matter for the purpose of the challenge.
    Well it depends on what you're trying to achieve I guess. The names are identifiers which may, in some FASTA formats, also have meta-data following them. Suitable meta-data here could be approximate geographic location and a time stamp, which would enable researchers to track mutations and spread over time.

    It's also worth noting that the order is arbitrary. If we want a lossless compression then we need to store that of course, but it's arguable that it's unnecessary.

    One workable strategy would be to sort the sequences into their lineages, so that the same sequences colocate together. It'd significantly help LZ algorithms. Indeed it may even help them by enough to offset the cost of storing which Nth record it is (~80-90kb).

  39. #28
    Member
    Join Date
    Dec 2020
    Location
    Tallinn
    Posts
    16
    Thanks
    4
    Thanked 3 Times in 3 Posts
    Quote Originally Posted by JamesB View Post
    One workable strategy would be to sort the sequences into their lineages, so that the same sequences colocate together. It'd significantly help LZ algorithms. Indeed it may even help them by enough to offset the cost of storing which Nth record it is (~80-90kb).
    James, in the case of infinite computing power or some massively ingenious algorithm, the generalized version of that strategy would be to find such a permutation of sequences, which would compress the most (not just according their lineages or other pre-known external information). It is considerably cheaper to store the order than to search for the best permutation. Have you or anyone else seen any algorithms/papers working in that direction?

  40. #29
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    528
    Thanks
    204
    Thanked 187 Times in 128 Posts
    Quote Originally Posted by innar View Post
    James, in the case of infinite computing power or some massively ingenious algorithm, the generalized version of that strategy would be to find such a permutation of sequences, which would compress the most (not just according their lineages or other pre-known external information). It is considerably cheaper to store the order than to search for the best permutation. Have you or anyone else seen any algorithms/papers working in that direction?

    No, but some of the other specialised sequence-collection compression tools may.

    There are a lot of tools doing raw fastq compression of the unaligned data, but IMO I view this as a fruitless exercise.

    The instruments produce millions of (mostly) small fragments. They either get assembled to a genome, and then potentially realigned against it to provide a visualisation of the evidence for that assembly, or they get directly aligned to a reference genome. This format (2008 ) was "BAM"; little more than a binary serialisation of the data chucked into block-wise gzip (bgzf) to permit random access. In 2012-3 it migrated to CRAM, which is where I joined the fray. CRAM wasn't my invention, but I did move it along considerably in subsequent versions.

    So why do I view compressing raw fastq as fruitless? It's not the end product, but an interim along the way. Because all the best fastq compressors are basically internally doing what the scientists do downstream - denovo assembly (ie looking for overlapping sequences), alignments vs a known reference, or possibly a hybrid of the two: reference-guided assembly. If you're going to do that, you may as well actually keep the data in that format, in genome alignment order, and use it directly, rather than just permitting the user to uncompress the original fastq and then start again with the process of assembly, alignment, etc. Also it's simply easier to use an off-the-shelf assembler or aligner and compress its output than to write your own.

    That said, genome alignment and sequence assembly can be CPU expensive operations and you may wish to do that on a remote machine in the cloud, so there is a benefit to FAST local algorithms that compress your fastq prior to transferring elsewhere. So quick compression it's fruitless, just he high end optimal stuff (IMO). My own approach to this is to use an ultra-fast approximate sequence aligner (eg "SNAP") and then CRAM again, but once again it's just pragmatism. The tools exist already and it's just a quick bit of glue logic to combine them together.

    As for CRAM, it was designed for the raw data and this task is a bit different - it's compression of a collection of similar genome sequences. This is a specialist field and there are other tools dedicated to it. It's quite likely some of those will have evaluated the reordering process. Maybe something like clustering (again minimisers and sketches are the likely fast methods here) or phylogentic tree building? I think the fact that CRAM works so well on this data is purely because the genome happens to be tiny, so it's far closer to raw sequence data than to a full genome, especially given most full genomes are humans so the sequence lengths are chromosomal (250Mb for the largest).

    Edit: I should point out there are probably techniques that this field can learn from bioinformatics, particularly denovo assembly tools. They need to do all vs all comparisons of millions of fragments in order to identify which appear to be similar to one another. The most recent techniques use are minimisers and sketches (see https://academic.oup.com/bioinformat...4/i110/3953951 for some terms and references) - basically fingerprint techniques for rapidly identifying which fragments are likely to contain the same portions of text. This is basically an identical task to dedup analysis for data compression, or for a non-streaming compressor it could form a starting point of a rapid approximate LZ matcher. It may be possible to build it on-the-fly for streaming too. Right now most LZ methods are still utilising standard hash tables.

  41. Thanks:

    Mike (5th January 2021)

  42. #30
    Member
    Join Date
    May 2020
    Location
    Berlin
    Posts
    76
    Thanks
    21
    Thanked 25 Times in 20 Posts
    Is there a chance for a "low hanging fruit" of a LZ77+EC based compressor, which is able to handle insertions/deletions and single mutations better than current ones using their rep-offsets for help? E.g. there is a very low probability per base to see a mutation while matching a long sequence to a similar one.

Page 1 of 3 123 LastLast

Similar Threads

  1. loseless data compression method for all digital data type
    By rarkyan in forum Random Compression
    Replies: 253
    Last Post: 21st October 2020, 04:44
  2. Sequence Compression Benchmark
    By SolidComp in forum Data Compression
    Replies: 14
    Last Post: 30th July 2020, 22:41
  3. Synthetic data benchmark
    By Matt Mahoney in forum Data Compression
    Replies: 13
    Last Post: 1st February 2019, 00:56
  4. benchmark of mixed data types with SIMD instructions
    By just a worm in forum The Off-Topic Lounge
    Replies: 8
    Last Post: 10th June 2015, 23:28
  5. New benchmark for generic compression
    By Matt Mahoney in forum Data Compression
    Replies: 20
    Last Post: 29th December 2008, 09:20

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •