Page 2 of 3 FirstFirst 123 LastLast
Results 31 to 60 of 77

Thread: SARS-CoV-2 Coronavirus Data Compression Benchmark

  1. #31
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    739
    Thanks
    424
    Thanked 486 Times in 260 Posts
    Quote Originally Posted by innar View Post
    I plan to calculate new baselines without line breaks (or whatever format seems most reasonable in consensus), but I would like to get opinions about the preprocessed file Gotty proposed? If I understand correctly - the identifiers are moved to the beginning of the file in the same order. Which makes it a reversible transform, but not compatible with FASTA standard any more? So, in a way, having (only) the FASTA version without line breaks would be most preferable?
    Quote Originally Posted by JamesB View Post
    The names are identifiers which may, in some FASTA formats, also have meta-data following them. Suitable meta-data here could be approximate geographic location and a time stamp, which would enable researchers to track mutations and spread over time.
    Based on both of your notes, I'm leaning forward (voting for) the suggested transformation JamesB have for re-running the baselines: un-line-wrapping the original file.
    Transformed file attached.
    Attached Files Attached Files

  2. Thanks:

    innar (10th January 2021)

  3. #32
    Member
    Join Date
    Dec 2020
    Location
    Tallinn
    Posts
    16
    Thanks
    4
    Thanked 3 Times in 3 Posts
    If I have not mistaken, then this commend would make the approach by JamesB to be identical with Gotty's transform:

    cat coronavirus.fasta | sed 's/>.*/<&</' | tr -d '\n' | tr '<' '\n' | tail -n +2 > coronavirus.fasta.un-wrapped
    printf "\n" >> coronavirus.fasta.un-wrapped

    (added \n to align with Gotty's file)

    Both (Gotty's transform and JamesB sed/tr/tail logic) have md5sum e0d7c063a65c7625063c17c7c9760708

    Would JamesB or somebody else mind double checking that the command under *nix produces a correct un-wrapped FASTA file?

    Thanks!

    PS! I started rerunning paq8l and cmix v18 on this file. Will make the announcement on web and elsewhere to focus on this transform after some extra eyes re: the correctness of the file (would prefer the original file + standard *nix commands for the transform)

  4. #33
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    739
    Thanks
    424
    Thanked 486 Times in 260 Posts
    Verified under Lubuntu 19.04 64 bit.
    Generated successfully.
    1) md5 checksum matches, 2) content matches with my version.

  5. Thanks:

    JamesB (11th January 2021)

  6. #34
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    290
    Thanks
    120
    Thanked 168 Times in 124 Posts
    Code:
    CMV 0.3.0 alpha 1
    coronavirus.fasta (.2bit version is much worse)
    
    Compressed size              Time (1)     Options
    First 200 MiB   Whole file   Whole file
          207 KiB       940888         45 h   -m2 (decompression not verified)
          176 KiB       845497         ~5 d   -m2,0,0x1ba36a7f (optimized options based on the first 1+100 MiB) (decompression not verified)
    
    (1) Intel(R) Core(TM) i7-4710HQ CPU @ 2.50GHz (up to 3.50GHz), 8 GB RAM DDR3, Windows 8.1 64 bits.
    I haven't tested the reverse-complement order yet (1 2).
    Edit: reverse-complement order with block size 2000 and re-feed last 16 bytes is worse.
    Last edited by Mauro Vezzosi; 26th January 2021 at 22:29. Reason: Updated optimized options

  7. #35
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    32
    Thanks
    5
    Thanked 10 Times in 5 Posts
    Hi Innar and all!

    I've just tried NAF [1] on the FASTA file. Among different options, "-15" worked best on this data, producing a 1,590,505 bytes archive.

    NAF is a compressor for archiving FASTA/Q files. It basically divides the input into headers, mask and sequence (and quality for FASTQ), and compresses each stream with zstd. This allows for a good compactness and very fast decompression. I use NAF to store and work with terabytes of sequence data.

    (BTW, NAF supports IUPAC codes).

    Many other sequence compressors exist, some of them are compared in [2], and might be interesting to try on this data. That benchmark includes a 1.2 GB Influenza dataset, which should produce similar results to the Coronavirus one. Also note the "Compressors" [3] page has useful notes about various compressors.

    [1] https://github.com/KirillKryukov/naf
    [2] http://kirr.dyndns.org/sequence-compression-benchmark/
    [3] http://kirr.dyndns.org/sequence-comp...ge=Compressors

    Cheers!
    Last edited by Kirr; 19th January 2021 at 05:30.

  8. #36
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    32
    Thanks
    5
    Thanked 10 Times in 5 Posts
    Quote Originally Posted by JamesB View Post
    There are a lot of tools doing raw fastq compression of the unaligned data, but IMO I view this as a fruitless exercise.
    I basically agree that there's no point doing elaborate work compressing the raw data that will be analyzed anyway.

    When dealing with FASTQ files the usual tasks are: 1. Get them off sequencer to analysis machine. 2. Transfer to computation nodes. 3. Archive. 4. Send to collaborators.

    The compressor for these purposes has to be quick, reasonably strong, and reliable (robust, portable). Robustness is perhaps the most important quality, which is not at all apparent from benchmarks. (This could be why gzip is still widely used).

    Among the dozens of compressors and methods, few seem to be designed for practical (industrial) use. Namely DSRC, Alapy, gtz, and NAF. DSRC unfortunately seems unmaintained (bugs are not being fixed). Alapy and gtz are closed source and non-free (also gtz phones home). So I currently use NAF for managing FASTQ data (no surprise). NAF's "-1" works well for one-time transfer (where you just need to get the data from machine A to machine B as quickly as possible). And "-22" works for archiving and distributing FASTQ data.

    One recent nice development in the field is transitioning to reduced resolution of base qualities. In the usual FASTQ data, the sequence is easy to compress, but the qualities occupy the main bulk of space in the archive. Therefore some compressors have option of rounding the qualities to reduce resolution. Now recent instruments can produces binned qualities from beginning, making compression much easier.

    CRAM and other reference-based methods work nicely in cases where they are applicable. However, there are fields like metagenomics (or compressing the reference genome itself) where we don't have a reference. In such case we still need a general reference-free compression. The interesting thing is that these days data volumes are so large that specialized tool optimized for specific data or workflow can make a meaningful difference. And yet most sequence databases still use gzip.

  9. #37
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    32
    Thanks
    5
    Thanked 10 Times in 5 Posts
    Regarding the use of 2bit data. First of all, the fa2twobit itself is lossy. I found that it did not preserve IUPAC codes, line lengths and even sequence names (It truncates them). Also, using the 2bit data still requires decompression (other than with BLAT), while FASTA is a universal sequence exchange format. So I would rather remove 2bit representation from the contest. Anyone interested can trivially convert that DNA into 2-bit by themselves if they need it, potentially avoiding fa2twibit's limitations.

    But this raises a bigger question. Many (most actually) of the existing sequence compressors have compatibility issues. Some compress just plain DNA (no headers), some don't support N, IUPAC codes, mask (upper/lower case), etc.. Some have their own idea about what sequence names should look like (e.g. max length). Some compress only FASTQ, and not FASTA. How can these various tools be compared, when each is doing its own thing?

    When designing my benchmark [1] last year, I decided to try my best to adopt each of the broken/incomplete tools to still perform a useful task. So I made a wrapper for each tool, which takes the input data (a huge FASTA file), and transforms it into a format acceptable to each tool. E.g., if some compressor tool does not know about N, my wrapper will pick out all N from the input, store it separately (compressed), and present the N-less sequence to the tool. Then, another wrapper will work in reverse during decompression, re-constructing the exact original FASTA file. The wrapped compressors, therefore, all perform the same task and can be compared on it.

    All my wrappers and tools used by them are available [2]. This should make it relatively easy to adapt any existing compressor to work on FASTA files. In a similar way, a general-purpose compressor can be wrapped using those tools, to allow stronger compression of FASTA files. It could be an interesting experiment to try wrapping various general-purpose compressors and adding them to the benchmark, along with the non-wrapped ones.

    [1] http://kirr.dyndns.org/sequence-compression-benchmark/
    [2] http://kirr.dyndns.org/sequence-comp...?page=Wrappers

  10. Thanks (2):

    innar (20th January 2021),JamesB (22nd January 2021)

  11. #38
    Member
    Join Date
    Dec 2020
    Location
    Tallinn
    Posts
    16
    Thanks
    4
    Thanked 3 Times in 3 Posts
    Dear Kirr,

    Thank you so much for such deep insights. I had an unexpected health issue, which took me away from computer for few weeks, but I will bounce back soon, hopefully in the end of this week and work through the whole backlog of submissions incl yours. Your contribution is highly appreciated!

    Meanwhile, if someone checks this forum, then I would relay a question, which I got from one of the top 50 researchers in genetics: if suddenly someone would get a (let's say) 20% (30%? 50%?) better [compression] result than others - how could this be turned to an insight for professionals with deep knowledge about coronaviruses? What would be the way, representation or visualization of results (or tools) that would enable a person knowing nothing about compressing algorithms, but a lot about coronaviruses, to understand how such compression came about? I think this is an important and fundamental question from many benchmarks- how to leak the "intelligence of better compression" back to the field? Any ideas?

  12. Thanks:

    Kirr (20th January 2021)

  13. #39
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    32
    Thanks
    5
    Thanked 10 Times in 5 Posts
    Thanks Innar, and please take good rest and get well!

    Quote Originally Posted by innar View Post
    if suddenly someone would get a (let's say) 20% (30%? 50%?) better [compression] result than others - how could this be turned to an insight for professionals with deep knowledge about coronaviruses? What would be the way, representation or visualization of results (or tools) that would enable a person knowing nothing about compressing algorithms, but a lot about coronaviruses, to understand how such compression came about? I think this is an important and fundamental question from many benchmarks- how to leak the "intelligence of better compression" back to the field? Any ideas?
    One straightforward path for compression improvement to affect biology knowledge is by improving accuracy of alignment-free sequence comparison. Better compression means better approximation of Kolmogorov complexity and more accurate information distances. This can be used for many purposes, e.g., for comparing entire genomes, or investigating the history of repeat expansions.

    I'm not sure if improved compression can help in studying coronaviruses specifically, because coronaviruses can be easily aligned, which allows more accurate comparison without compression. But many other topics can greatly benefit from better compression. E.g. see [1] for some overview.

    Quote Originally Posted by innar View Post
    20% (30%? 50%?) better [compression] result than others
    One other thing. I think there's too much emphasis on compression strength in this field. This is understandable, because in information science we dream about computing Kolmogorov complexity, so any step closer to approximating it must be welcome. However, compressor users often have a different balance of priorities, where compression strength is just one of useful qualities. (This again explains longevity of gzip in genomics).

    I realized that many compressor developers mainly care about compression strength. They will spend huge effort fine-tuning their method to gain extra 0.01% of compactness. But they are fine if their compressor works only on DNA sequence (no support for sequence names, N, IUPAC, or even end of line in some cases). Or if their compressor takes days (or weeks) to compress a genome (more problematic, but still common, is when it takes days for decompression too). Maybe it feels great to get that 0.01% compactness, but it's often disconnected from applications.

    What users want in a compressor is a combination of reliability, strength, speed, compatibility and ease of use. Funny thing is that I did not want to develop a compressor. But I wanted to use one, because I was routinely transferring huge data back and forth among computation nodes. I was shocked to realize that in a ton of available DNA compressors there's not one that was suitable for my purposes. (Never mind another ton of papers describing compression method without providing any actual compressor).

    Currently, personally NAF is perfect for my needs. But if I ask myself, how it can be made even better, the answer (for me as a user) is not just "20% better compactness" (even though it would be great too). Instead it may be something like: 1. Random access (without sacrificing compression strength much). 2. Library for easier integration in other tools. 3. Built-in simple substring searcher. 4. Something about FASTQ qualities. (). etc.. E.g., [2] is an interesting recent development for practical uses.

    [1] Zielezinski et al. (2019) "Benchmarking of alignment-free sequence comparison methods" Genome Biology, 20:144, https://doi.org/10.1186/s13059-019-1755-7
    [2] Hoogstrate et al. (2020) "FASTAFS: file system virtualisation of random access compressed FASTA files" https://www.biorxiv.org/content/10.1....377689v1.full

  14. #40
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    528
    Thanks
    204
    Thanked 187 Times in 128 Posts
    It's also worth considering that this is two way. The data compression community can improve bioinformatics tools, but vice versa is true too. I strongly suspect there may be applications for using minimisers as an alternative to hashing in rapid-LZ applications or when doing dedup. The ability to quickly identify matches across datasets 10s of GB in size is something the bioinformatics community has put a lot of effort into. Similarly some of the rapid approximate alignment algorithms may give better ways of describing not-quite-perfect matches instead of a series of neighbouring LZ steps.

  15. #41
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    290
    Thanks
    120
    Thanked 168 Times in 124 Posts
    Code:
    cmv c -m2,0,0x0ba36a7f coronavirus.fasta coronavirus.fasta.cmv
    ​
     845497 coronavirus.fasta.cmv (~1700 MiB, ~5 days, decompression not verified)
     218149 cmv.exe.zip (7-Zip 9.20)
    1063646 Total

  16. Thanks:

    JamesB (27th January 2021)

  17. #42
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    528
    Thanks
    204
    Thanked 187 Times in 128 Posts
    Nice work. I think that's a bit smaller than the smallest CRAM I managed to produce (albeit *somewhat* slower!).

    I wonder what correlations it's discovering.

  18. #43
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    290
    Thanks
    120
    Thanked 168 Times in 124 Posts
    Quote Originally Posted by JamesB View Post
    I wonder what correlations it's discovering.
    It's not easy to say:
    - the 97 sparse match models seem useful;
    - less important, but even more important than other models, are the 15 bit history on order 0 and 3 on order 1;
    - the other models seem to have less influence;
    - final mixer is in its longest chain: 6 mixers -> 3 mixers -> 1 mixer.

    This file surprised me:
    - cmv works better on .fasta, other programs do better on .2bit.
    - on average cmv is a little better than cmix only in small DNA files (maybe just because it initially fits faster), but on long files (I mean > a few hundred KB or a few MB) cmix is usually better.
    - reverse-complement order worsens the compression (both on initial and unwrapped .fasta).

    Now I am working on the unwrapped version to find some good cmv options, I still don't know exactly how much better it will be, it looks like it could be ~10% or just a few tens of KB smaller than initial .fasta.
    The time currently taken to compress the entire file is about "only" 3 days.

  19. #44
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    528
    Thanks
    204
    Thanked 187 Times in 128 Posts
    It doesn't surprise me it's working better on the fasta, as the 2bit encoding is just obfuscating things. The reason it helps other compressors is it turns a naive O1 encoding model to O4 (for example). Simple stats work better on the bit-packed file, but anything doing deep diving into the correlations will be harmed. It may also harm some LZ matches if there are insertions as it may get a "frame shift". (Eg a single 3-base codon being remove means all following bytes then differ in 2bit form.)

    I also wouldn't expect reverse complementing to work here as this is the consensus of the genome which are all in the same orientation, rather than raw sequencing fragments with are often in random orientations.

    The extreme compression ratios here will also be really stressing the accuracy of probability models. We don't normally see such high compression ratios so 16 bit or even smaller counters are fine, but I think they'd be detrimental here.

  20. #45
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    577
    Thanks
    220
    Thanked 833 Times in 341 Posts
    I removed all the useless stuff from LILY and adapted it, this new entry weights in at 853.542 bytes (13.007 bytes for the zipped decompressor "lily.zip" and 840.535 bytes for the payload "1").

    Since the execution time isn't important, the decompressor is optimized for size instead of speed. Decompression time is 230s on an i7 5820k@4.4Ghz, so about 5,5 MB/s.

    Code:
    Usage:
    2.exe 1 coronavirus.fasta

  21. Thanks:

    Mike (30th January 2021)

  22. #46
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    577
    Thanks
    220
    Thanked 833 Times in 341 Posts
    Well, it seems this is actually easier than I expected, so here's a nice >100KB improvement.
    The total for this entry is 749.473 bytes (decompressor "lily.zip": 13.119 bytes, payload "1": 736.354 bytes)

    This version is a bit slower, mostly because I increased memory usage a bit.
    Decompression time is 257s, so about 5MB/s.
    Usage is the same as the previous version.

  23. Thanks (4):

    hexagone (31st January 2021),JamesB (4th February 2021),Mike (31st January 2021),schnaader (31st January 2021)

  24. #47
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    739
    Thanks
    424
    Thanked 486 Times in 260 Posts
    Well done, mpais!
    Does this LILY have some version number or marker to distinguish it (her?) from the previous LILY?
    How does it (or a generalized version of it) perform on this testset? https://encode.su/threads/2105-DNA-Corpus

  25. #48
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    577
    Thanks
    220
    Thanked 833 Times in 341 Posts
    Quote Originally Posted by Gotty View Post
    Well done, mpais!
    Does this LILY have some version number or marker to distinguish it (her?) from the previous LILY?
    How does it (or a generalized version of it) perform on this testset? https://encode.su/threads/2105-DNA-Corpus
    It's based on the same core techniques used in LILY, and since it's just the decompressor I didn't see a problem in just keeping the name (naming stuff is probably the hardest problem in programming ).

    The real challenge here is that since the compression ratios are so high, the decompressor size is not negligible at all, so I'll probably get to diminishing returns much faster than expected, and so it won't make sense to go for extreme compression if it bloats the decompressor size by more than it gains in compression.

    Still, it might be interesting to see how low can we go, if nothing else because it might be usefull to the experts (@JamesB, @innar, @Kirr, feel free to correct me) to have a better tool to assist in using NCD in their field, even if the compression speed is just 2MB/s.

  26. #49
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    739
    Thanks
    424
    Thanked 486 Times in 260 Posts
    Quote Originally Posted by mpais View Post
    Still, it might be interesting to see how low can we go, if nothing else because it might be usefull to the experts (@JamesB, @innar, @Kirr, feel free to correct me) to have a better tool to assist in using NCD in their field, even if the compression speed is just 2MB/s.
    Do you think paq8gen could be a good tool to see how low we can go? There you are not restricted with speed/memory/exesize
    Funny how you and me have so similar results almost at the same time already twice in this challenge. (This time: LILY: 736.354, paq8gen_v3: 723'319 bytes.)
    Even if the current paq8gen beats the current LILY by 13K, still the exe makes "the package" bigger (no wonder, it's a full-blown compressor - nothing is stripped to make it smaller). I don't mind - I don't intend to strip the exe, my goal is to make paq8gen a universal genetic data compressor tool - not tailored to the current challenge, but the current challenge helps tremendously to give it an "initial" shape.

  27. #50
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    577
    Thanks
    220
    Thanked 833 Times in 341 Posts
    Quote Originally Posted by Gotty View Post
    Do you think paq8gen could be a good tool to see how low we can go? There you are not restricted with speed/memory/exesize
    Sure, but paq is just too slow. I'd wager the best option would be to use CRAM and just write specific codecs for each segment.

    Quote Originally Posted by Gotty View Post
    Even if the current paq8gen beats the current LILY by 13K
    Well, this version of LILY is purposefully handicapped
    But it's nice to know it can keep up with paq8gen whilst being much faster.

  28. #51
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    528
    Thanks
    204
    Thanked 187 Times in 128 Posts
    Quote Originally Posted by mpais View Post
    Sure, but paq is just too slow. I'd wager the best option would be to use CRAM and just write specific codecs for each segment.
    Impressive work by the way on your entry.

    As for CRAM, the standard has a huge amount of bloat and so the decoder is inherently quite large, unless I were to spend a lot of time removing fluff from it. I was just idly playing with this though as a curio more than anything else.

    You're right that the more interesting thing from my perspective is writing specific codecs for each type of data.

  29. #52
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    577
    Thanks
    220
    Thanked 833 Times in 341 Posts
    Here's a new entry, the total is 712.323 bytes (decompressor "lily.zip": 13.456 bytes, payload "1": 698.867 bytes). Again, same usage as the last version.

    Since the purpose of the challenge is to go as low as possible, I massively increased the memory used (now almost 3,3GB) to make sure it could break the 700.000 byte barrier.
    As expected, in squeezing those last couple of KB, performance took a nosedive, so decompression time is now 479s (about 2,66 MB/s).

  30. Thanks (3):

    Gotty (7th February 2021),JamesB (16th February 2021),Mike (6th February 2021)

  31. #53
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    739
    Thanks
    424
    Thanked 486 Times in 260 Posts
    Well done!
    Very impressive!
    Still head-to-head with paq8gen.

    @innar: the challenge website was not updated lately. Do you need any help in re-running the baselines? Is the challenge still on?

  32. #54
    Member
    Join Date
    Dec 2020
    Location
    Tallinn
    Posts
    16
    Thanks
    4
    Thanked 3 Times in 3 Posts
    Hi Gotty and others!
    Challenge is still on! I had, unfortunately, health issues which minimized the ability to access computers. Recovering now. Since you, so kindly, offered help, I will send you a message ASAP.

  33. #55
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    577
    Thanks
    220
    Thanked 833 Times in 341 Posts
    Another entry, this time the total is 682.262 bytes (decompressor "lily.zip": 13.754 bytes, payload "1": 668.508 bytes).
    Decompression time is now 539s (2.37 MB/s)

  34. Thanks:

    innar (16th February 2021)

  35. #56
    Member
    Join Date
    Dec 2020
    Location
    Tallinn
    Posts
    16
    Thanks
    4
    Thanked 3 Times in 3 Posts
    Quote Originally Posted by mpais View Post
    Another entry, this time the total is 682.262 bytes (decompressor "lily.zip": 13.754 bytes, payload "1": 668.508 bytes).
    Decompression time is now 539s (2.37 MB/s)
    Hi mpais!
    Thanks! Working through the backlog now.
    What is the correct command for execution?

    For both of your entries, entry 6 and entry 7, 2.exe coronavirus.fasta results an empty file. I tested on a Windows 10 and 4GB RAM & 8GB RAM machines - could it be a memory or command line issue?

  36. #57
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    577
    Thanks
    220
    Thanked 833 Times in 341 Posts
    Quote Originally Posted by innar View Post
    Hi mpais!
    Thanks! Working through the backlog now.
    What is the correct command for execution?

    For both of your entries, entry 6 and entry 7, 2.exe coronavirus.fasta results an empty file. I tested on a Windows 10 and 4GB RAM & 8GB RAM machines - could it be a memory or command line issue?
    Code:
    Usage:
    2.exe 1 coronavirus.fasta
    This last version uses about 3.8GB of RAM.
    There's also no need to test previous entries, each new entry supersedes all previous.

  37. #58
    Member
    Join Date
    Dec 2020
    Location
    Tallinn
    Posts
    16
    Thanks
    4
    Thanked 3 Times in 3 Posts
    I had a typo in my previous message, yes "2.exe 1 coronavirus.fasta" was the command I was testing, but with 4GB, 8GB, and (now with) 16GB Windows 10 computers it gives a 0 byte coronavirus.fasta after 2-3 seconds and exits. Any hints? Your first entry decompressed without any issues, but it had another executable in the .bat?

  38. #59
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    577
    Thanks
    220
    Thanked 833 Times in 341 Posts
    Quote Originally Posted by innar View Post
    I had a typo in my previous message, yes "2.exe 1 coronavirus.fasta" was the command I was testing, but with 4GB, 8GB, and (now with) 16GB Windows 10 computers it gives a 0 byte coronavirus.fasta after 2-3 seconds and exits. Any hints? Your first entry decompressed without any issues, but it had another executable in the .bat?
    The first entries were for the generic LILY, so they needed a separate executable to undo the transformation, these new versions are designed for this specific task so they don't need it.

    I also tested on a Windows 10 tablet (Intel Core m3-7y30 with 4GB) and asides from being 2,5 slower (mostly due to all the swapping), it worked fine, so I don't think the problem is in the memory allocation.

    It does however require an AVX-2 capable processor (no runtime detection to keep the filesize down), could that be the problem?

    @all
    Is anyone else having the same problem?

  39. #60
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    630
    Thanks
    288
    Thanked 252 Times in 128 Posts
    Could reproduce your latest entry without problems on my laptop.

    Resulting filesize after "2.exe 1 coronavirus.fasta" was 1,339,868,341 bytes, md5sum matches the sum from the benchmark site (96ae...). Decompression took 566 seconds, peak memory usage was ~3000 MB (task manager).

    Test system: AMD Ryzen 5 4600H, 6 x 3.00 GHz, 16 GB RAM, Windows 10 Professional 64-Bit
    http://schnaader.info
    Damn kids. They're all alike.

  40. Thanks:

    mpais (14th February 2021)

Page 2 of 3 FirstFirst 123 LastLast

Similar Threads

  1. loseless data compression method for all digital data type
    By rarkyan in forum Random Compression
    Replies: 253
    Last Post: 21st October 2020, 04:44
  2. Sequence Compression Benchmark
    By SolidComp in forum Data Compression
    Replies: 14
    Last Post: 30th July 2020, 22:41
  3. Synthetic data benchmark
    By Matt Mahoney in forum Data Compression
    Replies: 13
    Last Post: 1st February 2019, 00:56
  4. benchmark of mixed data types with SIMD instructions
    By just a worm in forum The Off-Topic Lounge
    Replies: 8
    Last Post: 10th June 2015, 23:28
  5. New benchmark for generic compression
    By Matt Mahoney in forum Data Compression
    Replies: 20
    Last Post: 29th December 2008, 09:20

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •