Page 3 of 3 FirstFirst 123
Results 61 to 68 of 68

Thread: paq8gen - sequence compressor

  1. #61
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    741
    Thanks
    424
    Thanked 486 Times in 260 Posts
    Quote Originally Posted by mpais View Post
    Also, it's funny seeing how you find and tackle the same things as I do, like the low precision problem (you really went scorched earth on that one, 28 bits of precision? ).
    It's funny indeed. LILY and paq8gen are always head to head. So following that pattern, now paq8gen must go under 600K. OK. Don't know how on earth it's gonna be, yet. Feel free to add your magic there.

    Quote Originally Posted by mpais View Post
    If you're serious about the sars-cov-2 benchmark, why not just create a new branch of paq8gen on the repo dedicated to it, instead of polluting the main version with a bunch of #defines?
    I'm not that serious about squeeze the last bits from the exe. (You can see it's not my priority.) Pushing compression further is what I'm into. When enhancing paq8gen I'm also checking my changes against two other corpus: what I call the small DNA corpus (https://encode.su/threads/2105-DNA-Corpus) and the large one: https://tinyurl.com/DNAcorpus (referenced from GeCo3: https://github.com/cobilab/geco3). So for me it's a threefold challenge.
    Paq8gen is far from a proper genomic sequence compressor, yet: it should support a line-unwrapping transform and reordering sequence names and sequences automatically for example. Or look for palindromes... Or anything I haven't even heard of
    If you've got ideas and have some time - feel free to join and let's make it a good entropy estimator in the genomic-compression field.

  2. #62
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    741
    Thanks
    424
    Thanked 486 Times in 260 Posts
    Quote Originally Posted by suryakandau@yahoo.co.id View Post
    I don't have personal reason with mpais, I just see the result of paq8gen is better than lily.
    Oh, I thought it could be more.
    OK, I'm glad then!

  3. #63
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    741
    Thanks
    424
    Thanked 486 Times in 260 Posts
    Back to the ACGT-alphabet-reordering.
    These are the results from the small DNA corpus:

    Click image for larger version. 

Name:	ACGT-order-small-dna-corpus.png 
Views:	18 
Size:	131.6 KB 
ID:	8354

    There is no clear winner which reordering is the "general best". There are some green areas though.

    Meanwhile I found the optimal alphabet-transform for the sars-cov-2 challenge: it's GCAT (meaning: 'G'->0; 'C'->1; 'A'->2; 'T'->3).
    It's followed closely by GCTA, CGAT, CGTA). Hmm... So G likes to be together with C and A with T. Looks like we've got something - see: https://en.wikipedia.org/wiki/Comple...cular_biology)

    But it does not really match with the results from the small DNA corpus. The difference in compression is significant.
    Who's got some insight?
    Last edited by Gotty; 22nd February 2021 at 00:19.

  4. #64
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,138
    Thanks
    320
    Thanked 1,401 Times in 803 Posts
    The difference between bytewise and bitwise model is that after occurrence of some symbol,
    a bytewise model increases the probability of only this symbol, while decreasing all others.
    While bitwise model also increases the probability of some groups of other symbols with a matching prefix.
    So we can view the alphabet permutation as an approximation of non-binary APM.
    This would mean that optimal permutations can easily be different not only for different files, but even for different contexts.
    (Which can be even practically implemented by building symbol probability distributions from binary counters
    separately for each context, then converting these distributions to a common binary tree for mixing and coding.)

    Btw, alphabet permutation is not limited to permutations of 0..3 even for target files with only 4 symbols.
    A code like {0001,0010,0100,1000} would have a different behavior than any of 0..3 permutations.

  5. #65
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    577
    Thanks
    220
    Thanked 833 Times in 341 Posts
    Quote Originally Posted by Gotty View Post
    So following that pattern, now paq8gen must go under 600K.
    As I said in the challenge thread, I expect paq8gen to get down to 57x.xxx, at least, so no pressure

    Quote Originally Posted by Gotty View Post
    If you've got ideas and have some time - feel free to join and let's make it a good entropy estimator in the genomic-compression field.
    The thing is, I don't think anyone in that field would even use it.

    Maybe innar, Kirr and JamesB have a different opinion on that, but it seems people in that field just cook up their own flawed compressors (see the benchmark that Kirr put together, especially his comments on each compressor) as long as it suits their particular workflow. For archiving and transmitting the large datasets they use, paq8gen is just too slow. And judging by innar's comments, it seems the field is not convinced of the use of compression algorithms as a way to find similarities between sequences. That's why I don't see the point in continuing working on LILY for this either.

    I'd rather work on something fun and useful, work is boring enough as it is, which is why I proposed fairytale in the first place, it would have kept me busy writing open-source versions of LILY, EMMA, LEA, etc. Alas, that ship has sailed, so maybe it's time for a new hobby.

  6. Thanks:

    Mike (22nd February 2021)

  7. #66
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    573
    Thanks
    245
    Thanked 98 Times in 77 Posts
    Quote Originally Posted by mpais View Post
    I'd rather work on something fun and useful, work is boring enough as it is, which is why I proposed fairytale in the first place, it would have kept me busy writing open-source versions of LILY, EMMA, LEA, etc. Alas, that ship has sailed, so maybe it's time for a new hobby.
    You want to work on Fairytale? No-one is stopping you, to be fair

    Maybe there aren't a lot of people on board just yet but you know that can change in a heartbeat. I know I for one would test the hell out of the thing. Maybe I'm not a developer, but that hasn't stopped me from collaborating on lots of projects. Just look at all the bugs I've found in precomp, for instance...
    Last edited by Gonzalo; 22nd February 2021 at 21:25. Reason: typo

  8. #67
    Member
    Join Date
    Jan 2017
    Location
    uk
    Posts
    15
    Thanks
    0
    Thanked 7 Times in 3 Posts
    I'm not sure if this is relevant or not: https://jpeg.org/items/20201226_jpeg...ouncement.html

  9. #68
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    577
    Thanks
    220
    Thanked 833 Times in 341 Posts
    Quote Originally Posted by Gonzalo View Post
    You want to work on Fairytale? No-one is stopping you, to be fair
    Sure, but no one helped either. We've discussed this before, writing a full blown archiver as ambitious as that, on my free time, is delusional. There's a reason most of us here write single file compressors. An archiver needs to implement a lot more functionality (adding content to an existing archive, deleting content from an archive, partial extraction, etc) and we wanted to do it in an OS and ISA independent way, i.e., it should just work out of the box on Windows, Linux, MacOS, etc, and on x86, ARM, MIPS, RISC-V, etc. That is way above my skill level.

    Anyway, that is off-topic and I don't want to derail the thread. Gotty is doing a fine work on paq8gen, the paq8 legacy is in good hands.

Page 3 of 3 FirstFirst 123

Similar Threads

  1. Sequence Compression Benchmark
    By SolidComp in forum Data Compression
    Replies: 14
    Last Post: 30th July 2020, 22:41
  2. Best compression algorithm for a sequence of incremental integers
    By CompressMaster in forum Data Compression
    Replies: 18
    Last Post: 17th May 2019, 12:56
  3. Binary sequence compression
    By smjohn1 in forum Data Compression
    Replies: 23
    Last Post: 8th December 2017, 02:48
  4. Sequence of bits
    By Kaw in forum Data Compression
    Replies: 12
    Last Post: 25th September 2009, 09:53
  5. LZP flag sequence compression
    By Shelwien in forum Data Compression
    Replies: 8
    Last Post: 9th August 2009, 03:08

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •