Results 1 to 3 of 3

Thread: ALAPY and fastqz

  1. #1
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    27
    Thanks
    4
    Thanked 8 Times in 4 Posts

    ALAPY and fastqz

    Every once in a while my benchmark turns up something interesting. Recently I added another compressor called ALAPY, and quickly noticed that its "best" mode performs nearly identically to "fastqz c". It compresses slightly weaker and slightly faster than "fastqz c". It also uses nearly the same amount of memory.

    Data size vs compression ratio
    Data size vs compression speed
    Data size vs decompression speed
    Data size vs compression memory
    Data size vs decompression memory

    The above graphs show only these two compressors, to make it easier to see how similar they perform, across my entire test corpus. This kind of similarity is highly unusual, I'm not aware of any other pair of independent compressors mirroring each other so precisely.

    Which logically leads to the hypothesis that these two compressors are possibly not independent.

    You can see the homepages of both:
    fastqz: http://mattmahoney.net/dc/fastqz/
    ALAPY: http://alapy.com/services/alapy-compressor/

    fastqz is a nice compressor. In particular, it shows the strongest compression on one of my favorite datasets - 2.76 GB bacterial genome set (Helicobacter):

    Compression ratio vs compression + decompression speed

    And sure enough, ALAPY is right behind.

    fastqz does not allow 'N' in the sequence, and ALAPY allows it. Therefore my wrapper removes N from sequences before feeding them to fastqz, and skips this step with ALAPY. This may account for some of the differences in both compression strength and speed. It also means that if ALAPY is related to fastqz, at least it's not a verbatim copy. In addition, it has 2 other weaker and faster levels.

    More points to consider:
    1. fastqz is made in 2012, ALAPY - in 2017.
    2. fastqz is open source, ALAPY is not.
    3. fastqz is described in peer-reviewed literature, ALAPY is not (as far as I'm aware).
    4. fastqz is free, ALAPY is not (protected by long EULA, and freely downloadable version is limited to 1 instance at a time).
    5. fastqz is made by Matt Mahoney, we don't know who developed ALAPY.

    Notably ALAPY is a rare example of biological compressor that was not described in a paper. (Usual problem is the opposite: too many papers describe unavailable compressors or even just algorithms without sharing implementation).

    Now, hypothetically speaking, is it OK to copy and modify fastqz and start selling it? fastqz is generously shared under the BSD 2-clause license. 2-nd clause sais: "Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution."

    Is the modified code considered "redistribution" or not? I guess it's not but I'm not sure.

    Note that I only tested the latest version 1.3.0 of ALAPY. The github repo has older binaries, it might be interesting to look at them.
    Last edited by Kirr; 23rd December 2019 at 02:16. Reason: Typos

  2. #2
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    506
    Thanks
    187
    Thanked 177 Times in 120 Posts
    I hadn't heard of ALAPY before. Interesting.

    It's possible it's another ZPAQ based thing with various configurations available, which may explain why it's close to fastqz in behaviour. Or maybe it's just something totally different. I didn't look at the binary to see.

    I really should dust off fqzcomp and apply my newer codecs I have for CRAM 4. It can sometimes be *considerably* smaller on quality values and the read name tokeniser is better too. Sequence compression was always fqzcomp's weakness, which is why I did samcomp. (Align to a genome and store the deltas) More modern fastq compressors (eg Spring, Pgrc) as basically lightweight sequence assemblers with the same align & delta approach. It works well on deeply sequenced organisms, but can be memory hungry and slow.

    There are more crude approximate approaches such as sorting by minimisers. Eg see https://github.com/samtools/samtools/pull/1093

    That's a fast and crude data collation technique with a simple rolling hash + XOR (so that poly-A doesn't become the minimiser) as the sort key. It gets maybe 70% of the way there to optimal encoding but without the huge memory and time requirements.

  3. #3
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    27
    Thanks
    4
    Thanked 8 Times in 4 Posts
    Quote Originally Posted by JamesB View Post
    I really should dust off fqzcomp and apply my newer codecs I have for CRAM 4. It can sometimes be *considerably* smaller on quality values and the read name tokeniser is better too.
    That's nice and I'll be glad to benchmark your new version.

    Quote Originally Posted by JamesB View Post
    Sequence compression was always fqzcomp's weakness, which is why I did samcomp. (Align to a genome and store the deltas) More modern fastq compressors (eg Spring, Pgrc) as basically lightweight sequence assemblers with the same align & delta approach. It works well on deeply sequenced organisms, but can be memory hungry and slow.
    True that this works well on deep coverage sets of short reads. Something I look forward to investigating in detail.

    I don't think Spring does assembly? ( https://github.com/shubhamchandak94/Spring ).

    Quote Originally Posted by JamesB View Post
    There are more crude approximate approaches such as sorting by minimisers. Eg see https://github.com/samtools/samtools/pull/1093

    That's a fast and crude data collation technique with a simple rolling hash + XOR (so that poly-A doesn't become the minimiser) as the sort key. It gets maybe 70% of the way there to optimal encoding but without the huge memory and time requirements.
    Something I'd be curious to benchmark as well, if it's available in a usable form.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •