Results 1 to 30 of 30

Thread: Mermaid and Selkie join Kraken

  1. #1
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    239
    Thanks
    95
    Thanked 47 Times in 31 Posts

    Mermaid and Selkie join Kraken

    Charles Bloom has just published a fascinating series of posts introducing two new compressors – Mermaid and Selkie:

    Kraken was a huge step in the Pareto frontier that pushed the achievable speedup factor way up beyond what other compressers were doing. There's a pre-Kraken curve where we thought the best possible tradeoff existed, that most other compressors in the world roughly lie on (or under). Kraken set a new frontier way up on its own with nobody to join it; Mermaid & Selkie are the partners on that new curve that have their peaks at higher speeds than Kraken.
    I like the looks of Mermaid especially: LZ4 speed with better than zlib compression. It's fascinating how they've optimized their decompressors for PS4 so well. There must be a market there – I didn't realize that games on Bluray would still be short on space and could benefit so much from good compression.

  2. #2
    Member
    Join Date
    Feb 2016
    Location
    Perth, Australia
    Posts
    9
    Thanks
    11
    Thanked 3 Times in 1 Post
    Optical media is very slow (from memory the PS4 drive is 6X, which is under 30MB/s transfer speed), so compression is less about being short on space and more about saving time on expensive reads. Even the hard drive in the PS4 is relatively slow enough that fast decompression makes for a load speed time net win. That's why Charles talks so much about the Pareto frontiers in terms of disk read speed on his blogs (i.e what decompressor will give you the shortest load time given a certain disk speed).

    Mermaid and especially Selkie sound fantastic.

  3. #3
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    752
    Thanks
    216
    Thanked 283 Times in 165 Posts
    Quote Originally Posted by DirtyPunk View Post
    ... 6X, which is under 30MB/s transfer speed ...
    If disk bandwidth is 30 MB/s, what is the benefit of over 1 GB/s decompression speed? Assuming 1:10 compression density, any decompression speed above 300 MB/s doesn't really speed up the loading in wall time?

  4. #4
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    If disk bandwidth is 30 MB/s, what is the benefit of over 1 GB/s decompression speed? Assuming 1:10 compression density, any decompression speed above 300 MB/s doesn't really speed up the loading in wall time?
    Ideally you'd use different compressors for different media speeds, if accelerating loading is what you care about. eg. you're right that on 30 MB/s media like optical, you want a higher-compression slower-to-decode codec.

    But it's not that simple. Most games are not doing their data loading with the CPUs just sitting idle for the decompressor to use the whole thing. In fact they're typically using 90% or more of the CPU already, and expect the compressor to run in only 10% of the available CPU time. That means a codec which runs at 50 MB/s on a full core will actually run at 5 MB/s when it only gets 10% of the time.

    And if you can make the codec even faster so they only have to give you 5% of the CPU, that makes them very happy.

    Another practical issue is that these systems cache the optical disc to hard disk or flash storage. So the first load is from optical, but later loads are from faster media. A slow decompressor might be right for the optical media, but then becomes the bottleneck from faster media. In theory you could transcode but in practice that's a PITA and it's nicer to avoid it.

    eg. we think it's important for us to be the least-bad, in the sense that people often won't use a codec in exactly the way you think it's optimal. For example, a very slow decoder like lzma is pretty reasonable over very slow media like downloading on a cell network, but then if you keep the data compressed with lzma and using it during loading from the flash storage, then it's a huge bottleneck. Rather than work really hard to improve that cell network download time by 1%, we can improve the flash load time by 10X
    Last edited by cbloom; 3rd August 2016 at 20:05.

  5. #5
    Member
    Join Date
    Dec 2015
    Location
    US
    Posts
    57
    Thanks
    2
    Thanked 112 Times in 36 Posts
    Also, essentially no PS4 game actually loads just off Blu-Ray, precisely because it's too slow (rate is slow too, but the real bad news is seek times, since there's a lot of on-demand asset streaming). They all require a multi-GB hard-drive install these days. The HDs are at least somewhat more reasonable (100-120 MB/s usually). The idea is to put at least the data with more or less random access patterns on disk, preferring to only stream large things with nice sequential access patterns (e.g. cinematics) from disk where possible.

    Separately, games *are* starting to run out of the 50GB they get out of dual-layer Blu-Ray these days. This GDC (last March) several people were interested in higher compression ratios because of that. Nobody wants to do multi-disk, it's a giant pain. Note this doesn't mean 50GB of *unique* content, but for anything intended to be streamed from optical disk (as opposed to install-to-HD packages) there's frequently a fair amount of replication to reduce the really expensive seeks (80-150ms each, sometimes more). Basically, the goal is to have tolerable loading times while still fitting on disk. Compression is part of that, since it increases the net bandwidth you get out of sequential reads, but it's also important that it increases the amount of "slack" space so you can build large coarse-grained asset packs (with some replication) that reduce seeks. It's a complicated balancing act.

    That said, optical media are (thankfully) on the way out. But it's still a concern right now.

    Medium term, not only are we gonna see less optical media, consumer devices are also moving away from HDs and towards Flash/SSDs (and the mobile world has always been there, of course). For example, the iPhone 6S has NVMe storage with >400MB/s read. These devices are still constrained for space, they have plenty of storage bandwidth, but not all that much CPU power. That's why we've been focusing on faster codecs recently. Since our focus is on asset loading from non-volatile storage, and NV storage speeds have gone op significantly with available CPU power stagnating, that means the sweet spot for CPU cycles per byte is trending downwards rapidly.

  6. #6
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    Yup, as Fabian says we have storage speeds going up, CPU speeds stagnating. (and storage speeds looks like they will continue to trend towards RAM speed in the next few years)

    On top of that you have RAM getting bigger, display res going up, GPUs getting faster -

    the result is that games need an absolutely massive amount of data to do high-detail rendering. It's not shocking for a game to load 1 GB for a level, and that will just keep going up.

    On the other end, we do still have size queens that just want things as small as possible, mainly in the downloadable mobile space. So there's a very wide spectrum of needs at the moment.
    Last edited by cbloom; 3rd August 2016 at 20:05.

  7. #7
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    239
    Thanks
    95
    Thanked 47 Times in 31 Posts
    Quote Originally Posted by cbloom View Post
    Yup, as Fabian says we have storage speeds going up, CPU speeds stagnating. (and storage speeds looks like they will continue to trend towards RAM speed in the next few years)

    On top of that you have RAM getting bigger, display res going up, GPUs getting faster -

    the result is that games need an absolutely massive amount of data to do high-detail rendering. It's not shocking for a game to load 1 GB for a level, and that will just keep going up.

    On the other end, we do still have size queens that just want things as small as possible, mainly in the downloadable mobile space. So there's a very wide spectrum of needs at the moment.
    Charles, have you tried modern compilers for the Windows builds? I assume you're locked into clang 3.7 or whatever for the PS4, but you used pretty old compilers on Windows according to your post on Zstd variance. I think you might have some headroom to exploit with newer compilers. Have you tried Visual Studio 2015 or the Intel Compiler 16.0? Microsoft had an interesting post on their new C++ compiler here.

    The STOKE superoptimizer from Stanford looks powerful too. I'm amazed by the boost they get on some workloads, especially in their "conditionally-correct optimization" paper.

  8. #8
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    Quote Originally Posted by SolidComp View Post
    Charles, have you tried modern compilers for the Windows builds? I assume you're locked into clang 3.7 or whatever for the PS4, but you used pretty old compilers on Windows according to your post on Zstd variance.
    Yeah it's a good point, I suppose I should try some other compilers on my own code to see if they help, though personally I really hate these modern optimizing compilers.

    Definitely from what I saw, if you use LZ4 or ZStd you want to build them with a modern GCC -O3 ; even if your main project is in MSVC you should build them that way and link them in.
    Last edited by cbloom; 3rd August 2016 at 20:05.

  9. #9
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    239
    Thanks
    95
    Thanked 47 Times in 31 Posts
    Quote Originally Posted by cbloom View Post
    Yeah it's a good point, I suppose I should try some other compilers on my own code to see if they help, though personally I really hate these modern optimizing compilers.

    Definitely from what I saw, if you use LZ4 or ZStd you want to build them with a modern GCC -O3 ; even if your main project is in MSVC you should build them that way and link them in.
    Do you see any difference if you tell gcc that the targets are modern CPUs with SIMD, e.g. with the -msse4.2 or -march=bdver1 flags? By default, it don't think it even assumes SSE2 is available. SSE 4.2 is a safe bet for anything made in the last six or seven years, and I think bdver1 would match up with the Jaguar CPUs in the PS4 and Xbox (which includes not only SSE 4.2, but also AVX, carryless multiplication, F16C, and some of the new Bit Manipulation Instructions). AVX might be slow in that generation though.

    Separately, can you elaborate on this statement from your Kraken Performance post?:

    Oodle+wbpe is an excellent text compressor.
    What kind of numbers are we looking at? Would it be better than brotli? I'm also just seeing XWRT – it looks like it might be better than brotli for text, and I'm surprised it's just been sitting there for years. I wonder what the performance details are. Maybe browser makers should evaluate brotli in along with several other solutions and settle on the best option. I feel like that process was skipped.

  10. #10
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,532
    Thanks
    755
    Thanked 674 Times in 365 Posts
    SSE 4.2 was first introduced by Intel 8 years ago, and by AMD 5 years ago. And even 5 years ago my friend bought Intel Celeron Merom (SSSE3 only). Taking into account that old Atoms doesn't suport SSE4 too, it may be safe to use with any CPU produced in last 3 years or so.

    AVX includes only FP operations, so useless for compression. CLMUL is pretty slow on most CPUs, wasn't supported by Celeron/Pentium just 1 year ago, and cannot be used by automatic vectorizer, since it's very specific operation.

    May be browser makers should evaluate brotli in along with several other solutions and settle on the best option. I
    they hardly can consider paid libs, while oodle authors hardly will work on OSS one. that's choice of their authors -work for himself or for big company

  11. #11
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    486
    Thanks
    169
    Thanked 166 Times in 114 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    AVX includes only FP operations, so useless for compression. CLMUL is pretty slow on most CPUs, wasn't supported by Celeron/Pentium just 1 year ago, and cannot be used by automatic vectorizer, since it's very specific operation.[
    The gather/scatter instructions may help with integer maths, as well as the blends, bitwise-or. and shuffles.

    AVX2 and AVX-512 are much newer but also much more useful beasts to consider. Ideally we'd have algorithms that can be written to exploit these abilities without requiring them. For example entropy coding a buffer with 16 start points means AVX-512 can multiplex all 16 coders in parallel. Older systems just have to do it the long way instead, but probably still gain some speed. If the API is permitted to return data in 16 buffers instead of one contiguous one then it also helps speed too. Untested, but I can see plenty of room for optimisation there. Even with SSE2 it ought to be possible to stripe 4 entropy encoders in parallel, but I think the lack of gather is ultimately what cripples the older SIMD instruction sets for this work load.

    I also wonder whether sometimes it's actually better to rewrite integer code in floating point if you can make use of the fast FMA instructions. Perhaps OK if the numbers are small and you can guarantee accuracy. Throughput is good, but latency poor so it depends on what else you're doing too.

  12. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,716
    Thanks
    271
    Thanked 1,185 Times in 656 Posts
    Actually its possible to emulate integer arithmetics with float-point vars.
    Its done by adding 1<<mantissa_size to all values.
    Once upon a time I made a 64-bit FPU-based rangecoder using that fact.

  13. Thanks (2):

    JamesB (2nd August 2016),Matt Mahoney (8th August 2016)

  14. #13
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    Quote Originally Posted by SolidComp View Post
    Do you see any difference if you tell gcc that the targets are modern CPUs with SIMD, e.g. with the -msse4.2 or -march=bdver1 flags? ...
    I haven't seen that the compilers do a great job of automatically generating very useful SSE4. You have to detect it and manually write good loops that use it. I bet compilers would do a good job of using things like SHRX, but you can't just enable BMI for a compile across the board. It would be cool if GCC could build obj's for multiple targets and select one at runtime based on cpu flags.

    Separately, can you elaborate on this statement from your Kraken Performance post?:

    What kind of numbers are we looking at? Would it be better than brotli? I'm also just seeing XWRT – it looks like it might be better than brotli for text, and I'm surprised it's just been sitting there for years. I wonder what the performance details are. Maybe browser makers should evaluate brotli in along with several other solutions and settle on the best option. I feel like that process was skipped.
    Yes, Oodle+wbpe or Oodle+XWRT beats brotli (without preprocessor) on text. Specialized word-replacers definitely work! (I've only tried english text and html, haven't looked at other language Utf8, this isn't an area I look at much). GLZA does even better on text, but I haven't looked into the space-speed performance of that.

    For example
    enwik8 + wbpe + 7z -> 23,782,942
    enwik8 + xwrt + 7z -> 23,671,028
    Oodle + wbpe -> 24,106,208
    Last edited by cbloom; 3rd August 2016 at 20:04.

  15. #14
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    486
    Thanks
    169
    Thanked 166 Times in 114 Posts
    Intel CC can generate code for different CPUs and dispatch on the fly, but I was told this is done on a function by function level so the overhead of checking the CPU can be significant if you have lots of tiny fast-returning functions. (Probably not good to do anyway.)

    In theory you can use things like "#pragma omp simd" with various hints to encourage vectorisation. I've seen it speed up icc but not so much gcc and the icc speedup was only to bring it inline with the auto-generated gcc output anyway. Useful hints can be to add alignment information and to disable aliasing or manually add "restrict" keywords. These may then make it more likely to generate good SIMD code, but it's still feels like poking around in the dark. We had to manually add cmov assembly statements to some code because most compilers weren't analysing the costs correctly. :/

    The other useful thing about the Intel compiler is -qopt-report (=5?). This generates lots of really useful information about vectorising loops and why the compiler decided not to do it. This can give some big hints, like it thinks there is a dependency between loop cycles or unaligned memory accesses. It also gives you the cost analysis. It's not perfect, but is a helpful tool. I also use google perf-tools a lot for instruction level profiling as it can indicate hot-spots for cache missing, branch predictions, etc.

  16. #15
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    239
    Thanks
    95
    Thanked 47 Times in 31 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    SSE 4.2 was first introduced by Intel 8 years ago, and by AMD 5 years ago. And even 5 years ago my friend bought Intel Celeron Merom (SSSE3 only). Taking into account that old Atoms doesn't suport SSE4 too, it may be safe to use with any CPU produced in last 3 years or so.

    AVX includes only FP operations, so useless for compression. CLMUL is pretty slow on most CPUs, wasn't supported by Celeron/Pentium just 1 year ago, and cannot be used by automatic vectorizer, since it's very specific operation.


    they hardly can consider paid libs, while oodle authors hardly will work on OSS one. that's choice of their authors -work for himself or for big company
    Sorry, I missed this comment. When I said that brotli hadn't been tested among all the alternatives before making decisions on which next-gen compressor browsers should support, I didn't mean anything proprietary or paid*. I was only thinking of open-source alternatives, mostly Zstd and zling. I think zling is looking very good, a nice of piece of work, a worthwhile improvement over gzip. If it can be further optimized to increase the decomp speed to gzip levels on the relevant content, then I think it would be the best choice, over Zstd and brotli. zling doesn't get enough attention. I'm also curious about XWRT, which looks like it could easily be made to be better than brotlil, if it isn't already.

    brotli and Zstd use enormous resources for compression compared to zlib/gzip. If gzip was too heavy for something like HAProxy, sparking the creation of SLZ, then Zstd and brotli are out of the question for resource-sensitive servers. brotli's dictionary could probably be better, more tailored to its use case of compressing web content, smaller but with longer strings. A sane binary format would be best – people forget that uncompressed size matters on the client, because it takes of RAM.

    As far as Kraken, Selkie, and Mermaid, they're not designed for text, but I think maybe some open-source future spin-off of RAD's work would be better than brotli or Zstd, when combined with the text compression methods Charles was talking about.

    * I don't rule out paid codecs completely, because I don't rule out paid browsers. I think the world could really use a browser company right now, and that it would be surprisingly profitable. 1. No one is trying to build the best browser. 2. It's actually possible to build much better browsers than Chrome or Edge. This is underexposed. We could have much faster browsing without anything changing on servers. The browser makers that give us "free" browsers have all sorts of interests and goals in doing so that compromise the quality of the browsers. Mozilla is more a political organization than one focused on building the best browser (and it shows). Google is an ad company. Microsoft is a confused company. None of them have a customer relationship with the users of their browsers, and that shapes the incentives. Google uses a random, junky open-source libraries for lots of Chrome, including the JPEG2000 support in their PDF reader, which of course had awful security holes that were exposed recently, and will have more. No one's really trying to meet a higher standard, and they use primitive programming languages that guarantee that their software is insecure. They're stuck in old paradigms. A browser company could do a lot better.

  17. #16
    Member
    Join Date
    Dec 2015
    Location
    US
    Posts
    57
    Thanks
    2
    Thanked 112 Times in 36 Posts
    Quote Originally Posted by SolidComp
    brotli and Zstd use enormous resources for compression compared to zlib/gzip. If gzip was too heavy for something like HAProxy, sparking the creation of SLZ, then Zstd and brotli are out of the question for resource-sensitive servers.
    Well, it depends on how you use it. Compression that happens on-the-fly on the server, as the request is being handled, has very different trade-offs from "baking" compressed assets once and storing them. A transparent proxy is an extreme case: it doesn't know anything about the data it handles, and can't really do any work up front (because it doesn't know what data it's going to be serving ahead of time). That absolutely biases it towards faster encoders. (Spending more time encoding might pay off eventually if the asset in question is going to be requested a lot, but the proxy doesn't know that in advance!)

    At the other extreme, when you have a high-traffic site with a bunch of known static assets like style sheets, fonts and JS files that change rarely (compared to the number of times they're served), extra time spent preparing a pre-compressed version once is amortized well. There's still a threshold on what kind of delay is tolerable if assets do change and need to be re-compressed, but it's orders of magnitude higher than your threshold of pain on a server that's compressing on-the-fly for a live request.

    Quote Originally Posted by SolidComp
    As far as Kraken, Selkie, and Mermaid, they're not designed for text, but I think maybe some open-source future spin-off of RAD's work would be better than brotli or Zstd, when combined with the text compression methods Charles was talking about.

    My opinion (personal, not RAD's or whatever) is that the most interesting future work is in more specialized compressors, along various axes, and allowing more varied (and more useful) methods of access. Purely sequential decoding is not a great place to be in. Compressing gigabytes of unspecified data as one giant solid blob: really not all that interesting outside of an archiver or a Hutter prize attempt (to me anyway).

    I got more data than I know how to deal with already. I'd prefer accessing and organizing that data to get less painful over time, not more. I want some compression (because storage is never enough), but I also want indexing and better ways to retrieve data. Really, I want all of the above: data that is compact, indexed, fast to access the parts I care about, and really fast to search to identify those parts.

    Previous attempts to make storage more database-y have failed, in part for performance reasons, but I hope that maybe with ever faster random-access storage access, we're not that far from the point (or might have crossed it already) where we can actually make this happen for most of the data I personally handle ("document-y" things). That's a lot more exciting to me than just squeezing an extra % off file sizes anyway.

  18. Thanks (3):

    Cyan (16th August 2016),JamesB (16th August 2016),Turtle (18th August 2016)

  19. #17
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    486
    Thanks
    169
    Thanked 166 Times in 114 Posts
    Quote Originally Posted by fgiesen View Post
    My opinion (personal, not RAD's or whatever) is that the most interesting future work is in more specialized compressors, along various axes, and allowing more varied (and more useful) methods of access. Purely sequential decoding is not a great place to be in. Compressing gigabytes of unspecified data as one giant solid blob: really not all that interesting outside of an archiver or a Hutter prize attempt (to me anyway).

    I got more data than I know how to deal with already. I'd prefer accessing and organizing that data to get less painful over time, not more. I want some compression (because storage is never enough), but I also want indexing and better ways to retrieve data. Really, I want all of the above: data that is compact, indexed, fast to access the parts I care about, and really fast to search to identify those parts.

    Previous attempts to make storage more database-y have failed, in part for performance reasons, but I hope that maybe with ever faster random-access storage access, we're not that far from the point (or might have crossed it already) where we can actually make this happen for most of the data I personally handle ("document-y" things). That's a lot more exciting to me than just squeezing an extra % off file sizes anyway.[/COLOR]
    This is exactly the state of affairs in the Bioinformatics community with DNA sequencing data. We need to be able to perform quick random access of our data for analysis, sometimes random access within one sample and sometimes finding the same region (loci) across many samples. The holy grail is something that can achieve compression (deduplication) between samples while not harming streaming rates of a single sample.

    Previously, and to be fair still dominantly, people use(d) BAM for this which are basically Z_FULL_FLUSH compressed deflate streams with all the data types munged together. Now there is CRAM which is around half the size while still having decent random access; it splits by data type, but still uses fairly standard algorithms per type of data. There are better formats still with custom compression algorithms, but as with everything it becomes a tradeoff between space, time and granularity of random access.

    To see why this matters, just look at the difference between costs of compute and costs of DNA sequencing. (Cost of storage is also an inverse exponential, but I think it's halving every 14 months or so.) As costs come down, people design experiments to sequence more individuals so cost is inversely proportional to the amount of data we have to store. Scary, but fun.

    Click image for larger version. 

Name:	costperMb2015_4.jpg 
Views:	141 
Size:	1.17 MB 
ID:	4586

  20. #18
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    737
    Thanks
    230
    Thanked 233 Times in 143 Posts
    James,
    I didn't know about the big drop a year ago (~60$ per 1 read of human??) - is it still Illumina?
    Oxford nanopore doesn't seem to be economically competitive yet (?)
    Very promising is starlight ( https://tools.thermofisher.com/conte...cms_091831.pdf ) - it doesn't read single strand once (PacBio), or twice (Oxford), but multiple times ...

    Unfortunately these prices seem to require that you already have the device, for a private person full sequencing (or even exon sequencing) seems still astronomically costly ...

  21. #19
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    486
    Thanks
    169
    Thanked 166 Times in 114 Posts
    Quote Originally Posted by Jarek View Post
    James,
    I didn't know about the big drop a year ago (~60$ per 1 read of human??) - is it still Illumina?
    Oxford nanopore doesn't seem to be economically competitive yet (?)
    Very promising is starlight ( https://tools.thermofisher.com/conte...cms_091831.pdf ) - it doesn't read single strand once (PacBio), or twice (Oxford), but multiple times ...

    Unfortunately these prices seem to require that you already have the device, for a private person full sequencing (or even exon sequencing) seems still astronomically costly ...
    That chart is a mix of everything. The initial drop around 2003 was 454 pyrosequencing. The big drops after that were successive releases of Solexa / Illumina sequencing, with other manufacturers pricing their costs to be competitive with Illumina mainly. The most recent drop is probably due to the Illumina X10 machines. They're costly, but if you get enough throughput then they become cheap. This means that practically it's only large centres that have them or places offering sequencing as a service where you send a sample and get data back.

    Note that chart of per-Mb is megabase of raw data, not Mb covered of genome. Typically people want 15x coverage or more to ensure most locations are covered (it's mostly random placement of fragments so the higher the depth, the less chance of having a hole in coverage) and also to ensure accuracy of consensus calls and to get enough data to ensure you cover both alleles for SNP calling. Ie you want the DNA covered well for both copies of your chromosomes. That brings the price up much higher, but it's still getting very cheap.

    The newer technologies are very interesting, but a bit of a different breed, although it's not correct to say PacBio only covers data once - they have something called a SMRTbell where the DNA gets read multiple times over. Starlight I think is a modification of the IonTorrent, but I hope it is successful as the more instrument choices we have the cheaper it becomes. By and large though the newer technologies are much longer fragments but very high error rates. Eg 5-10% error vs 0.1-0.3% or something like that. That doesn't mean it's useless. If the errors are largely randomly distributed then higher depth gives a mostly error free consensus (PacBio), but if the errors are systematic (pyrosequencing, some ONT data too) then depth doesn't rescue you completely. The long reads though are fantastic are looking for larger range mutations such structural rearrangements (common in cancers) or for denovo sequencing & assembly of unknown organisms. Plus ONT MinION is *tiny* and can be used in the field. We're getting close to the Star Trek medical TriCorder. Fun fun.

    James

    PS. Getting back to data compression; fundamentally the sequence base calls aren't the bulk of the data, plus we can delta them to a known reference genome. It's the per-base quality/confidence scores that cripple compression ratios. More recently several groups, including myself, have been looking at lossy compression of quality values and only storing them in places where they help determine the consensus / SNP assignments.

  22. Thanks:

    Jarek (16th August 2016)

  23. #20
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    737
    Thanks
    230
    Thanked 233 Times in 143 Posts
    Sure the coverage is crucial, this 60$ x 15 is probably to get below the famous 1000$ barrier.
    However, what is important is (coverage, error rate) pair - this graph doesn't seem to express - it should be normalized to some final error rate.
    Indeed USB-stick-like Oxford MinION is cool, but far from being economically feasible for sequencing human.

    By PacBio reading once, I didn't mean coverage, but just reading strand once - what is an issue as single molecule sequencing has large noise (~15% and more).
    Nanopore has this "2D sequencing": read strand than its complementary strand - kind of reading this strand twice, and so reducing noise.

    This StarLight concept is from Life, but is far from IonTorrent (454 with current instead of light): they have DNA strand in a solution, attach multiple polymerase with qdot, which is excited and transfer energy via FRET to one of 4 dyes distinguishing bases - finally they see multiple sparklings in the solution ("starlight") - from up to 30 polymerase on a single strand ... of length up to 250k (!)

  24. #21
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    239
    Thanks
    95
    Thanked 47 Times in 31 Posts
    Quote Originally Posted by fgiesen View Post

    My opinion (personal, not RAD's or whatever) is that the most interesting future work is in more specialized compressors, along various axes, and allowing more varied (and more useful) methods of access. Purely sequential decoding is not a great place to be in. Compressing gigabytes of unspecified data as one giant solid blob: really not all that interesting outside of an archiver or a Hutter prize attempt (to me anyway).

    I got more data than I know how to deal with already. I'd prefer accessing and organizing that data to get less painful over time, not more. I want some compression (because storage is never enough), but I also want indexing and better ways to retrieve data. Really, I want all of the above: data that is compact, indexed, fast to access the parts I care about, and really fast to search to identify those parts.

    Previous attempts to make storage more database-y have failed, in part for performance reasons, but I hope that maybe with ever faster random-access storage access, we're not that far from the point (or might have crossed it already) where we can actually make this happen for most of the data I personally handle ("document-y" things). That's a lot more exciting to me than just squeezing an extra % off file sizes anyway.[/COLOR]
    I agree about the potential for specialized compression. Actually, I'm surprised that after all these years, we still don't have even a first attempt at a web-specific compressor, something that is HTML-aware, CSS-aware, etc. Have you seen the XPack XML compressor? (PDF) It's pretty slick with how it has a specific module for compressing URLs, which seems pretty obvious – HTML files have lots of URLs, and we should think about how best to compress them, since most aren't exactly repeated. And it has a module for attributes, which also seems sensible, and one for content, and others.

    More to your point, it supports search and other operations on the compressed file. See their Table 1 for some of the results.

    A good web format would provide the DOM and the AST up front, or readily scannable. There's no reason why a browser should have to do any work to construct a DOM – we should just tell it what the DOM is.

    On the Chromium "bug" thread about adding brotli support, there was one guy who kept asking about web-specific dictionaries, like for CSS and HTML, but no one responded to him for some reason. I'd want to replace long strings like the meta viewports, the FB/Twitter metadata boilerplate and lots of other stuff with less than 1 byte symbols. One limitation of brotli is that its dictionary is based on the past, even though it will only occupy the future. If you have dictionary entries designed around modern and anticipated standards and keywords, people will pivot toward that dictionary – it will influence what goes into HTML, CSS, and JS files. That's better than scanning dusty corpora. Of course, we shouldn't be inflating and parsing text files anyway – it's a huge waste of energy, time, and bandwidth.
    Last edited by SolidComp; 17th August 2016 at 16:16.

  25. #22
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    Quote Originally Posted by SolidComp View Post
    I'm surprised that after all these years, we still don't have even a first attempt at a web-specific compressor, something that is HTML-aware, CSS-aware, etc.
    XWRT (XML-WRT) is a specialized XML/HTML/text compressor: https://github.com/inikep/XWRT. The first version is already 10 years old

  26. Thanks:

    SolidComp (17th August 2016)

  27. #23
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 198 Times in 147 Posts
    As demonstrated in the thread "Transfer-Encoding & Data Compression" the bandwidth saving of using other content-encoding than gzip for text is negligible.
    And most important, the (growing) global internet traffic is due to videos, images, advertising, Ad-tracking,... but not text.

  28. #24
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    752
    Thanks
    216
    Thanked 283 Times in 165 Posts
    Quote Originally Posted by dnd View Post
    As demonstrated in the thread "Transfer-Encoding & Data Compression" the bandwidth saving of using other content-encoding than gzip for text is negligible.
    I think there was no general agreement about that, and I'd like to think it is not true. I have heard about a 10 % median latency reduction measured from a brotli deployment.

    http://httparchive.org/trends.php gives us more details on the composition of modern webpages.

  29. #25
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    239
    Thanks
    95
    Thanked 47 Times in 31 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    I think there was no general agreement about that, and I'd like to think it is not true. I have heard about a 10 % median latency reduction measured from a brotli deployment.

    http://httparchive.org/trends.php gives us more details on the composition of modern webpages.
    Thanks Jyrki, this is a very helpful resource. Did you notice the big drop in Google API usage starting in March, 2016? (There's a graph for this if you scroll down). This puzzles me. They also switched the user agent from IE 9 to Chrome in March. Why would Chrome make fewer calls to Google APIs than IE 9? Actually, the way they measure it is binary – whether a site calls googleapis.com or not. I wonder if there's a difference in caching behavior that would explain the graph.

  30. #26
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    239
    Thanks
    95
    Thanked 47 Times in 31 Posts
    Quote Originally Posted by inikep View Post
    XWRT (XML-WRT) is a specialized XML/HTML/text compressor: https://github.com/inikep/XWRT. The first version is already 10 years old
    Przemyslaw, this is excellent work. What is the decomp speed for -l2? Have you talked to Jyrki, Google, Mozilla, et al about getting XWRT into browsers? I might help to have energy use data like this (PDF).

  31. #27
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    Quote Originally Posted by SolidComp View Post
    What is the decomp speed for -l2? Have you talked to Jyrki, Google, Mozilla, et al about getting XWRT into browsers? I
    It's not so easy to change existing standards. But XWRT+brotli is still better than brotli with it's static dictionary:
    http://encode.su/threads/2353-improving-brotli

  32. #28
    Member
    Join Date
    Jan 2017
    Location
    Selo Bliny-S'edeny
    Posts
    24
    Thanks
    7
    Thanked 10 Times in 8 Posts
    Had some thoughts on Mermaid's format design. The tokens are somewhat similar to that of LZ5: it's O_MMMM_LLL where O is whether to reuse the most recent offset (or otherwise to read the offset from the stream), MMMM is the match length, and LLL is the length of literals. Because there is no implied minimum match length, the fast case is (flagbyte >= 24), or in other words (flagbyte >= 0_0011_000), which means that the minimum match length is 3, unless the O flag is set, in which case there is no minimum match length. Mermaid is different to LZ4 and LZ5 in that the lengths do not overflow into extra bytes: it operates preferably in terms of short literals and relatively short matches. To encode a long match or a long run of literals, it uses values (flagbyte < 24). So the penalty for a longer string is branch misprediction, which is about 20 cycles; the gains must then be lucrative (those longer strings must be not just a bit longer, but quite a bit longer). The worst case could be is that you pay those 20 cycles to copy 8 literals, just because the length does not fit into LLL. That would be a bad deal. Indeed, Mermaid requires long literals be at least 64 bytes long. So the literal runs in the range 8..63 are not directly representable! How's this even possible? Unlike matches, which you may skip if you don't like one, literals cannot be skipped at will.

    This brings me to an earlier point that when the reuse-offset flag is set, there is no minimum match length. At first I thought it was a mistake. Why would one want to use match length 0? A better design would simply exclude such a possibility at the level of format. I guess the minimum match with the reuse-offset flag should be 2. The token then could be MMMM_O_LLL, and the fast case would be (flagbyte >=40), which reads exactly "min length 2 with the O flag or 3 otherwise" (0010_1_000 resp. 0011_0_000). This leaves more values to encode long lengths.

    But wait, some literal lengths are not representible. This means the encoder can be forced to spill literals into an extra token 0000_1_LLL. That's not a great deal either, but the penalty can be less than branch misprediction.

    It is not at all obvious why such a format can deliver better compression or faster decompression.

  33. Thanks:

    introspec (31st January 2019)

  34. #29
    Member
    Join Date
    Sep 2018
    Location
    Philippines
    Posts
    68
    Thanks
    26
    Thanked 0 Times in 0 Posts
    svpv,
    string length code output one time only, and "length of literals" not needed.

    Has anyone implemented RLLZ already? I stopped coding in 2010.

    https://encode.su/threads/3013-Reduc...put-LZ77-codes?
    Last edited by compgt; 31st January 2019 at 12:39.

  35. #30
    Member
    Join Date
    Dec 2015
    Location
    US
    Posts
    57
    Thanks
    2
    Thanked 112 Times in 36 Posts
    The combined match+literal run is called a "packet" in Mermaid/Selkie. We can add extra packets with match length 0 to handle literal runs longer than 7 bytes (but less than escape length), and likewise for matches.

    It turns out to be far more valuable (at Mermaid/Selkie's space/speed design point) to avoid mispredicted branches than it is to be squeamish about extra packets. Non-escape packets are fast, and branch mispredictions from escapes *really* hurt. The thresholds for escapes are as high as they are for a reason.

  36. Thanks:

    introspec (1st February 2019)

Similar Threads

  1. Kraken compressor
    By HeirToday in forum Data Compression
    Replies: 44
    Last Post: 9th August 2016, 17:28
  2. 100% Lossless Compression Theory - Please Join Me
    By SmallestFilePoss in forum Random Compression
    Replies: 16
    Last Post: 24th September 2013, 23:33

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •