Results 1 to 10 of 10

Thread: Microsoft’s Project Zipline

  1. #1
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,026
    Thanks
    103
    Thanked 410 Times in 285 Posts

    Microsoft’s Project Zipline

    "Microsoft’s Project Zipline compression algorithm yields dramatically better results, up to 2X high compression ratios versus the commonly used Zlib-L4 64KB model."

    "We are open sourcing Project Zipline compression algorithms, hardware design specifications, and Verilog source code for register transfer language (RTL) with initial content available today and more coming soon."

    https://azure.microsoft.com/en-us/bl...t-cloud-scale/

    Link give 404 here:
    https://github.com/opencomputeproject/Project-Zipline

    News:
    https://techcrunch.com/2019/03/14/zi...for-the-cloud/

  2. Thanks (4):

    Bulat Ziganshin (15th March 2019),Cyan (15th March 2019),jibz (15th March 2019),Shelwien (14th March 2019)

  3. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts

  4. #3
    Member FunkyBob's Avatar
    Join Date
    Jun 2017
    Location
    Australia
    Posts
    24
    Thanks
    1
    Thanked 5 Times in 4 Posts
    Am I missing something?

    This same article is cut'n'pasted in dozens of pointless "news outlet" sites, but I've yet to see anything resembling a description of the algorithm.

    Not to mention, what exactly do they mean by "the Zlib-LZ4 64K model"?

    Claiming 2x is no great claim without a lot more detail, as we all know. Data types, variances, block sizes, encoding and decoding speeds... the fact it's a hardware implementation is intriguing, but also skews any speed measurements.

    --

  5. Thanks:

    Hakan Abbas (15th March 2019)

  6. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    They probably mean this:
    Click image for larger version. 

Name:	qa_dc.png 
Views:	251 
Size:	337.1 KB 
ID:	6516

  7. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    It appeared.
    https://github.com/opencomputeprojec...cification.pdf

    "XP10 is a lossless compression data format developed by Microsoft."

    It seems to be LZ77-huffman with window up to 16M and predefined huffman tables / dictionaries.

    Don't get confused by the weird terminology.
    "Predefined prefix" is a dictionary, "MTF match" is a rep match :)

    No C source atm.
    And I suspect it might never appear, since it seems we're supposed to compile something from the provided verilog sources
    via https://www.synopsys.com/verificatio...ation/vcs.html

  8. Thanks (2):

    Cyan (16th March 2019),Hakan Abbas (18th March 2019)

  9. #6
    Member
    Join Date
    Jan 2017
    Location
    Selo Bliny-S'edeny
    Posts
    24
    Thanks
    7
    Thanked 10 Times in 8 Posts
    They use multisymbol Huffman alphabet that absorbs short match lengths:
    * 0..255 are literals;
    * 256..256+15 are rep0 matches with length 0..14 and >= 15;
    * 256+16..256+16+15 are rep1 matches with length 0..14 and >= 15, etc.
    704 "short symbols" in total (p.26).

    This should be good for decoding speed, because short matches are decoded in a single step, and the extra decoding steps for longer matches must amortize well anyway (per output byte). On the other hand, there is no length-offset correlation, so it's not a strong contender in high compression. The format further seems quite simple, no complications like it brotli, looks almost like a semester project.

  10. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    I wonder how it is related to LZX/LZMS...

  11. #8
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    876
    Thanks
    242
    Thanked 325 Times in 198 Posts
    Quote Originally Posted by Shelwien View Post
    predefined huffman tables
    In brotli we had two great interns exploring predefined prefix codes back in the day. Neither effort was able to demonstrate significant (above 0.1 %) savings, so I suspect that it is not easy to get substantial savings with predefined prefix codes. There might be a streaming advantage however... One might get more payload data into the first segment.

  12. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    > Neither effort was able to demonstrate significant (above 0.1 %) savings,
    > so I suspect that it is not easy to get substantial savings with predefined prefix codes.

    Unfortunately current zipline "source" doesn't contain any predefined tables.

    1) There're some options for "predefined codes".
    We could use predefined huffman length tables, but store a symbol reorder table (even if its not normally used).
    Its also possible to invent several kinds of "parametric" codes.
    2) It likely only makes sense in combination with parsing optimization.
    3) kzip likes to create large (MBs) deflate blocks, which is equivalent to reusing same table.
    4) A test case: base64 data; it appears quite frequently in web traffic and might benefit from a special static table.
    5) Even for 7z deflate headers take 0.25% on average, so maybe 0.1% gain is not bad? Its 1k per 1M of data.

    Code:
    [book1_gz5.raw]
    type=02 cbytes=56991 ubytes=135228 tokens=32767 hdrbits=571/455923/0.13%
    type=02 cbytes=58523 ubytes=140555 tokens=32767 hdrbits=597/468180/0.13%
    type=02 cbytes=58643 ubytes=141259 tokens=32767 hdrbits=569/469140/0.12%
    type=02 cbytes=59121 ubytes=143988 tokens=32767 hdrbits=605/472968/0.13%
    type=02 cbytes=58942 ubytes=143638 tokens=32767 hdrbits=569/471534/0.12%
    type=12 cbytes=26012 ubytes= 64103 tokens=14294 hdrbits=572/208089/0.27%
    
    [book1__kzip.raw]
    type=12 cbytes=109162 ubytes=276198 tokens=65536 hdrbits=636/873289/0.07%
    type=12 cbytes=112171 ubytes=288887 tokens=65536 hdrbits=0/897364/0.00%
    type=12 cbytes=78106  ubytes=203686 tokens=45293 hdrbits=0/624843/0.00%
    
    [book1_7z9.raw]
    type=02 cbytes= 3677 ubytes=  7705 tokens= 2977 hdrbits=423/29416/1.44%
    type=02 cbytes= 3244 ubytes=  7634 tokens= 1987 hdrbits=397/25945/1.53%
    type=02 cbytes= 6047 ubytes= 15216 tokens= 3412 hdrbits=465/48370/0.96%
    type=02 cbytes=11851 ubytes= 30375 tokens= 6436 hdrbits=428/94808/0.45%
    type=02 cbytes=23863 ubytes= 61262 tokens=12986 hdrbits=474/190897/0.25%
    type=02 cbytes=23813 ubytes= 61045 tokens=12947 hdrbits=476/190504/0.25%
    type=02 cbytes=24032 ubytes= 61174 tokens=13157 hdrbits=479/192250/0.25%
    type=02 cbytes=24204 ubytes= 61049 tokens=13218 hdrbits=490/193625/0.25%
    type=02 cbytes=23875 ubytes= 61169 tokens=12984 hdrbits=476/190995/0.25%
    type=02 cbytes=23729 ubytes= 61010 tokens=12968 hdrbits=467/189831/0.25%
    type=02 cbytes=23677 ubytes= 61090 tokens=12814 hdrbits=502/189414/0.27%
    type=02 cbytes=23609 ubytes= 60942 tokens=12910 hdrbits=490/188870/0.26%
    type=02 cbytes=23463 ubytes= 61142 tokens=12679 hdrbits=499/187703/0.27%
    type=02 cbytes=23610 ubytes= 61165 tokens=12796 hdrbits=509/188874/0.27%
    type=02 cbytes=23183 ubytes= 60943 tokens=12484 hdrbits=493/185457/0.27%
    type=12 cbytes=13861 ubytes= 35850 tokens= 7464 hdrbits=499/110882/0.45%

  13. #10
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    876
    Thanks
    242
    Thanked 325 Times in 198 Posts
    Quote Originally Posted by Shelwien View Post
    4) A test case: base64 data; it appears quite frequently in web traffic and might benefit from a special static table.
    5) Even for 7z deflate headers take 0.25% on average, so maybe 0.1% gain is not bad? Its 1k per 1M of data.
    Simple codes like base64 compress well.

    Approximations in code create inefficiency quickly.

    ... And it is a lot of complexity for 0.1 %, perhaps investing the complexity elsewhere would lead to better performance. Honestly, I was not shy of complexity in brotli's design and still I decided to leave this out. Of course I hope that they got it right and are able to get actual savings...

Similar Threads

  1. Super Microsoft LZX
    By comp1 in forum Data Compression
    Replies: 4
    Last Post: 29th November 2016, 14:03
  2. Microsoft Visual Studio 11
    By encode in forum The Off-Topic Lounge
    Replies: 1
    Last Post: 28th April 2012, 14:40
  3. A Microsoft study on deduplication
    By m^2 in forum Data Compression
    Replies: 1
    Last Post: 5th May 2011, 18:15
  4. Updated PeaZip project: 1.5
    By giorgiotani in forum Forum Archive
    Replies: 8
    Last Post: 21st March 2007, 14:02
  5. OpenDark project
    By kvark in forum Forum Archive
    Replies: 5
    Last Post: 18th November 2006, 01:48

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •