Results 1 to 14 of 14

Thread: LZMA markup tool

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts

    LZMA markup tool

    http://nishi.dreamhosters.com/u/lzma_markup_v0.rar

    I made an utility for visualization of lzma's data parsing,
    the result looks like this:
    http://nishi.dreamhosters.com/u/lzma_book1__a0_a1.png
    (left: lzma output for book1 in -a0 mode (279k), right: -a1 (260k) )

    Normal matches are green, literals are gray, background changes
    show the element edges.
    Also, rep0 matches are blue, rep1-rep3 are red, and rep0long literal
    is white.

  2. The Following User Says Thank You to Shelwien For This Useful Post:

    rgeldreich (30th November 2015)

  3. #2
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Wonderful!

    But the colors are too dark for my taste :P It would be better if you use external CSS.

    Also you could add some scripts to highlight source match when pointing the destination match by mouse pointer. And when clicking a match the highlight could be permanent until clicking on other match (then there will be another permanent highlight) or no match (then there would be no permanent highlisghts).

    Update:
    You could also add a box displaying distance of highlighted match together with its cost.
    Last edited by Piotr Tarsa; 17th May 2011 at 13:33.

  4. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    > Wonderful!

    Thanks

    > But the colors are too dark for my taste :P
    > It would be better if you use external CSS.

    There's a source for dump-to-html converter in the archive,
    you can fix that.

    > Also you could add some scripts to highlight source match when
    > pointing the destination match by mouse pointer. And when clicking a
    > match the highlight could be permanent until clicking on other match
    > (then there will be another permanent highlight) or no match (then
    > there would be no permanent highlisghts).

    The main problem is that its hard for browsers to render the markup
    even as it is, also most of matches near the end of text are
    relatively long-distances ones... you won't see the target in most cases.

    > You could also add a box displaying distance of highlighted match
    > together with its cost.

    That's more realistic, I guess I can generate text lines as tables with 2 rows,
    and add the match info above the actual line.

  5. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    LZMA rep-codes removal experiment

    http://nishi.dreamhosters.com/u/lzma_delrep_v0.rar

    lzma includes a popular trick where recent match distances
    are saved and can be reused, instead of full coding.
    To be specific, it remembers 4 last distances, which can
    be used for matches, and, in addition, a 1-byte match
    can be encoded using rank-0 saved distance.

    Now, the "delrep" utility here is able to replace some
    or all of rep-coded matches with full ones (it works
    with lzmarec's stream dumps).
    Btw, this delrep actually decodes the original file
    from .rec dumps (its required to handle 1-byte rep0 matches),
    so it may be useful for some other purposes - archive
    includes the source.

    Code:
    0: do nothing
    1: remove rep0 literals
    2: remove rep1-3 matches
    3: remove rep1-3 matches and rep0 literals
    4: remove rep0-3 matches and rep0 literals
    
    +r columns contain lzmarec-4b results
    
       enwik8           MIPS              enwik8+r MIPS+r  
    0  24557177         7954481           24318978 7889109 
    1  24573200         7980682           24332512 7912665 
    2  24621542         8160415           24379312 8064485 
    3  24636837 +0.324% 8186018  +2.910%  24392531 8086898 
    4  25021760 +1.891% 9230753 +16.044%  24739163 8910968

  6. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    http://nishi.dreamhosters.com/u/lzma_delrep_v1.rar

    Tried forcibly restoring rep-codes after removing them.
    For enwik8 it looks kinda like this:

    Code:
    opt=08 dictsize=08000000 filesize=100000000
    Count the tokens in .rec: 11042877
    Load the tokens into an array. Done
    Compute backward rep sets. Done
    Turn rep-coded tokens into plain matches/literals: 34547 litR0, 222219 rep0, 47353 rep1-3
    Recompute rep-codes: 173979 litR0, 222219 rep0, 51422 rep1-3
    Store the results. Done
    But compression actually got worse (24557177->2460537, despite the fact that
    some distances were discarded, and many literals encoded with 4 bits instead of 9.
    Seems like lzma encoder really takes into account entropy codes of these things -
    all literals are encoded in context of rep0 byte, so explicit rep0 literals are rarely helpful.

    So, I tried to only restore previously existing rep0 literals:

    Code:
    Turn rep-coded tokens into plain matches/literals: 34547 litR0, 222219 rep0, 47353 rep1-3
    Recompute rep-codes: 34547 litR0, 222219 rep0, 51422 rep1-3
    Which was still no good (24557177->24558297).

    Then I tried to change match distances to values used in _future_ matches
    (of course, only when referenced strings match too):

    Code:
    Turn rep-coded tokens into plain matches/literals: 34547 litR0, 222219 rep0, 47353 rep1-3
    Recompute rep-codes: 34496 litR0, 226143 rep0, 58052 rep1-3
    And it was an improvement this time (24557177->24556587).
    Also, it clearly did find quite a few extra hits this time.
    But unfortunately this gain is far from universal.

    I'm still wondering how to properly do parsing optimization
    with repeat codes...

  7. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    I made a deflate-to-lzma converter, for estimation of
    direct deflate recompression potential.

    http://nishi.dreamhosters.com/u/defl2rec_v0.rar

    Code:
               deflate   lzma  plzma
    book1.raw   312896 307867 303167 -3.10% // gzip -9 book1
    book1a.raw  299739 298310 293844 -1.96% // 7z -tgzip -mx=9 book1
    wcc386.raw  313715 297747 287960 -8.20% // gzip -9 wcc386
    wcc386a.raw 303048 295316 286573 -5.43% // 7z -tgzip -mx=9 wcc386
    The gain seems to be surprisingly small - comparable to lzmarec's gain on lzma.
    At least its more noticeable on binaries, where lzma's alignment tricks help.
    Also note that this doesn't include block info - for gzip files it can be
    reproduced (like precomp does), but not for 7z's deflate streams.
    Maybe its better to skip this step and just go for full recompression.

    Btw, here's the comparision of lzma vs deflate parsing
    http://nishi.dreamhosters.com/u/lzma_defl.png

  8. #7
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Surprisingly small? Try to compress those files with LZMA with 32 KiB window and then draw conclusions.

  9. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    > Try to compress those files with LZMA with 32 KiB window and then draw conclusions.

    You missed the point. I meant that there're different recompression methods for deflate:
    1. Entropy coding replacement (like in lzmarec)
    2. Extracting the data and compressing deflate structures in context of uncompressed data
    3. Extracting the data and bruteforcing zlib options to match compressed data (like in precomp)

    Here (1) is most simple and fast (for me, anyway).
    But I have to compete with precomp, where any compression method can be used
    (precomp doesn't work for 3 of 4 of my .raw files though).

    Anyway, I added delrep to defl2rec script, to make use of lzma's repeat codes.
    And I did test plain plzma compression, with 32k window and 1M window.

    Code:
               deflate   defl2rec_v0     plzma_32k      plzma_1M
    book1.raw   312896  303106 -3.12% 289463  -7.48% 258151 -17.49%
    book1a.raw  299739  293759 -1.99% 289463  -3.42% 258151 -13.87%
    wcc386.raw  313715  284628 -9.27% 270962 -13.62% 268030 -14.56%
    wcc386a.raw 303048  282752 -6.69% 270962 -10.58% 268030 -11.55%
    See? The "suprisingly small" gain is like 3x smaller than (potentially) possible
    result with precomp+plzma, and then precomp allows to use PPM/CM, while deflate-to-lzma
    conversion doesn't.

  10. #9
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    In the vein of the original topic, I've found the browser exceedingly badly adapted to showing this kind of markup.

    Click image for larger version. 

Name:	lzma_book1__a0_a1.png 
Views:	421 
Size:	94.0 KB 
ID:	1589

    And I struggled to use the most basic data controls to do it too: http://stackoverflow.com/questions/2...text-in-python
    Click image for larger version. 

Name:	statsviewer01d.png 
Views:	513 
Size:	28.8 KB 
ID:	1590

    General heatmaps are generic and would be generally useful to many people dabbling in compression.

    A utility program that can display a (compressed) markup file (that could be very large), or stream it in over a pipe, would be excellent.

    It could use various zoom settings so you can zoom out and see something pretty like fv or zoom in and see extra stats and details.



    Just saying, in case anyone can contribute...
    Last edited by willvarfar; 21st June 2011 at 09:15.

  11. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    I don't see what's your problem with using a browser for this.
    Sure, we need to split the map if it gets too large, but for files like book1 modern browsers still can render it.
    Anyway, inventing a new markup language just because html is a little slow seems unreasonable.
    If you want it to scroll smoothly so much, I guess there's an option of printing that html, or saving as an image
    via some capture tool.
    I probably could make a fast viewer for these maps, but what's the point if i only need them like once in a year?..

  12. #11
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Maybe you could directly output a 8-bit TGA file? That shouldn't be hard.

  13. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    book1 has 16623 lines with ~60 symbols per line.
    Suppose we'd use a 8x16 font, then its
    16623*60*8*16 = 127664640 (grayscale/palette)
    16623*60*8*16 *3 = 382993920 (RGB)
    So it'd hit 4G already at 10x of book1 size, no enwik8 markup for you anyway.

    Also as I said, you can capture the browser's output as an image anyway, as it is.

  14. #13
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    I just stumbled onto this old thread...

    It looks like the site hosting the lzma markup tool is now dead.

    Does this exist somewhere else now? (I couldn't google it up.) It seems like a very cool visualization tool.

  15. #14
    Member
    Join Date
    May 2008
    Location
    France
    Posts
    79
    Thanks
    461
    Thanked 22 Times in 17 Posts
    Quote Originally Posted by Paul W. View Post
    I just stumbled onto this old thread...

    It looks like the site hosting the lzma markup tool is now dead.

    Does this exist somewhere else now? (I couldn't google it up.) It seems like a very cool visualization tool.
    Copy of Eugene's tools from this thread in attachment.
    Attached Files Attached Files

  16. The Following 3 Users Say Thank You to Mike For This Useful Post:

    Bulat Ziganshin (30th November 2015),Paul W. (29th November 2015),rgeldreich (30th November 2015)

Similar Threads

  1. Replies: 39
    Last Post: 10th April 2014, 22:26
  2. New GUI tool
    By encode in forum Data Compression
    Replies: 31
    Last Post: 18th November 2010, 21:55
  3. LZMA source
    By Shelwien in forum Data Compression
    Replies: 2
    Last Post: 29th March 2010, 18:45
  4. compression trace tool
    By Shelwien in forum Data Compression
    Replies: 6
    Last Post: 19th August 2009, 03:52
  5. LZBW1 - compression tool by another newbye :)
    By stfox in forum Data Compression
    Replies: 4
    Last Post: 28th April 2009, 16:33

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •