Results 1 to 22 of 22

Thread: lzma submodel shares / redundancy measurement

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts

    lzma submodel shares / redundancy measurement

    Any ideas for files where "rep" coding could make more sense?

    Code:
    enwik8, 100M bytes, compressed file length = 24560059 
     - 13 bytes lzma header - 1 byte rc pad = 24560045
    
         Submodel  log2_entropy   rc_output
    ------------------------------------------
        c_IsRepG2:     1879.37      1879.37
        c_IsRepG1:     5340.18      5340.18
        c_IsRepG0:     9346.86      9346.91
     c_IsRep0Long:    14287.23     14287.22
          c_IsRep:   149490.93    149497.97
        c_SpecPos:   356311.28    356311.19
       c_Literal2:   415920.15    415922.79
     c_LenChoice2:   445556.96    445556.61
        c_LenHigh:   589577.69    589580.39
         c_LenLow:   768648.29    768648.04
      c_LenChoice:   809017.17    809016.98
         c_LenMid:  1019553.67   1019553.34
        c_IsMatch:  1117219.48   1117219.17
       c_Literal1:  1543005.86   1543010.79
          c_Align:  3115007.86   3115007.06
        c_PosSlot:  4310701.82   4310706.69
      direct bits:  9889156.75   9889156.78
    ------------------------------------------
            Total: 24560021.55  24560041.48
    Code:
    icl.exe, 3128224 bytes, compressed file length = 996740,
    
         Submodel  log2_entropy   rc_output
    ----------------------------------------
     c_IsRep0Long:      257.64       257.63
        c_IsRepG2:     1361.74      1361.74
        c_IsRepG1:     3656.54      3656.54
        c_IsRepG0:     5445.74      5445.74
     c_LenChoice2:     7072.96      7072.96
         c_LenMid:    12381.42     12381.42
        c_LenHigh:    13681.88     13682.00
          c_IsRep:    24667.16     24667.10
      c_LenChoice:    26850.71     26850.68
        c_SpecPos:    27668.61     27668.60
       c_Literal2:    57264.92     57264.93
        c_IsMatch:    72904.86     72904.86
         c_LenLow:    89370.24     89370.24
          c_Align:    96537.44     96537.42
        c_PosSlot:   148082.87    148083.39
      direct bits:   153976.38    153976.38
       c_Literal1:   255540.85    255540.80
    ----------------------------------------
            Total:   996721.96    996722.44

  2. #2
    Member Fu Siyuan's Avatar
    Join Date
    Apr 2009
    Location
    Mountain View, CA, US
    Posts
    176
    Thanks
    10
    Thanked 17 Times in 2 Posts
    Sorry what do you mean of " "rep" coding could make more sense"

  3. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    I just posted bytes per submodel for real lzma samples, and as you can see, 4 of 5 rep-related submodels are kinda useless.
    (lzma remembers 4 last match offsets in vars rep0..rep3 and encodes some flags to use these
    instead of actual match distance coding, also rep0long is alternative literal coding).

    But from what I see, its much more important that nearly half of compressed enwik8 data are in fact uncompressed
    ("direct bits" are distance bits which are encoded with probability 0.5 w/o any modelling).

    Anyway, I need an advice about other filetypes to check out, which may have different entropy distributions,
    specifially higher load in rep* submodels.

  4. #4
    Member
    Join Date
    May 2009
    Location
    CA USA
    Posts
    22
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    Anyway, I need an advice about other filetypes to check out, which may have different entropy distributions,
    specifially higher load in rep* submodels.

    You might try the repetitive text corpus, which might have different statistics than flat files.
    Repetitive texts are (by construction) amenable and sensitive to LZ matching choices.

  5. #5
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    278
    Thanks
    6
    Thanked 29 Times in 18 Posts
    I have no idea about LZMA internals but remembering the last n offsets helps mostly on strictly binary files. Try acrord32.exe
    Also it's plain useless on text and could even hurt.

    edit: also "model load" is not very interesting. More interesting is the relative amount of distances that got replaced by reps. It's about 0.4% on book1, 4% on enwik8 and ~15% on acrord32.exe according to my LZ77-model.
    Last edited by Sebastian; 6th December 2010 at 22:04.

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    > Repetitive texts are (by construction) amenable and sensitive to LZ matching choices.

    Thanks, but I want something widely available.
    The question is whether we can discard these submodels for speed, or they're actually necessary somewhere.
    Also I don't really care about bmp and wav - they need different handling anyway.

    > I have no idea about LZMA internals but remembering the last n offsets helps mostly on strictly binary files. Try acrord32.exe

    I tried a different .exe, but whatever... Here's what it shows for acrord32:
    Code:
       c_IsRepG2:     1119.93     1119.93
       c_IsRepG1:     2699.04     2699.05
    c_LenChoice2:     6160.97     6160.97
       c_IsRepG0:     6885.46     6885.57
    c_IsRep0Long:    11165.32    11165.33
       c_LenHigh:    11554.21    11554.28
        c_LenMid:    12182.39    12182.39
     c_LenChoice:    25887.69    25887.73
       c_SpecPos:    35270.09    35270.08
         c_IsRep:    35767.31    35767.30
         c_Align:    97838.61    97838.58
      c_Literal2:   110549.25   110549.34
       c_IsMatch:   113864.48   113864.54
        c_LenLow:   116106.02   116106.10
       c_PosSlot:   168798.71   168799.07
     direct bits:   195839.25   195839.25
      c_Literal1:   448120.29   448120.52
           Total:  1399809.01  1399810.02
    Same thing basically... anything else?

  7. #7
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    278
    Thanks
    6
    Thanked 29 Times in 18 Posts
    Look at my edit, coding costs are imo useless to estimate the amount of savings through this specific submodel, as if you had a runlength-model and it saves your 500k zero-string to 1 byte would you say it's useless?
    Last edited by Sebastian; 6th December 2010 at 22:22.

  8. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    > coding costs are imo useless to estimate the amount of savings through this specific submodel,

    Well, I have a different opinion.

    > as if you had a runlength-model and it saves your 500k zero-string to 1 byte would you say it's useless?

    Yes, I would, because a precise order0 model would be the same or better as rle in such case
    (not w.r.t. speed but its another matter)

    Well, sure you could be right in some cases, but here we don't have a smart model
    which could predict reps with high probability, so they're still encoded with a few bits
    on average, so the submodel share does mean something.

    Also, I actually implemented this measurement to look for submodels to optimize first -
    _all_ counters in lzma are:
    Code:
        if( flag ) {
          P -= (P >> 5);
        } else {
          P += ((SCALE-P) >> 5);
        }
    And from that point of view the low-load submodels certainly are not worth attention.

  9. #9
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    278
    Thanks
    6
    Thanked 29 Times in 18 Posts
    >Well, I have a different opinion.
    How would you be able to state how good a submodel is by looking at the amount of log2-entropy?

    >Yes, I would, because a precise order0 model would be the same or better as rle in such case
    That was just an example to clarify my point.

    >but here we don't have a smart model which could predict reps with high probability
    Imo it's hard to predict reps because they depend on the current path through the matches which gets optimized on the fly.
    But I don't know how LZMA does it, I only remember that I didn't bother with it.

    The savings are relative small there you're right but try disabling reps on acrord32 and see how much you loose.

    >And from that point of view the low-load submodels certainly are not worth attention.
    Sure most bits are spend on distance coding and from your point of view you can see where to spend time first, but believe me that reps are worth the 1% you could save on some files.

  10. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    > How would you be able to state how good a submodel is by looking at the amount of log2-entropy?

    With a single black box model - sure I won't know anything.
    But here I know the data type, and the model implementation, and results from different
    models for comparison, so I think I can make reasonable assumptions.

    Sure you're right that the correct way is to disable a submodel and check what happens,
    but I can't do it here due to certain reasons.

    > reps are worth the 1% you could save on some files.

    rep0 yes (see c_IsRep), but there're also rep1-rep3 and rep0long, and I'm not saying
    that they're useless, but asking for a file where they could be really helpful.

    Anyway, here I added bit counts (n=n0+n1,n0,n1), does it change anything?
    Code:
       c_IsRepG2:     1879.37     1879.37     15444      9212      6232
       c_IsRepG1:     5340.18     5340.18     47285     31841     15444
       c_IsRepG0:     9346.86     9346.91    302266    254981     47285
    c_IsRep0Long:    14287.23    14287.22    254981     33770    221211
         c_IsRep:   149490.93   149497.97   7194915   6892649    302266
       c_SpecPos:   356311.28   356311.19   2828456   1436057   1392399
      c_Literal2:   415920.15   415922.79   7574598   4125744   3448854
    c_LenChoice2:   445556.96   445556.61   4560629   3407744   1152885
       c_LenHigh:   589577.69   589580.39   9223080   7101537   2121543
        c_LenLow:   768648.29   768648.04   7801548   2744842   5056706
     c_LenChoice:   809017.17   809016.98   7161145   2600516   4560629
        c_LenMid:  1019553.67  1019553.34  10223232   4509811   5713421
       c_IsMatch:  1117219.48  1117219.17  11011166   3816251   7194915
      c_Literal1:  1543005.86  1543010.79  22955410  12323212  10632198
         c_Align:  3115007.86  3115007.06  24699276  12364200  12335076
       c_PosSlot:  4310701.82  4310706.69  41355894  21048170  20307724
     direct bits:  9889156.75  9889156.78  79113254  39990542  39122712
           Total: 24560021.55 24560041.48 236322579 122691079 113631500

  11. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts

    Post

    I've got some interesting results about lzma recompression:
    Code:
            lzma        ccmx 5   paq8px-6  ppmd-o2  09g2a   fpaq0mw
    id    1297570.82   1342043   1289363   1319879  1310547 1457638
    lit   3632355.36   2460211   2322619   2437335  2381820 2703777
    len   1958933.58   3760672   3644006
    dist 17671181.72  19295724  17764194
         24560041.48
    
    1289363+2322619+1958933.58+17671181.72 = 23242097.3
    
    (1-23242097.3/24560041.48)*100 = 5.366%
    This is computed using previously posted stats for enwik8.
    I dumped the raw lzma codes (id is "operation type" - literal/match/rep0-3/rep0long),
    and appears that lzma is very inefficient at least at literal coding
    (though paq8 compressed ids better as well).

    Unfortunately universal compressors are pretty bad at number coding, so
    possible improvement for len/dist couldn't be estimated (though it certainly
    can be reached at least by tuning the lzma models).

  12. #12
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    278
    Thanks
    6
    Thanked 29 Times in 18 Posts
    >I dumped the raw lzma codes

    so "x<match>xy" is coded as seperate streams and one is "xxy" ?

    That would result in a different encoding because the context for "xy" are the last bytes of the preceding match in LZMA imo (at least I did it so )
    But there's additionally order-1 modeling with the following byte after the last match.

  13. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    There's also stuff like length values used as context for distances etc.
    But still its a fact - literals from lzma stream encoded with -lc8 (ie order1) can be compressed that much better without extra contexts.

    I suppose there was a concern about lzma speed, because in very unlikely "worst case" it can turn into
    a very complex (and thus slow) bitwise order1 coder (ie 8 rc calls per byte).

    ...I mean, like it is in my enwik8.lzma, with only 3816251 literals per 100M of data, even a very strong model
    there won't affect the speed much, but there may be different cases, with much larger literal share.

  14. #14
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    278
    Thanks
    6
    Thanked 29 Times in 18 Posts
    >I suppose there was a concern about lzma speed, because in very unlikely "worst case"

    Isn't every literal processed as 8 single bit-values. At least that was the impression I got while looking at the source. Maybe I got something wrong.

  15. #15
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    I mean that it should be possible to generate/find a file where all bytes would be encoded as literals (there would be redundancy,
    but no matches), which would be likely the case with slowest decoding for lzma.

    btw, the lzma literal model works like this:
    Code:
      uint pbMask = (1<<pb)-1, lpMask = (1<<lp)-1, lc8 = 8-lc;
      psym = previous uncompressed byte
    [...]
      if( state>=kNumLitStates ) {
        uint matchbyte = 0x100 + dic[rep0pos()]; // first byte at rep0, not next after match!
        for( sym=1; sym<0x100; ) {
          uint mbprefix = (matchbyte<<=1) >> 8;
          sym += sym + BIT(c_Literal2[filepos&lpMask][psym>>lc8][mbprefix&1][sym]);
          if( mbprefix!=sym ) break;
        }
      } else sym=1;
      for(; sym<0x100; sym+=sym+BIT(c_Literal1[filepos&lpMask][psym>>lc8][sym]) );
    which is what i meant by "very complex (and thus slow) bitwise order1 coder"

  16. #16
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    278
    Thanks
    6
    Thanked 29 Times in 18 Posts
    >I mean that it should be possible to generate/find a file where all bytes would be encoded as literals
    On "geo", my optimizer did eliminate nearly all matches.

    >the lzma literal model works like this

    Ok maybe I had LZMA1, this is from LZMA2? and "sym" is encoded as one rc-call? Looks strange, but I had to use plain order-2 to beat it.

  17. #17
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    LZMA2 = LZMA1 + small format changes (block headers) for multithreading support
    it even calls the same LZMA1 function for decoding.
    And what I posted is LZMA1, just not the original source.

    As to rc calls, its 8 per symbol as i said, BIT() fetches the probability from counter,
    reads the bit from rc with it, and updates the counter.

  18. #18
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    278
    Thanks
    6
    Thanked 29 Times in 18 Posts
    Ok got you wrong, you just meant that it's possible that every symbol gets encoded as a literal, thus forcing 8 rc-calls per symbol.

  19. #19
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    best lzma result for geo:
    Code:
       c_IsMatch:      534.35      534.45     83340     58931     24409
      c_Literal2:     5972.49     5972.52     79896     49757     30139
      c_Literal1:    39860.01    39860.07    391552    233859    157693
         c_IsRep:      285.97      285.97     24409       303     24106
       c_IsRepG0:      520.27      520.30     24106     22974      1132
    c_IsRep0Long:     2152.62     2152.62     22974     10655     12319
       c_IsRepG1:      119.07      119.07      1132       715       417
       c_IsRepG2:       50.53       50.53       417       267       150
     c_LenChoice:      146.10      146.12     13754     13586       168
        c_LenLow:      379.94      380.00     40758     40124       634
    c_LenChoice2:       20.48       20.48       168        75        93
        c_LenMid:       25.71       25.71       225       105       120
       c_LenHigh:       52.12       52.12       744       511       233
       c_PosSlot:      165.61      165.61      1818       937       881
       c_SpecPos:       11.56       11.56        95        22        73
     direct bits:      222.00      222.00      1776       858       918
         c_Align:       68.49       68.49       964       226       738
           Total:    50587.33    50587.62    688128    433905    254223

  20. #20
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    278
    Thanks
    6
    Thanked 29 Times in 18 Posts
    LZMA always suprised me with its performance on geo. Try hitting 50k without using sparse-contexts or something like that. The order2-mix from the other thread gets around 54k.
    I supect it deinterleaves the file.

  21. #21
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    > I suspect it deinterleaves the file.

    More or less - there're position masks, and that result (50605) is acquired with lp0 lp2 pb2, which
    is order0 with ofs%4 context basically. With lp0 pb0 it shows 54-55k too.

  22. #22
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    Changing counter precision in lzma:
    Code:
    SCALElog icl_lzma e8_lzma 
    11       996740   24557177 // default
    12       996292   24545514
    13       996158   24544642
    14       996143   24547568
    15       996166   24551325
    Also its interesting that commenting out that unnecessary line (implied in lzma code)
    Code:
      uint rc_Bits( uint l ) { if( DECODE==0 ) return 0;
        uint x=0; do {
          rc_Renorm();
    //      range &= ~1;
          uint rnew = (range>>1) * 1;
          uint bit = code >= rnew;
          range = bit ? code-=rnew,range-rnew : rnew;
          x += x + bit; 
        } while( --l!=0 );
        return x;
      }
    doesn't really affect the output size, and even more - the arithmetic code somehow syncs back
    to the version with that range cutting, so only small spots are actually different.

    Another unfortunate result is that increasing the range update precision doesn't help either:
    Code:
    (range >> SCALElog) * P -> (qword(range)*(P<<(32-SCALElog)))>>32
    24544642  -> 24544655
    24547568  -> 24547754
    But the less precise version actually gives a part of 0's interval to 1, which is probably the reason.

Similar Threads

  1. found 6 bits redundancy in sharnd_challenge.dat
    By ddfox in forum The Off-Topic Lounge
    Replies: 1
    Last Post: 4th June 2010, 22:30
  2. LZMA source
    By Shelwien in forum Data Compression
    Replies: 2
    Last Post: 29th March 2010, 18:45
  3. Implementation of JPEG2000 LLC based on LZMA
    By Raymond_NGhM in forum Data Compression
    Replies: 0
    Last Post: 19th March 2010, 01:14
  4. LZMA SDK is Public Domain now
    By Vacon in forum Data Compression
    Replies: 1
    Last Post: 27th November 2008, 20:07
  5. Data Compression Book with LZMA description [!]
    By encode in forum Forum Archive
    Replies: 11
    Last Post: 5th April 2008, 19:33

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •