Page 3 of 3 FirstFirst 123
Results 61 to 69 of 69

Thread: Optimal Preprocessing/Parsing for LZ?

  1. #61
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    I implemented Bulat's idea for parsing with statistics moving forward.

    That is, when you arrive at position "pos" you know the best way to arrive there, so you go back to the source of the best arrival and bring the statistics forward from there.

    My LZA :

    24700820 uncompressed
    9267278 statistics carried forward (Bulat) (LZMA-style parse)
    9315794 statistics updated only on 16k chunk boundaries (LZMA-style parse)

    for reference :
    9290222 old chain-N optimal parse at lowest level
    9431777 7zip high profile
    Last edited by cbloom; 3rd August 2016 at 21:36.

  2. Thanks (3):

    Bulat Ziganshin (13th January 2015),Cyan (26th January 2015),wgarvin (4th February 2015)

  3. #62
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    I just found a nice compromise.

    For reference :

    9267278 statistics carried forward (Bulat) (LZMA-style parse)
    9315794 statistics updated only on 16k chunk boundaries (LZMA-style parse)

    with 4k chunks : 9306334

    But it occurred to me that instead of just cutting at 4k to do a statistics update, I could let that boundary be flexible. Instead I wait to cut until there's no decisions in the parse. This occurs at either a literal (no matches cross a location) or a long enough match (due to fast-bytes still forced taking of long matches).

    with 4k flexible cut : 9299690

    But the real advantage of the flexible cut is that there's no penalty for making the update interval smaller. In fact you can make it as small as 1 byte!

    So the procedure is :

    Set prices from current statistics
    Parse ahead Bulat/LZMA style using current prices
    parse ahead stops when it hits a no-decision point
    reverse parse & output codes, updating statistics

    with small flexible cut : 9276018

    In practice I think you want a tiny minimum parse-ahead distance, because there is a little bit of cpu overhead in switching modes between parsing and outputting, so you don't want to actually do it on every byte when you're in a chunk of literals.
    Last edited by cbloom; 3rd August 2016 at 21:35.

  4. Thanks (2):

    Bulat Ziganshin (13th January 2015),Cyan (13th January 2015)

  5. #63
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    AFAIK, LZMA breaks out of parsing cycle and goes to encoding cycle once it finds match longer than fast_bytes:


    if (mainLen >= p->numFastBytes)
    {
    *backRes = matches[numPairs - 1] + LZMA_NUM_REPS;
    MovePos(p, mainLen - 1);
    return mainLen;
    }

    and lots of similar if's. also i thought that LZMA maximum block size is #define kNumOpts (1 << 12)
    Last edited by Bulat Ziganshin; 13th January 2015 at 20:55.

  6. #64
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    A few more notes.

    I tried enhancing this parse by doing a "chain-N" reparse on the forward output step.

    So the process is :

    LZMA-style parse forward to best arrival costs
    reverse parse
    change "cost from start" to "cost to end"
    reparse forward with chain-N instead of just outputting the previously found parse.


    It didn't work. I'm not entirely sure why. I had to increase N (of chain-N) to 4 before I got any improvement over the previous, and by then it's getting quite slow.

    My suspicion is because the LZMA-style parse is optimal from the start but is not optimal costs to the end, which is what chain-N needs. I'm not sure.

    Anyway I thought of something better which is next...
    Last edited by cbloom; 3rd August 2016 at 21:35.

  7. #65
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    There's an improvement which scales nicely and fits within the framework of this parse.

    Instead of just storing the 1 cheapest way to arrive at each position, store N ways to arrive. Just storing the N cheapest ways to arrive seems to work pretty well (*).

    Then when you're costing outputs from the position, instead of just using the one arrival state, try all N arrivals.

    One of the arrivals that is not cheapest might wind up being a cheaper way to get into the future, because it has more useful repeat matches or something.

    Basically this lets the parse be less greedy. The original Bulat/LZMA parse is greedy in that you can never take a more expensive step in order to find a cheaper future step. This way gives you a little chance to do that.

    At each position you need N times the storage for the parse (what match did you choose, what position did you come from). You also have to store which of the N arrivals you came from. Then when you walk backwards, you trace back through the N different choices at each position.

    You can think of the position being augmented with a decimal place like {pos.arrival} ; then you just reverse those links and walk back through it forward to output, as before.

    The speed is not N times worse because most of the work at each position can be reused. (the normal matches are the same for all arrivals, it's just the repeat matches that change)

    One note :

    with this style of parse, the chunks can no longer be arbitrarily small.

    The reason is when you flush a chunk, you make a definite decision about which path through the arrivals you take. The multiple possible paths collapse down to one chosen path, you update the statistics, and at the end of the chunk have only 1 arrival to start from.

    I found the optimal flush length to be around 256, which balances keeping the statistics up to date vs. keeping your parse choice open for a while.
    Last edited by cbloom; 3rd August 2016 at 21:34.

  8. Thanks (3):

    Bulat Ziganshin (23rd January 2015),Cyan (26th January 2015),Turtle (6th October 2015)

  9. #66
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    Results :

    Fez_Essentials.pak original 17784477

    lzma best : 10319525

    my old best parse ;
    backward-forward parse with chain-4 : 9739717

    Bulat/LZMA forward parse, 1 best arrival : 9780036
    Bulat/LZMA forward parse, 4 arrivals : 9512780

    Woot!
    Last edited by cbloom; 3rd August 2016 at 21:34.

  10. Thanks (3):

    comp1 (23rd January 2015),Cyan (26th January 2015),ne0n (23rd January 2015)

  11. #67
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    I just had a look in the LZMA code and want to write down what I saw.

    (I could be wrong; it's hard to follow!)

    The way execution flows in LZMA is pretty weird, it jumps around :

    The outer loop is for the encoding position.
    At each encoding position, he asks for the result of the optimal parse ahead.
    The optimal parse ahead caches results found in some sliding window ahead.
    If the current pos is in the window, return it, else fill a new window.

    To fill a window :
    find matches at current position.
    set window_len = longest match length
    fill costs going forward as described by Bulat above
    as you go forward, if a match length crosses past window_len, extend the window
    if window reaches "fast bytes" then break out
    when you reach the end of the window, reverse the parse and fill the cache
    now you can return the parse result at the base position of the window


    So, some notes :

    1. LZMA in fact does do the things that I proposed earlier - it updates statistics as soon as it hits a choke point in the parse.
    The optimal-parse-ahead window is not always "fast bytes". That's a maximum. It can also stop any time there are no matches crossing a place, just as I proposed earlier.

    2. LZMA caches the code costs ("prices") of match lengths and offsets, and updates them periodically. The match flags and literals are priced on the fly using the latest encoded statistics, so they are quite fresh.

    3. LZMA only stores 1 arrival. As I noted previously this has the drawback of missing out on some non-greedy steps. *However* and this is something I finally appreciate - LZMA also considers some multi-operation steps.

    LZMA considers the price to go forward using :
    literal
    rep matches
    normal matches

    If you stored all arrivals, that's all you need to consider and it would find the true optimal parse. But LZMA also considers :

    literal + rep0 match
    rep + literal + rep0
    normal match + literal + rep0

    These are sort of redundant in that they can be made from the things we already considered, but they help with the 1 arrivals problem. In particular they let you carry forward a useful offset that might not be the cheapest arrival otherwise.

    ( literal + rep0 match is only tested if pos+1 's cheapest arrival is not a literal from the current pos )

    That is, it's explicitly including "gap matches" in the code cost as a way of finding slightly-non-local-minima.
    Last edited by cbloom; 3rd August 2016 at 21:33.

  12. #68
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    I'm not sure if this is what you're already talking about, but the idea occurred to me to treat the match offsets as relative to a cursor that remembers its last position; you could then use d.p. to find the optimal sequence of cursor moves (assuming that the offset bits vary by distance) and that could/should be able to pick up on and properly incentivize gap matches. The d.p. table size might be prohibitive, though.

    The d.p. table would store the optimal cost for every pair of (length, cursor position). That's O(n^2) elements, of course, which doesn't scale. But if it works, it would be a starting point.

    I think you might be describing the same thing, basically, pared down to be more practical.

    Edit:
    It occurred to me (a bit late), this doesn't necessarily directly solve the gap-match problem. It would chip away at it, somewhat indirectly, though, to the extent that the gap-match problem is a special case of the larger problem of saving space by capitalizing on match-source locality. E.g., it would create an incentive to fill the gap using match(es) that didn't stray too far from the ideal gapped match -- in other words, to look for 3-for-1 deals.

    I'm not claiming this is original or a particularly great idea. I guess I'd just like to gauge where you are at relative to this.
    Last edited by nburns; 10th February 2015 at 19:43.

  13. #69
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    A little more looking at LZMA.

    1. LZMA can store a direct arrival or a multi-step arrival. This is the prev2 business.

    (I was trying to figure out how to implement a multi-step arrival, like considering match+literal+rep ; the problem is if you try to store that up ahead at parse[match_len+1+rep_len] using only one arrival, then there might not be a path back that takes you through those states. eg. the cheapest state at parse[match_len+1] might come from somewhere else by the time you do the backwards trace. The solution is to explicitly store the multi-step arrival in a separate channel so that you can get it back.)

    2. LZMA defers a lot of the work of figuring out the arrival state until you actually arrive there. It doesn't update the markov history or reps or various things. They get figured out at the arrival. This is a speed win because you only arrive once but you might fill the same state many times.

    3. LZMA fills all lengths. That is, if you find a match_len, it also fills the arrival at match_len-1, match_len-2, etc. I've done a lot of testing that indicates this is not necessary. There's very little compression gain from doing all lengths. However if your code cost is well optimized there's also very little speed penalty to doing all lengths.
    Last edited by cbloom; 3rd August 2016 at 21:33.

Page 3 of 3 FirstFirst 123

Similar Threads

  1. Replies: 13
    Last Post: 17th May 2014, 07:11
  2. [LZ] Optimal parsing
    By Bulat Ziganshin in forum Data Compression
    Replies: 14
    Last Post: 19th March 2014, 22:56
  3. Optimal ordered bitcodes (or smth like this)
    By Piotr Tarsa in forum Data Compression
    Replies: 29
    Last Post: 26th July 2011, 14:18
  4. optimal parsing
    By flounder in forum Data Compression
    Replies: 10
    Last Post: 3rd August 2008, 14:07
  5. parsing methods
    By Piotr Tarsa in forum Forum Archive
    Replies: 18
    Last Post: 9th August 2007, 07:45

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •