Results 1 to 13 of 13

Thread: LZRW

  1. #1
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    459
    Thanked 257 Times in 105 Posts

    LZRW

    Hi

    i'm currently exercising my scarce neuronal connections at a fairly complex and interesting problem right now related to 20 years old LZRW (Ross Williams) algorithm.

    As probably most readers of this forum already know, LZRW differs from standard LZSS by providing to the decoder a match pointer position into a reference table (mostly a Hash table, or a bucket-hash table).

    http://www.ross.net/compression/introduction.html

    It requires the decoder to build the same table, but is fairly fast and efficient on the compression side.

    The LZRW scheme features a "best spot" for optimal compression ratio :
    the larger the table, the better its "match find capability", hence a better compression ratio.
    However, the larger the table, the more bits it requires to describe the table slot to the decoder, hence a worse compression ratio.


    So the problem is this one :
    trying to find out which table size would be "optimal" for this scheme, i unsurprisingly concluded that such "optimal size" was different depending on the file to be compressed.

    For example, with "enwik8", this optimal number is 14 bits.
    For "open office", it is 19 bits.
    For very small files, this number can go down to 10 or less.

    The question is : how to find this "optimal size", without brute-forcing all possibilities ?

    Obviously, building massive and slow statistics to solve this issue would also defeat the point, since LZRW algorithms are used for their speed.

    Not sure if this problem has a simple answer, though....

    (http://fastcompression.blogspot.com/...iams-lzrw.html)

  2. #2
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 172 Times in 64 Posts
    Quote Originally Posted by Cyan View Post
    It requires the decoder to build the same table
    True, but it's not true for LZRW1 and LZRW1-a, which are simple LZSS.

    Quote Originally Posted by Cyan View Post
    but is fairly fast and efficient on the compression side.
    It's not efficient for decompression (look at QuickLZ results which encodes an index to a hash table). I was not able to get even 300 MB/s with something similar to LZRW2+. But with LZSS I can get decompression speed up to 700 MB/s with just one core (look also at decompression speed of Snappy or LZF).

    In-memory test (compression and decompression) with ENWIK8 using 1 core of Athlon X4 2.8 GHz (32-bit compilation under gcc 4.5.2 (MinGW) -O3 -fomit-frame-pointer -fstrict-aliasing -fforce-addr -ffast-math --param inline-unit-growth=999 -DNDEBUG):

    Code:
    lzjb 2010        = 842 ms (115 MB/s), 100000000 -> 68711273, 410 ms (238 MB/s)
    lzo 2.04 1x      = 964 ms (101 MB/s), 100000000 -> 53481944, 374 ms (261 MB/s)
    fastlz 0.1 -1    = 614 ms (159 MB/s), 100000000 -> 55239233, 390 ms (250 MB/s)
    fastlz 0.1 -2    = 723 ms (135 MB/s), 100000000 -> 54163013, 362 ms (269 MB/s)
    lzf 3.6 vf       = 738 ms (132 MB/s), 100000000 -> 53945381, 267 ms (365 MB/s)
    lzf 3.6 uf       = 667 ms (146 MB/s), 100000000 -> 57695415, 276 ms (353 MB/s)
    lzrw1            = 891 ms (109 MB/s), 100000000 -> 59669043, 438 ms (222 MB/s)
    lzrw1-a          = 830 ms (117 MB/s), 100000000 -> 59448006, 372 ms (262 MB/s)
    lzrw2            = 749 ms (130 MB/s), 100000000 -> 55312164, 464 ms (210 MB/s)
    lzrw3            = 722 ms (135 MB/s), 100000000 -> 52468327, 488 ms (200 MB/s)
    lzrw3-a          = 1574 ms (62 MB/s), 100000000 -> 47874203, 484 ms (201 MB/s)
    snappy 1.0       = 472 ms (206 MB/s), 100000000 -> 58350605, 199 ms (490 MB/s)
    quicklz 1.5.0 -1 = 429 ms (227 MB/s), 100000000 -> 52334371, 381 ms (256 MB/s)
    quicklz 1.5.0 -2 = 1140 ms (85 MB/s), 100000000 -> 45883075, 442 ms (220 MB/s)
    Quote Originally Posted by Cyan View Post
    trying to find out which table size would be "optimal" for this scheme, i unsurprisingly concluded that such "optimal size" was different depending on the file to be compressed.
    It's a common problem in compression e.g. choosing the optimal order for PPM.

    Quote Originally Posted by Cyan View Post
    The question is : how to find this "optimal size", without brute-forcing all possibilities ?
    1. Take into account a size of input data (bigger file size = bigger hash size).
    2. You can brute-force only some part of data (e.g. 10% or 100 KB). If hash size = 2^13 you can try 2^12, 2^13, and 2^14. Then check results for 2^12 and 2^14 and go in a direction of a size which gives better compression.
    Last edited by inikep; 30th March 2011 at 11:24.

  3. #3
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    There's no need to brute-force. You can use genetic algorithms, which i always advertised. Dependent on the "look" of the cost function surface, i.e., entropy H(x) as a function of your parameter x, a simple hill climbing algorithm could be fine . Furthermore i suggest to do the following:

    1. write a (simple) optimizer
    2. find optimal parameter values x(L) for files of different size L
    3. inikeps observation (1.) is likely to be true, thus you can fit a polynomial which predicts x(L)
    4. for further tuning on unknown files with size L:
    4a. estimate x(L)
    4b. use a local search to run into a (local) minimum

    Hope that helps.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  4. #4
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    459
    Thanked 257 Times in 105 Posts
    Thanks for suggestion.
    Unfortunately, this is not so simple : except for very small files (<1MB), the optimal table size is not a function of file length.

    To give an example :
    The optimal size for enwik8 (100MiB) is 14 bits.
    The optimal size for enwik9 (1GiB) is ... 14 bits.

    In between, we have open office (~400MiB), for which optimal size is 19 bits.
    And specific corner cases, such as Audio files, which are hardly compressible by LZ, will be better of with 9-10 bits, whatever their length.

    Also note that dynamically changing the table size requires something like a re-hash, which is quite costly, potentially information destructive (shrinkage) and costs memory. So it should not happen too often (doing it sometimes is okay).

    Finally, we could consider brute-forcing a sample data chunk to find the best settings, but :
    1) a short data chunk may not be representative of the file, especially at the beginning
    2) a large data chunk costs time. Note that if we have enough time to do this, we probably have enough time to compress with something else than LZRW.

    So, i'm trying to find the right metric to track in order to trigger the decision to expand or shrink the table. It should be an easy one to follow (ie fast to process).

    The reason i'm asking is that maybe this problem has already been explored. but maybe not. It simply would be not too smart to address this issue head-on without first checking prior knowledge.

    Regards
    Last edited by Cyan; 31st March 2011 at 10:52.

  5. #5
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    You possibly didn't understand properly. I didn't talk about dynamically changing the table size. Is that what you want to do?

    Predicting x (Note that a constant is a polynomial, too) shouldn't give you an exact solution, just an estimate, which you can use as an initial value for further optimization. If you know that the cost function is covex you can quickly find the global minimum. So did you try to figure out if the cost function is convex? Other attributes, e.g. dependent on the source entropy estimated with a simple method or the distribution of match distances (which can be recorded for some big initial table size) might be more appropriate.

    Of course you can use a (carefully chosen) sample, instead of the whole data to do the job. Based on your observation it should be rather large (< 20 mb?).
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  6. #6
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    459
    Thanked 257 Times in 105 Posts
    hi toffer.
    I'm not too familiar with these concepts.

    Regarding convex property of cost function, i can't answer, since i don't have cost function.
    I can however try to correlate some metrics with observed results.
    The problem is : which metrics should i calculate and follow ?

    Match distance can indeed be a good idea. I'll try it.
    I'm already tracking match length (some correlation is observed) and collision rate (no substantial correlation) for instance. And distance from beginning (no correlation unfortunately).
    Any idea on metric which could be correlated to LZRW optimal table size is welcomed.

    Then finding the optimal value will be a matter of trials and measurements. I'll take care of that part.


    I didn't talk about dynamically changing the table size. Is that what you want to do?
    Indeed.
    This is how LZRW works : using more samples to increase hit rate means a larger table.
    Using less bits to describe the table slot means a smaller table.
    This is what's at stake in this exercise.

    I don't expect it to be possible to find the correct table size by just studying the first few KB of data. 2-pass algorithm, or sampling the first 20MB is also out of question, since it would kill performance, therefore making other algorithms more efficient.
    So the main idea is to change this size while compressing (and therefore scanning data), adapting on the fly to measured metrics. We don't need something hyper-reactive. Just a casual check and resize will be enough.

    (the actual resizing algorithm is a matter of its own, which is outside of the scope of this thread).

    Regards
    Last edited by Cyan; 31st March 2011 at 17:01.

  7. #7
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    285
    Thanks
    8
    Thanked 33 Times in 21 Posts
    your cost-function is output_size->min!

    regarding lzrw, for me its in the same class as lzp and rolz, so the same strategies apply here.

  8. #8
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    http://cbloomrants.blogspot.com/2008/10/10-10-08-2.html seems very useful and discusses LZSS.
    Last edited by willvarfar; 1st April 2011 at 11:40.

  9. #9
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    459
    Thanked 257 Times in 105 Posts
    its in the same class as lzp and rolz, so the same strategies apply here.
    Interesting, could you please elaborate Sebastian ? Which strategies ?


    @willvarfar : very interesting reading. It seems a bit out of scope from the original question regarding LZRW optimal hash cost, but still very learnful.

  10. #10
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    It depends whether you want fast decompression, compression or both?

    Perhaps if place your hashes in a tree then you only need encode sufficient bits to discriminate them?

  11. #11
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    As Sebastian said in your case the cost function is the entropy of
    your model (i.e., LZRW) H(x) as a function of some
    arbitrary parameter vector x measured on some data set
    T. And you want to solve



    where x* is the optimal solution and X
    is the whole set of feasible solutions (should be a convex set,
    too), such that



    Convexity can be defined like this: if you fix some values x and y
    you can calculate H(x) and H(y). H(z), where z is any point on the
    line connecting x and y, is always "below" (or equal) the linear
    interpolation of H:



    In particular that means that there is just a single optimal
    solution x* (in X)and you can follow a path
    through X, which decreases H. You estimate an initial
    value for your parameters, x0. Then
    you find another value in the neighbourhood of
    x0, call it
    x1, and so on... If H is
    convex you can find a sequence, such that



    In your case finding such a sequence is easy, since you've just got
    a single parameter x, which is the log of your hash
    table size. The set of feasible values is X = {1, 2, 3, ...,
    n}
    . So you calculate an initial value
    x0, e.g.,
    x0=10. The candidate for the next
    step might be x1=11. If
    H(x1) < H(x0) you know
    that your next step needs to increase, to reach
    x*, otherwise you need to decrease. You may want to
    have a look at certain optimization methods to find out how to
    compute xk+1 based on
    xk.

    EDIT: I'm giving up to try to produce readable TeX. The forum automatically insears spaces "%20" with the image URLs... If you want to see the formulas, just follow the URLs.

    EDIT2 (Shelwien): Inserted the images instead. Alternatively it should be possible to use redirector services like tinurl.
    Last edited by toffer; 2nd April 2011 at 15:15.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  12. #12
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    459
    Thanked 257 Times in 105 Posts
    place your hashes in a tree then you only need encode sufficient bits to discriminate them
    @willvarfar : that sounds like a clever idea. Is there any implementation so far of this concept ?
    CPU cycles may be quite stressed by this strategy, so the relative speed/gain trade-off remains to be tested.

    you want fast decompression, compression or both?
    Best case : i want to keep as much as possible LZRW speed properties (especially on compression) while still being able to select the "best" settings dynamically. Obviously, a little slow down is acceptable.

    It could be that this objective is unrealistic though.


    @toffer :
    OK, these are complex territories for my braincell activity level.
    If i do understand correctly, you suggest to start with any value, then blindly try to increment or decrement it, and then keep this modification if the new result is better. And continue from this point.

    What's not clear is if this x(k+1) must be tried on the same file chunk than previously. Then it would provide a fair and valid comparison, but would also cost quite some time. This would be damn close to brute force, albeit with a convex property to limit the number of tested values.

    Alternatively, if you suggest to try x(k+1) on the next file chunk, then i'm not sure results would remain comparable, and therefore, it seem difficult to draw conclusions on a comparison. Or maybe this approximation could be considered "good enough" ?....

  13. #13
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I simply cannot answer these questions, as it depends on your homogenity of your data. You simply have to try.

    And these techniques are not really comparable (considering complexity) to a bruteforce search. You could make some plots of the compression (per block/per blocks/per file) for quite a number of parameter values and have a look.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

Similar Threads

  1. Ross Williams high-speed LZRW compressors
    By LovePimple in forum Data Compression
    Replies: 4
    Last Post: 8th June 2008, 21:26

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •