Results 1 to 7 of 7

Thread: lzdatagen - generate lz compressible data

  1. #1
    Member jibz's Avatar
    Join Date
    Jan 2015
    Location
    Denmark
    Posts
    124
    Thanks
    106
    Thanked 71 Times in 51 Posts

    lzdatagen - generate lz compressible data

    Inspired by the discussion about generating compressible test data in this thread, I wrote a small program based on a recent paper. It is a simplified version, which is really not much different from Shelwien's lzgen.

    https://github.com/jibsen/lzdatagen

    Windows 64-bit executable attached (compiled with GCC 5.3.0).
    Attached Files Attached Files

  2. Thanks (2):

    Bulat Ziganshin (3rd July 2016),comp1 (3rd July 2016)

  3. #2
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    jibz,

    What are you expecting the data to be used for? I'm guessing that it's for testing whether the basic LZ + entropy coding parts of compressors are working ok, with respect to minimum match lengths, scalability, etc.

    The SDGen paper strikes me as bogus and dangerous, though. If they understand what modern LZ77's like LZX and LZMA are actually doing, they don't show it. (It's really not LZ77 anymore.) Random data with a realistic distribution of match lengths and distances will systematically make LZMA look slow for not much benefit when in fact it kicks ass on a lot of data that you find in storage management layers of various things. (Numeric arrays, strictly strided record structures, long string matches with small substitutions, etc.) They seem to be pitching simplistic random data for performance prediction, and that's a recipe for getting basic things systematically wrong. Compression is about recognizing and exploiting various kinds of patterns, and randomizing data systematically destroys all sorts of patterns, even if you preserve realistic statistics. (This was a big problem in memory allocator research for decades. After 30 years, it turned out that everybody was throwing away the most important patterns in real data, and almost all of 200+ published papers were full of experimental results that were meaningless and mostly wrong. They had just been measuring consequences of artificial noise, where the actual data were intensely patterned.)

  4. Thanks (3):

    Bulat Ziganshin (6th July 2016),jibz (6th July 2016),Jyrki Alakuijala (5th July 2016)

  5. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    Its easy enough to make a generator based on actual lzma though - just add a limit (or wrap-around) for distance values, and then let it decode random data,
    after feeding it a file for model training. Well, maybe also disable stats updates, to avoid model degeneration.
    If anyone is interested, I can make it.

    But the original idea was simply generating data for codec tests - real files are better, but impossible to distribute with the codec source.

  6. Thanks (4):

    Cyan (5th July 2016),jibz (6th July 2016),Mike (5th July 2016),Paul W. (5th July 2016)

  7. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    Ok, here it is: http://nishi.dreamhosters.com/u/lzmagen_v0.rar
    Only lzma mode of plzmalib is patched though.

    Code:
    echo Compressing the seed file
    pmd c book1 book1.plz
    
    rem lzma params can be added after output filename eg.
    rem pmd c book1 book1.plz 25 9999 273 3 0 2
    rem (d, mc, fb, lc, lp, pb)
    rem but they also have to be specified the same way for "decoder"
    
    echo Decompressing the seed file + generating 100*100k=10M of random data
    pmd d100 book1.plz book1.unp

  8. Thanks (3):

    Bulat Ziganshin (6th July 2016),Mike (5th July 2016),Paul W. (5th July 2016)

  9. #5
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Wow, that was quick. Thanks!

  10. #6
    Member jibz's Avatar
    Join Date
    Jan 2015
    Location
    Denmark
    Posts
    124
    Thanks
    106
    Thanked 71 Times in 51 Posts
    Quote Originally Posted by Paul W. View Post
    What are you expecting the data to be used for? I'm guessing that it's for testing whether the basic LZ + entropy coding parts of compressors are working ok, with respect to minimum match lengths, scalability, etc.
    That was my thought yes. Besides the more synthetic unit tests that check the limits of a library, it is useful to verify it compresses and decompresses (more or less) normal data. The Squash unit tests for instance include a piece of lorem ipsum text, Brotli includes various test files.

    The lzdgen tool was more meant as an example, the data generation code itself is just four functions and could be used to generate data and perform tests in memory without writing the files to disk.

    Also, I saw Shelwien's code, found that paper, and thought it would be fun to try implementing the algorithm, but in a way that did not require input data.

  11. #7
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    from my POV, such generators are great for archiver built-in benchmark code. It should provide repeatable results and use multi-megabyte data to match modern algos. So using small "learning set" and generating much more data with the same stats looks like the best solution possible. Of course that requires a compressor that is close to optimal for intended datasets. For example, such approach will fail with rep/srep that store LZ match offset as plain 32-bit data.

    Also, as Eugene said, there is a question of statistics changes. Ideally, it should change in the same way as real data. This means that better approach would be to use block-static compressors that store stats in the block header, f.e. zlib or zstd. So we can provide only the block header contents to decoder and use random numbers as block bodies. Usually, block headers are about 3-10% of compressed data - that makes the best compressed size. But for our purpose, we can compress in a non-optimal way, making block headers/stats only 0.1-1% of compressed data.

    Sometime ago, i developed block-static PPM algorithm CM (it stands for context modeling). I made a bug in one version that essentially implemented such generator. It's an example of its output:

    Code:
    HAMLET    sowho our youching?
    
    HORATIO    What greated,
        Is of good do speak aboutemponick to a withnotiss.
    
    HORATIO    Aant body said my what arry we has him our his have blo(k harc(yt,
        S    IgOShequepearps.
        Where ick,
        Afor; andmil mind our yourself, not to my you
        Taill.
    
    HORATIO    I must lordwedy, my lord?
    
    HAMLET    I abMuffer,
        Whicharktorows in
        Hance[homit think you, thou hast not as.
    I have attached sources/executables to the post
    Attached Files Attached Files
    Last edited by Bulat Ziganshin; 6th July 2016 at 17:53.

  12. Thanks (3):

    jibz (6th July 2016),Mike (6th July 2016),Paul W. (7th July 2016)

Similar Threads

  1. loseless data compression method for all digital data type
    By rarkyan in forum Random Compression
    Replies: 244
    Last Post: 23rd March 2020, 16:33
  2. Recursive LZ
    By chornobyl in forum Data Compression
    Replies: 1
    Last Post: 29th September 2018, 12:28
  3. Genefile - to generate a test file
    By Sportman in forum Data Compression
    Replies: 2
    Last Post: 17th February 2013, 13:05
  4. hashing LZ
    By willvarfar in forum Data Compression
    Replies: 13
    Last Post: 24th August 2010, 20:29
  5. LZ differential ?
    By Cyan in forum Data Compression
    Replies: 4
    Last Post: 27th September 2008, 14:00

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •