Results 1 to 13 of 13

Thread: Specific case - High redundant data in a pattern

  1. #1
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    236
    Thanked 90 Times in 70 Posts

    Arrow Specific case - High redundant data in a pattern

    Here's the question. We all know that different kind of data compresses best with different algorithms. Or the same, in other words, not every algorithm will compress every type of data efficiently. Despite this fact, there are some programs which can, at cost of speed and resources usage, squeeze to the last bit almost every file, like PAQ series.
    Well, i have this file which refuses to agree with the statement. What happened is that i tried several programs, and when i thought i had reached the highest ratio available (paq8pxd14, 13.67 hours running), nanozip gave me a file in size 1/3 of that paq8... at ~3 mbps. This was rather unexpected, for i was prepared for a behavior à la 7-zip... And i said WOW


    I know artificial data is not welcome at researching time, and for good reasons. Nonetheless, seems like mine has already very much in common with real world data, such as LOGs, for example. Probably, we will never know what's inside of nanozip, and i do NOT want to start something like a disassembly race or anything similar. What i'm wondering is whether we could improve current compressors in order to achieve higher ratios and speeds, at least for a determined data kind by looking at this example. Furthermore, if this matter is helpful at all... I might be wrong after all.

    The file can be build this way:

    Code:
    @echo off
    for %%1 in (0,5,100) do (
        for %%2 in (0,5,80) do (
            for %%3 in (0,10,9999) do (
                for %%4 in (0,5,100) do (
                    echo q%%1 - qc %%2 - plc %%3 - peq %%4 >> LOG.txt
                )
            )
        )
    )
    Be careful because running this will take a long time. It's always faster to download the attachment.
    The code is just to understand how was made.

    Here is size comparison:

    Code:
    252,469,833 LOG.txt
    011,719,224 LOG.txt.sx - Slugx
    008,361,937 LOG.txt.ppmd - fazip
    006,093,273 LOG.txt.grzip - fazip
    003,929,599 LOG.txt.lzma - fazip
    001,030,742 LOG.txt.ccmx
    000,923,984 LOG.txt.xz
    000,882,957 LOG.pmm - ppmonstr
    000,072,422 LOG.txt.paq8pxd14
    000,023,744 LOG.nz - nanozip -co
    000,023,571 LOG.nz - nanozip -cO
    So... what do you think about?
    Attached Files Attached Files

  2. #2
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    886
    Thanks
    52
    Thanked 107 Times in 85 Posts
    i can compress it even smaller sadly the method is not good for other data

    but here it is: (in 254 bytes)
    for %%1 in (0,5,100) do (
    for %%2 in (0,5,80) do (
    for %%3 in (0,10,9999) do (
    for %%4 in (0,5,100) do (
    echo q%%1 - qc %%2 - plc %%3 - peq %%4 >> LOG.txt
    )
    )
    )
    )

    so the best compression for your data does not help you gain compression on other files.
    I believe this subject was already explained in the last debate that just using a artificial created data is pretty meaningless. Especially something that gets derived from such a little "seed"

  3. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,532
    Thanks
    755
    Thanked 674 Times in 365 Posts
    it is handled by delta/table compression

  4. #4
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    236
    Thanked 90 Times in 70 Posts
    Haha! Yes, i know. Just like pi in zpaq. The goal here is to reach a good result avoiding such tricks, as we didn't know there is a seed. Like nz does.
    @bulat: did you try delta? It actually hurts compression. All lines have not the same lenght, so we can not build a table.

  5. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,532
    Thanks
    755
    Thanked 674 Times in 365 Posts
    which tricks you want to avoid? it's a natural fit for delta/table compression, although my actual implementation can't handle it. but it's a matter of specific implementation rather than idea. the table connversion should split it to separate channels and delta compression can handle arithmetic sequences. you just need to develop both filters flexible enough to handle this specific case too. i think that's the exhaustive answer to your initial question

  6. #6
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    236
    Thanked 90 Times in 70 Posts
    which tricks you want to avoid? it's a natural fit for delta/table compression, although my actual implementation can't handle it. but it's a matter of specific implementation rather than idea. the table connversion should split it to separate channels and delta compression can handle arithmetic sequences. you just need to develop both filters flexible enough to handle this specific case too. i think that's the exhaustive answer to your initial question
    That's what i'm talking about. I thought this example might help developers improve existing algorithms, like your delta. And yes, indeed is an exhaustive answer.
    As for "tricks" i mean very specific algorithms that will not help at all with anything diferent, in this case triyng to use my bat file is obviously useless for any other file/stream in the entire world.

  7. #7
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,479
    Thanks
    26
    Thanked 122 Times in 96 Posts
    Pattern recognition boils down to artificial intelligence if you want a compressor to recognize patterns it newer saw. Even if a compressor can detect 100 patterns and deal with them exceptionally, then someone can find another 100 patterns that look simple to decompose and compress, but the compressor doesn't cope well with them (ie doesn't detect them and compress them using inferior generic method).

    In general, compression algorithm authors tune their products to data that's available for them or to data they generate by themselves. If someone else brings a different type of data, then the compression effectiveness will depend on luck, ie the probability that the new data shares some similarities with data the compressor was tuned to (so that the filters, models, transformers, etc will recognize patterns and handle them appropriately).

    NanoZip has lots of efficient filters, probably because its author, Sami Runsas, has extensive experience with developing them.
    Last edited by Piotr Tarsa; 18th September 2014 at 19:39.

  8. Thanks:

    Gonzalo (27th September 2016)

  9. #8
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    236
    Thanked 90 Times in 70 Posts
    Yes, falls into AI... AI should be the definitive answer.
    As for ‘conventional' fiiters, depends on author's criteria. If he considers any specific case extensible enough to write code for it. And wether the program can realize when apply each filter. An obvious example es FA $bmp. mm+grzip is good for a few big images. But for a big number of small bmp rep+lzma outperforms $bmp by far. Mainly because is solid. Again, AI involved.
    Last edited by Gonzalo; 18th September 2014 at 20:37.

  10. #9
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts
    I tried a few more tests with zip, 7zip, rar, and zpaq.
    Code:
         23,571 LOG.nz
        230,914 log-s8.3ci1.1.1m16.zpaq
        236,154 log-s8.3ci1.1.1m.zpaq
        288,313 log-s8.3ci1.1.1.zpaq
        288,868 log-s8.3ci1.1.2.zpaq
        296,130 log-s8.3ci1.1.1.1.zpaq
        297,834 log-s8.3ci1.1.zpaq
        535,992 log-s8.3ci1.zpaq
        543,587 log-m5.zpaq
        664,789 log-x8.3ci1.zpaq
        911,738 log-m4.zpaq
        911,738 log-m3.zpaq
      1,102,795 log-mx.7z
      2,746,010 log-s8.0ci1.1.1.1.2.2m.zpaq
      3,242,625 log-s8.0ci1.1.1.1.2.2.zpaq
      3,363,159 log-s8.0ci1.1.1.1.2.zpaq
      3,430,516 log-s8.0ci1.1.1.1.1.zpaq
      4,221,318 log-s8.0ci1.1.1.1.zpaq
      4,971,745 log-s8.0ci1.1.1.zpaq
      7,169,691 log-s8.0ci1.1.zpaq
      9,043,672 log-ma5-m5.rar
     12,145,086 log-s8.3c.zpaq
     17,979,430 log-9.zip
     18,924,323 log-s8.0ci1.zpaq
     26,773,587 log-m1.zpaq
     30,295,432 log-m2.zpaq
     92,530,637 log-s8.0c.zpaq
    252,469,833 LOG.txt
    zpaq -m1..-m5 are the built in -method 1..5. method -s8.3ci1.1.1m16 which means as follows:
    s = streaming mode.
    8 = 2^8 MB = 256 MB block (fits whole file).
    3 = BWT.
    c = indirect order 0 context model (ICM) after BWT.
    i1.1.1 = ISSE chain order 1, 2, 3 after ICM.
    m16 = order 1 mixer (16 bit context).

    Streaming mode (s vs x) saves 29 KB by not storing dedup fragment hashes and pointers. -m3 and -m4 are equivalent to x6.3ci1 for this file which means as follows:
    x = journaling mode.
    6 = 64 KB blocks (allows 4 threads).
    3 = BWT.
    ci1 = order 0-1 ICM-ISSE chain.

    Other files might use a different algorithm for -m3 and -m4 depending on an analysis of the input. It usually uses BWT for text and highly compressible data but CM or LZ77 or no compression for less compressible data. I also tried some direct CM models like s8.0ci1.1.1.1.2.2m which is an order 0-1-2-3-4-6-8 ICM-ISSE chain and order 0 mixer.

    Methods 1 and 2 are LZ77. I was surprised that 2 compresses worse. It may be because it uses larger blocks (64 MB vs. 16 MB) which requires longer codes to encode short match offsets. Also, these compress worse than zip, possibly because zpaq uses a fixed codes with no literal compression. Larger blocks don't help if all the offsets are short.

  11. Thanks:

    Gonzalo (19th September 2014)

  12. #10
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts
    Another test with a text file containing the numbers 1 to 1000000, CR LF terminated. Nanozip wins again.

    Code:
        3,404 n.nz
        4,160 n-cc.nz
       13,194 n-m5.zpaq
       14,169 n.pmm - ppmonstr
       29,696 n-m5-ma5.rar
      139,712 n-mx.7z
      422,930 n-m4.zpaq
      422,930 n-m3.zpaq
    1,377,237 n.pmd - ppmd
    2,131,145 n-9.zip
    3,074,043 n-m2.zpaq
    4,694,110 n-m1.zpaq
    7,888,896 n
    Create the file n like this:
    Code:
    @echo off
    for /l %%1 in (1,1,1000000) do (
      echo %%1
    )

  13. Thanks:

    Gonzalo (27th September 2016)

  14. #11
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    236
    Thanked 90 Times in 70 Posts
    Matt: Imagine this scenario:
    1) A database containing known algorithms, including zpaq methods and which are allowed to work together, for instance "x" match finder with "y" and "z" coders plus "a" "b" and "c" filters and transformations in the process.
    2) Another database containing know data types: text (each codepage apart), raw pcm, and so on...
    3) The key: A program that tries every possible combination, -brute force approach- collecting stats about ratio, speed and resources usage of each one.
    4) A program (can be the same as 3) collecting info about the file being processed. I mean every number and any particularity that can help the compressor to guess in the future what method to choose - Name, extension, CRC, size, entropy, detected magic numbers or headers, detected tables, distribution of 0 - 255 bytes in the body, etc, etc, etc... including meaningless statistics that can be useful in a 'forensic' analysis.
    5) An update to the 1 and 2 stages. Ideally this must be done by the automated process. Over time, and recursively updating the best case scenario for each data type we can also discover variations in this data types (for example, ours test files are $text at first sight, but fits best in another category), or even new categories by looking at the log and see what files share similar conditions. And of course, the certainly of how to compress even data that we have never seen before.
    Last edited by Gonzalo; 19th September 2014 at 06:27.

  15. #12
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    236
    Thanked 90 Times in 70 Posts
    Obviously, this concept has it's drawbacks and challenges. But i think will be a progress going in the AI way.

    An obvious hardness is insane resource usage. This can be circumvent by running the software in 'screensaver' mode when the system is idle, for instance at night. We can also connect several computers sharing their results.
    A major drawback is also the task of analyse the hole thing after a given running. i wonder: Can a program at present day do such a job? Will be the conclusions reached by it useful at the time of a real compression session? Or will we have to do it by hand? I have no experience at all at AI development so i cannot give an answer.

    To all: What do you think? Is possible, but, Is it practicable?
    Can we join efforts and build it? The knowledge obtained this way will be useful to the entire community.
    Just think about it!
    Last edited by Gonzalo; 19th September 2014 at 06:26.

  16. #13
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    236
    Thanked 90 Times in 70 Posts
    Only humans can see a file and think: 'Oh! this looks like an encrypted file.' Or 'I will try an exe filter for this seems to be an executable, althought i can not find the usual header...'
    As data packing is an AI problem... When will a program be capable of 'think' that way? Maybe the answer can start collecting all possible data from real world, just like a baby does within the first monts of life then he gets started to imitate his parents and interact with his environment. Actually, every living thing are getting info from it's environment until die. Just this way can predict what's next and make decisions.
    Last edited by Gonzalo; 22nd September 2014 at 00:38.

Similar Threads

  1. The BWT is redundant...
    By nburns in forum Data Compression
    Replies: 29
    Last Post: 10th April 2013, 04:59
  2. Nice Mini-ITX case
    By Bulat Ziganshin in forum The Off-Topic Lounge
    Replies: 0
    Last Post: 27th February 2012, 02:45
  3. Need advice on compressing specific sample of data
    By Piotr Tarsa in forum Data Compression
    Replies: 11
    Last Post: 22nd April 2011, 00:01
  4. Help solve an open murder case, crack the code
    By Sportman in forum The Off-Topic Lounge
    Replies: 5
    Last Post: 12th April 2011, 00:39
  5. The worst case for LZTurbo?
    By m^2 in forum Data Compression
    Replies: 8
    Last Post: 16th April 2009, 01:24

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •