Results 1 to 29 of 29

Thread: Bunch of stupid questions

  1. #1
    Member chornobyl's Avatar
    Join Date
    May 2008
    Location
    ua/kiev
    Posts
    153
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Bunch of stupid questions

    Does anybody know how should called LZ method
    which woud use incomplete matches (difference compensation)
    For example 123456 and 123356 be a match plus some compensation data 000100

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    Only *diff utilities do that.
    There're no LZ compressors like that, because it would be awfully
    slow to search for such patterns, and then compression might even
    decrease after than, because this support for "patches" provides
    an alternative way to generate the same sequence with different
    codes.

    Such a thing can be implemented with a CM+LZ combination, though.
    First you process the data with CM (eg. paq and replace the
    symbols with too small probability (and store the patches somewhere).
    And then any LZ would have somewhat improved compression, if
    there was really a lot of mistypes or something like that.
    Last edited by Shelwien; 26th May 2008 at 15:11.

  3. #3
    Member chornobyl's Avatar
    Join Date
    May 2008
    Location
    ua/kiev
    Posts
    153
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Yes it would be slow, but i think if compare
    LZD vs DELTA+LZ it is like PPM vs BWT
    because current implementations has much limitations
    and also imho it would be superior for mm data like WAV,BMP

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    Ah, I misunderstood.
    Thought that you was talking about sparse matches or something like that.

    But anyway, LZ is only useful for types of data with high probability of
    long matches. And numeric data which can be better compressed with
    delta-coding doesn't commonly contain any matches.

    I mean, of course a delta model would be useful.
    But an algorithm able to use that kind of correlations just won't be called LZ.
    Last edited by Shelwien; 26th May 2008 at 15:23.

  5. #5
    Member chornobyl's Avatar
    Join Date
    May 2008
    Location
    ua/kiev
    Posts
    153
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I didnt mean the data is numeric, that numbers is for example
    Cant see any reason why its not LZ

    Compensation should be be somehow limited maybe -8 +7, 16 values
    then it would take half of data spase

    PS: i have a feeling that when Bulat read this he will add it to FA
    which would be great
    Last edited by chornobyl; 26th May 2008 at 16:04.

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    Its better to compress whole numbers, not their digits or something.
    Also I'd never seen an example of data with "hidden matches".
    Usually if there're matches in delta sequence, it means that original
    data at locations with matching deltas is similar too.
    And there's no reason to compress data without matches using LZ.

    > I didnt mean the data is numeric, that numbers is for example

    And if the data isn't numeric, then we return to the topic of
    sparse matches. Of course its possible to create a LZ generalization
    with support for any patterns, not only substrings, and it would
    certainly be LZ, and I guess it could have better compression than
    traditional LZ, but the matter of compression time kills the whole idea.

    Btw, there's example of that in the form of motion prediction with
    exhaustive search in video compression - check its speed to see
    how a sparse LZ could work.

  7. #7
    Member chornobyl's Avatar
    Join Date
    May 2008
    Location
    ua/kiev
    Posts
    153
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Another One Question

    Is there any way for "Blind" huffman decompression
    when you havent got stored tree and you rebuild it

    the reason why i am asking is simple - preprocessind of ANY huff compressed
    data, not just JPEG or ZLIB stream

    exact precise data is not needed, only to restore correct probability
    ie remove huffman's enthropy

  8. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    > Is there any way for "Blind" huffman decompression
    > when you havent got stored tree and you rebuild it

    Of course, if there's enough redundancy in the code.
    Though that's already cryptography I think.
    Ah, and code has better to be static, or else.
    And considering that all the common huffman-based
    formats use blockwise encoding, you'd have to be
    able to split the stream into blocks at least.

    > the reason why i am asking is simple - preprocessind of ANY huff compressed
    > data, not just JPEG or ZLIB stream
    > exact precise data is not needed, only to restore correct probability
    > ie remove huffman's enthropy

    Its a good application of LZ optimal parsing (and better the one which uses
    precise code length metric, not some random weights for matches in sequence).
    I mean, the only "catch" in huffman code is its prefix property, so it
    should be possible to recover with plenty of (redundant) data by converting
    bits to bytes and applying the optimal parsing.
    I mean, if there's enough redundancy in the data, then most of matches would
    be starting at the start of some huffman code.
    But there probably won't be enough redundancy in the sequence of LZ matches
    for the 32k block of data (and after that huffman tree is switched).

    So I guess that only option is to drop the idea of recovering the tree
    and just concentrate on removing its redundancy.
    For that you'd need some good bitwise CM. An easy hack might be to take paq8/lpaq/paq9
    and remove the bit context zeroing on byte boundaries, and add a context shift
    and masking on each bit instead.
    Another optimization would be adding some optimal parsing counterpart to CM.
    It may look like simulating insertion of a special separator symbol at each
    position and comparing the compression with and without it.
    (A good example of this is worse CM compression of texts with removed spaces,
    despite smaller file size.)

  9. #9
    Member chornobyl's Avatar
    Join Date
    May 2008
    Location
    ua/kiev
    Posts
    153
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Mixing

    nowadays CM algos uses mixing
    what about this sort of mixing

    for i=1 to n do
    a'=sqrt(a*b)
    b'=0.5(a+b)
    a=a'
    b=b'
    end;

  10. #10
    Member chornobyl's Avatar
    Join Date
    May 2008
    Location
    ua/kiev
    Posts
    153
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Mad Idea

    i tried to visualize common lz methods as i unerstand it, correct me if i wrong

    2008-02-05
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	m+.png 
Views:	418 
Size:	4.0 KB 
ID:	215  

  11. #11
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I don't really understand what you visualized there?!

  12. #12
    Member chornobyl's Avatar
    Join Date
    May 2008
    Location
    ua/kiev
    Posts
    153
    Thanks
    0
    Thanked 0 Times in 0 Posts
    something like dictionary fill, that depends from file position

  13. #13
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    In my understanding the LZ77 dictionary is the sliding window, sized N bytes. So the dictionary will fill up after processing 0...N-1 bytes and will remain full after that.

    LZ78 based compressors should behave similar, the dictionary fills up until the available memory is exhausted. Afterwards the dictionary can be kept (1), rebuilt or discarded (2). It will produce either a sawtooth graph (2) or a limited ramp function (like LZ77).

  14. #14
    Member chornobyl's Avatar
    Join Date
    May 2008
    Location
    ua/kiev
    Posts
    153
    Thanks
    0
    Thanked 0 Times in 0 Posts
    can you draw it?

  15. #15
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    735
    Thanked 660 Times in 354 Posts
    funny pictures but really there are only 2 variants - sliding dict is filled and kept full, block-splitted dict filled, kept full until end of block and then cleared before next block

  16. #16
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Here we go...

    @Bulat:
    Blockwise processing is implementation specific, partitially rebuilding the dictionary, too (third graph). I'd just show the most "general" cases for streaming data processing.
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	dict.PNG 
Views:	359 
Size:	14.3 KB 
ID:	216  

  17. #17
    Member chornobyl's Avatar
    Join Date
    May 2008
    Location
    ua/kiev
    Posts
    153
    Thanks
    0
    Thanked 0 Times in 0 Posts
    2 bulat
    what about LZW?
    how it will look like

  18. #18
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    LZW is a LZ78 derivate, so it'll look like that. Some LZ78 variants might grow the dictionary nonlineary, since they add multiple phrases per byte.

  19. #19
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    735
    Thanked 660 Times in 354 Posts
    most lz* algos grow dict non-linearly - first growth is very fast then it becomes slower due to obvious reason

    yes, there are algos which clears dict non-entirely, but this not depends on algorithm. you can do it with lz78, ppm and probably any other algo in universe

  20. #20
    Member chornobyl's Avatar
    Join Date
    May 2008
    Location
    ua/kiev
    Posts
    153
    Thanks
    0
    Thanked 0 Times in 0 Posts

    B-symbols in CM

    Usually cm algos predict current symbol from previous.
    What about predict from NEXT symbol, at least one.
    Just like video codecs produce B frames, and predict em from next/previous/both frames.

  21. #21
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    > Usually cm algos predict current symbol from previous.

    Well, because its most useful.

    > What about predict from NEXT symbol, at least one.

    There're different ways to do that.

    1. its possible to calculate the probability distribution
    for two next symbols and fix it up using backward context
    statistics. Nobody does that because of the obvious reason (speed).

    2. its possible to just exclude last symbol from the
    context - this is called a sparse context and is commonly
    used in all strong compressors.

    > Just like video codecs produce B frames, and predict em
    > from next/previous/both frames.

    In the video case its basically reordering, and there's some
    sense to do that because a frame is relatively large.
    If we apply this approach to general data, that would be
    block reordering, not something on symbol level.
    And that's called segmentation and is used in some compressors.

    Also symbol order reversing is used sometimes, mostly in
    BWT compressors, as "right" contexts seem to provide better
    statistics clustering for texts.
    And special preprocessors or submodels, like E8, might reorder
    the bytes in some specific areas, where its known to be useful.

    Then, there's an "extraction" approach - its possible to implement
    a multi-pass CM compressor, using both "left" and "right" contexts -
    just encode the locations of some symbols or words and remove them
    (or replace with something else).

    And straight reordering on symbol level is not any reasonable as
    it requires encoding a lot of redundant information.

  22. #22
    Member chornobyl's Avatar
    Join Date
    May 2008
    Location
    ua/kiev
    Posts
    153
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Compression of big similar files

    I faced situation when i need to compress two alike but not identical files.
    This where 399MB wavpack files, one of which damaged during transmittion.
    Straight approch is to compress them with 400MB dictionary and forget.
    But this is almost fantastic with my 512MB of ram.
    So i took 7zip, split files on 45MB pices and compress with 48MB dictionary, sorted by extensions files like A.001 B.001 appeared together and final size where 410MB which i expected.

    This is first time when ROLZ compression where made by human+lz77
    Last edited by chornobyl; 1st November 2008 at 23:59.

  23. #23
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    Once I made some utilities for such cases.
    http://shelwien.googlepages.com/patch_v1.rar

    Well, the idea was to be able to

    1) patch the broken file on remote server with a valid local copy
    2) patch the broken file with a valid copy on remote server
    3) create a patch file for distribution when multiple people have
    the broken file.

    And archivers are certainly not very convenient for that.
    Kinda offtopic though.

  24. #24
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    735
    Thanked 660 Times in 354 Posts
    xdelta?

  25. #25
    Member chornobyl's Avatar
    Join Date
    May 2008
    Location
    ua/kiev
    Posts
    153
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Improve CM by LZ and not viceversa

    Most of known implementations improving LZ methods by adding come CM stages where LZ does not performs well or CM much better. So why not add some LZ model and if it founds that on certain data block LZ gets better results then mark this block as "LZcompressed". The main problems are:
    - both algos work simultaniously so this may take twice or more time and memory to compress
    - problem of smart blocking.
    Of course we assume that "LZcompressed" is not discarded from CM stats, it is processed on de/compression same way. This may also cause slightly other approach:
    1 compress data with CM in ordinary way.
    2 try to improve results by applying LZ on same data
    3 combine blocks from both results to get better compression

    ps: all mentioned before may appear obsolete, if it founds that this stuff already exist and called "match model"

  26. #26
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    There is no variant of LZ, even super designed, that may beat a proper designed CM. The only difference is speed. The only advantage of hybrid schemes such as LZ+CM is that we may use LZ for very long matches and CM in other cases - making some speed improvement.

  27. #27
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    Of course the usual LZ implementations are not any
    good from modelling point of view - as they allow for multiple
    encodings of the same data, and thus are considerably redundant.
    But its possible to make a LZ-like mixing scheme, with codelengths
    of all available matches estimated for given position and corresponding
    rc intervals integrated into a symbol probability distribution.
    Now, that would be superior to common CM methods, as they only consider
    the symbol probabilities at a single point, while LZ-like approach
    would be something like separately mixing all the substring instances.
    But the complexity of that is obviously too high, and if we'd try
    to approximate it somehow, it'd probably become a common CM eventually.

    In short, LZ is a considerably different modelling approach, and thus
    it can better handle some kinds of data, sometimes even despite the
    redundancy of LZ compressors.

    And as to blockwise switching of CM/LZ, then yeah, its useful,
    but rarely implemented, I'm only sure about rar and nanozip.
    That's probably because usually archiver developers and compressor
    developers are different people.

    While match models and the like are quite common these days, its
    still mostly a speed optimization and doesn't have the same
    kind of probability estimation as a usual LZ.

  28. #28
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by encode View Post
    There is no variant of LZ, even super designed, that may beat a proper designed CM. The only difference is speed. The only advantage of hybrid schemes such as LZ+CM is that we may use LZ for very long matches and CM in other cases - making some speed improvement.
    Actually there are LZW's that can do better than a proper desinged CM
    at least for some files.

    The reason mine often compress better some files. At least till
    the internal tables fill up is because they are bijective from the
    space of all files to the space of all files.

    You can check then out at

    http://bijective.dogma.net/compreslzw1.htm

    Actually people seem to forget that there is no best compressor
    for all files. Even the unity transform beats the best super
    compressor for some files since to compress any files means
    that it expands others.

    The only reason CM does so good is that as it compresses
    it tries to figure out which model is the best. But if one is
    only interested in certain types of files and could care less
    about another set of files CM is not the best.

  29. #29
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by biject.bwts
    Actually people seem to forget that there is no best compressor for all files.
    Yes, you are right. That's the main reason why WinRAR still is competitor to 7-Zip. Because, WinRAR tries to classify compressed data while 7-zip just blindly compresses data.

    Quote Originally Posted by biject.bwts
    The only reason CM does so good is that as it compresses it tries to figure out which model is the best. But if one is only interested in certain types of files and could care less about another set of files CM is not the best.
    CM is a very generic term in my point of view. Someone can optimize an "optimizable CM coder" for a specific file (look at toffer's M1). This way is superior for me when I compare well-optimized LZ style compressors. Also, some more specific thing can be made: floating-point model etc (look at Shelwien's GEO packer which outperforms even PAQ

Similar Threads

  1. New guy look for helps on naive questions
    By yacc in forum Data Compression
    Replies: 4
    Last Post: 1st January 2010, 17:39
  2. A recruit's compressor and some questions
    By Fu Siyuan in forum Data Compression
    Replies: 122
    Last Post: 23rd September 2009, 18:35
  3. Data Distribution Questions.
    By Tribune in forum Data Compression
    Replies: 13
    Last Post: 25th June 2008, 18:09

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •