Page 1 of 2 12 LastLast
Results 1 to 30 of 44

Thread: Universal LZ Format

  1. #1
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts

    Universal LZ Format

    The question of where Tree's compression algorithm fits in the spectrum of LZ programs has been a curiousity to me since Dr. Mahoney first suggested it was like LZW last Christmas and since nburns commented about fitting Tree into an existing LZ format last spring. So, in an attempt to clarify how Tree and other recursive grammar based compression algorithms fit within the LZ spectrum, I have written a short and rough beginning outline of a paper describing what I am calling the Universal LZ General Format. It describes a relatively simple single general format that supports added restrictions that turn it into LZ77, LZ78, LZW or grammar transformations. Even though the discussion is minimal, I hope it helps shed some light on why Tree is able to compress enwik9 to a file that is about 8% smaller than the any (other) LZ algorithm in spite of what is likely inferior code in some or many cases.
    Attached Files Attached Files
    Last edited by Kennon Conrad; 30th September 2014 at 08:36.

  2. #2
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    I can't see much similarity between Tree (and other similar compressors) and LZW. The only similarity is that both Tree and LZW associate phrases with single numbers or codes. This is LZW's main loop:

    1. find longest match in dictionary
    2. output number for match
    3. insert longest match + 1 additional symbol

    LZW dictates both the dictionary contents and the match selection. The variants that I've looked at tend to keep a similar pattern. On each cycle, they select a phrase for the following symbols (usually the longest match) and add a longer version of that phrase.

    I think Tree is conceptually similar to LZ77, because LZ77 doesn't dictate phrase selection. LZ77's format is a list of backward references to phrases. Tree moves the reference forward to the phrase itself, which (1) requires an escape symbol, and (2) limits the scope and size of the dictionary. All phrases have been enumerated prior to the first time they're used; codes can then be explicitly part of the definition or derived from the order (Tree's system is complex). Tree requires that the phrases are well-nested, giving the dictionary a tree structure (not the same tree as it's named after, I don't think). In LZ77, all phrases are independent and may overlap each other. An LZ stream has a straightforward translation to Tree format as long as all of its matches are well-nested and all references point to the first match. LZ-type (start, length) references are strictly more expressive than Tree format.
    Last edited by nburns; 30th September 2014 at 10:49.

  3. #3
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,495
    Thanks
    26
    Thanked 131 Times in 101 Posts
    I think Tree in current form resembles LZ78 more than LZ77. LZW isn't the only right variant of LZ78. You can devise different ones: http://encode.su/threads/216-LZW-LZM...ZAP-comparison And LZW format doesn't force you to select longest matches, it's up to you. Also LZ78 derived algos seems to not categorize symbols by recency (that would enable efficient dictionary index modelling) and do full dictionary restarts on space exhaustion (instead of eg pruning most useless entries).


    Overall, my intution about LZ77 vs LZ78 (when considering them as classes of algorithms) is that LZ77 is about outputting offsets to a window of previous data, while LZ78 is about outputtting indexes to entries in dictionary and those indexes don't have to have anything in common with offsets. But then there would be a problem with coders like LZP (Lempel Ziv Predictive) or SR (Symbol Ranking) - they don't correspond with either description.
    OTOH, I could describe that LZ class of algorithms consists of algorithms that are able to encode variable length substrings of past data with a single indivisible token. With such description LZP falls into LZ category, but SR still doesn't (and this is probably OK, since SR intuitively doesn't look like LZ).

    BWT, CM, PPM, bare entropy coders, etc are either bytewise or bitwise and they divide input into fixed size chunks (ie either bits or bytes, usually) and compress each one separately. That differentiates them from LZ-class algorithms.

    Even though the discussion is minimal, I hope it helps shed some light on why Tree is able to compress enwik9 to a file that is about 8% smaller than the any (other) LZ algorithm in spite of what is likely inferior code in some or many cases.
    I think the main reason is that Tree analyzes data offline and selects only the most useful substrings to be addressable. That reduces addresses lengths (where address is either offset or index; in case of Tree it's index). I think LZ78 type algorithms could benefit a lot from offline analysis and also from periodic pruning of useless entries (if new entries are defined implicitly as in LZW, which doesn't have an option to define new token other than selecting previous one as a base) and variable length coding of dictionary entries (Tree assigns codes of different lengths unlike LZW).

  4. #4
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    Quote Originally Posted by nburns View Post
    I can't see much similarity between Tree (and other similar compressors) and LZW. The only similarity is that both Tree and LZW associate phrases with single numbers or codes. This is LZW's main loop:

    1. find longest match in dictionary
    2. output number for match
    3. insert longest match + 1 additional symbol

    LZW dictates both the dictionary contents and the match selection. The variants that I've looked at tend to keep a similar pattern. On each cycle, they select a phrase for the following symbols (usually the longest match) and add a longer version of that phrase.

    I think Tree is conceptually similar to LZ77, because LZ77 doesn't dictate phrase selection. LZ77's format is a list of backward references to phrases. Tree moves the reference forward to the phrase itself, which (1) requires an escape symbol, and (2) limits the scope and size of the dictionary. All phrases have been enumerated prior to the first time they're used; codes can then be explicitly part of the definition or derived from the order (Tree's system is complex). Tree requires that the phrases are well-nested, giving the dictionary a tree structure (not the same tree as it's named after, I don't think). In LZ77, all phrases are independent and may overlap each other. An LZ stream has a straightforward translation to Tree format as long as all of its matches are well-nested and all references point to the first match. LZ-type (start, length) references are strictly more expressive than Tree format.
    The point I am trying to make is that if you take Tree and add restrictions that symbols must be length 1 and dictionary entries must be created for every output symbol, then the compression algorithm could be reduced to LZW. That would make compression fast, but not leave any freedom for the designer.

    Tree is similar to LZ78/LZW in that it has a dictionary and outputs compressed symbols in the history in place of raw symbols.

    Tree is similar to LZ77 because the compressor needs to decide which strings to use. The choices for Tree are reduced compared to LZ77 because the data stream is compressed.

    Compression for Tree is similar to LZ77 because the program needs to decide which strings to use and can have lengths more than 2. The choices are reduced for Tree because it is looking at a compressed stream and LZ77 is looking at raw data, but more complicated because it can create dictionary symbols from prior symbols with unique offsets.

    If you look at Tree from the Univeral LZ General Format perspective, it has fewer restrictions than either LZ77 or LZ78, allowing more freedom for the algorithm designer to make good decisions.

  5. #5
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    I think Tree in current form resembles LZ78 more than LZ77. LZW isn't the only right variant of LZ78. You can devise different ones: http://encode.su/threads/216-LZW-LZM...ZAP-comparison And LZW format doesn't force you to select longest matches, it's up to you. Also LZ78 derived algos seems to not categorize symbols by recency (that would enable efficient dictionary index modelling) and do full dictionary restarts on space exhaustion (instead of eg pruning most useless entries).


    Overall, my intution about LZ77 vs LZ78 (when considering them as classes of algorithms) is that LZ77 is about outputting offsets to a window of previous data, while LZ78 is about outputtting indexes to entries in dictionary and those indexes don't have to have anything in common with offsets. But then there would be a problem with coders like LZP (Lempel Ziv Predictive) or SR (Symbol Ranking) - they don't correspond with either description.
    OTOH, I could describe that LZ class of algorithms consists of algorithms that are able to encode variable length substrings of past data with a single indivisible token. With such description LZP falls into LZ category, but SR still doesn't (and this is probably OK, since SR intuitively doesn't look like LZ).

    BWT, CM, PPM, bare entropy coders, etc are either bytewise or bitwise and they divide input into fixed size chunks (ie either bits or bytes, usually) and compress each one separately. That differentiates them from LZ-class algorithms.


    I think the main reason is that Tree analyzes data offline and selects only the most useful substrings to be addressable. That reduces addresses lengths (where address is either offset or index; in case of Tree it's index). I think LZ78 type algorithms could benefit a lot from offline analysis and also from periodic pruning of useless entries (if new entries are defined implicitly as in LZW, which doesn't have an option to define new token other than selecting previous one as a base) and variable length coding of dictionary entries (Tree assigns codes of different lengths unlike LZW).
    I purposely avoiding the topic of what predictive algorithms can do will the offset ranks. I thinking they can be added to any LZ format, although the effectiveness would vary.

    I see LZ77 and LZ78 as being on opposite ends of the Universal LZ General Format. Tree can blend the best of both by removing some of the restrictions.

    LZ78 type algorithms seem under-developed to me. It seems silly to me to put a word in the dictionary for every symbol when it is unlikely that more than 10% or so will be used. The cost of sending a 10% probability escape symbol on 1/10 symbols is not much compared to the cost of having a dictionary filled with 90% useless entries (assuming a large dictionary). I think some LZ78/LZW algorithms use dictionary indexes because it is convenient rather than because it is optimal. On the other hand, making predictions is probably easier with dictionary indexes.

    The most important limitation with LZ78/LZW that Tree eliminates is the ability to create dictionary symbols from more than 1 or 2 symbols. This limitation severely impacts the efficiency of deduplication of long, rare matches.

    I think SR falls within the Universal LZ General Format as an algorithm with fixed symbol length 1 (not coded) and a fancy algorithm for the offset.
    Last edited by Kennon Conrad; 30th September 2014 at 12:17.

  6. #6
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Well I am a fan of BIJECTIVE LZW. You state Tree is 8% better than any other LZ method for enwik9. It would be nice if you at least in this thread give a compressed size to shoot for. LZW format when implemented correctly at each point during compression always fores the longest match. At each point you have fixed number of values to choose from. And there is no ways to pick the wrong number.

    I have posted on this site a 2 root method that will even compress enwik9 without a dictionary reset. Of course a 4 or 16 root would be better. I consider bijective nonlength transforms done to the file first fair to used before the implementation of the LZW algorithm which is an entropy compressor.

    it gets 232,638,564 enwik9.scott_trans.lzw1_64m
    while I see on matts compression page
    plzma_v3b c2 1000000000 999999999 273 8 0 0 6000 1 1 1 7 gets 193,240,160 using LZMA

  7. #7
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    Quote Originally Posted by biject.bwts View Post
    Well I am a fan of BIJECTIVE LZW. You state Tree is 8% better than any other LZ method for enwik9. It would be nice if you at least in this thread give a compressed size to shoot for.
    The best Tree has done is 177,253,956 bytes, posted yesterday by Sportman.

  8. Thanks:

    biject.bwts (30th September 2014)

  9. #8
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    I think Tree in current form resembles LZ78 more than LZ77. LZW isn't the only right variant of LZ78. You can devise different ones: http://encode.su/threads/216-LZW-LZM...ZAP-comparison And LZW format doesn't force you to select longest matches, it's up to you. Also LZ78 derived algos seems to not categorize symbols by recency (that would enable efficient dictionary index modelling) and do full dictionary restarts on space exhaustion (instead of eg pruning most useless entries).
    If you don't use the longest match in LZW, then you can't insert a new phrase, either, because match+1ch is already in the dictionary. I suppose that non-greedy match selection wouldn't cripple the algorithm fatally; the format can represent it either way. It always seemed kind of critical that it add a phrase each cycle, though, because long phrases require many cycles to grow and this makes it slow to adapt; making it slower seems like a questionable idea.

    Overall, my intution about LZ77 vs LZ78 (when considering them as classes of algorithms) is that LZ77 is about outputting offsets to a window of previous data, while LZ78 is about outputtting indexes to entries in dictionary and those indexes don't have to have anything in common with offsets. But then there would be a problem with coders like LZP (Lempel Ziv Predictive) or SR (Symbol Ranking) - they don't correspond with either description.
    OTOH, I could describe that LZ class of algorithms consists of algorithms that are able to encode variable length substrings of past data with a single indivisible token. With such description LZP falls into LZ category, but SR still doesn't (and this is probably OK, since SR intuitively doesn't look like LZ).
    Here's my intuition about LZ77 and LZ78. LZ77 was obviously a great algorithm, but it didn't solve the problem of how to choose matches, which is obviously a lot of the work of implementing an LZ77 compressor. There are infinitely-many matching schemes that are possible, and match finding is a nontrivial problem requiring special-purpose data structures and potentially a lot of memory and computation resources. I think Lempel and Ziv introduced LZ78 as an algorithm that was completely ready to implement and would be fast even on the computers of 1978. They made it simple and online, using only relatively simple data structures, and specified it precisely -- no questions were left unanswered. The fact that LZ78 was complete also made it possible to analyze in greater detail from a theory perspective. So I see LZ77 as something open-ended and underspecified and LZ78 as a complete and fully-specified algorithm that's ready to implement.

    BWT, CM, PPM, bare entropy coders, etc are either bytewise or bitwise and they divide input into fixed size chunks (ie either bits or bytes, usually) and compress each one separately. That differentiates them from LZ-class algorithms.


    I think the main reason is that Tree analyzes data offline and selects only the most useful substrings to be addressable. That reduces addresses lengths (where address is either offset or index; in case of Tree it's index).
    Yes. Existing LZ algorithms use fast heuristics and/or make other kinds of compromises to be fast and simple. Tree aims to get as close as possible to the global optimum, and it manages to outperform existing LZ algorithms at the cost of being very slow.

    I think LZ78 type algorithms could benefit a lot from offline analysis and also from periodic pruning of useless entries (if new entries are defined implicitly as in LZW, which doesn't have an option to define new token other than selecting previous one as a base) and variable length coding of dictionary entries (Tree assigns codes of different lengths unlike LZW).
    I don't think those algorithms would still resemble LZ78 after those changes. I'd give them a new name.
    Last edited by nburns; 1st October 2014 at 04:02.

  10. #9
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    Quote Originally Posted by nburns View Post
    Existing LZ algorithms use fast heuristics and/or make other kinds of compromises to be fast and simple. Tree aims to get as close as possible to the global optimum, and it manages to outperform existing LZ algorithms at the cost of being very slow.
    It seems like a few LZ77 style algorithms try to reach maximum LZ compression (LzTurbo, Lzham) but take a lot longer to compress than some of the faster LZ77 implementations. I suspect perfect LZ77 cannot reach the compression level of Tree because it doesn't (as far as I know) use a compressed history. I think compressed history works best when dictionary symbols are created at the first appearance and LZ77 does not have any way to do that. I could be wrong, but I think creating a compression program that is much faster than TreeCompress but still outperforms LZ77 is possible.

  11. #10
    Member
    Join Date
    Jul 2014
    Location
    Kenya
    Posts
    59
    Thanks
    0
    Thanked 1 Time in 1 Post
    LZ is sliding window. It looks for a common pattern (generally string left to right) to relate to from the past by its offset/length, and repeats this concept.
    For something more optimal, providing it looks at byte patterns and not something like variable bit lengths, once an initial pass is done to determine the general repetitive areas used to identify the offset/length and give the symbol, that data can be internally calculated and optimized to:
    a) Use the least possible entries to cover the whole file
    b) Prioritize these least possible entries to effectively cover as much of the file as possible for each

    This is similar to looking at strings / 8 bit / byte pattern and then going further to find combinations of which 16 bit / 24 bit / 32 bit 'clones/clusters' can be effective, except there is a notion of length and offset to reduce the overhead where applicable so length can be used in a variable / suitable manner for smaller results.

    For something like 1GB of 0 bit in a row, for storage purpose, it might time out for its limit/constraint to presume first byte to byte 65536 if this is the limit and repeat this multiple times (and perhaps stack as some form of encapsulating).
    The length aspect caters for more in this scenario than simply looking at encoding the strings in chunks like zip would.

    This method ends up being more effective for data with a lot of 'clones/clusters' of common sets of patterns throughout a file with some decent capturing of in a row.
    This reason is why 7z in general performs more effectively for some archival purposes than some other tools and is also relatively snappy.

    Things a specific set distance apart like every 24 bytes apart won't have a way to capture this for this available prevalence of '24 byte gap' for the times this occurs for entries in the file, since it looks only left-to-right and doesn't take into account things at a distance apart more than 1 byte.


    Thanks for the PDF.
    Last edited by SoraK05; 1st October 2014 at 15:17.

  12. #11
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    Quote Originally Posted by SoraK05 View Post
    LZ is sliding window. It looks for a common pattern (generally string left to right) to relate to from the past by its offset/length, and repeats this concept.
    For something more optimal, providing it looks at byte patterns and not something like variable bit lengths, once an initial pass is done to determine the general repetitive areas used to identify the offset/length and give the symbol, that data can be internally calculated and optimized to:
    a) Use the least possible entries to cover the whole file
    b) Prioritize these least possible entries to effectively cover as much of the file as possible for each
    LZ performs best when the window size is the entire file.

    With explicit dictionary creation, it is possible to create dictionary entries based on upcoming data rather than past data.

    Algorithms that create the least possible symbols are not optimal because they will create symbols that take more bits to encode than it would take to encode two or more much higher frequency symbols.

  13. #12
    Member
    Join Date
    Jul 2014
    Location
    Kenya
    Posts
    59
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by Kennon Conrad View Post
    Algorithms that create the least possible symbols are not optimal because they will create symbols that take more bits to encode than it would take to encode two or more much higher frequency symbols.
    This is what I am referring to:

    00110010011001001100100110010000100000100001000001 00010011001001

    THE HEAVIEST OF CURRENT SCAN IS:
    00100 = 9
    store data = 5 + 9 = 14
    raw data = 5 * 9 = 45
    proportion = 45 / 14 = 3.2142857142857142857142857142857 = HEAVIER
    This is from the minimal variable string scan I mention in my writeup looking at strings.
    http://encode.su/threads/2048-Ultima...ll=1#post40690

    In this example, you will barely get anything if you look at 8 bits / 1 byte patterns, but looking at variable bits you get most of the file with a 5 bit string.
    You end up IIRC with a 5 bit, 3 bit, 1 bit and 1 bit entry, and priority for construction of the variable code tree is based on the weight, so that the 5 bit string one gets the least bit representation. It doesn't care about the length, such as it if is 5 bit or 3 bit, but if it is the 'heaviest' meaning it covers the most of the file and represents the most in the file. As the heaver result regardless of its bit length like 5/3/8 encompasses more of the file, it gets the smallest representation for the max compression for it covering the file the most.
    This method requires rescanning for each entry and marking offsets of all occurrences to a temporary mapping file so you have all minimal possible entries for the file and have minimal slots/occurrences without any clash for the all round smallest form.

    This is technically the smallest possible code tree entry list and representation associated when the file is written all round, before other stuff mentioned to be specific to position eg.

    One method mentioned (clone/cluster) is after doing this process to run another scan to combine entries together, such as if the 5 bit and the 3 bit appear clustered together frequently in a file this alone gets a representation, and in a variable manner multiple (up to n) or a couple entries (at least 2 or more) can be combined based on weight (so can have mixed, some entries as 2, some as 4 eg) for the same approach of having the least possible entries in the code tree that cover the whole file the most and associated by weight to have the smallest code tree representation when the file is written all round.

    The writeup describes any repetition of entries in a row on its own entirely as its own code tree branch with its own weight association for presence in the file for smallest written file and is treated as a plain number, such that 1GB of 0 bit can have '0' and '1GB(-1)' as a number directly, with the code tree branch recognized for 'in a row' for its prefix. This is unlike LZ which looks for any patterns to relate for an offset and length, where the length can end up doing something similar.

    The idea is to make a central code tree then a secondary code tree (and more with internal repeats) used for the final entries in the temporary mapping file suitable just for it.

    All entries in the code trees are associated by their branch after sorting them for minimal code tree entries and priority of weight for smallest written form individually, such that all branches are recognized for their specific properties and for example the branch in the initial tree for in a row is recognized specifically for its entry as a number.

    LZ isn't as specific as this but for space/time it can be effective.
    And yes, in general processing the entire file is more effective and even looking at specific areas I mention in the writeup (mainly to separate possible compressible/uncompressible to leave some areas raw with recognition by length). This is before moving/shuffling data to get more repetition for priorities like in a row, common clusters, separate potential compressible/uncompressible areas (still WIP, [TO DO]).
    Last edited by SoraK05; 1st October 2014 at 20:45.

  14. #13
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    In my experience, tokenizing patterns with sizes that are not multiples of the size of the native data format or that repeat at intervals other than multiples of the size of the native data format generally makes compression worse.

  15. #14
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Kennon Conrad View Post
    It seems like a few LZ77 style algorithms try to reach maximum LZ compression (LzTurbo, Lzham) but take a lot longer to compress than some of the faster LZ77 implementations. I suspect perfect LZ77 cannot reach the compression level of Tree because it doesn't (as far as I know) use a compressed history. I think compressed history works best when dictionary symbols are created at the first appearance and LZ77 does not have any way to do that. I could be wrong, but I think creating a compression program that is much faster than TreeCompress but still outperforms LZ77 is possible.
    Re-reading my statement, I didn't manage to say much there. I was trying to say that LZ algorithms have made it a goal to be online or nearly online, whereas Tree completely abandons that idea and will continue making passes until it runs out of things to compress.

    The LZ format is really dependent on a subsequent Huffman/EC pass to achieve high compression levels, because the (start,length) pairs are too big by themselves. To get top-level compression, the match selection algorithm has to be aware of the Huffman pass and optimize for it. I'm not aware that any LZ algorithm has gone far enough in coordinating the two passes to compete with Tree.

    Another thing in Tree's favor is that it has no sliding window; it matches across the whole file. It might be somewhat unfair to these LZ algorithms to compare them with Tree without including a preprocessor for long-range matches.

  16. #15
    Member
    Join Date
    Jul 2014
    Location
    Kenya
    Posts
    59
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by Kennon Conrad View Post
    In my experience, tokenizing patterns with sizes that are not multiples of the size of the native data format or that repeat at intervals other than multiples of the size of the native data format generally makes compression worse.
    As long as the initial code tree backend data contains relevant lengths for the entries, when any prefixes are read in the compressed content the exact string of any length (or whatever you code) can be determined. This is mainly some additional backend data solely to describe the length of the variable string content and is only in the headering, and for its relatively minor length association data will allow for breaking even for more compression (considerably).
    Variable string rather than forcing 8 bit patterns is more likely to compress more from presence and break even (considerably) for minor length association to the string entries, getting repetitive variable strings which cross over from byte to byte and from byte makeup in bits.

    In regards to cpu calculation, for minimal storage this is considered beyond programming as 'no object'.



    EDIT:
    For the most part, the length has to be defined at some point including executable instructions.
    A method to be dynamic in the header to give a size for all being the same where applicable, or some fluctuate in an organized way should be efficient.

    For a custom self extracting version form besides instruction steps being minimal to the architecture these things can be hardcoded and specific portions of the executable appended/modified.
    Last edited by SoraK05; 2nd October 2014 at 02:05.

  17. #16
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    Quote Originally Posted by nburns View Post
    The LZ format is really dependent on a subsequent Huffman/EC pass to achieve high compression levels, because the (start,length) pairs are too big by themselves. To get top-level compression, the match selection algorithm has to be aware of the Huffman pass and optimize for it. I'm not aware that any LZ algorithm has gone far enough in coordinating the two passes to compete with Tree.
    It seems like making direct comparisons between compressors is difficult in many cases. I suspect some of the LZ algorithms that get the best compression try pretty hard based on memory usage and maximum block size (PLZMA, LzTurbo, Lzham for example), but I am not sure either. Lzham is open source, so I should take a look at the code at some point to get a better idea of what it does.

    Quote Originally Posted by nburns View Post
    Another thing in Tree's favor is that it has no sliding window; it matches across the whole file. It might be somewhat unfair to these LZ algorithms to compare them with Tree without including a preprocessor for long-range matches.
    Yes, it is unfair to Ilia's LZ algorithms. They seem very practical, keeping memory and compression time both at reasonable values. But Plzma and LzTurbo both fit all of enwik9 in their sliding window.

  18. #17
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    Quote Originally Posted by SoraK05 View Post
    As long as the initial code tree backend data contains relevant lengths for the entries, when any prefixes are read in the compressed content the exact string of any length (or whatever you code) can be determined. This is mainly some additional backend data solely to describe the length of the variable string content and is only in the headering, and for its relatively minor length association data will allow for breaking even for more compression (considerably).
    Variable string rather than forcing 8 bit patterns is more likely to compress more from presence and break even (considerably) for minor length association to the string entries, getting repetitive variable strings which cross over from byte to byte and from byte makeup in bits.
    I was not questioning the ability to do what you describe, I was questioning the effectiveness.

    Let's take a closer look at the example you provided and pretend you could encode the length/offset code for the "00100" at the cost of an escape symbol (which is not possible, but puts a lower bound on the order 0 entropy):

    data: 00110010011001001100100110010000100000100001000001 00010011001001
    prevalent bit pattern: 00100 (let's call this "A")
    new data: 0011A11A11A11AA0AA0A010011A1

    original data: 43 0's, 21 1's => order 0 entropy 58.43 bits (43 x 0.574 + 21 * 1.608)
    new data: 7 0's, 12 1's, 9 A's, plus 4 0's, 1 1's and 1 escape for the dictionary (total 11 0's, 13 1's, 9 A's, 1 escape) => order 0 entropy 58.28 bits (13 x 1.387 + 11 x 1.628 + 9 x 1.918 + 1 x 5.087)

    That would provide 0.15 bits of savings if you could encode the offset/length code in just 1 escape, but you can't, so you have not compressed this data at all.
    Last edited by Kennon Conrad; 2nd October 2014 at 07:50.

  19. #18
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,569
    Thanks
    777
    Thanked 687 Times in 372 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Lzham is open source, so I should take a look at the code at some point to get a better idea of what it does.
    for optimal parsing, you can read http://encode.su/threads/1895-LZ-Optimal-parsing and http://cbloomrants.blogspot.ru/2008/10/10-10-08-7.html
    for lzma encoding tricks, you can read http://mattmahoney.net/dc/dce.html#Section_523 or http://cbloomrants.blogspot.ru/2010/...ting-lzma.html

    lzma sources are also available but extremely hard to read

  20. Thanks:

    Kennon Conrad (3rd October 2014)

  21. #19
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by Kennon Conrad View Post
    I was not questioning the ability to do what you describe, I was questioning the effectiveness.

    Let's take a closer look at the example you provided and pretend you could encode the length/offset code for the "00100" at the cost of an escape symbol (which is not possible, but puts a lower bound on the order 0 entropy):

    data: 00110010011001001100100110010000100000100001000001 00010011001001
    prevalent bit pattern: 00100 (let's call this "A")
    new data: 0011A11A11A11AA0AA0A010011A1

    original data: 43 0's, 21 1's => order 0 entropy 58.43 bits (43 x 0.574 + 21 * 1.60
    new data: 7 0's, 12 1's, 9 A's, plus 4 0's, 1 1's and 1 escape for the dictionary (total 11 0's, 13 1's, 9 A's, 1 escape) => order 0 entropy 58.28 bits (13 x 1.387 + 11 x 1.628 + 9 x 1.918 + 1 x 5.087)

    That would provide 0.15 bits of savings if you could encode the offset/length code in just 1 escape, but you can't, so you have not compressed this data at all.

    For Fun I took the above string and removes the one space.
    Then I had a file of 43 0's and 21 1's
    I then used a012b01 to bijectively convert that file to a binary byte file 9 bytes long
    then used lzw1_m32 to compress the result its a 2 root bijective LZW code
    this compress to the same size. 9 bytes
    then run b012a01 to get the underlying bit string
    01110001001000011110000101101010100101010111000000 01100001010111
    which is 36 0's and 28 1's so no compression and zero order entropy worse

    Just felt like doing it
    Last edited by biject.bwts; 2nd October 2014 at 20:00. Reason: a typo

  22. #20
    Member
    Join Date
    Jul 2014
    Location
    Kenya
    Posts
    59
    Thanks
    0
    Thanked 1 Time in 1 Post
    64 bits is short for most data arrangements to compress, but if this kind of thing happened frequently in a file, getting the variable length strings than a fixed length like 8 bit would break even. It is for illustration.
    I know each entry, and it looks like it would be 5 bit, 2 bit, 1 bit and 1 bit would have their own code tree prefix for overall minimal entries in a code tree and by weight so the 'heaviest' one (nothing about length/frequency) would get the smallest representation. This uses a temporary mapping file to acknowledge the 'A' spots, 'B' spots etc for no clash and write the file according to it for minimal spots.
    I describe it as a proportional shrink.
    It should write the smallest possible form overall, before anything do to with positioning like in a row/clone/chunk length or more specific to areas, shuffling, mirror+inverse etc.

    It is just to demonstrate in principle it is the 'heaviest' i.e. most compression capture which focuses on minimal entries which are used the most in the file and giving each the ideal least representation for smallest possible file written. You will have less entries in the code tree which cater for the whole file and their exact positions, and from less code tree entries less possible bits for each the prefixes for writing and giving the least to the heaviest and most covered.
    The overheads for determining the variable length of entries than static is generally negligible for compression gained, and where static including them all being 6 bit for some reason this can be specifically acknowledged and have a static code tree prefix instead of variable one if it applies based on common weights which won't break even. These things sort of add up.


    EDIT:
    To me, and perhaps should mention in my text, is that minimal compression involves catering for the widest spectrum of files with the available algorithm tools. Hence, having multiple approaches concurrently and according to weight. You can capture more files in the spectrum for their prevalence with the minimal variable strings than static ones overall, and some adjustments to compensate. From the size of the EXE where this is affected, like mentioned, a custom EXE can be made using only modules applied to the file and tweaks (possible for automation of appending and hardcoding) for something as its own self-extracting unit.
    Last edited by SoraK05; 2nd October 2014 at 18:33.

  23. #21
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    Quote Originally Posted by SoraK05 View Post
    64 bits is short for most data arrangements to compress, but if this kind of thing happened frequently in a file, getting the variable length strings than a fixed length like 8 bit would break even. It is for illustration.
    I know each entry, and it looks like it would be 5 bit, 2 bit, 1 bit and 1 bit would have their own code tree prefix for overall minimal entries in a code tree and by weight so the 'heaviest' one (nothing about length/frequency) would get the smallest representation. This uses a temporary mapping file to acknowledge the 'A' spots, 'B' spots etc for no clash and write the file according to it for minimal spots.
    I describe it as a proportional shrink.
    It should write the smallest possible form overall, before anything do to with positioning like in a row/clone/chunk length or more specific to areas, shuffling, mirror+inverse etc.

    It is just to demonstrate in principle it is the 'heaviest' i.e. most compression capture which focuses on minimal entries which are used the most in the file and giving each the ideal least representation for smallest possible file written. You will have less entries in the code tree which cater for the whole file and their exact positions, and from less code tree entries less possible bits for each the prefixes for writing and giving the least to the heaviest and most covered.
    The overheads for determining the variable length of entries than static is generally negligible for compression gained, and where static including them all being 6 bit for some reason this can be specifically acknowledged and have a static code tree prefix instead of variable one if it applies based on common weights which won't break even. These things sort of add up.
    You should read Matt's dce page. He has provided the link. Instead of describing "proportional shrink" and providing a long description, you could just call it Huffman coding and skip the description.

    Your idea is not attractive to me. Relying on the presense of recurring random patterns to accomplish compression is not likely to yield much. On the other hand, if you take advantage of the native data format size to make intelligent decisions on where to look for recurring patterns, then you would actually be talking about LZ style algorithms.
    Last edited by Kennon Conrad; 2nd October 2014 at 18:37.

  24. #22
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    Quote Originally Posted by biject.bwts View Post
    For Fun I took the above string and removes the one space.
    Then I had a file of 43 0's and 21 1's
    I then used a012b01 to bijectively convert that file to a binary byte file 9 bytes long
    then used lzw1_m32 to compress the result its a 2 root bijective LZW code
    this compress to the same size. 9 bytes
    then run b012a01 to get the underlying bit string
    01110001001000011110000101101010100101010111000000 01100001010111
    which is 36 0's and 18 1's so no compression and zero order entropy worse

    Just felt like doing it
    But you did compress the string by 10 bits! Much better than less than 0.15 bits.....

    Edit: Was that a typo or a premonition?
    Last edited by Kennon Conrad; 3rd October 2014 at 07:57.

  25. #23
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    >type test.txt
    00110010011001001100100110010000100000100001000001 00010011001001

    >bwts test.txt test_bwts.x

    >type test_bwts.x
    11101000001100001101111111100000000000000000000001 01110000100000

    >y:mtfq test_bwts.x test_bwts_mtfq.x

    >type test_bwts_mtfq.x
    11100011110101110100111111101111111111111111111110 00110111001111

    >a012b01 test_bwts_mtfq.x test_bwts_mtfq.a
    Bijective convert ascii 0 and 1 to binary file version 20071227
    on test_bwts_mtfq.x to test_bwts_mtfq.a

    .
    number of ones 49 number of zeros 15
    string has 64 bits and 8.0 bytes
    *** DONE ***

    looking at entropy here -49*log(49/64)/log(2) - 15*log(15/64)/log(2) = 50.2758587042270195308261313..
    thought it best to use 2 state bijective compressor

    >x:arb2x_32 test_bwts_mtfq.a test_bwts_mtfq_2x.a


    >b012a01 test_bwts_mtfq_2x.a test.xx
    Bijective convert binary 0 and 1 to ascii file version 20071227
    on test_bwts_mtfq_2x.a to test.xx

    .
    number of ones 28 number of zeros 26
    string has 54 bits and 6.750 bytes
    *** DONE ***
    zero order entropy of -28×(log(28/54))/(log(2))-26×(log(26/54))/(log(2)) = 53.94655462754599...

    >type test.xx
    01010001110100000111011110110101111110011001100010 0100

    So now I save 10 bits

  26. #24
    Member
    Join Date
    Jul 2014
    Location
    Kenya
    Posts
    59
    Thanks
    0
    Thanked 1 Time in 1 Post
    I prefer not to use the term huffman coding since anyone can make a code tree whether variable/static to represent anything.
    I'm being specific to an initial run of just strings.

    Once you do a second run in a similar manner on the results of the data, for all entries available and which are eligible for clustering so you don't force group and some things can be 1 entry, 3 combined, 4 combined etc for an ideal result, not only would variable strings be more suitable to content but the variable clustering of common patterns would be more specific than the way LZ I believe captures common clusters of patterns with the offset/length.
    For all entries at that point, one can look specifically for entries in a row and get these all, and have individual prefixes just for 'in a row' repetition as a number. Instead of having multiple offset/length entries, all in advance you have a single entry in the initial code tree in an already minimized form for its entry of a cluster and then an occurence in the file directly as a least bit prefix, and where there is any repetition in a row to acknowledge it is done so as a number which can be used anywhere just like the cluster prefix where appropriate.
    It can even cluster in an internal repeat the entries which appear cloned after this process for specific prefixes which only when met will temporary decode in temp space for its layer.

    An entry which has 4 clustered entries in one for its own prefix would instead have offset+length for this in LZ. Where there is repetition in a row of this 4 clustered entry, you have one prefix for this and one prefix for the repetition in a row as a number, where LZ has offset+length and then length.


    You avoid all the data for offset+length by having the exact string data associated to minimal prefixes in advance as well as in a row as an exact number which can be used anywhere too. There is no requirement to associate this same data to offset+length, but can get it all in advance to their own minimal prefixes. LZ is quicker considering for space/time.


    And I see this biject method rearranging data? I've mentioned also rearranging data to break even by the relinking/association of areas and matching common data for getting more prevalence captured and be specific for priority of what kind of prevalence.
    The writing is WIP but does cater for a range.


    Anyway, just suggestions to consider having minimal effective data from the LZ scan to make it more efficient, and lesser like frames or specific limit to amount of length of bytes allowed for an entry etc.
    Last edited by SoraK05; 3rd October 2014 at 01:52.

  27. #25
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    for optimal parsing, you can read http://encode.su/threads/1895-LZ-Optimal-parsing and http://cbloomrants.blogspot.ru/2008/10/10-10-08-7.html
    for lzma encoding tricks, you can read http://mattmahoney.net/dc/dce.html#Section_523 or http://cbloomrants.blogspot.ru/2010/...ting-lzma.html

    lzma sources are also available but extremely hard to read
    Thanks, those are some very helpful links. I ended up downloading lzham and tornado. IMHO, the tornado code is much cleaner and easier to follow. I'm not saying lzham is bad, just than the tornado code is very well structured. I generally don't write C++ code and sometimes struggle with deciphering objects, but found tornado easy to follow. It has pretty much everything I expected like arithmetic coding, combined length/offset codes and smart symbol selection. Nice job! (Just don't ask me to decipher the (UTF-16?) comments!)

  28. Thanks:

    Bulat Ziganshin (3rd October 2014)

  29. #26
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Kennon Conrad View Post
    It seems like making direct comparisons between compressors is difficult in many cases. I suspect some of the LZ algorithms that get the best compression try pretty hard based on memory usage and maximum block size (PLZMA, LzTurbo, Lzham for example), but I am not sure either. Lzham is open source, so I should take a look at the code at some point to get a better idea of what it does.
    It´s hard to compare compressors using different techniques, but it isn´t impossible. When I was exploring a wide variety of algorithms a while back, I found myself repeatedly realizing that, even though algorithms look very different on the surface, the deeper you examine them, the more you realize they´re all after the same thing. I also found that superficially-different techniques have an uncanny tendency to yield file sizes that are almost identical to each other. So that supports the idea that they´re all approximating the same underlying entropy (even the error tends to be consistent, using dissimilar encodings but a similar degree of sophistication).

    Making the offsets and lengths Huffman-friendly isn´t the only way to trim down LZ. For instance, the match source can be described as an offset counting backwards from the current position (this is the standard way, I think, or something quite similar to this). But there are inifitely-many alternate ways to describe the match source position. Any total order across the text positions can be used, e.g. the rank obtained from suffix-sorting the text up to the current position. Considering different total orders and reference positions to measure the offset from, in general, you can usually find some better way that makes a smaller number, usually doing additional computation along the way. That´s one way to attack the size of LZ. Other ways are things like expecting matches with short gaps -- like ´President Clinton´ and ´President Bill Clinton´. cbloom has written a great deal about LZ on his rants website. Highly efficient LZ algorithms like lzma employ multiple ways of describing matches or nonmatching sections of text and switch between them. It´s my general opinion that one good idea is equal to or better than several mediocre ideas strung together. Many of those same techniques would work in Tree, too. But, they´d make it more complex and slow down decoding.

    Yes, it is unfair to Ilia's LZ algorithms. They seem very practical, keeping memory and compression time both at reasonable values. But Plzma and LzTurbo both fit all of enwik9 in their sliding window.

  30. #27
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    Quote Originally Posted by nburns View Post
    Making the offsets and lengths Huffman-friendly isn´t the only way to trim down LZ. For instance, the match source can be described as an offset counting backwards from the current position (this is the standard way, I think, or something quite similar to this). But there are inifitely-many alternate ways to describe the match source position. Any total order across the text positions can be used, e.g. the rank obtained from suffix-sorting the text up to the current position. Considering different total orders and reference positions to measure the offset from, in general, you can usually find some better way that makes a smaller number, usually doing additional computation along the way. That´s one way to attack the size of LZ. Other ways are things like expecting matches with short gaps -- like ´President Clinton´ and ´President Bill Clinton´. cbloom has written a great deal about LZ on his rants website. Highly efficient LZ algorithms like lzma employ multiple ways of describing matches or nonmatching sections of text and switch between them. It´s my general opinion that one good idea is equal to or better than several mediocre ideas strung together. Many of those same techniques would work in Tree, too. But, they´d make it more complex and slow down decoding.
    Yes, for instance, adding a context model on the back end of LZ has shown good results if you are willing to have the code run slower. I am sure Tree could create smaller files with smarter encoding and decoding as well as a context model, but for now I don't want to slow down decompression since that seems to be the best performance aspect of Tree. It could also create smaller files with smarter capital encoding, smarter UTF-8 recognition, arithmetic coding, supporting matches with skips, adding RLE, etc. But people know these things, many wouldn't have a large overall impact, and, as you say, they would slow down decompression.

    I may be wrong, but I think Tree spends way too much time during compression analyzing what is nearly the same data over and over. A smart linear approach should be able to eliminate this problem and yield compressed files much faster without compromising the compressed file size too much. I think there are advantages to using the universal format to create dictionaries and use compressed references like LZ78 (but much smaller dictionary size) while retaining the ability to use a single reference for long, rare matches like LZ77. It allows for algorithms that approach the overall compression problem as a top-down deduplication math problem rather than a local pattern matching math problem. But words are fairly meaningless without a demonstration, so that is what I am working on.

  31. #28
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    I belatedly looked at the pdf. A few diagrams could be extremely useful to help communicate what the universal format and the subtypes actually look like.

    Quote Originally Posted by Kennon Conrad View Post
    I may be wrong, but I think Tree spends way too much time during compression analyzing what is nearly the same data over and over. A smart linear approach should be able to eliminate this problem and yield compressed files much faster without compromising the compressed file size too much. I think there are advantages to using the universal format to create dictionaries and use compressed references like LZ78 (but much smaller dictionary size) while retaining the ability to use a single reference for long, rare matches like LZ77. It allows for algorithms that approach the overall compression problem as a top-down deduplication math problem rather than a local pattern matching math problem. But words are fairly meaningless without a demonstration, so that is what I am working on.
    Note that in LZ77, there is no such thing as top-down. Top-down implies traversing a tree from root to leaf. LZ77 matches don´t form a tree.

  32. #29
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    Quote Originally Posted by nburns View Post
    I belatedly looked at the pdf. A few diagrams could be extremely useful to help communicate what the universal format and the subtypes actually look like.

    Note that in LZ77, there is no such thing as top-down. Top-down implies traversing a tree from root to leaf. LZ77 matches don´t form a tree.
    Yes, top-down only is suitable when there is a dictionary. Here is a simple diagram with a few comments.

    The idea is actually pretty simple and I suppose even CM fits the model, always using a length code of 1.
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	Universal_LZ_Diagram.jpg 
Views:	188 
Size:	126.5 KB 
ID:	3175  

  33. #30
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Yes, top-down only is suitable when there is a dictionary. Here is a simple diagram with a few comments.

    The idea is actually pretty simple and I suppose even CM fits the model, always using a length code of 1.
    Forgive me for being lazy and not working very hard at making sense of that. What I had in mind was a diagram of how the physical bits would be laid out. A C struct definition would would be equivalent.

Page 1 of 2 12 LastLast

Similar Threads

  1. reflate - a new universal deflate recompressor
    By Shelwien in forum Data Compression
    Replies: 131
    Last Post: 4th December 2019, 08:08
  2. Universal data detector?
    By lunaris in forum Data Compression
    Replies: 6
    Last Post: 17th July 2014, 12:55
  3. 64 GB/s aka 16 bytes/cycle universal hashing
    By Bulat Ziganshin in forum Data Compression
    Replies: 2
    Last Post: 10th April 2014, 01:33
  4. Steganography for universal recompression of lossy formats
    By Shelwien in forum Data Compression
    Replies: 3
    Last Post: 7th November 2011, 17:05
  5. Universal Archive Format
    By Bulat Ziganshin in forum Data Compression
    Replies: 1
    Last Post: 9th July 2008, 00:54

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •