Page 19 of 29 FirstFirst ... 91718192021 ... LastLast
Results 541 to 570 of 849

Thread: Tree alpha v0.1 download

  1. #541
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Paul W. View Post
    I think very simple skip-gram context modeling should help a lot for DNA---just first-order Markov-like context modeling of sequences of your Tree strings, but tolerating one mismatched symbol.

    I'm pretty sure that's going to work well for text as well, so you can often recognize things like "in the <A> of the <B>" with a first-order model, skipping a "single" symbol, where <A> and <B> might be single words or whole common noun phrases that Tree's front end has collapsed to single symbols. By doing most of the PPM-like multi-order modeling up front, you should get a vocabulary of symbols for which a single-order model works surprisingly well, if it can handle the "single" skips that kept the front-end from recognizing the whole larger phrase in the first place. Cool.
    All sorts of modeling could be added to Tree. The only place I wouldn't expect much benefit is with an order 1 model of adjacent symbols since the good recurring matches have already been combined into symbols. But skip contexts and context modeling of adjacent subsymbols should provide better compression.

  2. #542
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    Did you know he had been workng on grammar compression, or was it a coincidence?
    I had no idea, but it worked out well. When I stopped by his office and asked for a few minutes of his time, he seemed disinterested and said he had to leave soon. I must have looked desperate, because my impression was that he wasn't going to give me any time, but after a brief hesitation, he asked if 5 minutes would be enough. As soon as I mentioned grammar compressor, top-down, and Markov chain probabilities, he was suddenly very interested and made a comment about getting visitors that don't know what they are talking about. He gave me all the time he had and has been nothing but encouraging since.

    Quote Originally Posted by nburns View Post
    Actually, MTF works ok for symbols that repeat at a stable frequency: if a symbol has probability p, then you expect to see one every 1/p symbols... and MTF assigns a reasonable number on average. I remember this being mentioned in the paper.

    Of course, there are plenty of ways to get poor behavior from MTF.
    If the frequency is perfectly stable, it seems like MTF will lose. For instance, if a symbol occurs once every 100 symbols, it's going to be impossible for most MTF schemes to give a 1% probability to the 100th queue position.

  3. #543
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Quote Originally Posted by nburns View Post
    Actually, MTF works ok for symbols that repeat at a stable frequency: if a symbol has probability p, then you expect to see one every 1/p symbols... and MTF assigns a reasonable number on average. I remember this being mentioned in the paper.
    Yes. The problem is that if you do have stable frequencies, Huffman generally wins by a little, because it's optimal for that, and in benchmarks MTF tends to predictably come out "ok", but no better, and its competitiveness doesn't count for much---why use what's usually second best, and only sometimes best?

    For example, there's a number of papers comparing things like Huffman and MTF-plus-Elias-coding and maybe splay-tree coding, where they throw raw data at zeroth-order models, and of course Huffman usually wins. And often most of the data is text---a stream of letters or a stream of words with a fairly stable Zipfy distribution. So of course Huffman usually wins, with MTF not far behind, but behind.

    What people seem to have systematically missed is that (as you've said before) MTF is just a transform, and like any handy transform it will make some regularities clearer and others more obscure. You don't want to just use it directly as a model, but as a feature extractor to tell you what model is appropriate for the data at hand.

    And often a transform is best used in conjunction with other transforms---compose transforms in one way, and certain features will pop out, but if you compose them in another way, other features will pop out.

    MTF is particularly interesting for compression when combined with a straight distance transform, like what Matt's fv program plots. (Essentially showing you LZ77-style sliding-window match offsets.)

    When you look at FV plots, it's obvious there's rich structure in most kinds of data except plain text. You get horizontal striping because of strict strides in the data, multiple stripes at different offsets due to nested arrays, bands of fuzzy horizontal striping because of approximate strides due to variable-length records, patches of "snow" where you hit data that don't have those kinds of regularity (e.g., text or variable-length instructions), rising features where data repeat at larger and larger intervals because new things are interleaved with them, falling features because data repeat in reverse order, etc.

    If you MTF transform the output of the match distance transform, you get something interesting, similar to what LZMA does for modeling LZ77 match offsets. LZMA keeps a short (4-element) MTF queue of recently-seen offsets, so that it can predict and exploit strided repeats.

    You might think a short MTF queue of match distances would let you keep track of strides at several distances, so that you could recognize nested array structures with major and minor repeat intervals, and sometimes it will, but often it won't. Often you get a lot of matches at one particular distance (the minor strides, like an inner loop), and matches at various other distances too. What the short MTF queue does in that case is to make the predictor tolerant of some noise---if you're getting a lot of matches at one distance and some at various other distances, that one important distance is likely to be among the ones in the queue.

    LZMA is weird, and hard to understand what it's really doing, because it's applying the MTF-of-distance transform after another funny transform---variable-length string matching, which chops the data up in funny ways.

    Let's consider something simpler. Suppose you just do one-byte matching, with a match distance transform. That is, for every byte value, you keep a pointer back to the previous occurrence of that value, and when you see a repeat, you just emit the offset backward from the current pointer.

    If you have a repeating pattern like abcdeabcdeabcdeabcde..., then after the first abcde, you'll just get a string of 5's---you're always seeing the same letter you saw 5 letters ago.

    If you MTF transform that, you'll get a corresponding sequence of 1's. Runs of 1's in MTF-of-match-distance correspond to string copies. Cool.

    But it's not that simple, because you can have repeats within repeats that throw things off. Suppose the repeating pattern has a repeat in it, like abcad, so that we're modeling the string abcadabcadabcad... the a's are going to match at different distances than the other letters, and after the first abcad, we'll get a match distance of 2 for a, then 5's for b and c, then a 3 for the second a in the pattern, and another 5 for the d, so we'll see 25535 over and over again. When we MTF transform that string of match distances, we'll get a repeating pattern of 42132.

    I think that's why a 4-element MTF queue is a good choice for LZMA's modeling of LZ77 offsets. It's the minimum length queue that you need to track one stride with one awkward bobble in it.

    That's often enough to let you find the inner repeating pattern (minor stride) in nested repeating patterns, which is probably the biggest win.

    If you're lucky, and there are no bobbles in your inner repeating pattern, it's also enough to let you track nested repeats, and get another win.

    --

    I think there's room for improvement, though, if you don't just use the output of MTF for prediction, as LZMA does (as I understand it). A better predictor could look at the actual data items, the match distances, and the MTF-of-match distribution, and combine the information intelligently. For example, in the example above, it might notice that the extra A's in the pattern are causing the problem, and the stride is really just 5, with a bobble at 4. It would know to just predict whatever's 5 bytes back, and ignore the bobble, because it doesn't actually matter. (It doesn't improve the prediction over just noticing the stride of 5.)

    You could do that in an adaptive context-mixer, if you had an MTF queue of recently-seen match distances---not for LZ77 coding, but just for prediction. You'd have a predictor for each match distance, and pick which one works better. The predictor for a stride of 5 would clearly dominate the predictors for 2 and 3.
    Last edited by Paul W.; 11th January 2015 at 11:58. Reason: typo

  4. #544
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    actually, lzma doesn't find matches and then applies MTF. it checks possible matches at the last 4 offsets and then use complex heuristic to select between MTF-based matches and natural one

  5. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

    Paul W. (11th January 2015)

  6. #545
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Paul W. View Post
    Not that I've seen. If anybody else has gotten and published the idea, it hasn't gotten the attention it deserves, or I'd have seen it in things I have read.
    That's surprising. It seems like it is a simple concept that should be known.

    Here's an example that shows the possible benefit of distinguishing between ergodic and non-ergodic symbols for MTF.

    Suppose we have a 32 symbol file with symbol probabilities as follows:
    p(A) = 0.5
    p(B) = 0.5 for the first half of the file, 0.0 for the second half of the file
    p(C) = 0.0 for the first half of the file, 0.5 for the second half of the file

    If we encode the file with Huffman codes, A gets a 1 bit code and B and C get 2 bit codes so the Huffman encoded file would average 16 x 1 + 8 x 2 + 8 x 2 = 48 bits.

    If we encode the file with an MTF queue, MTF queue position probabilities should be approximately as follows:
    p(1) = 15/32
    p(2) = 15/32
    p(3) = 2/32
    and the encoded file would typically contain 40.794 bits (slightly less if the file did not have a B or a C).

    If we encode the file with an ergodic aware MTF queue (only B and C put in queue), probabilites should be approximately as follows:
    p(A) = 0.5
    p(1) = 14/32
    p(2) = 2/32
    and now the encoded file would typically contain 40.696 bits.

    So for this example, MTF saved 7.206 bits and ergodic aware MTF saved 7.304 bits. The advantage I see is that by modeling ergodic symbols separately from non-ergodic symbols, you can avoid having to skew the probabilities of symbols that should not have their probabilities skewed by recency. Where it works, it saves some bits plus the time to maintain the mtf queue positions of the ergodic symbols. As noted earlier, the impact on Tree was not insignificant. A partial implementation produced a compressed enwik9 file that was almost 0.2% smaller and could be decoded more than 10% faster.
    Last edited by Kennon Conrad; 11th January 2015 at 21:40.

  7. The Following User Says Thank You to Kennon Conrad For This Useful Post:

    Paul W. (12th January 2015)

  8. #546
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    actually, lzma doesn't find matches and then applies MTF. it checks possible matches at the last 4 offsets and then use complex heuristic to select between MTF-based matches and natural one
    Oh. OK.

    I'm not sure what that really means, though... does it basically change the story I gave above, or just complicate it?

    My impression is (or was) that LZMA does bounded "optimal" parsing to pick which match to use and what match length, e.g., taking a shorter match sometimes in preference to a longer one so as not to overlap interfere with a better match upstream.

    I'm not clear on how that affects the MTF-of-match-distance thing, e.g., when LZMA would choose a "natural" match over an MTF match.

    I take it you mean just using an LZ77 offset, not relative to one of the match distances in the MTF queue, which will then be context-modeled and range coded... but I don't understand the context modeling and coding, and when it would be likely to compress to something smaller than an MTF code. I'd think it'd almost always be substantially bigger, unless the context modeling is finding other common regularities and exploiting them really well. If it's able to do that, I'm curious what regularities it's sensitive to.

  9. #547
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    lzma employs optimum parsing that just finds all possible matches for every position. then it finds a combination of matches/literals that gives smallest encoded size using current encoding codes. i detailed it a bit more in my OP thread. the little MTF queue just adds more matches to check - in addition to "natural matches", lzma finds match length at those positions and check this match possibilities too when searching for best match combination
    Last edited by Bulat Ziganshin; 13th January 2015 at 22:33.

  10. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

    Paul W. (13th January 2015)

  11. #548
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    I had no idea, but it worked out well. When I stopped by his office and asked for a few minutes of his time, he seemed disinterested and said he had to leave soon. I must have looked desperate, because my impression was that he wasn't going to give me any time, but after a brief hesitation, he asked if 5 minutes would be enough. As soon as I mentioned grammar compressor, top-down, and Markov chain probabilities, he was suddenly very interested and made a comment about getting visitors that don't know what they are talking about. He gave me all the time he had and has been nothing but encouraging since.
    Cool.

    If the frequency is perfectly stable, it seems like MTF will lose. For instance, if a symbol occurs once every 100 symbols, it's going to be impossible for most MTF schemes to give a 1% probability to the 100th queue position.
    You would only want to use MTF where you expect a lot of symbols to repeat. The only benefit of MTF over coding the raw offset is that you get to eliminate repeats, which gives smaller ranks. But I think your point was that MTF gives unpredictable codes, and you're right. I think that's why MTF and splay trees aren't very popular in compression (except MTF in the BWT).


    Quote Originally Posted by Paul W. View Post
    For example, there's a number of papers comparing things like Huffman and MTF-plus-Elias-coding and maybe splay-tree coding, where they throw raw data at zeroth-order models, and of course Huffman usually wins. And often most of the data is text---a stream of letters or a stream of words with a fairly stable Zipfy distribution. So of course Huffman usually wins, with MTF not far behind, but behind.
    Huffman and MTF aren't exactly mutually-exclusive. You still need to use something like Huffman after MTF. I think I get your point, though.


    What people seem to have systematically missed is that (as you've said before) MTF is just a transform, and like any handy transform it will make some regularities clearer and others more obscure. You don't want to just use it directly as a model, but as a feature extractor to tell you what model is appropriate for the data at hand.

    ...

    I think there's room for improvement, though, if you don't just use the output of MTF for prediction, as LZMA does (as I understand it). A better predictor could look at the actual data items, the match distances, and the MTF-of-match distribution, and combine the information intelligently. For example, in the example above, it might notice that the extra A's in the pattern are causing the problem, and the stride is really just 5, with a bobble at 4. It would know to just predict whatever's 5 bytes back, and ignore the bobble, because it doesn't actually matter. (It doesn't improve the prediction over just noticing the stride of 5.)

    You could do that in an adaptive context-mixer, if you had an MTF queue of recently-seen match distances---not for LZ77 coding, but just for prediction. You'd have a predictor for each match distance, and pick which one works better. The predictor for a stride of 5 would clearly dominate the predictors for 2 and 3.
    You can't really do anything sophisticated with MTF. I think it's best used in situations where you don't want to think too hard and you just want to grab some low-hanging fruit. If you know your data very well, you can undoubtedly do something better with it.

  12. #549
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    You would only want to use MTF where you expect a lot of symbols to repeat. The only benefit of MTF over coding the raw offset is that you get to eliminate repeats, which gives smaller ranks. But I think your point was that MTF gives unpredictable codes, and you're right. I think that's why MTF and splay trees aren't very popular in compression (except MTF in the BWT).
    I wasn't trying to make a point that MTF gives unpredictable codes. I was trying to make the point that MTF will not be effective if symbols repeat at regular intervals because a MTF queue that favors recency cannot give the symbol a 1/N probability at position N. I think MTF provides benefits over dictionary encoding only when the symbol occurances are more grouped together that you would expect for a random distribution and that there can be benefits, especially for text, for a symbol ranking algorithm to distinguish between symbols that tend to occur in clusters, symbols that are randomly distributed, and symbols that are more evenly spaced than a random distribution.

  13. #550
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts

    MTF transform, pattern detection, plain-OR-MTF entropy

    Quote Originally Posted by nburns View Post
    Cool.
    You can't really do anything sophisticated with MTF.
    Maybe not with an MTF-entropy coder per se, but the MTF transform can be used in sophisticated ways. (It's what makes the compressed memory system in OS X work well, for example, dynamically adjusting the split between compressed and uncompressed RAM with a simple online cost-benefit analysis.)

    You can do things like robustly detecting the sizes of loops, including nested loops, somewhat variable-sized loops without a strict stride, loops over gradually growing or shifting sets of items, and loops whose ordering varies a lot locally, and loops whose order reverses from iteration to iteration. (Real programs do all those things.)

    You just can't do that sort of thing with, say, string-matching and frequency counting, or Markov models. Small local variations throw them off from the get-go, and they're just not smart enough to notice aggregate patterns like loopiness anyway, except in the simplest cases. They're just not robust in the face of minor local variations, because you get an explosion of literal strings or contexts. (And you need explosively more input to get useful frequencies of those zillions of specific strings.) MTF is robust, not brittle.

    For caching policies, you can use the MTF transform to tell you things like when to use LRU vs. MRU and at what scales. (E.g., keep the 2200 most recently used items, because your inner loop fits in RAM, but evict things after that in a more MRU-like way, because your outer loop doesn't.)

    Similar patterns occur in various binary data files, as well as programs' memory reference traces. (You can pretty much see them in fv plots of various binary files.)

    I think it's best used in situations where you don't want to think too hard and you just want to grab some low-hanging fruit. If you know your data very well, you can undoubtedly do something better with it.
    Counting repeats is just grabbing low-hanging fruit, too. Whether you should use frequency or recency depends on what kind of fruit you're picking.

    MTF is a basic transform, very well suited to compression (and to caching), because it's about something fundamentally related to both compressibility (and cacheability)---how many things recur how often.

    Like any good transform, MTF tends to make some important patterns more apparent (aggregate behavior like loopiness), and others more obscure (like stable frequencies of particular items). You have to know when and how to use it.

    Its coolest uses are in pattern detection---figuring out what regularities are present in the data, so you can adjust your model---rather than just MTF-entropy-coding.

    --

    The simplest version of that is to gang a normal entropy coder and an MTF-entropy coder together, and only output the result of whichever works better. You're putting in an optional MTF modeling step (or not) before actual entropy coding.

    Where Tree is discriminating between particular symbols that behave ergodically and those for which MTF is better, globally, this would dynamically adjust to local aggregate behavior---stretches of input that are behaving ergodically or transiently.

    A blockwise coder could just add a flag bit at the front saying whether the block was compressed with plain or MTF entropy, and the cost would be negligible for the usual sizes of blocks. When your stats are stable enough that plain beats MTF, you do the usual thing, and when you have a block with more transient stats, you use MTF. If your coding is strictly blockwise (no memory of previous blocks), the decoder wouldn't have to run both sub-decoder---just the one the encoder chose for the block at hand.

    A symbolwise coder could adapt every n symbols, using whichever coder worked better lately, and avoid needing an escape to tell the decoder what to do. (But an escape or a flag bit every n symbols would let the decoder avoid having to run two decoders all the time to figure that out.)

    A subtler scheme could more intelligently combine symbol-specific global information and aggregate local information. (E.g., by flagging each symbol with two bits to say how reliably ergodic it is, and the coder using that to decide what to do with it during transients.)
    Last edited by Paul W.; 14th January 2015 at 15:10. Reason: typos

  14. The Following User Says Thank You to Paul W. For This Useful Post:

    Kennon Conrad (15th January 2015)

  15. #551
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    I wasn't trying to make a point that MTF gives unpredictable codes. I was trying to make the point that MTF will not be effective if symbols repeat at regular intervals because a MTF queue that favors recency cannot give the symbol a 1/N probability at position N. I think MTF provides benefits over dictionary encoding only when the symbol occurances are more grouped together that you would expect for a random distribution and that there can be benefits, especially for text, for a symbol ranking algorithm to distinguish between symbols that tend to occur in clusters, symbols that are randomly distributed, and symbols that are more evenly spaced than a random distribution.
    Maybe I'm not sure what you're talking about. MTF doesn't assign probabilities at all. It only assigns numbers. You can assign probabilities to those numbers however you want. (Although, you can't give each rank the probability of 1/rank, because it doesn't converge to 1.)

  16. #552
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Paul W. View Post
    Maybe not with an MTF-entropy coder per se, but the MTF transform can be used in sophisticated ways. (It's what makes the compressed memory system in OS X work well, for example, dynamically adjusting the split between compressed and uncompressed RAM with a simple online cost-benefit analysis.)

    You can do things like robustly detecting the sizes of loops, including nested loops, somewhat variable-sized loops without a strict stride, loops over gradually growing or shifting sets of items, and loops whose ordering varies a lot locally, and loops whose order reverses from iteration to iteration. (Real programs do all those things.)

    You just can't do that sort of thing with, say, string-matching and frequency counting, or Markov models. Small local variations throw them off from the get-go, and they're just not smart enough to notice aggregate patterns like loopiness anyway, except in the simplest cases. They're just not robust in the face of minor local variations, because you get an explosion of literal strings or contexts. (And you need explosively more input to get useful frequencies of those zillions of specific strings.) MTF is robust, not brittle.

    For caching policies, you can use the MTF transform to tell you things like when to use LRU vs. MRU and at what scales. (E.g., keep the 2200 most recently used items, because your inner loop fits in RAM, but evict things after that in a more MRU-like way, because your outer loop doesn't.)
    LRU is an important principle in systems. It's not exactly the last word, though -- it's more like the first word. It doesn't guarantee optimal performance, but it's a good choice. Compression problems are off-line, and you can expend more CPU resources, so compression can often do better.

  17. #553
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    Maybe I'm not sure what you're talking about. MTF doesn't assign probabilities at all. It only assigns numbers. You can assign probabilities to those numbers however you want. (Although, you can't give each rank the probability of 1/rank, because it doesn't converge to 1.)
    You had said MTF works ok for symbols that repeat at a stable frequency. I was making the point that if the symbol frequency is perfectly stable, then it is best to not use MTF with that symbol. MTF is only better than a dictionary for specific symbols when the spread of the symbol is less than you would have with a random distribution, which is what you would have if symbol probability was constant.

    In some ways, what I initially viewed as an MTF hack to provide fast decompression is actually an imperfect test of what happens when you try to rank symbols based on both recency and overall probability. It has shown possible benefits of adjusting recency based probability predictions with overall symbol probability. My general conclusion is that for a lot of text like files, the less likely a dictionary symbol is, the more likely it is to be located near the other instances of the same symbol. The question for me, is how to best exploit this characteristic. Paul has suggested splay trees and heaps. Piotr has suggested adjusting individual queue probabilities. I am sure Tree is not optimal. There a lots of possible directions. My list of things to look at seems to only get longer and longer....

    In terms of LRU, it looks like a similar scheme could be used to identify template symbols - pages, table headers, foreign translations, etc. in enwiks. It's interesting to me that so many can be identified by recognizing their lack of local repetitions.
    Last edited by Kennon Conrad; 15th January 2015 at 19:21.

  18. #554
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts

    Tree v0.18

    Tree v0.18 has several improvements:

    • TreeCompress64 compresses files large files up to about 90% faster (slightly over 1/2 the prior time).
    • TreeCompress64 uses more memory for smaller files, often providing better compression at the cost of being slower (but generally faster than the prior version).
    • TreeCompress64 now supports a -r command line option to set memory usage and allows up to 32 GB of memory to be used for the dynamically allocated data structures.
    • TreeBitEncode now supports a -v command line option to print information about the compressed symbols.
    • TreeBitEncode now supports code lengths as short as two bits (was 6) for better encoding of files that contain high frequency symbols.

    Details:
    • Faster compression has been achieved mostly by doing more parallel processing. TreeCompress64 now creates 26 helper threads per compression cycle instead of 9 and with up to 7 helper threads active instead of 3. This works best on processors with at least 8 cores, but still works okay on my 4 core A8-5500. The number of LCP tree build helper threads has been increased from 9 to 12. There is 1 new helper thread that manages the list of new symbol candidates. There are 7 new helper threads that each scan 1/8th of the input data for overlapping new symbol candidates. There are 6 new helper threads that each perform the new symbol substitutions on about 1/7th of the data.
    • TreeCompress64 can be given a -r command line argument like "-r10" to set the ratio of the size of the dynamic memory allocation to input file size. The default value is 6.5. The minimum value is 5.0 with a minimum actual allocation size of 1690 MB. There is no maximum value, but the program limits the size to 32 GB if the -r setting would cause more than 32 GB of memory to be allocated. The program adds an extra 4 x input file size for files that are not UTF-8 compliant. The purpose is to allow adjustment of the initial window size to see how much the window size affects compression.
    • TreeBitEncode's -v option causes the program to output symbol information to stdout. This can be redirected to a file that will contain the symbol information. The information is essentially a list of the grammar ranked from most frequently used to least frequently used. The list includes the rank of each symbol, the baseline dictionary code length, the number of occurences of the symbol, the (misnamed) dispersity bit if the symbol is eligible to go in the general mtf queue, and the string the symbol represents.
    • TreeBitEncode's new 2 bit minimum symbol code length should have been implemented when the partial ANS coding (or whatever it is called) was implemented, but I forgot to do it.

    Results for enwik9:
    TreeCompress64: 174,357,336 bytes, compressed in 4901 sec., decompressed in 7.2 sec.
    (was 174,399,062 bytes, 9362 sec./7.2 sec.)
    memory: TreeCompress64: vm 10,688 MB, pm 6,009 MB; TreeDecode: vm 1166 MB, pm 320 MB

    Results for enwik8:
    TreeCompress64: 21,639,204 bytes, compressed in 311 sec., decompressed in 0.7 sec.
    (was 21,772,096 bytes)

    I tried TreeCompess64 -r100 on my i7-4790K that has 16 GB of RAM. This increased the initial window size from 11,258,732 symbols to 128,903,531 symbols and allows the window to expand to include the entire file much earlier in the compression process. Results were as follows:
    173,856,720 bytes, compressed in14,965 sec., decompressed in 7.2 sec.
    memory: TreeCompress64: vm 37,052 MB, pm 15,370 MB; TreeDecode: vm 1166 MB, pm 319 MB

    So the compressed file was about 0.5 MB smaller, which is about half the difference between the compressed file sizes produced by TreeCompress and TreeCompress64. Disk usage was 100% for much of the compression run, but given the much larger tree builds, the impact on run time didn't seem too bad. When I tried the same thing with a 2 GB virtual drive (14 GB of available RAM instead of 16 GB), it was took almost 3x longer to compress.
    Attached Files Attached Files

  19. The Following User Says Thank You to Kennon Conrad For This Useful Post:

    Paul W. (19th January 2015)

  20. #555
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Now that I fixed the problem with TreeBitEncode, I wanted to post some results from selected files from the DNA corpus. I chose to compare Tree to results in the DNASequitur paper co-authored by one of my college professors: http://www.broadinstitute.org/~neva/...nasequitur.pdf. For convenience, I have put two tables from this paper into a .jpg:

    Click image for larger version. 

Name:	DNA_results.jpg 
Views:	232 
Size:	110.6 KB 
ID:	3357

    I don't have exactly comparable results for Tree, but here is what I do have using TreeCompress64. Dictionary size includes the 4 members of the original alphabet:
    HUMDYSTROP: longest match 22 symbols, 26 symbol dictionary, compressed file size 10,588 bytes, 2.18 bps
    HUMGHCSA: longest match 408 symbols, 988 symbol dictionary, compressed file size 12,580 bytes, 1.51 bps
    MTPACGA: longest match 116 symbols, 156 symbol dictionary, compressed file size 26,452 bytes, 2.11 bps
    MPOCPCG: longest match 31 symbols, 86 symbol dictionary, compressed file size 31,876 bytes, 2.11 bps
    VACCG: longest match 560 symbols, 67 symbol dictionary, compressed file size 50,344 bytes, 2.10 bps

    So really only HUMGHCSA gets compressed (which is not uncommon). Encoding is still not very efficient because all the compressed files except HUMGHCSA have an order 0 entropy before embedding the dictionary that is at least 10% smaller than the output of TreeBitEncode. However, the results are competitive with the listed compressors that are not DNA specific except for HUMGHCSA, where Tree outperforms the other non DNA specific compressors. The longest match/repeat is longer except for VACCG where DNASequitur's longest repeat is the same length.
    Last edited by Kennon Conrad; 17th January 2015 at 21:48.

  21. The Following User Says Thank You to Kennon Conrad For This Useful Post:

    Paul W. (18th January 2015)

  22. #556
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    I'm curious how tree would do on large DNA files, like a fly or mouse genome. A lot of the repetition in DNA for macroscopic organisms is from large-scale copies of whole genes, whole chromosomes, and even the whole genome. (Much of our DNA was duplicated at least twice early in evolution, in major duplication events, and there's scads of dead or dormant retroviruses all through it too, with many copies of the same viruses.)

    Viruses and bacteria have much less extra DNA like that (they can't afford it, being so small, so evolution tends to weed it out). And if you only look at a local stretch of DNA around one or a few genes, you won't see much of the kind of repetition we do have.

  23. #557
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    It seems to me that a good compression algorithm should respect and exploit the basic regularities in gene structure.

    Each gene is basically a rule in a production system, with a left hand side and a right hand side. They have different regularities and irregularities.

    (A genome is basically a computer program, for a stochastic, massively parallel production system. Each gene is a rule whose preconditions may or may not be met, and most rules are effectively boolean, e.g. IF (A and B and not C) THEN (E and F and G). In the general case, the preconditions can have degrees, and the rule will fire probabilistically depending on how well a condition is met, so it's something like a fuzzy logic rule system.)

    The stretch of the gene that acts as the left hand side of a rule is the "control region" and it works very differently than the stretch that acts as the right hand side (the "coding region"). The control region (LHS) has various sites where signaling proteins can dock; some of those sites enhance gene activity (normal preconditions of the rule) and others repress it (NOT conditions). The coding region (RHS) isn't used that way at all---it's transcribed to make an RNA copy of that stretch of the gene.

    Most RNA transcripts of the coding region are then transcribed again to make a protein, in a rigid way: three-letter sequences translate to particular amino acids in the protein sequence. If you can parse those three-letter sequences as tokens, I think you should be able to get better compression. Not all combinations of three letters are allowed, at least in the coding region. (There's an error correction scheme that fixes three-letter sequences that don't map to an allowed amino acid. I think it's applied to all of the DNA, not just the coding regions, so that the control region will have a similar vocabulary but different regularities.)

    Most proteins generated by the DNA->RNA->protein transcription process are purely computational signaling molecules. The protein folds up into a particular predictable shape (usually) with various-shaped bumps exposed. The shape of a bump determines whether the protein can dock to a particular binding site in the control region of another gene (or the same gene, for feedbacks). Those specific shapes implement the vocabulary of the rule system. (E.g. the terms A, B, not C, D, E, and F in the rule IF A and B and not C THEN D and E and F) The signaling protein as a whole implements the entire right-hand side of the rule.

    Some proteins are used differently---"structural" proteins are used to make actual stuff. I have no real idea what the regularities in structural coding regions are, but I'd guess there are some.

    There are higher-level regularities in the genome; it's a fairly modular program, with ensembles of rules called "genetic regulatory networks" that are basically subroutines. If you can more or less cluster those, you can probably get somewhat better compression, but it's tricky.

    One way it's tricky is that very different coding sequences can fold up into molecules with the similar-shaped bumps that serve the same signaling function---it's all about how the protein folds up, and which stretches of the protein are exposed as bumps, and details of those bumps. Unfortunately, a lot of that stuff is likely to be fairly incompressible, because much of the protein is just there to make some other part stick out, or influence its intermediate shapes during folding. (E.g., by getting in the way of something else and preventing a wrong folding step.) So lots of randomly-evolved sequences can serve the same physical spacing and folding-control functions within the protein. I suspect that there are several regularities anyway, for various subtle reasons, but I don't know what they are or how strong they are. It's going to be pretty noisy.

    Another factor that's going to affect the statistics of sequences within genes is whether those genes are functional or junk. Most of our DNA is junk, and junk DNA accumulates more random mutations than functional DNA.

    So... an ideal DNA compressor would at least distinguish between control regions and coding regions, and between coding regions for signaling vs. structural proteins, and probably guess whether the gene looked functional or like junk, and maybe cluster genes with similar binding sites in the control region. (They may have some similarities in the coding region too.) It likely would group together genes that are basically copies of the same retrovirus, and other sets of genes that are basically copies of each other.

    A simple-ish algorithm may be able to get much of those effects with trigram tokenization, skip n-gram matching to make it tolerant of local mutations in copies of genes, and basic clustering based on that.

  24. The Following 2 Users Say Thank You to Paul W. For This Useful Post:

    Bulat Ziganshin (19th January 2015),Kennon Conrad (19th January 2015)

  25. #558
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Kennon,

    Can you tell if you're getting any benefit from the MTF queues when compressing DNA? (I wouldn't expect much at all, but if I'm wrong, that'd be interesting.)

  26. #559
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    You had said MTF works ok for symbols that repeat at a stable frequency. I was making the point that if the symbol frequency is perfectly stable, then it is best to not use MTF with that symbol.
    I think we're coming at the same point from two different angles.

  27. #560
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Paul W. View Post
    I'm curious how tree would do on large DNA files, like a fly or mouse genome. A lot of the repetition in DNA for macroscopic organisms is from large-scale copies of whole genes, whole chromosomes, and even the whole genome. (Much of our DNA was duplicated at least twice early in evolution, in major duplication events, and there's scads of dead or dormant retroviruses all through it too, with many copies of the same viruses.)

    Viruses and bacteria have much less extra DNA like that (they can't afford it, being so small, so evolution tends to weed it out). And if you only look at a local stretch of DNA around one or a few genes, you won't see much of the kind of repetition we do have.
    I suspect compression would be very slow, but reasonably good compared to other non-DNA specific compressors, kind of like it did on HUMGHCSA. I would be willing to try a file if you can provide a link to a larger file in a simple format. I tried to find something I could understand and easily use but was not successful.

  28. #561
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Paul, thanks for the writeup on DNA. That's probably 10x more than I knew before.

    I am pretty sure that tokenization is not compatible with optimum compression ratios. I think an ideal context model can always do at least as well without the overhead of tokenization.

    I'm getting a bit off-topic now, but one of things I have learned while developing Tree is that tokenization always increases the order 0 entropy of a symbol stream when the number of repeats is less than or equal to two times the number of symbols in the file multiplied by the Markov chain probability of the string that is being tokenized. I think most short grams for DNA files fail this requirement because of the relatively high symbol probability compared to text. The shortest profitable gram Tree found in the five DNA files I posted results for was a 6-gram.

  29. #562
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Paul W. View Post
    Kennon,

    Can you tell if you're getting any benefit from the MTF queues when compressing DNA? (I wouldn't expect much at all, but if I'm wrong, that'd be interesting.)
    Probably not, but Tree's MTF queues still need some work for small files like the ones I tested. For example, having a 64 element queue when there are only 3 symbols that qualify for the queue is not very efficient.

    It does appear that there are recency relationships for which there could be some profit if handled efficiently. I added a print statement to TreeBitEncode to print the <20 instance mtf queue position hit statistics. Here are some of the statistics:

    HUMDYSTROP: 7 of 21 two instance symbols (1 repeat) hit in queue position 1 (~4 expected for random distribution)
    HUMGHCS: 5 of 826 two instance symbol hit in queue position 1 (slightly less than the average expectated value)
    MTPACGA: 13 of 101 two instance symbols hit in queue position 1
    MPOCPCG: 11 of 56 two instance symbols hit in queue position 1
    MPOCPCG: 10 of the 40 instances of the 20 three instance symbols hit in queue position 1
    VACCG: Queue hit positions for the 45 two instance symbols: 26, 8, 5, 2, 1, 0, 0, 2, 1 (and the rest 0)
    VACCG: Queue hit positions for the 10 three instance symbols: 14, 4, 2 (and the rest 0)

    One additional thing I found particularly interesting is from HUMGHCS. There is a significant spike in the queue hits in the three instance queue in queue positions 46 - 49 and another spike at queue position 54. The compressed file has 107 three instance symbols, giving 214 possible queue hits. The hit count for positions 46 - 49 is 5/5/15/8 and position 54 gets 10 hits. Queue position 1 gets 6 hits, position 7 gets 8 hits, position 15 gets 4 hits, position 51 gets 4 hits and all other positions get no more than 3 hits.

    Update: Paul, now it's bugging me that I am totally ignorant about DNA and don't know the meaning of the human growth hormone having it's most compressible strings show up at what appear to be periodic intervals. Why do certain strings repeat at nearly regular intervals?
    Last edited by Kennon Conrad; 19th January 2015 at 10:15.

  30. #563
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    I modified Matt's fv slightly (changing the color scheme) and plotted all the files in the DNA corpus with it.

    fv plots LZ77-style match distances for 1-, 2-, 4-, and 8-byte strings. In my color scheme, 1-byte matches are blue, and 8-byte matches are black (reversed from Matt's original), while 2- and 4- byte matches are red and green (same as in original).

    Here's a plot of the humdyst file (which I assume is the same thing as HUMDYSTROP):

    Click image for larger version. 

Name:	humdyst_fv.png 
Views:	212 
Size:	124.1 KB 
ID:	3360

    This looks like the output of an ergodic Markov model---random "snow" colorized in bands. Short strings tend to repeat at short distances, longer strings at longer distances, hence the banding.

    There's really no clear blue (1-byte) band at the bottom, though, because with our tiny DNA alphabet (4 letters, all pretty common), you get a lot of 2-byte matches at very short distances, as well as a whole lot of 1-byte matches at those distances. With my color scheme, whereever you have 1-byte (blue) and 2-byte (red) matches, it comes out purple. There's a lot of purple at the bottoms of all the pictures for this corpus.

    (With Matt's original color scheme, one-byte matches were black, which tends to obscure anything else that's going on at the same match distances---mixing anything with black makes it all go black.)

    Now here's humghcs:

    Click image for larger version. 

Name:	humghcs_fv.png 
Views:	145 
Size:	181.3 KB 
ID:	3361

    The horizontal 1-pixel-wide black lines in the upper part of the picture mean that you're seeing 8-letter matches at nearly fixed distances of things from about seven thousand to about fifty thousand bytes back in the input. (The dark major ticks on on the log scale on the left hand side are factors of ten. Most of the long horizontal lines are above the 10K major tick.)

    What this means is that you have huge strings, thousands of letters, that have been copied very-nearly-literally, so that 8-byte substrings tend to keep being recognized at the same distance, even if there are intervening mismatches. (e.g., due to a local random mutation)

    Such big string copies with little differences are common in DNA---a whole lot of our DNA is just copies of other parts of our DNA, slightly modified by random mutations.

    The most common kind of mutation is a single-point substitution mutation, where one letter gets replaced with another but the string is otherwise the same. For that, you'd expect exactly what you see above where you encounter a copy of something you've seen before---an absolutely horizontal stripe in the match distances, interrupted at mutations but resuming immediately after.

    Other common mutations include local insertions and deletions---a small snippet of DNA gets added or dropped. That will change the match distance slightly, but it may not show up in the picture.

    Since the picture is log-scaled, we can't tell whether we're getting exactly the same match distances, as you'd expect if you had only substitutions, or slight variations in the match distance as you'd expect from small insertions and deletions. A few small "string edits" like that would not usually change which horizontal row of pixels the match distance mapped to.
    Last edited by Paul W.; 19th January 2015 at 12:28. Reason: typos

  31. The Following User Says Thank You to Paul W. For This Useful Post:

    Kennon Conrad (19th January 2015)

  32. #564
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Notice that in the humghcs picture, some of the horizontal lines seem to have a little jog in them, going along at one distance, then going up or down a pixel and continuing horizontally from there.

    That's what you'd expect if you had mostly "point mutations" (single-letter changes) and a few larger insertions or deletions. A point mutation that's a substitution wouldn't change the match distance at all. A point mutation that's adding or dropping a single letter would increase or decrease the match distance by 1, not usually enough to show up on a log plot. Insertions and deletions of small snippets of DNA would change the match distance by a little more, but not usually enough to show up on a log plot.

    I'm guessing LZMA would do fine on humghcs, because it can exploit exact repeats in match distances, and tolerate occasional small shifts in match distances. (Either by recognizing a resumption of string-copying at the same match distance after a "bobble", or by rapidly adapting to a new match distance using its MTF queue of recently-seen match distances.)
    Last edited by Paul W.; 20th January 2015 at 10:07. Reason: typo

  33. #565
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Paul W. View Post
    I'm guessing LZMA would do fine on humghcs, because it can exploit exact repeats in match distances, and tolerate occasional small shifts in match distances. (Either by recognizing a resumption of string-copying at the same match distance after a "bobble", or by rapidly adapting to a new match distance using its MTF queue of recently-seen match distances.)
    I tried lzham and got 1.262 bits per byte.

  34. #566
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    I updated LTCB with tree 0.18. http://mattmahoney.net/dc/text.html#1734

  35. #567
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts

    more on DNA-as-program, compression, offsets

    Quote Originally Posted by Kennon Conrad View Post
    Update: Paul, now it's bugging me that I am totally ignorant about DNA and don't know the meaning of the human growth hormone having it's most compressible strings show up at what appear to be periodic intervals. Why do certain strings repeat at nearly regular intervals?

    I don't know anything about the human growth hormone locus specifically, but I know a fair bit about DNA and evolution...

    [Update: I realize this is fairly redundant with what I've said before lately, but maybe the elaborations will help some. ]

    [ climbing on my soapbox for a few minutes... ]

    As I sketched a couple of posts ago, DNA is basically a bunch of rules (genes) written in a weird way in a 4-letter alphabet, for a chemical/mechanical computer.

    The nucleus of a cell is mainly a computer that controls the rest of the cell and is part of a massively distributed computing system that controls the whole organism. Life is 99 percent computation. (Von Neumann was right about that---life is a primarily computational phenomenon, and biology done right is mainly computer science of a very different sort than we're used to with von Neumann machines.)

    At a very low level of abstraction, you have a crude string-processing computer, operating on strings of nucleotides (G,C,T,A) but usually interpreting them as three-letter tokens corresponding to a 20-something symbol alphabet that can be mapped to amino acids. There are little nanomachines (e.g., ribosomes) that go through DNA letter-by-letter copying it, or token-by-token transcribing it to RNA.

    A lot of people, including a lot of geneticists, will tell you that the genome "isn't like a computer program" because they don't know what it means for something to be a computer. A lot of biologists' idea of "a computer" is a von Neumann machine running a FORTRAN program, and gene processing isn't much like a von Neumann machine, and isn't sequential by default, so to them it's "not like a computer." They'll tell you it's a "bad metaphor" to talk about the genome as a computer program. They could not be more wrong. (Some geneticists do get this and will say that a genome literally is a computer program.)

    A lot of computer scientists try to analogize it to a multitape Turing machine---that's really tempting because you have little machines processing something very like a tape, sequentially, symbol-by-symbol---and don't get much useful out of that, so to them it's "not like a computer", too. It's not a very useful metaphor for understanding gene function.

    But those are the wrong kinds of computers to think about, and that's the wrong level of abstraction. If you get what the little sequential token-processing machines are really doing for you, it's not mainly acting like a Turing Machine, where the rubber meets the road. The little string-processors are like functional units below the level of the "machine language", which is a rule language, like the mathematician's Emil Post's production systems, which predate Turing machines. The symbols in the real machine language aren't nucleotides (GCTA) or the three-letter tokens corresponding to 20-odd amino acids. They're binding site shapes, which are the terms in a (fuzzy) propositional logic-like language. They give you a vocabulary of tokens for your rules, and there are thousands of them, not four or 20-something.

    So the gene-processing machinery in an organism isn't like a computer in a way that gives you a good introductory metaphor. It literally is a computer of a sort that most people, even most computer scientists, aren't familiar with.

    One of Turing's major achievements was to show that anything Post's production systems could do, his Turing Machines could do, too---and that Post's universal production systems and his Universal Turing Machines could emulate each other. IMO, for most purposes, production systems are a better computational model than Turing Machines, and the big advantage of Turing machines is just that anybody can see intuitively how they could be implemented by very simple hardware. Beyond that, they're really awkward compared to rule systems, which is why nobody ever really implements computing systems that way---"it's only a model." Rule systems, on the other hand, are very handy for lots of things and we use them all the time. (E.g., every time you type "make.") And they're relatively easy to implement with molecules.

    You could say that production systems predate Turing machines by a decade or two in terms of when they appear in the literature---or by a couple of billion years, because nature invented them first and they turned out to be workable and really, really useful. Emil Post wouldn't have been able to write about production systems if he and his ancestors hadn't been made out of them all along.

    A lot of stuff in molecular biology can be understood in terms of how you program nondeterministic production systems. Genetic regulatory networks are just subroutines---but they're not like FORTRAN subroutines, because the language isn't sequential by default. They're like the modules you get programming a forward-chaining rule language like OPS5. (Not a backward-chaining rule language like Prolog or make.) You see all sorts of stuff that's familiar in computer science terms if---and only if---you're familiar with forward-chaining rule systems and parallel and distributed systems. (And having to get by with a propositional logic when you'd really rather have a decent first-order logic with variable binding and functions.)

    The rule language of genes is for the most part a simple one. The terms in the language are just propositions (possibly negated) with fuzzy truth values. (Truthiness values?)

    There are some weird complications beyond that---the whole thing has been kludged for over a billion years---but I think that's the right basic story.

    [ steps off soapbox ]

    One of the ways this basic conceptual model may actually matter, and even matter to compressing DNA sequences, is that there are certain stereotyped ways that you program a nondeterministic and/or parallel forward-chaining propositional production system.

    You very often want to impose sequentiality on rule firing, for two basic reasons:

    1) You're dealing with sequential phenomena, and you want the right rules to come into play at the right time. This is true of a lot of things genes do, e.g., responding to external events or maintaining homeostasis (self-regulation). A cell is like a factory and much of the genome is a reactive program for keeping that factory running with everything staying within operational parameters. (When you start running out of a certain kind of building block, you switch on the module to make more, and when you've got enough, you switch it off.) There's also an exquisitely choreographed and highly staged process for the cell to divide itself into two cells, with a very strict sequence of major steps.

    2) It's just easier to program things sequentially than in parallel. Evolution knows this at least as well as a human trying to program OPS5 to do something "algorithmic," and it uses the same basic trick, over and over again.

    That trick is to introduce a term into a rule so that it can't fire until its turn comes, and for each rule in a sequence to assert the term that enables the next rule in the sequence.

    So for example, suppose you have the following rules and want to execute them sequentially:

    1: IF (A and B and not C) THEN (D and E and F)
    2: IF (P and not Q) THEN (S and R)
    3: IF (not G and H) THEN (I)

    You basically just introduce new terms for the steps, and shove them into your preconditions and right-hand sides appropriately:

    IF (STEP1 and A and B and not C) THEN (D and E and F and STEP2)
    IF (P and not Q and STEP2) THEN (S and STEP3 and R)
    IF (not G and STEP3 and H) THEN (I)

    It's a little trickier than that, but that's the basic trick, and evolution uses it a whole hell of a lot.

    For example, the Hox genes (homeobox complex genes) that control basic development do this all over the place. (Hox genes are hugely important---once the Hox complex evolved, we fairly rapidly evolved all the basic body plans we see in macroscopic animals today, and much of that evolution occurred by duplicating most or all of the the Hox genes and evolving new more-specialized versions of the duplicates, with even more flag variables to coordinate sub-steps of steps of development, etc. It's a horribly ugly kludge, but it works well enough for fruitflies and humans, squid and starfish, etc.)

    This would seem to imply several likely regularities that could in principle be recognized.

    For a simple pair of rules that execute in sequence, you'll typically have an occurrence of a stepping term on the right-hand side of the first rule, and the same term on the left-hand side of the second. (That is, in the coding region of the first gene, and in the control region of the second.) Unfortunately, because of the way the actual machinery works with protein folding and docking and so on, the term will generally look different when it occurs on the left-hand side than when it occurs on the right. (On the left you have a sequence that functions as binding site. On the right you have some not-all-local properties of the sequence that result in the folded molecule having a bump of the right shape.)

    For rules that have been duplicated (like most of the Hox genes) you'll have related rules using the same terms. (E.g., a cluster of rules that switch on during a particular stage of development). Many of those terms will occur twice or four times or eight times, depending on how many times those genes have been duplicated. Others won't, because they'll have gotten mutated to discriminate between substeps or whatever, so you'll get some threes and sevens and so on, too.

    ---

    The easiest wins are going to be in recognizing "homologies" that result from duplications of individual genes (that happens) or whole bunches of genes. (Mass duplication events are relatively rare, and usually don't result in a viable organism, but when they do result in a viable organism, they sometimes are very beneficial---they enable evolution by re-specializing copies of existing ensembles of rules that work well together, rather than starting from scratch.)

    A good way to think of this is in terms of editing a text file.

    (1) Evolution tends to copy the whole text file, with a few minor one-letter edits (random point mutations), as part of normal reproduction. (The average person has about five new point mutations, most of which don't matter because they're changes to inactive junk. These changes accumulate over the generations.)

    (2) Sometimes it copies or deletes or inserts a whole sentence (gene), because the gene copying machinery respects gene boundaries, but sometimes makes a mistake and skips a gene, or makes an extra copy and ends up inserting it elsewhere. Sometimes it deletes or inserts a fragment of a gene, because it makes errors about gene boundaries too. You get long string matches with small mismatches from this.

    (3) Occasionally it deletes or duplicates really big chunks of the file. (Big fragments of chromosomes, whole chromosomes, or most or all of the chromosomes.) You get lots of long string matches from this, with each long string occurring 2 or 4 or 8 times, depending on whether it's been duplicated once, re-duplicated, or re-reduplicated.

    (4) Retroviruses sometimes insert themselves into your DNA, sometimes with the same virus (thousands of DNA letters) in many places. Most of them get mutated to death, so you have a bunch copies of of old long-dead parasitic viruses littered through your DNA, some of them millions of years old and blindly copied down the generations along with normal genes.

    (5) This process iterates over evolutionary time, with your long strings tending to get regularly point-mutated into pairs of shorter ones with a mismatch in the middle, and some of them deleted or duplicated now and then, and sometimes with medium-sized edits.

    --

    In terms of simple compression algorithms, it means that if you look at big files like whole genomes for macroscopic organisms, you'll find

    (1) a lot of longish strings that occur 2 or 4 or 8 times, but sometimes 3 or 7 times because one of the copies got deleted, and
    (2) almost-consecutive groups of shorter strings that result from local mutations within a longer string that got copied and then mutated in minor ways. They'll match at the same or nearly the same offsets because you're really doing a long string copy with local bobbles---you match a fragment of the string up to a mutation, mismatch at the mutation, and then resume matching the next unmutated stretch of the string.

    Consider a gene that gets duplicated once, and the one of the copies gets randomly point-mutated once. Assume you encounter the unmutated gene first. When you get to the copy, you'll recognize the first part (up to the mutation) as matching the first part of the original string, get a mismatch at the mutation, then get another match of the rest of the string, matching it to the rest of the original.

    In the case of a one-letter substitution mutation, which is common, LZ77 will get a match to the first part of the original, a one-byte literal, and another match to the rest of the original at the exact same match distance. LZMA will recognize this as a hit to MTF 1 in its MTF list of recent match offsets, and give it a very short code. Cool. (LZMA wasn't the first LZ77-derivative to do that.)

    In the case of a one-letter insertion or deletion---also common---that won't work. You'll get a match at distance that's off by one from the previous match distance (one higher for an insertion, one lower for a deletion). That likely won't be in the MTF queue of recent offsets at all. LZMA only recognizes exact strides it's seen lately, not the close-but-inexact strides you get from small insertions and deletions.

    (What you want for that is something like what our WK algorithms do, but applied to match offsets instead of actual data. WKdm just assumes that its input resembles a series of integers, which are often going to be numerically similar to some very-recently-seen integers, but not necessarily exactly the same. I think a splay tree coder should get a similar effect, in a more elegant and general way, if it's done right; I'm not sure it's been done right.)

    Basically, LZMA can exploit strictly horizontal lines in fv plots, where you're essentially copying a huge long long strings, but with substitutions. That lets LZMA get the effect of approximate string matching, sort of like skip-n-grams in important common cases.

    You could add a feature like that to Tree if you remember the most recently-seen positions of your strings, and it would presumably help a lot on the kinds of binary data it works well for in LZMA. You wouldn't need to reserve as much code space as for normal MTF codes---just a few short codes for a handful of MTF offsets, and a few more (slightly less short) codes for close-but-inexact matches. For DNA the first big win is likely to be exact and off-by-one matches to the first few MTF-offset positions, so that point mutations don't throw matching off nearly so much.

    I'm not sure how that should affect your string-picking heuristics, but I'm guessing doing it any reasonable way would help. (And I'm not sure how your string-picking is affected by the MTF codes now).

    You could do something like your "dispersity" bit, too, to keep from polluting your code space much at all. You could have a "stridiness" bit per symbol as well.
    Last edited by Paul W.; 19th January 2015 at 23:58. Reason: repetitiveness warning

  36. The Following User Says Thank You to Paul W. For This Useful Post:

    Kennon Conrad (20th January 2015)

  37. #568
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Hmmm... that last brain dump was fairly redundant with what I have said before...

    Kennon, let me know if the elaborations made anything clearer the second time around.

  38. #569
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Quote Originally Posted by Kennon Conrad View Post
    I tried lzham and got 1.262 bits per byte.
    Interesting... that's less than 25 percent worse than DNACompress, and much better than any of the other algorithms in the table you posted in post #555. (bzip2 is second-best at 1.73) For that file, lzham mostly closes the gap between general-purpose and domain-specific DNA sequence compression.

    Have you tried it on the other files?

    I'm wondering if LZMA would do a little better than lzham. My understanding is that lzham sacrifices a bit of compression for decoding speed, and that may involve sacrificing some of the sophisticated context-modeling-and-arithmetic coding LZMA does in the back end. (Not that I really understand what LZMA does.)

    That could be important for DNA, where the tiny alphabet and almost-noise-like local stats make life hard.

  39. #570
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts

    domain-independent and domain-specific stuff for DNA sequences

    Looking at the paper you linked to in post #555, I see that DNACompress is doing some version of what I suggested about detecting homologies and tolerating slight mismatches. And it's closed-source and mysterious.

    But LZMA is really doing much the same thing without getting all domain-specific and proprietary about it. It's tolerating slight mismatches in long nearly-identical strings, which allows it to detect and exploit some homologies in DNA sequences, among many other things. Not because it's designed for DNA, but because it's designed to handle binary data well, and binary data of various kinds tend to have that general kind of problem.

    ---

    I suspect that modifying LZMA to tolerate slight differences in recently-seen match offsets would be a similar domain-independent win, and specifically good for DNA sequences.

    Think about writing an array of records to a file, where each record has a variable-length string in it. (Or is just written out that way, e.g., using textual representation of a numeric field or something like that.) The fv plot will clearly show the strides, but they'll wobble a bit depending on how many characters it takes to write out the variable-length part. And that slight variation will often foil LZMA's stride recognition (MTF offsets).

    ---

    There's something else DNACompress is doing that is very domain-specific---I don't see it being useful for any other kind of data.

    DNACompress is recognizing "palindromes"---where a piece of DNA has been moved or duplicated and stuck back into the sequence *backwards*, so that you see a certain pattern and then later see the same data in reverse.

    Recognizing reversed sequences is a good idea for compressing binary data generally---you see can see some reversal patterns in fv plots of various kinds of binary data if you know what to look for, and not just in DNA. Strict reversals are rare in text, but not rare in binary data.

    But what geneticists call "palindromes" in DNA sequences isn't what the rest of us call palindromes.

    DNACompress is looking for are reversed and complemented patterns, for a specific complementing function.

    Here's what that's about:

    DNA is a double-stranded molecule with two backbones, where each base (letter) is always opposed by a particular other letter on the opposite strand. Each strand is the negative of the other, physically, and they contain exactly the same information, so by convention we prefer one strand and write its sequence, because the other strand's sequence can be trivially inferred from that.

    But things get weird when you take a piece of DNA and splice it back in backwards. Now the strands' roles are reversed---the strand we don't usually talk about swaps roles with the one we usually do, and we have to talk about it. So instead of seeing the same sequence of letters in backward order, we see those letters' opposites, i.e, their opposite numbers on the opposite strand. What was background becomes foreground.

    I don't see how that corresponds to any reasonably common regularity in any other kind of data, so modifying LZMA to do that would make it domain-specific.

    It could still be interesting to do. A couple of minor tweaks to LZMA (one domain-specific, one not) might make it competitive with the best known DNA sequence compressor, which just happens to be closed-source and proprietary.
    Last edited by Paul W.; 20th January 2015 at 03:30. Reason: typos

  40. The Following User Says Thank You to Paul W. For This Useful Post:

    Kennon Conrad (20th January 2015)

Page 19 of 29 FirstFirst ... 91718192021 ... LastLast

Similar Threads

  1. Replies: 4
    Last Post: 2nd December 2012, 02:55
  2. Suffix Tree's internal representation
    By Piotr Tarsa in forum Data Compression
    Replies: 4
    Last Post: 18th December 2011, 07:37
  3. M03 alpha
    By michael maniscalco in forum Data Compression
    Replies: 6
    Last Post: 10th October 2009, 00:31
  4. PIM 2.00 (alpha) is here!!!
    By encode in forum Forum Archive
    Replies: 46
    Last Post: 14th June 2007, 19:27
  5. PIM 2.00 (alpha) overview
    By encode in forum Forum Archive
    Replies: 21
    Last Post: 8th June 2007, 13:41

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •