Page 18 of 29 FirstFirst ... 8161718192028 ... LastLast
Results 511 to 540 of 849

Thread: Tree alpha v0.1 download

  1. #511
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Matt Mahoney View Post
    No problem. Looks like you moved up several spots http://mattmahoney.net/dc/text.html
    Results from Timer64:

    TreeCompress:
    Virtual Memory 1779 MB
    Physical Memory 1693 MB

    TreeCompress64:
    Virtual Memory 8099 MB
    Physical Memory 5002 MB

    TreeCompress64's Process Time is 2.35x Global Time, so an average of 2.35 threads are used. One of these days I should try adding support for up to 8 threads and multithread as much of the remaining single threaded code as I can.

    I was testing the Calgary Corpus and found a couple of encoding/decoding bugs that only show up on relatively small files. New version coming....in a few days.

  2. #512
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Thanks. I updated the memory results. http://mattmahoney.net/dc/text.html#1738

  3. #513
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts

    Tree v0.17

    Tree v0.17 includes an explicit dispersion indicator bit for symbols that occur more than 20 times and have a code length of 11 or more bits. The dispersion decision code is tuned to set the bit for fewer symbols than what provides optimum compression to improve decoding speed.

    The general MTF queue has been changed. What was a regular mtf queue for the first 10 symbols is now a circular buffer of length 8. The total MTF queue length is now 256.

    Coding bugs that caused some small files to not encode or decode correctly have been fixed. Now Tree even works on an empty file.

    New results for enwik9 and enwik8 are below. Compression time (and RAM usage) are essentially unchanged.

    enwik9:
    TreeCompress: 173,461,100 bytes decodes in 7.1 sec. (was 173,846,372 bytes, 8.1 sec.)
    TreeCompress64: 174,399,062 bytes decodes in 7.2 sec. (was 174,796,156 bytes, 8.2 sec.)

    enwik8:
    TreeCompress: 21,564,704 bytes (was 21,602,648 bytes)
    TreeCompress64: 21,772,096 bytes
    Attached Files Attached Files

  4. #514
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Nice results, but beware the overtuning to enwiks :]

  5. #515
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts

  6. #516
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Thanks, but the decoding times should be 1 second faster.

  7. #517
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Nice results, but beware the overtuning to enwiks :]
    I guess I shouldn't have said "tuned" as much as "set". I purposely set the dispersion threshold significantly above the expected mtf hit rate to keep symbols out of the mtf queue unless there is significant benefit. I didn't feel that it was worth the time to update the mtf queue every time a symbol occurs if it is only going to save a few bits. For instance, for a symbol with a code length of 11, I would expect a mtf hit for approximately 12% of the instances of that symbol with random distribution. I set the threshold for symbols to 1/(code_length - 6), which is 20% for code length 11. The change was tested with enwik9 to verify the results matched expectations and I played with the threshold to make sure the effects were as expected, but went with my initial idea.

    In general, I think MTF with known global probabilities is more complicated than MTF with unknown global probabilities where it makes more sense to just keep a global queue. That isn't so good when global probabilites are known because ideally, the symbol probabilities for the queue positions should vary according to the global probability and dispersity of that symbol. Maybe there is an approach that gives optimal compression and good speed, but if it exists, I don't know it.

  8. #518
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts

    Tree v0.17a

    I found a bug while encoding OBJ1 of the Calgary Corpus. A fix is attached.

    Test results with the Calgary Corpus, Silesia Corpus and a few other files.

    First, from one of the Sequitur papers: http://www.cs.waikato.ac.nz/~ihw/pap...W-Compress.pdf

    Click image for larger version. 

Name:	Sequitur_Results.jpg 
Views:	256 
Size:	107.9 KB 
ID:	3318

    For comparison purposes (Tree/Tree64 in compressed file bits per uncompressed byte):

    Calgary Corpus files:

    BIB: 2.44/2.38
    BOOK1: 2.49/2.43
    BOOK2: 2.13/2.09
    GEO: 4.94/4.93
    NEWS: 2.58/2.54
    OBJ1: 3.92/3.93
    OBJ2: 2.56/2.51
    PAPER1: 2.82/2.73
    PAPER2: 2.72/2.62
    PIC: 0.86/0.88
    PROGC: 2.85/2.83
    PROGL: 1.91/1.93
    PROGP: 1.89/1.90
    TRANS: 1.62/1.60
    AVG: 2.55/2.52

    (vs. 3.64 for compress, 2.69 for gzip, 2.64 for Sequitur, and 2.49 for PPMC)

    A bigger file:

    BIBLE (KING JAMES VERSION): 1.49/1.48
    (vs. 2.77 for compress, 2.32 for gzip, 1.84 for Sequitur, and 1.92 for PPMC)
    4,047,392 bytes: 49/10 seconds to compress, about 15 msec to decompress

    Silesia Corpus (Tree/Tree64 compressed file size) http://mattmahoney.net/dc/silesia.html:
    dickens: 2,404,940/2,421,048
    mozilla: 16,233,680/16,184,312
    mr: 3,201,076/3,221,588
    nci: 1723,148/1780,000
    ooffice: 2,660,440/2,633,624
    osdb: 2,708,424/2,727,656
    reymont: 1,252,208/1,190,044
    samba: 4,958,524/4,293,456
    sao: 4,958,524/4,924,840
    webster: 7,090,864/7,162,204
    x-ray: 5,595,204/5,591,248
    xml: 432,688/435,360
    total: 52,494,284/52,565,380

    compression time was 5225 seconds for Tree and 1108 seconds for Tree64. Decoding time was 1.4 seconds for either version of compressed files.
    Attached Files Attached Files

  9. The Following 4 Users Say Thank You to Kennon Conrad For This Useful Post:

    Mike (17th December 2014),Paul W. (19th December 2014),Piotr Tarsa (17th December 2014),surfersat (17th December 2014)

  10. #519
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts

  11. #520
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    I guess I shouldn't have said "tuned" as much as "set". I purposely set the dispersion threshold significantly above the expected mtf hit rate to keep symbols out of the mtf queue unless there is significant benefit. I didn't feel that it was worth the time to update the mtf queue every time a symbol occurs if it is only going to save a few bits. For instance, for a symbol with a code length of 11, I would expect a mtf hit for approximately 12% of the instances of that symbol with random distribution. I set the threshold for symbols to 1/(code_length - 6), which is 20% for code length 11. The change was tested with enwik9 to verify the results matched expectations and I played with the threshold to make sure the effects were as expected, but went with my initial idea.

    In general, I think MTF with known global probabilities is more complicated than MTF with unknown global probabilities where it makes more sense to just keep a global queue. That isn't so good when global probabilites are known because ideally, the symbol probabilities for the queue positions should vary according to the global probability and dispersity of that symbol. Maybe there is an approach that gives optimal compression and good speed, but if it exists, I don't know it.
    I get the sense that you are making MTF out to be a more sophisticated algorithm than it is. All MTF does is count the number of different symbols that appear in between each pair of matching symbols and output this number. It's a nifty algorithm because it can be trivially used as a bijective transform; it maps a sequence S* from alphabet S back onto itself; the output is completely unambiguous and self-contained.

    I mention this not to be pedantic -- or not only to be pedantic -- but because I think the definition might be helpful. It might help in making a clear distinction between MTF and the other parts of Tree's pipeline that are not part of MTF. I have been getting confused, for one.

    MTF knows nothing of probabilities. It only does one thing, which is output a number, a number which is fully determined by the symbol sequence and nothing else. That number could be called the MTF number -- it's unique to MTF and fully specified by the MTF algorithm {fully specified when there is a previous symbol, anyway... there are various ways of dealing with the first of instance of each symbol}.

    Vis-a-vis the algorithm you're describing, I'm assuming that you are translating the symbol references into MTF numbers and then feeding this stream of MTF numbers to a subsequent modeling step, and so on. If you're not generating MTF numbers, not even transiently, then I'd venture to say that this is not MTF. That would not be a bad thing. MTF is basically a hack -- it's just a convienient heuristic and it can easily be beaten. Whatever you are doing is surely better.

    In any case:

    I've been thinking about MTF. You said you're maintaining multiple MTF queues. You can get similar behavior to multiple queues using a single queue if you reorder the data before giving it to MTF. In other words, you could first send all of the queue-1 symbols, then send all of the queue-2 symbols, ..., etc. The MTF queue will adapt to each new stream in an essentially constant number of steps, and then it will be just like a brand new queue. Of course, you could also create a mechanism to trigger it to clear itself and start over, if you wanted.

    Re-ordering the data would probably slow down decoding, which is probably not what you want. But one queue is a bit easier to visualize, I think.

  12. #521
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Yeah, my terminology may not be the best. Tree uses a hybrid MTF/dictionary scheme. A hybrid scheme is used in place of a standard MTF implementation because I think a standard MTF implementation would take more time to decode and possibly be less effective given that the global frequencies for each symbol are known and some interesting correlations exist between global frequencies and MTF position probabilities.

    I would claim that Tree's scheme captures much of the benefit of local correlations that can be captured with a standard full MTF scheme followed by an MTF model that adjusts position probabilities according to the global frequency of each symbol without having to actually adjust the probabilities. The main reason for having multiple queues at this point in time is to provide code lengths that more accurately match a symbol's probability than a standard MTF queue would provide if the global probability was ignored. I am interested in implementing it as a single queue with adjusted probabilities, but haven't thought of a way to implement it that seems fast enough to try.

    In terms of Tree's scheme being confusing, I will try to explain it another way. The scheme is equivalent to a single MTF queue implementation, but with some extra rules or limitations. For the purposes of discussion, I will use code lengths from enwik9 (and maybe not exactly right at that), but these will vary from file to file.

    MTF queue position 0 (7 bits)
    Contains the most recent (non expired) 2 instance symbol
    MTF queue positions 1 - 6 (8 bits)
    5 positions contain the 5 most recent >20 instance symbols
    1 position contains the most recent (non expired) 3 instance symbol
    MTF queue positions 7 - 13 (9 bits)
    3 positions contain the 6th - 8th most recent >20 instance symbols
    1 position contains the 2nd most recent (non expired) 2 instance symbol
    1 position contains the most recent (non expired) 4 instance symbol
    1 position contains the most recent (non expired) 5 instance symbol
    1 position contains the most recent (non expired) 6 instance symbol
    MTF queue positions 14 - 28 (10 bits)
    8 positions contain the 9th - 16th most recent >20 instance symbols
    2 positions contains the 3rd - 4th most recent (non expired) 2 instance symbol
    1 position contains the most recent (non expired) 7 instance symbol
    1 position contains the most recent (non expired) 8 instance symbol
    1 position contains the most recent (non expired) 9 instance symbol
    1 position contains the most recent (non expired) 10 instance symbol
    1 position contains the most recent (non expired) 11 instance symbol

    ETC.

    With additional rules:
    1. Symbols with global frequencies that call for a dictionary code length of less than or equal to 11 bits are always placed in an MTF position with that code length.
    2. Symbols with global frequencies that call for a dictionary code length of at least 11 bits are never moved to an MTF position with a code length longer than the dictionary code length.
    3. Symbols with a dispersion bit that is not set are placed only in MTF positions with code lengths corresponding to their global frequency.
    4. Symbols that occur more than 20 times are placed in MTF positions corresponding to their global frequency if more than 256 other symbols that occur more than 20 times have been seen more recently.
    5. Symbols that occur up to 20 times are placed in MTF positions corresponding to their global frequency if more than 64 other non expired symbols that have the same global frequency have been seen more recently.

    Clearly this isn't optimal (a symbol that occurs 11 times shouldn't start out with a code that is 2 bits longer than a symbol that occurs 2000 times) and I will probably make some adjustments at some point.

    I guess you could describe it as a hack of the mtf hack that limits mtf handling and provides (imperfect) position adjustments. But it is a way to (in effect) adjust MTF positions quickly and keep MTF from consuming a lot of time on the decoding side.

    Tree's code isn't sophisticated enough yet to take advantage of clearing the mtf queue. I think I need another layer of code calculations to make significant improvements. rANS looks simple when I see examples, but I haven't felt comfortable enough with it yet to give it a try.

  13. #522
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Yeah, my terminology may not be the best. Tree uses a hybrid MTF/dictionary scheme. A hybrid scheme is used in place of a standard MTF implementation because I think a standard MTF implementation would take more time to decode and possibly be less effective given that the global frequencies for each symbol are known and some interesting correlations exist between global frequencies and MTF position probabilities.

    I would claim that Tree's scheme captures much of the benefit of local correlations that can be captured with a standard full MTF scheme followed by an MTF model that adjusts position probabilities according to the global frequency of each symbol without having to actually adjust the probabilities. The main reason for having multiple queues at this point in time is to provide code lengths that more accurately match a symbol's probability than a standard MTF queue would provide if the global probability was ignored. I am interested in implementing it as a single queue with adjusted probabilities, but haven't thought of a way to implement it that seems fast enough to try.

    In terms of Tree's scheme being confusing, I will try to explain it another way. The scheme is equivalent to a single MTF queue implementation, but with some extra rules or limitations. For the purposes of discussion, I will use code lengths from enwik9 (and maybe not exactly right at that), but these will vary from file to file.

    MTF queue position 0 (7 bits)
    Contains the most recent (non expired) 2 instance symbol
    MTF queue positions 1 - 6 (8 bits)
    5 positions contain the 5 most recent >20 instance symbols
    1 position contains the most recent (non expired) 3 instance symbol
    MTF queue positions 7 - 13 (9 bits)
    3 positions contain the 6th - 8th most recent >20 instance symbols
    1 position contains the 2nd most recent (non expired) 2 instance symbol
    1 position contains the most recent (non expired) 4 instance symbol
    1 position contains the most recent (non expired) 5 instance symbol
    1 position contains the most recent (non expired) 6 instance symbol
    MTF queue positions 14 - 28 (10 bits)
    8 positions contain the 9th - 16th most recent >20 instance symbols
    2 positions contains the 3rd - 4th most recent (non expired) 2 instance symbol
    1 position contains the most recent (non expired) 7 instance symbol
    1 position contains the most recent (non expired) 8 instance symbol
    1 position contains the most recent (non expired) 9 instance symbol
    1 position contains the most recent (non expired) 10 instance symbol
    1 position contains the most recent (non expired) 11 instance symbol
    Oh. I think you posted something like that already, and maybe I forgot. I am tending to think that you are abusing and misusing terminology to some extent by calling this MTF, though. It was probably inspired by MTF originally, but at this point there may well be a better name for it. This may be a novel application, but similar things have been done before for other applications.

    Symbol Ranking might be a better name, but I myself am not 100% sure what exactly it means. If you search for that term on google, you'll get some hits from Peter Fenwick papers that may be worth looking at. Also, Matt has written a Symbol Ranking compressor, so maybe he could be of help (or it may already be in DCE).

  14. #523
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    Oh. I think you posted something like that already, and maybe I forgot. I am tending to think that you are abusing and misusing terminology to some extent by calling this MTF, though. It was probably inspired by MTF originally, but at this point there may well be a better name for it. This may be a novel application, but similar things have been done before for other applications.

    Symbol Ranking might be a better name, but I myself am not 100% sure what exactly it means. If you search for that term on google, you'll get some hits from Peter Fenwick papers that may be worth looking at. Also, Matt has written a Symbol Ranking compressor, so maybe he could be of help (or it may already be in DCE).
    DCE does have a section on Symbol Ranking. The first sentence is as follows:

    Symbol ranking, or move-to-front (MTF), is a transform in which the alphabet is maintained in a queue from newest to oldest and coded by its position.

    I guess I would say the implementation consists of a set of MTF queues that have an overall effect of ranking symbols. Matt's algorithm has arithmetic coding on top of the Huffman codes, which is something I have been thinking about adding. I am pretty sure that even just context modeling the overall mtf/dictionary/new symbol probabilities would result in a noticable improvement in the compression ratio for many files.

  15. #524
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts

    Symbol Ranking & frequency vs. recency

    My impression is that in practice, the term "symbol ranking" is used for context models where the information per context is an MTF list (a.k.a. MTF queue).

    Matt's SR2 algorithm is inspired by a fixed-length-context algorithm of Fenwick's, which was designed for efficient hardware implementation.

    Fenwick had another, stronger SR algorithm that was much more similar to PPM, with a multi-order model and fallback from higher orders to lower orders if the higher-order contexts made no predictions or the predictions failed. (Like with PPM, you start with the highest applicable order, and take the first prediction that matches. In effect, Fenwick's multi-order SR would concatenate MTF lists from highest down, until it found a match.)

    http://wenku.baidu.com/view/816fe52b...e09122bd5.html

    IMO that SR algorithm hasn't gotten enough attention. Its performance was very similar to PPMC's---very slightly worse on average, but often the same or better and sometimes a little worse, for the benchmarks he used. Statistically, it was a tie. (IIRC, that was for the Calgary files and maybe a few other things.)

    I think that's interesting because it seems to show that

    (1) most of the benefit of PPM-style statistics is in doing what SR does---the most valuable predictions (by far) are usually about the recent past predicting the near future, and more detailed stats are often irrelevant or misleading
    (2) if your statistics are stable, you can beat MTF by a little with actual statistics that have more precision, but
    (3) if your statistics aren't stable, MTF is usually better---it's just okay for stable stats, but often kicks ass on unstable ones, where it responds very quickly to changes.

    Fenwick himself discounted the value of his own algorithm, essentially presenting SR as a coarse approximation of PPM, with the bias toward the most recent things being more or less a random only okay sample of a real distribution. He also compared it unfavorably to BWT, which implicitly does a different approximation of PPM, which is also only okay, but can be pretty fast with modern BWT algorithms.

    I think he was kinda wrong on both counts.

    Compared to PPM, SR is usually either better or worse, because you either have relatively stable stats where you can make relatively fine quantitative distinctions and PPM wins, or you have transients where MTF is likely to do better, because the older stats are likely just wrong.

    Compared to BWT, SR is also usually either better or worse. If your stats are stable enough at the BWT's block granularity, BWT has an edge because it's a little better at approximating PPM. But if you have transients at granularities smaller than BWT blocks, SR is likely to be better, because it's much "twitchier" and heavily weights recent behavior over older behavior.

    --

    If I'm right about those things, I think it's relevant to Tree, because Tree is now getting something important basically right: you want to discriminate between things for which you have stable stats and can compute fairly precise probabilities, and and things for which transient stats are better and you need a "twitchier" model, and precision of probabilities is less important than adapting fast enough that your probabilities aren't just wrong.

    It comes down to cases, and the two most important broad cases are (1) where your stats are pretty stable and normal statistical modeling and coding wins, and (2) where your stats are unstable and MTF-like modeling (plus reasonable EC) usually wins.

    People commonly blur between those things in practice (in high-end compressors) by using Markov-ish stats with decay factors to weight recent more recent stats more heavily, and there's something right about that. If your stats are very stable, then weighting more recent ones more heavily just loses some precision in estimation. (Recent samples are about as likely to be representative as older ones, so you're effectively just reducing your sample size somewhat, and getting more noise in your stats.) But if your stats are unstable due to transients, you win much more than that---recent stats are likely to be much more representative of future ones, and while you lose modest precision, you gain considerable representativeness.

    I think it's better to discriminate between the cases, if you can, and treat each case just right, rather than blurring them together, but that ends up being "AI-complete"---you have an aribitrarily difficult task of recognizing the nature of the input and applying the right techniques.
    Last edited by Paul W.; 26th December 2014 at 05:06. Reason: typos (incl. duplicated text)

  16. The Following User Says Thank You to Paul W. For This Useful Post:

    Kennon Conrad (7th January 2015)

  17. #525
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Interesting. Fenwick's algorithm overall performs very similarly to Tree, with an average of 2.51 bits/byte vs. 2.52 bits/byte for Tree64.

    Everything I have observed about encoding of Tree's compressor output is consistent with what you are saying. The order 0 entropy of the tokenized file when compressing enwik9 is typically around 212 MB (with a minimum of around 203MB - 206MB a little before completion of the tokenization process), yet the final encoded file is less than 175 MB. I don't have an exact breakdown on how much of the embedding the dictionary, using a dynamic dictionary, capital symbol modeling, and MTF, but think MTF is about 40% of the total savings. But the savings from the dynamic dictionary also comes from unstable statistics and that's probably at least a few MB more.

    One thought I have had but not near trying is to create some sort of token filter for encoding that looked at local affects and modified the tokenized file accordingly. The idea would be to try to remove matches that were created or used but not helpful because of poor dispersion.

    Compression is way more complicated that I thought when I started out. I think Tree has the right general ideas in measuring symbol dispersion and adjusting (some of the) symbol weights accordingly, but the implementation is still far from optimal. That seems to support your notion that getting the precision of probabilities exactly right is less important than the ability to adapt.

  18. #526
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Paul W. View Post
    My impression is that in practice, the term "symbol ranking" is used for context models where the information per context is an MTF list (a.k.a. MTF queue).
    That makes intuitive sense. MTF is a very descriptive gem of a name: the name tells you pretty much what the algorithm does. Symbol ranking sounds more general. If your algorithm does much more than "move to front," then why call it Move-to-Front? For consistency, you should at least add to it, e.g.: MTFADSOS -- Move-to-Front-and-Do-Some-Other-Stuff.

    Matt's SR2 algorithm is inspired by a fixed-length-context algorithm of Fenwick's, which was designed for efficient hardware implementation.

    Fenwick had another, stronger SR algorithm that was much more similar to PPM, with a multi-order model and fallback from higher orders to lower orders if the higher-order contexts made no predictions or the predictions failed. (Like with PPM, you start with the highest applicable order, and take the first prediction that matches. In effect, Fenwick's multi-order SR would concatenate MTF lists from highest down, until it found a match.)

    http://wenku.baidu.com/view/816fe52b...e09122bd5.html

    IMO that SR algorithm hasn't gotten enough attention. Its performance was very similar to PPMC's---very slightly worse on average, but often the same or better and sometimes a little worse, for the benchmarks he used. Statistically, it was a tie. (IIRC, that was for the Calgary files and maybe a few other things.)

    I think that's interesting because it seems to show that

    (1) most of the benefit of PPM-style statistics is in doing what SR does---the most valuable predictions (by far) are usually about the recent past predicting the near future, and more detailed stats are often irrelevant or misleading
    (2) if your statistics are stable, you can beat MTF by a little with actual statistics that have more precision, but
    (3) if your statistics aren't stable, MTF is usually better---it's just okay for stable stats, but often kicks ass on unstable ones, where it responds very quickly to changes.
    That is interesting. In all honesty, SR doesn't really seem that interesting as an algorithm -- it doesn't seem to have improved on any existing algorithm in any meaningful way. But as an experiment, it seems to have provided some useful insights. I also think it's interesting to see how it compared to AC, since it achieves sub-bit accuracy by doing successive comparisons with individual symbols semi-randomly, rather than by computing actual fractions -- sort of like dither.

    Fenwick himself discounted the value of his own algorithm, essentially presenting SR as a coarse approximation of PPM, with the bias toward the most recent things being more or less a random only okay sample of a real distribution. He also compared it unfavorably to BWT, which implicitly does a different approximation of PPM, which is also only okay, but can be pretty fast with modern BWT algorithms.

    I think he was kinda wrong on both counts.

    Compared to PPM, SR is usually either better or worse, because you either have relatively stable stats where you can make relatively fine quantitative distinctions and PPM wins, or you have transients where MTF is likely to do better, because the older stats are likely just wrong.

    Compared to BWT, SR is also usually either better or worse. If your stats are stable enough at the BWT's block granularity, BWT has an edge because it's a little better at approximating PPM. But if you have transients at granularities smaller than BWT blocks, SR is likely to be better, because it's much "twitchier" and heavily weights recent behavior over older behavior.
    The BWT has a major weakness when the data are highly non-stationary, because the BWT basically completely discards all position information (unless you add it back in somehow) and makes an all-out bet on context lengths.

    Recency and context lengths are both pretty good indicators of significance, and sometimes you should prefer one, sometimes you should prefer the other. Different kinds of data favor one or the other in various mixtures.

    I was planning to write a bit more. I may come back and add more at some point.

    --

    If I'm right about those things, I think it's relevant to Tree, because Tree is now getting something important basically right: you want to discriminate between things for which you have stable stats and can compute fairly precise probabilities, and and things for which transient stats are better and you need a "twitchier" model, and precision of probabilities is less important than adapting fast enough that your probabilities aren't just wrong.
    This is an interesting discussion, and I could write at length about my observations and ideas on this topic. One thing I'm wondering is whether we should start a new thread. For one thing, this thread is already practically its own forum-within-a-forum, and the discussion would be easier to keep track of somewhere else.

    The stationary assumption is convenient and popular, because it's by far the easiest problem to solve. If statistics are the same throughout the data, that means only one set of stats to keep track of, and the whole problem of compression becomes a whole lot simpler. Unfortunately, I don't think any real data are like that, because perfectly-stationary data are like perfect circles or perfectly-straight lines -- they only exist in math. I think it's not necessarily easy to keep track of what the statistics are doing at any given time and react, and that is probably why those algorithms just ignore the problem and do one thing. Advanced CM compressors try to attack this problem fairly directly, I think. But I think there might be room for better algorithms there, because I think CM compressors sort of throw a lot of machine-learning algorithms at the problem and let them figure it out (if someone, like Matt, wants to correct my impressions, please do).

    So, the main issue might be coming up with (abstract, mental) models of the kinds of changes statistics go through and then coming up with efficient ways to extract the patterns from real data. One pattern is to have two different types of data placed sequentially, like a text file followed by a wav audio file. In that case, the compressor wants to just throw out the first model at the end of the text and start collecting statistics again. That's a pretty simple pattern, and if someone can think of an efficient way to just locate the break-point, then the two parts could be separately processed with the BWT, and that would make the BWT much more efficient.

    It comes down to cases, and the two most important broad cases are (1) where your stats are pretty stable and normal statistical modeling and coding wins, and (2) where your stats are unstable and MTF-like modeling (plus reasonable EC) usually wins.

    People commonly blur between those things in practice (in high-end compressors) by using Markov-ish stats with decay factors to weight recent more recent stats more heavily, and there's something right about that. If your stats are very stable, then weighting more recent ones more heavily just loses some precision in estimation. (Recent samples are about as likely to be representative as older ones, so you're effectively just reducing your sample size somewhat, and getting more noise in your stats.) But if your stats are unstable due to transients, you win much more than that---recent stats are likely to be much more representative of future ones, and while you lose modest precision, you gain considerable representativeness.

    I think it's better to discriminate between the cases, if you can, and treat each case just right, rather than blurring them together, but that ends up being "AI-complete"---you have an aribitrarily difficult task of recognizing the nature of the input and applying the right techniques.
    Last edited by nburns; 26th December 2014 at 22:24.

  19. The Following User Says Thank You to nburns For This Useful Post:

    Paul W. (7th January 2015)

  20. #527
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Kennon,

    It would be interesting to see what common strings your algorithm decides are high vs. low "dispersion."

    ---

    Do be aware that what you're calling "dispersion" isn't what computational linguists and corpus wranglers call "dispersion." It's related, and correlated, but different.

    What they call "dispersion" is (roughly) the fraction of documents in which a word/phrase/ngram appears (usually in some large varied corpus). That's usually adjusted for document length in some reasonably obvious way, because things are more likely to appear in long documents than short ones.

    What you're measuring is something different, and significant, which deserves its own name. (And a well-chosen one.)

    My impression is that you're basically discriminating between items for which

    1) recency is a better predictor of whether you'll see them soon, versus

    2) things for which base rates (presumed-stable stats) are a better predictor.

    (Or as I understand it, since your MTF lists are not huge, and you can't tell in general which way the code would be shorter, you're discriminating between things where particularly high recency is a better predictor than base rates.)

    I'm not sure what to call that, but it's worth defining and naming clearly.

  21. #528
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Paul W. View Post
    Kennon,

    It would be interesting to see what common strings your algorithm decides are high vs. low "dispersion."

    ---

    Do be aware that what you're calling "dispersion" isn't what computational linguists and corpus wranglers call "dispersion." It's related, and correlated, but different.

    What they call "dispersion" is (roughly) the fraction of documents in which a word/phrase/ngram appears (usually in some large varied corpus). That's usually adjusted for document length in some reasonably obvious way, because things are more likely to appear in long documents than short ones.

    What you're measuring is something different, and significant, which deserves its own name. (And a well-chosen one.)

    My impression is that you're basically discriminating between items for which

    1) recency is a better predictor of whether you'll see them soon, versus

    2) things for which base rates (presumed-stable stats) are a better predictor.

    (Or as I understand it, since your MTF lists are not huge, and you can't tell in general which way the code would be shorter, you're discriminating between things where particularly high recency is a better predictor than base rates.)

    I'm not sure what to call that, but it's worth defining and naming clearly.

    I redid the top 100K symbol list, using the best compressed enwik9 file from the latest release (TreeCompress only). The list is attached. There is one problem with the list - the code that generates the list doesn't print extended UTF-8 characters correctly, but that only shows up occasionally, is usually pretty obvious, and doesn't significantly impact the data (IMHO).

    Thanks for the heads up on my usage of the term dispersity. I will do some research and think about it. I suppose something like "clustered" or "local repetition rate" may be better. It's similar to what Nburns pointed out about MTF vs. symbol ranking. It's very helpful to find out about terms that may be misleading since I have started working on a paper, so getting the terminology right is very important. Writing the paper may be almost as big of a challenge as writing the code...

    Your general idea about Tree's recency discrimination is basically correct. I think I could determine exact bit impacts of any combination of symbols being allowed into the MTF queues with more code and execution time, but have been targeting reasonably effective compression with fast decoding rather than optimal compression (given the existing constraints) and have limited time. I would have to think about it some more, but my impression is that keeping symbols out of the MTF queues that are only slightly more efficiently coded when allowed in the queues allows the other symbols that are put in the mtf queues to be coded more efficiently because their average queue position is lower.
    Attached Files Attached Files

  22. The Following User Says Thank You to Kennon Conrad For This Useful Post:

    Paul W. (8th January 2015)

  23. #529
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    It's interesting that a big fraction of your strings have the high-recency bit set, once you get past the first few hundred strings. Many things are clearly better coded with MTF entropy than with base rate entropy, and I suspect that if you had a big enough MTF queue, it would be the clear majority---most things that are not both high-frequency and high dispersion (like "the", "a", "and", "if the", "and a", etc.) have strong locality.

    That's really important and should be in your paper as clearly as humanly possible. It explains the somewhat disappointing practical results for MTF coding and splay tree coding, despite their provable "optimality" under seemingly reasonable assumptions. Plain MTF and splay tree coders lose by a modest constant factor where there are strong, stable (ergodic Markov) regularities, but win where things are unstable and exhibit any significant locality. If you can discriminate between those two basic cases, and treat each appropriately, you will generally win.

    IMO that's a breakthrough in basic compression theory. It shows why "universal" compression algorithms aren't universal---because ergodic Markov models are really, really stupid---and what to do instead.

    Don't undersell this.

  24. #530
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Quote Originally Posted by Kennon Conrad View Post
    I think I could determine exact bit impacts of any combination of symbols being allowed into the MTF queues with more code and execution time, but have been targeting reasonably effective compression with fast decoding rather than optimal compression (given the existing constraints) and have limited time. I would have to think about it some more, but my impression is that keeping symbols out of the MTF queues that are only slightly more efficiently coded when allowed in the queues allows the other symbols that are put in the mtf queues to be coded more efficiently because their average queue position is lower.
    For a paper, it'd be well worth doing a run that takes a few days to figure out what the compression wins are for much longer MTF queues. That space/time tradeoff curve is important, and you can reinterpret it assuming a better implementation. (E.g., if you have linear searches taking ridiculous time, you can argue that a log n search of a reasonable priority queue implementation would take a log n factor off what you measured.) Even without that, it's theoretically important how much MTF can win over ergodic Markov assumptions, compression-wise, even if you don't have a cheap implementation.

    In terms of practical space/time tradeoffs, I'd think that using some reasonable priority queue implementation (with a heap embedded in an array in the usual way, or splay trees) would give you a nice "knob" to adjust space vs. time---you could dial up somewhat higher ratios and decompression times, but still be pretty fast, and hit several points on the Pareto frontier.

    BTW, I noticed recently that Danny Sleator (the splay tree guy, a prof at CMU) gives away C code for splay trees that keep track of the size of left subtrees---i.e., what you need for a priority queue, allowing you to quickly determine ranks or search by rank order.

  25. #531
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Paul W. View Post
    It's interesting that a big fraction of your strings have the high-recency bit set, once you get past the first few hundred strings.
    Yes, and the fraction would be even higher for if I was willing to trade some decompression speed for a slightly better compression ratio. When I look over the data, I also find it interesting that single letters, strings that have numbers, nouns, and names tend to get marked as "dispersive" more often than other types of strings, especially more than verbs and verbish (?) strings. Another category I find interesting that is not necessarily clear in the file I posted is the presence of strings that have very few or no local repititions, far less than one would expect with random spacing. These tend to be strings related to the structure of the text, like headers for tables, xml markup, and the abbrevations for various languages used to list the page titles in a variety of languages. A really smart encoder could track these and predict which one is next with a fairly high success ratio.

    Quote Originally Posted by Paul W. View Post
    IMO that's a breakthrough in basic compression theory. It shows why "universal" compression algorithms aren't universal---because ergodic Markov models are really, really stupid---and what to do instead.

    Don't undersell this.
    As in it belongs in the abstract?


    I am thinking I will explore MTF further because I would like more data on some of the things you are also interested in. I suspect better ranking would have more impact than longer queues, but can't be sure without testing. First, though, I want to finish up some TreeCompress64 improvements that are providing similar compression in 40+% less time and user settable RAM usage to test compression ratio vs. initial block size.

  26. #532
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    I have one additional comment on the size of Tree's 256 deep "general" MTF queue. It covers a bigger portion of the data than one might expect at first thought. For example, when encoding the tokenized enwik9 file, about 27% of symbols go into this queue and each symbol represents an average of almost 12 characters, so on average the queue covers about the 256 x 12 / 0.27 = 11,378 most recent characters in the uncompressed enwik9 file.

  27. #533
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Thanks for the heads up on my usage of the term dispersity. I will do some research and think about it. I suppose something like "clustered" or "local repetition rate" may be better. It's similar to what Nburns pointed out about MTF vs. symbol ranking. It's very helpful to find out about terms that may be misleading since I have started working on a paper, so getting the terminology right is very important. Writing the paper may be almost as big of a challenge as writing the code...
    I'd probably put the highest priority on not getting terminology wrong; e.g., if you have a choice between using a word you're not sure about versus explaining the concept in simpler terms, err on the side of caution.

  28. #534
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Yes, getting the terminology wrong on a paper would be bad. It's not so bad here because people can ask questions and it helps me learn, but those don't apply to a paper. I will do the best I can for a first draft, but since I have never published a paper and am not really a subject expert I think it would be good if I am not the only author. I am fortunate to have two offers from CS professors to review the paper, including the professor whose class inspired the idea behind Tree and who also is an author of another grammar compression paper. What do you think is the probability of finding a professor from a class that was 30 years ago in his office at a random time on a random afternoon?
    Last edited by Kennon Conrad; 10th January 2015 at 09:04.

  29. #535
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Quote Originally Posted by Kennon Conrad View Post
    As in it belongs in the abstract?
    I think so. You want to be careful how you say it, though---you don't want to undersell it, but you don't want to seem to oversell it, either. You want to get it exactly right.

    As I see it your main points should be

    1. Outside-in n-gram frequency analysis is good for picking good strings offline, to direct the parse and let you use LZ codes to get much the same effect as PPM gets online with with a bunch of orders of n-grams, and that costs up front but gives you very fast decoding---no explicit modeling of multiple orders at decode time. Outside-in frequency analysis is like a generalization of Statistical Substring Reduction, where you adjust frequencies of substrings to account for parsing long strings preferentially.

    2. Doing this in terms of bytes gives you a kind of automatic tokenization based mainly on empirical statistics, not explicit knowledge of English orthography and grammar, and Tree tends to chop things up pretty well, not just for plain written English, but for automatically template-filled "artificial" English text and un-Englishlike markup goop. For plain text, you get mostly reasonable words and phrases, and for markup, you get funny-looking stuff that works well for boilerplate removal, etc. (I'm not sure how strong a claim you can make here, but you want to make the strongest claim you can support, and it deserves some serious thought... more on that later.)

    3. The implementation strategy also makes it easy to model long-range dependencies between the phrases you parse, which cannot be captured by ergodic Markov models. (To me that's by far the most interesting thing about Tree, and it's a mostly different thing from hitting the Pareto frontier for decompression time.) Your outside-in grammar processing gives you a kind of simple automatic structure discovery and feature detection. Once you have those basic features, you can do higher-level pattern recognition and adaptation... recognizing which phrases are highly dispersed vs. MTF-friendly is just the beginning of what you can do.

    (To me, Tree represents a whole new class of algorithms that I expect to be interesting and important. Putting MTF coding in there isn't just a little hack to get a little boost in compression---it's just the first and simplest example of non-Markov adaptive stuff you can do with the infrastructure you've built for this new class of algorithms.)

    4. You can discriminate between strings which do behave in a roughly ergodic Markov fashion---recurring with relatively fixed frequencies---and strings that don't, and treat each class appropriately.

    5. Simple recency modeling (MTF) works well for the non-ergodic class, giving you a nice boost in compression---maybe a big one if you make the MTF queue big enough. This shows what Bentley et al. got wrong: MTF is good for some stuff but not other stuff, so don't do it for the other (stably frequent) stuff. MTF is better than people think, but you've got to use it for what it's good for, and not what it's bad for.

    6. The same basic techniques could be applied in preprocessors, and/or at the word granularity. For example, you could use published corpus n-gram statistics, analyzed outside-in, to automatically generate a preprocessing dictionary for compressing text in different languages. (That's what I'd planned to do, and may still do.) Don't leave this out of your paper just because you haven't done it yet, or because you think it's obvious. Go ahead and say it, so that if somebody like me goes and does that now-obvious thing, they'll have to cite you.

    --

    About #3...

    You want to make what you're doing sound important, because it is, but you don't want to overdo it and sound like you're making too big a deal of your still-relatively-simple stuff, or that your'e brashly (or worse, falsely) claiming that you know how well other instances of this new class of algorithms will work. You want to come off as somebody who knows where they're going and why it's interesting, not as a self-important blowhard.

    There's an art to that. One way of getting away with painting a fairly big picture and saying fairly big-sounding stuff is to talk about your long-range goals, and cast the motivation of the paper you're writing in terms of those goals---you're just explaining why you did the stuff you did so far, and why you think it's interesting, and that's okay. So you get to talk a little bit about where you think it leads, without claiming to have gotten there yet. You're just telling people why you're doing what you're doing, and if they think it's really interesting and promising too, that's great, but if they don't it's no harm, no foul. And you got a few more ideas in print with your name on them.

    --

    It would be nice to compare and contrast what you can do in a CM framework with what you can do in a Tree-like framework. They're both going beyond ergodic Markov modeling, but in different ways.

    I wish I understood what the PAQ algorithms are really doing under the hood. As I understand it, they're doing some non-ergodic and non-Markov modeling with preprocessing into tokens, part-of-speech tagging, skipgrams, other funny contexts, and adaptation between multiple context models sensitive to those kinds of things... but it's all voodoo to me at this point. Too many moving parts that too few people understand, and not enough clear publications to make it accessible.

    --

    BTW, I think you probably ought to come up with a better name than Tree... that name is too much about the implementation of the front end and not enough about the basic modeling and coding issues. As I understand it, you could reimplement the front end using suffix arrays rather than trees, and it wouldn't matter much---what matters is that you're finding the n-grams and doing the frequency analysis right, outside-in.

    --

    It would be interesting to see how well Tree works on non-English text, especially ideographic text like Chinese and Japanese, which is very different from text in English or most European languages. I'm curious whether Tree is good at finding reasonable phrase boundaries in those languages, where there are no spaces between words and words are often written with two characters.

    (Unfortunately, I can't read Chinese or Japanese, so if I looked at the top Tree strings for text in those languages, I wouldn't be able to tell if the phrases were reasonable-seeming or not.)

    --

    Have you tried Tree on the DNA data you posted? How'd it do?

    DNA data is interesting, because it often has long verbatim or very-nearly-verbatim repeats, but in a tiny alphabet. I'd expect that to do a really good job, you'd need some kind of skip-gram modeling.

    (One reason DNA has long repeats is that in most species, it's full of junk inserted by retroviruses---we've got zillions of copies of a bunch of viruses scattered through our DNA, gradually mutated here and there over millions of years. Another reason for long repeats is that most of our genes have been copied several times, and then evolved new specialized functions by mutation. So there's a whole lot of slightly-modified copies of long strings of nucleotides, differing mostly in the occasional single-point mutation, where one letter is changed to another or deleted, or an extra letter is inserted. A simple non-Markov predictor that can tolerate one wrong character and keep copying a string should kick ass.)
    Last edited by Paul W.; 10th January 2015 at 12:34. Reason: deleted dangling quoted text

  30. The Following User Says Thank You to Paul W. For This Useful Post:

    Kennon Conrad (10th January 2015)

  31. #536
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Thanks for the extensive comments. It's helpful.

    I had not seen the Bentley paper. Are there papers that discuss optimal mtf benefits when distinguishing between ergodic and non-ergodic symbols? I am almost certain that a more advanced method than I am using could provide better compression. It seems like the ideal solution would (if ignoring overhead) would be to have an "ergodicity" value for each symbol that would indicate the decay rate of the symbol probability.

    I agree Tree could have a better name. I have considered renaming it but only want to do that once and need a good name. It should be related to what it does rather than the data structures it uses.

    Yes, I tried Tree on the DNA data. I would say compression was interesting, encoding was terrible, and the amount of tokenization is highly dependent on the -m (cutoff threshold) parameter. I wanted to improve encoding before publishing results. I say compression is interesting because the dictionary size tends to be pretty small, as small as 64 entries for vaccg when tokenized with working version of TreeCompress64. So apparently the compressor finds a few recurring strings and decides the rest should just be entropy encoded. I'm not sure, that may be good, but I haven't looked closely at what gets tokenized.

  32. #537
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    I have an idea for improving your parsing.

    When you have a choice between strings that overlap at a particularly common, dispersed symbol, pick the one with the common-and-dispersed symbol on the left.

    E.g., for a sentence like "walk to the end of the road", you'd tend to group "the" with the noun to its right, like Walk to [the end] of [the road].

    Except that "to the" and "of the" are so common and dispersed that they'd be considered common-and-dispersed symbols unto themselves, so you'd get "walk [to_the end] [of_the road]".

    If you do this right, I think you get putting-the-space-on-the-left-of-a-word for free.

    A bias the other way would probably work, too. I suspect it's more important to have a bias than which way it goes, to give more consistent parsing that is closer to bijective, i.e., avoid a lot of extra overlapping strings wasting code space.

    I think this should help with binary data, too---maybe more.
    Last edited by Paul W.; 10th January 2015 at 23:56.

  33. #538
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Thanks for the extensive comments. It's helpful.

    I had not seen the Bentley paper. Are there papers that discuss optimal mtf benefits when distinguishing between ergodic and non-ergodic symbols?
    Not that I've seen. If anybody else has gotten and published the idea, it hasn't gotten the attention it deserves, or I'd have seen it in things I have read.

    I think that's an easy, basic, important result that ought to be in the literature. MTF coders and splay tree coders got a flurry of attention for a few years, then people lost interest because the few experiments people did gave mixed and somewhat disappointing results.

    I'm convinced that's because people did not understand what MTF and especially splay tree coders are actually good for, what they're not good for, and how you can easily discriminate between cases and use them appropriately.

    I am almost certain that a more advanced method than I am using could provide better compression.
    I'm pretty certain of that too, and think I know a couple of simple methods that I think are very likely to work well.

    It seems like the ideal solution would (if ignoring overhead) would be to have an "ergodicity" value for each symbol that would indicate the decay rate of the symbol probability.
    If we're really ignoring overhead, I think the ideal solution would involve an MTF hit rate histogram, that shows how often a symbol reoccurs after how long. There's a lot of exploitable information in an MTF histogram that an intellligent algorithm could use---e.g., it could tell at what scales a symbol is "dispersed" or reoccurs in a clustered way. E.g., for Wikipedia, you could make a pretty good guess whether a phrase.

    1) tends to recur a bunch of times in quick succession if it's used at all, (like XML boilerplate for list items), or
    2) it's scattered through some articles and doesn't occur in most at all, like subject-specific English phrases, or
    3) it's scattered through lots of articles but tends not to recur within an article (like section headings common to many similar articles but only used once each)

    My students and I have used the equivalent LRU miss rate histogram to make intelligent caching decisions, e.g., about when to diverge from LRU when faced with a loop that's a little too big for your memory size, or how much memory to used for the compression cache in OSX. With even a very crude, approximate miss rate histogram, you can do an online cost-benefit analysis that tells you which replacement policy works better, or how big your compression cache should be and dynamically adjust things to suit the actual workload.

    Yes, I tried Tree on the DNA data. I would say compression was interesting, encoding was terrible, and the amount of tokenization is highly dependent on the -m (cutoff threshold) parameter. I wanted to improve encoding before publishing results. I say compression is interesting because the dictionary size tends to be pretty small, as small as 64 entries for vaccg when tokenized with working version of TreeCompress64. So apparently the compressor finds a few recurring strings and decides the rest should just be entropy encoded. I'm not sure, that may be good, but I haven't looked closely at what gets tokenized.
    I think very simple skip-gram context modeling should help a lot for DNA---just first-order Markov-like context modeling of sequences of your Tree strings, but tolerating one mismatched symbol.

    I'm pretty sure that's going to work well for text as well, so you can often recognize things like "in the <A> of the <B>" with a first-order model, skipping a "single" symbol, where <A> and <B> might be single words or whole common noun phrases that Tree's front end has collapsed to single symbols. By doing most of the PPM-like multi-order modeling up front, you should get a vocabulary of symbols for which a single-order model works surprisingly well, if it can handle the "single" skips that kept the front-end from recognizing the whole larger phrase in the first place. Cool.

  34. #539
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Yes, getting the terminology wrong on a paper would be bad. It's not so bad here because people can ask questions and it helps me learn, but those don't apply to a paper. I will do the best I can for a first draft, but since I have never published a paper and am not really a subject expert I think it would be good if I am not the only author. I am fortunate to have two offers from CS professors to review the paper, including the professor whose class inspired the idea behind Tree and who also is an author of another grammar compression paper. What do you think is the probability of finding a professor from a class that was 30 years ago in his office at a random time on a random afternoon?
    Great. Did you know he had been workng on grammar compression, or was it a coincidence?

  35. #540
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    5. Simple recency modeling (MTF) works well for the non-ergodic class, giving you a nice boost in compression---maybe a big one if you make the MTF queue big enough. This shows what Bentley et al. got wrong: MTF is good for some stuff but not other stuff, so don't do it for the other (stably frequent) stuff. MTF is better than people think, but you've got to use it for what it's good for, and not what it's bad for.
    Actually, MTF works ok for symbols that repeat at a stable frequency: if a symbol has probability p, then you expect to see one every 1/p symbols... and MTF assigns a reasonable number on average. I remember this being mentioned in the paper.

    Of course, there are plenty of ways to get poor behavior from MTF.


    --

    BTW, I think you probably ought to come up with a better name than Tree... that name is too much about the implementation of the front end and not enough about the basic modeling and coding issues. As I understand it, you could reimplement the front end using suffix arrays rather than trees, and it wouldn't matter much---what matters is that you're finding the n-grams and doing the frequency analysis right, outside-in.
    Tree isn't such a terrible name if you relate it to the fact that Tree's matches are hierarchical and form trees. That's the major difference from, say, LZ77.

    On the other hand, Tree is such a common word in computer science. It's kind of like having a dog named Dog.

Page 18 of 29 FirstFirst ... 8161718192028 ... LastLast

Similar Threads

  1. Replies: 4
    Last Post: 2nd December 2012, 02:55
  2. Suffix Tree's internal representation
    By Piotr Tarsa in forum Data Compression
    Replies: 4
    Last Post: 18th December 2011, 07:37
  3. M03 alpha
    By michael maniscalco in forum Data Compression
    Replies: 6
    Last Post: 10th October 2009, 00:31
  4. PIM 2.00 (alpha) is here!!!
    By encode in forum Forum Archive
    Replies: 46
    Last Post: 14th June 2007, 19:27
  5. PIM 2.00 (alpha) overview
    By encode in forum Forum Archive
    Replies: 21
    Last Post: 8th June 2007, 13:41

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •