Page 9 of 29 FirstFirst ... 789101119 ... LastLast
Results 241 to 270 of 849

Thread: Tree alpha v0.1 download

  1. #241
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts

    Tree v0.9.1

    An updated version of the two thread TreeDecode program is now included in the release package and the only version that will be used from now on.

    Only TreeDecode has changed. The buffer for transferring data to the output thread is now circular and MTF queue position and code length decoding tables have been added. Memory usage for enwik9 is now about 415MB vs. about 890MB for the initial two threaded TreeDecode or 420MB for the single threaded TreeDecode. The increase in decompression time for small files with the two thread version has been eliminated. Decompression of enwik9 is about 1% faster.
    Attached Files Attached Files

  2. #242
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    From Bulat's Fast LZ4+EC Compressor thread:

    Quote Originally Posted by Piotr Tarsa View Post
    Discussion about Tree is scattered through many threads, but I want to add yet again that there should be some arithmetic coding in Tree, I think. For me the most obvious application of arithmetic coding would to to code a flag before each code, telling whether it's a code from MTF queue or a global code. MTF indexes could be coded in blocks, with quasi-static arithmetic coding (could be ANS), ie a block of N indexes would be coded at once and before that block a table with frequencies of MTF indexes values would be transmitted (or even a full ANS table if that doesn't hurt compression ratio too much).

    I just think that the local repetitiveness is too varying to assign global codes to MTF indexes values.

    Transmitting MTF indexes in blocks would require Tree bit encoder to do some additional buffering, but that shouldn't be a big problem. Decoder shouldn't be slower - I think it would be even faster.
    Piotr, here is some data from decoding enwik9 to validate your idea that the local repetitiveness is too varying to assign global codes to MTF index values:
    Click image for larger version. 

Name:	SymbolTypes.jpg 
Views:	231 
Size:	84.9 KB 
ID:	3055

    The effects are smeared out by using 100K symbol bins, but plenty of variation is still shown.

    Tree has 4 main symbol types as follows:
    1. Fixed code length symbols (Huffman-like code tree)
    2. Start new definition escape symbols (followed by "frequency" code and number of composing symbols code)
    3. MTF 64 deep queue symbols for symbols that occur up to 20 times
    4. MTF 1 deep queue symbols for symbols that occur more than 20 times but have a fixed code length of at least 11 bits.

    The current implementation of Tree uses the unfilled fixed code tree space for the start new definition escape. That is no where near perfect, but very simple. It is also seems like an area that ANS is well-suited for. So, based on your idea and comments from others such as Jerek and nburns, this is what I am thinking as a starter that can later be extended for additional compressed file size improvements.

    Use ANS for encoding the four main symbol types, each given a probability of N/128. (Tree has maximum code length of 25 bits and I want to stay within 32 bits and 2**(32-25) is 128. Not sure if it matters, this is a path I knew existed but haven't really explored until now.) Tree will pass the "probability state" to the encoders for each of the four types of data and then output the type code plus symbol as a combined symbol that is 0 - 7 bits longer than the current symbol, depending on the type probability state. I think I would use tables to encode and decode the definition escapes and MTF symbols but use math to encode and decode the fixed code length symbols to keep the table size at 2**25 entries. First cut would probably only support varying definition escape probabilities because I seem to work best when I make small incremental changes. Once I have that in place, I think it should not be too hard to start predicting MTF probabilities and also gives a better framework for subsymbol context modeling.

  3. #243
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    This paper (attached) describes a scheme that has the same implementation issues as Tree. I skimmed it, and I don't think it uses an entropy-based cost function, but otherwise it's very close.
    Attached Files Attached Files

  4. The Following User Says Thank You to nburns For This Useful Post:

    biject.bwts (14th July 2014)

  5. #244
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    This paper (attached) describes a scheme that has the same implementation issues as Tree. I skimmed it, and I don't think it uses an entropy-based cost function, but otherwise it's very close.
    Interesting, that is probably the closest I have seen. Another significant difference appears to be Tree's ability to reference dictionary symbols from within the dictionary. Some variations of their algorithm outperform Tree on one of the object files and the picture file, which isn't surprising since Tree is not so good on those. Tree did better on the rest, some as much as about 20%.

  6. #245
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Here is a data structure for doing substitutions efficiently.
    Attached Files Attached Files

  7. #246
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Tree did better on the rest, some as much as about 20%.
    The numbers don't look all that good at a glance. That is the GREEDY algorithm from the other paper that lists five or so different algorithms. The one they call COMPRESSIVE is this (I don't see it available for free).
    Last edited by nburns; 14th July 2014 at 18:09.

  8. #247
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Apparently for Tree there are relationships between the current context and mtf queue locations for the next symbol. For the version of TreeCompress I am using (not released), 78,401,922 non-definition escape symbols are used to write the compressed enwik9 file. 4,214,166 (5.4%) of those symbols come from the mtf queues, 2,504,788 (3.2%) come from the mtf1 symbols, and 71,682,968 (91.4%) come from the fixed length code portion of the code tree. 4,629,453 of these symbols end with the special capital encode symbol. For the symbols that follow those 4,629,453 symbols, 711,479 (15.4%) come from the mtf queues, 324,990 come from the mtf1 symbols (7.0%), and 3,592,984 (77.6%) come from the fixed length code portion of the code tree. For the 73,772,469 symbols that follow a symbol that does not end with the special capital encode symbol, 3,502,687 (4.7%) come from the mtf queues, 2,179,798 (3.0%) come from the mtf1 symbols, and 68,089,984 come from the fixed length code portion of the tree (92.3%).

    This seems to imply that the relative weight of predictions for recent context histories should be adjusted based on the success of the individual histories in predicting the next symbol. Now I am curious whether existing high end compressors do this.
    Last edited by Kennon Conrad; 25th July 2014 at 16:55.

  9. #248
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    I think CM does what you're talking about, and it's been exploited to extraordinary degrees by Matt and others.

  10. #249
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    But usual CMs work on bit level and not on variable length tokens level. And you don't need a CM to start with - at the beginning you can use single model which generates probabilities.

    PAQ9 has LZP model and somehow marries it with its CM models, so probably you can look at it.
    Besides that, IIRC Ilya (encode) experimented with some complex LZ backends, but I'm not sure.
    Oh, and there's ZPAQ which probably allows for some complex modelling of matches, doesn't it?

  11. #250
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    I think Tree has affects that call for a unique CM. One that works not only at the variable length token level, but also at the trailing subtoken levels. Very little benefit can be derived by predicting the next symbol based on the previous symbol, because if that were the case, then the symbols would have been tokenized. There are however, good opportunities of predicting the initial portion of the next symbol from the later portion of the previous symbol. The simplest method I can think of with good benefits is to drill all the way down to context match the last character in the previous symbol with the first character in the next symbol. But I am pretty sure there are also good benefits to looking at the word level and natural prefix/suffix level, skip contexts, etc.

    My impression is that some CM's adjust the model's weight according to its success, but I'm not so sure about adjusting the weights of individual histories. I haven't seen it, but that's not saying a whole lot.

    Yes, I will start with a single model. I tend to do best with small, incremental changes. But I will put in numerical coding first. I think I can fit all the tables in memory if I do everything in byte chunks and use some math. Table management will probably be pretty CPU intensive, but getting rid of a lot of bit shifts might make up for that speed-wise.

    For now, I am making simple improvements to eliminate some unused codes and impossible codes when the previous symbol ends with the special capital encode symbol. My initial non-ideal implementation compresses enwik9 to 179.7 MB instead of 181.3 MB, but I need to finish the decoder, which unfortunately will probably be at least 10% - 20% slower.
    Last edited by Kennon Conrad; 25th July 2014 at 19:36.

  12. #251
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    PAQ9 uses long range LZP followed by a context model (ICM-ISSE chain). The LZP step hurts compression a little because the match codes aren't modeled as well as straight text, but speeds up compression a lot on highly compressible input.

    PAQ8 and ZPAQ -methods 4 and 5 use match models and mixers with no preprocessing. A match model looks for matching contexts and predicts whatever bit came next. ZPAQ uses ISSE chains like PAQ9. PAQ8 uses mostly ICM + mixers.

  13. #252
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Matt Mahoney View Post
    PAQ9 uses long range LZP followed by a context model (ICM-ISSE chain). The LZP step hurts compression a little because the match codes aren't modeled as well as straight text, but speeds up compression a lot on highly compressible input.

    PAQ8 and ZPAQ -methods 4 and 5 use match models and mixers with no preprocessing. A match model looks for matching contexts and predicts whatever bit came next. ZPAQ uses ISSE chains like PAQ9. PAQ8 uses mostly ICM + mixers.
    Your comment was way over my head when I first read it. After doing some research, I think I see that ISSE chains adjust predictions based on the context history and I see that the mixers adjust predictions according to their success rates. But what I am wondering is whether the weighting of the individual context history predictions are ever adjusted according to their individual success rate. For instance, if the very most recent history is the best one, does the prediction adjustment in ISSE ever get more heavily weighted toward the very most recent history and less toward the older histories?

  14. #253
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts

    Tree v10.0

    Tree v10.0 represents a test of how much compression can be improved by eliminating unused codes from the code tree while the tree is building and adding a (relatively lame) context model for the capital encode symbol for files that are capital encoded. I also adjusted the compressor to favor two symbol matches that are near each other (rather than far), but the impact appears to be very minor. A couple of unlikely to occur bugs were also fixed. The header for capital encoded files has been expanded so TreeBitEncode and TreeDecode are not compatible with previous versions.

    Results for enwik9:
    Compression (running a second compression simultaneously, so perhaps a bit slower than normal):
    TreePreEncode: 20 sec.
    TreeParagraphs: 1485 sec.
    TreeWords: 393 sec.
    TreeCompress: 70,732 sec.
    TreeBitEncode: 33 sec.
    Total: 72,663 sec.

    Decompression (TreeDecode): 18.5 sec.
    Compressed file size: 178,949,848 bytes
    Compressed TreeDecode.c: 12,174 bytes
    Total size: 178,962,022 bytes (was 181,331,708 bytes)
    LTCB ranking would be #36 instead of #39.

    Results for enwik8:
    Compression:
    TreePreEncode: 0.6 sec.
    TreeParagraphs: 39 sec.
    TreeWords: 24 sec.
    TreeCompress: 3697 sec.
    TreeBitEncode: 4.0 sec.
    Total: 3765 sec.

    Decompression (TreeDecode): 2.4 sec.
    Compressed file size: 22,072,432 bytes (was 22,366,748 bytes)

    I ran a slightly different version of TreeCompress without TreeWords and got 177,744,588 bytes for enwik9, but compression time was about 185K seconds. It seems like the cause of the difference is important to understand for writing a one-pass (or maybe 2-3 passes) tokenizer version.

    Decompression is about 25% slower for capital encoded files and it varies for other files.

    The special capital encode symbol model is responsible for all but about 500K of the improvement in enwik9. It is used for about 5% of the symbol predictions. The general model has not been adjusted for the presence of the capital encode symbol model.
    Attached Files Attached Files
    Last edited by Kennon Conrad; 15th August 2014 at 10:04.

  15. #254
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    778
    Thanks
    63
    Thanked 273 Times in 191 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Tree v10.0
    I did same test as in the lzturbo 1.2 thread:

    Input:
    635,346,739 bytes, IIS log file

    Output:
    15,851,524 bytes, 11,233.358 sec. - 1.125 sec., TreeComp.bat

    Is it possible to disable dual core support or select single core in treedecode?

  16. #255
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Sportman View Post
    I did same test as in the lzturbo 1.2 thread:

    Input:
    635,346,739 bytes, IIS log file

    Output:
    15,851,524 bytes, 11,233.358 sec. - 1.125 sec., TreeComp.bat
    Thanks for posting this result. The result is very encouraging.

    Quote Originally Posted by Sportman View Post
    Is it possible to disable dual core support or select single core in treedecode?
    Not currently, but it shouldn't be too difficult to add this capability. Let me get back to you on this in a few days.

  17. #256
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Sportman View Post
    Is it possible to disable dual core support or select single core in treedecode?
    You may try this version of TreeDecode. To run with a single thread, add -t1 on the command line after TreeDecode. For example, "TreeDecode -t1 enwik9.tree enwik9.rest" will use only one thread.

    Decompression time appears to be slightly longer than it was without having an option for the number of threads and single threaded performance is probably worse than it could be because I didn't eliminate the use of volatile variables in the single threaded version.
    Attached Files Attached Files

  18. #257
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    778
    Thanks
    63
    Thanked 273 Times in 191 Posts
    Quote Originally Posted by Kennon Conrad View Post
    You may try this version of TreeDecode.
    Same IIS log file:
    1140 msec., TreeDecode
    1890 msec., TreeDecode -t1
    Last edited by Sportman; 2nd September 2014 at 19:50.

  19. #258
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Sportman View Post
    Same IIS log file:
    1140 sec., TreeDecode
    1890 sec., TreeDecode -t1
    Thanks. Not the greatest single threaded performance, but making single threaded performance faster would make the code significantly more verbose and harder to maintain, so I think it will stay this way for now.

  20. #259
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    From the Fast LZ+EC Compressor thread:

    Quote Originally Posted by Piotr Tarsa View Post
    I wonder whether there is an efficient and concise way to mix ANS (assymetric numeral system) with ABC (assymetric binary coder) and plain prefix codes (eg canonical Huffman) - with the goal of not building a large lookup tables for prefix codes. If yes, then that could simplify decoder when mixing arithmetic coding and bitcoding.
    I'm not sure this is exactly what you had in mind, but this is what I am thinking will work for Tree, at least for now. Hopefully I will explain this clearly enough.

    Adding 10 bit ANS coding, but maintining all individual symbol codes and mtf queues as powers of two. For starters, only support non-integer code sizes (probability N of 1024) for the special define symbol and the sum of the mtf queue and the the sum of the mtf1 queue sizes. Probabilities for the special define symbol and individual mtf queues could be adjusted by about 0.1%, which should give enough accuracy to reduce bit losses from inaccurate static special define symbol and mtf queue probabilities. Since most symbols are more than 9 bits, the ANS state will often be 0 before sending the symbol and the majority of symbols will be able to be coded and decoded unmodified.

    It seems like this would also be a step that could be extended to add relatively efficient context modelling. It doesn't seem practical to build a Huffman-like code tree for every model. I think I need to organize the symbols by leading subsymbols and need better resolution than 1/2**N for the probability of each symbol set that starts with any particular lowest level subsymbol (the starting alphabet).

    BTW, I found out one interesting fact about the latest Tree compessed enwik9 file. The symbol that has the deepest depth for the its leading character has a depth of 28. In other words, for this symbol, code needs to look at the starting symbol that makes up the symbol (and repeat) 28 times before it gets to the symbol that represents a symbol from the original alphabet. I am surprised it's that deep. The deepest last character is 17 symbols deep, which is closer to what I would have guessed. It's going to require a lot of RAM to keep histories for 4 million symbols (peak #) x 17 ending subsymbols. Fortunately, the average is less than 17, but it still seems like too much memory. Hopefully there are smart ways to ignore some of the contexts or combine contexts. Combining contexts for matching subsymbols might be best since that would get it back down to 4 million contexts, which still seems like a huge number. If I do that and also limit contexts to those that occur perhaps 100 or more times, maybe the number starts looking more reasonable. It would be easy enough to filter out contexts when the symbol is defined since the frequency (to the closest power of 2) is sent explicitly.

    Any thoughts are appreciated. I have never tried implementing code that did anything like this, so if I am missing some key idea(s), it would be good to find out now. Also, if there is any prior research on context modeling when dissecting a multi-level dictionary, I would be very interested in hearing about it.

  21. #260
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    I added your results to LTCB. http://mattmahoney.net/dc/text.html#1789

  22. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    Kennon Conrad (18th August 2014)

  23. #261
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Kennon Conrad:
    I was talking about interleaving ANS, ABC and Huffman coding in one bit stream. FWIU ANS and ABC can be fully interleaved when state size is equal in both coding schemes. Problem remains with mixing Huffman into that mix, but since encoding and decoding are symmetrical in ANS/ ABC/ Huffman, you can have separate Huffman coder and do interleaving on the lowest level, ie the level of separate bit strings that are produced in single coding steps.

  24. #262
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Kennon Conrad:
    I was talking about interleaving ANS, ABC and Huffman coding in one bit stream. FWIU ANS and ABC can be fully interleaved when state size is equal in both coding schemes. Problem remains with mixing Huffman into that mix, but since encoding and decoding are symmetrical in ANS/ ABC/ Huffman, you can have separate Huffman coder and do interleaving on the lowest level, ie the level of separate bit strings that are produced in single coding steps.
    Okay. I guess I'm not so clear on ANS. Reading a bit further, I think what I am really looking at is a mix of ABC and Huffman. I see talk of pseudo-random number generators with ANS and I'm not doing anything like that. All of my symbols would still have Huffman codes except the special define symbol, but the Huffman codes would be run through an ABC coder, which sets the probabilites of 4 things: an mtf symbol (any), an mtf1 symbol, a static Huffman code, and the special define symbol. When any of the first three are coded, the particular fixed Huffman code would be run through the ABC coder so that it gets the appropriate probability. I don't see any big problems, but I have never done this before and could be missing something important. I might have been wrong when I said the ANS state would be 0 after most symbols. I think the probability adjustment may cause it to mostly not be 0. Not a big deal, I can figure that part out when I get there.

    Perhaps this is not close to what you were talking about, but nonetheless my interpretation of your idea seems to have led to something that will fairly easily allow me to adjust Tree's probabilities in fairly small steps for mtf symbols and the special define symbol without slowing down encoding and decoding by a huge amount of time. It's basically ABC on top of Huffman with the ABC coding doing the probability adjustments but Huffman providing the ability to keep static codes throughout decoding (except the symbols that get moved in or out of the mtf queues).

    Piotr, I see significant error in the current probabilites for the special define symbol and mtf symbols along the lines of what you suggested and would like to improve this. If you have an idea you think is better, a simple description would be golden.

  25. #263
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts

    Tree v0.11

    Tree v0.11 is being release because it reduces the Pareto Frontier decompression time for enwik9 for 178MB - 194MB+ compressed file sizes by more than 10%. This version of code was created as a step toward adding dynamic predictions. I didn't really expect decoding to be so much faster.

    Only TreeBitEncode and TreeDecode have been changed, plus the example batch files have been modified to take the input file name as a command line argument.

    All prior versions of Tree build prefix trees as symbols are created and the decoder used code length table(s) and symbol table(s) to decode symbol codes. Tree v0.1 - v0.9 use one code length table and one symbol table. Tree v0.10 uses two code length tables and symbol tables for capital encoded files (1 x 2^25 + 4 x 2^25 + 1 x 2^22 + 4 x 2^22 bytes for enwik9).

    Tree v0.11 uses 4K sorted single length symbol bins with a 4K code length lookup table. It calculates symbol indexes from the input, code length, and index of the first bin for that code length. The lookup tables are dynamically adjusted as symbols are received. The unfilled portion of the table is used to code the special new symbol code and shorten other codes after the largest possible power of 2 number of bins are allocated to the special new symbol code.

    Results for v0.11 using the tokenized files generated with v0.10:

    enwik9:
    TreeBitEncode: 28 seconds (was 33 seconds)
    TreeDecode: 13.3 seconds (was 18.5* seconds)
    compressed file size: 178,773,808 (was 178,945,848 bytes)
    ram usage: about 340 MB (was about 475 MB)
    *: Pareto Frontier

    enwik8:
    TreeBitEncode: 3.5 seconds (was 4.0 seconds)
    TreeDecode: 1.6 seconds (was 2.4 seconds)
    compressed file size: 22,076,556 bytes (was 22,072,432 bytes)

    The test machine is still an A8-5500 @ 3.2 GHz. Decompression of non-text files can be worse. I took a couple of short-cuts to get this working as a starting point for predicting probabilities and don't want to spend a lot of time on code that would just get taken back out later.
    Attached Files Attached Files

  26. #264
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    778
    Thanks
    63
    Thanked 273 Times in 191 Posts
    Same IIS log file:
    16,194,436 bytes (before 15,851,524 bytes)
    Decode:
    1140 msec (before 1140 msec)
    -t1:
    1734 msec (before 1890 msec)

  27. #265
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Interesting. My first thought is that thread #2 (sending decoded symbols to the output) dominates the decoding time on this file due to the very high compression ratio. Is this IIS log file publicly available?

  28. #266
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Sportman View Post
    Same IIS log file:
    16,194,436 bytes (before 15,851,524 bytes)
    Decode:
    1140 msec (before 1140 msec)
    -t1:
    1734 msec (before 1890 msec)
    Hi Sportman,

    I think the attached version of TreeDecode will run much faster on your ~97.5% reduced IIS log file with either one or two threads. If my simulation is correct, this will reduce decompression time for either option by several hundred milliseconds. I suspect the modest increase in file size is caused by one or more relatively high frequency symbols that I took a short-cut on. When mtf symbols have a separate model, then the inefficiency can be easily removed.
    Attached Files Attached Files

  29. #267
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    778
    Thanks
    63
    Thanked 273 Times in 191 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Is this IIS log file publicly available?
    Sorry, I'm not allowed to spread it.

  30. #268
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    778
    Thanks
    63
    Thanked 273 Times in 191 Posts
    Quote Originally Posted by Kennon Conrad View Post
    If my simulation is correct, this will reduce decompression time for either option by several hundred milliseconds.
    Are you ready?

    Same IIS log file:
    609 msec. (before 1140 msec.), TreeDecode
    1156 msec. (before 1724 msec.), TreeDecode -t1

    I do not know what you did, but it's magical

  31. The Following User Says Thank You to Sportman For This Useful Post:

    Kennon Conrad (3rd September 2014)

  32. #269
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts

  33. #270
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    I bought a new computer because testing compression was taking longer than I wanted. My new computer is a 4.0 GHz i4790K with 16 GB RAM @ 1866 MHz with a 128 GB SSD. TreeCompress produced a slightly bigger file, but I'm not exactly sure why.

    Timings are as follows:

    enwik9
    Old computer (A8-5500 @ 3.2 GHz):

    Compress:
    TreePreEncode: 20 sec.
    TreeParagraphs: 1485 sec.
    TreeWords: 393 sec.
    TreeCompress: 70,732 sec.
    TreeBitEncode: 28 sec.
    178,773,808 byte file in 72,658 sec.

    Decompress:
    TreeDecode: 13.3 sec.

    New computer:
    Compress:
    TreePreEncode: 2.5 sec.
    TreeParagraphs: 786 sec.
    TreeWords: 195 sec.
    TreeCompress: 35,762 sec.
    TreeBitEncode: 19 sec.
    178,782,844 byte file in 36,765 sec.

    Decompress:
    TreeDecode: 7.9 sec.

    enwik8
    Old computer:

    Compress:
    TreePreEncode:0.6 sec.
    TreeParagraphs: 39 sec.
    TreeWords: 24 sec.
    TreeCompress: 3697 sec.
    TreeBitEncode: 3.5 sec.
    22,076,556 byte file in 3764 sec.

    Decompress:
    TreeDecode: 1.6 sec.

    New computer:
    Compress:
    TreePreEncode: 0.4 sec.
    TreeParagraphs: 21 sec.
    TreeWords: 12 sec.
    TreeCompress: 1874 sec.
    TreeBitEncode: 1.9 sec.
    22,076,556 byte file in 1910 sec.

    Decompress:
    TreeDecode: 0.8 sec.

    AFAIK, that's the first time enwik9 has been reconstructed from a file of less than 250 MB in less than 9 seconds.

Page 9 of 29 FirstFirst ... 789101119 ... LastLast

Similar Threads

  1. Replies: 4
    Last Post: 2nd December 2012, 02:55
  2. Suffix Tree's internal representation
    By Piotr Tarsa in forum Data Compression
    Replies: 4
    Last Post: 18th December 2011, 07:37
  3. M03 alpha
    By michael maniscalco in forum Data Compression
    Replies: 6
    Last Post: 10th October 2009, 00:31
  4. PIM 2.00 (alpha) is here!!!
    By encode in forum Forum Archive
    Replies: 46
    Last Post: 14th June 2007, 19:27
  5. PIM 2.00 (alpha) overview
    By encode in forum Forum Archive
    Replies: 21
    Last Post: 8th June 2007, 13:41

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •