Page 8 of 29 FirstFirst ... 67891018 ... LastLast
Results 211 to 240 of 849

Thread: Tree alpha v0.1 download

  1. #211
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Decent improvement!

  2. #212
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    776
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Tree become a data eating monster now

    enwik9 181,398,728 bytes 36,167.298 sec. - 12.165 sec.

    FP.log 592,068 bytes (UTF8 ) and AMillionRandomDigits.bin 340,476 bytes (sounds impossible) both decompress crash.

  3. The Following User Says Thank You to Sportman For This Useful Post:

    Kennon Conrad (22nd June 2014)

  4. #213
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Sportman View Post
    Tree become a data eating monster now

    enwik9 181,398,728 bytes 36,167.298 sec. - 12.165 sec.

    FP.log 592,068 bytes (UTF8 ) and AMillionRandomDigits.bin 340,476 bytes (sounds impossible) both decompress crash.
    Sportman, I have duplicated the problem on AMillionRandomDigits and will check it out.

    For FP.LOG I also get compression from 20,617,071 bytes to 592,068 bytes, but decompression runs to completion in 93 milliseconds and the output verifies. Can you double check your results and/or post both the compressed and encoded files? Alternatively, could you unzip the attachment and compare the file to your compressed file?
    Attached Files Attached Files
    Last edited by Kennon Conrad; 22nd June 2014 at 03:54.

  5. #214
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    776
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Can you double check your results and/or post both the compressed and encoded files? Alternatively, could you unzip the attachment and compare the file to your compressed file?
    Your FP.log.tree is identical.
    I even see FP.log.rest old test is identical with FP.log.

    After rerun I get:

    1,910,386 bytes FP.log.tcom (old test)
    1,915,995 bytes FP.log.tcom (new test, FP.log.twor is identical to old test)
    596,336 bytes FP.log.tree (new test)

    TreeCompress FP.log.twor FP.log.tcom
    Allocated 1862270976 bytes for data processing
    Read 18582836 byte input file
    cap encoded: 0, UTF8_complaint 0
    Found 15989215 tokens, 232 defines
    1: 15989215 syms, dict. size 232, 5.5146 bits/sym, o0e 11021749 bytes
    Tree scan 19/19 7a.0 - 1e7.1e7, score[7500] = 39603304887.51
    2: 13355224 syms, dict. size 316, 5.5040 bits/sym, o0e 9188427 bytes
    Tree scan 12/12 75.0 - 23b.23b, score[7039] = 19802818940.04
    3: 11951939 syms, dict. size 363, 5.4920 bits/sym, o0e 8204950 bytes
    Tree scan 9/9 1d7.0 - 26a.26a, score[7550] = 12942326667.54
    (cut)
    70: 549686 syms, dict. size 67504, 12.1758 bits/sym, o0e 836605 bytes
    Tree scan 1/1 0.0 - 108af.108af, score[4] = 4.24
    71: 549654 syms, dict. size 67507, 12.1765 bits/sym, o0e 836610 bytes
    Tree scan 1/1 0.0 - 108b2.108b2, score[1] = 7.69
    72: 549653 syms, dict. size 67508, 12.1766 bits/sym, o0e 836612 bytes
    Tree scan 1/1 0.0 - 108b3.108b3
    Run time 112 seconds.

    TreeBitEncode FP.log.tcom FP.log.tree
    cap encoded: 0, UTF8_compliant 0
    Read 549653 tokens including 67508 token definition symbols
    Tokenize done
    Embed tokens done
    Compressed file of size 596336 bytes uses a 67246 'letter' alphabet
    elapsed time = 0.046000 seconds.

    Windows TreeDecode crash popup message:
    Problem signature:
    Problem Event Name: APPCRASH
    Application Name: TreeDecode.exe
    Application Version: 0.0.0.0
    Application Timestamp: 53a524e5
    Fault Module Name: StackHash_c7e4
    Fault Module Version: 0.0.0.0
    Fault Module Timestamp: 00000000
    Exception Code: c0000005
    Exception Offset: PCH_03_FROM_TreeDecode+0x000038D4

  6. #215
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts

    Tree v0.8a

    TreeBitEncode has been fixed so it doesn't crash on AMillionRandomDigits.

    Sportman, FP.LOG is crashing because TreeWords (and TreeParagraphs) only works on UTF-8 compliant text files. FP.LOG is not UTF-8 compliant. Adding support for other files or aborting the program with a clear message would be good things to do, but I haven't gotten around to it yet.
    Attached Files Attached Files

  7. #216
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    776
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Quote Originally Posted by Kennon Conrad View Post
    FP.LOG is not UTF-8 compliant.
    TreeDecode also crash when I saved FP.log with Notepad as 20,617,098 bytes FP2.log UTF-8 file.

    TreePreEncode FP2.log FP2.log.tpre
    Pre encoding 20617098 bytes
    Wrote 1 byte header and 20617098 encoded bytes in 0.046 seconds.

    TreeParagraphs FP2.log.tpre FP2.log.tpar

    Allocated 1795162112 bytes for data processing
    Read 20617099 byte input file
    Found 20617072 tokens, 0 defines, maximum unicode value 0xfeff
    1: 20617072 syms, dict. size 0, 5.3936 bits/sym, o0e 13900017 bytes
    2: 20614409 syms, dict. size 12, 5.3937 bits/sym, o0e 13898373 bytes
    3: 20613145 syms, dict. size 14, 5.3937 bits/sym, o0e 13897566 bytes
    Run time 0 seconds.

    TreeWords FP2.log.tpar FP2.log.twor

    Allocated 1795162112 bytes for data processing
    Read 20613337 byte input file
    Found 20613145 tokens, 14 defines, maximum unicode value 0xfeff
    1: 20613145 syms, dict. size 14, 5.3937 bits/sym, o0e 13897566 bytes
    2: 16149572 syms, dict. size 228, 5.4995 bits/sym, o0e 11101916 bytes
    3: 15989216 syms, dict. size 232, 5.5146 bits/sym, o0e 11021752 bytes
    Run time 0 seconds.

    TreeCompress FP2.log.twor FP2.log.tcom
    Allocated 1862270976 bytes for data processing
    Read 18582863 byte input file
    cap encoded: 0, UTF8_complaint 1
    Found 15989216 tokens, 232 defines, maximum unicode value 0xfeff
    1: 15989216 syms, dict. size 232, 5.5146 bits/sym, o0e 11021752 bytes
    Tree scan 19/19 7a.0 - 300e7.300e7, score[7500] = 39603308937.10
    2: 13355225 syms, dict. size 316, 5.5040 bits/sym, o0e 9188430 bytes
    Tree scan 12/12 75.0 - 3013b.3013b, score[7038] = 19802821473.39
    3: 11951940 syms, dict. size 363, 5.4920 bits/sym, o0e 8204953 bytes
    Tree scan 9/9 300d7.0 - 3016a.3016a, score[7550] = 12942369893.30
    (cut)
    68: 545867 syms, dict. size 66232, 12.1908 bits/sym, o0e 831821 bytes
    Tree scan 1/1 0.0 - 402b7.402b7, score[1] = 5.71
    69: 545857 syms, dict. size 66233, 12.1911 bits/sym, o0e 831822 bytes
    Tree scan 1/1 0.0 - 402b8.402b8, score[1] = 4.31
    70: 545851 syms, dict. size 66234, 12.1912 bits/sym, o0e 831824 bytes
    Tree scan 1/1 0.0 - 402b9.402b9
    Run time 112 seconds.

    TreeBitEncode FP2.log.tcom FP2.log.tree
    cap encoded: 0, UTF8_compliant 1
    Read 545851 tokens including 66234 token definition symbols
    Tokenize done
    Embed tokens done
    Compressed file of size 593324 bytes uses a 66021 'letter' alphabet
    elapsed time = 0.046000 seconds.

    TreeDecode FP2.log.tree FP2.log.rest
    Problem signature:
    Problem Event Name: APPCRASH
    Application Name: TreeDecode.exe
    Application Version: 0.0.0.0
    Application Timestamp: 53a682a0
    Fault Module Name: StackHash_c7e4
    Fault Module Version: 0.0.0.0
    Fault Module Timestamp: 00000000
    Exception Code: c0000005
    Exception Offset: PCH_03_FROM_TreeDecode+0x000038D4

  8. The Following User Says Thank You to Sportman For This Useful Post:

    Kennon Conrad (22nd June 2014)

  9. #217
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts

    Tree v0.8b

    Quote Originally Posted by Sportman View Post
    TreeDecode also crash when I saved FP.log with Notepad as 20,617,098 bytes FP2.log UTF-8 file.
    I really appreciate you finding files that cause problems.

    TreeBitEncode has been modified to fix the bug. I am surprised this one didn't cause problems on a lot more files. Compression should be the same on files that were not affected by the bug.

    My results for FP.log are as follows:
    FP.log: 592,068 bytes (without TreeParagraphs or TreeWords)
    FP2.log (without TreeParagraphs or TreeWords): 592,048 bytes
    FP2.log (with TreeParagraphs and TreeWords): 593,596 bytes
    FP2.log (with TreeParagraphs, without TreeWords): 579,824 bytes
    FP2.log (without TreeParagraphs, with TreeWords): 589,556 bytes

    It is interesting that running either TreeParagraphs or TreeWords makes a smaller file, but running both makes a bigger file. I think it is related to the losses caused by using integer length codes. I should probably try generating Huffman codes rather than codes that are more like Polar codes.
    Attached Files Attached Files

  10. #218
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts

    Tree v0.8c

    I found more encoder bugs while testing the files in calgary.tar. Also, cleaned up TreeDecode a little and have it running about 1% faster.

    Results for calgary.tar files:
    BIB: 33,860 (30.4%)
    BOOK1: 239,280 (31.4%)
    BOOK2: 162,368 (26.6%)
    GEO: 65,052 (63.5%)
    NEWS: 122,428 (32.5%)
    OBJ1: 11,340 (52.7%)
    OBJ2: 81,880 (33.2%)
    PAPER1: 18,832 (35.4%)
    PAPER2: 82,199 (34.3%)
    PIC: 53,436 (10.4%)
    PROGC: 14,004 (35.4%)
    PROGL: 16,920 (23.6%)
    PROGP: 11,700 (23.7%)
    TRANS: 19,556 (20.9%)

    Total: 878,860 bytes

    That's 3,644 bytes more than is achieved by compressing calgary.tar.
    Attached Files Attached Files

  11. #219
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Try some grouping by content type, eg split Calgary corpus to 4 sets:
    - GEO,
    - OBJ1, OBJ2,
    - PIC,
    - rest
    Then tar every segment and compress all of them separately.

  12. #220
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Try some grouping by content type, eg split Calgary corpus to 4 sets:
    - GEO,
    - OBJ1, OBJ2,
    - PIC,
    - rest
    Then tar every segment and compress all of them separately.
    That gives 840,384 bytes, about 4% less.

  13. #221
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    I think the problem with going bottom-up, like RePair does, is that it causes you to base your decisions on bogus statistics.

    For instance, you may create a pair based on a count of 100 occurrences of that pair. But if, later, you create a higher-order token that contains that pair, appearing 25 times, the first pair no longer occurs 100 times, it only occurs about 75 times (100 - 25). So you're going to hopelessly overcount early tokens.

    The top-down approach can count accurately how often a token will appear in the final stream.

  14. #222
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    I think the problem with going bottom-up, like RePair does, is that it causes you to base your decisions on bogus statistics.

    For instance, you may create a pair based on a count of 100 occurrences of that pair. But if, later, you create a higher-order token that contains that pair, appearing 25 times, the first pair no longer occurs 100 times, it only occurs about 75 times (100 - 25). So you're going to hopelessly overcount early tokens.

    The top-down approach can count accurately how often a token will appear in the final stream.
    I don't intend to ever go bottom up. I do think many reasonable top-down decisions can be made using a relatively small window at the start of the data along with Markov chains and then adding really long matches as they are found.

    I still think about your ideas about alternate data structures. If I was starting over today, I would seriously consider building a LCP array containing the string pointer, the next LCP array element, the previous LCP array element, the common prefix length of adjacent elements and a separate array of sorted LCP indexes. I think that would be 20 bytes per input byte and provide a structure that can be used similarly to Tree's tree for scoring purposes.

  15. #223
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    If anyone is interested, the attachment contains two files. One shows the first ~500 coded strings and the other contains the last ~500 coded strings of enwik8 with the code length for symbols that previously resided in the dictionary.
    Attached Files Attached Files
    Last edited by Kennon Conrad; 25th June 2014 at 09:07.

  16. #224
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    I don't intend to ever go bottom up. I do think many reasonable top-down decisions can be made using a relatively small window at the start of the data along with Markov chains and then adding really long matches as they are found.
    I know that you rejected bottom-up early. Bottom-up tends to yield simpler algorithms, though. I was holding out some hope that there might be some bottom-up strategy that's equivalent, which would show the way to a simple and fast algorithm. It's starting to look like top-down may be strictly more powerful than bottom-up, though, so there may be no purely bottom-up solution that's worthwhile.

    I still think about your ideas about alternate data structures. If I was starting over today, I would seriously consider building a LCP array containing the string pointer, the next LCP array element, the previous LCP array element, the common prefix length of adjacent elements and a separate array of sorted LCP indexes. I think that would be 20 bytes per input byte and provide a structure that can be used similarly to Tree's tree for scoring purposes.
    Scoring seems pretty easy, at least comparatively. There are suffix array-based data structures and algorithms that will give you exactly what you need. The trickier problem seems to be overlaps.

    The biggest reason that I would use arrays is basically laziness. The code is already out there, and you can get interesting results very quickly. And, the fact that trees are unlikely to win in the end, anyway.

  17. #225
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    I know that you rejected bottom-up early. Bottom-up tends to yield simpler algorithms, though. I was holding out some hope that there might be some bottom-up strategy that's equivalent, which would show the way to a simple and fast algorithm. It's starting to look like top-down may be strictly more powerful than bottom-up, though, so there may be no purely bottom-up solution that's worthwhile.
    For large files like enwik9, it is quite possible to use smaller windows of data initially as long as eventually all of the common prefixes are analyzed to pick up the long, distant matches. The compressed file size can be pretty similar with much faster compression. Since Tree is experimental, I am trying for maximum compression for this method more than fast compression.

    The future single-pass version of TreeCompress will utilize this technique. Overlaps will also be much less of a problem in terms of compression time because the overlapping symbol candidates will be dealt with at the point where they are first encountered.

  18. #226
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    I think it makes the problem of overlapping tokens simpler if you are always looking for pairs, rather than arbitrarily-long matches. RePair is pretty simple, at least from a high-level theory perspective. I wonder if it might help simplify the Tree approach if you only looked for tokens of one particular length at a time. You would start with the maximum token length (say, n, or n/2) and iterate through each possible length all the way to zero.

  19. #227
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    776
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Quote Originally Posted by nburns View Post
    I wonder if it might help simplify the Tree approach if you only looked for tokens of one particular length at a time. You would start with the maximum token length (say, n, or n/2) and iterate through each possible length all the way to zero.
    Some things you suggested in this Tree thread (included this one) I have tested with krc in the past (before you wrote it) and none gave better compression, only the opposite, worst compression in my experiments. Strange enough I had even worst compression when I increased maximum keyword length from 19 to 32 and use krc still used maximal profit first method, what work best so far.

    I have no idea what Tree do, I can't read or program C code, but it do something very good with big files and very quick.

  20. #228
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    I think it makes the problem of overlapping tokens simpler if you are always looking for pairs, rather than arbitrarily-long matches. RePair is pretty simple, at least from a high-level theory perspective. I wonder if it might help simplify the Tree approach if you only looked for tokens of one particular length at a time. You would start with the maximum token length (say, n, or n/2) and iterate through each possible length all the way to zero.
    My results for testing of Tree with that approach are similar to Sportman's results. If the token finder is too greedy on the length, it creates a lot of less than ideal symbol boundaries. My experimentation showed me that starting with long, frequent matches works best and those often are not the same as the longest matches.

  21. #229
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Sportman View Post
    Some things you suggested in this Tree thread (included this one) I have tested with krc in the past (before you wrote it) and none gave better compression, only the opposite, worst compression in my experiments.
    Thanks for the feedback. Now that I'm picturing it in my head, I can imagine where long matches could be poor choices, because choosing them could split shorter matches with greater numbers of repetitions and potentially prevent the optimal choice from being made later.

    Strange enough I had even worst compression when I increased maximum keyword length from 19 to 32 and use krc still used maximal profit first method, what work best so far.
    I assume your cost function is based on Kennon's. Tree and krc seem to depend a lot on fudge-factors and heuristics. I would love to come up with a well-conceived theory of these compressors, but I don't have much to show, yet.

    I have no idea what Tree do, I can't read or program C code, but it do something very good with big files and very quick.
    It's not the most reader-friendly C code I've seen. Kennon seems to know it extremely well, though.

  22. #230
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    I can imagine where long matches could be poor choices, because choosing them could split shorter matches with greater numbers of repetitions and potentially prevent the optimal choice from being made later.
    That is exactly right.

    Quote Originally Posted by nburns View Post
    I assume your cost function is based on Kennon's. Tree and krc seem to depend a lot on fudge-factors and heuristics. I would love to come up with a well-conceived theory of these compressors, but I don't have much to show, yet.
    Two main theories that are exploited heavily by Tree are Markov Modeling and the theory (sorry, can't remember who's or the theory exactly) that the number of symbols that occur N times is approximately half as many whenever N doubles. I would love to find the optimum dictionary set, but AFAIK, that problem has not been solved.

    Quote Originally Posted by nburns View Post
    It's not the most reader-friendly C code I've seen. Kennon seems to know it extremely well, though.
    I think you are putting things mildly. Back in the days when I managed software engineers, I would not have been pleased about several aspects of the code, including the lack of headers with high level descriptions, the lack of sufficient comments, the single file approach, and lack of function calls. Maintenance or changes for anyone that didn't know the code would be a nightmare. Over time, it is slowly getting better, but I have empathy for anyone that tries to figure out exactly what the code is doing is certain places.

  23. #231
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Two main theories that are exploited heavily by Tree are Markov Modeling and the theory (sorry, can't remember who's or the theory exactly) that the number of symbols that occur N times is approximately half as many whenever N doubles. I would love to find the optimum dictionary set, but AFAIK, that problem has not been solved.
    The part I emphasized sounds somewhat like a Zipf distribution: if you put symbols in order of decreasing frequency, with f_1 = the frequency of the most common symbol (using =~ to mean approximately equal),

    f_i =~ f_1 / i

    I was going to say that implies what you wrote, but actually, what you wrote is a little different. You can expect the distribution to be Zipf, and I think your statement will be approximately correct for some segments of the plot, at least. A Zipf distribution becomes a straight line on a log-log plot.

    By the way, it's not hard to generate and plot statistics, and plotting this data would help to make the case for these assumptions (or refute them).

    I think you are putting things mildly. Back in the days when I managed software engineers, I would not have been pleased about several aspects of the code, including the lack of headers with high level descriptions, the lack of sufficient comments, the single file approach, and lack of function calls. Maintenance or changes for anyone that didn't know the code would be a nightmare. Over time, it is slowly getting better, but I have empathy for anyone that tries to figure out exactly what the code is doing is certain places.
    Rules are made to be broken, right? The worst case is code that no one can read. As long as one person can read it, it's a good start.

    I personally hate hearing lectures on proper style, because most of it is usually just opinions. But a computer program can be thought of as a description of an idea, meant for people to read. If you are really thinking clearly about an idea, you will be able to express it in simplest terms. A clearly-coded algorithm is like an idea expressed in simplest terms.
    Last edited by nburns; 26th June 2014 at 21:33.

  24. #232
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    The part I emphasized sounds somewhat like a Zipf distribution: if you put symbols in order of decreasing frequency, with f_1 = the frequency of the most common symbol (using =~ to mean approximately equal),

    f_i =~ f_1 / i

    I was going to say that implies what you wrote, but actually, what you wrote is a little different. You can expect the distribution to be Zipf, and I think your statement will be approximately correct for some segments of the plot, at least. A Zipf distribution becomes a straight line on a log-log plot.

    By the way, it's not hard to generate and plot statistics, and plotting this data would help to make the case for these assumptions (or refute them).
    Yes, I was thinking of Zipf distributions. If you assume f(i) ~= f(1)/i, it follows that for arbitrary n that f(n) ~= f(1)/n and f(2n) ~= f(1)/2n. Since f(2n) changes 1/2 as fast as f(n), the frequency changes half as fast for symbols that are half as frequent, meaning there will be twice as many tokens in each integer bin once the frequencies are converted to number of occurances because of the limited length of the input. Here are the number of symbols that occur in Tree's compressed file by number of symbol bits (ignoring MTF):

    bits 7: tokens 2
    bits 8: tokens 2
    bits 9: tokens 11
    bits 10: tokens 26
    bits 11: tokens 46
    bits 12: tokens 100
    bits 13: tokens 259
    bits 14: tokens 619
    bits 15: tokens 1595
    bits 16: tokens 4166
    bits 17: tokens 8763
    bits 18: tokens 17344
    bits 19: tokens 34866
    bits 20: tokens 68495
    bits 21: tokens 137280
    bits 22: tokens 148260
    bits 23: tokens 540385
    bits 24: tokens 1254445
    bits 25: tokens 6665387

    It's not exactly a Zipf distribution as the tail is heavy and there seems to be an interesting "hole" at 22 bits, but other than that it is pretty close. The tail was much less heavy in earlier versions when I stopped tokenization when the order 0 entropy change reached 0. I compress more deeply now because embedding the tokens in the data stream reduces the symbol count, making tokenization more profitable than before.

    The thing that surprised me most about the data is that 1.02 million of the 4.41 million symbols that occur twice (one definition and one reference) are coded from the first MTF location for two instance symbols. Symbols that occur twice have a 23% probability of being the next symbol of that type instead of the 0.000023% probability a static model would predict.

    Quote Originally Posted by nburns View Post
    Rules are made to be broken, right? The worst case is code that no one can read. As long as one person can read it, it's a good start.
    Yes, and as long as the code worked well, I rarely mentioned the other issues. For me, worst case is code that does not work, is not well planned or executed and is owned by a lousy programmer. Basically, you have to get rid of programmer and then start over. Since this is experimental, I definitely break rules. It's hard to plan well when you don't really know exactly where you are going and don't test nearly as thoroughly as I would for a commercial or OEM release.
    Last edited by Kennon Conrad; 27th June 2014 at 07:50.

  25. #233
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Yes, and as long as the code worked well, I rarely mentioned the other issues. For me, worst case is code that does not work, is not well planned or executed and is owned by a lousy programmer. Basically, you have to get rid of programmer and then start over. Since this is experimental, I definitely break rules. It's hard to plan well when you don't really know exactly where you are going and don't test nearly as thoroughly as I would for a commercial or OEM release.
    If no one can read code, then it will be broken as soon as a bug or new feature demands a change, so it doesn't make that much difference whether it works at the present instant.

    I'm not a big believer in either rules or planning. For anything that's worth doing right, the first version is just a throwaway prototype anyway. The only way to learn how something ought to work is by making it work. Then you can write it again based on your new idea of how it should be, and, if necessary, repeat.
    Last edited by nburns; 27th June 2014 at 12:09.

  26. #234
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts

    Tree v0.9

    Tree v0.9 has lots of changes, none of which make a huge difference to compression.

    Highlights of the changes are as follows:

    TreeWords: Changed cut-off threshold to a little bigger number to slightly improve final compressed file size

    TreeCompress: Added command line option -m to tailoring of the cufoff point for compression. For example, -m5.5 will stop tokenization when the score declines to 5.5 instad of the default value of 4.0. This was added to allow smaller dictionaries be created for exploring context modeling. Adjusted the symbol candidate scoring formula to slightly improve final compressed file size.

    TreeBitEncode: Significantly reduced RAM usage only storing a window of input data in RAM instead of the entire input file. Cleaned up wasteful base symbol encoding. Optimized some code, added functions and improved variable names. Runtime is about 20% less than v0.8.

    TreeDecode: Increased supported file size and cleaned up wasteful base symbol encoding. Optimized some code, added separate code for UTF-8 compliant files and improved variable names. Runtime is about 5% less than v0.8.

    Results for enwik9:
    Compression time: 70,723 seconds
    Compressed file size: 181,324,992 bytes
    Decompression time: 21.8 seconds
    TreeDecode.c.zip file size: 6,716 bytes
    Compressed file + decompressor size: 181,331,708 bytes

    This improves Tree's Pareto Frontier enwik9 performance slightly on both compressed size and decompression speed.

    Results for enwik8:
    Compression time: 3,963 seconds
    Compressed file size: 22,366,748 bytes
    Decompression time: 2.4 seconds

    Results for Silesia Corpus:
    dickens: 2,443,536
    mozilla: 16,802,964
    mr: 3,215,928
    nci: 1,596,396
    ooffice: 2,727,272
    osdb: 2,705,404
    reymont: 1,260,236
    samba: 4,380,212
    sao: 4,999,596
    webster: 7,298,008
    x-ray: 5,613,704
    xml: 448,608

    Total size: 53,491,864 bytes

    All results verified. The test machine is still my A8-5500 based system.

    Sportman's genefile now verifies, not that you'd want to keep your computer busy for that many days (10 for me).
    Attached Files Attached Files
    Last edited by Kennon Conrad; 6th July 2014 at 03:11.

  27. #235
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts

    Major Pareto Frontier Improvement for enwik9

    I had never written a multi-threaded program from scratch until today. I only worked on portions of multi-threaded programs that other people setup. I tried adding a separate output thread to TreeDecode and it wasn't nearly as hard as I thought it would be.

    TreeDecode2T is attached. Since I only release Tree v0.9 yesterday, please consider this a supplement to v0.9 and an alternate to TreeDecode.

    Decompression of enwik9:
    Was 21.8 seconds
    Now 15.3 seconds

    Decompression of enwik8:
    Was 2.4 seconds
    Now 1.9 seconds

    Compressed TreeDecode2T.c file size: 7118 bytes

    It can take longer to run on small files, something I can work on.

    Sportman, if you would like to try this, I think lzturbo will lose its Pareto Frontier status and Tree will have the fastest decompression of enwik9 from about 181.4 MB to 279.4 MB (crush).
    Attached Files Attached Files
    Last edited by Kennon Conrad; 7th July 2014 at 01:17.

  28. #236
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts

  29. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    Kennon Conrad (7th July 2014)

  30. #237
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    776
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Sportman, if you would like to try this, I think lzturbo will lose its Pareto Frontier status and Tree will have the fastest decompression of enwik9 from about 181.4 MB to 279.4 MB (crush).
    I did a quick test for enwik8 only:

    25,870,196 bytes, 26.839 sec. - 0.355 sec., lzturbo -49 -p12 - lzturbo -d
    25,870,196 bytes, 26.839 sec. - 1.190 sec., lzturbo -49 -p12 - lzturbo -d -p1
    25,870,196 bytes, 26.839 sec. - 0.633 sec., lzturbo -49 -p12 - lzturbo -d -p2
    25,870,196 bytes, 26.839 sec. - 0.619 sec., lzturbo -49 -p12 - lzturbo -d -p3
    25,870,196 bytes, 26.839 sec. - 0.353 sec., lzturbo -49 -p12 - lzturbo -d -p4
    25,870,196 bytes, 26.839 sec. - 0.355 sec., lzturbo -49 -p12 - lzturbo -d -p12
    24,416,777 bytes, 106.652 sec. - 1.192 sec., lzturbo -49 -p12 -b110 - lzturbo -d
    24,416,777 bytes, 106.652 sec. - 1.187 sec., lzturbo -49 -p12 -b110 - lzturbo -d -p1
    24,416,777 bytes, 106.652 sec. - 1.188 sec., lzturbo -49 -p12 -b110 - lzturbo -d -p2
    24,416,777 bytes, 106.652 sec. - 1.189 sec., lzturbo -49 -p12 -b110 - lzturbo -d -p3
    24,416,777 bytes, 106.652 sec. - 1.190 sec., lzturbo -49 -p12 -b110 - lzturbo -d -p4
    24,416,777 bytes, 106.652 sec. - 1.195 sec., lzturbo -49 -p12 -b110 - lzturbo -d -p12
    22,366,748 bytes, 2,243.018 sec - 1.248 sec., TreeCompUTF8 - TreeDecode
    22,366,748 bytes, 2,243.018 sec - 1.044 sec., TreeCompUTF8 - TreeDecode2T
    Last edited by Sportman; 7th July 2014 at 09:25.

  31. The Following User Says Thank You to Sportman For This Useful Post:

    Kennon Conrad (7th July 2014)

  32. #238
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Sportman View Post
    I did a quick test for enwik8 only:

    25,870,196 bytes, 26.839 sec. - 0.355 sec., lzturbo -49 -p12 - lzturbo -d
    25,870,196 bytes, 26.839 sec. - 1.190 sec., lzturbo -49 -p12 - lzturbo -d -p1
    25,870,196 bytes, 26.839 sec. - 0.633 sec., lzturbo -49 -p12 - lzturbo -d -p2
    25,870,196 bytes, 26.839 sec. - 0.619 sec., lzturbo -49 -p12 - lzturbo -d -p3
    25,870,196 bytes, 26.839 sec. - 0.353 sec., lzturbo -49 -p12 - lzturbo -d -p4
    25,870,196 bytes, 26.839 sec. - 0.355 sec., lzturbo -49 -p12 - lzturbo -d -p12
    24,416,777 bytes, 106.652 sec. - 1.192 sec., lzturbo -49 -p12 -b110 - lzturbo -d
    24,416,777 bytes, 106.652 sec. - 1.187 sec., lzturbo -49 -p12 -b110 - lzturbo -d -p1
    24,416,777 bytes, 106.652 sec. - 1.188 sec., lzturbo -49 -p12 -b110 - lzturbo -d -p2
    24,416,777 bytes, 106.652 sec. - 1.189 sec., lzturbo -49 -p12 -b110 - lzturbo -d -p3
    24,416,777 bytes, 106.652 sec. - 1.190 sec., lzturbo -49 -p12 -b110 - lzturbo -d -p4
    24,416,777 bytes, 106.652 sec. - 1.195 sec., lzturbo -49 -p12 -b110 - lzturbo -d -p12
    22,366,748 bytes, 2,243.018 sec - 1.248 sec., TreeCompUTF8 - TreeDecode
    22,366,748 bytes, 2,243.018 sec - 1.044 sec., TreeCompUTF8 - TreeDecode2T
    Ah, thank you. I forget lzturbo has faster modes that don't compress quite as well. I am happy to get confirmation that TreeDecode is now faster than the best compression option for lzturbo, even if Tree doesn't have as much of the Pareto Frontier as I initially thought. Maybe I should try 64 bit code like lzturbo so there is less data shifting. Multi-threading TreeCompress will not be so easy.

    Update 7/8/14: It's interesting to note the differences in approaches to multi-threaded decompression. lzturbo can run a multi-threaded decompression if the data is split into separate blocks, which hurts compressed file size. Tree is multi-threading the decompression of a single block. One thread determines the dictionary references and the other outputs the strings from the dictionary. I think a third thread to handle the input so the code isn't waiting on disk could make decompression faster.
    Last edited by Kennon Conrad; 8th July 2014 at 10:14.

  33. #239
    Member
    Join Date
    May 2008
    Location
    brazil
    Posts
    163
    Thanks
    0
    Thanked 3 Times in 3 Posts
    For repack scene Tree can be good,but it needs to be integrated in freearc/zpaq using some library. But the decompression memory cannot be more than 1GB.

  34. #240
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by lunaris View Post
    For repack scene Tree can be good,but it needs to be integrated in freearc/zpaq using some library. But the decompression memory cannot be more than 1GB.
    I have no control over what the authors of Freearc and Zpaq do, nor do I know any specifics of how that could be done. Up to now, Tree is experimental, but I do appreciate the thought. The first TreeDecode2T used a lot more memory than TreeDecode, but a new version is now using slightly less - about 415 MB for enwik9. There are options available for reducing memory further, but up to now I have prioritized speed and simplicity over memory usage.
    Last edited by Kennon Conrad; 8th July 2014 at 10:20.

Page 8 of 29 FirstFirst ... 67891018 ... LastLast

Similar Threads

  1. Replies: 4
    Last Post: 2nd December 2012, 02:55
  2. Suffix Tree's internal representation
    By Piotr Tarsa in forum Data Compression
    Replies: 4
    Last Post: 18th December 2011, 07:37
  3. M03 alpha
    By michael maniscalco in forum Data Compression
    Replies: 6
    Last Post: 10th October 2009, 00:31
  4. PIM 2.00 (alpha) is here!!!
    By encode in forum Forum Archive
    Replies: 46
    Last Post: 14th June 2007, 19:27
  5. PIM 2.00 (alpha) overview
    By encode in forum Forum Archive
    Replies: 21
    Last Post: 8th June 2007, 13:41

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •