Page 22 of 29 FirstFirst ... 122021222324 ... LastLast
Results 631 to 660 of 849

Thread: Tree alpha v0.1 download

  1. #631
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    This is what I mean by "Once a common phrase gets paired one way with something, all the other symbols on the same end tend to get promoted because the Markov probability of those strings tends to go down while the instances stays the same".

    Suppose the phrase " of the" was only preceded by " many", " some" or " one" and only followed by " best", " worst", " first" and " last". Further, suppose the best combination found that includes " of the" was " of the first". When " of the first" is deduplicated, the probability of the symbol representing " of the" decreases because many instances were deduplicated when substituting the symbol representing " of the first" (as well as the probability of " first"). The counts for " many of the", " some of the" and " one of the" all tend to decrease roughly in proportion to the decrease in the number of symbols representing " of the" so the Markov chain probability tends to not change much (with exceptions) and the ratio of matches to the Markov chain probability tends to not change a lot. However, the counts for " of the best", " of the worst" and " of the last" are not changed by deduplicating " of the first", so the ratio of matches to the Markov chain probability goes up, making deduplication of those strings more profitable.

    It is not perfect, but I think it helps. It doesn't work as well if there are correlations between the symbols preceding " of the" and following " of the", but if the correlation is strong then the whole combination will typically get deduplicated first instead.

  2. The Following User Says Thank You to Kennon Conrad For This Useful Post:

    Paul W. (11th March 2015)

  3. #632
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Hmmm... I would think that wouldn't actually tend to help a lot, for phrases of that form, as they actually occur in English text. I'd think there are a whole bunch of words (and phrases) that fairly commonly appear before "of the", in a rather Zipfy distribution (one, many, most, some, half, two, the largest, the majority, a minority, the smallest, the furthest, nine out of ten, the tallest, the heaviest, the deepest, etc...)

    I'd also think that there's a similar Zipfy distribution of the stuff following "of the" (best, worst, first, last, tall, worthwhile, etc...)

    So I'd expect Tree to follow those Zipf distributions in a pretty fairly interleaved way, grabbing one or a few of the phrases that end with "of the", then one or a few of the phrases that start with "of the", and jumping back and forth between ways of parsing things, rather as though it were merge-sorting the Zipf distributions.

    I'd think this would lead to a fair number of extra productions for alternative ways of parsing the same phrases, and pollute your code space significantly, but I'm not sure.

  4. #633
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    I looked closer. My hypothetical example wasn't very good. The symbol for " of the" is the 509th symbol created, which is fine, but most of the symbols created for strings that include the " of the" symbol use it somewhere in the middle of the string. The symbols created earliest in the deduplication process tend to be represent long strings (over 100 characters), but here's a good portion of the relatively short ones:

    #1564: "democratic Crepublic of the Ccongo"
    #1969: "quorum of the Ctwelve Capostles"
    #1978: "victoria of the Cunited Ckingdom|Cqueen Cvictoria"
    #2460: "indigenous peoples of the Camericas|Cnative Camerican"
    #2462: "president of the Cunited Cstates|Cpresident"
    #3119: "one of the most"
    #3755: "members of the"
    #5583: "state of the Cunion Caddress"
    #6324: ", recipient of the [[Cnobel Cpeace Cprize]]"

    So I think the concept probably applies more strongly to other types of (less common) symbols that tend to have differences between the diversity of what appears on one side vs. the other. I have watched the tokenizer enough to know that sometimes once it picks one combination, it will suddenly start picking a bunch of similar combinations with only a different starting or ending symbol.

    There definitely is some redundancy in the grammar, although I'm not sure that wouldn't be true for the minimum entropy grammar if we knew how to find it. I don't know the cost of the "pollution" but would guess it is a few percent of the compressed enwik9 file size.

  5. The Following User Says Thank You to Kennon Conrad For This Useful Post:

    Paul W. (12th March 2015)

  6. #634
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Paul W. View Post
    I think text is largely non-ergodic. It's got lots of regularities in it that can be captured by recency models, once you filter out most of the ergodic Markov noise (the frequent, dispersed "stop words").

    My guess is that the right division of labor is something like fifty-fifty. Text consists largely of relatively ergodic phrases interspersed with clearly non-ergodic ones with various long-range dependencies---the large majority of symbols is non-ergodic, but some ergodic ones are very frequent, so it comes out roughly even in terms of how often you see which.
    I have been working on AC and it was easy to get some statistics on where Tree is currently on ergodic vs. non-ergodic symbols, using the output of TreeCompress64 on enwik9:

    86,056,501 symbols are sent (excluding EOF), consisting of:
    New Symbols: 7,638,735 symbols (8.9%)
    MTF Symbols: 11,644,488 symbols (13.5%)
    Dictionary Symbols: 66,773,278 symbols (77.6%)

    Of the 86,056,501 symbols transmitted, the breakdown is like this in terms of whether they are considered ergodic:
    New Symbols: 245,986 ergodic (3.2%); 7,392,749 non-ergodic (96.8%)
    Established Symbols: 49,177,297 ergodic (62.7%); 29,240,469 non ergodic (37.3%)
    Total: 49,423,283 ergodic (57.4%); 36,633,218 (42.6%)

    When I biased the ergodicity bit to only be set when I felt the compression ratio benefit exceed the time cost, the number of symbols moving through the mtf queues became about 4 million less than what appeared optimal for compression ratio. If I had not done that, the ratio would be more like 53%/47%.

  7. The Following User Says Thank You to Kennon Conrad For This Useful Post:

    Paul W. (21st March 2015)

  8. #635
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts

    GLZA v0.1

    It has been over a year now since Tree's first release. I have learned a lot since then both about compression in general and also about what Tree does relative to other compressors. One of the things I have learned is that Tree isn't a very good name. While my compressor uses tree structures, they are just tools to implement the idea. Therefore, I have decided to give my program a new name.

    This version is called GLZA, which stands for Grammatical Ziv-Lempel Arithmetic. It is generally similar to Tree64, but now uses adaptive order 0 arithmetic coding for symbol types, mtf queues, mtf queue positions, new symbol lengths, new symbol frequencies and the ergodicity bit and has separate models for symbol types. In some cases there are separate models for symbols following the special capital escape symbol and for existing vs. new symbols. The maximum UTF-8 character has been increased to unicode value 0x80000 and a bug in the compressor when determining UTF-8 compliance of files that end with a partial UTF-8 character has been fixed. The ergodic formula was adjusted to allow a few more symbols into the >20 instance mtf queue. Dictionary codes for symbols whose preceding symbols are in the >20 instance mtf queue are expanded to include the code space for those symbols if they are in the same code bin or consume an entire bin. Compression ratios are better. Decoding is significantly slower than Tree but still Pareto Frontier for enwik8/9.

    Results for enwik9:
    GLZAformat+GLZAcompress+GLZAencode: 4716 seconds, 6027 MB RAM, 171,131,068 bytes
    GLZAdecode: 12.5 seconds, verifies

    Results for enwik8:
    GLZAformat+GLZAcompress+GLZAencode: 298 seconds, 1394 MB RAM, 21,225,310 bytes
    GLZAdecode: 1.4 seconds, verifies

    A .zip file of GLZAdecode.c and GLZAmodel.h is 14,391 bytes.

    I tested the set of 10 wiki's from SqueezeChart, each one is 100,000,000 bytes:
    arwiki (Arabic): Compress 314 seconds, 15,978,131 bytes; Decompress 1.0 seconds
    dewiki (German): Compress 339 seconds, 19,208,373 bytes; Decompress 1.3 seconds
    enwiki (English): Compress 321 seconds, 21,439,295 bytes; Decompress 1.4 seconds
    eswiki (Spanish): Compress 399 seconds, 20,525,039 bytes; Decompress 1.4 seconds
    frwiki (French): Compress 316 seconds, 19,609,487 bytes; Decompress 2.0 seconds
    hiwiki (Hindi): Compress 556 seconds, 10,833,966 bytes; Decompress 0.7 seconds
    ptwiki (Portuguese): Compress 336 seconds, 18,566,311 bytes; Decompress 1.2 seconds
    ruwiki (Russian): Compress 254 seconds, 14,459,964 bytes; Decompress 0.9 seconds
    trwiki (Turkish): Compress 418 seconds, 18,478,542 bytes; Decompress 1.2 seconds
    zhwiki (Chinese): Compress 284 seconds, 22,303,762 bytes; Decompress 1.3 seconds

    All files verified. It looks like Chinese is hardest to compress. It seems odd that French takes by far the longest to decompress.

    Calgary.tar now compresses to 829,224 bytes instead of 853,112 bytes.
    Attached Files Attached Files
    Last edited by Kennon Conrad; 27th April 2015 at 10:56.

  9. The Following 2 Users Say Thank You to Kennon Conrad For This Useful Post:

    Matt Mahoney (27th April 2015),surfersat (27th April 2015)

  10. #636
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Is there a convenient place to download the truncated (100,000,000-byte) Chinese wiki file? (I'm having trouble finding the full file on wikimedia.org too.)

    Chinese is generally harder to compress than alphabetic languages, because each character conveys more information and because there are no spaces between words. Just Huffman coding or LZing it doesn't work well at all. You get better results with PPM or with custom Chinese-word-based preprocessing/compression, but I'm not sure how much better.

    You may actually be doing very well---you're getting compression only about four percent worse than for the English version---so it'd be worth comparing it to other algorithms.

  11. #637
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Paul W. View Post
    Is there a convenient place to download the truncated (100,000,000-byte) Chinese wiki file? (I'm having trouble finding the full file on wikimedia.org too.)
    I got the files from here:http://www.compressionratings.com/fi...zechart_xml.7z

  12. The Following User Says Thank You to Kennon Conrad For This Useful Post:

    Paul W. (27th April 2015)

  13. #638
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts

  14. #639
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    I tried ppmd on the Chinese file with the maximum memory (256MB), all 3 statistics modes, and with various max orders (4 through 16). The best combination of settings I found was to use either of the non-default stats modes and max order 11.

    ppmd e -o11 -r1 -m256 <filename> resulted in a 22,844,566-byte file. (-r1 is best suited to relatively stationary stats)
    ppmd e -o11 -r2 -m256 <filename> resulted in a 22,893,568-byte file. (-r2 is best suited to even more stationary stats)

    Using the default stats mode, compression was worse and the optimal max order was only 7, giving a 23,761,401 byte file.

    Your 22,303,762 bytes is better than PPMD with parameter-fiddling. Cool.

    lzma -9 gave me a 25,256,625-byte file.

  15. The Following User Says Thank You to Paul W. For This Useful Post:

    Kennon Conrad (28th April 2015)

  16. #640
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Kennon:

    >Calgary.tar now compresses to 829,224 bytes instead of 853,112 bytes.

    Any idea why such a big improvement? (My guess would be that the arithmetic coding helps more on the non-text files like Geo, because for text you're usually dealing with long-enough strings that fractional bit coding doesn't matter much.)

    (I'm guessing you're not yet exploiting the arithmetic coding to allow you to dynamically vary the code space allocated to dictionary vs. MTF codes, or you'd have mentioned it.)

  17. #641
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Thanks. Can you tell me why the underline for Pareto Frontier decompression speed is gone?

  18. #642
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Paul W. View Post
    I tried ppmd on the Chinese file with the maximum memory (256MB), all 3 statistics modes, and with various max orders (4 through 16). The best combination of settings I found was to use either of the non-default stats modes and max order 11.

    ppmd e -o11 -r1 -m256 <filename> resulted in a 22,844,566-byte file. (-r1 is best suited to relatively stationary stats)
    ppmd e -o11 -r2 -m256 <filename> resulted in a 22,893,568-byte file. (-r2 is best suited to even more stationary stats)

    Using the default stats mode, compression was worse and the optimal max order was only 7, giving a 23,761,401 byte file.

    Your 22,303,762 bytes is better than PPMD with parameter-fiddling. Cool.

    lzma -9 gave me a 25,256,625-byte file.
    PPMD seems to not give the greatest compression due to the memory limitations. I tried PPMonstr on the Chinese enwiki and got 19,416,105 bytes. It takes 93.3 seconds to decompress though.

    I also tried plzma, lzturbo and lzham (the best compression ratio LZx compressors on LTCB). The best compression was with plzma at 24,245,718 bytes. So in this case, GLZA beats other LZ compressors on compression ratio by 8% while maintaining fast decompression.

  19. The Following 2 Users Say Thank You to Kennon Conrad For This Useful Post:

    Gonzalo (10th May 2015),Paul W. (28th April 2015)

  20. #643
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Paul W. View Post
    Kennon:

    >Calgary.tar now compresses to 829,224 bytes instead of 853,112 bytes.

    Any idea why such a big improvement? (My guess would be that the arithmetic coding helps more on the non-text files like Geo, because for text you're usually dealing with long-enough strings that fractional bit coding doesn't matter much.)

    (I'm guessing you're not yet exploiting the arithmetic coding to allow you to dynamically vary the code space allocated to dictionary vs. MTF codes, or you'd have mentioned it.)
    When compressing the files individually (not the reported result), the biggest percentage improvements were in BIB and PAPER2 (5%) and the smallest in PIC (1%). GEO was 2%.

    I didn't really add fractional bit dictionary codes (yet), except for >20 instance symbols with preceding symbols that are in the mtf queue. All dictionary entries with 10 or fewer bits still use a power of two number of dictionary bins. Since the number of dictionary bins usually isn't a power of 2, I guess they are fractional, but only after being rounded to a power of 2.

    I did mention that the code has adaptive order 0 encoding of symbol types. Perhaps this needs more explanation. For each symbol, the first thing decoded is the symbol type. This is generally one of four things:

    • Dictionary symbol
    • New symbol
    • <= 20 instance mtf queue symbol
    • > 20 instance mtf queue symbol


    The exception is that files that have low recency "scores" do not use MTF queues and therefore only have dictionary symbols and new symbols.

    There are four symbol type models as follows:
    • Level 0, not preceded by a capital symbol
    • Level 0, preceded by a capital symbol
    • Level >=1, not preceded by a capital symbol
    • Level >=1, preceded by a capital symbol

    where level is the current depth in the hierarchy of the grammar (ie. 0 when not defining new symbols). Each model has a half-life of 64 symbols.

    With Tree v0.19, the >20 instance MTF queue had a fixed probability of 1/16 and the <=20 instance MTF queues had a fixed probability per file. With GLZA, these probabilities vary according to the context and file position. Calgary.tar uses MTF quite a bit. At renormalization when encoding calgary.tar, the <=20 instance MTF queues probability now varies between 1/128 and 77/128 and the >20 instance MTF queue probability varies between 1/128 and 48/128. Since the trends tend to be significantly longer than the half-life of the prediction adjustments, there is a big win with adapting the symbol type probability. This is the biggest cause of compression ratio improvement on most files but it only one of six model categories, each of which tend to help the compression ratio of the files I tried, which was about 40. The most complicated model is the symbol instances model, which uses both the grammar level and symbol length as context. I like it much better than Tree's fixed combined length/instances probability table that was tailored to enwik9.
    Last edited by Kennon Conrad; 28th April 2015 at 09:13.

  21. The Following User Says Thank You to Kennon Conrad For This Useful Post:

    Paul W. (28th April 2015)

  22. #644
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Thanks. Can you tell me why the underline for Pareto Frontier decompression speed is gone?
    Oops! fixed. I found a few other mistakes that I fixed too.

  23. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    Kennon Conrad (28th April 2015)

  24. #645
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    Decompression speed seems quite impressive, I am just about to test it. It's good to see the work around suffix trees and ist's compression or transformations is still active here, also on BWT which is slightly different. Glza, BCM, maybe some others.

    Well I done a quick test of GLZA 0.1 on my laptop which is old and slow already, but It seems compression ratio is quite impressive, while decoder uses not much memory and is fast at the same time. (Tested on Aspire One Intel(R) Atom(TM) CPU N2600 @ 1.60GHz, 2 cores Ubuntu 64-bit )


    wine GLZAformat.exe enwik8 enwik8.f
    F.Time: 2.05 s user, 0.94 s system, 0:04.62 total; Used 64% CPU, memory: 108548 kB

    wine GLZAcompress.exe enwik8.f enwik8.c
    C.Time: 6225.89 s user, 71.38 s system, 34:39.51 total; Used 302% CPU, memory: 1428756 kB

    wine GLZAencode.exe enwik8.c enwik8.e
    E.Time: 30.83 s user, 0.79 s system, 0:38.87 total; Used 81% CPU, memory: 230188 kB

    wine GLZAdecode.exe enwik8.e enwik8.d
    D.Time: 25.51 s user, 0.97 s system, 0:16.75 total; Used 158% CPU, memory: 127420 kB

    enwik8 100000000 -> enwik8.e 21225310
    Edit: I am also really surprised by your zhwik8 results, there you seem compress better than most BWT compressors.

    BTW I also plan to do some maintenance release of qad (BWT) compressor, I do remember it was quite buggy, so maybe will try to improve memory management there as well.

    Well anyway good luck
    Last edited by quadro; 20th May 2015 at 22:25.

  25. #646
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Thank you for the encouraging words and testing. I may be going the wrong way on decompression speed in striving to achieve better compression ratios, but since this is experimental, I want to see how good the ratios can get while maintaining relatively fast decompression with relatively low RAM usage. If you are interested, there should be a new version soon that takes the compressed size of enwik8 from 21,225,310 bytes to 20,806,973 bytes and the zhwik8 file from 22,333,294 bytes to 21,776,425 bytes (unless I make more changes).

    Also, good luck with qad. The thread was quite popular when development was active.

  26. #647
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    A good improvement it seems, on the encode stage? Maybe you could keep fast mode option too, but I see you still do experiment.

  27. #648
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Yes, on the encoding stage. I think about having different encoding speed options but want to figure out what works well first and try to optimize the speed of that. It seems like there should be a way to get rid of a lot of divides in the decoder by encoding backwards even if GLZA doesn't have the 2^N bins that ANS requires. But, yes in the long run, GLZA should be able to have faster less efficient encoding options.

  28. #649
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts

    GLZA v0.2

    GLZA v0.2 has two significant algorithmic changes compared to GLZA v0.1:

    1. GLZAencode and GLZAdecode now have a model for each character to predict the first character of the next symbol from the last character of the previous symbol, except if the file is UTF-8 compliant all extended UTF-8 characters are put in a single model. To support this, dictionary symbols are now put into separate bins based on both the first character of the string represented by the symbol and the symbol code length.
    2. GLZAformat now delta filters files whose order 0 entropy is reduced by using a delta filter with byte gaps of 1, 2, or 4 bytes. GLZAcompress and GLZAencode were modified to include two bits in the header to indicate the delta filter selection and GLZAdecode was modified to undo the delta filtering.

    The code is only intended for win64. GLZAdecode will now run out of virtual memory on 32 bit builds. I might try to improve that someday.

    The leading character prediction generally provides better compression ratios but causes decompression to take about 30% longer. Physical memory usage for decoding is generally slightly less than before.

    GLZA is mostly aimed at text so delta filtering isn't necessarily the most relevant thing, but it helps performance considerably on certain benchmark files such as mr and x-ray in the Silesia Corpus.

    Test Results:
    enwik8:
    encode 100,000,000 -> 20,806,740 bytes, 299 seconds, 1394 MB RAM
    decode 1.8 seconds, 45 MB RAM

    enwik9:
    encode 1,000,000,000 -> 167,274,338 bytes, 4713 seconds, 6027 MB RAM
    decode 16.4 seconds, 306 MB RAM

    A .zip file of GLZAdecode.c and GLZAmodel.h is 15,218 bytes.

    So, GLZA v0.2 produces a compressed enwik9 file that is 3,856,730 bytes smaller than v0.1 but it takes 3.9 seconds longer to decompress it. Perhaps it's not the most practical change, but it does put another point on the Pareto frontier for decompression speed vs. compression ratio and can be considered a step along the path to subsymbol based modeling and predictions that should provide better compression ratios (and slower decompression).

    Silesia Corpus (total uncompressed size is 211,938,580 bytes) compresses to 48,330,061 bytes as follows:
    dickens: 10,192,446 -> 2,317,338 (4.398:1)
    mozilla: 51,220,480 -> 15,352,413 (3.336:1)
    mr: 9,970,564 -> 2,678,018 (3.723:1)
    nci: 33,553,445 -> 1,617,539 (20.744:1)
    ooffice: 6,152,192 -> 2,544,614 (2.418:1)
    osdb: 10,085,684 -> 2,563,659 (3.934:1)
    reymont: 5,527,202 -> 1,062,699 (5.201:1)
    samba: 21,606,400 -> 3,905,765 (5.532:1)
    sao: 7,251,944 -> 4,770,160 (1.520:1)
    webster: 41,458,703 -> 6,808,745 (6.089:1)
    x-ray: 8,474,240 -> 4,300,194 (1.971:1)
    xml: 5,345,280 -> 408,917 (13.072:1)

    Serially decompressing the files takes approximately 4.95 seconds. I think that is faster than all compressors with a better compression ratio.

    Calgary.tar as a single file now compresses to 799,786 bytes vs. 829,224 bytes for GLZA v0.1. Compression as individual files is slightly worse. I think the leading character prediction models do not get enough information to make accurate predictions for small files. I might be able to improve that later.
    Attached Files Attached Files

  29. The Following 2 Users Say Thank You to Kennon Conrad For This Useful Post:

    Nania Francesco (25th May 2015),Paul W. (24th May 2015)

  30. #650
    Tester
    Nania Francesco's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    1,565
    Thanks
    220
    Thanked 146 Times in 83 Posts
    GLZA v.01 on MCB [test on Intel Core i7 920 2.66 GHZ 6GB RAM]

    name=C:\test\A10.jpg 842468->837628
    name=C:\test\AcroRd32.exe 3870784->1566104
    name=C:\test\english.dic 4067439->957766
    name=C:\test\FlashMX.pdf 4526946->3770645
    name=C:\test\FP.LOG 20617071->547875
    name=C:\test\MSO97.DLL 3782416->1903974
    name=C:\test\ohs.doc 4168192->828588
    name=C:\test\rafale.bmp 4149414->1115362
    name=C:\test\vcfiu.hlp 4121418->695869
    name=C:\test\world95.txt 2988578->518376
    number of files=10 original=53134726 compressed=12742187 ENC=487.0979919434 sec. DEC=1.0930000544 sec.

    GLZA v.02 crash in decoding with Rafale.bmp

    GLZA is great compressor !
    Last edited by Nania Francesco; 25th May 2015 at 01:56.

  31. The Following User Says Thank You to Nania Francesco For This Useful Post:

    Kennon Conrad (25th May 2015)

  32. #651
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    778
    Thanks
    63
    Thanked 273 Times in 191 Posts
    Quote Originally Posted by Kennon Conrad View Post
    GLZA v0.2
    I get crash by decode (also crash by v0.1):

    Unhandled exception at 0x000000000040854F in GLZAdecode.exe: 0xC0000005: Access violation writing location 0x000000007F66E14D.

    000000000040854F mov byte ptr [rcx+r8],dil

  33. The Following User Says Thank You to Sportman For This Useful Post:

    Kennon Conrad (26th May 2015)

  34. #652
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts

    GLZ v0.2a

    Quote Originally Posted by Nania Francesco View Post
    GLZA v.01 on MCB [test on Intel Core i7 920 2.66 GHZ 6GB RAM]
    GLZA v.02 crash in decoding with Rafale.bmp
    GLZA v0.2a has a fix for the bug that caused rafale.bmp decoding to crash. The problem was in GLZAcompress and was in the code that glues together the results of the multi-threaded new symbol substitution. This code hadn't changed for months, but was triggered by adding delta filtering of rafale.bmp.

    On a side note, rafale.bmp compresses a little better without the delta filter. I will probably raise the threshold on the next regular release so that delta filtering is only used if the order 0 entropy with delta filtering is a minimum of 5% less than the order 0 entropy without delta filtering.
    Attached Files Attached Files

  35. #653
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Sportman View Post
    I get crash by decode (also crash by v0.1):

    Unhandled exception at 0x000000000040854F in GLZAdecode.exe: 0xC0000005: Access violation writing location 0x000000007F66E14D.

    000000000040854F mov byte ptr [rcx+r8],dil
    A little more information would be helpful. Is this happening with one of your log files? If so, can you post the first few lines and last few lines output by GLZAcompress and the output of GLZAencode? Otherwise, a link to the file would be very helpful.

  36. #654
    Tester
    Nania Francesco's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    1,565
    Thanks
    220
    Thanked 146 Times in 83 Posts
    GLZA v.0.2a on MCB [test on Intel Core i7 920 2.66 GHZ 6GB RAM]

    name=C:\test\A10.jpg 842468->836497
    name=C:\test\AcroRd32.exe 3870784->1549148
    name=C:\test\english.dic 4067439->937452
    name=C:\test\FlashMX.pdf 4526946->3752588
    name=C:\test\FP.LOG 20617071->529846
    name=C:\test\MSO97.DLL 3782416->1881712
    name=C:\test\ohs.doc 4168192->824363
    name=C:\test\rafale.bmp 4149414->1070173
    name=C:\test\vcfiu.hlp 4121418->690986
    name=C:\test\world95.txt 2988578->503359
    number of files=10 original=53134726 compressed=12576124 ENC=489.7820129395 sec. DEC=2.7069999218 sec. ?

  37. #655
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Nania Francesco View Post
    DEC=2.7069999218 sec. ?
    That's considerably worse than I expected, but I didn't think about it too much. I get a similar time increase, from 588 msec to 1322 msec. The text files are taking 30% - 50% longer but the binary files except rafale.bmp are taking 100% - 300% longer. The worst one is FlashMX.pdf, which is the biggest compressed file and is taking 468 msec vs. 124 msec. I suspect the problem is that on average, it is taking considerable time to calculating and check ranges for the 256 possible starting characters in the model since the probabilities may be fairly evenly distributed.

    I am kind of new to modeling, so I'm not exactly sure what would be best. The files where this is the biggest speed problem are also the ones where the model helps the least, so maybe an option to disable the model and encode like v0.1 would be best. The other options I can think of that might help would be to model bits or nibbles instead of bytes or to have a binary tree for finding the symbol from the range code.

  38. #656
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    Didn't tested this yet, but nice improvement compression-wise it seem.

    // a brief comparison: I think improvement in compression is a good trade-off, if it's general for most text data, as you can always add some speed or another mode later, I will try to do more tests later.

    (Tested on Netbook Aspire One Intel(R) Atom(TM) CPU N2600 @ 1.60GHz, 2 cores Ubuntu 64-bit )
    wine GLZAformat.exe enwik8 enwik8.f F.Time: 3.58 s user, 0.99 s system, 0:11.99 total; Used 38% CPU, memory: 106808 kB

    wine GLZAcompress.exe enwik8.f enwik8.c
    C.Time: 6797.01 s user, 77.82 s system, 37:01.98 total; Used 309% CPU, memory: 1427596 kB

    wine GLZAencode.exe enwik8.c enwik8.e
    E.Time: 36.77 s user, 0.90 s system, 0:46.44 total; Used 81% CPU, memory: 206136 kB

    wine GLZAdecode.exe enwik8.e enwik8.d
    D.Time: 35.95 s user, 1.12 s system, 0:20.49 total; Used 180% CPU, memory: 176032 kB

    enwik8 100000000 -> enwik8.e 20806740
    Last edited by quadro; 26th May 2015 at 23:35.

  39. #657
    Tester
    Nania Francesco's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    1,565
    Thanks
    220
    Thanked 146 Times in 83 Posts
    I tried to test GLZA with WCC2015 and eventually there was a reported error in decompression tests with GAME1. Here is the partial results.
    APP1:
    79^ SIZE=26418097 Comp.=1697.60 s. Dec=4.87 s. C.EFF. = 55380 D. EFF.=1213 C./D. EFF.=28296

    APP2:
    34^ SIZE=39134183 Comp.=7317.54 s. Dec= 8.66 s. C.EFF. = 235727 D. EFF.=1842 C./D. EFF.=118784

    APP3:
    84^ SIZE= 22696000 Comp.=1054.11 s. Dec. =5.23 s. C.EFF. = 34639 D. EFF. = 1075 C./D. EFF.=17857

    APP4:
    64^ SIZE=66490961 Comp.= 9237.90 s. Dec.= 12.96 s. C.EFF. =298272 D. EFF. =3074 C./D. EFF.= 150673

    AUDIO:
    73^ SIZE=90004250 Comp.=5829.02 s. Dec.=23.27 s. C.EFF. =190129 D. EFF. =4345 C./D. EFF.= 97237
    Last edited by Nania Francesco; 26th May 2015 at 16:23.

  40. #658
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Nania Francesco View Post
    I tried to test GLZA with WCC2015 and eventually there was a reported error in decompression tests with GAME1.
    Is there a link somewhere to download GAME1?

  41. #659
    Tester
    Nania Francesco's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    1,565
    Thanks
    220
    Thanked 146 Times in 83 Posts
    Read private message!

  42. #660
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Nania Francesco View Post
    Read private message!
    Thanks. Was the reported error about exceeding the symbol limit or did it crash?

Page 22 of 29 FirstFirst ... 122021222324 ... LastLast

Similar Threads

  1. Replies: 4
    Last Post: 2nd December 2012, 02:55
  2. Suffix Tree's internal representation
    By Piotr Tarsa in forum Data Compression
    Replies: 4
    Last Post: 18th December 2011, 07:37
  3. M03 alpha
    By michael maniscalco in forum Data Compression
    Replies: 6
    Last Post: 10th October 2009, 00:31
  4. PIM 2.00 (alpha) is here!!!
    By encode in forum Forum Archive
    Replies: 46
    Last Post: 14th June 2007, 19:27
  5. PIM 2.00 (alpha) overview
    By encode in forum Forum Archive
    Replies: 21
    Last Post: 8th June 2007, 13:41

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •