Page 25 of 29 FirstFirst ... 152324252627 ... LastLast
Results 721 to 750 of 849

Thread: Tree alpha v0.1 download

  1. #721
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by FatBit View Post
    Sorry, I am scatterbrain, too fast reading. I used cmd

    Rem SMILE format file processing
    Set DictionarySize=100
    GLZAFormat32.exe %1 %1.glzf
    GLZACompress32.exe %1.glzf %1.glzc
    GLZAEncode32.exe %1.glzc %1.glze
    GLZAEncode32.exe -v -m%DictionarySize% %1.glzc %1.glze > %1.SymbolList.asc

    not

    Rem SMILE format file processing
    Set DictionarySize=100
    GLZAFormat32.exe %1 %1.glzf
    GLZACompress32.exe -m%DictionarySize% %1.glzf %1.glzc
    GLZAEncode32.exe %1.glzc %1.glze
    GLZAEncode32.exe -v %1.glzc %1.glze > %1.SymbolList.asc

    Do you think (if you use freeware development environment) it would be possible to zip all to one file and to send me? As we discussed previously I will have to modify line with patterns recognition.
    I am not familiar with program compiling, last tens years I have "produced" office macros etc.

    FatBit apologizes + Best regards
    Ah, that makes sense. BTW, you don't need to run GLZAEncode more than once, the last one in each example would suffice.

    Zip all of exactly what in one file?
    Last edited by Kennon Conrad; 31st January 2016 at 11:48.

  2. #722
    Member FatBit's Avatar
    Join Date
    Jan 2012
    Location
    Prague, CZ
    Posts
    189
    Thanks
    0
    Thanked 36 Times in 27 Posts
    Yes, please, as one 7z or zip file + how you compile.

    Now all is ok, data files, logs and results are enclosed.

    Remark - data files are NOT exactly same, one is sorted + removed duplicities.

    Best regards,

    FatBit
    Attached Files Attached Files

  3. #723
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by FatBit View Post
    Yes, please, as one 7z or zip file + how you compile.

    Now all is ok, data files, logs and results are enclosed.

    Remark - data files are NOT exactly same, one is sorted + removed duplicities.

    Best regards,

    FatBit
    I planning to wait for the next release because the code base is a little bloated with separate versions of some programs, but here it is. To compile, I use minGW and just type "gcc -O2 -o <program> <program>.c -static -lpthread", except I skip the -lpthread part off for GLZAformat and GLZAencode since they are not multithreaded.
    Attached Files Attached Files

  4. #724
    Member FatBit's Avatar
    Join Date
    Jan 2012
    Location
    Prague, CZ
    Posts
    189
    Thanks
    0
    Thanked 36 Times in 27 Posts
    Dear Mr. Conrad,

    thank you for answer and file. My question was about compiler. You use minGW. Is it important which version, some special settings etc to be sure and to reproduce exact same exe as you?
    Do you have interest in my results after my score node function modification?

    Best regards,

    FatBit

  5. #725
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by FatBit View Post
    Dear Mr. Conrad,

    thank you for answer and file. My question was about compiler. You use minGW. Is it important which version, some special settings etc to be sure and to reproduce exact same exe as you?
    Do you have interest in my results after my score node function modification?

    Best regards,

    FatBit
    I have been using minGW 4.9.0, but update occasionally. I do not use special setting, just the command I posted above. Compilation should work with any version but the binaries may not match and the execution time may be a little different. I think you will need POSIX for the threading libraries.

    Yes, I am curious how the results will look. Allowing the carriage return and line feed at the start of matches is probably the way to go since you will be sorting, but it's a little trickier than what I posted because you have to deal with both characters appearing at the start of a string and then some chemical formula letters. It might be best to add a variable to track whether formula letters have been seen yet and then only block matches when the variable is set and you see a carriage return or line feed.

  6. #726

  7. #727
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Thank you. For now, GLZA cannot handle files greater than 2 GB, and very little testing has been done over 1 GB. I do not know how to get things like SEEK_END to work right with fread in C on windows.

  8. #728
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    542
    Thanks
    194
    Thanked 174 Times in 81 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Thank you. For now, GLZA cannot handle files greater than 2 GB, and very little testing has been done over 1 GB. I do not know how to get things like SEEK_END to work right with fread in C on windows.
    You can look at the source code of Precomp for one way to do it, both in Windows and Linux. Basically, it uses the 64-bit capable file functions fgetpos, fsetpos, .... You can either compile on a 64-bit machine, define _FILE_OFFSET_BITS == 64 or use the 64-bit function variants. Getting the 64-bit file size was a bit more tricky, see the fileSize64 function.

    All the read/write functions like fgetc/fputc, fread/fwrite and so on still work this way.
    http://schnaader.info
    Damn kids. They're all alike.

  9. The Following User Says Thank You to schnaader For This Useful Post:

    Kennon Conrad (21st February 2016)

  10. #729
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    In zpaq I use the 64 bit versions fseeko() and ftello() like this:
    Code:
    #ifdef _MSC_VER  // Microsoft C++
    #define fseeko(a,b,c) _fseeki64(a,b,c)
    #define ftello(a) _ftelli64(a)
    #else
    #ifndef unix
    #ifndef fseeko
    #define fseeko(a,b,c) fseeko64(a,b,c)
    #endif
    #ifndef ftello
    #define ftello(a) ftello64(a)
    #endif
    #endif
    #endif

  11. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    Kennon Conrad (27th February 2016)

  12. #730
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts

    GLZA v0.4

    GLZA v0.4 has lots of small changes compared to GLZA v0.3. In general the code is a little cleaner, more portable. It now works when compiled on Win32 and Linux systems. The changes also include the following:

    GLZAcompress: New string scoring formula. The new formula calculates the estimated entropy reduction for each substitution candidate and multiplies the calculated profit by the (new symbol entropy divided by the string entropy) to the power p, where p can be set with command line option -p. The default value for p is 2.0. Setting p to 0.0 is essentially most compressive and bigger p values bias the program to pick longer strings.

    GLZAencode: Added the -m command line option to allow the user to override GLZAencode's decision on whether or not to use MTF. -m0 disables MTF and -m1 forces MTF on. Some of the models were tweaked and the model initialization was improved.

    GLZAcompressFast: Fixed bugs that caused crashes on many files. Changed the scoring formula. Added a bit of lookahead to the new dictionary symbol selection code.

    In general compression ratios are a little better (but not always), especially for small text files. For instance, grammar.lsp from the Canterbury Corpus now compresses to 1203 bytes instead of 1349 bytes.


    Results for enwik9 with GLZAcompress using command line options -m5.5 -p3.5:
    165,117,094 bytes in 5971 seconds, 6024 MB RAM; decodes in 14.9 seconds

    Results for enwik8 with GLZAcompress using command line options -m5.5 -p3.5:
    20,497,514 bytes in 294 seconds, 1414 MB RAM; decodes in 1.8 seconds

    A .zip file of GLZAdecode.c, GLZA.h and GLZAmodel.h is 20,056 bytes.


    Compression results for GLZAcompressFast and other high compression ratio LZ compressors on the .txt files from Canterbury Corpus and Large Canterbury Corpus:

    alice29.txt (152,089 bytes):
    GLZAformat + GLZAcompressFast + GLZAencode: 43,517 bytes in 0.078 sec, 41 MB RAM
    Brotli: 46,510 bytes in 0.187 seconds, 19 MB RAM
    LZHAMTEST: 51,398 bytes in 0.062 seconds, 15 MB RAM (8 threads)
    LzTurbo: 49,245 bytes in 0.046 seconds, 7 MB RAM

    asyoulik.txt (125,179 bytes):
    GLZAformat + GLZAcompressFast + GLZAencode: 39,517 bytes in 0.062 sec, 41 MB RAM
    Brotli: 42,739 bytes in 0.156 seconds, 16 MB RAM
    LZHAMTEST: 47,258 bytes in 0.062 seconds, 14 MB RAM (8 threads)
    LzTurbo: 45,301 bytes in 0.046 seconds, 6 MB RAM

    lcet10.txt (426,754 bytes):GLZAformat + GLZAcompressFast + GLZAencode: 104,237 bytes in 0.139 sec, 44 MB RAM
    Brotli: 113,455 bytes in 0.609 seconds, 46 MB RAM
    LZHAMTEST: 125,751 bytes in 0.171 seconds, 26 MB RAM (8 threads)
    LzTurbo: 120,501 bytes in 0.171 seconds, 11 MB RAM

    plrabn12.txt (481,861 bytes):
    GLZAformat + GLZAcompressFast + GLZAencode: 139,838 bytes in 0.155 sec, 44 MB RAM
    Brotli: 163,332 bytes in 0.671 seconds, 52 MB RAM
    LZHAMTEST: 174,383 bytes in 0.171 seconds, 28 MB RAM (8 threads)
    LzTurbo: 166,085 bytes in 0.187 seconds, 11 MB RAM

    world192.txt (2,473,400 bytes):
    GLZAformat + GLZAcompressFast + GLZAencode: 443,501 bytes in 0.717 sec, 49 MB RAM
    Brotli: 475,298 bytes in 3.922 seconds, 211 MB RAM
    LZHAMTEST: 509,970 bytes in 1.390 seconds, 47 MB RAM (8 threads)
    LzTurbo: 475,925 bytes in 1.375 seconds, 41 MB RAM

    bible.txt (4,047,392 bytes):
    GLZAformat + GLZAcompressFast + GLZAencode: 767,165 bytes in 1.153 sec, 273 MB RAM
    Brotli: 891,507 bytes in 6.578 seconds, 220 MB RAM
    LZHAMTEST: 903,251 bytes in 1.781 seconds, 61 MB RAM (8 threads)
    LzTurbo: 879,673 bytes in 2.328 seconds, 64 MB RAM

    In all 6 cases GLZAformat + GLZAcompressFast + GLZAencode provides significantly better compression ratios. Except for the smallest two files, compression time is less. For the largest two files, compression is more than 5x faster than Brotli and about 2x faster than LzTurbo.
    Attached Files Attached Files

  13. The Following 2 Users Say Thank You to Kennon Conrad For This Useful Post:

    Paul W. (11th March 2016),Sportman (14th March 2016)

  14. #731
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    I updated LTCB with your results. http://mattmahoney.net/dc/text.html#1651

  15. #732
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    I don't want to be unpleasant, but compression speed on small files especially when i/o is involved is not precise enough.
    But on enwik9, lzturbo without multithreading and using an older cpu compress more than 3 times faster than glza (see LTCB ), decompression is also faster.
    You are also not reporting the number of threads used for glza.

    LzTurbo is not only a text compressor like glza, but a general purpose compressor.
    Glza is more comparable to bwt compressors like bsc or nanozip.

    I suggest again, you compare also glza on enwik8/9 against the text compressor nanozipltcb on the same hardware.

    From an algorithmic view it is also interesting how glza compares with "text preprocessing" (like xwrt or glza preprocessing) + lz77.
    In this case preprocessing can reduce the file size to 40-50% witch will be compressed a lot faster with lz77.

    It is possible to post a preprint from "Compression Ratios with LZ-Class Decompression Speed"?

  16. #733
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Matt Mahoney View Post
    I updated LTCB with your results. http://mattmahoney.net/dc/text.html#1651
    Thanks. The decompressor size should be 20,056 bytes. Would you mind fixing that?

  17. The Following User Says Thank You to Kennon Conrad For This Useful Post:

    Matt Mahoney (12th March 2016)

  18. #734
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by dnd View Post
    I don't want to be unpleasant, but compression speed on small files especially when i/o is involved is not precise enough.
    But on enwik9, lzturbo without multithreading and using an older cpu compress more than 3 times faster than glza (see LTCB ), decompression is also faster.
    You are also not reporting the number of threads used for glza.
    Excuse me, but I think there has been quite a bit of compression speed testing of the files in the Canterbury Corpii. Small files are not good for determining exact % but there is nothing wrong with posting and observing the general results. If anything, the results were biased against GLZA since it writes and reads two intermediate files and has reprocess the data, plus having three separate times each rounded up instead of one (best of >= 5 attempts).

    You seem to be missing the difference between using GLZAcompress and using GLZAcompressFast. On enwik9, GLZAcompressFast is approximately 7x faster than GLZAcompress and almost 2 times faster than lzturbo. GLZAcompressFast and the other programs for which I did not list the number of threads use one thread.

    For enwik9, GLZAcompressFast+TreeEncode/TreeDecode = faster compression, better compression ratio, and faster decompression than lzturbo in single threaded -49 (best compression) mode.

    Quote Originally Posted by dnd View Post
    LzTurbo is not only a text compressor like glza, but a general purpose compressor.
    Glza is more comparable to bwt compressors like bsc or nanozip.
    GLZA is NOT only a text compressor. It is best suited for text but is capable of compressing other types of files. GLZA achieves better average compression ratios than lzturbo on the Calgary Corpus, the Canterbury Corpus, the Large Canterbury Corpus (including E.coli) and the Silesia Corpus.

    GLZA is much more comparable to other LZ compressors than to BWT compressors. The assymetric compression time/decompression time and RAM usage are consistent with other LZ compressors and unlike BWT compressors. GLZA has been on the Pareto Frontier for enwik9 decompression RAM usage. AFAIK, BWT decompressors are typically memory hogs.

    Quote Originally Posted by dnd View Post
    I suggest again, you compare also glza on enwik8/9 against the text compressor nanozipltcb on the same hardware.
    Maybe someday. It's not a high priority since nanozipltcb is not a general purpose compressor and is a completely different animal.

    Quote Originally Posted by dnd View Post
    It is possible to post a preprint from "Compression Ratios with LZ-Class Decompression Speed"?
    The paper will be posted on encode.su when it is done, immediately prior to DCC 2016. Thank you for asking.

  19. #735
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    776
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Quote Originally Posted by Kennon Conrad View Post
    GLZA v0.4 has lots of small changes compared to GLZA v0.3.
    Input:
    20,518,953 bytes, 147 HTML files, front pages top domains, 10 from each 15 TLDs and 14 languages.

    Output:
    4,721,070 bytes, 0.694 sec, gzip 1
    4,665,424 bytes, 0.515 sec, bro 0
    4,573,257 bytes, 0.699 sec, gzip 2
    4,464,299 bytes, 0.718 sec, gzip 3
    4,416,461 bytes, 0.535 sec, bro 1
    4,343,877 bytes, 0.511 sec, zstd 1
    4,263,541 bytes, 0.516 sec, zstd 2
    4,214,300 bytes, 0.765 sec, gzip 4
    4,193,271 bytes, 0.563 sec, zstd 3
    4,142,743 bytes, 0.580 sec, bro 2
    4,084,855 bytes, 0.607 sec, zstd 4
    4,083,316 bytes, 0.815 sec, gzip 5
    4,042,256 bytes, 0.604 sec, bro 3
    4,020,976 bytes, 0.878 sec, gzip 6
    4,006,714 bytes, 0.915 sec, gzip 7
    3,990,195 bytes, 1.023 sec, gzip 8
    3,987,255 bytes, 1.125 sec, gzip 9
    3,914,830 bytes, 0.609 sec, zstd 5
    3,862,690 bytes, 0.689 sec, bro 4
    3,855,103 bytes, 0.649 sec, zstd 6
    3,797,696 bytes, 0.691 sec, zstd 7
    3,781,379 bytes, 0.726 sec, zstd 8
    3,760,955 bytes, 0.768 sec, zstd 9
    3,742,919 bytes, 0.849 sec, zstd 10
    3,740,046 bytes, 0.958 sec, zstd 11
    3,728,670 bytes, 1.123 sec, zstd 12
    3,725,722 bytes, 1.156 sec, zstd 13
    3,721,856 bytes, 1.574 sec, zstd 14
    3,696,612 bytes, 1.853 sec, zstd 15
    3,669,233 bytes, 2.969 sec, zstd 17
    3,668,762 bytes, 2.711 sec, zstd 16
    3,658,770 bytes, 3.539 sec, zstd 18
    3,630,933 bytes, 3.883 sec, zstd 19
    3,610,115 bytes, 0.860 sec, bro 5
    3,605,099 bytes, 6.044 sec, zstd 21
    3,605,099 bytes, 6.044 sec, zstd 20
    3,579,147 bytes, 0.923 sec, bro 6
    3,547,521 bytes, 1.163 sec, bro 7
    3,537,036 bytes, 1.384 sec, bro 8
    3,524,334 bytes, 1.750 sec, bro 9
    3,289,709 bytes, 156.650 sec, glza
    3,204,431 bytes, 32.427 sec, bro 11

    GLZA never finished at the HTML code with Javascript redirect from homepage (before auto redirect) http://www.cctv.com/ so I took the page it redirected to in all tests.

    Quote Originally Posted by Kennon Conrad View Post
    GLZAcompressFast: Fixed bugs that caused crashes on many files.
    It still crashed at multiple files or never finished.

  20. #736
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Sportman View Post
    GLZA never finished at the HTML code with Javascript redirect from homepage (before auto redirect) http://www.cctv.com/ so I took the page it redirected to in all tests.


    It still crashed at multiple files or never finished.
    If you can provide a sample I will make sure it works.

  21. #737
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    dnd,

    From some preliminary experiments (your mileage may vary, usual hedges) it looks like GLZA 0.4 compresses text considerably better than XWRT+LZMA, and close to PPMd -o16 -m256 -r1.

    (I'm talking about regular very-asymmetrical GLZA, not GLZAcompressFast. Haven't tried that yet.)

    That's for 10 randomly selected 12,000,000-byte chunks of the (concatenated) plain text files from the Project Gutenberg "Best of" CD-ROM, so it should more or less resemble what you'd get for "solid mode" compression of something like that---a bunch of books per chunk. (The chunk size was chosen to enable comparisons to PPMd -m256 -r1 without quite hitting PPMd's memory limitations.)

    That's for "xwrt -l0" (to preprocess only) and " lzma -9e" (to compress using a bigger model and more aggressive parsing than xwrt's options for its built-in LZMA). I didn't try XWRT+PPMd, assuming decompression would still be pretty slow (and not feeling like doing experiments with different model orders to trade off ratio vs. time). I didn't enable XWRT's prebuilt English-and-XML-oriented dictionary, figuring it wouldn't help (or not much, and might hurt) for large blobs of plain text; AFAIK it mostly helps on smaller files compressed alone.

    Anyway, in that experiment

    1. PPMd wins by only a little bit on ratio, even with its biggest model and highest order
    2. GLZA 0.4 is less than two percent worse
    3. XWRT-LZMA is a shade over 7 percent worse than GLZA on average (at least 5.0 and up to 9.7 percent), and
    4. LZMA alone is a few percent worse than that

    This is consistent with my working hypothesis that LZ77 is just the wrong algorithm for compressing text. You can make it somewhat better by implementing a preprocessor with a whole different dictionary at a higher level, but LZ77 offsets-and-literals isn't a great intermediate representation, and you might as well just use a fundamentally better dictionary compressor, particularly a grammar-based compressor.

    UPDATE: I found better parameters for xwrt, and I computed more averages, and learned some things. GLZA still beats xwrt by a considerable margin on this test (12-million byte chunks of a collection of plain text books), but xwrt+lzma is beating lzma alone by about as much. (So GLZA is beating plain LZMA by more than I realized, too.) A different test suggests that a significant chunk of the benefit of xwrt is coming from EOL modeling, which GLZA doesn't have yet. More later when I'm satisfied I'm using xwrt reasonably well.
    Last edited by Paul W.; 30th March 2016 at 21:10.

  22. The Following 2 Users Say Thank You to Paul W. For This Useful Post:

    dnd (19th March 2016),Kennon Conrad (15th March 2016)

  23. #738
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    776
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Quote Originally Posted by Kennon Conrad View Post
    If you can provide a sample I will make sure it works.
    <script type="text/javascript">var nowUrl = window.location.href;if(nowUrl.indexOf('pageId=000 000000000000000000000000000')==-1){var url='http://www.cctv.com/default.shtml';window.location.href=url;}</script>

  24. The Following User Says Thank You to Sportman For This Useful Post:

    Kennon Conrad (15th March 2016)

  25. #739
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts

    GLZA v0.4.1

    There was a problem with GLZAencode getting into an infinite loop with files whose most frequent symbol in the grammar occurred <= 15 times. I fixed that problem, fixed a problem with processing empty files, and modified GLZAencode to print out the final grammar size and number of grammar rules, where the grammar size is the sum of the sizes of the production rules, where the size of a production rule is 1 plus the length of the right side of the production rule. Thank you Sportman for reporting the problem!
    Attached Files Attached Files
    Last edited by Kennon Conrad; 16th March 2016 at 10:58.

  26. The Following User Says Thank You to Kennon Conrad For This Useful Post:

    Paul W. (18th March 2016)

  27. #740
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts

    GLZA v0.5

    GLZA v0.5 has a lot of changes vs. v0.4, mostly targeted at improving compression ratios but also at improving memory use. The most significant changes I remember off the top of my head are:

    GLZAformat:

    Improved delta encoding. Instead of just subtracting bytes for strides of 1, 2 or 4, it now detects strides of 1 - 4 and for even strides attempts to determine whether the data is big-endian or little endian and for strides of 4 determine whether the data has one 4 byte channel or two 2 byte channels. Added reordering of data blocks by "columns" for files with strides of 2 and 4. It is not perfect, but helps compression significantly on GEO, mr, x-ray, .wav files, .bmp files, .tif files, etc.

    Reworked and lowered the threshold for classification of data as text. Added capital lock encoding (can be disabled with -l0) and automatic space after linefeed additions for capital encoded files.

    GLZAcompress:

    The suffix tree nodes now have a length field and the code uses this to build and use a more normal suffix tree (with branches containing more than 1 symbol). Now it is normal except if the non-overlapping count changes there will be an extra branch/node for each non-overlapping count. The node sibling array size was changed from 4 to 2 to reduce the node size. These changes could have been used to reduce memory usage but for large files instead the savings (by default) is used to read more data. For the typical corpus files, the now usually fit in memory right from the start but it can take quite a while to build and evaluate the suffix tree. Memory usage can be set with the -r# option, with # being millions of bytes (was -m# and it worked differently).

    The default production cost is no longer fixed at 6.0 bits. Now at the start of each duduplication cycle it is set to 1.5 plus log2(number of symbols/alphabet size).

    For capital encoded files, the first deduplication pass only analyzes strings that start with a space and end before a space. This can be disabled with -w0.

    GLZAencode:
    Removed a couple of inefficiencies, tweaked some models and improved memory use.

    GLZAdecode:
    Moved capital decoding to thread #2. Added string length fields to the production rule structure to support a hierarchical dictionary strings and reuse of dictionary memory. Also added code to reuse expired production rule structures. Memory use is much better.


    Overall compression is generally a little better (and a lot better on certain files) but occassionally worse. Compression can be significantly slower, especially at the start. Decompression is typically about 10% slower. Decompression RAM use is quite good. I see enwik9 decompression RAM of less than 200 MB of physical memory (for default parameters) vs. 330 MB for v0.4 (which was already Pareto Frontier).


    Silesia corpus results:
    dickens: 2,279,946 bytes
    mozilla: 15,256,099 bytes
    mr: 2,364,198 bytes
    nci: 1,481,796 bytes
    ooffice: 2,535,426 bytes
    osdb: 2,583,662 bytes
    reymont: 1,056,915 bytes
    samba: 3,796,034 bytes
    sao: 4,787,155 bytes
    webster: 6,570,397 bytes
    x-ray: 3,790,684 bytes
    xml: 386,249 bytes
    Total: 46,888,561 bytes

    Compression of DNA files seems to be pretty good now. For instance, GLZA v0.5 gets 1.97 bpB for E.coli vs. 2.02 for GLZA v0.4. It also does better than it used to on certain weird files now such as aaa.txt (100000 bytes -> 99 bytes) and alphabet.txt (100000 bytes -> 35 bytes). And lastly, here is a comparison of compressed file sizes vs. Brotli, Lzham and Lzturbo on the Canterbury Corpus .txt files:

    GLZA/Brotli/LZHAMTEST/LzTurbo:

    alice29.txt: 40,737 / 46,510 / 51,398 / 49,245 (114.2% / 126.2% / 120.9%)
    asyoulik.txt: 38,841 / 42,739 / 47,258 / 45,301 (110.0% / 121.7% / 116.6%)
    lcet10.txt: 97,385 / 113,455 / 125,751 / 120,501 (116.5% / 129.1% / 123.7%)
    plrabn12.txt: 137,017 / 163,332 / 174,383 / 166,085 (119.2% / 127.3% / 121.2%)
    world192.txt: 409,783 / 475,298 / 509,970 / 475,925 (116.0% / 124.4% / 116.1%)
    bible.txt: 713,265 / 891,507 / 903,251 / 879,673 (125.0% / 126.6% / 123.3%)


    I think GLZA is now reasonably complete from an experimental standpoint. There are many additional things that can be tried that add complexity (decoding time), but I think it is now pretty good for an initial implementation of GLZ. It seems the experiment has been successful in showing that text files can be more effectively compressed than with LZ77 style algorithms. Now I need to decide whether to turn this into a plug-in or keep exploring some more complicated options that are mostly probably only interesting from a science standpoint, if at all.
    Attached Files Attached Files
    Last edited by Kennon Conrad; 9th June 2016 at 08:57.

  28. The Following 2 Users Say Thank You to Kennon Conrad For This Useful Post:

    Matt Mahoney (13th June 2016),Sportman (11th July 2016)

  29. #741
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Comparison of GLZA 0.2 and 0.5

    Code:
      Silesia dicke mozil   mr   nci ooff  osdb reym samba  sao webst x-ray  xml Compressor -options
    --------- ----- ----- ---- ----- ---- ----- ---- ----- ---- ----- ----- ---- -------------------
     46888561  2279 15256 2364  1481 2535  2583 1056  3796 4787  6570  3790  386 glza 0.5
     48330061  2317 15352 2678  1617 2544  2563 1062  3905 4770  6808  4300  408 glza 0.2

  30. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    Kennon Conrad (14th June 2016)

  31. #742
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    I ran some tests on enwik9 to find the current Pareto frontier for decompression RAM vs. compressed file size. For GLZA, all tests were run with the -l0 option for GLZAformat (disable capital lock encoding) and the -p3.5 (profit purity power, 0 = most compressive, higher values favor longer strings) and -c# (production rule cost in bits, # below) options for GLZAcompress. For PLZMA, LZHAM and LzTurbo, options for maximum compression were used. RAM is physical RAM as measured with timer64. Results were as follows:

    Compressed file size/decode time/RAM
    GLZA -c5.5: 164,777,214/16.2 sec/194 MB
    GLZA -c10: 165,154,202/15.6 sec/162 MB
    GLZA -c20: 166,200,811/14.7 sec/120 MB
    GLZA -c40: 168,613,852/13.9 sec/83 MB
    GLZA -c100: 173,592,934/13.1 sec/48 MB
    GLZA -c200: 179,198,500/13.0 sec/31 MB
    GLZA -c400: 186,018,893/12.9 sec/20 MB
    PLZMA: 193,240,163/52.0 sec/145 MB
    LzTurbo: 193,605,881/12.2 sec/1141 MB
    GLZA -c1000: 196,500,926/13.6 sec/12 MB
    LZHAM: 202,211,210/4.7 sec/514 MB
    GLZA -c2000: 205,666,995/14.5 sec/9 MB
    GLZA -c4000: 216,209,970/15.6 sec/7 MB
    Brotli: 223,597,274/2.6 sec/18 MB

    AFAIK, all GLZA results above are Pareto Frontier for decompression RAM.

  32. #743
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Could be. LTCB memory usage is only for compression. For CM, PPM, and usually BWT the decompression memory is about the same. GLZA is the top ranked compressor not using one of these.

  33. #744
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    Quote Originally Posted by Kennon Conrad View Post
    I think GLZA is now reasonably complete from an experimental standpoint. There are many additional things that can be tried that add complexity (decoding time), but I think it is now pretty good for an initial implementation of GLZ. It seems the experiment has been successful in showing that text files can be more effectively compressed than with LZ77 style algorithms. Now I need to decide whether to turn this into a plug-in or keep exploring some more complicated options that are mostly probably only interesting from a science standpoint, if at all.
    FWIW, I just took a quick look at the code, and I have a few suggestions that could help make it a bit easier for other code to include, as well as some things which (IMHO) should be fixed regardless of whether you decide to encourage that. In no particular order:


    • I see you rely on pthreads, which is a bit tricky on Windows. It's a bit self-serving, but I maintain a project called TinyCThread which is an implementation of the C11 threads API which works with pthreads or Windows. It's one source file and one header, easy to drop into your project and use, and if you're on a platform which supports C11 threads natively it will just use that instead…
    • Replace all the fprintf calls with a macro which can be defined away by the preprocessor.
    • Please just drop the current contents of GLZA.h, just use the C99 types. Even MSVC has inttypes.h these days…
    • It would be relatively easy to just strip out everything but the argument passing stuff from main(), and have a few functions which accept FILE* arguments. Not perfect, but a good first step.
    • exit(0) and returning 0 from main() typically indicate that the process was successful, but you use it on errors. See man 3 exit (in particular, the "Notes" section). You probably want to use EXIT_FAILURE on error and EXIT_SUCCESS on success.
    • Don't rely on implicit int types for function parameters. Preprocessor macro arguments are untyped, but arguments to functions like InitNewLastCharBin aren't.
    • You might want to consider replacing a lot of those macros with static functions. That gives you type safety, better debugging, more optimization options, and better tooling support.
    • Please output a help screen if no arguments are passed or -h or --help is passed. If I were you I would consider using something like getopt to parse arguments… if you want something portable, we use parg in Squash and I'm quite happy with it.
    • You should probably use atomics instead of a volatile variable. Unfortunately there isn't really a good lightweight abstraction layer, but there are C11 atomics, gcc extensions, clang extensions, Windows' interlocked operations, etc. Volatile doesn't necessarily do what you want (IIRC operations can be reordered across loads/stores on volatiles), and it prevents a lot of optimizations.


    If you're interested, I would be willing to help with at least some of those, though I can't promise too much time…
    Last edited by nemequ; 20th June 2016 at 05:48.

  34. The Following User Says Thank You to nemequ For This Useful Post:

    Bulat Ziganshin (20th June 2016)

  35. #745
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    It is very helpful. Not many people take the time to look at the code and provide comments. My programming skills are not the best, but hopefully a little better than when I started. Threads are new to me for this project, and most or all of my past experience with volatiles is with assembly code where you don't need to worry about the compiler changing the order of execution. I thought they weren't supposed to do that with C either, but am not certain.

    I have a few questions/comments that I added in italics into your post:

    Quote Originally Posted by nemequ View Post
    FWIW, I just took a quick look at the code, and I have a few suggestions that could help make it a bit easier for other code to include, as well as some things which (IMHO) should be fixed regardless of whether you decide to encourage that. In no particular order:


    • I see you rely on pthreads, which is a bit tricky on Windows. It's a bit self-serving, but I maintain a project called TinyCThread which is an implementation of the C11 threads API which works with pthreads or Windows. It's one source file and one header, easy to drop into your project and use, and if you're on a platform which supports C11 threads natively it will just use that instead… Can you explain why pthreads is tricky with Windows or provide a link? I would like to understand this. I literally looked it up on the internet and pthreads is what I found.
    • Replace all the fprintf calls with a macro which can be defined away by the preprocessor. Is that better than a command line option (which I was thinking about)? Is it for a smaller executable?
    • Please just drop the current contents of GLZA.h, just use the C99 types. Even MSVC has inttypes.h these days… You are talking about types like uint_32, right? I started out with things like unsigned int but learned it's not so portable. Those are a little ugly but I suppose I can get used to them.
    • It would be relatively easy to just strip out everything but the argument passing stuff from main(), and have a few functions which accept FILE* arguments. Not perfect, but a good first step. I was actually getting ready to work in this area but then found a couple of issues and inefficiencies with the current release that sidetracked me. I think a v1.0 should be a single program and not save intermediate files (maybe with exceptions for testing/optimization).
    • exit(0) and returning 0 from main() typically indicate that the process was successful, but you use it on errors. See man 3 exit (in particular, the "Notes" section). You probably want to use EXIT_FAILURE on error and EXIT_SUCCESS on success. Okay, I can fix this.
    • Don't rely on implicit int types for function parameters. Preprocessor macro arguments are untyped, but arguments to functions like InitNewLastCharBin aren't. Oops, good catch! Those functions were macros until recently and I forgot to add the types. I am surprised the compiler didn't complain.
    • You might want to consider replacing a lot of those macros with static functions. That gives you type safety, better debugging, more optimization options, and better tooling support. What are your thoughts on inline functions? The macros seems fastest in a lot of cases otherwise I wouldn't have them but in some cases they cause the executables to be noticably bigger.
    • Please output a help screen if no arguments are passed or -h or --help is passed. If I were you I would consider using something like getopt to parse arguments… if you want something portable, we use parg in Squash and I'm quite happy with it. For efficiency, I will wait until I have a single program to add better help.
    • You should probably use atomics instead of a volatile variable. Unfortunately there isn't really a good lightweight abstraction layer, but there are C11 atomics, gcc extensions, clang extensions, Windows' interlocked operations, etc. Volatile doesn't necessarily do what you want (IIRC operations can be reordered across loads/stores on volatiles), and it prevents a lot of optimizations. You are not the first person to say something like this. I guess I just don't understand why a volatile isn't safe and fast compared to other options, as long as 32 bit values are atomic. It seems like the compiler shouldn't move code around that is dependent on reading a volatile register. I hadn't heard of atomic variables, are they in standard C? I don't quite get it, but it looks like the same thing with a new name, and now the convention for C++.


    If you're interested, I would be willing to help with at least some of those, though I can't promise too much time…
    I think helping me understand the reasoning is a good start. I don't mind making code changes in general, so probably fixing pthreads if it's bad would probably be the most helpful.

  36. #746
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    Quote Originally Posted by Kennon Conrad View Post
    It is very helpful. Not many people take the time to look at the code and provide comments. My programming skills are not the best, but hopefully a little better than when I started. Threads are new to me for this project, and most or all of my past experience with volatiles is with assembly code where you don't need to worry about the compiler changing the order of execution. I thought they weren't supposed to do that with C either, but am not certain.
    You may not have had to worry about the compiler, but the CPU was probably changing the order around. You have to emit special instructions to prevent it (google memory fences). For C11, the key concept is "sequential consistency for data-race-free programs" (SC-DRF), which basically means that as long as you don't write any data races the result of your program must appear as though it had all run in the order you specified. Before C11, though, there was no memory model for C; there were no threads, so it didn't matter. So, if you want to avoid complete insanity in the results, you need to avoid races, and volatile does almost no good. Even if you rely on non-standard compiler extensions, volatile's usefulness is pretty limited.

    Quote Originally Posted by Kennon Conrad View Post
    Can you explain why pthreads is tricky with Windows or provide a link? I would like to understand this. I literally looked it up on the internet and pthreads is what I found.
    Windows doesn't support pthreads; it has it's own API for concurrency (like CreateThread instead of pthread_create), so what you're doing is using a third-party library that emulates pthreads on Windows.

    C11 standardizes an API that is basically a simplified version of pthreads, but most common standard libraries don't implement it (musl is probably the most popular exception). TinyCThread implements it using either pthreads or Windows API threads, depending on the platform., just like winpthread (unless the system actually has C11 threads, in which case TinyCThread will #include that and not do anything else).

    Probably the biggest difference is the size and complexity of the implementation. Like I said, C11 is *simplified*; the requirements are a lot simpler, so the implementation is much simpler, smaller, and lighter. It's also easier to distribute since it's just one source file and one header. Compare the two repositories, you'll quickly get a sense of how much extra winpthreads has to do… for example, https://github.com/vancegroup-mirror...rigin/create.c vs https://github.com/tinycthread/tinyc...cthread.c#L591

    TBH this is probably the most disputable suggestion I made. You do already have a solution which works for you, and it's hard to argue with that. I don't think it's optimal (for example, people have to download it separately in order to compile GLZA?), but it works.

    Quote Originally Posted by Kennon Conrad View Post
    Is that better than a command line option (which I was thinking about)? Is it for a smaller executable?
    It can be combined with a command line option if you want. Have a run-time switch, but #ifdef the whole thing so the switch is only available if it hasn't been disabled at compile-time.

    It's mostly about calling GLZA from other code, not from the command line. How would you feel if every time you called printf the standard C library printed out a debugging message that said you were calling printf, and gave details about the format string, arguments, etc.? People calling GLZA will feel the same way when calling your code. For an executable, though, an run-time switch could make more sense.

    Quote Originally Posted by Kennon Conrad View Post
    You are talking about types like uint_32, right? I started out with things like unsigned int but learned it's not so portable. Those are a little ugly but I suppose I can get used to them.
    Yes, types like uint32_t. If you want to make your code usable as a library you need to stay out of the global namespace; U32 used to be a popular name for a type before C99 was widely available because people needed a type with a specific size (and each definition would be mapped to a native type which happened to be the right size on that particular architecture… maybe U16 was an int, maybe a short), and if people try to use two different libraries that both want to define U32, conflits will arise. That is where namespacing comes in, and that means you pretty much get to choose between something like GLZA_U32 and uint32_t, at least in your public API. Defining custom types where built-in types will do is a great way to irritate people consuming the API; it's another type to get used to, which may or may not be different from what they expect, and may or may not be compatible with what they are using. You may not think "uint32_t" looks good, but virtually every C programmer knows *exactly* what it means.

    MSVC is pretty much the only compiler which doesn't fully support C99, and since 2013 even they have had stdint.h, so these days it's generally considered best practice to use the types from stdint.h. Even if it doesn't exist, it's easy to find implementations floating around from the days before everyone shipped stdint.h, no need to try to deal with it in your code.

    Quote Originally Posted by Kennon Conrad View Post
    Oops, good catch! Those functions were macros until recently and I forgot to add the types. I am surprised the compiler didn't complain.
    Oh, the compiler complains here . The relevant warning (for gcc, at least) is -Wimplicit-int. I don't know about your platform, but it's on by default on GCC 6.1.1 on Fedora 24.

    Quote Originally Posted by Kennon Conrad View Post
    What are your thoughts on inline functions? The macros seems fastest in a lot of cases otherwise I wouldn't have them but in some cases they cause the executables to be noticably bigger.
    Eh, it's complicated. First, functions with the inline function specifier aren't necessarily inlined by the compiler; the C standard is very clear that such decisions are up to the compiler. I actually really like how the standard puts it:

    A function declared with an inline function specifier is an inline function. Making a function an inline function suggests that calls to the function be as fast as possible.

    The extent to which such suggestions are effective is implementation-defined.
    If you *really* want something inlined, you have to use compiler extensions, like __attribute__((always_inline)) for GCC. Or use a macro, but I think it's pretty clear that I'm not generally a fan of macros replacing functions… it's too easy to introduce subtle bugs or get data types wrong.

    That said, I'm not usually a fan of forcing inlines, unless you're completely sure… inlining code isn't always a good thing. The larger your code the less the CPU can keep cached, and cache misses kill performance. Also, that cache is shared with data, so if the program takes up more space on the CPU the data has less, and you end up with cache misses on the data, too. It also makes LTO annoyingly slow and memory-intensive, but that's probably not critical.

    Leaving the decision up to the compiler means it can make different decisions for different CPUs; perhaps for a CPU with a large cache the compiler will inline more, but for an older CPU with less cache it will inline less. The flexibility also opens the door for some pretty big performance boots with PGO, though that's a complicated subject.

    Quote Originally Posted by Kennon Conrad View Post
    You are not the first person to say something like this. I guess I just don't understand why a volatile isn't safe and fast compared to other options, as long as 32 bit values are atomic. It seems like the compiler shouldn't move code around that is dependent on reading a volatile register. I hadn't heard of atomic variables, are they in standard C? I don't quite get it, but it looks like the same thing with a new name, and now the convention for C++.
    I started typing a quick reply to this, but after 5 paragraphs and not even really scratching the surface I think it would be better to just tell you that Herb Sutter gave a fantastic talk about this a few years ago called "Atomic Weapons". It's 2 parts, each a little over an hour, and it's *well* worth the time. The very end of part 2 talks about volatile, but it's hard to understand the significance of what he says without watching the rest.


    Frankly, I found the whole thing fascinating.

    Also, note that 32 bit values aren't necessarily atomic, and volatile doesn't imply atomic. Volatile also doesn't really do much for memory order, you need atomics, a mutex, or an explicit fence for that.

  37. The Following User Says Thank You to nemequ For This Useful Post:

    Bulat Ziganshin (20th June 2016)

  38. #747
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nemequ View Post
    You may not have had to worry about the compiler, but the CPU was probably changing the order around. You have to emit special instructions to prevent it (google memory fences). For C11, the key concept is "sequential consistency for data-race-free programs" (SC-DRF), which basically means that as long as you don't write any data races the result of your program must appear as though it had all run in the order you specified. Before C11, though, there was no memory model for C; there were no threads, so it didn't matter. So, if you want to avoid complete insanity in the results, you need to avoid races, and volatile does almost no good. Even if you rely on non-standard compiler extensions, volatile's usefulness is pretty limited.
    For now, I want to focus on GLZAdecode since that's where I care about speed the most. I did a little experiment and only turned the volatile types into volatile atomics. It worked but decoding time went up by 37%. I imagine it only gets worse if I try mutexes, etc. I really don't want to take that type of execution speed penalty so I am not sure what to do.

    GLZA's purpose for using volatiles is to ensure the compiler 1) doesn't optimize the read loops into infinite loops and 2) actually reads the memory location. The details of avoiding races lie in the implementation. I learned today it is called a lock-free single-producer, single-consumer circular queue: http://www.codeproject.com/script/Ar...3510&av=311159. My queue management is different than the standard implementation in how it manages the queue, but that's not really important. The important parts are that it can't write to a location that isn't empty and won't empty a location until after it has processed the value. uint32_t will always be atomic on x86/x64 and a lot of other modern processors even though the code standard doesn't guarantee it, but I do realize this is not the case with some small/old/weird processors.

    AFAIK, the queue is robust on x86/x64 because I have run 100's of trillions of symbols through the queue on two different computers and never had a problem and not had reports of such behavior. It's not proof but it seems like I should have seen/heard something by now if it was not reliable. Maybe it's bad from a purist standpoint, I can't really argue with that; but it's pretty efficient. It would be nice if there was a way to have the compiler deal with it efficiently besides using volatiles, coding careful, and assuming variables will be atomic.


    The CreateThread thing seems totally valid. I was okay with downloading POSIX but it is a hassle and some people may not want it. I am not sure what is most common, probably worth looking at some code examples....

    Quote Originally Posted by nemequ View Post
    Oh, the compiler complains here . The relevant warning (for gcc, at least) is -Wimplicit-int. I don't know about your platform, but it's on by default on GCC 6.1.1 on Fedora 24.
    Not mine. There's a couple of switches I have to turn on. It has been that way across several GCC versions. There may be a way to configure it but I haven't looked.

  39. #748
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    C11 atomics are not so much about making sure you read or write a complete value; like you said, integer access is *usually* atomic on most modern architectures. The significant part of atomics is that they prevent the compiler and CPU from reordering loads and stores which in ways that it shouldn't.

    Volatiles can't be reordered with respect to one another, but regular loads and stores can be reordered across volatile accesses. Based on a rather brief look at your code, what you seem to have done is basically just make everything that cares about order volatile. This means that, yes, the ordering is correct, but performance is terrible because nothing can be optimized.

    Naively converting every volatile to a atomic (or worse, a volatile and atomic) is going to make performance much worse, as you've seen. Especially if you just use fully sequentially consistent accesses everywhere (which is the default); for example, careful use of acquire and release operations can result in code that is just as fast as regular accesses (no atomics or volatile) on x86, but also safe on other architectures. However, the big gain from atomics for GLZA is that you should be able to just remove the volatile qualifier (not replace it with an atomic) on a lot of variables.

    If I were you, the first thing I would do (at least as far as this issue is concerned, in general the first thing I would do would be to get rid of those enormous macros) would probably be to remove all instances of "volatile" from the code. Then I would compile with ThreadSanitizer enabled (-fsanitize=thread). TSan will emit reports about data races as they happen, and I would just keep fixing them until you can run GLZA without seeing any. If you're smart about how you resolve the races, you should end up with code which is faster, and is safe on all platforms.

  40. #749
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nemequ View Post
    Based on a rather brief look at your code, what you seem to have done is basically just make everything that cares about order volatile. This means that, yes, the ordering is correct, but performance is terrible because nothing can be optimized.
    I'm sorry if I am a little slow at times and obstinant. I am not following what you are saying here. In general, I use lock-free circular queues for interthread communication. Is that what you are referring to when you say "everything that cares about order"?

    I do not consider performance to be terrible. GLZA was rated #1 in text decoding efficiency by World Compression Challenge 2015: http://heartofcomp.altervista.org/MOC/TEXT/MOCADE.htm. For GLZAdecode, the main drivers for decoding time on large files have been cache misses for symbol data lookups and the modeling. The critical thread for timing (main) until v0.5 rarely reads volatile memory locations so I think the impact of using volatiles is pretty small. With v0.5, the symbol data is also volatile but the timing impact was fairly small. But if you know how to make the code run faster, that would be a wonderful thing.

    Quote Originally Posted by nemequ View Post
    Naively converting every volatile to a atomic (or worse, a volatile and atomic) is going to make performance much worse, as you've seen. Especially if you just use fully sequentially consistent accesses everywhere (which is the default); for example, careful use of acquire and release operations can result in code that is just as fast as regular accesses (no atomics or volatile) on x86, but also safe on other architectures. However, the big gain from atomics for GLZA is that you should be able to just remove the volatile qualifier (not replace it with an atomic) on a lot of variables.
    Can you provide an example of a variable that is now volatile that does not need to be either volatile or atomic?

    Quote Originally Posted by nemequ View Post
    If I were you, the first thing I would do (at least as far as this issue is concerned, in general the first thing I would do would be to get rid of those enormous macros) would probably be to remove all instances of "volatile" from the code. Then I would compile with ThreadSanitizer enabled (-fsanitize=thread). TSan will emit reports about data races as they happen, and I would just keep fixing them until you can run GLZA without seeing any. If you're smart about how you resolve the races, you should end up with code which is faster, and is safe on all platforms.
    The code has macros because they make the code faster. Believe me, they wouldn't be there otherwise. At times I have been tempted to remove them anyway and just live with all the extra pushes and pops. If I take all of the volatiles out of GLZAdecode, it will immediately freeze up because this will become an infinite loop:

    while (out_symbol_array[0x80] != MAX_U32_VALUE);


    How can this be made to wait until the condition is true and then exit the loop without a volatile, atomic, or something like that?

  41. #750
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    Sorry about the slow response; I always forget to check encode.su unless I'm bored, especially the "Download Area" forum :/

    Quote Originally Posted by Kennon Conrad View Post
    I'm sorry if I am a little slow at times and obstinant. I am not following what you are saying here. In general, I use lock-free circular queues for interthread communication. Is that what you are referring to when you say "everything that cares about order"?
    Not exactly. I'd really encourage you to watch those videos I posted; I know they're long, but they have a lot of very important information, and do a much better job of explaining it than I can, but I'll give it a shot:

    Let's start with pre-C11 C. There is no memory model; the standard doesn't know anything about threads (there is a bit of an exception for signals, but it's not really anything that can help us, so let's just ignore it). The compiler, the CPU, the memory cache, and anything else are completely free to reorder a program however they see fit, so long as it doesn't alter the *result* of the program.

    So, let's consider a simple example, an overflow-safe addition func:

    Code:
    bool f(int* res, int si_a, int si_b) {
      if (((si_b > 0) && (si_a > (INT_MAX - si_b))) ||
          ((si_b < 0) && (si_a < (INT_MIN - si_b)))) {
        return false;
      } else {
        *res = si_a + si_b;
      }
      return true;
    }
    Now, let's imagine that someone (compiler, CPU, whatever), sees all that branching and decides that instead, what it wants to do is just do the `sum = si_a + si_b;` step right away and keep going, while it does all those checks in the background. If the checks fail, it will just roll back to the old value. And yes, this happens all the time. Effectively, it changed your code to something like:

    Code:
    bool f(int* res, int si_a, int si_b) {
      int oldval = *res;
      *res = si_a + si_b;
      if (((si_b > 0) && (si_a > (INT_MAX - si_b))) ||
          ((si_b < 0) && (si_a < (INT_MIN - si_b)))) {
        *res = oldval;
        return false;
      }
      return true;
    }
    The result is the same, so this is perfectly legal. The problem comes when you introduce threads; some other thread can come along and read *res in between when it changes to an invalid value and when it is reverted. That's a data race.

    In this case, I believe a volatile will actually eliminate the problem since you're not allowed to generate spurious writes to a volatile, but there are plenty of reorderings which don't involve a spurious write. To oversimplify, you can have something like

    Code:
    int a, b, c, d;
    a = 8;
    b = 13;
    c = 21;
    d = 34;
    Now, let's imagine another thread does this:

    Code:
    if (d == 34)
      assert(c == 21);
    That assertion can fail; the computer is completely free to execute those stores in whatever order it wants. If the store for c hasn't occured yet, the value is undefined.

    In order to get around this with volatile, we would have to make all four variables volatile (accessing volatile variables cannot be re-ordered across accessing *other* volatile variables). Simply making d volatile is insufficient, since non-volatile accesses can be moved across a volatile access.

    With atomics, we can give the compiler more flexibility in how it optimizes things, but still stay safe. If we just make d atomic (and let's stick to fully sequentially consistent atomics for now, because if you don't things start to get complicated), the computer is no longer allowed to reorder the stores for a, b, and c to below the store for d, so if another thread comes along and reads d and sees that it is 34, it knows all the stores before it have completed and the data is consistent.

    If that explanation isn't clear (which wouldn't surprise me, it's a rather complicated area, and hard to simplify), all I can really say is watch those videos. He does a really good job explaining everything, and it doesn't hurt that he has a *lot* more time to do it.

    I do not consider performance to be terrible. GLZA was rated #1 in text decoding efficiency by World Compression Challenge 2015
    Sorry, I phrased that very poorly. I didn't mean that GLZA's performance is terrible, certainly not compared to other compressors targeting a similar ratio; I meant that misusing volatile can lead to terrible performance, especially when compared to careful use of the right type of atomics, where you can sometimes get the guarantees you need with zero performance cost, since x86 tends to provide rather strong guarantees.

    Can you provide an example of a variable that is now volatile that does not need to be either volatile or atomic?
    That may sound easy, but it would require a pretty deep understanding of your codebase, which I *really* don't have time for. All I can really tell you is that if you're relying on atomics for synchronization, you're almost certainly leaving a lot of performance on the table (since the compiler can't optimize volatiles, and operations on non-volatiles are free to move across operations on volatiles), depending on compiler-specific behavior (before C11, lots of compilers assigned extra significance to volatile because they didn't really have a choice; see the very end of https://bartoszmilewski.com/2008/12/...mory-ordering/), or relying on x86's guarantees (which tend to be much stronger than other architectures).

    The code has macros because they make the code faster. Believe me, they wouldn't be there otherwise. At times I have been tempted to remove them anyway and just live with all the extra pushes and pops.

    • It may be faster on some platforms but much slower on others. I mentioned earlier that code sharing means less of the CPU cache is taken up by your program, which means there is more room for data, which means the program can run faster on CPUs with smaller caches. If you use functions, the compiler has the freedom to choose whether or not to inline, and it can make a different decision for differnt CPUs.
    • Macros make code difficult to read, and easy to write incorrectly, since there is no type information. If you've worked with dynamically typed languages, you know what I'm talking about.


    Even if you *really* want to ignore the first point, macros are still inferior; inline might not do the trick (since it's just a suggestion to the compiler), but most compilers have compiler-specific ways of forcing the compiler to do your bidding; GCC (and all the compilers which pretend to be GCC, which is basically everyone but MSVC) have __attribute__((always_inline)), MSVC has __forceinline. However, I would encourage you to just use "inline" (if anything). If you need to squeeze out that last drop of performance then PGO will probably give you much better results.

    If I take all of the volatiles out of GLZAdecode, it will immediately freeze up because this will become an infinite loop:

    while (out_symbol_array[0x80] != MAX_U32_VALUE);


    How can this be made to wait until the condition is true and then exit the loop without a volatile, atomic, or something like that?
    If out_symbol_array[0x80] is something which is written to by another thread, you probably need *something* for synchronization. Perhaps an atomic_load in the while loop, and an atomic_store in whatever is writing it. Perhaps neither; it's quite possible that synchronizing somewhere else will do the trick. GLZA is a rather large program, and it's not really possible to just look at a tiny piece of it and say what you need to do; the answer depends on everywhere you access that variable, and everything that needs to happen before or after you do so.

Page 25 of 29 FirstFirst ... 152324252627 ... LastLast

Similar Threads

  1. Replies: 4
    Last Post: 2nd December 2012, 02:55
  2. Suffix Tree's internal representation
    By Piotr Tarsa in forum Data Compression
    Replies: 4
    Last Post: 18th December 2011, 07:37
  3. M03 alpha
    By michael maniscalco in forum Data Compression
    Replies: 6
    Last Post: 10th October 2009, 00:31
  4. PIM 2.00 (alpha) is here!!!
    By encode in forum Forum Archive
    Replies: 46
    Last Post: 14th June 2007, 19:27
  5. PIM 2.00 (alpha) overview
    By encode in forum Forum Archive
    Replies: 21
    Last Post: 8th June 2007, 13:41

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •