type time_t <ctime>
Time type
Alias of a fundamental arithmetic type capable of representing times, as those returned by function time.
For historical reasons, it is generally implemented as an integral value representing the number of seconds elapsed since 00:00 hours, Jan 1, 1970 UTC (i.e., a unix timestamp). Although libraries may implement this type using alternative time representations.
Portable programs should not use values of this type directly, but always rely on calls to elements of the standard library to translate them to portable types.
times elapsed for clock 'realtime': 3.000057
times elapsed for clock 'monotonic': 3.000057
times elapsed for clock 'process cputime id': 0.000007
times elapsed for clock 'thread cputime id': 0.000007
Precision seems to be pretty good. Check if your compiler gives similiar results (I hope so).
Yes, I get similar results. I still don't understand why my compiles give global time for clock(), but I guess that mystery will remain unsolved. I agree precision is pretty good, much better than with clock.
I am curious though about one thing you did that I didn't do in post #835:
Maybe silesia.tar has low compression ratio because GLZA is not good for non uniform files?Does it remove symbols from the dictionary?
Yes, GLZA is not as effective for non-uniform files as for uniform files. It seems to also generally not be as effective as LZMA for files that are not text-like. GLZA does remove low frequency symbols from the dictionary but this is likely an area that could be improved. Here's a plot of the dictionary size vs. file position for Silesia.tar. In total, there are 1,525,516 dictionary entries, but most come and go:
Here is GLZA version 0.10. The main goal for this version was to simplify the code and improve decompression speed.
Functionally, the "-f" flag (faster compression) is now the default and "-x" flag (extreme compression) can be used to get the previous default slower but usually better compression.
The convoluted multi-mtf queue scheme that could hold up to 1152 symbols has been replaced by a single queue holding up to 256 symbols. To reduce the number of symbol moves between the mtf queue and the dictionary a new signal type has been added that tells the decoder whether or not to put an mtf eligible symbol into the queue. So now symbols never fall out of the mtf queue (without being used) and queue shuffling associated with queue misses is gone. The side benefit is that the queue can be smaller, making less overhead for queue hits and also giving lower average queue positions. The complication is that now the code has to figure out which symbols to put in the queue. This is not so easy, I think the code could be better but it's the best I could do for now.
There were quite a few changes to simplify the code, too many to remember them all, and a few bug fixes including one that may have affected Gonzalo earlier. The overall results are that compression and decoding are (a little) faster and encoding is slower. Test results for enwik9 show a decompression speed improvement of about 15% and a compression improvement of almost 1 MB, putting it at #20 on LTCB, something that has been a goal of mine for quite a while:
enwik9:
Code:
GLZA c enwik9 enwik9.glza: 167,832,309 bytes in 594 seconds, 7452 MB RAM.
GLZA d enwik9.glza enwik9: 12.1 seconds, 355 MB RAM.
Code:
GLZA c -x -p3 enwik9 enwik9.glza: 163,768,203 bytes in 8184 seconds, 8205 MB RAM
GLZA d enwik9.glza enwik9: 12.2 seconds, 429 MB RAM.
enwik8:
Code:
GLZA c enwik8 enwik8.glza: 20,753,713 bytes in 25.9 seconds, 952 MB RAM.
GLZA d enwik8.glza enwik8: 1.35 seconds, 47 MB RAM.
Code:
GLZA c -x -p3 enwik8 enwik8.glza: 20,356,097 bytes in 420 seconds, 2140 MB RAM
GLZA d enwik8.glza enwik8: 1.40 seconds, 57 MB RAM.
I didn't realize until after the latest release that one of the 4 byte fields in the dictionary entry structure wasn't really necessary so I took it out and got about 2% faster decompression and 3 - 4% less RAM usage when decompressing enwik9 (I did not rerun the compression from v0.10):
enwik9:
Code:
GLZA c enwik9 enwik9.glza: 167,832,309 bytes in 594 seconds, 7452 MB RAM.
GLZA d enwik9.glza enwik9: 11.9 seconds, 344 MB RAM.
Code:
GLZA c -x -p3 enwik9 enwik9.glza: 163,768,203 bytes in 8184 seconds, 8205 MB RAM
GLZA d enwik9.glza enwik9: 11.9 seconds, 413 MB RAM.
enwik8:
Code:
GLZA c enwik8 enwik8.glza: 20,753,713 bytes in 25.9 seconds, 952 MB RAM.
GLZA d enwik8.glza enwik8: 1.30 seconds, 45 MB RAM.
Code:
GLZA c -x -p3 enwik8 enwik8.glza: 20,356,097 bytes in 420 seconds, 2140 MB RAM
GLZA d enwik8.glza enwik8: 1.36 seconds, 54 MB RAM.
A .zip file of the source code is now 69,935 bytes.
Here is a new version of GLZA, version 0.11. It's the same basic idea but there are lots of changes, including the following highlights:
GLZAcompress:
Added an order 1 trailing character/leading character deduplication candidate scoring formula. This can be used by providing a "-o1.0" command line option. Without this option, it will use the order 0 score formula. Values between 0.0 and 1.0 will use a blend of the two candidate score formulas. Using the -o option forces extreme (-x) compression mode. Using the -o option slows down compression, maybe 50% on average. Overall, I'd say order 1 and order 0 typically provide similar compression ratios, order 0 creating a typically smaller grammar but order 1 creating a grammar that is slightly more predictable with an order 1 model. On the files I tested, on average -o0.5 worked a little better than -o0.0 or -o1.0.
Increased the maximum input file size from 2 GB to almost 4 GB. I didn't test this much at all. User beware, it may not work and you better have a lot of RAM (up to 64 GB) and time, especially if you are using the -x or -o options.
GLZAencode:
Added additional models and blending of model predictions. It's probably not the best implementation but does generally improve the compression ratio a little.
Simplified the selective MTF queue selection code and made it generally more effective.
For capital encoded files, changed the MTF queue to three queues, one for symbols that start with a space character, one for symbols that start with an alphabet (a - z) character, and one for the rest of the symbols and added a field to indicate with queue the next symbol is coming out of (unless the most recent trailing character was a capital encode character, in which case it goes straight to the a - z queue.
GLZAdecode:
Modified the code to support GLZAencode changes.
Changed the dictionary structure from a set of indexes into an array of dictionary structures per starting symbol/code length combination to a set of arrays of dictionary structures per starting symbol/code length combination. Now the code has to move dictionary structure data into and out of the mtf queue but it doesn't use as much RAM and the dictionary data can be accessed more directly.
Results for enwik8/9 are below, with options -o0.6, -p4, -r20000 giving the best compression ratios of the combinations I tested. The "-x -p3" option is shown for comparison to the previous version.
Thank you for the error report. The default fast compression option had a bug in the list management when removing overlapping deduplication candidates. This is fixed now. Now it gives 1144182 bytes for the compressed file, 0.3119 bpB. Using the -x option gives 0.2860 bpB.
Kennon, very nice compressor! Wow!! So can you explain to me very clear and in few words exactly (mostly) how it works so that I can replicate your algorithm. I read the pdf... I see you are looking for the best longest phrases like Byte Pair Encoding when you mention "954 characters. It appears 1694 times in the encoded file". What encoded file? You mean wiki9 don't you? Or by encoded you mean the tree holding wiki9 lol?
1) So you build a tree, a trie ? (Of all of wiki9). See figure_1 below.
2) Are you finding the common "codes" in wiki9 by a form of Byte Pair Encoding ? Ex. "[but afterwards][ the][ cat][ was in the ][box]".
3) And are you replacing these windows (where they sit in wiki9, like Huffman does) with their own "huffman-like" code ? Ex. "[10100000][1001][0100][101110011][010]".
4) Can you explain the "best code" discovery again? So you look for what in the tree/trie? A node that extends very long and has few children for miles? Or also is frequent? Or is near the Root?
Kennon, very nice compressor! Wow!! So can you explain to me very clear and in few words exactly (mostly) how it works so that I can replicate your algorithm. I read the pdf... I see you are looking for the best longest phrases like Byte Pair Encoding when you mention "954 characters. It appears 1694 times in the encoded file". What encoded file? You mean wiki9 don't you? Or by encoded you mean the tree holding wiki9 lol?
1) So you build a tree, a trie ? (Of all of wiki9). See figure_1 below.
2) Are you finding the common "codes" in wiki9 by a form of Byte Pair Encoding ? Ex. "[but afterwards][ the][ cat][ was in the ][box]".
3) And are you replacing these windows (where they sit in wiki9, like Huffman does) with their own "huffman-like" code ? Ex. "[10100000][1001][0100][101110011][010]".
4) Can you explain the "best code" discovery again? So you look for what in the tree/trie? A node that extends very long and has few children for miles? Or also is frequent? Or is near the Root?
GLZA has preprocessing that can include a delta transformation and capital encoding, the grammar creation can make initial passes to deduplicate words, and the encoder can use MTF where it finds it to be beneficial. I will ignore those for the purpose of this discussion. The main algorithm creates a straight-line grammar and then uses arithmetic coding to send the grammar. Most of these features can be disabled with command line options. Note: Tree used huffman coding, it was renamed GLZA when the encoding was changed to arithmetic to obtain a little better compression ratios.
Byte Pair Encoding looks for the most common pairs. GLZA (like Michael Galle's IRR) looks for the "best" recurring strings of any length. Yes, the reference to the repeating string of 954 characters that appears 1694 times is regarding enwik9.
1. Yes, GLZA builds trees (tries?). Specifically, they are suffix trees. The grammar creation in GLZA is recursive so it builds one tree per pass, determines which grammar rules to create, and substitutes the left side grammar rule symbol for each instance of the right hand side string. GLZA with the -x option will build the entire tree if there is enough memory, otherwise it will stop adding to the tree when it runs out of allocated memory (command line option).
2. GLZA's grammar creation algorithm is somewhat similar to byte pair encoding, but GLZA can create grammar rules for strings of arbitrary length instead of only pairs. So for BPE to create a grammar rule for " cat", it would have to first create grammar rules for " c" and "at", then create another rule to combine those two (other buildups possible). On the other hand, GLZA would more likely create a grammar rule for " cat" first, then later create grammar rules for " c" and "at" and substitute those rules into the grammar rule for " cat".
3. Each grammar rule gets it's own symbol. As mentioned earlier, the symbols are arithmetically encoded. Huffman encoding could work but arithmetic encoding provides a little better compression ratio.
4. After the tree is built, GLZA traverses the entire tree looking for the "most profitable" strings for which to create grammar rules. Finding the "most profitable" grammar rules to create is a difficult problem, probably not solvable, so GLZA does the best it can based on testing of what seems to work. At the heart of the decision process, GLZA uses a blend of two methods (with some adjustments for leading spaces and capital encode symbols) by default (command line option). One method basically determines the ratio of reduction in order 0 entropy to the order 0 entropy of the unmodified strings (that may include grammar rule symbols) it is evaluating and blends that value with the overall reduction in order 0 entropy. The other determines the ratio of reduction in order 1 entropy using the first and last characters in the strings represented by the grammar symbols for any non-terminal grammar symbols to the order 1 entropy of the strings it is evaluating and blends that with the reduction in the order 1 entropy. For reference, IRR chooses one grammar rule per pass, the one that provides the most order 0 entropy reduction. In practice, GLZA's algorithm tends to favor common strings that are longer and less frequent than IRR. That being said, I feel strongly that there are better algorithms for choosing the best strings that are awaiting discovery. I suspect the best ones might be more based on analyzing the structure of the suffix tree rather than calculations of the entropy reduction.
I hope this helps. Please let me know if you have anything was not clear or you have additional questions.
Yes I understand the arithmetic coding it in/error correction, I made a compressor so far that makes 100MB into 40MB.
Ok I think I understand your algorithm before even reading your post above. I'm going to give it a shot and tell me if anything is wrong. Your suffix tree digests all of wiki9 and makes it into a tree. Now you are looking for the longest, most recurring string, the "best one". Once you find it, you set it aside and give it the smallest code if you get me, and maybe even delete it from the tree. Now, you look for the next best string. Repeat until all text/branches is erased from tree. Wow this is a very simple cheat while makes 100MB into 18MB losslesly. So it's just looking for the best ""Huffman Codes"" so to speak to get the most bits for the littlest cost to say it.
Will my tree need to be 1,000 branches long to store duplicates that happen to be 1,000 chars long? Will every child leaf have 1,000 parents? How do I fit all that into RAM?
Last edited by Self_Recursive_Data; 19th January 2020 at 03:10.
I'm really interested. Can we chat on Facebook or Gmail Chat in real time so I can understand your algorithm much better and explain it to others and simplify it. What's your gmail?
I see you posted top 100K wiki9 symbols in the Tree (see below), but you said the duplicates were 500-5000 letters long, what are these short strings below?:
10%): ", he became the"
84575 95 20 (0.000110%): " serotonin"
84576 95 20 (0.000110%): " on [[1 Cjuly]]"
84577 95 20 (0.000110%): " and the main"
84578 95 20 (0.000110%): "/archives/"
84579 95 20 (0.000110%): ".
Will my tree need to be 1,000 branches long to store duplicates that happen to be 1,000 chars long? Will every child leaf have 1,000 parents? How do I fit all that into RAM?
No. A suffix tree for a string on length N will never have more than 2N nodes. Some of the edges could have 1000 symbols though. Suffix trees do consume a lot of RAM.
Originally Posted by Self_Recursive_Data
I'm really interested. Can we chat on Facebook or Gmail Chat in real time so I can understand your algorithm much better and explain it to others and simplify it. What's your gmail?
I do not use Facebook or Gmail.
Originally Posted by Self_Recursive_Data
I see you posted top 100K wiki9 symbols in the Tree (see below), but you said the duplicates were 500-5000 letters long, what are these short strings below?:
10%): ", he became the"
84575 95 20 (0.000110%): " serotonin"
84576 95 20 (0.000110%): " on [[1 Cjuly]]"
84577 95 20 (0.000110%): " and the main"
84578 95 20 (0.000110%): "/archives/"
84579 95 20 (0.000110%): ".
There are some very long common strings in enwik9 but not all common strings are that long. I am not sure which post this came from but it is a partial listing of the symbols and the (capital encoded) strings they represent. These are the 84575th - 84589th most common symbols in the final grammar (tied), each symbols appearing 95 times in the final grammar and are put into the 20 bit code length dictionary bin, each having a global probability of 0.000110%.
I ran GLZA on wiki8 and took 15mins.
Can you send me all duplicate strings in wiki8 it finds from greatest to least? To be clear I expect to see it start with those big 800 letter long strings ex.:
A symmetric game is a game where the payoffs for playing a particular strategy depend only on the other strategies employed, not on who is playing them
An important subset of sequential games consists of games of perfect information. A game is one of perfect information
Economists have used game theory to analyze a wide array of economic
Alternatively, some authors claim that
Last edited by Self_Recursive_Data; 19th January 2020 at 22:46.
I'm making my own and will compare results very soon. Though wiki9 you only show one string in your pdf and wiki9 will be too big to work with anyway, hence I'm wondering what your strings collected are in wiki8.
I ran GLZA on wiki8 and took 15mins.
Can you send me all duplicate strings in wiki8 it finds from greatest to least? To be clear I expect to see it start with those big 800 letter long strings ex.:
A symmetric game is a game where the payoffs for playing a particular strategy depend only on the other strategies employed, not on who is playing them
An important subset of sequential games consists of games of perfect information. A game is one of perfect information
Economists have used game theory to analyze a wide array of economic
Alternatively, some authors claim that
It depends on what command line options are used. Please read the readme.txt file in the release. It sounds like you should turn off the initial pass word deduplication with command line option -w0, use the -x option if you are not already doing so, and use the -v2 command line option to get it to output the dictionary in the order it was created and redirect the standard output to a file for viewing.
-x makes it "stop working" and isn't in the readme.txt. What's -x do?
How did you do arithmetic coding with such big vocab? I know you look at front of codes ex. [a]aaaaaaaaaaa but then how do you force it to choose the right one still...
Last edited by Self_Recursive_Data; 20th January 2020 at 21:41.
-x makes it "stop working" and isn't in the readme.txt. What's -x do?
How did you do arithmetic coding with such big vocab? I know you look at front of codes ex. [a]aaaaaaaaaaa but then how do you force it to choose the right one still...
You must not be using the latest version, GLZA v0.11.2. -x enables extreme parsing. This is necessary if you want the program to build the entire suffix tree in the first pass. This was the only mode before the -x option was added to allow the default compression mode to be faster.
The dictionary codes are transmitted in three parts. First, a code for the first character is sent using an order 1 model, then the dictionary "Huffman" code length is sent, then the index of the entry is sent, all using arithmetic coding.
So you predict one of the millions of strings based on 1 char prior to it?
Lol, yes. Somebody please correct me if I am wrong, but as far as I know that is state of the art for a compressor that uses grammar rules. Using higher order symbols or rules on either side would provide better compression ratios, but require more memory and be slower. With GLZA and enwik9 and similar textish files it is not uncommon for the next character to be a space roughly 50% of the time.
Originally Posted by Self_Recursive_Data
Don't you have millions of ranges to jump in in your arithmetic encoding/.
For each starting character, there can be up to 4,096 bins for dictionary codes and each bin can hold up to 32,768 dictionary entries. So a divide by up to 2^12 in one stage and a divide by up to 2^14 in the last stage.