Page 29 of 30 FirstFirst ... 1927282930 LastLast
Results 841 to 870 of 878

Thread: Tree alpha v0.1 download

  1. #841
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,495
    Thanks
    26
    Thanked 131 Times in 101 Posts
    I tried time but only got 1 second resolution. Is that normal?
    It seems so, unfortunately. http://www.cplusplus.com/reference/ctime/time_t/ says:
    type time_t <ctime>
    Time type
    Alias of a fundamental arithmetic type capable of representing times, as those returned by function time.

    For historical reasons, it is generally implemented as an integral value representing the number of seconds elapsed since 00:00 hours, Jan 1, 1970 UTC (i.e., a unix timestamp). Although libraries may implement this type using alternative time representations.

    Portable programs should not use values of this type directly, but always rely on calls to elements of the standard library to translate them to portable types.
    On the other hand, function 'clock_gettime' suggested by user 'algorithm' seems to be working on my system. I've implemented a program using documentation at https://linux.die.net/man/3/clock_gettime and snippet from https://gist.github.com/diabloneo/9619917 .

    Example code ( https://ideone.com/6XcQ7d ):
    #define _POSIX_C_SOURCE 199309L

    #include <stdio.h>
    #include <time.h>
    #include <unistd.h>

    void timespec_diff(const struct timespec * start, const struct timespec *stop,
    struct timespec *result) {
    if ((stop->tv_nsec - start->tv_nsec) < 0) {
    result->tv_sec = stop->tv_sec - start->tv_sec - 1;
    result->tv_nsec = stop->tv_nsec - start->tv_nsec + 1000000000;
    } else {
    result->tv_sec = stop->tv_sec - start->tv_sec;
    result->tv_nsec = stop->tv_nsec - start->tv_nsec;
    }
    return;
    }

    void timespec_print(const char * description,
    const struct timespec *start, const struct timespec *stop) {
    struct timespec result;
    timespec_diff(start, stop, &result);
    double seconds = result.tv_sec + result.tv_nsec / 1e9;
    printf("times elapsed for clock '%s': %lf\n", description, seconds);
    return;
    }

    int main(void) {
    struct timespec start_clock_realtime;
    struct timespec start_clock_monotonic;
    struct timespec start_clock_process_cputime_id;
    struct timespec start_clock_thread_cputime_id;

    struct timespec finish_clock_realtime;
    struct timespec finish_clock_monotonic;
    struct timespec finish_clock_process_cputime_id;
    struct timespec finish_clock_thread_cputime_id;

    clock_gettime(CLOCK_REALTIME, &start_clock_realtime);
    clock_gettime(CLOCK_MONOTONIC, &start_clock_monotonic);
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start_clock_process_cputime_id);
    clock_gettime(CLOCK_THREAD_CPUTIME_ID, &start_clock_thread_cputime_id);

    sleep(3); // seconds

    clock_gettime(CLOCK_REALTIME, &finish_clock_realtime);
    clock_gettime(CLOCK_MONOTONIC, &finish_clock_monotonic);
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &finish_clock_process_cputime_id);
    clock_gettime(CLOCK_THREAD_CPUTIME_ID, &finish_clock_thread_cputime_id);

    timespec_print("realtime",
    &start_clock_realtime, &finish_clock_realtime);
    timespec_print("monotonic",
    &start_clock_monotonic, &finish_clock_monotonic);
    timespec_print("process cputime id",
    &start_clock_process_cputime_id, &finish_clock_process_cputime_id);
    timespec_print("thread cputime id",
    &start_clock_thread_cputime_id, &finish_clock_thread_cputime_id);

    return 0;
    }

    Result:
    Code:
    times elapsed for clock 'realtime': 3.000057
    times elapsed for clock 'monotonic': 3.000057
    times elapsed for clock 'process cputime id': 0.000007
    times elapsed for clock 'thread cputime id': 0.000007
    Precision seems to be pretty good. Check if your compiler gives similiar results (I hope so).

  2. Thanks (4):

    Bulat Ziganshin (23rd September 2017),Kennon Conrad (24th September 2017),Mike (23rd September 2017),Shelwien (23rd September 2017)

  3. #842
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    Yes, I get similar results. I still don't understand why my compiles give global time for clock(), but I guess that mystery will remain unsolved. I agree precision is pretty good, much better than with clock.

    I am curious though about one thing you did that I didn't do in post #835:

    Code:
    double seconds = result.tv_sec + result.tv_nsec / 1e9;
    Will compilers always convert result.tv_nsec and 1e9 to double before doing the divide?

  4. #843
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,564
    Thanks
    775
    Thanked 687 Times in 372 Posts
    1e9 is a constant of double type, arith operators promote narrower operand to a wider one before calculation

  5. Thanks:

    Kennon Conrad (27th September 2017)

  6. #844
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts

    GLZA v0.9.2

    I changed a few things for the "fast" mode (-f) and got faster compression than I ever expected. The printing of elapsed time has been fixed too.

    GLZA c -f enwik8 enwik8.glza
    Code:
    Compressed 100000000 bytes -> 20722473 bytes (1.6578 bpB) in 24.142 seconds.
    GLZA c -f enwik9 enwik9.glza
    Code:
    Compressed 1000000000 bytes -> 167188835 bytes (1.3375 bpB) in 556.622 seconds.
    GLZA c -f Silesia.tar Silesia.tar.glza
    Code:
    Compressed 211957760 bytes -> 51559957 bytes (1.9460 bpB) in 132.188 seconds.
    Attached Files Attached Files

  7. Thanks (3):

    algorithm (9th October 2017),Bulat Ziganshin (10th October 2017),Lucas (9th October 2017)

  8. #845
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    100
    Thanks
    37
    Thanked 28 Times in 19 Posts
    Maybe silesia.tar has low compression ratio because GLZA is not good for non uniform files?Does it remove symbols from the dictionary?

  9. #846
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    Quote Originally Posted by algorithm View Post
    Maybe silesia.tar has low compression ratio because GLZA is not good for non uniform files?Does it remove symbols from the dictionary?
    Yes, GLZA is not as effective for non-uniform files as for uniform files. It seems to also generally not be as effective as LZMA for files that are not text-like. GLZA does remove low frequency symbols from the dictionary but this is likely an area that could be improved. Here's a plot of the dictionary size vs. file position for Silesia.tar. In total, there are 1,525,516 dictionary entries, but most come and go:

    Click image for larger version. 

Name:	Silesia.png 
Views:	99 
Size:	17.6 KB 
ID:	5394

  10. #847
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts

    GLZA v0.10

    Here is GLZA version 0.10. The main goal for this version was to simplify the code and improve decompression speed.

    Functionally, the "-f" flag (faster compression) is now the default and "-x" flag (extreme compression) can be used to get the previous default slower but usually better compression.

    The convoluted multi-mtf queue scheme that could hold up to 1152 symbols has been replaced by a single queue holding up to 256 symbols. To reduce the number of symbol moves between the mtf queue and the dictionary a new signal type has been added that tells the decoder whether or not to put an mtf eligible symbol into the queue. So now symbols never fall out of the mtf queue (without being used) and queue shuffling associated with queue misses is gone. The side benefit is that the queue can be smaller, making less overhead for queue hits and also giving lower average queue positions. The complication is that now the code has to figure out which symbols to put in the queue. This is not so easy, I think the code could be better but it's the best I could do for now.

    There were quite a few changes to simplify the code, too many to remember them all, and a few bug fixes including one that may have affected Gonzalo earlier. The overall results are that compression and decoding are (a little) faster and encoding is slower. Test results for enwik9 show a decompression speed improvement of about 15% and a compression improvement of almost 1 MB, putting it at #20 on LTCB, something that has been a goal of mine for quite a while:

    enwik9:
    Code:
    GLZA c enwik9 enwik9.glza:  167,832,309 bytes in 594 seconds, 7452 MB RAM.
    GLZA d enwik9.glza enwik9:  12.1 seconds, 355 MB RAM.
    Code:
    GLZA c -x -p3 enwik9 enwik9.glza:  163,768,203 bytes in 8184 seconds, 8205 MB RAM
    GLZA d enwik9.glza enwik9:  12.2 seconds, 429 MB RAM.
    enwik8:
    Code:
    GLZA c enwik8 enwik8.glza:  20,753,713 bytes in 25.9 seconds, 952 MB RAM.
    GLZA d enwik8.glza enwik8:  1.35 seconds, 47 MB RAM.
    Code:
    GLZA c -x -p3 enwik8 enwik8.glza:  20,356,097 bytes in 420 seconds, 2140 MB RAM
    GLZA d enwik8.glza enwik8:  1.40 seconds, 57 MB RAM.
    A .zip file of the source code is 69,980 bytes.

    If anyone has any problems, please let me know.
    Attached Files Attached Files

  11. Thanks (2):

    Bulat Ziganshin (2nd January 2018),JamesB (2nd January 2018)

  12. #848
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts

    GLZA v0.10.1

    I didn't realize until after the latest release that one of the 4 byte fields in the dictionary entry structure wasn't really necessary so I took it out and got about 2% faster decompression and 3 - 4% less RAM usage when decompressing enwik9 (I did not rerun the compression from v0.10):

    enwik9:
    Code:
    GLZA c enwik9 enwik9.glza:  167,832,309 bytes in 594 seconds, 7452 MB RAM.
    GLZA d enwik9.glza enwik9:  11.9 seconds, 344 MB RAM.
    Code:
    GLZA c -x -p3 enwik9 enwik9.glza:  163,768,203 bytes in 8184 seconds, 8205 MB RAM
    GLZA d enwik9.glza enwik9:  11.9 seconds, 413 MB RAM.
    enwik8:
    Code:
    GLZA c enwik8 enwik8.glza:  20,753,713 bytes in 25.9 seconds, 952 MB RAM.
    GLZA d enwik8.glza enwik8:  1.30 seconds, 45 MB RAM.
    Code:
    GLZA c -x -p3 enwik8 enwik8.glza:  20,356,097 bytes in 420 seconds, 2140 MB RAM
    GLZA d enwik8.glza enwik8:  1.36 seconds, 54 MB RAM.
    A .zip file of the source code is now 69,935 bytes.
    Attached Files Attached Files

  13. Thanks (5):

    Bulat Ziganshin (6th January 2018),Gonzalo (2nd July 2018),load (16th May 2018),Matt Mahoney (9th August 2018),Stephan Busch (9th January 2018)

  14. #849
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    Here is a new version of GLZA, version 0.11. It's the same basic idea but there are lots of changes, including the following highlights:

    GLZAcompress:
    Added an order 1 trailing character/leading character deduplication candidate scoring formula. This can be used by providing a "-o1.0" command line option. Without this option, it will use the order 0 score formula. Values between 0.0 and 1.0 will use a blend of the two candidate score formulas. Using the -o option forces extreme (-x) compression mode. Using the -o option slows down compression, maybe 50% on average. Overall, I'd say order 1 and order 0 typically provide similar compression ratios, order 0 creating a typically smaller grammar but order 1 creating a grammar that is slightly more predictable with an order 1 model. On the files I tested, on average -o0.5 worked a little better than -o0.0 or -o1.0.
    Increased the maximum input file size from 2 GB to almost 4 GB. I didn't test this much at all. User beware, it may not work and you better have a lot of RAM (up to 64 GB) and time, especially if you are using the -x or -o options.

    GLZAencode:
    Added additional models and blending of model predictions. It's probably not the best implementation but does generally improve the compression ratio a little.
    Simplified the selective MTF queue selection code and made it generally more effective.
    For capital encoded files, changed the MTF queue to three queues, one for symbols that start with a space character, one for symbols that start with an alphabet (a - z) character, and one for the rest of the symbols and added a field to indicate with queue the next symbol is coming out of (unless the most recent trailing character was a capital encode character, in which case it goes straight to the a - z queue.

    GLZAdecode:
    Modified the code to support GLZAencode changes.
    Changed the dictionary structure from a set of indexes into an array of dictionary structures per starting symbol/code length combination to a set of arrays of dictionary structures per starting symbol/code length combination. Now the code has to move dictionary structure data into and out of the mtf queue but it doesn't use as much RAM and the dictionary data can be accessed more directly.

    Results for enwik8/9 are below, with options -o0.6, -p4, -r20000 giving the best compression ratios of the combinations I tested. The "-x -p3" option is shown for comparison to the previous version.

    enwik9:
    Code:
    GLZA c -o0.6 -p4 -r20000 enwik9 enwik9.glza:  161,966,886 bytes, 17,031 seconds, 13,071 MB RAM
    GLZA d enwik9.glza enwik9:  9.9 seconds, 384 MB RAM
    Code:
    GLZA c -x -p3 enwik9 enwik9.glza:  162,899,155 bytes, 8219 seconds, 8220 MB RAM
    GLZA d enwik9.glza enwik9:  9.9 seconds, 383 MB RAM
    Code:
    GLZA c enwik9 enwik9.glza:  166,844,914 bytes, 593 seconds, 7463 MB RAM
    GLZA d enwik9.glza enwik9:  10.1 seconds, 327 MB RAM
    enwik8:
    Code:
    GLZA c -o0.6 -p4 -r20000 enwik8 enwik8.glza: 20,114,673 bytes, 782 seconds, 2347 MB RAM
    GLZA d enwik8.glza enwik8: 1.24 seconds, 50 MB RAM
    Code:
    GLZA c -x -p3 enwik8 enwik8.glza: 20,247,308 bytes, 423 seconds, 2142 MB RAM
    GLZA d enwik8.glza enwik8: 1.23 seconds, 50 MB RAM
    Code:
    GLZA c enwik8 enwik8.glza: 20,636,657 bytes, 25.4 seconds, 954 MB RAM
    GLZA d enwik8.glza enwik8: 1.22 seconds, 43 MB RAM
    A .zip file of the source code is 75,527 bytes.
    Attached Files Attached Files

  15. Thanks (2):

    algorithm (7th November 2018),Mike (7th November 2018)

  16. #850
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,915
    Thanks
    291
    Thanked 1,273 Times in 720 Posts
    Getting heap corruption crash when trying to encode this file, zero output.
    "GLZA.exe c 00270.6 1"
    Attached Files Attached Files

  17. #851
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts

    GLZA v0.11.1

    Quote Originally Posted by Shelwien View Post
    Getting heap corruption crash when trying to encode this file, zero output.
    "GLZA.exe c 00270.6 1"
    Thank you for the error report. There was a bug in the encoder.
    Attached Files Attached Files

  18. Thanks:

    Shelwien (24th December 2019)

  19. #852
    Member
    Join Date
    Mar 2018
    Location
    Somewhere over the Rainbow
    Posts
    3
    Thanks
    1
    Thanked 1 Time in 1 Post

    Unhappy GLZA error

    Getting an error:

    $ ./glza-v0.11.1/GLZA c 1_42_polytope_7-cube.svg 1_42_polytope_7-cube.svg.glza
    Applying 60 byte delta transformation
    cap encoded: 0, UTF8 compliant 0
    Found 29344554 symbols
    double free or corruption (out)43887 grammar rules
    Aborted

    Regards
    Dot

    If it's of any use:
    Debian 10 Testing amd64
    gcc (Debian 9.2.1-21) 9.2.1 20191130
    OR clang version 9.0.0-4 (both produce the same error)
    libc6 2.29-3
    Attached Files Attached Files

  20. #853
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,915
    Thanks
    291
    Thanked 1,273 Times in 720 Posts
    I tested this file with windows exe from GLZA_v0.11.1.zip - it crashes:
    Code:
    Z:\008>GLZA.exe c 1_42_polytope_7-cube.svg 1b
    Applying 60 byte delta transformation
    cap encoded: 0, UTF8 compliant 0
    Found 29344554 symbols
    PASS 23: grammar size 1487309, 44198 grammar rules
    "GLZA.exe c -x 1_42_polytope_7-cube.svg 1" works though.

  21. #854
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts

    GLZA v0.11.2

    Thank you for the error report. The default fast compression option had a bug in the list management when removing overlapping deduplication candidates. This is fixed now. Now it gives 1144182 bytes for the compressed file, 0.3119 bpB. Using the -x option gives 0.2860 bpB.
    Attached Files Attached Files

  22. Thanks (3):

    dot (29th December 2019),Mike (29th December 2019),Shelwien (30th December 2019)

  23. #855
    Member
    Join Date
    Jan 2020
    Location
    Canada
    Posts
    142
    Thanks
    12
    Thanked 2 Times in 2 Posts
    Kennon, very nice compressor! Wow!! So can you explain to me very clear and in few words exactly (mostly) how it works so that I can replicate your algorithm. I read the pdf... I see you are looking for the best longest phrases like Byte Pair Encoding when you mention "954 characters. It appears 1694 times in the encoded file". What encoded file? You mean wiki9 don't you? Or by encoded you mean the tree holding wiki9 lol?

    1) So you build a tree, a trie ? (Of all of wiki9). See figure_1 below.
    2) Are you finding the common "codes" in wiki9 by a form of Byte Pair Encoding ? Ex. "[but afterwards][ the][ cat][ was in the ][box]".
    3) And are you replacing these windows (where they sit in wiki9, like Huffman does) with their own "huffman-like" code ? Ex. "[10100000][1001][0100][101110011][010]".
    4) Can you explain the "best code" discovery again? So you look for what in the tree/trie? A node that extends very long and has few children for miles? Or also is frequent? Or is near the Root?

    Figure_1 (tree):
    //////////a<
    ////a<
    //////////b<
    S<
    //////////a<
    ​////b<
    //////////b<

  24. #856
    Member
    Join Date
    Jan 2020
    Location
    Canada
    Posts
    142
    Thanks
    12
    Thanked 2 Times in 2 Posts
    Can anyone answer these questions?

  25. #857
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,915
    Thanks
    291
    Thanked 1,273 Times in 720 Posts
    Just read the source. Most people won't explain things until you understand.

  26. #858
    Member
    Join Date
    Jan 2020
    Location
    Canada
    Posts
    142
    Thanks
    12
    Thanked 2 Times in 2 Posts
    I'm learning so much from yous though. I've almost crossed no man's land. (I tried reading the material.)

  27. #859
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    Quote Originally Posted by Self_Recursive_Data View Post
    Kennon, very nice compressor! Wow!! So can you explain to me very clear and in few words exactly (mostly) how it works so that I can replicate your algorithm. I read the pdf... I see you are looking for the best longest phrases like Byte Pair Encoding when you mention "954 characters. It appears 1694 times in the encoded file". What encoded file? You mean wiki9 don't you? Or by encoded you mean the tree holding wiki9 lol?

    1) So you build a tree, a trie ? (Of all of wiki9). See figure_1 below.
    2) Are you finding the common "codes" in wiki9 by a form of Byte Pair Encoding ? Ex. "[but afterwards][ the][ cat][ was in the ][box]".
    3) And are you replacing these windows (where they sit in wiki9, like Huffman does) with their own "huffman-like" code ? Ex. "[10100000][1001][0100][101110011][010]".
    4) Can you explain the "best code" discovery again? So you look for what in the tree/trie? A node that extends very long and has few children for miles? Or also is frequent? Or is near the Root?

    Figure_1 (tree):
    //////////a<
    ////a<
    //////////b<
    S<
    //////////a<
    ​////b<
    //////////b<
    Hello Self_Recursive_Data,

    GLZA has preprocessing that can include a delta transformation and capital encoding, the grammar creation can make initial passes to deduplicate words, and the encoder can use MTF where it finds it to be beneficial. I will ignore those for the purpose of this discussion. The main algorithm creates a straight-line grammar and then uses arithmetic coding to send the grammar. Most of these features can be disabled with command line options. Note: Tree used huffman coding, it was renamed GLZA when the encoding was changed to arithmetic to obtain a little better compression ratios.

    Byte Pair Encoding looks for the most common pairs. GLZA (like Michael Galle's IRR) looks for the "best" recurring strings of any length. Yes, the reference to the repeating string of 954 characters that appears 1694 times is regarding enwik9.

    1. Yes, GLZA builds trees (tries?). Specifically, they are suffix trees. The grammar creation in GLZA is recursive so it builds one tree per pass, determines which grammar rules to create, and substitutes the left side grammar rule symbol for each instance of the right hand side string. GLZA with the -x option will build the entire tree if there is enough memory, otherwise it will stop adding to the tree when it runs out of allocated memory (command line option).

    2. GLZA's grammar creation algorithm is somewhat similar to byte pair encoding, but GLZA can create grammar rules for strings of arbitrary length instead of only pairs. So for BPE to create a grammar rule for " cat", it would have to first create grammar rules for " c" and "at", then create another rule to combine those two (other buildups possible). On the other hand, GLZA would more likely create a grammar rule for " cat" first, then later create grammar rules for " c" and "at" and substitute those rules into the grammar rule for " cat".

    3. Each grammar rule gets it's own symbol. As mentioned earlier, the symbols are arithmetically encoded. Huffman encoding could work but arithmetic encoding provides a little better compression ratio.

    4. After the tree is built, GLZA traverses the entire tree looking for the "most profitable" strings for which to create grammar rules. Finding the "most profitable" grammar rules to create is a difficult problem, probably not solvable, so GLZA does the best it can based on testing of what seems to work. At the heart of the decision process, GLZA uses a blend of two methods (with some adjustments for leading spaces and capital encode symbols) by default (command line option). One method basically determines the ratio of reduction in order 0 entropy to the order 0 entropy of the unmodified strings (that may include grammar rule symbols) it is evaluating and blends that value with the overall reduction in order 0 entropy. The other determines the ratio of reduction in order 1 entropy using the first and last characters in the strings represented by the grammar symbols for any non-terminal grammar symbols to the order 1 entropy of the strings it is evaluating and blends that with the reduction in the order 1 entropy. For reference, IRR chooses one grammar rule per pass, the one that provides the most order 0 entropy reduction. In practice, GLZA's algorithm tends to favor common strings that are longer and less frequent than IRR. That being said, I feel strongly that there are better algorithms for choosing the best strings that are awaiting discovery. I suspect the best ones might be more based on analyzing the structure of the suffix tree rather than calculations of the entropy reduction.

    I hope this helps. Please let me know if you have anything was not clear or you have additional questions.

  28. Thanks:

    Self_Recursive_Data (18th January 2020)

  29. #860
    Member
    Join Date
    Jan 2020
    Location
    Canada
    Posts
    142
    Thanks
    12
    Thanked 2 Times in 2 Posts
    Yes I understand the arithmetic coding it in/error correction, I made a compressor so far that makes 100MB into 40MB.

    Ok I think I understand your algorithm before even reading your post above. I'm going to give it a shot and tell me if anything is wrong. Your suffix tree digests all of wiki9 and makes it into a tree. Now you are looking for the longest, most recurring string, the "best one". Once you find it, you set it aside and give it the smallest code if you get me, and maybe even delete it from the tree. Now, you look for the next best string. Repeat until all text/branches is erased from tree. Wow this is a very simple cheat while makes 100MB into 18MB losslesly. So it's just looking for the best ""Huffman Codes"" so to speak to get the most bits for the littlest cost to say it.

    Will my tree need to be 1,000 branches long to store duplicates that happen to be 1,000 chars long? Will every child leaf have 1,000 parents? How do I fit all that into RAM?
    Last edited by Self_Recursive_Data; 19th January 2020 at 02:10.

  30. #861
    Member
    Join Date
    Jan 2020
    Location
    Canada
    Posts
    142
    Thanks
    12
    Thanked 2 Times in 2 Posts
    See above post.


    I'm really interested. Can we chat on Facebook or Gmail Chat in real time so I can understand your algorithm much better and explain it to others and simplify it. What's your gmail?


    I see you posted top 100K wiki9 symbols in the Tree (see below), but you said the duplicates were 500-5000 letters long, what are these short strings below?:

    10%): ", he became the"
    84575 95 20 (0.000110%): " serotonin"
    84576 95 20 (0.000110%): " on [[1 Cjuly]]"
    84577 95 20 (0.000110%): " and the main"
    84578 95 20 (0.000110%): "/archives/"
    84579 95 20 (0.000110%): ".

    &lt;!-- Cunsourced image removed: [[Cimage:C"
    84580 95 20 (0.000110%): "dew"
    84581 95 20 (0.000110%): "uber"
    84582 95 20 (0.000110%): ", see also"
    84583 95 20 (0.000110%): " Cfallon"
    84584 95 20 (0.000110%): " was paid"
    84585 95 20 (0.000110%): " [[Cdwight Cd. Ceisenhower]]"
    84586 95 20 (0.000110%): " he had made"
    84587 95 20 (0.000110%): " Cpalatinate"
    84588 95 20 (0.000110%): "nautilus"
    84589 95 20 (0.000110%): "k'"
    Last edited by Self_Recursive_Data; 19th January 2020 at 17:20.

  31. #862
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    Quote Originally Posted by Self_Recursive_Data View Post
    Will my tree need to be 1,000 branches long to store duplicates that happen to be 1,000 chars long? Will every child leaf have 1,000 parents? How do I fit all that into RAM?
    No. A suffix tree for a string on length N will never have more than 2N nodes. Some of the edges could have 1000 symbols though. Suffix trees do consume a lot of RAM.


    Quote Originally Posted by Self_Recursive_Data View Post
    I'm really interested. Can we chat on Facebook or Gmail Chat in real time so I can understand your algorithm much better and explain it to others and simplify it. What's your gmail?
    I do not use Facebook or Gmail.

    Quote Originally Posted by Self_Recursive_Data View Post
    I see you posted top 100K wiki9 symbols in the Tree (see below), but you said the duplicates were 500-5000 letters long, what are these short strings below?:

    10%): ", he became the"
    84575 95 20 (0.000110%): " serotonin"
    84576 95 20 (0.000110%): " on [[1 Cjuly]]"
    84577 95 20 (0.000110%): " and the main"
    84578 95 20 (0.000110%): "/archives/"
    84579 95 20 (0.000110%): ".

    &lt;!-- Cunsourced image removed: [[Cimage:C"
    84580 95 20 (0.000110%): "dew"
    84581 95 20 (0.000110%): "uber"
    84582 95 20 (0.000110%): ", see also"
    84583 95 20 (0.000110%): " Cfallon"
    84584 95 20 (0.000110%): " was paid"
    84585 95 20 (0.000110%): " [[Cdwight Cd. Ceisenhower]]"
    84586 95 20 (0.000110%): " he had made"
    84587 95 20 (0.000110%): " Cpalatinate"
    84588 95 20 (0.000110%): "nautilus"
    84589 95 20 (0.000110%): "k'"
    There are some very long common strings in enwik9 but not all common strings are that long. I am not sure which post this came from but it is a partial listing of the symbols and the (capital encoded) strings they represent. These are the 84575th - 84589th most common symbols in the final grammar (tied), each symbols appearing 95 times in the final grammar and are put into the 20 bit code length dictionary bin, each having a global probability of 0.000110%.

  32. #863
    Member
    Join Date
    Jan 2020
    Location
    Canada
    Posts
    142
    Thanks
    12
    Thanked 2 Times in 2 Posts
    Do you have a preferred real time chat?


  33. #864
    Member
    Join Date
    Jan 2020
    Location
    Canada
    Posts
    142
    Thanks
    12
    Thanked 2 Times in 2 Posts
    ​I ran GLZA on wiki8 and took 15mins.
    Can you send me all duplicate strings in wiki8 it finds from greatest to least? To be clear I expect to see it start with those big 800 letter long strings ex.:



    {| border=&quot;1&quot; align=right cellpadding=&quot;4&quot; cellspacing=&quot;0&quot; style=&quot;margin: 1em 1em 1em 1em; background: #f9f9f9; border: 1px #aaa solid

    A symmetric game is a game where the payoffs for playing a particular strategy depend only on the other strategies employed, not on who is playing them

    An important subset of sequential games consists of games of perfect information. A game is one of perfect information

    Economists have used game theory to analyze a wide array of economic

    Alternatively, some authors claim that
    Last edited by Self_Recursive_Data; 19th January 2020 at 21:46.

  34. #865
    Member
    Join Date
    Jan 2020
    Location
    Canada
    Posts
    142
    Thanks
    12
    Thanked 2 Times in 2 Posts
    I'm making my own and will compare results very soon. Though wiki9 you only show one string in your pdf and wiki9 will be too big to work with anyway, hence I'm wondering what your strings collected are in wiki8.

  35. #866
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    Quote Originally Posted by Self_Recursive_Data View Post
    ​I ran GLZA on wiki8 and took 15mins.
    Can you send me all duplicate strings in wiki8 it finds from greatest to least? To be clear I expect to see it start with those big 800 letter long strings ex.:



    {| border="1" align=right cellpadding="4" cellspacing="0" style="margin: 1em 1em 1em 1em; background: #f9f9f9; border: 1px #aaa solid

    A symmetric game is a game where the payoffs for playing a particular strategy depend only on the other strategies employed, not on who is playing them

    An important subset of sequential games consists of games of perfect information. A game is one of perfect information

    Economists have used game theory to analyze a wide array of economic

    Alternatively, some authors claim that
    It depends on what command line options are used. Please read the readme.txt file in the release. It sounds like you should turn off the initial pass word deduplication with command line option -w0, use the -x option if you are not already doing so, and use the -v2 command line option to get it to output the dictionary in the order it was created and redirect the standard output to a file for viewing.

  36. #867
    Member
    Join Date
    Jan 2020
    Location
    Canada
    Posts
    142
    Thanks
    12
    Thanked 2 Times in 2 Posts
    -x makes it "stop working" and isn't in the readme.txt. What's -x do?

    How did you do arithmetic coding with such big vocab? I know you look at front of codes ex. [a]aaaaaaaaaaa but then how do you force it to choose the right one still...
    Last edited by Self_Recursive_Data; 20th January 2020 at 20:41.

  37. #868
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    Quote Originally Posted by Self_Recursive_Data View Post
    -x makes it "stop working" and isn't in the readme.txt. What's -x do?

    How did you do arithmetic coding with such big vocab? I know you look at front of codes ex. [a]aaaaaaaaaaa but then how do you force it to choose the right one still...

    You must not be using the latest version, GLZA v0.11.2. -x enables extreme parsing. This is necessary if you want the program to build the entire suffix tree in the first pass. This was the only mode before the -x option was added to allow the default compression mode to be faster.

    The dictionary codes are transmitted in three parts. First, a code for the first character is sent using an order 1 model, then the dictionary "Huffman" code length is sent, then the index of the entry is sent, all using arithmetic coding.

  38. #869
    Member
    Join Date
    Jan 2020
    Location
    Canada
    Posts
    142
    Thanks
    12
    Thanked 2 Times in 2 Posts
    So you predict one of the millions of strings based on 1 char prior to it?

    Don't you have millions of ranges to jump in in your arithmetic encoding/.

  39. #870
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    Quote Originally Posted by Self_Recursive_Data View Post
    So you predict one of the millions of strings based on 1 char prior to it?
    Lol, yes. Somebody please correct me if I am wrong, but as far as I know that is state of the art for a compressor that uses grammar rules. Using higher order symbols or rules on either side would provide better compression ratios, but require more memory and be slower. With GLZA and enwik9 and similar textish files it is not uncommon for the next character to be a space roughly 50% of the time.

    Quote Originally Posted by Self_Recursive_Data View Post
    Don't you have millions of ranges to jump in in your arithmetic encoding/.
    For each starting character, there can be up to 4,096 bins for dictionary codes and each bin can hold up to 32,768 dictionary entries. So a divide by up to 2^12 in one stage and a divide by up to 2^14 in the last stage.

Page 29 of 30 FirstFirst ... 1927282930 LastLast

Similar Threads

  1. Replies: 4
    Last Post: 2nd December 2012, 02:55
  2. Suffix Tree's internal representation
    By Piotr Tarsa in forum Data Compression
    Replies: 4
    Last Post: 18th December 2011, 07:37
  3. M03 alpha
    By michael maniscalco in forum Data Compression
    Replies: 6
    Last Post: 10th October 2009, 00:31
  4. PIM 2.00 (alpha) is here!!!
    By encode in forum Forum Archive
    Replies: 46
    Last Post: 14th June 2007, 19:27
  5. PIM 2.00 (alpha) overview
    By encode in forum Forum Archive
    Replies: 21
    Last Post: 8th June 2007, 13:41

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •