Page 30 of 30 FirstFirst ... 20282930
Results 871 to 879 of 879

Thread: Tree alpha v0.1 download

  1. #871
    Member
    Join Date
    Jan 2020
    Location
    Canada
    Posts
    142
    Thanks
    12
    Thanked 2 Times in 2 Posts
    https://paste.ee/p/qQroG
    I implemented a 'best string finder', I think. What it does is stores the possible sequences by putting them in a tree with counts, then it can efficiently find string and each's value, but I did leave out the refining of the sorted list but it's good enough to present. The value formula is a string's counts * the letters in string to get savings, then to get cost the letters in the string + the counts*2 to store the string/say it using huffman (2 bytes (16 bits), if there is 64,000 strings, currently manual adjusting is required). Now with cost and savings (they are printed, see code) you divide the savings by the cost to get bitsPerByte ratio. Code then sorts them by greatest to least value. So the first one below is huge, from 1million bytes fed in, it is 21.5! You save/or 'store' 5445 bytes for just ~253 bytes.

    updated example:
    ['[[Structure of Atlas Shrugged|section', 5624, 75, 74.98666666666666]
    ['Structure of Atlas Shrugged|section', 5600, 75, 74.66666666666667]
    ['[Structure of Atlas Shrugged|section', 5472, 74, 73.94594594594595]
    ['tructure of Atlas Shrugged|section', 5440, 74, 73.51351351351352]
    ['ructure of Atlas Shrugged|section', 5280, 73, 72.32876712328768]
    ['ucture of Atlas Shrugged|section', 5120, 72, 71.11111111111111]
    ['cture of Atlas Shrugged|section', 4960, 71, 69.85915492957747]



    Kennon's attempt to find the 'best strings' that are long and frequent is the same thing as Shelwien's Green algorithm https://encode.su/threads/541-Simple...xt-mixing-demo, Shelwien's learns online most of its context strings. Notice as Shelwien's tree/lists grow after eating what it generates, it is getting longer branches with more stats on them, yup, long frequent strings, so when it steps on 16 bytes it sends them basically to tree and finds the match, and instead of predicting whole strings it predicts just the next byte. Both their algorithms get the wiki8 100MB to ~21MB. Shelwien's can store 16 byte long contexts (costs RAM 2GB), so I think it can do ~150 byte masks? Enough to cover/detect most duplicates. For longer dups you can easily find a way to add them.
    Last edited by Self_Recursive_Data; 23rd January 2020 at 18:16.

  2. #872
    Member
    Join Date
    Jan 2020
    Location
    Canada
    Posts
    142
    Thanks
    12
    Thanked 2 Times in 2 Posts
    I updated the link above, this one gives the strings but there are some overlappers... But at least it is half baked.

  3. #873
    Member
    Join Date
    Jan 2020
    Location
    Canada
    Posts
    142
    Thanks
    12
    Thanked 2 Times in 2 Posts
    ‚ÄčI think your way is better, only read on if you're really interested.
    ~~~~~~~~~~~~~~
    Here is full details of what I did to get the strings:
    https://ibb.co/R20hcTg

    So for every branch you get all overlay branches based on counts. You'd take the best top sorted string and delete it from input file, then do the algorithm all over again, but of course that'd take too long.

  4. #874
    Member
    Join Date
    Jan 2020
    Location
    Chagrin Falls, OH
    Posts
    8
    Thanks
    1
    Thanked 0 Times in 0 Posts
    GLZA is segfaulting trying to compress this csv file with "-x" or "-p0.0": https://thepiratebay.org/torrent/664...d_(cables.csv)

    compressing the file without "-x" or "-p0.0" seems to work.

  5. #875
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    697
    Thanks
    154
    Thanked 185 Times in 109 Posts
    Quote Originally Posted by hotaru View Post
    GLZA is segfaulting trying to compress this csv file with "-x" or "-p0.0": https://thepiratebay.org/torrent/664...d_(cables.csv)

    compressing the file without "-x" or "-p0.0" seems to work.
    I would be happy to investigate the problem if you will put the file on a site that does not maliciously attack my computer.

  6. #876
    Member
    Join Date
    Jan 2020
    Location
    Chagrin Falls, OH
    Posts
    8
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Kennon Conrad View Post
    I would be happy to investigate the problem if you will put the file on a site that does not maliciously attack my computer.
    sorry, I forgot how bad the pirate bay is without adblock.

    does this work? https://thinkindifferent.net/cables.csv.xz

  7. #877
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    697
    Thanks
    154
    Thanked 185 Times in 109 Posts
    Quote Originally Posted by hotaru View Post
    Yes, thank you, that worked great. I am running the compressor now. Being such a large file it may take a few days to duplicate and fix the problem(s). We'll see.

  8. #878
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    697
    Thanks
    154
    Thanked 185 Times in 109 Posts

    GLZA v0.11.3

    There was a problem with managing the grammar creation when the number of grammar rules was going to exceed the structure sizes. I fixed that and also increased the maximum structure sizes.

    Code:
    timer64 GLZA c -x cables.csv cables.glza
    
    Compressed 1730507223 bytes -> 283953657 bytes (1.3127 bpB) in 37434.599 seconds.
    
    Kernel  Time =  3851.234 =   10%
    User    Time =218303.593 =  583%
    Process Time =222154.828 =  593%    Virtual  Memory =  14495 MB
    Global  Time = 37434.601 =  100%    Physical Memory =  12303 MB
    Code:
    timer64 GLZA d cables.glza cables.csv
    
    Decompressed 283953657 bytes -> 1730507223 bytes (1.3127 bpB) in 15.480 seconds.
    
    Kernel  Time =     0.734 =    4%
    User    Time =    29.750 =  192%
    Process Time =    30.484 =  196%    Virtual  Memory =    684 MB
    Global  Time =    15.485 =  100%    Physical Memory =    627 MB
    Attached Files Attached Files

  9. #879
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    697
    Thanks
    154
    Thanked 185 Times in 109 Posts

    GLZA v0.11.4

    I found a few problems while developing a .dll for the GDC competition. There were a couple memory initialization problems, a memory leak, a model initialization problem, and a problem with files that ended with a capital locked string.

    Also, I am including a second .zip file that contains GLZA.dll, a .dll based on GLZA v0.11.4 that is compatible with the GDC dll requirements.
    Attached Files Attached Files

  10. Thanks (2):

    Dresdenboy (23rd November 2020),kampaster (22nd November 2020)

Page 30 of 30 FirstFirst ... 20282930

Similar Threads

  1. Replies: 4
    Last Post: 2nd December 2012, 03:55
  2. Suffix Tree's internal representation
    By Piotr Tarsa in forum Data Compression
    Replies: 4
    Last Post: 18th December 2011, 08:37
  3. M03 alpha
    By michael maniscalco in forum Data Compression
    Replies: 6
    Last Post: 10th October 2009, 01:31
  4. PIM 2.00 (alpha) is here!!!
    By encode in forum Forum Archive
    Replies: 46
    Last Post: 14th June 2007, 20:27
  5. PIM 2.00 (alpha) overview
    By encode in forum Forum Archive
    Replies: 21
    Last Post: 8th June 2007, 14:41

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •