# Activity Stream

• Today, 20:38
I'm learning so much from yous though. I've almost crossed no man's land. (I tried reading the material.)
857 replies | 408524 view(s)
• Today, 20:35
btw, I suggest you to also look into BCM. it applies CM after BWT, so the model used is really simple - bitwise based on last 16 bits or so. It may be even simpler than green
51 replies | 1434 view(s)
• Today, 20:28
If I feed Green "hello world and hello again", how many windows are placed on it and where and how long is each window? For example, If I have 3 models that have their own position/offset, and all 3 windows are 4 letters long of view, and are are closest to front, we get: "hello world and helloi]n]" or more intuitively (if I separate their overlayer): "hello world and hello a" "hello world and hello n" "hello world and helloin" We have 3 different (because offset) models looking at the most recent text. Each window only views 4 letters because the tree would die hard if we went for view of 16 letters. Especially if have 3 windows each 16 letters wide lol. So where are you getting this order-16 from? It can't be 16. Do you mean chunks at a time, just like I showed here? Which we then mix together of course. Your formula is different from my formula? https://ibb.co/xHyhnmk Can you tweak my image so I understand your formula? It doesn't make sense at all in many ways... w = A/(B+total/(C+1)) = A*(C+1)/(B*(C+1)+total) ~= W/total ---------------------------- P = Sum * o.freq/o.total, j=0..maxorder ] / Sum ] P0=w1*0.35+w2*0.20+w3*0.35; P0=w1*0.20+w2*0.35+w3*0.20; P0=w1*0.05+w2*0.05+w3*0.02; P0=w1*0.02+w2*0.02+w3*0.05; sum = P0+P0+P0+P0; P=P0/sum; P=P0/sum; P=P0/sum; P=P0/sum;
51 replies | 1434 view(s)
• Today, 20:22
Just read the source. Most people won't explain things until you understand.
857 replies | 408524 view(s)
• Today, 20:01
857 replies | 408524 view(s)
• Today, 18:21
this is Crush v1.6 with small compression ratio improvement from crush v1.5 ​
8 replies | 614 view(s)
• Today, 17:29
> What does P0 mean? Just an intermediate result. We need to normalize the distribution, since sum of probabilities should be 1. P is the final distribution used for coding. > What is the constant? From what? A letter? Its these constants (config.inc in green_v3a.rar): uint coef = { 27584, 4, 208, 0, 15592, 32832, 4, 1, 59299, 86318, 7, 0, 26624, 65536, 3, 5, 8178, 38914, 0, 32, 7520, 56802, 0, 58, 7104, 51840, 0, 73, 7425, 47736, 0, 248, 8192, 41088, 0, 76, 7236, 32264, 0, 17, 1280, 34881, 0, 162, 4096, 40960, 0, 152, 9380, 24634, 0, 45, 4096, 25120, 0, 13, 4096, 24576, 0, 14, 4096, 41088, 0, 15, 4096, 49469, 0, 16 }; They're for transforming n (number of different symbols that appeared in context), total (number of occurrences of context), and order (j here) into a weight value. This is the formula in the source: uint w = coef/(1+coef+total/(coef+1)); // mixer weight As you can see, a context with (n==1), ie only one symbol, gets a higher weight, since its likely to continue like that. > What is w1*0.35 mean and what is result for just that? What is w1? I changed 35% to 0.35 because I normally work with probabilities in 0..1 range. And we can start by normalizing weights (w1'=w1/(w1+w2+w3) etc) and probability distributions, then w1*P1 would be subrange of arithmetic code assigned to symbol 'e' of the first submodel. We don't need to know which model seemed better to encoder, so we have to merge same-symbol intervals of different submodels. > It'll just add to 100% and after we /sum will say 'e' is ex. 50% of it, which is what we knew. This normalization step here also normalizes weights, so it has to be there anyway. Also computer programs don't really calculate anything precisely. So we can get slightly more or less than 100% due to rounding of intermediate values etc. > My understanding is still each model has its %s for the probability for each letter, > then add each model m+m+m, then divide by 3. help... That's what it does, just that "add each model then divide by 3" means w1=1/3; w2=1/3; w3=1/3; and its not the best weight function. > Your algorithm Green is really fast and 'good start' for me and I want to know how it works exactly. > What are you storing? There's this: byte* inpbuf = new byte; if( inpbuf==0 ) return 1; // unpacked data buffer uint* l2 = new uint; if( l2==0 ) return 1; // o3+ link buffer static qword freq; static uint low; static uint o0; bzero( o0 ); static uint o1; bzero( o1 ); // 256k static uint o2; bzero( o2 ); // 64M static uint p2; bzero( p2 ); // +64M, last occurrence pointers So it stores symbol counts up to order2 as is, then builds linked lists (via l2/p2 arrays, which store location of each previous occurrence of each order3 context). > I see you say bruteforce, what? Because unlike normal CM/PPM, it builds the distributions for o3+ dynamically, by visiting up to DEPTH (=8192) recent context occurrences. > If we take a piece of text "hello world and hello again" how many > letters far back does it consider after trained offline and is running now? Well, it would go back once for "hel".."hello " contexts, I don't see any other o3+ matches in your example. > 16 orders? How many context models? 1 length of 16? No way. 40 models? 17 = order0 .. order16 > Do you use a dict or tree? What is it doing that I'm not. "green.cpp" uses a data structure usually known as "hash chains" in LZ, although without a hash function. "tree.cpp" implements the same model with an actual tree. > Here is my formula/understanding. > Can you tweak this image to Green's formula? > I will grasp it if it's visual. https://ibb.co/xHyhnmk I made a log of distributions during green's processing of " the cat was running down the street" (I added 3 spaces at start, because green just writes first 3 symbols as is without modelling) See it in attached file.
51 replies | 1434 view(s)
• Today, 10:35
Kennon, very nice compressor! Wow!! So can you explain to me very clear and in few words exactly (mostly) how it works so that I can replicate your algorithm. I read the pdf... I see you are looking for the best longest phrases like Byte Pair Encoding when you mention "954 characters. It appears 1694 times in the encoded file". What encoded file? You mean wiki9 don't you? Or by encoded you mean the tree holding wiki9 lol? 1) So you build a tree, a trie ? (Of all of wiki9). See figure_1 below. 2) Are you finding the common "codes" in wiki9 by a form of Byte Pair Encoding ? Ex. "". 3) And are you replacing these windows (where they sit in wiki9, like Huffman does) with their own "huffman-like" code ? Ex. "". 4) Can you explain the "best code" discovery again? So you look for what in the tree/trie? A node that extends very long and has few children for miles? Or also is frequent? Or is near the Root? Figure_1 (tree): //////////a< ////a< //////////b< S< //////////a< ​////b< //////////b<
857 replies | 408524 view(s)
• Today, 07:33
See above. Here is my formula/understanding. Can you tweak this image to Green's formula? I will grasp it if it's visual. https://ibb.co/xHyhnmk See how simple that is? This allows the correct letter prediction to be suggested based on the witnesses you have and their position. I see the full formula is: w = A/(B+total/(C+1)) = A*(C+1)/(B*(C+1)+total) ~= W/total ---------------------------- P = Sum * o.freq/o.total, j=0..maxorder ] / Sum ] As for Green's model(s?) themselves, the context matches - are they stored in a tree, a raw dict (no data-structure), ? And that's it for Green right? It just looks at the past 16 letters and does the formula? How in the world do you store order16? You need multiple models to chop it up though....:-(
51 replies | 1434 view(s)
• Today, 07:12
What does P0 mean? What is the constant? From what? A letter? What is the count... What is w1*0.35 mean and what is result for just that? What is w1? What is the weight for weight1? Why is "sum = P0+P0+P0+P0;" combining the % we, have, already. It'll just add to 100% and after we /sum will say 'e' is ex. 50% of it, which is what we knew. My understanding is still each model has its %s for the probability for each letter, then add each model m+m+m, then divide by 3. help... Your algorithm Green is really fast and 'good start' for me and I want to know how it works exactly. What are you storing? I see you say bruteforce, what? If we take a piece of text "hello world and hello again" how many letters far back does it consider after trained offline and is running now? 16 orders? How many context models? 1 length of 16? No way. 40 models? Do you use a dict or tree? What is it doing that I'm not.
51 replies | 1434 view(s)
• Today, 06:45
Sportman replied to a thread Zstandard in Data Compression
Cyan and Terrelln thanks for picking up and fixing this issue so fast, can't wait till the official release to see if there are more files with till 10% storage and till 8% compression speed savings in big data production environments with zstd ultra modes.
394 replies | 122113 view(s)
• Today, 06:29
> Why, are, you, adding model1+total/model3? C's are constants, not models. > and what is "green"? Its in this thread that I linked before: https://encode.su/threads/541-Simple-bytewise-context-mixing-demo > Please show me result after adding these weights togther > e.g the averaging of them (an example, not has not be precise): It would be something like P0=w1*0.35+w2*0.20+w3*0.35; P0=w1*0.20+w2*0.35+w3*0.20; P0=w1*0.05+w2*0.05+w3*0.02; P0=w1*0.02+w2*0.02+w3*0.05; sum = P0+P0+P0+P0; P=P0/sum; P=P0/sum; P=P0/sum; P=P0/sum; where w's are computed from available information with a weight function like the one you complained about. There's no good way to compute optimal weights just from a few probability distributions - some history is necessary. Of course, perfect weights in this case would maximize the probability of current symbol: 'e': w1=1; w2=0; w3=0; 'h': w1=0; w2=1; w3=0; 'z': w1=1: w2=0; w3=0; '&': w1=0; w2=0; w3=1;
51 replies | 1434 view(s)
• Today, 05:50
> Am I missing anything.... What/where is the Weight Averaging? How is it done (use my example above). >> Mostly by using different weights, based on history of previous predictions. >> For example, "green" uses weight function like this: C1/(C2+total/C3). >> Where Ci are tuned constants, total is the number of context's occurrences, and "o" is context's order. Whoa you lost me with "C1/(C2+total/C3)". You have 3 context models where each has its own set of probabilities for a, b, c, d, etc. We get their %s by totaling the counts for a given set. We then add sets together and divide them too. Why, are, you, adding model1+total/model3? And not dong this for model1? This isn't clear. To me the meaning of "C1/(C2+total/C3)" means: model1 35%=e, 20%=h, 5%=z, 2%=& divided by ( model2 20%=e, 35%=h, 5%=z, 2%=& +300% (total) divided by: model3 5%=e, 20%=h, 2%=z, 5%=& ) me=confused and what is "green"? Please show me result after adding these weights togther e.g the averaging of them (an example, not has not be precise): > "35%=e, 20%=h, 5%=z, 2%=&" > "20%=e, 35%=h, 5%=z, 2%=&" > "35%=e, 20%=h, 2%=z, 5%=&"
51 replies | 1434 view(s)
• Today, 05:42
terrelln replied to a thread Zstandard in Data Compression
Thanks for the benchmarks sportman! The multithreading issue is fixed in the latest dev branch, and we now mention --single-thread in the help.
394 replies | 122113 view(s)
• Today, 05:17
> "35%=e, 20%=h, 5%=z, 2%=&" > "20%=e, 35%=h, 5%=z, 2%=&" > "35%=e, 20%=h, 2%=z, 5%=&" > So I simply add them (smash the 3 rows together into 1 row) and then divide each % by 3. Well, it would be linear mixing with weight = 1/3 for each model. Also you can skip "divide by 3", but probabilities in the coding distribution should add up to 100%. > Am I missing anything.... What/where is the Weight Averaging? How is it done (use my example above). Mostly by using different weights, based on history of previous predictions. For example, "green" uses weight function like this: C1/(C2+total/C3). Where Ci are tuned constants, total is the number of context's occurrences, and "o" is context's order. Ash07 above basically uses the same method, so there's at least some potential in it. Without SSE its result is 19,492,857 though. However bitwise approach is easier and provides better compression in the end.
51 replies | 1434 view(s)
• Today, 04:47
Say I get 3 models telling me: Model_1 "I'm a 4-letter context over here, and I think the Prediction Letter is probably 35%=e, 20%=h, 5%=z, 2%=&" ​Model_2 "I'm a 4-letter context over here, and I think the Prediction Letter is probably 35%=h, 20%=e, 5%=z, 2%=&" ​Model_3 "I'm a 4-letter context over here, and I think the Prediction Letter is probably 35%=e, 20%=z, 5%=&, 2%=z" So I simply add them (smash the 3 rows together into 1 row) and then divide each % by 3. Am I missing anything.... :-) What/where is the Weight Averaging? How is it done (use my example above).
51 replies | 1434 view(s)
• Today, 04:31
I tried increasing the number of iterations for method#2 to 10M, it improved, while #4 got worse because of different seed... >a.exe method 1: 0.06072729 method 2: 0.08386047 method 3: 0.08643947 method 4: 0.05860876 method 5: 0.06026703 >a.exe method 1: 0.06072729 method 2: 0.07329438 method 3: 0.08221427 method 4: 0.05916388 method 5: 0.06026703
4 replies | 204 view(s)
• Today, 04:11
> Wow 18.5MB, you made that algorithm? Yes, in 2003, its open-source: http://ctxmodel.net/files/ASH/ > Now when I mix the models, do I do weight averaging before, > after, or both before and after mixing models? https://en.wikipedia.org/wiki/Context_mixing I'm not sure what you're asking... you need a single probability to encode a bit, but there can be multiple layers of mixing. Mixer basically is a context model itself - it takes context (submodel predictions plus known data) and produces one prediction. For example, I tend to chain binary mixers instead of using one n-ary like paq.
51 replies | 1434 view(s)
• Today, 04:06
I wrote a quick benchmark to compare some approaches: https://gist.github.com/byronknoll/3bafb846d2760ab6081a37ac700c0618 My original idea of sampling uniform random numbers did not perform well. The best performing technique so far uses numbers sampled from a Bates distribution.
4 replies | 204 view(s)
• Today, 03:45
Wow 18.5MB, you made that algorithm? Now when I mix the models, do I do weight averaging before, after, or both before and after mixing models?
51 replies | 1434 view(s)
• Today, 03:14
> Shelwien, what's your wiki8 record? I never really tried to set one on purpose... Tried running my relevant coders and its like this: 98,056,487 enwik_text3a // https://encode.su/threads/2590-Some-perl-scripts-for-enwik8-parsing 59,878,185 enwik_text3a.drt // DRT.exe enwik_text3a enwik_text3a.drt 20,185,928 enwik_text3a.drt.bwtmix // BWTmix.exe c600 enwik_text3a.drt enwik_text3a.drt.bwtmix 20,013,580 enwik_text3a.drt.pmd32 // pmd.exe c enwik_text3a.drt enwik_text3a.drt.pmd32 32 1400 (ppmd_sh9) 19,418,811 enwik_text3a.drt.CMo8 // CMo8.exe c12 enwik_text3a.drt enwik_text3a.drt.CMo8 18,510,302 enwik_text3a.drt.ash07 // ash07.exe /o256 /m4000 /s255 enwik_text3a.drt enwik_text3a.drt.ash07 I could probably reach 17 by combining CMo8 with mod_ppmd and mod_SSE - CMo8 doesn't have an SSE stage atm, its a simple logistic mix of o0..o8 to test new counters. If you count paq, then I have this: https://encode.su/threads/2515-mod_ppmd mod_ppmd is probably still a part of "phda" - it was there last time I checked. > How do you store all these oh too many ex. order-4 contexts? A tree? A dict? > Or an all-connected FeedForward architecture? Can you draw a pic of it in full? CMs mostly use hashtables: https://en.wikipedia.org/wiki/Hash_table Although ppmd/ppmonstr/durilca do use a tree. Here's an example with a linked list: https://encode.su/threads/541-Simple-bytewise-context-mixing-demo > A tree can appear fast/low memory but in the end it is not going to be the best. > We want less nodes and less searching. Yeah, hashtable is a much better solution for compression (where it can be used), because its not necessary to avoid collisions too much etc. In a compressor it can be used as a lossy data structure, so its much more efficient than a tree - can use all the memory for counters, while for a tree we need to keep lots of pointers too.
51 replies | 1434 view(s)
• Today, 02:53
Sportman replied to a thread Zstandard in Data Compression
2,337,506,087 bytes, 2,596.459 sec. - 13.159 sec., zstd -18 --single-thread (v1.4.4) 2,299,225,559 bytes, 3,196.329 sec. - 13.559 sec., zstd -19 --single-thread (v1.4.4) 2,197,171,322 bytes, 3,947.477 sec. - 14.471 sec., zstd -20 --ultra --single-thread (v1.4.4)
394 replies | 122113 view(s)
• Today, 02:40
Cyan replied to a thread Zstandard in Data Compression
Thanks @Sportman for exposing this use case thanks to your new benchmark ! It made it possible to identify the issue, and @terrelln basically nailed it in a new patch. This issue will be fixed in the next release (v1.4.5).
394 replies | 122113 view(s)
• Today, 00:26
Shelwien, what's your wiki8 record? How do you store all these oh too many ex. order-4 contexts? A tree? A dict? Or an all-connected FeedForward architecture? Can you draw a pic of it in full? A tree can appear fast/low memory but in the end it is not going to be the best. We want less nodes and less searching.
51 replies | 1434 view(s)
• Yesterday, 23:21
Here's my coder with lzma counters: http://nishi.dreamhosters.com/u/c0lit_v0.7z 61251024 2.063s 2.156s // c0lit_cl90.exe 61251024 2.265s 2.297s // c0lit_gc82.exe 61251024 2.109s 2.156s // c0lit_ic19.exe 61251024 2.172s 2.344s // c0lit_cl90_fil.exe 61251024 2.094s 2.265s // c0lit_gc82_fil.exe 61251024 2.218s 2.312s // c0lit_ic19_fil.exe 61251024 1.750s 1.844s // c0lit_cl90.exe 61251024 1.844s 2.000s // c0lit_cl90_fil.exe 61251024 1.891s 1.969s // c0lit_gc82.exe 61251024 1.782s 1.937s // c0lit_gc82_fil.exe 61251024 1.781s 1.829s // c0lit_ic19.exe 61251024 1.875s 1.969s // c0lit_ic19_fil.exe E MB/s size ratio% D MB/s function 62.39 61250808 61.25% 56.88 11-rcs o0 simple enwik8 61250808 1.602s 1.758s // trc.exe recalculated from speed 61250820 1.953s 1.891s // trc.exe -11 "enwik8" 1 (gcc82, modified test.bat) Its timed with file i/o, so I wonder how it actually compares...
21 replies | 1682 view(s)
• Yesterday, 21:37
Sportman replied to a thread Zstandard in Data Compression
Confirmed: 2,136,340,302 bytes, 4,604.696 sec. - 15.121 sec., zstd -21 --ultra --single-thread (v1.4.4)
394 replies | 122113 view(s)
• Yesterday, 21:18
Larger renorm (ex. 48 bit) will make substantial changes necessary in the coding. A priori it will not make the RC faster. The speed issue of a RC is the sequential processing in the decoding. You can use interleaving to speed up rANS. Actually I've not tried interleaving (2 streams) with the Turbo-Range-Coder. Wonder if it's possible and how to rearrange the output of the RC, so to make it possible to use interleaving with one stream at decoding. The 64 bit RC is reading and writing in 32 bits units only. I've added some logic to use the renorm only when necessary in the o0 simple coder (option 11). enwik9 - skylake i6700 3.4GHz E MB/s size ratio% D MB/s function RC_IO RC_BITS 56.34 620093192 62.01% 47.56 11-rcs o0 simple 32 15 4 Renorm/Byte 54.58 620093192 62.01% 47.10 11-rcs o0 simple 32 15 8 Renorm/Byte (previous version) 50.70 620717608 62.07% 42.58 11-rcs o0 simple 16 12 2 Renorm/Byte I've added the type __uint128_t to rcrange_t to support 128 bit RC, without changing the coding. It's too slow as expected. Output (RC_IO) can be 8,16,32 or 64 bit. The fastest RC is the default 64 bit RC (RC_SIZE = 64) and with 32 bit i/o (RC_IO = 32).
21 replies | 1682 view(s)
• Yesterday, 21:10
The final version of BCM should be given a special name like Gigawatts v1.21 (If David Scott adds to it making it bijective, that version could be called "Great Scott!") lol just a few nomenclature ideas to go with your avatar. ;-) Nice work btw.
113 replies | 55159 view(s)
• Yesterday, 19:53
Sportman replied to a thread Zstandard in Data Compression
Confirmed: 2,364,340,233 bytes, 5,667.053 sec. - 15.339 sec., zstd -22 --ultra -T1 (v1.4.4) 2,080,479,075 bytes, 5,257.974 sec. - 15.484 sec., zstd -22 --ultra --single-thread (v1.4.4) Zstd v1.4.4 windows binary help (-h) do not mention argument --single-tread only argument T: -T# : spawns # compression threads (default: 1, 0==# cores)
394 replies | 122113 view(s)
• Yesterday, 19:26
> All artificial generated data isn't a bad idea because Matt did it > once and someone found a way to preprocess it for better compression. Artifical data is bad because its artifical. It would work if we can reproduce the properties of the data which we are specifically targeting (eg. BWT output), but would prevent discovering new patterns (which is good if its fit for practical applications). When writing the data generator its easy to miss something, which can be exploited in the contest (eg. there're PRNGs in actual compiler standard libs which produce "random numbers" compressible by paq). > I think its still the way to go for an objective competition. I also considered it, but don't see how to make it useful and unexploitable. >> I'd rather have preprocessors for relevant data (eg. exes) than for something artifically generated for the contest. > Would hate this. I love the idea behind Paq8 but all the custom man made models for all sorts of cases Its a good way to create actual competition, with multiple people making improvements. paq8 is too complicated and "solid" (not modular), so imho its effectively stopping CM development since its release. paq8 lacks many things (explicit parameters for component tuning, precision, support for enc/dec asymmetry), but a new CM with new features can't compete with paq8 in benchmarks, because first it needs to have all the same submodels (match, sparse, periodic, 2D, indirect), and there's no analysis of submodel contribution, which would explain to a new developer what has to be integrated to compete with paq8, and why. Btw, there I only listed "pure" universal models, which make sense just based on theory. > (at least for me) go against the idea of general artificial intelligence. This contest certainly won't be about AI. The (potential) sponsor is storage-related. Also I much prefer standalone preprocessors to a combination of complex features which combine to make use of some pattern in the data. For example, consider rep-matches in lzma (and clones) - they were (likely) introduced simply because somebody noticed that LZ distances are repeated sometimes. But then lzma's optimal parser has heuristics for checking multi-token patterns like match-literal-rep_match, which effectively let it choose longer "fuzzy matches" over simple nearest repeated substrings. > If the competition will be a game of hand crafted preprocessing and a bunch of custom man made models I would lose interest. We'll see I guess. I still think that making a good exe filter for blockwise zstd (for example) would be more interesting than tweaking paq8. If its not for you, Hutter's contest is still active. > It would be more interesting to allow the compressor to build/develop/find (unsupervised training) > a custom model that acts like a preprocessor or custom model for the data and see this a separate step besides encoding/decoding. In theory I agree, and that was supposedly the original Hutter's (and Mahoney's) intention for the contest. But in practice I don't believe in fully learning english from enwik, even if we'd get humans to work on it. Some aspects are discoverable (like words/spaces/punctuation, or parts of morphology; maybe synonyms), but actual understanding is only possible with external information (eg. pictures with objects known from external information), and some parts are impossible without supervision (rhymes, rhytm and other stuff related to phonetics; detailed morphology; idioms) or at least specially prepared descriptions fed to AI. Of course, some kinds of data analysis are still possible - for example, atm paq8 doesn't have parsing optimization. But I can't add it to paq (it requires backtracking etc) and if I'd make a new CM with it, it would have worse compression than paq, and thus there won't be any interest. Btw, I'd like your feedback on this project: https://encode.su/threads/3072-controlled-regexp-tools-for-HP
17 replies | 645 view(s)
• Yesterday, 18:11
1. 32-bit i/o in x64 rc is a good speed optimization. In fact, I have an old 32-bit coder like that which used FP registers for rc state, so this method should be relatively portable. But I wonder if more than 32-bit would make sense too? I tried testing it with my rc, and it somehow made decoding speed worse: 45411520 6.203s 6.890s // 32-bit renorm/LE 45729030 6.156s 6.969s // 48-bit renorm/BE 45729030 6.187s 7.078s // 48-bit renorm/LE In theory, even 56-bit renorm is possible, eg. with counters like in lzma. 2. Its possible to make a binary rc with 128-bit range. It would require more multiplications, but would also allow 64-bit renorm and/or only checking for renorm once per 8 bits. I wonder if it would work as speed optimization.
21 replies | 1682 view(s)
• Yesterday, 17:53
I think that's a Baconian fallacy? All artificial generated data isn't a bad idea because Matt did it once and someone found a way to preprocess it for better compression. I think its still the way to go for an objective competition. Would hate this. I love the idea behind Paq8 but all the custom man made models for all sorts of cases (at least for me) go against the idea of general artificial intelligence. If the competition will be a game of hand crafted preprocessing and a bunch of custom man made models I would lose interest. This would quickly lead to abuse of the rule I am afraid. It would be more interesting to allow the compressor to build/develop/find (unsupervised training) a custom model that acts like a preprocessor or custom model for the data and see this a separate step besides encoding/decoding.
17 replies | 645 view(s)
• Yesterday, 17:23
> I'm not sure what to think. If things should be practical we get just another dictionary coder, > written in c(++) and fine tuned for the testdata. PPM/fast_CM/BWT have faster encoding and better compression than LZ. And yes, I don't intend to support python/java/C#/JS - these developers can find their own solution for trans-compiling to C/C++, since there're no high-quality compilers for anything else. Also modern LZs have much higher algorithm complexity than anything else, and its still possible to improve their parsing optimization and entropy coding. It may be also possible to speed-optimize paq-like CMs. > What I would suggest is artificially generated data. That was attempted before: http://mattmahoney.net/dc/uiq/ And failed fast enough, after a method of preprocessing was discovered. I'd rather have preprocessors for relevant data (eg. exes) than for something artifically generated for the contest. > The contest creater/holder keeps the private set for the final scores. I do intend to keep the test corpus private. > But I would love to see a second competition where purely the size is the objective. I have some hope that Hutter Prize would be rebooted for that. > Why not a memory constraint of 16gb and a time constraint of 24 hours for this competition and lets see what the competitors create? I agree on 16gb, but smaller time limit imho helps competitors. Otherwise it could easily become a competition in having access to more hardware to run some cmix/nncp hybrid. I'm actually more interested in taking off the decoder size limit somehow, to enable natural language parsing etc. But that would rely too much on having unique test data, which nobody can guess.
17 replies | 645 view(s)
• Yesterday, 17:05
I love the idea. Maybe even smaller blocks like 32 byte or variable length? Just as in a real database? This way you can't just use zstd to win. You could use one or more open source databases as input.
17 replies | 645 view(s)
• Yesterday, 16:44
> Maybe change the objective, but still stay with enwik8? Enwik8 compression is basically only useful for enwik8 compression. Enwik file contains too much unique stuff like wiki markup, large tables in mixed html/wiki markup, bibliographies, reference lists, language lists. So some preprocessing tricks can significantly change the results even if we'd run a blockwise random-access benchmark on enwik. Which could be ok normally, but enwik tricks are not relevant for anything else. > Hutter Prize limits the memory and time and rewards highest compression ratio. > Maybe instead we should seek fastest compressors that operate within some memory limit and compress enwik8 to e.g. under 18,000,000 bytes. Under _19,000,000_ are basically only coders with WRT preprocessing and slow CM/PPM+SSE. I guess its possible to setup rules such that durilca would win, but none of paq derivatives, but it would keep most of drawbacks of HP contest (like having no applications, encouraging information gathering rather than programming, etc). I'm thinking about focusing on compression of small blocks (4k..1M) for random access, since 1) Its an important (basically only) use-case in practical applications - filesystems, databases, storage, even games. But "compression scene" has very limited resources for it, 90% of LTCB entries are unfit, while a number of remaining ones require refactoring and new features (eg. external dictionary support) to be competitive. 2) Benchmark order significantly changes in this case - many popular choices would become useless (eg. BSC has worse compression and is slower than ppmd), so it should be interesting. Well, in practical applications zstd would currently win (compression ratio + decoding speed + external dictionary support), but better ratios and faster encoding are possible, so there's stuff to work on. 3) Its possible to make progress by tweaking existing codecs or preprocessors, which should let people participate in the contest without having to invest months of work.
17 replies | 645 view(s)
• Yesterday, 16:33
I'm not sure what to think. If things should be practical we get just another dictionary coder, written in c(++) and fine tuned for the testdata. What I would suggest is artificially generated data. A collection of files with different (1kb, 1mb, 1gb) sizes and different probability distributions (easy, moderate, hard) on bit and byte level. Let's say 18 files in total. A public and a private set. People are allowed to train their algorithm on the public set. The contest creater/holder keeps the private set for the final scores. We could use a widely accepted tool like 7zip on default settings to set a ceiling size for every file. Competitors have to write a program that is able to compress equal or better than default 7zip and are judged by the sum of encoding and decoding time. I would also suggest to use a modern memory constraint like 8gb or something. But I would love to see a second competition where purely the size is the objective. From the past we learned that time or memory constraints diminish over the years and that innovative but impractical algorithms from the 80th's and the 90th's are just fine and mainstream now. Why not a memory constraint of 16gb and a time constraint of 24 hours for this competition and lets see what the competitors create?
17 replies | 645 view(s)
• Yesterday, 12:43
enwik8 would take too long, but here's book1: source o1rc paq8px qlfc BOOK1.bwt 768,775 208,199 208,397 212,100 BOOK1.bwth 768,771 226,555 220,485 233,853 BOOK1.bwtl 768,771 228,076 221,892 235,693 BOOK1h.bwt 6,150,172 220,097 231,029 219,453 BOOK1l.bwt 6,150,172 221,339 232,752 220,643 "h" is bit7 to bit0 order, "l" is bit0 to bit7. As expected, it makes compression worse, because byte alignment is important for text (you'd have more precision in predicting bit7=0 from bitpos%8 than from prefix context; also there're letter cases.) Also you have to understand that BWT is very limited, its not fit for breaking compression ratio records. Basically, BWT suffix array is a data structure that simplifies access to prefix order* contexts... at the cost of hiding everything else.
51 replies | 1434 view(s)
• Yesterday, 12:16
Piglet replied to a thread Zstandard in Data Compression
It would be nice to have a switch/settings that runs without compromise. Of course the runtime would be long, but it would provide the smallest possible size. It would be interesting to know where zstd ends.
394 replies | 122113 view(s)
• Yesterday, 12:15
Bulat Ziganshin replied to a thread Zstandard in Data Compression
Note that m/t also affects level 21
394 replies | 122113 view(s)
• Yesterday, 11:36
D:\>bcm -9 wikipedia-en-html.tar Compressing wikipedia-en-html.tar: 223674511360 -> 7800893626 in 27559.1 sec Tested on HDD though. 208 GB of Wikipedia slimmed down to 7 GB... :_coffee:
113 replies | 55159 view(s)
• Yesterday, 06:22
Has anyone tried BWT on the, full wiki8, on its binary? Or is it way too slow? It'd look like this as binary: 101000111010\$ 01000111010\$1 1000111010\$10 000111010\$101 00111010\$1010 0111010\$10100 111010\$101000 11010\$1010001 1010\$10100011 010\$101000111 10\$1010001110 0\$10100011101 \$101000111010
51 replies | 1434 view(s)
• Yesterday, 05:43
terrelln replied to a thread Zstandard in Data Compression
Running zstd in single threaded mode gets the expected compression ratio. Time to figure out what’s going wrong! zstd --single-thread --ultra -22 enwik10 -o /dev/null enwik10 : 20.80% (10000000000 => 2080479075 bytes, /dev/null)
394 replies | 122113 view(s)
• Yesterday, 05:36
terrelln replied to a thread Zstandard in Data Compression
> Hi Cyan, I am actually seeing this issue on my small files corpus too. I've analyzed your file. Sadly the issue isn't the same as the enwik10 regression. For your file size level 19 has target_length=256 and 22 has target_length=999. When zstd finds a match of length > target_length it stops its analysis and just takes the match. In your file, near the beginning (position 511) there is a match of length 261. This causes zstd to emit its sequences and update the stats for level 19. But level 21 keeps going for 4KB before it updates its stats. That means that level 19 gets more accurate stats sooner, so it is able to make better decisions in its parsing. This is a weakness in the zstd parser which could be improved. To help combat this we make 2 passes on the first block. Without doing 2 passes the difference between the 2 levels is 59 bytes instead of 26 bytes. This weakness should only cause small swings in the compressed size of the file, not the 300MB difference we're seeing on enwik10.
394 replies | 122113 view(s)
• Yesterday, 05:26
Maybe change the objective, but still stay with enwik8? Hutter Prize limits the memory and time and rewards highest compression ratio. Maybe instead we should seek fastest compressors that operate within some memory limit and compress enwik8 to e.g. under 18,000,000 bytes.
17 replies | 645 view(s)
• Yesterday, 02:25
What I know: In the lossless audio codec MPEG-4 ALS an algorithm named Block Gilbert Moore Coding (BGMC) is used to code residuals. There is a PDF document that provides some information about it.
4 replies | 306 view(s)
• 16th January 2020, 23:03
Sportman replied to a thread Zstandard in Data Compression
I tried every 1GB enwik10 language separate but all as expected: zstd -21, zstd -22, enwik10 language 239,381,519 bytes, 233,554,032 bytes, en 236,339,891 bytes, 230,156,981 bytes, de 218,140,700 bytes, 212,437,303 bytes, fr 166,287,286 bytes, 161,822,491 bytes, ru 221,889,335 bytes, 216,165,123 bytes, es 212,661,141 bytes, 207,155,966 bytes, ja 202,773,602 bytes, 198,084,760 bytes, it 186,791,399 bytes, 180,868,511 bytes, pl 253,935,594 bytes, 247,350,494 bytes, zh 198,331,259 bytes, 193,563,157 bytes, pt ---------------------, --------------------- 2,136,531,726 bytes, 2,081,158,818 bytes
394 replies | 122113 view(s)
• 16th January 2020, 22:22
Please provide more details! I have no issues with large files and the CRUSH. If it's *nix compile try to use fopen64() or something. :-)
51 replies | 38412 view(s)
• 16th January 2020, 21:26
That may be ok, although I've been thinking to try something more relevant - saved pages from web (html/css/js etc) or strings extracted from exes. (I recently analyzed a 12G windows VM image and 60% of it was exe/dll/sys, although half of them are duplicates; it also contains all the usual recompression targets, but the recompression gains would be too little and slow comparing to exe optimization). But the main question is how to setup benchmark metric and payouts so that it would allow eg. something like ppmd to compete.
17 replies | 645 view(s)
• 16th January 2020, 21:05
> Sorry do you mean the latest HP entry? or the latest phda9 ? latest HP entry. > The latest HP entry uses less than 1 GiB of RAM, and a 176 MiB scratch file. 5G is reported on LTCB. Or is it asymmetric and encoder uses 5G while decoder fits in 1G? Then its my mistake. I presumed that it would fit under time constraint even when working with swap on 1G of physical memory. > Really? Have you ever tried? Tried this now - it didn't run because of too old glibc. I'd try again tomorrow whether it would work by patching glibc version in unpacked decoder. Also I remember some archive which still had windows decoder, but it didn't decode. And I doubt that you would be able to compile that decoder from source now and get it to decode the entry archive. Its not like cmix etc don't have the same problem, but in this case it makes contest entries hard to verify, since math functions (log etc) would likely work differently in different glibc versions and/or on different linux distributions. ------------- Anyway, I presume that you don't want to open-source phda. It may be your right (or not, since paq and cmix are GPL), and in any case nobody can force you. But can you please tell that to contest organizers, so that they can reboot the contest and let others participate? Because competing with you in current contest (if its even possible) requires for everybody to stop posting any CM improvements for years.
83 replies | 27188 view(s)
• 16th January 2020, 20:45
Here is a task suggestion: compressing public domain books from Project Gutenberg.
17 replies | 645 view(s)
• 16th January 2020, 20:05
​Sorry for not elaborating the sizes in this test. The goal of this benchmark is to compare the speed of different CDF decoding with small alphabets and a clean static coding only, without the overhead of adaptive predictors and contexts. Small alphabets are the preferred use cases (image,video,...) for CDFs. Without writing a dedicated benchmark, we simply ignore the high nibble and encode only the low nibble of each byte. Thus the sizes shown are meaningless here. But the relative speed is correct and the benchmark shows for ex. that bitwise decoding is faster than CDF decoding even for static distributions. Predictors for adaptive CDF coding are more complex and slow. In other words decoding 4 bits is faster than decoding a single nibble with CDFs This is a little surprising given that nowadays CDF coding has become a fashion in entropy coding. This test is only for benchmarking speed with small alphabets. The CDFs are calculated for the whole file before coding/decoding.
21 replies | 1682 view(s)
• 16th January 2020, 19:03
introspec replied to a thread Zstandard in Data Compression
Hi Cyan, I am actually seeing this issue on my small files corpus too. Attached to this post is an example of file for which: prof4d-summertime_2_(2012).mg2 (uncompressed) 18,688 prof4d-summertime_2_(2012).mg2.zstd19 (compressed using zstd -19) 2,744 prof4d-summertime_2_(2012).mg2.zstd22 (compressed using zstd --ultra -22) 2,770 (This file contains pixelart for ZX Spectrum.) I hope it is small enough for you to be able to analyse what the issue is.
394 replies | 122113 view(s)
• 16th January 2020, 18:50
Sorry do you mean the latest HP entry? or the latest phda9 ? The latest HP entry uses less than 1 GiB of RAM, and a 176 MiB scratch file. The latest phda9 uses ~6.02 GiB and nothing like "swap to disk". Really? Have you ever tried?
83 replies | 27188 view(s)
• 16th January 2020, 18:19
BCM v1.40 has been released: + Updated the license - now it's Public Domain / MIT (libdivsufsort) + More convenient block size setup (-1..-9) https://github.com/encode84/bcm :_superman2: Please check for compatibility with older versions!
113 replies | 55159 view(s)
• 16th January 2020, 17:56
improve compression ratio from crush v1.0 size 279491430 bytes to 257210635 bytes (7.97%)
8 replies | 614 view(s)
• 16th January 2020, 17:32
Those sizes all look like basic order-1 entropy, not order-0. Is that simply from being nibble based and thus implicitly having partial order-1 by using the first 4 bits as context for the next 4 bits? I'm still surprised if so that it can match o1 so closely. \$ entropy16 < ~/scratch/data/enwik9 len = 1000000000 bytes, entropy8 = 644561309.084184, 5.156490 bits per byte len = 1000000000 bytes, entropy8o1 = 486874433.871637, 3.894995 bits per byte What block size is it using too? Also is it genuinely block based with the ability to decode from any block, or are the frequency tables encoded relative to the previous frequency table? (That can help improve compression especially with small blocks, but obviously removes random access by block.)
21 replies | 1682 view(s)
• 16th January 2020, 15:42
> If you want to compare general purpose code and not simply who has the best > set of recognition scripts and filters then perhaps that should be part of > the submission requirements. Unfortunately its not that simple to choose. In practical applications the preprocessing would be very important. Also I doubt that I would have a budget that could motivate a development of something like rz or oodle. So most tasks where we could actually expect competition and practical useful improvements would be actually preprocessing. > Are the tools expected to be open source? I think not - sources would be harder to deal with (correctly compile to satisfy the author, etc), though I'd accept sources too. But the main format would be a dynamically-loaded binary library (.dll/.so) for my benchmark tool. > Perhaps this can also be adding (rotating) every byte with a fixed value or > a fixed XOR of low bits. I can use this for some part of benchmark. This could be useful for basic match coding evaluation. I don't expect people to easily beat ppmd or zstd in their categories, though. Also this "alphabet reordering" idea is not very secure. For example, for jpegs it would be possible to detect and recover original alphabet based on huffman/quantization tables. > The (specialist) SequenceSqueeze contest did this, with a framework for > participants to submit AWS images (a bit complex sadly) which the organiser > then ran with their own attached private S3 bucket containing the test data. I'd try to not overcomplicate it :) There'd be some troubles about windows/linux split though - I'm thinking to implement the same baseline solution for both and compare against it.
17 replies | 645 view(s)
• 16th January 2020, 15:12
Turbo-Range-Coder update : Benchmark with small alphabet (16 symbols) - option "-t" ./turborc -e40,41,42,43,44 enwik9 -t ​ E MB/s size ratio% D MB/s function 76.10 475135200 47.51% 69.45 40-rc4sa adaptive bitwise o0 simple 92.05 500000004 50.00% 71.47 41-rc4s static bitwise o0 simple 365.82 482541082 48.25% 80.32 42-rccdfsb static search 366.24 482541082 48.25% 57.45 43-rccdfsv static division 366.33 482541082 48.25% 68.98 44-rccdfst static division free (reciprocal multiplication) Decoding: - reciprocal multiplication is using 640k lookup table for 32 bit division - The division free w/ lut is faster than hardware division on skylake - static symbol search (42) faster than division (43) or reciprocal multiplication (44) - adaptive Turbo-Range-Coder (40) nearly as fast as static coder (41) - adaptive bitwise o0 is faster than adaptive coder with CDFs (entry not in result table)
21 replies | 1682 view(s)
• 16th January 2020, 14:14
File for conversion:
7 replies | 831 view(s)
• 16th January 2020, 13:16
7 replies | 831 view(s)
• 16th January 2020, 12:55
If you want to compare general purpose code and not simply who has the best set of recognition scripts and filters then perhaps that should be part of the submission requirements. No explicit recognition of data types with repacking; just basic modelling. Are the tools expected to be open source? (In which case people can validate that if they wish.) Perhaps this can also be adding (rotating) every byte with a fixed value or a fixed XOR of low bits. It may slightly harm some multi-byte delta techniques, but it'd utterly cripple the custom file format detection. Edit: eg \$ cat /usr/share/dict/words | zstd -19|wc -c 216749 \$ cat /usr/share/dict/words | perl -e '\$/=undef;print map{chr(ord(\$_)^1)}split("",<>)'| zstd -19|wc -c 216736 \$ cat /usr/share/dict/words | perl -e '\$/=undef;print map{chr((ord(\$_)+1)&255)}split("",<>)'| zstd -19|wc -c 216752 \$ cat /usr/share/dict/words | xz -9|wc -c 196680 \$ cat /usr/share/dict/words | perl -e '\$/=undef;print map{chr(ord(\$_)^1)}split("",<>)'| xz -9 |wc -c 196632 \$ cat /usr/share/dict/words | perl -e '\$/=undef;print map{chr((ord(\$_)+1)&255)}split("",<>)'|xz -9|wc -c 197200 Eg I tried paq8px183 on a local jpg file, before and after XOR 1 and it changed from 63629 bytes (very fast) to 83961 bytes (very slow) indicating it'd broken the file type detection model. Xz on both files gave identical sizes (85512), due to it being already compressed data (86815 bytes). Basic transforms, with minimal impact on text compression. I like the idea of a public training set and a private test set to ensure real world applicability and to reduce over-fitting. The (specialist) SequenceSqueeze contest did this, with a framework for participants to submit AWS images (a bit complex sadly) which the organiser then ran with their own attached private S3 bucket containing the test data.
17 replies | 645 view(s)
• 16th January 2020, 12:22
Probably, it is simple to insert file write statements in the FLAC file output functions. Write the integer values into a raw 32 bits format for better testing. Possible methods for encoding with entropy coding: 1 - Encode the rice coding output with a bitwise range coder like Turbo-Range-Coder instead of bit I/O. You can use (length limited) rice coding for the quotient and code the remainder with bit i/o. 2 - Encode the residuals with variable length coding using a range coder, Huffman or ANS. - Gamma Coding with RC: You can test this with the gamma coding in Turbo-Range-Coder ​ by writing the integer values into a raw 32 bits file. - Golomb/Rice Coding with RC: Encode the quotient with RC like in RC gamma coding and encode the remainder with bit i/o - Limited Length Golomb/Rice Coding: use the same golomb parameters like in FLAC. Encode the length limited part (ex. 8 bits) with a fixed length entropy coder (RC, Huffman or ANS) and the remainder with bit i/o
4 replies | 306 view(s)
• 16th January 2020, 09:38
Residuals are typically from Laplace distribution centered in zero: rho(x) = exp(-|x|/b) / 2b Here is penalty of using Golomb comparing to accurate entropy coding, strongly depends on width b: The difficulty is choosing this b based on context, also adapt such model - here are some techniques: https://arxiv.org/pdf/1906.03238 In short, MLE estimation is just: b = average of |x| For context dependence you need to predict b e.g. using some linear combination of functions of context, or neural networks to minimize: (b(context) - |x|)^2 Simplest adaptivity is replacing this average with exponential moving average- use update step: b <- mu * b + (1-mu) |x| In practice you can have prepared entropy coder for some discretization of b, encoder and decoder need to choose the same. Here was supposed to be discussion about it: https://encode.su/threads/3124-LOCO-I-(JPEG-LS)-uses-365-parameters-to-choose-width-any-alternatves
4 replies | 306 view(s)
• 16th January 2020, 03:09
I guess we need to work hard and post enough improvements for paq, so that Alex could make enough money to get bored of it and open-source phda. Also the recent entry uses 5G of memory (with swap to disk) and an archive for it basically has to be encoded on the same machine (because of floats): http://mattmahoney.net/dc/text.html#1165 So its only possible to compete using the same tricks (its fairly similar to 5x difference in window size). I wonder why they can't do the sane thing and just reboot the contest with different sample data... say, same 100M of enwik text, but all markup, bibliography, references filtered out. And with higher memory limit, since 1G for 100M file means that only paq is applicable (because it basically stores statistics in compressed format (hashtable/FSM)).
83 replies | 27188 view(s)
• 16th January 2020, 02:49
New rule published on 2019-12-03 (and cosmetic changes on 2020-01-15):
83 replies | 27188 view(s)
• 16th January 2020, 02:27
You can compare FLAC compression with results of other lossless audio codecs which use arithmetic coding, like Monkey's Audio or optimfrog (though the latter likely has a better predictor too). Based on squeezechart wav stats: SAC 21324487+65919345+20771857+22759276+25340833+52216710 = 208332508 OFR 21394272+65905363+20880497+23179339+25262790+52433236 = 209055497 MAC 21773012+66797828+21499324+23246616+25823788+52855336 = 211995904 FLAC 22369503+69547135+22708163+24927481+26791804+54394889 = 220738975 (1-211995904/220738975)*100 = 3.96% (1-208332508/220738975)*100 = 5.62% You can expect about 4% compression improvement over FLAC by using better entropy coding, and another 1.5% by using too much computing time. In any case, residuals for testing should be easy to acquire since flac is open-source.
4 replies | 306 view(s)
• 16th January 2020, 02:02
I've been wanting to work on a pet project to basically re-implement FLAC (lossless audio codec). It seems like at a very high level there is a predictor, then a coding applied to compress the residuals (error) of the predictor. While FLAC uses Rice coding to compress these, I noticed that the format reserves a space for extra types of codings. So I was thinking to make this more interesting, then maybe we can compress these codes with Huffman, arithmetic coding, etc. and see what that looks like. Recently I have been made aware of various ANS implementations (FSE, etc.) and their success - maybe even more interesting to implement, and could maybe allow for better compression ratios. However, I'm wondering if there exists some sort of corpus for FLAC residual bitstreams or if the probability distribution of these residuals (for most music/audio files) is known to any extent, which would inform whether or not ANS/arithmetic coding would bring any compression ratio benefits in the first place. According to the FLAC format these numbers are typically small, which is why Rice is efficient, but I can't seem to find anything about their distributions.
4 replies | 306 view(s)
• 16th January 2020, 01:49
Cyan replied to a thread Zstandard in Data Compression
Thanks Sportman ! we'll have to look into that. Since it doesn't show up on enwik9, I would expect some components of enwik10 to show this impact more than others. Let's zoom on a practical case to analyze.
394 replies | 122113 view(s)
• 16th January 2020, 01:44
You can find the parameters in readme.txt here: https://bellard.org/nncp/nncp-2019-11-16-win64.zip And NNCP is intended to be used with its own preprocessor, without preprocessing its result is much worse (and slower).
120 replies | 17025 view(s)
• 16th January 2020, 00:30
My quick test (i5-9600K @ 4.9 GHz): C:\nncp>timer nncp -T 6 c enwik8 enwik8.nncp Timer 3.01 Copyright (c) 2002-2003 Igor Pavlov 2003-07-10 cell=LSTM-C n_layer=4 hidden_size=352 batch_size=16 time_steps=20 seg_len=20 n_symb=256 ln=1 fc=1 sgd_opt=adam lr=4.000e-003 beta1=0.000000 beta2=0.999900 eps=1.000e-005 n_params=6.76M n_params_nie=5.32M mem=99.4MB SIZE locBPS BPS/5M BPS kS/s LR 99960688 1.329 1.401 1.524 7.09 4.00e-003 bps=1.524 (19048454 bytes), 7.09 kS/s Kernel Time = 1444.515 = 00:24:04.515 = 10% User Time = 38092.578 = 10:34:52.578 = 270% Process Time = 39537.093 = 10:58:57.093 = 280% Global Time = 14099.468 = 03:54:59.468 = 100% I have no clue what kind of parameters should I use. And why the dictionary preprocessing again at the LTCB listed?!
120 replies | 17025 view(s)
More Activity