Activity Stream

Filter
Sort By Time Show
Recent Recent Popular Popular Anytime Anytime Last 24 Hours Last 24 Hours Last 7 Days Last 7 Days Last 30 Days Last 30 Days All All Photos Photos Forum Forums
  • Self_Recursive_Data's Avatar
    Today, 21:54
    ​I ran GLZA on wiki8 and took 15mins. Can you send me all duplicate strings in wiki8 it finds from greatest to least? To be clear I expect to see ex.: {| border="1" align=right cellpadding="4" cellspacing="0" style="margin: 1em 1em 1em 1em; background: #f9f9f9; border: 1px #aaa solid A symmetric game is a game where the payoffs for playing a particular strategy depend only on the other strategies employed, not on who is playing them An important subset of sequential games consists of games of perfect information. A game is one of perfect information Economists have used game theory to analyze a wide array of economic Alternatively, some authors claim that
    863 replies | 408929 view(s)
  • dnd's Avatar
    Today, 21:35
    Turbo-Range-Coder update: ​ ​- complete bwt post coder/decoder with qlfc + TurboRC in 70 lines of code - using simple gamma coding + TurboRC - decoding 2 times faster than bsc's post coder qlfc,1 at nearly the same compression ratio - beats bwtmix (context mixing) on enwik9 TurboBench Compression Benchmark - Skylake 3.4 GHz File: enwik9bwt C Size ratio% C MB/s D MB/s Name 163,924,735 16.4 28 27 bscqlfc 2 164,998,959 16.5 55 61 bscqlfc 1 167,395,956 16.7 82 120 TurboRC 25 167,861,671 16.8 bwtmix
    22 replies | 1712 view(s)
  • dnd's Avatar
    Today, 21:27
    Try restarting your system.
    37 replies | 23815 view(s)
  • Sportman's Avatar
    Today, 21:16
    There is 32GB and 30GB+ memory available.
    37 replies | 23815 view(s)
  • Self_Recursive_Data's Avatar
    Today, 21:07
    Do you have a preferred real time chat? ​
    863 replies | 408929 view(s)
  • Kennon Conrad's Avatar
    Today, 21:02
    No. A suffix tree for a string on length N will never have more than 2N nodes. Some of the edges could have 1000 symbols though. Suffix trees do consume a lot of RAM. I do not use Facebook or Gmail. There are some very long common strings in enwik9 but not all common strings are that long. I am not sure which post this came from but it is a partial listing of the symbols and the (capital encoded) strings they represent. These are the 84575th - 84589th most common symbols in the final grammar (tied), each symbols appearing 95 times in the final grammar and are put into the 20 bit code length dictionary bin, each having a global probability of 0.000110%.
    863 replies | 408929 view(s)
  • pacalovasjurijus's Avatar
    Today, 20:34
    Our algorithm rotation working I already compressed .zip seven bytes.
    236 replies | 87609 view(s)
  • Shelwien's Avatar
    Today, 20:13
    I always wanted to turn paq models to coroutines - so that a model would look like a normal parser/coder with branches and rc calls in multiple places, and main engine would just sync multiple such coders. Another point is memory allocation: dynamic memory alloc is quite slow, and is usually totally unnecessary - for example, I managed to port paq jpeg model (as jojpeg) to a standalone codec with static alloc. And here I can see that each vmi() does its own alloc.
    23 replies | 4438 view(s)
  • kaitz's Avatar
    Today, 19:45
    paq8pxv_v16 ​update jpeg model ​add new component DHS (used by 4bit image and jpeg) small changes reduce warnings at compile time make jit memory execute/read only fix text transform in vm add static bounds check on src->asm compile add runtime bounds check fix arm detection enable arm,dec in main conf remove mxs function. make mixer into one layer only fix SIMD mixer errors add 4 bit bmp detection add 4 bit image compression update git readme ​
    23 replies | 4438 view(s)
  • suryakandau@yahoo.co.id's Avatar
    Today, 16:42
    crush v1.7 beat osdb file too compared to bcrush
    15 replies | 768 view(s)
  • suryakandau@yahoo.co.id's Avatar
    Today, 16:23
    this is the result for enwik8
    15 replies | 768 view(s)
  • suryakandau@yahoo.co.id's Avatar
    Today, 16:14
    very small improvements, but compression and decompression speed is very fast. decompression time for webster file only 1.17s. is it still pareto frontier ?:)
    15 replies | 768 view(s)
  • suryakandau@yahoo.co.id's Avatar
    Today, 15:53
    i don't test it with original crush format so i don't know it compatible or not. from bcrush webpage there is the result for silesia benchmark . is it use --optimal flag ? if yes so there is a bug in bcrush because crush v1.6 can beat bcrush on webster file.
    15 replies | 768 view(s)
  • jibz's Avatar
    Today, 15:44
    jibz replied to a thread Crush v1.3 in Data Compression
    Is your version compatible with the original crush format? bcrush with --optimal flag was supposed to be bit-optimal for that format, so if you are beating it with the same data format there must be a bug in bcrush.
    15 replies | 768 view(s)
  • suryakandau@yahoo.co.id's Avatar
    Today, 15:02
    compared to bcrush or crush v1.1 ? which small file ? crush v1.6 beat bcrush on webster file (silesia benchmark)
    15 replies | 768 view(s)
  • xezz's Avatar
    Today, 14:59
    xezz replied to a thread Crush v1.3 in Data Compression
    small file compression is worse.
    15 replies | 768 view(s)
  • CompressMaster's Avatar
    Today, 14:44
    That´s the problem. Lzturbo 1.2 probably needs more RAM than you have available. Enable pagefile and see if it helps. Or upgrade your RAM. But compression will be slooooow, though.
    37 replies | 23815 view(s)
  • Self_Recursive_Data's Avatar
    Today, 05:53
    See above post. I'm really interested. Can we chat on Facebook or Gmail Chat in real time so I can understand your algorithm much better and explain it to others and simplify it. What's your gmail? I see you posted top 100K wiki9 symbols in the Tree (see below), but you said the duplicates were 500-5000 letters long, what are these short strings below?: 10%): ", he became the" 84575 95 20 (0.000110%): " serotonin" 84576 95 20 (0.000110%): " on ]" 84577 95 20 (0.000110%): " and the main" 84578 95 20 (0.000110%): "/archives/" 84579 95 20 (0.000110%): ". <!-- Cunsourced image removed: ]" 84586 95 20 (0.000110%): " he had made" 84587 95 20 (0.000110%): " Cpalatinate" 84588 95 20 (0.000110%): "nautilus" 84589 95 20 (0.000110%): "k'"
    863 replies | 408929 view(s)
  • Sportman's Avatar
    Today, 04:19
    I tried Lzturbo 1.2 with enwik10 but decompress file is only 1,073,741,824 bytes. 227,112,649 bytes, 2,342.967 sec. - 10.514 sec., lzturbo -49 -p1 (v1.2) With compression I see a row: VirtualAlloc failed No other errors. OS page file is disabled.
    37 replies | 23815 view(s)
  • Self_Recursive_Data's Avatar
    Today, 02:50
    You missed my upper text (I added a few words at end to be clearer): "So just to be on same page, my order-2 above uses only 1 mask, size of 2 bytes. And if I want to replicate your Green project, I'll need 16 masks, each a byte longer than each other? So one mask for order0, one mask for order1, another for order2, order3, order4.....order16." So here's my understanding now: input is "the cat ran down the road" 3 context masks on it are "the cat ran and" We only have counts so far from counters. 'Down' thinks the final word where 'and' is is probably 'and' because it has many counts from the counter: and=345, but=22, if=5 'The' thinks the final word where 'and' is is probably 'but' because it has many counts from the counter: but=180, and=160, if=30 'Road' thinks the final word where 'and' is is probably 'and' because it has many counts from the counter: and=222, but=50, if=3 Now we multiply each count. So I multiply Road's 'and' by, The's 'but'? Or The's 'and'? Or by 1? Or by Road's 'but'? What is multiplied but what? Be clear here. Now with my three rows, I add them: and=345, but=22, if=5 + but=180, and=160, if=30 + and=222, but=50, if=3 = and=727, but=252, if=38 Now for each of the 3 numbers (counts), I divide them by 3 to get: and=242.3, but=84, if=12.6 Try to clearer in the demo example of what is what or where and use clear unambiguous terms. Don't use any math/references etc.
    57 replies | 1630 view(s)
  • Shelwien's Avatar
    Today, 02:22
    >> So you can keep the same picture, just instead of dividing by 2, you can multiplify the first distribution >> with w1=119/(119+1125), and second with w2=1125/(119+1125). > "multiply the first distribution with"... > do you mean the red circles or blue boxes I drew here? blue boxes. You add the distributions, then divide by 2. And I'm suggesting to multiply them by w1 and w2 above, then add. Its really the same, your version just corresponds to w1=1/2, w2=1/2. > So you want me to add together each ex. e+e, d+d, etc. Then divide e by thattt total? No. >> w = A/(B+total/(C+1)) = A*(C+1)/(B*(C+1)+total) ~= W/total > First row below: What is C+1??? What is B? Nothing, just some empirically tuned coefs. /total is there because green actually mixes symbol counts, not probabilities. So /total turns a "frequency table" into a probability distribution. So initial weight function was just w=1/(1+total/(order+1)), could be w=(order+1) for normalized probability distribution. And +1s are a workaround to avoid division by zero. Then I tried running an optimizer to find optimal values for the coefs, and that was the result. Keep in mind that "green" is a tutorial project, there was no goal to discover the best weighting function for bytewise CM. For example, ASH also is a bytewise CM, but to compute the weights it uses some extra statistics stored in its suffix-tree's context nodes - probability of "escape" (when current symbol never occured in this context, so we have to use a lower order for estimation), and a probability of this context being optimal (when its given the highest probability to current symbol among all orders). > Why is B added to Total if Total is Total? Because it improves compression. > Why does B & C do their stuff first and A does it to the creation i.e. A/(B/C)? Mostly no reason, but doing multiplication first requires higher arithmetic precision, which is slower.
    57 replies | 1630 view(s)
  • Self_Recursive_Data's Avatar
    Today, 00:59
    Yes I understand the arithmetic coding it in/error correction, I made a compressor so far that makes 100MB into 40MB. Ok I think I understand your algorithm before even reading your post above. I'm going to give it a shot and tell me if anything is wrong. Your suffix tree digests all of wiki9 and makes it into a tree. Now you are looking for the longest, most recurring string, the "best one". Once you find it, you set it aside and give it the smallest code if you get me, and maybe even delete it from the tree. Now, you look for the next best string. Repeat until all text/branches is erased from tree. Wow this is a very simple cheat while makes 100MB into 18MB losslesly. So it's just looking for the best ""Huffman Codes"" so to speak to get the most bits for the littlest cost to say it. Will my tree need to be 1,000 branches long to store duplicates that happen to be 1,000 chars long? Will every child leaf have 1,000 parents? How do I fit all that into RAM?
    863 replies | 408929 view(s)
  • Self_Recursive_Data's Avatar
    Today, 00:48
    So just to be on same page, my order-2 above uses only 1 mask, size of 2 bytes. And if I want to replicate your Green project, I'll need 16 masks, each a byte longer than each other? So one mask for order0, order1, order2, order3, order4.....order16. > So you can keep the same picture, just instead of dividing by 2, you can multiplify the first distribution with w1=119/(119+1125), and second with w2=1125/(119+1125). "multiply the first distribution with"... do you mean the red circles or blue boxes I drew here? https://ibb.co/yWRH5R0 So you want me to add together each ex. e+e, d+d, etc. Then divide e by thattt total? First row below: What is C+1??? What is B? Why is B added to Total if Total is Total? Why does B & C do their stuff first and A does it to the creation i.e. A/(B/C)? " w = A/(B+total/(C+1)) = A*(C+1)/(B*(C+1)+total) ~= W/total ---------------------------- P = Sum * o.freq/o.total, j=0..maxorder ] / Sum ] "
    57 replies | 1630 view(s)
  • Mauro Vezzosi's Avatar
    Today, 00:39
    I don't know if this helps. Method 5, distribution(m, d). m d x10000 x10000000 0.50 0.09 0.05216995 0.04713555 0.50 0.12 0.05303528 0.04617647 0.53 0.09 0.05139838 0.04581179 0.54 0.12 0.05273559 0.04489511
    5 replies | 266 view(s)
  • Shelwien's Avatar
    Yesterday, 23:01
    > Is mask0 no size just global freq, mask1 1 letter long, > mask2 2 letters long, and so on, like the 16 masks below?: Yes, like that. Its the main type of context for most PPM/CM/BWT/LZ compressors (usually the only), sparse contexts and other types just help to improve compression a little. > But isn't a 16 letter long mask/ context model way to big to store? > The end of the tree/dict would have so many unique branches, millions. It only would be a problem if you'd try using all 2^16 sparse masks at once, or something like that (we had a guy who seemingly looped through all n-grams, then tried to find them in a file - of course it would be slow). It doen't really matter when you work only with substrings that actually appear in the data. For example, an unlimited order suffix-tree of all substrings that appear in a file can be built in linear time (O(filesize)). > So I really want to see you tweak the image so I can use the better formula and understand it too. I still don't understand. P=0; P=0; P=0; P=0; P=0; P=0; P=0; P=0; P=0; P=0; total=47; n=15; order=0; w=119; P=1; P=1; total=2; n=2; order=1; w=1125; RC intervals for encoding 'e': p=1072; p=239; p=120; p=120; p=1721; p=120; p=358; p=120; p=477; p=358; p=358; p=358; p=953; p=1245; p=239; total=8099 So you can keep the same picture, just instead of dividing by 2, you can multiplify the first distribution with w1=119/(119+1125), and second with w2=1125/(119+1125). I don't see how it helps. And I don't know how to visualize the weight function, since its adaptive (as are context statistics). I can do something like the attached image, but really doubt that you'd appreciate that.
    57 replies | 1630 view(s)
  • Self_Recursive_Data's Avatar
    Yesterday, 22:44
    Me and my programmer I hired have made the order-2 yesterday. It made the 100MB wiki8 compress into exactly 40,572,450 bytes. Took exactly 12 hours lol in python (don't hit me with stick, hehe). The dictionary (I included it into the 40MB) was 2,069,481 bytes. The decompressor was 4,910 bytes (also included in the 40MB). Code is attached. You can try it on the small input I uploaded if seek to do it. https://paste.ee/p/Cd7Va So I know now with Online Learning and multiple context models I can get 100MB down from 40MB to 38MB with no dict_tree and also down to ~30MB or maybe even 28MB. Just like Green did when I tested it last night. So I'm happy I understand it a lot. I see above you say Green does orders 0-through-16. So it uses 16 masks on the previous text? Ok. How long is each mask? Is mask0 no size just global freq, mask1 1 letter long, mask2 2 letters long, and so on, like the 16 masks below?: "hello world and hello again" "hello world and hello agai" "hello world and hello aga" "hello world and hello ag" "hello world and hello a" "hello world and hello " "hello world and hello" "hello world and hell" "hello world and hel" and so on up to 16 masks Then you Mix these 16 models/judges together. But isn't a 16 letter long mask/ context model way to big to store? The end of the tree/dict would have so many unique branches, millions. With the image I drew, I did leave out converting the counter counts in the upper section into %s, but I understand that part. So in the image, I can easily understand what each mask thinks about the counters/%_probabilities of the letter to encode (the red 'e' at the end, I can see what the 2 bubbles are thinking it is). I can easily see that they then add together the % in the bottom section and predictions grow in length/ change. And finally we divide those sticks by 2 since we had 2 models/judges weighing in (mixed). So I really want to see you tweak the image so I can use the better formula and understand it too.
    57 replies | 1630 view(s)
  • Kennon Conrad's Avatar
    Yesterday, 22:33
    Hello Self_Recursive_Data, GLZA has preprocessing that can include a delta transformation and capital encoding, the grammar creation can make initial passes to deduplicate words, and the encoder can use MTF where it finds it to be beneficial. I will ignore those for the purpose of this discussion. The main algorithm creates a straight-line grammar and then uses arithmetic coding to send the grammar. Most of these features can be disabled with command line options. Note: Tree used huffman coding, it was renamed GLZA when the encoding was changed to arithmetic to obtain a little better compression ratios. Byte Pair Encoding looks for the most common pairs. GLZA (like Michael Galle's IRR) looks for the "best" recurring strings of any length. Yes, the reference to the repeating string of 954 characters that appears 1694 times is regarding enwik9. 1. Yes, GLZA builds trees (tries?). Specifically, they are suffix trees. The grammar creation in GLZA is recursive so it builds one tree per pass, determines which grammar rules to create, and substitutes the left side grammar rule symbol for each instance of the right hand side string. GLZA with the -x option will build the entire tree if there is enough memory, otherwise it will stop adding to the tree when it runs out of allocated memory (command line option). 2. GLZA's grammar creation algorithm is somewhat similar to byte pair encoding, but GLZA can create grammar rules for strings of arbitrary length instead of only pairs. So for BPE to create a grammar rule for " cat", it would have to first create grammar rules for " c" and "at", then create another rule to combine those two (other buildups possible). On the other hand, GLZA would more likely create a grammar rule for " cat" first, then later create grammar rules for " c" and "at" and substitute those rules into the grammar rule for " cat". 3. Each grammar rule gets it's own symbol. As mentioned earlier, the symbols are arithmetically encoded. Huffman encoding could work but arithmetic encoding provides a little better compression ratio. 4. After the tree is built, GLZA traverses the entire tree looking for the "most profitable" strings for which to create grammar rules. Finding the "most profitable" grammar rules to create is a difficult problem, probably not solvable, so GLZA does the best it can based on testing of what seems to work. At the heart of the decision process, GLZA uses a blend of two methods (with some adjustments for leading spaces and capital encode symbols) by default (command line option). One method basically determines the ratio of reduction in order 0 entropy to the order 0 entropy of the unmodified strings (that may include grammar rule symbols) it is evaluating and blends that value with the overall reduction in order 0 entropy. The other determines the ratio of reduction in order 1 entropy using the first and last characters in the strings represented by the grammar symbols for any non-terminal grammar symbols to the order 1 entropy of the strings it is evaluating and blends that with the reduction in the order 1 entropy. For reference, IRR chooses one grammar rule per pass, the one that provides the most order 0 entropy reduction. In practice, GLZA's algorithm tends to favor common strings that are longer and less frequent than IRR. That being said, I feel strongly that there are better algorithms for choosing the best strings that are awaiting discovery. I suspect the best ones might be more based on analyzing the structure of the suffix tree rather than calculations of the entropy reduction. I hope this helps. Please let me know if you have anything was not clear or you have additional questions.
    863 replies | 408929 view(s)
  • Shelwien's Avatar
    Yesterday, 22:09
    > If I feed Green "hello world and hello again", how many windows are placed on it > and where and how long is each window? "Window" in data compression is a term normally used for the whole volume of the previously processed data which can be directly referenced/indexed. File size can be pretty large (potentially infinite) and computers have limited resources, so compression algorithms usually can't reference information which is not in the window. For example, window size of deflate is 32kb (deflate is the main compression method in .zip archives). While the thing you actually talk about here is usually called a "context mask", or just a context. http://mattmahoney.net/dc/dce.html#Section_422 (Ideally read the whole book). green with default parameters works with order0 to order16 contexts. You can change this by changing the value of "MAXORDER" constant in the source. order0 means no context, order1 means one previous symbol as a context, order16 means 16 previous symbols. > "hello world and hello a" This is normal order 4. > "hello world and hello n" > "hello world and helloin" This is called "sparse context", to be specific "sparse order4 with gap 2". Its useful for some kinda of binary data (eg. executables), but gives a very minor gain to text compression. > We have 3 different (because offset) models looking at the most recent text. > Each window only views 4 letters because the tree would die hard if we went for view of 16 letters. Any modern programming language has built-in support for "maps" or "dictionaries" that are essentially hashtables. So it should be easy to experiment with any contexts, although default hashtable implementations may be inefficient. > It can't be 16. Do you mean chunks at a time, just like I showed here? > Which we then mix together of course. green for each symbol of input file finds the occurrence counts of symbols following the occurrences of strings formed by N=0..16 previous symbols. Counts for order0..2 are kept in simple arrays, while for order3+ it follows the linked list and counts the symbols anew. MAXORDER setting thus doesn't matter for green, since the total number of occurrences of the current order3 context is usually low enough. > Can you tweak my image so I understand your formula? It doesn't make sense at all in many ways... I don't understand what you want to visualize. If its really necessary, you can draw the distributions and weights from green_cat.txt that I posted. 1. I hope you understand how arithmetic coding works. It allocates subranges within a range of numbers, with each subrange size corresponding to a symbol's probability. So given a number, we can find the subrange it belongs to, and thus which symbol the number encodes. Then first symbol's subrange can be split further etc ad infinitum. 2. For a memoryless source that generates random symbols according to a given probability distribution, arithmetic coding can provide perfect compression (exactly to Shannon's entropy of the data). 3. Imagine that we have M such sources with different distributions, then add another source with alphabet of size M, which generates random choices of primary sources. So then we have data generated like this: i = S2.get_symbol(); symbol = S1.get_symbol(); put( symbol ); And then we want to compress and decompress this sequence of symbols, but we don't want to know actual models which produced them, just the symbols. Now, the probability that source i would generate symbol c is W * Pi, where W is the probability distribution of source S2, and Pi is the probability distribution of source S1. But since we don't need i values, we can actually merge subranges corresponding to same symbols from different S1 sources. P = Sum * Pi, i=1..M ]; So that's how linear mixing works. 4. The relation of this to order-M modelling of text is direct: a 2-stage symbol generator above selects a random source, but we don't know which one, if only symbols are encoded; while in text a symbol can be a part of n-gram, n-grams of different sizes have different statistics, but we don't know with certainty which n-gram given symbol belongs to. We can only compute a probability distribution - which is what the mixing weights are. 5. So CM weight function is just a model, there's no one perfect choice for it. Assigning equal weights to all submodels (which you suggest) is also a choice, but its usually not a good choice for compression improvement. Green's weight function is an empirical formula that works good enough for plaintext data, but its not very compatible with binary files.
    57 replies | 1630 view(s)
  • Self_Recursive_Data's Avatar
    Yesterday, 20:38
    I'm learning so much from yous though. I've almost crossed no man's land. (I tried reading the material.)
    863 replies | 408929 view(s)
  • Bulat Ziganshin's Avatar
    Yesterday, 20:35
    btw, I suggest you to also look into BCM. it applies CM after BWT, so the model used is really simple - bitwise based on last 16 bits or so. It may be even simpler than green
    57 replies | 1630 view(s)
  • Self_Recursive_Data's Avatar
    Yesterday, 20:28
    If I feed Green "hello world and hello again", how many windows are placed on it and where and how long is each window? For example, If I have 3 models that have their own position/offset, and all 3 windows are 4 letters long of view, and are are closest to front, we get: "hello world and helloi]n]" or more intuitively (if I separate their overlayer): "hello world and hello a" "hello world and hello n" "hello world and helloin" We have 3 different (because offset) models looking at the most recent text. Each window only views 4 letters because the tree would die hard if we went for view of 16 letters. Especially if have 3 windows each 16 letters wide lol. So where are you getting this order-16 from? It can't be 16. Do you mean chunks at a time, just like I showed here? Which we then mix together of course. Your formula is different from my formula? https://ibb.co/xHyhnmk Can you tweak my image so I understand your formula? It doesn't make sense at all in many ways... w = A/(B+total/(C+1)) = A*(C+1)/(B*(C+1)+total) ~= W/total ---------------------------- P = Sum * o.freq/o.total, j=0..maxorder ] / Sum ] P0=w1*0.35+w2*0.20+w3*0.35; P0=w1*0.20+w2*0.35+w3*0.20; P0=w1*0.05+w2*0.05+w3*0.02; P0=w1*0.02+w2*0.02+w3*0.05; sum = P0+P0+P0+P0; P=P0/sum; P=P0/sum; P=P0/sum; P=P0/sum;
    57 replies | 1630 view(s)
  • Shelwien's Avatar
    Yesterday, 20:22
    Just read the source. Most people won't explain things until you understand.
    863 replies | 408929 view(s)
  • Self_Recursive_Data's Avatar
    Yesterday, 20:01
    Can anyone answer these questions?
    863 replies | 408929 view(s)
  • suryakandau@yahoo.co.id's Avatar
    Yesterday, 18:21
    this is Crush v1.6 with small compression ratio improvement from crush v1.5 ​
    15 replies | 768 view(s)
  • Shelwien's Avatar
    Yesterday, 17:29
    > What does P0 mean? Just an intermediate result. We need to normalize the distribution, since sum of probabilities should be 1. P is the final distribution used for coding. > What is the constant? From what? A letter? Its these constants (config.inc in green_v3a.rar): uint coef = { 27584, 4, 208, 0, 15592, 32832, 4, 1, 59299, 86318, 7, 0, 26624, 65536, 3, 5, 8178, 38914, 0, 32, 7520, 56802, 0, 58, 7104, 51840, 0, 73, 7425, 47736, 0, 248, 8192, 41088, 0, 76, 7236, 32264, 0, 17, 1280, 34881, 0, 162, 4096, 40960, 0, 152, 9380, 24634, 0, 45, 4096, 25120, 0, 13, 4096, 24576, 0, 14, 4096, 41088, 0, 15, 4096, 49469, 0, 16 }; They're for transforming n (number of different symbols that appeared in context), total (number of occurrences of context), and order (j here) into a weight value. This is the formula in the source: uint w = coef/(1+coef+total/(coef+1)); // mixer weight As you can see, a context with (n==1), ie only one symbol, gets a higher weight, since its likely to continue like that. > What is w1*0.35 mean and what is result for just that? What is w1? I changed 35% to 0.35 because I normally work with probabilities in 0..1 range. And we can start by normalizing weights (w1'=w1/(w1+w2+w3) etc) and probability distributions, then w1*P1 would be subrange of arithmetic code assigned to symbol 'e' of the first submodel. We don't need to know which model seemed better to encoder, so we have to merge same-symbol intervals of different submodels. > It'll just add to 100% and after we /sum will say 'e' is ex. 50% of it, which is what we knew. This normalization step here also normalizes weights, so it has to be there anyway. Also computer programs don't really calculate anything precisely. So we can get slightly more or less than 100% due to rounding of intermediate values etc. > My understanding is still each model has its %s for the probability for each letter, > then add each model m+m+m, then divide by 3. help... That's what it does, just that "add each model then divide by 3" means w1=1/3; w2=1/3; w3=1/3; and its not the best weight function. > Your algorithm Green is really fast and 'good start' for me and I want to know how it works exactly. > What are you storing? There's this: byte* inpbuf = new byte; if( inpbuf==0 ) return 1; // unpacked data buffer uint* l2 = new uint; if( l2==0 ) return 1; // o3+ link buffer static qword freq; static uint low; static uint o0; bzero( o0 ); static uint o1; bzero( o1 ); // 256k static uint o2; bzero( o2 ); // 64M static uint p2; bzero( p2 ); // +64M, last occurrence pointers So it stores symbol counts up to order2 as is, then builds linked lists (via l2/p2 arrays, which store location of each previous occurrence of each order3 context). > I see you say bruteforce, what? Because unlike normal CM/PPM, it builds the distributions for o3+ dynamically, by visiting up to DEPTH (=8192) recent context occurrences. > If we take a piece of text "hello world and hello again" how many > letters far back does it consider after trained offline and is running now? Well, it would go back once for "hel".."hello " contexts, I don't see any other o3+ matches in your example. > 16 orders? How many context models? 1 length of 16? No way. 40 models? 17 = order0 .. order16 > Do you use a dict or tree? What is it doing that I'm not. "green.cpp" uses a data structure usually known as "hash chains" in LZ, although without a hash function. "tree.cpp" implements the same model with an actual tree. > Here is my formula/understanding. > Can you tweak this image to Green's formula? > I will grasp it if it's visual. https://ibb.co/xHyhnmk I made a log of distributions during green's processing of " the cat was running down the street" (I added 3 spaces at start, because green just writes first 3 symbols as is without modelling) See it in attached file.
    57 replies | 1630 view(s)
  • Self_Recursive_Data's Avatar
    Yesterday, 10:35
    Kennon, very nice compressor! Wow!! So can you explain to me very clear and in few words exactly (mostly) how it works so that I can replicate your algorithm. I read the pdf... I see you are looking for the best longest phrases like Byte Pair Encoding when you mention "954 characters. It appears 1694 times in the encoded file". What encoded file? You mean wiki9 don't you? Or by encoded you mean the tree holding wiki9 lol? 1) So you build a tree, a trie ? (Of all of wiki9). See figure_1 below. 2) Are you finding the common "codes" in wiki9 by a form of Byte Pair Encoding ? Ex. "". 3) And are you replacing these windows (where they sit in wiki9, like Huffman does) with their own "huffman-like" code ? Ex. "". 4) Can you explain the "best code" discovery again? So you look for what in the tree/trie? A node that extends very long and has few children for miles? Or also is frequent? Or is near the Root? Figure_1 (tree): //////////a< ////a< //////////b< S< //////////a< ​////b< //////////b<
    863 replies | 408929 view(s)
  • Self_Recursive_Data's Avatar
    Yesterday, 07:33
    See above. Here is my formula/understanding. Can you tweak this image to Green's formula? I will grasp it if it's visual. https://ibb.co/xHyhnmk See how simple that is? This allows the correct letter prediction to be suggested based on the witnesses you have and their position. I see the full formula is: w = A/(B+total/(C+1)) = A*(C+1)/(B*(C+1)+total) ~= W/total ---------------------------- P = Sum * o.freq/o.total, j=0..maxorder ] / Sum ] As for Green's model(s?) themselves, the context matches - are they stored in a tree, a raw dict (no data-structure), ? And that's it for Green right? It just looks at the past 16 letters and does the formula? How in the world do you store order16? You need multiple models to chop it up though....:-(
    57 replies | 1630 view(s)
  • Self_Recursive_Data's Avatar
    Yesterday, 07:12
    What does P0 mean? What is the constant? From what? A letter? What is the count... What is w1*0.35 mean and what is result for just that? What is w1? What is the weight for weight1? Why is "sum = P0+P0+P0+P0;" combining the % we, have, already. It'll just add to 100% and after we /sum will say 'e' is ex. 50% of it, which is what we knew. My understanding is still each model has its %s for the probability for each letter, then add each model m+m+m, then divide by 3. help... Your algorithm Green is really fast and 'good start' for me and I want to know how it works exactly. What are you storing? I see you say bruteforce, what? If we take a piece of text "hello world and hello again" how many letters far back does it consider after trained offline and is running now? 16 orders? How many context models? 1 length of 16? No way. 40 models? Do you use a dict or tree? What is it doing that I'm not.
    57 replies | 1630 view(s)
  • Sportman's Avatar
    Yesterday, 06:45
    Sportman replied to a thread Zstandard in Data Compression
    Cyan and Terrelln thanks for picking up and fixing this issue so fast, can't wait till the official release to see if there are more files with till 10% storage and till 8% compression speed savings in big data production environments with zstd ultra modes.
    394 replies | 122250 view(s)
  • Shelwien's Avatar
    Yesterday, 06:29
    > Why, are, you, adding model1+total/model3? C's are constants, not models. > and what is "green"? Its in this thread that I linked before: https://encode.su/threads/541-Simple-bytewise-context-mixing-demo > Please show me result after adding these weights togther > e.g the averaging of them (an example, not has not be precise): It would be something like P0=w1*0.35+w2*0.20+w3*0.35; P0=w1*0.20+w2*0.35+w3*0.20; P0=w1*0.05+w2*0.05+w3*0.02; P0=w1*0.02+w2*0.02+w3*0.05; sum = P0+P0+P0+P0; P=P0/sum; P=P0/sum; P=P0/sum; P=P0/sum; where w's are computed from available information with a weight function like the one you complained about. There's no good way to compute optimal weights just from a few probability distributions - some history is necessary. Of course, perfect weights in this case would maximize the probability of current symbol: 'e': w1=1; w2=0; w3=0; 'h': w1=0; w2=1; w3=0; 'z': w1=1: w2=0; w3=0; '&': w1=0; w2=0; w3=1;
    57 replies | 1630 view(s)
  • Self_Recursive_Data's Avatar
    Yesterday, 05:50
    > Am I missing anything.... What/where is the Weight Averaging? How is it done (use my example above). >> Mostly by using different weights, based on history of previous predictions. >> For example, "green" uses weight function like this: C1/(C2+total/C3). >> Where Ci are tuned constants, total is the number of context's occurrences, and "o" is context's order. Whoa you lost me with "C1/(C2+total/C3)". You have 3 context models where each has its own set of probabilities for a, b, c, d, etc. We get their %s by totaling the counts for a given set. We then add sets together and divide them too. Why, are, you, adding model1+total/model3? And not dong this for model1? This isn't clear. To me the meaning of "C1/(C2+total/C3)" means: model1 35%=e, 20%=h, 5%=z, 2%=& divided by ( model2 20%=e, 35%=h, 5%=z, 2%=& +300% (total) divided by: model3 5%=e, 20%=h, 2%=z, 5%=& ) me=confused and what is "green"? Please show me result after adding these weights togther e.g the averaging of them (an example, not has not be precise): > "35%=e, 20%=h, 5%=z, 2%=&" > "20%=e, 35%=h, 5%=z, 2%=&" > "35%=e, 20%=h, 2%=z, 5%=&"
    57 replies | 1630 view(s)
  • terrelln's Avatar
    Yesterday, 05:42
    terrelln replied to a thread Zstandard in Data Compression
    Thanks for the benchmarks sportman! The multithreading issue is fixed in the latest dev branch, and we now mention --single-thread in the help.
    394 replies | 122250 view(s)
  • Shelwien's Avatar
    Yesterday, 05:17
    > "35%=e, 20%=h, 5%=z, 2%=&" > "20%=e, 35%=h, 5%=z, 2%=&" > "35%=e, 20%=h, 2%=z, 5%=&" > So I simply add them (smash the 3 rows together into 1 row) and then divide each % by 3. Well, it would be linear mixing with weight = 1/3 for each model. Also you can skip "divide by 3", but probabilities in the coding distribution should add up to 100%. > Am I missing anything.... What/where is the Weight Averaging? How is it done (use my example above). Mostly by using different weights, based on history of previous predictions. For example, "green" uses weight function like this: C1/(C2+total/C3). Where Ci are tuned constants, total is the number of context's occurrences, and "o" is context's order. Ash07 above basically uses the same method, so there's at least some potential in it. Without SSE its result is 19,492,857 though. However bitwise approach is easier and provides better compression in the end.
    57 replies | 1630 view(s)
  • Self_Recursive_Data's Avatar
    Yesterday, 04:47
    Say I get 3 models telling me: Model_1 "I'm a 4-letter context over here, and I think the Prediction Letter is probably 35%=e, 20%=h, 5%=z, 2%=&" ​Model_2 "I'm a 4-letter context over here, and I think the Prediction Letter is probably 35%=h, 20%=e, 5%=z, 2%=&" ​Model_3 "I'm a 4-letter context over here, and I think the Prediction Letter is probably 35%=e, 20%=z, 5%=&, 2%=z" So I simply add them (smash the 3 rows together into 1 row) and then divide each % by 3. Am I missing anything.... :-) What/where is the Weight Averaging? How is it done (use my example above).
    57 replies | 1630 view(s)
  • Shelwien's Avatar
    Yesterday, 04:31
    I tried increasing the number of iterations for method#2 to 10M, it improved, while #4 got worse because of different seed... >a.exe method 1: 0.06072729 method 2: 0.08386047 method 3: 0.08643947 method 4: 0.05860876 method 5: 0.06026703 >a.exe method 1: 0.06072729 method 2: 0.07329438 method 3: 0.08221427 method 4: 0.05916388 method 5: 0.06026703
    5 replies | 266 view(s)
  • Shelwien's Avatar
    Yesterday, 04:11
    > Wow 18.5MB, you made that algorithm? Yes, in 2003, its open-source: http://ctxmodel.net/files/ASH/ > Now when I mix the models, do I do weight averaging before, > after, or both before and after mixing models? https://en.wikipedia.org/wiki/Context_mixing I'm not sure what you're asking... you need a single probability to encode a bit, but there can be multiple layers of mixing. Mixer basically is a context model itself - it takes context (submodel predictions plus known data) and produces one prediction. For example, I tend to chain binary mixers instead of using one n-ary like paq.
    57 replies | 1630 view(s)
  • byronknoll's Avatar
    Yesterday, 04:06
    I wrote a quick benchmark to compare some approaches: https://gist.github.com/byronknoll/3bafb846d2760ab6081a37ac700c0618 My original idea of sampling uniform random numbers did not perform well. The best performing technique so far uses numbers sampled from a Bates distribution.
    5 replies | 266 view(s)
  • Self_Recursive_Data's Avatar
    Yesterday, 03:45
    Wow 18.5MB, you made that algorithm? Now when I mix the models, do I do weight averaging before, after, or both before and after mixing models?
    57 replies | 1630 view(s)
  • Shelwien's Avatar
    Yesterday, 03:14
    > Shelwien, what's your wiki8 record? I never really tried to set one on purpose... Tried running my relevant coders and its like this: 98,056,487 enwik_text3a // https://encode.su/threads/2590-Some-perl-scripts-for-enwik8-parsing 59,878,185 enwik_text3a.drt // DRT.exe enwik_text3a enwik_text3a.drt 20,185,928 enwik_text3a.drt.bwtmix // BWTmix.exe c600 enwik_text3a.drt enwik_text3a.drt.bwtmix 20,013,580 enwik_text3a.drt.pmd32 // pmd.exe c enwik_text3a.drt enwik_text3a.drt.pmd32 32 1400 (ppmd_sh9) 19,418,811 enwik_text3a.drt.CMo8 // CMo8.exe c12 enwik_text3a.drt enwik_text3a.drt.CMo8 18,510,302 enwik_text3a.drt.ash07 // ash07.exe /o256 /m4000 /s255 enwik_text3a.drt enwik_text3a.drt.ash07 I could probably reach 17 by combining CMo8 with mod_ppmd and mod_SSE - CMo8 doesn't have an SSE stage atm, its a simple logistic mix of o0..o8 to test new counters. If you count paq, then I have this: https://encode.su/threads/2515-mod_ppmd mod_ppmd is probably still a part of "phda" - it was there last time I checked. > How do you store all these oh too many ex. order-4 contexts? A tree? A dict? > Or an all-connected FeedForward architecture? Can you draw a pic of it in full? CMs mostly use hashtables: https://en.wikipedia.org/wiki/Hash_table Although ppmd/ppmonstr/durilca do use a tree. Here's an example with a linked list: https://encode.su/threads/541-Simple-bytewise-context-mixing-demo > A tree can appear fast/low memory but in the end it is not going to be the best. > We want less nodes and less searching. Yeah, hashtable is a much better solution for compression (where it can be used), because its not necessary to avoid collisions too much etc. In a compressor it can be used as a lossy data structure, so its much more efficient than a tree - can use all the memory for counters, while for a tree we need to keep lots of pointers too.
    57 replies | 1630 view(s)
  • Sportman's Avatar
    Yesterday, 02:53
    Sportman replied to a thread Zstandard in Data Compression
    2,337,506,087 bytes, 2,596.459 sec. - 13.159 sec., zstd -18 --single-thread (v1.4.4) 2,299,225,559 bytes, 3,196.329 sec. - 13.559 sec., zstd -19 --single-thread (v1.4.4) 2,197,171,322 bytes, 3,947.477 sec. - 14.471 sec., zstd -20 --ultra --single-thread (v1.4.4)
    394 replies | 122250 view(s)
  • Cyan's Avatar
    Yesterday, 02:40
    Cyan replied to a thread Zstandard in Data Compression
    Thanks @Sportman for exposing this use case thanks to your new benchmark ! It made it possible to identify the issue, and @terrelln basically nailed it in a new patch. This issue will be fixed in the next release (v1.4.5).
    394 replies | 122250 view(s)
  • Self_Recursive_Data's Avatar
    Yesterday, 00:26
    Shelwien, what's your wiki8 record? How do you store all these oh too many ex. order-4 contexts? A tree? A dict? Or an all-connected FeedForward architecture? Can you draw a pic of it in full? A tree can appear fast/low memory but in the end it is not going to be the best. We want less nodes and less searching.
    57 replies | 1630 view(s)
  • Shelwien's Avatar
    17th January 2020, 23:21
    Here's my coder with lzma counters: http://nishi.dreamhosters.com/u/c0lit_v0.7z 61251024 2.063s 2.156s // c0lit_cl90.exe 61251024 2.265s 2.297s // c0lit_gc82.exe 61251024 2.109s 2.156s // c0lit_ic19.exe 61251024 2.172s 2.344s // c0lit_cl90_fil.exe 61251024 2.094s 2.265s // c0lit_gc82_fil.exe 61251024 2.218s 2.312s // c0lit_ic19_fil.exe 61251024 1.750s 1.844s // c0lit_cl90.exe 61251024 1.844s 2.000s // c0lit_cl90_fil.exe 61251024 1.891s 1.969s // c0lit_gc82.exe 61251024 1.782s 1.937s // c0lit_gc82_fil.exe 61251024 1.781s 1.829s // c0lit_ic19.exe 61251024 1.875s 1.969s // c0lit_ic19_fil.exe E MB/s size ratio% D MB/s function 62.39 61250808 61.25% 56.88 11-rcs o0 simple enwik8 61250808 1.602s 1.758s // trc.exe recalculated from speed 61250820 1.953s 1.891s // trc.exe -11 "enwik8" 1 (gcc82, modified test.bat) Its timed with file i/o, so I wonder how it actually compares...
    22 replies | 1712 view(s)
  • Sportman's Avatar
    17th January 2020, 21:37
    Sportman replied to a thread Zstandard in Data Compression
    Confirmed: 2,136,340,302 bytes, 4,604.696 sec. - 15.121 sec., zstd -21 --ultra --single-thread (v1.4.4)
    394 replies | 122250 view(s)
  • dnd's Avatar
    17th January 2020, 21:18
    Larger renorm (ex. 48 bit) will make substantial changes necessary in the coding. A priori it will not make the RC faster. The speed issue of a RC is the sequential processing in the decoding. You can use interleaving to speed up rANS. Actually I've not tried interleaving (2 streams) with the Turbo-Range-Coder. Wonder if it's possible and how to rearrange the output of the RC, so to make it possible to use interleaving with one stream at decoding. The 64 bit RC is reading and writing in 32 bits units only. I've added some logic to use the renorm only when necessary in the o0 simple coder (option 11). enwik9 - skylake i6700 3.4GHz E MB/s size ratio% D MB/s function RC_IO RC_BITS 56.34 620093192 62.01% 47.56 11-rcs o0 simple 32 15 4 Renorm/Byte 54.58 620093192 62.01% 47.10 11-rcs o0 simple 32 15 8 Renorm/Byte (previous version) 50.70 620717608 62.07% 42.58 11-rcs o0 simple 16 12 2 Renorm/Byte I've added the type __uint128_t to rcrange_t to support 128 bit RC, without changing the coding. It's too slow as expected. Output (RC_IO) can be 8,16,32 or 64 bit. The fastest RC is the default 64 bit RC (RC_SIZE = 64) and with 32 bit i/o (RC_IO = 32).
    22 replies | 1712 view(s)
  • JamesWasil's Avatar
    17th January 2020, 21:10
    The final version of BCM should be given a special name like Gigawatts v1.21 (If David Scott adds to it making it bijective, that version could be called "Great Scott!") lol just a few nomenclature ideas to go with your avatar. ;-) Nice work btw.
    113 replies | 55185 view(s)
  • Sportman's Avatar
    17th January 2020, 19:53
    Sportman replied to a thread Zstandard in Data Compression
    Confirmed: 2,364,340,233 bytes, 5,667.053 sec. - 15.339 sec., zstd -22 --ultra -T1 (v1.4.4) 2,080,479,075 bytes, 5,257.974 sec. - 15.484 sec., zstd -22 --ultra --single-thread (v1.4.4) Zstd v1.4.4 windows binary help (-h) do not mention argument --single-tread only argument T: -T# : spawns # compression threads (default: 1, 0==# cores)
    394 replies | 122250 view(s)
  • Shelwien's Avatar
    17th January 2020, 19:26
    > All artificial generated data isn't a bad idea because Matt did it > once and someone found a way to preprocess it for better compression. Artifical data is bad because its artifical. It would work if we can reproduce the properties of the data which we are specifically targeting (eg. BWT output), but would prevent discovering new patterns (which is good if its fit for practical applications). When writing the data generator its easy to miss something, which can be exploited in the contest (eg. there're PRNGs in actual compiler standard libs which produce "random numbers" compressible by paq). > I think its still the way to go for an objective competition. I also considered it, but don't see how to make it useful and unexploitable. >> I'd rather have preprocessors for relevant data (eg. exes) than for something artifically generated for the contest. > Would hate this. I love the idea behind Paq8 but all the custom man made models for all sorts of cases Its a good way to create actual competition, with multiple people making improvements. paq8 is too complicated and "solid" (not modular), so imho its effectively stopping CM development since its release. paq8 lacks many things (explicit parameters for component tuning, precision, support for enc/dec asymmetry), but a new CM with new features can't compete with paq8 in benchmarks, because first it needs to have all the same submodels (match, sparse, periodic, 2D, indirect), and there's no analysis of submodel contribution, which would explain to a new developer what has to be integrated to compete with paq8, and why. Btw, there I only listed "pure" universal models, which make sense just based on theory. > (at least for me) go against the idea of general artificial intelligence. This contest certainly won't be about AI. The (potential) sponsor is storage-related. Also I much prefer standalone preprocessors to a combination of complex features which combine to make use of some pattern in the data. For example, consider rep-matches in lzma (and clones) - they were (likely) introduced simply because somebody noticed that LZ distances are repeated sometimes. But then lzma's optimal parser has heuristics for checking multi-token patterns like match-literal-rep_match, which effectively let it choose longer "fuzzy matches" over simple nearest repeated substrings. > If the competition will be a game of hand crafted preprocessing and a bunch of custom man made models I would lose interest. We'll see I guess. I still think that making a good exe filter for blockwise zstd (for example) would be more interesting than tweaking paq8. If its not for you, Hutter's contest is still active. > It would be more interesting to allow the compressor to build/develop/find (unsupervised training) > a custom model that acts like a preprocessor or custom model for the data and see this a separate step besides encoding/decoding. In theory I agree, and that was supposedly the original Hutter's (and Mahoney's) intention for the contest. But in practice I don't believe in fully learning english from enwik, even if we'd get humans to work on it. Some aspects are discoverable (like words/spaces/punctuation, or parts of morphology; maybe synonyms), but actual understanding is only possible with external information (eg. pictures with objects known from external information), and some parts are impossible without supervision (rhymes, rhytm and other stuff related to phonetics; detailed morphology; idioms) or at least specially prepared descriptions fed to AI. Of course, some kinds of data analysis are still possible - for example, atm paq8 doesn't have parsing optimization. But I can't add it to paq (it requires backtracking etc) and if I'd make a new CM with it, it would have worse compression than paq, and thus there won't be any interest. Btw, I'd like your feedback on this project: https://encode.su/threads/3072-controlled-regexp-tools-for-HP
    17 replies | 673 view(s)
  • Shelwien's Avatar
    17th January 2020, 18:11
    1. 32-bit i/o in x64 rc is a good speed optimization. In fact, I have an old 32-bit coder like that which used FP registers for rc state, so this method should be relatively portable. But I wonder if more than 32-bit would make sense too? I tried testing it with my rc, and it somehow made decoding speed worse: 45411520 6.203s 6.890s // 32-bit renorm/LE 45729030 6.156s 6.969s // 48-bit renorm/BE 45729030 6.187s 7.078s // 48-bit renorm/LE In theory, even 56-bit renorm is possible, eg. with counters like in lzma. 2. Its possible to make a binary rc with 128-bit range. It would require more multiplications, but would also allow 64-bit renorm and/or only checking for renorm once per 8 bits. I wonder if it would work as speed optimization.
    22 replies | 1712 view(s)
  • Kaw's Avatar
    17th January 2020, 17:53
    I think that's a Baconian fallacy? All artificial generated data isn't a bad idea because Matt did it once and someone found a way to preprocess it for better compression. I think its still the way to go for an objective competition. Would hate this. I love the idea behind Paq8 but all the custom man made models for all sorts of cases (at least for me) go against the idea of general artificial intelligence. If the competition will be a game of hand crafted preprocessing and a bunch of custom man made models I would lose interest. This would quickly lead to abuse of the rule I am afraid. It would be more interesting to allow the compressor to build/develop/find (unsupervised training) a custom model that acts like a preprocessor or custom model for the data and see this a separate step besides encoding/decoding.
    17 replies | 673 view(s)
  • Shelwien's Avatar
    17th January 2020, 17:23
    > I'm not sure what to think. If things should be practical we get just another dictionary coder, > written in c(++) and fine tuned for the testdata. PPM/fast_CM/BWT have faster encoding and better compression than LZ. And yes, I don't intend to support python/java/C#/JS - these developers can find their own solution for trans-compiling to C/C++, since there're no high-quality compilers for anything else. Also modern LZs have much higher algorithm complexity than anything else, and its still possible to improve their parsing optimization and entropy coding. It may be also possible to speed-optimize paq-like CMs. > What I would suggest is artificially generated data. That was attempted before: http://mattmahoney.net/dc/uiq/ And failed fast enough, after a method of preprocessing was discovered. I'd rather have preprocessors for relevant data (eg. exes) than for something artifically generated for the contest. > The contest creater/holder keeps the private set for the final scores. I do intend to keep the test corpus private. > But I would love to see a second competition where purely the size is the objective. I have some hope that Hutter Prize would be rebooted for that. > Why not a memory constraint of 16gb and a time constraint of 24 hours for this competition and lets see what the competitors create? I agree on 16gb, but smaller time limit imho helps competitors. Otherwise it could easily become a competition in having access to more hardware to run some cmix/nncp hybrid. I'm actually more interested in taking off the decoder size limit somehow, to enable natural language parsing etc. But that would rely too much on having unique test data, which nobody can guess.
    17 replies | 673 view(s)
  • Kaw's Avatar
    17th January 2020, 17:05
    I love the idea. Maybe even smaller blocks like 32 byte or variable length? Just as in a real database? This way you can't just use zstd to win. You could use one or more open source databases as input.
    17 replies | 673 view(s)
  • Shelwien's Avatar
    17th January 2020, 16:44
    > Maybe change the objective, but still stay with enwik8? Enwik8 compression is basically only useful for enwik8 compression. Enwik file contains too much unique stuff like wiki markup, large tables in mixed html/wiki markup, bibliographies, reference lists, language lists. So some preprocessing tricks can significantly change the results even if we'd run a blockwise random-access benchmark on enwik. Which could be ok normally, but enwik tricks are not relevant for anything else. > Hutter Prize limits the memory and time and rewards highest compression ratio. > Maybe instead we should seek fastest compressors that operate within some memory limit and compress enwik8 to e.g. under 18,000,000 bytes. Under _19,000,000_ are basically only coders with WRT preprocessing and slow CM/PPM+SSE. I guess its possible to setup rules such that durilca would win, but none of paq derivatives, but it would keep most of drawbacks of HP contest (like having no applications, encouraging information gathering rather than programming, etc). I'm thinking about focusing on compression of small blocks (4k..1M) for random access, since 1) Its an important (basically only) use-case in practical applications - filesystems, databases, storage, even games. But "compression scene" has very limited resources for it, 90% of LTCB entries are unfit, while a number of remaining ones require refactoring and new features (eg. external dictionary support) to be competitive. 2) Benchmark order significantly changes in this case - many popular choices would become useless (eg. BSC has worse compression and is slower than ppmd), so it should be interesting. Well, in practical applications zstd would currently win (compression ratio + decoding speed + external dictionary support), but better ratios and faster encoding are possible, so there's stuff to work on. 3) Its possible to make progress by tweaking existing codecs or preprocessors, which should let people participate in the contest without having to invest months of work.
    17 replies | 673 view(s)
  • Kaw's Avatar
    17th January 2020, 16:33
    I'm not sure what to think. If things should be practical we get just another dictionary coder, written in c(++) and fine tuned for the testdata. What I would suggest is artificially generated data. A collection of files with different (1kb, 1mb, 1gb) sizes and different probability distributions (easy, moderate, hard) on bit and byte level. Let's say 18 files in total. A public and a private set. People are allowed to train their algorithm on the public set. The contest creater/holder keeps the private set for the final scores. We could use a widely accepted tool like 7zip on default settings to set a ceiling size for every file. Competitors have to write a program that is able to compress equal or better than default 7zip and are judged by the sum of encoding and decoding time. I would also suggest to use a modern memory constraint like 8gb or something. But I would love to see a second competition where purely the size is the objective. From the past we learned that time or memory constraints diminish over the years and that innovative but impractical algorithms from the 80th's and the 90th's are just fine and mainstream now. Why not a memory constraint of 16gb and a time constraint of 24 hours for this competition and lets see what the competitors create?
    17 replies | 673 view(s)
  • Shelwien's Avatar
    17th January 2020, 12:43
    enwik8 would take too long, but here's book1: source o1rc paq8px qlfc BOOK1.bwt 768,775 208,199 208,397 212,100 BOOK1.bwth 768,771 226,555 220,485 233,853 BOOK1.bwtl 768,771 228,076 221,892 235,693 BOOK1h.bwt 6,150,172 220,097 231,029 219,453 BOOK1l.bwt 6,150,172 221,339 232,752 220,643 "h" is bit7 to bit0 order, "l" is bit0 to bit7. As expected, it makes compression worse, because byte alignment is important for text (you'd have more precision in predicting bit7=0 from bitpos%8 than from prefix context; also there're letter cases.) Also you have to understand that BWT is very limited, its not fit for breaking compression ratio records. Basically, BWT suffix array is a data structure that simplifies access to prefix order* contexts... at the cost of hiding everything else.
    57 replies | 1630 view(s)
  • Piglet's Avatar
    17th January 2020, 12:16
    Piglet replied to a thread Zstandard in Data Compression
    It would be nice to have a switch/settings that runs without compromise. Of course the runtime would be long, but it would provide the smallest possible size. It would be interesting to know where zstd ends.
    394 replies | 122250 view(s)
  • Bulat Ziganshin's Avatar
    17th January 2020, 12:15
    Bulat Ziganshin replied to a thread Zstandard in Data Compression
    Note that m/t also affects level 21
    394 replies | 122250 view(s)
  • encode's Avatar
    17th January 2020, 11:36
    D:\>bcm -9 wikipedia-en-html.tar Compressing wikipedia-en-html.tar: 223674511360 -> 7800893626 in 27559.1 sec Tested on HDD though. 208 GB of Wikipedia slimmed down to 7 GB... :_coffee:
    113 replies | 55185 view(s)
  • Self_Recursive_Data's Avatar
    17th January 2020, 06:22
    Has anyone tried BWT on the, full wiki8, on its binary? Or is it way too slow? It'd look like this as binary: 101000111010$ 01000111010$1 1000111010$10 000111010$101 00111010$1010 0111010$10100 111010$101000 11010$1010001 1010$10100011 010$101000111 10$1010001110 0$10100011101 $101000111010
    57 replies | 1630 view(s)
  • terrelln's Avatar
    17th January 2020, 05:43
    terrelln replied to a thread Zstandard in Data Compression
    Running zstd in single threaded mode gets the expected compression ratio. Time to figure out what’s going wrong! zstd --single-thread --ultra -22 enwik10 -o /dev/null enwik10 : 20.80% (10000000000 => 2080479075 bytes, /dev/null)
    394 replies | 122250 view(s)
More Activity