Activity Stream

Filter
Sort By Time Show
Recent Recent Popular Popular Anytime Anytime Last 24 Hours Last 24 Hours Last 7 Days Last 7 Days Last 30 Days Last 30 Days All All Photos Photos Forum Forums
  • nikkho's Avatar
    Today, 14:28
    nikkho replied to a thread FileOptimizer in Data Compression
    Thank you. Gifsicle was my mistake, so it is fixed. Regarding pngquant and jpeg-recompress, they are not distributing 32 bit binaries, so not much I can do. https://sourceforge.net/p/nikkhokkho/code/1546/
    661 replies | 191792 view(s)
  • nikkho's Avatar
    Today, 14:19
    nikkho replied to a thread FileOptimizer in Data Compression
    ​Thanks. CHM file is not updated since months ago. I will try to fix it ofr the next release.
    661 replies | 191792 view(s)
  • suryakandau@yahoo.co.id's Avatar
    Today, 13:39
    That strange because when I decompress the output there is no error happen like file corrupted or something like that. Have you tested v1.3 please ?
    30 replies | 1052 view(s)
  • cade's Avatar
    Today, 13:32
    Found and fixed a bug in how I slide bt4 when it crosses 2GB (2 * window size + max match len + good peek buffer for optimal parsing), will test enwik10 to completion and with just bt4 + memcmp, and then upload a fixed version. This doesn't happen for window bits <= 29. For quick verification of other files, you can use t command and it will show expected CRC32 vs CRC32 of the result unpacked in memory. If there aren't any issues after that then I'll clean up and release the source into public domain.
    9 replies | 314 view(s)
  • Jyrki Alakuijala's Avatar
    Today, 13:32
    Such a large benchmark comes with inconveniences in use and distribution. Therefore it should carry some benefit, too. Consider quantifying this, for example showing how much the order of compression algorithms change if you only take 100 or 500 MB in each corpus vs the full 1GB.
    30 replies | 1052 view(s)
  • pacalovasjurijus's Avatar
    Today, 13:15
    My algorithm rotation here file before and after:
    54 replies | 5012 view(s)
  • Sportman's Avatar
    Today, 12:29
    Yes no error, length ok, but decompressed output content not ok. For example original: |last=Stubblefield|first=Phillip G. | Crush 1.9 output: |last=Stubblt;ref |first=Phillip G. |
    30 replies | 1052 view(s)
  • dnd's Avatar
    Today, 10:06
    TurboBench Compression Benchmark is using large window brotli per default for quality levels 10 and 11 only. For lower levels there are little benefits in compression ratio, but the decompression can be significantly slower. @Shelwien, @bwt is it possible to delete the duplicated list ? it's really confusing when people are copying large posts when replying
    30 replies | 1052 view(s)
  • encode's Avatar
    Today, 07:56
    BTW, I'm working on the BCM v2.00! Currently rebuilding/retuning the CM part from scratch. It will feature: + More precise Range Encoder + Higher compression + Cumulative CRC now stored per each block :_superman2:
    114 replies | 55289 view(s)
  • suryakandau@yahoo.co.id's Avatar
    Today, 05:12
    What do you mean by compare fail ? There is no error while I compress n decompress it
    30 replies | 1052 view(s)
  • michael maniscalco's Avatar
    Today, 04:57
    Exactly. Some algorithms benefit more from MT than others. Even some implementations of the same algorithm scale better than others. In the case of M99 you get basically the same compression ratio and same memory requirements no matter how many cores you use. But the throughput increases dramatically. You can't say the same of other BWT implementations. Good luck using multiple cores along with 1G block sizes for BSC for instance. You'll be in swap-ville in no time! ;) So it all depends on what the test is intended to guage, i guess.
    30 replies | 1052 view(s)
  • Sportman's Avatar
    Today, 04:49
    To have fair single core compare. I had planned to do a multi-core benchmark, but to do that, I need to pump up the liquid cooling pump speed and radiator fans speed from unhearable to a very loud level and many programs already failed with the single-core benchmark and this big file.
    30 replies | 1052 view(s)
  • Sportman's Avatar
    Today, 04:20
    Compare fail.
    30 replies | 1052 view(s)
  • Shelwien's Avatar
    Today, 02:30
    Maybe make another benchmark with MT? Half of the listed programs actually support MT.
    30 replies | 1052 view(s)
  • michael maniscalco's Avatar
    Today, 02:07
    I'm curious as to why you disable multithreading? That's where most of the underlying algorithms really shine. - Michael
    30 replies | 1052 view(s)
  • suryakandau@yahoo.co.id's Avatar
    Today, 01:57
    crush v1.9 use -9 option on enwik10: 10000000000 -> 2498911772 in 13323.99s 2498911772 -> 10000000000 in 351.06s
    30 replies | 1052 view(s)
  • Sportman's Avatar
    Today, 01:11
    Yes! -b1048576 -t1 - 2,599,535,211 bytes: 437.143 sec. - 275.518 sec., m99_clang90_k8 433.678 sec. - 315.293 sec., m99_gcc82_k8 446.727 sec. - 275.459 sec., m99_ic19_SSE2
    30 replies | 1052 view(s)
  • Sportman's Avatar
    Today, 01:06
    Indeed, added.
    30 replies | 1052 view(s)
  • Self_Recursive_Data's Avatar
    Yesterday, 22:34
    So in my tree, I should increment any node I pass by 1. Then when I search for my 17 orders, I instantly get the next layer predictions with counts, like this: https://ibb.co/HpYf8TQ ? I think I got it now, checking... every node has counts ------------------- Of course we store letters crammed in many nodes until need to split at some area.
    88 replies | 2434 view(s)
  • Shelwien's Avatar
    Yesterday, 22:21
    > Take a look again at my image, the letter in the input 's' is the prediction, > these are at end of the tree, the leafs. Yes, but normally we'd walk through the tree from root to leaf. The tree is a data structure for hierarchical arrangement of elements > The letters prior to 's' are the context, they are stored in the tree, > you can find it backwards and then grab predictions & counts. I don't understand why would you want to find the context from the leaf node. I don't understand either why box together the leaves from all context branches. Context is already known, you just need to find the corresponding symbol table. The dumb way to do that is to start from the tree root and choose a branch corresponding to n'th symbol of context on n'th tree level. But the tree would store the pointer to context node of 'cats' in the symbol descriptor for 's' of context 'cat'. So while finding the counter for symbol 's' in context 'cat', we'd also find the context node for next symbol's context 'cats'. > So you mean each node in Tree stores predictions+counts, and branch children? .... > Your tree is a tree made of contexts, which layer/s holds the predictions?? Which layer/s hold counts? Actually my tree (one in tree.cpp) only has context counts. So using it, I need to access all children (only direct children) of a context to build its probability distribution. While usually symbol counters would be stored in symbol table, along with child pointers. Its faster, but also uses more memory. > If your tree is made of layers, which layers hold the counts for range encoding? Last layer? All layers? I'd say, all "layers", but its not "made of layers" (there's no easy way to access all symbols on a "layer"). My tree is made of context descriptors, which I call context nodes. Each context node contains a counter (number of context instances seen before) and a symbol table with addresses of context nodes of one-symbol-longer contexts. "tree.cpp" would visit of all these child contexts (up to 256 per context) to collect the counts for the probability distribution. "counts for range encoding" are only computed (mixed) dynamically, they're not stored anywhere.
    88 replies | 2434 view(s)
  • Self_Recursive_Data's Avatar
    Yesterday, 22:01
    If your tree is made of layers, which layers hold the counts for range encoding? Last layer? All layers?
    88 replies | 2434 view(s)
  • Self_Recursive_Data's Avatar
    Yesterday, 21:42
    Take a look again at my image, the letter in the input 's' is the prediction, these are at end of the tree, the leafs. The letters prior to 's' are the context, they are stored in the tree, you can find it backwards and then grab predictions & counts. So you mean each node in Tree stores predictions+counts, and branch children? .... Your tree is a tree made of contexts, which layer/s holds the predictions?? Which layer/s hold counts?
    88 replies | 2434 view(s)
  • Shelwien's Avatar
    Yesterday, 21:37
    I don't understand your text very well, and understand your pictures even less. > Is your tree backwards like mine, ? green.cpp doesn't use any trees. tree.cpp has a tree with a symbol table in each node and up to 256 branches, which lead to current_context_string+branch_symbol context nodes.
    88 replies | 2434 view(s)
  • Self_Recursive_Data's Avatar
    Yesterday, 21:28
    Can you see my tree image above? How should I draw it....And where do I go, stop, and gather in it? Explain using terms like go to next layer, or all or only 1 branch, grab counts, or skip layers using pointer.... Is your tree backwards like mine, ?
    88 replies | 2434 view(s)
  • Shelwien's Avatar
    Yesterday, 21:25
    > What does "timer.inc" do in Green's folder? Time for what? Its used in this line: if( CheckTimer(500) ) printstats(i,inplen); To print stats during processing only twice per second (500ms). You can remove "#include timer.inc", StartTimer() and the line above - it would still work, just without progress info. > Or are predictions kept in last layer leafs? green.cpp/tree.cpp use predictions from all context orders _at once_. They take probability distributions from all contexts and mix them. There's a PPM algorithm which attempts to use only the distribution from the "last layer leaf" first. Its not the algorithm which is used in green/tree. > so if you see a short order-4 you'll need to go to all children...32,000 sometimes. It happens in "green" - it would visit all order3 instances and collect starts for order3+ contexts. But currently it only visits 8192 most recent instances (DEPTH constant). And in "tree" this problem doesn't exist at all - it just works with up to 256 symbols in each context.
    88 replies | 2434 view(s)
  • Self_Recursive_Data's Avatar
    Yesterday, 21:24
    As you can see the backwards tree looks most useful, it can BackOff from order-16 to lower orders without leaving the prior zone/context frontier. All predictions are on other side, end leafs with counts there only, so if you see a short order-4 you'll need to go to all children...32,000 sometimes. Help....maybe I got it wrong.
    88 replies | 2434 view(s)
  • Self_Recursive_Data's Avatar
    Yesterday, 21:13
    What does "timer.inc" do in Green's folder? Time for what? Let's focus on a tree version here; So you start at the root and usually see 256 paths, right. Then you pick one, and usually see ex. 100 paths. It's getting thinner. It's 16 deep say. Are any of these early branch layer2/3 nodes where the predictions/counts are found? Or are predictions kept in last layer leafs? See my diagram below. https://ibb.co/nn4BJhw
    88 replies | 2434 view(s)
  • Shelwien's Avatar
    Yesterday, 20:37
    https://github.com/michaelmaniscalco/m99 http://nishi.dreamhosters.com/u/m99_v0.7z M99 with modified source to compile on windows. Seems to work
    30 replies | 1052 view(s)
  • Shelwien's Avatar
    Yesterday, 20:00
    > Wait, if Green does order-16, why do you need order-15, 14, 13, etc? Because order-16 statistics only exist if you had at least one previous occurrence of the same 16-byte string. And even that is only enough if current symbol already appeared in context. > Mixing is for non-overlaps I think. > You just go down as far as a match you can and then choose a range. > One tree/list, grows as learns more. There's a PPM algorithm which starts from max available context, then goes to lower order until it is able to encode the current symbol. https://en.wikipedia.org/wiki/Prediction_by_partial_matching But PPM can be seen as implicit CM with chained binary mixing and (1 - escape probability) as context's weight. > Or are we mixing because of info uncertainty, > need a little stats of low order-2, order-6, order-4..as Matt said. There're cases where higher order context can provide worse prediction (eg. suffixes in a sorted dictionary). > Say I pre-compute/store orders 0-2. > Now i have to run down tree to all children from a layer 3 node? > That's ~20,000 leafs. Normally we already had order-(N-1) context for previous symbol, and found the descriptor of the previous symbol in it. So we can store a pointer to current order-N context (previous order-(N-1) + previous symbol) and find it just by reading a pointer. Otherwise we can start from the tree's root, find 1st symbol, walk to branch node, find 2nd symbol... until the descriptor of symbol is reached, which is the last symbol of context string. Number of children etc only matters if you want to encode the tree instead of mixing the probability distribution for sequential coding of the symbols. That's possible too, I just posted an old coder which does that: https://encode.su/threads/3321-Bulat-s-CM-based-on-freqtable-encoding But we don't have an implementation with state-of-art compression for this approach (though it should be possible; some BWT coders are related).
    88 replies | 2434 view(s)
  • kaitz's Avatar
    Yesterday, 19:59
    update to v16 fix JIT update main compression, now a bit better compression ​Binary only, source in github. only pxv_VM version can decompress archive produced by this binary. Otherwise c/decompression is compatible. -- I think this adds no significant speedup. Then again i can be wrong. ---- ​Added bmp24 filter
    25 replies | 4659 view(s)
  • Shelwien's Avatar
    Yesterday, 19:33
    https://freearc.dreamhosters.com/CM_v0.7z Order 8, count 10, encode on Parse Text: 100000000 chars, 43364400 contexts, 26173920 nodes, 1737681572 memory Cutoff Tree: 1566541 contexts, 2240084 nodes, 88664828 memory Encode Tree: 0(0) 0(1); 0i+205c=205; 5919956 bytes Encode Text: 111746826 codes, 19588443 empty codes, 37422165 null hints; 11746826 escapes, 12377166 first escapes, 27836343 bytes o8 - 27,836,358 Order 8, count 10, encode on Decode Tree Decode Text: 100000000 chars a1fa5ffddb56f4953e226637dabbb36a *enwik8 a1fa5ffddb56f4953e226637dabbb36a *unp
    0 replies | 109 view(s)
  • Shelwien's Avatar
    Yesterday, 18:31
    https://encode.su/threads/3315-enwik10-benchmark-results?p=63314&viewfull=1#post63314
    9 replies | 314 view(s)
  • Self_Recursive_Data's Avatar
    Yesterday, 18:15
    I updated the link above, this one gives the strings but there are some overlappers... But at least it is half baked.
    871 replies | 409519 view(s)
  • Self_Recursive_Data's Avatar
    Yesterday, 18:04
    I think I might face a speed issue. Say I pre-compute/store orders 0-2. Now i have to run down tree to all children from a layer 3 node? That's ~20,000 leafs.
    88 replies | 2434 view(s)
  • cade's Avatar
    Yesterday, 18:02
    Thank you both for thorough testing, very helpful. Found the problem, the step size for incompressible check was getting too large and overflowed. Limited to not overflow now, tested and it should work now. Updated link in the first post. Do you have a link to where I can download enwik10 compressed please?
    9 replies | 314 view(s)
  • xezz's Avatar
    Yesterday, 17:33
    xezz replied to a thread Crush v1.3 in Data Compression
    ​may be bad offset. it is larger than current position.
    22 replies | 1286 view(s)
  • Sportman's Avatar
    Yesterday, 17:26
    enwik10: 2,882,390,913 bytes, 760.516 sec. - 299.918 sec., nlzm -window:30 cf (compare ok) 2,399,702,034 bytes, 3,394.103 sec. - 2,732.918 sec., nlzm -window:30 c (compare failed)
    9 replies | 314 view(s)
  • Mauro Vezzosi's Avatar
    Yesterday, 16:26
    > Could you share that file with me please Sorry, I assumed you knew, it's in the Maximum Compression corpus, downloadable here: https://www.maximumcompression.com/data/files/pdf-test.zip > is there a line number and error message? No, Windows goes into APPCRASH. Window bits: 16 Calculating crc... 84E9D112 State: 11 KB HR2: 12 KB HT3: 128 KB BT4: 1313 KB Window size: 133 KB Parser: 759 KB Working... The compressed file size is 425.984 (instead of > 3 MB). > It works if you don't specify window size? Yes, like -window from 17 to 30. > Do you have this crash with other files too? I tested the 10 files of the Maximum Compression corpus with -window from 16 to 30: only -window:16 with FlashMX.pdf crashes.
    9 replies | 314 view(s)
  • cade's Avatar
    Yesterday, 14:35
    Could you share that file with me please, is there a line number and error message? I'll test and fix. It works if you don't specify window size? Do you have this crash with other files too?
    9 replies | 314 view(s)
  • Sportman's Avatar
    Yesterday, 12:02
    Is there a Windows binary? I could only find an old 32-bit m99 (3/26/2007) binary, but that one crash by start.
    30 replies | 1052 view(s)
  • Mauro Vezzosi's Avatar
    Yesterday, 11:57
    Sorry, you are correct, I wrote "-window:bits=...", not "-window:...". Now crash only "nlzm -window:16 c FlashMX.pdf FlashMX.pdf.nlzm".
    9 replies | 314 view(s)
  • Sportman's Avatar
    Yesterday, 11:52
    Added, till -q 3 no effect, -q 4 negative effect, -q 5 till -q -9 small positive effect, -q 10 and -q 11 big positive effect.
    30 replies | 1052 view(s)
  • cade's Avatar
    Yesterday, 11:42
    Oops, you're correct, typo in the help message. Done and updated. The correct syntax is: nlzm -window:27 c enwik8.txt out8 or nlzm c
    9 replies | 314 view(s)
  • Mauro Vezzosi's Avatar
    Yesterday, 10:42
    "nlzm c FlashMX.pdf FlashMX.pdf.nlzm" is ok, "nlzm -window:bits=... c FlashMX.pdf FlashMX.pdf.nlzm" crash for any window:bits (even for the default 20). ​We should write " c ", not "c ", right?
    9 replies | 314 view(s)
  • Self_Recursive_Data's Avatar
    Yesterday, 06:18
    Wait, if Green does order-16, why do you need order-15, 14, 13, etc? Mixing is for non-overlaps I think. You just go down as far as a match you can and then choose a range. One tree/list, grows as learns more. Or are we mixing because of info uncertainty, need a little stats of low order-2, order-6, order-4..as Matt said.
    88 replies | 2434 view(s)
  • cade's Avatar
    Yesterday, 06:00
    This is NZLM (not-really-LZMA LZM), a general compressor with semi-optimal parsing, LZ (up to 1 GB), o0 + o1 CM and simple probability counters for decisions. I started writing an optimal lz + ari, then switched decisions to a separate probability counter (i.e. PPM escape vs encode decision) instead of a single alphabet for literals + log2 lengths. For distances >= 4 bits, the lower 4 bits are also entropy coded. Current state is determined by the two most recent decisions (literal vs match), and there is a simple rep0. I'll clean up the source (jumble of files reused by different experiments), check/fix any bugs, then make it open source. Attachment exe is 64-bit compiled with clang++ 9.0.0 on Windows 8.1 Results (c to compress, cf to compress fast, d to decompress, t to test in-memory decompression and CRC32): nlzm -window:27 c enwik8.txt out8 100000000 -> 25116875 in 102.080 sec nlzm -window:30 c enwik9 out9 1000000000 -> 202561110 in 10467.483 sec nlzm -window:27 c enwik9 out9_27 1000000000 -> 211524600 in 1427.124 sec nlzm t out8 25116875 in 1.637 sec nlzm t out9_27 211524600 in 15.529 sec nlzm -window:27 cf enwik8.txt out8f 100000000 -> 33460723 in 9.730 sec nlzm -window:30 cf enwik9 out9f 1000000000 -> 293022504 in 89.565 sec Semi-optimal parsing is done by finding all matches (exhaustively) for the block size, then finding (but not updating search) matches for block size + nice length, then traced by finding the minimum cost from block size + nice length to block start. Not terminating matches at block size allows some (better) parsing decisions when matches could cross block boundaries. Block start is always aligned to block size so that the match finder is updated with every position. When a decision table is calculated for a block, the current costs of encoding (literal cost, match cost, etc) are estimated (log2 bps, ~1-3% error compared to final compressed file size with 8k blocks) and used for path finding. There are two optimizations for semi-optimal parsing: - In the match finder (forward pass), if a sufficiently long match is found, the next position will try to continue that match or reduce length by one, and ends when length is less than an escape threshold to not spend extra time in the binary tree match finder. - In the path finder (reverse pass), a maximum of 8 different lengths forward (step size of (max length - min length) / 8 ) are tested to not waste time exhausting all the options when a few represent most of the optimization. To speed up passing incompressible data, if no matches are found for N bytes, then every 2nd byte is tested, 4th, 8th until another match is found and the step size is decreased again. Speed drops off rapidly with larger dictionary sizes and semi-optimal parsing due to the exhaustive match finder. Most of my target files are plain-text natural language texts, structured binary data (sensor data, rows and hierarchies of floats, doubles, ints, repetitive strings and labels) between 300 KB and 200 MB, compressed once and decompressed often. Default window size is 1 MB and default memory usage during decompression is not significantly more because my target environment only has ~4-10 MB RAM available for decompression and ~20-32 MB for compression. Memory for compression is 12N bytes (optimal) + IO overhead or 17 MB (fast) + N + IO overhead. Disclaimer: Not significantly optimized for speed. Does not compress all strings of N bytes to N - 1 bytes, nor N / 2 bytes. Did not inspire Star Wars. Limited colour scheme options. If disassembled, please provide source with optimization :D and improvements.
    9 replies | 314 view(s)
  • michael maniscalco's Avatar
    Yesterday, 05:58
    I tarred the 10 files together so there's a bit more overhead in my test. M99 is still a good competitor for compression vs speed as well. :cool: m99 e enwik10 enwik10.m99 -b1000000000 compressed: 10000015360 -> 1720236193 bytes. ratio = 17.2023% Elapsed time: 396.497 seconds : 24.0525 MB/sec m99 d enwik10.m99 enwik10 Elapsed time: 236.818 seconds lscpu CPU(s): 12 On-line CPU(s) list: 0-11 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 158 Model name: Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
    30 replies | 1052 view(s)
  • Self_Recursive_Data's Avatar
    Yesterday, 04:51
    Let me make sure the code works first wait, something seems off. EDIT: Updated code.
    88 replies | 2434 view(s)
  • Self_Recursive_Data's Avatar
    Yesterday, 04:28
    https://paste.ee/p/qQroG I implemented a 'best string finder', I think. What it does is stores the possible sequences by putting them in a tree with counts, then it can efficiently find string and each's value, but I did leave out the refining of the sorted list but it's good enough to present. The value formula is a string's counts * the letters in string to get savings, then to get cost the letters in the string + the counts*2 to store the string/say it using huffman (2 bytes (16 bits), if there is 64,000 strings, currently manual adjusting is required). Now with cost and savings (they are printed, see code) you divide the savings by the cost to get bitsPerByte ratio. Code then sorts them by greatest to least value. So the first one below is huge, from 1million bytes fed in, it is 21.5! You save/or 'store' 5445 bytes for just ~253 bytes. updated example: Kennon's attempt to find the 'best strings' that are long and frequent is the same thing as Shelwien's Green algorithm https://encode.su/threads/541-Simple...xt-mixing-demo, Shelwien's learns online most of its context strings. Notice as Shelwien's tree/lists grow after eating what it generates, it is getting longer branches with more stats on them, yup, long frequent strings, so when it steps on 16 bytes it sends them basically to tree and finds the match, and instead of predicting whole strings it predicts just the next byte. Both their algorithms get the wiki8 100MB to ~21MB. Shelwien's can store 16 byte long contexts (costs RAM 2GB), so I think it can do ~150 byte masks? Enough to cover/detect most duplicates. For longer dups you can easily find a way to add them.
    871 replies | 409519 view(s)
  • Self_Recursive_Data's Avatar
    Yesterday, 04:28
    https://paste.ee/p/qQroG I implemented a 'best string finder', I think. What it does is stores the possible sequences by putting them in a tree with counts, then it can efficiently find string and each's value, but I did leave out the refining of the sorted list but it's good enough to present. The value formula is a string's counts * the letters in string to get savings, then to get cost the letters in the string + the counts*2 to store the string/say it using huffman (2 bytes (16 bits), if there is 64,000 strings, currently manual adjusting is required). Now with cost and savings (they are printed, see code) you divide the savings by the cost to get bitsPerByte ratio. Code then sorts them by greatest to least value. So the first one below is huge, from 1million bytes fed in, it is 21.5! You save/or 'store' 5445 bytes for just ~253 bytes. updated example: Kennon's attempt to find the 'best strings' that are long and frequent is the same thing as Shelwien's Green algorithm https://encode.su/threads/541-Simple...xt-mixing-demo, Shelwien's learns online most of its context strings. Notice as Shelwien's tree/lists grow after eating what it generates, it is getting longer branches with more stats on them, yup, long frequent strings, so when it steps on 16 bytes it sends them basically to tree and finds the match, and instead of predicting whole strings it predicts just the next byte. Both their algorithms get the wiki8 100MB to ~21MB. Shelwien's can store 16 byte long contexts (costs RAM 2GB), so I think it can do ~150 byte masks? Enough to cover/detect most duplicates. For longer dups you can easily find a way to add them.
    88 replies | 2434 view(s)
  • Shelwien's Avatar
    Yesterday, 01:54
    If you mean new symbols, there's a so-called order-(-1), which contains all symbols with equal probably In green its represented by this (distribution init before mixing): for( j=0; j<CNUM; j++ ) freq=1; // order-(-1) So for coding, every symbol would have non-zero probability, although its much less than 1%.
    88 replies | 2434 view(s)
  • Shelwien's Avatar
    Yesterday, 01:45
    I made and uploaded enwik10, please check: https://mega.nz/#!9f4DxYJR!wgX4bEvR96sPCcVLWu05TUA3WmgM9ZIQ9jyTo1QJ0Hg Also size: https://mega.nz/#!9f4DxYJR!wgX4bEvR96sPCcVLWu05TUA3WmgM9ZIQ9jyTo1QJ0Hg 1,872,327,733 // 7z a -mx=9 -mf=off -ms=1t -m0=lzma:d=1536m:lc=8:lp=0:pb=0:fb=273 enwik10.7z enwik10
    30 replies | 1052 view(s)
  • Self_Recursive_Data's Avatar
    22nd January 2020, 21:44
    Last question. For information uncertainty, when have not seen all the datset if doing Online Learning, you go to a node and may not even have the prediction there seen yet even once. Do you fallback an order? Or do you give all automatically ex. 1% range slice?
    88 replies | 2434 view(s)
  • Shelwien's Avatar
    22nd January 2020, 21:07
    > Now the 17 models/masks, I imagine you store them in 1 tree for tree version? Either way is possible. For example, it may be easier to look up each context in a hashtable independently. In tree.cpp case, it has context node with a table of pointers (per symbol), which point either to text buffer (if symbol only occured once in this context), or to a node of higher order context. > Which way do you get the leaf predictions for order-2? tree.cpp uses same context nodes for everything including order0. But normally its easier to keep up to order2 in static tables (its just 64Mb).
    88 replies | 2434 view(s)
  • Self_Recursive_Data's Avatar
    22nd January 2020, 20:44
    Now the 17 models/masks, I imagine you store them in 1 tree for tree version? Instead of 17 dedicated trees? So at layer 5 for order-5 you have the predictions? Are these predictions "layer 6"? Or its own leaf stash inside layer 5? ? Which way do you get the leaf predictions for order-2? https://ibb.co/GsnNYn2
    88 replies | 2434 view(s)
  • Shelwien's Avatar
    22nd January 2020, 20:32
    > they'd be perfectly mixed (by just adding each 3 & dividing by 3) There's nothing perfect in contexts having equal weight. For example, an order5 context "perfe" would predict next symbol to be 'c' with almost 100% probability. While "erfe" already can be a part of "interference" and some other words. In such a case its best for a weight function to assign high weight to order5, and much lower weights to lower orders.
    88 replies | 2434 view(s)
  • Self_Recursive_Data's Avatar
    22nd January 2020, 18:56
    Let me explain then. A context model of order-3 has seen every offset of 3 letters in the file (or some of file if Online Learning is implemented). So it has higher counts/probabilities for it's candidate predictions 1/0 (or a, b, c.....z....`, !, ", $). Most the time it's best prediction is right and is given a smaller code by ACing. If we want to mix order-3 and order-4 and order-5, the three model's may differ in relativity. Model1 may think a=50 counts, b=25 counts, while model2 may think a=480, b=200. Clearly they are similar, but they are scaled differently, else they'd be perfectly mixed (by just adding each 3 & dividing by 3) and be like a standalone model with no mixing. If normalization/re-scaling is the issue, this is simple to fix I guess. But if not just this is the issue, then there is other magical mysteries. Possibly, the models themselves are only sorta perfect if doing Online Learning, are scaled differently, and, each model is like a flavor where a higher order/more recent model model is more accurate/overpowering while has fewer stats seen. Any more differences between 2 given models with varying size/position (ex. order-3 and order-4)?
    88 replies | 2434 view(s)
  • suryakandau@yahoo.co.id's Avatar
    22nd January 2020, 17:22
    this is the silesia benchmark result using crush v1.9 now it can beat nci file beside osdb n webster file. btw @xezz what means by file corrupted: s=-2032 ?
    22 replies | 1286 view(s)
  • dnd's Avatar
    22nd January 2020, 16:08
    Turbo-Range-Coder update: - added CDF interleaved range coder + improved symbol search Note all the CDF static coders here can used with adaptive coding by connecting a CDF predictor. For large alphabets (ex. 256), there is a possibility to add a fast table based decoding with "division" or "division lut" (instead of the symbol search), but this will not work with adaptive coding. For large alphabets with static or semi-static (periodic rescale) the compression ratio will be so worse that it is better to use huffman coding or more better block based ANS. ​ TurboBench Compression Benchmark - Skylake 3.4 GHz ./turborc -e40,41,42,43,44,45 -t enwik9 E MB/s size ratio% D MB/s function 82.18 475135200 47.51% 69.17 40-rc4s bitwise adaptive o0 simple 92.18 500000004 50.00% 72.74 41-rc4s bitwise static o0 simple 425.47 478499608 47.85% 85.62 42-rccdfsb CDF static search 425.47 478499608 47.85% 50.93 43-rccdfsv CDF static division 358.50 482541082 48.25% 70.01 44-rccdfst CDF static division lut 491.94 478499612 47.85% 100.68 45-rccdfsb CDF static search interleaved ./turborc -e40,41,42,43,44,45 -t enwik9bwt E MB/s size ratio% D MB/s function 138.29 156109312 15.61% 122.36 40-rc4s bitwise adaptive o0 simple 139.34 500000008 50.00% 127.62 41-rc4s bitwise static o0 simple 457.02 478499608 47.85% 132.32 42-rccdfsb CDF static search 456.71 478499608 47.85% 69.24 43-rccdfsv CDF static division 420.97 483079172 48.31% 119.03 44-rccdfst CDF static division lut 548.13 478499616 47.85% 171.40 45-rccdfsb CDF static search interleaved In this speed benchmark only the low nibbles of each byte are encoded/decoded
    23 replies | 1921 view(s)
  • AcidBurn's Avatar
    22nd January 2020, 13:55
    AcidBurn replied to a thread FileOptimizer in Data Compression
    nikkho, FileOptimizer 14.10.2534 in the Plugins32 folder contains 64-bit applications: ​jpeg-recompress.exe, pngquant.exe, gifsicle.exe. Please fix this.
    661 replies | 191792 view(s)
  • Shelwien's Avatar
    22nd January 2020, 09:34
    > Can you explain why the mixing formula 'add and divide' is not as good? There's no specific reason. Its possible to generate a file where it would be better than green's weighting function. Normalized weights are probabilities of each context predicting the highest probability for the current symbol. 21,094,300 // w=coef/(1+coef+total/(coef+1)) 22,664,336 // w=8192/total (probability distributions with equal weight) 61,654,200 // w=1 (symbol counts with equal weight) Ideally we need to build a proper model for prediction successes. Its possible, but the program for that would be much more complicated than "green", since it would require storing extra statistics for weight model, working with probabilities in fixed-point/logistic domain, etc. That kind of thing is also much easier to make for a bitwise model, rather than bytewise.
    88 replies | 2434 view(s)
  • Self_Recursive_Data's Avatar
    22nd January 2020, 08:49
    I added one question while you were typing: Lastly let's try something new: Can you explain why the mixing formula 'add and divide' is not as good? For example are the 17 models not normalized as good and that is the major cause? Try to flesh this out to me using pure English explaining the issues/idea and what is mixing together. Use analogies. The more context the better I will grasp solidly no doubt I will.
    88 replies | 2434 view(s)
  • Shelwien's Avatar
    22nd January 2020, 08:48
    > "no dynamic mixing or adaptive counters are used there." > From: > https://encode.su/threads/541-Simple...xt-mixing-demo Well, the meaning here is different. "Adaptive counter" is something like this: https://github.com/Shelwien/mod_CM/blob/master/CM/counter.inc While green just uses simple numbers of occurrences. Same with mixing - good mixer adapts to actual performance of the submodels, and green just uses a heuristic weight function which assigns lower weights to contexts with more occurrences (presuming that there'd be more relevant higher-order contexts). Green's model overall is adaptive (its certainly not static), just that there're better models. > I'm assuming you mean the 16 models have their own counters. > And order0-2 are static, order3-16 updates as decompression runs? They are all updated. Statistics are not stored in compressed file, its all adaptive. > am I supposed to store in ex. a tree every 16 bytes the mask moves along? Not quite like that, but with a tree, yes, you'd need to create a new context node for most of symbols of the file. On other hand, green.cpp only allocates 5*filesize for file data and order3 chains. If you're concerned about memory usage, there're all kinds of tricks For example, bwmonstr manages to compress files with only 1.25*filesize memory usage: http://mattmahoney.net/dc/text.html#1605 > That's 100,000,000 steps the 16-byte-mask makes. 16*100,000,000=~1,000,000,000 nodes! Yes, and? There're 1,073,741,824 bytes in a gigabyte, and most computers now have 4GB+ memory? Though its certainly a good reason to not use python for data compression. > A billion...Am I supposed to erase some if not enough statistics are saw in 10KB? There're all kinds of workarounds for that. 1) "Sliding window" - only keep recent N bytes indexed, remove contexts that go out of window. 2) Restart (ppmd -r0) - once memory pool fills up, reset the model and continue with empty tree. 3) Restart and retrain (ash /w) - reset and reindex N% of known data 4) Tree pruning (ppmd -r1) - delete less probable contexts to free memory. 5) Freezing (ppmd -r2) - just stop adding new contexts when there's no more memory There're also simply different data structures like BWT Suffix Array, hashtables, bloom filters... In any case, for enwik8 it should be possible to fit in 1GB with a more compact tree. The tree in tree.cpp just uses 32-bit fields for every symbol, but we could reduce the precision and go with 16-bit or even 8-bit (like paq) counters.
    88 replies | 2434 view(s)
  • Jarek's Avatar
    22nd January 2020, 08:07
    Quantum mechanics is equivalent with Feynman path ensembles {gamma(time)}, but there are also these basic considered condensed matter e.g. Ising model which assume Boltzmann sequence ensemble e.g. of spins {gamma(x)}: https://en.wikipedia.org/wiki/Ising_model The latter ensemble can be seen as MERW: https://en.wikipedia.org/wiki/Maximal_entropy_random_walk These two ensembles are different ("Wick rotated"), but have many common features, like localization property (e.g. in standard diffusion says stationary probability distribution is rho=1, while QM and MERW says rho~sin^2). As MERW has Born rule (rho = psi^2), it seems we can e.g. violate Bell inequalities with Ising model: https://physics.stackexchange.com/questions/524856/violation-of-bell-like-inequalities-with-spatial-boltzmann-path-ensemble-ising Construction of Bell violation can be extended further to quantum-like computers realized as Ising model: e.g. by somehow printing on a surface conditions for solving a given problem by Ising's: Boltzmann distribution among possible sequences. Such Wick-rotated quantum gates seem a bit weaker computationally, but spatial realization allows to fix amplitudes from both directions: left and right, what seems(?) to allow to quickly solve NP-complete problems like 3-SAT (end of https://arxiv.org/pdf/1912.13300 ) ... Stack: https://physics.stackexchange.com/questions/526439/wick-rotated-quantum-computers-e-g-to-be-realized-with-ising-like-systems
    37 replies | 32160 view(s)
  • Self_Recursive_Data's Avatar
    22nd January 2020, 08:03
    That clears more up thanks. "no dynamic mixing or adaptive counters are used there." From: https://encode.su/threads/541-Simple-bytewise-context-mixing-demo I'm assuming you mean the 16 models have their own counters. And order0-2 are static, order3-16 updates as decompression runs? As it trains itself on wiki8 with its order-16 mask, am I supposed to store in ex. a tree every 16 bytes the mask moves along? That's 100,000,000 steps the 16-byte-mask makes. 16*100,000,000=~1,000,000,000 nodes! A billion...Am I supposed to erase some if not enough statistics are saw in 10KB? Not often does the same 16 letters/bytes appear in wiki8...So the leafs in tree are a ~billion...
    88 replies | 2434 view(s)
  • Shelwien's Avatar
    22nd January 2020, 07:35
    > 1) You said Green isn't adaptive, > but in another post say it dynamically grows depth > to order3+ 'as' it regenerates text > (Online Learning). I don't remember saying these things and google doesn't either. a) green.cpp doesn't use dynamic memory allocation, but it uses different probability distributions on each context occurrence (and each byte of input data), so its certainly adaptive. I also explicitly said that its weighting function is adaptive, so I can't easily visualize it. b) "DEPTH" is a constant in green.cpp which determines the max number of most recent order3 context occurrences it can visit to rebuild the order3+ probability distributions. c) from the first byte actually processed by the model (green.cpp copies first 3 bytes as is), greep.cpp does access symbol counts for all contexts from 0 to MAXORDER. Although contexts with total=0 are skipped in mixing. > 2) Does Green predict the next byte or next bit? green.cpp always works with byte distributions and encodes/decodes whole bytes. But it is possible to convert it to a bitwise coder using some parts from mod_ppmd. > 3) So to predict the next byte/bit your algorithm takes the previous 16 masks/contexts, > finds them in tree/ linked lists as EXACT MATCHES, then does a nice mixing formula and Arithmetic Coding? Kind of, except it has static tables for order0..2 symbol counts. > 4) By the time Green finishes the 100MB input, how many letters are stored in Tree? > 16*100,000,000? I see my RAM goes up by ~2GB. I don't understand "Log 2N" > so I won't understand if you use those terms. If you mean tree.cpp, it creates about 97M of context nodes for enwik8. They are variable-size though (min 12 bytes, plus 4 bytes per each new symbol in context), that's why it hits 2G. Unfortunately its not x64-aware, but with order15 it fits in 2G and compresses enwik8 to 21,094,300 (and decompresses it correctly). (Like "tree.exe 15 enwik8 output").
    88 replies | 2434 view(s)
  • Self_Recursive_Data's Avatar
    22nd January 2020, 02:43
    Trying to re-implement Green. Please please answer the below with care: 1) You said Green isn't adaptive, but in another post say it dynamically grows depth to order3+ 'as' it regenerates text (Online Learning). 2) Does Green predict the next byte or next bit? 3) So to predict the next byte/bit your algorithm takes the previous 16 masks/contexts, finds them in tree/ linked lists as EXACT MATCHES, then does a nice mixing formula and Arithmetic Coding? 4) By the time Green finishes the 100MB input, how many letters are stored in Tree? 16*100,000,000? I see my RAM goes up by ~2GB. I don't understand "Log 2N" so I won't understand if you use those terms.
    88 replies | 2434 view(s)
  • Jyrki Alakuijala's Avatar
    22nd January 2020, 01:17
    Sad to see no large window results for brotli. People get confused when small window (practical) brotli is compared with huge window size in other algorithms. Try --large_window 30
    30 replies | 1052 view(s)
  • Cristo's Avatar
    21st January 2020, 20:23
    That's some interesting info for me there and yes there was a miscalculation, placing the expected output at this point on ~40Mb then. I will try to work it a bit more from here. Thanks.
    7 replies | 256 view(s)
  • Shelwien's Avatar
    21st January 2020, 19:51
    >> I tried to demonstrate that permutatation index is too large and mostly incompressible > As I estimated, perm number, it should be around 13M! ~= 87 Mb of digits, > that means using 10 out of 256 symbols in a byte, which we can code in binary as 3,5Mb Ah, so that's where you made a mistake. https://www.wolframalpha.com/input/?i=IntegerPart%5BLog%5B256.%2C13000000%21%5D%5D "87 Mb of digits" is actually about right https://www.wolframalpha.com/input/?i=IntegerPart%5BLog%5B10.%2C13000000%21%5D%5D But a decimal digit requires ~0.415 of a byte to encode, so 3.5Mb binary is too little. > Ideally should be already calculated in 3,5Mb precision binary number, but I can't implement that at my level of math/programming. Well, its not a problem in C/C++, I already did it in stegdict, also kinda here: https://encode.su/threads/3122-dec2bin-converter-for-Nelson-s-million-digits-file > Could you define DRT dict? It should be there... read the readme though. http://nishi.dreamhosters.com/u/lpaq9m.rar Or you can use cmix -s, its similar: http://www.byronknoll.com/cmix.html
    7 replies | 256 view(s)
More Activity