Activity Stream

Filter
Sort By Time Show
Recent Recent Popular Popular Anytime Anytime Last 24 Hours Last 24 Hours Last 7 Days Last 7 Days Last 30 Days Last 30 Days All All Photos Photos Forum Forums
  • Jyrki Alakuijala's Avatar
    Today, 07:53
    Perhaps you are more focused on aesthetics and elegance than efficiency. Efficiency is something that can be measured in a benchmark, not by reasoning. As an example when I played with dictionary generation (both zstd --train and shared brotli), occasionally I found that taking 10 random samples of the data and finding the best sample as a dictionary turned out more efficient than running either of the more expensive dictionary extraction algorithm. Other times concatenating 10 random samples was a decent strategy. It is not necessary for thorough thinking, logic and beauty to 'win' the dictionary efficiency game. Depending on how well the integration of a shared dictionary has been done, different 'rotting' times can be observed. SDCH dictionaries were rotting every 3-6 months into being mostly useless or already slightly harmful, with brotli dictionaries we barely see rot at all. Zstd dictionaries use -- while less efficient than shared brotli style shared dictionary coding -- also likely rots much slower than SDCH dictionaries. This is because SDCH used to mix the entropy originating from the dictionary use with the literals in the data, and then hope that a general purpose compressor can make sense out of this bit porridge. IMHO, we could come up with a unified way to do shared dictionaries and use it across the simple compressors (like zstd and brotli).
    399 replies | 124548 view(s)
  • xinix's Avatar
    Today, 07:24
    xinix replied to a thread Paq8pxd dict in Data Compression
    I can compile. Therefore. Can you tell which part of the code to change? _____ Perhaps simplify the task. pxd_v79 does not preprocess for my file Segmentation outputs bintext And does not preprocess If he can ignore it and apply preprocessing anyway, then I won’t need an external dictionary. Thanks!
    770 replies | 289349 view(s)
  • xinix's Avatar
    Today, 07:20
    You must not use "C9" and you must not use "C" You must use lowercase "c" And do not try to use phda9dec.exe windows phda9 only works under linux The created file in linux will not be unpacked under windows + due to the fact that phda is based on floats, it is possible to pack and unpack only on the same processor\linux.
    90 replies | 30901 view(s)
  • Alexander Rhatushnyak's Avatar
    Today, 07:14
    C9 is only for enwik9 C is for other files, but they can't be as big as enwik9. There's something in readme.txt about sizes. Sorry, it looks like this year I won't have more than 5 minutes per month for this.
    90 replies | 30901 view(s)
  • ShihYat's Avatar
    Today, 06:16
    Same problem here as above. ​When I use 'C' on other large enwik text files, it will return segmentation fault at different percentage, as well as use 'C' instead of 'C9' on enwik9, fault at 91%.
    90 replies | 30901 view(s)
  • byronknoll's Avatar
    Today, 00:11
    byronknoll replied to a thread cmix in Data Compression
    I didn't use your binary. Here are some suggestions that might help: - change the compiler flag from -Ofast to -O3. The binary will be slower, but might fix the issue you are seeing. - upgrade your compiler to a more recent version. - change to a different compiler - I recommend clang.
    449 replies | 110319 view(s)
  • LucaBiondi's Avatar
    Yesterday, 17:01
    LucaBiondi replied to a thread Paq8pxd dict in Data Compression
    Hi Guys! I hope you are all fine! I have two XML similar files. i found that one is detected as default while the other is detected as text: 0 |default | 1 | 14404734 ​18 |text | 1 | 11449237 Why? What should i verify inside my files? This is the log: C:\Compression\paq8pxd>paq8pxd_v76_AVX2 -x12 testset_xml c:\compression\Xml_testset\*.xml Slow mode FileDisk: unable to open file (No such file or directory) Creating archive testset_xml.paq8pxd76 with 2 file(s)... File list (76 bytes) Compressed from 76 to 58 bytes. 1/2 Filename: c:/compression/Xml_testset/12_A_20060313194711.xml (11449237 bytes) Block segmentation: 0 | text | 11449237 2/2 Filename: c:/compression/Xml_testset/AS_B_20150226093210.xml (14404734 bytes) Block segmentation: 0 | default | 14404734 Segment data size: 27 bytes TN |Type name |Count |Total size ----------------------------------------- 0 |default | 1 | 14404734 18 |text | 1 | 11449237 ----------------------------------------- Total level 0 | 2 | 25853971 default stream(0). Total 14404734 bigtext wrt stream(10). Total 8167896 Stream(0) compressed from 14404734 to 133667 bytes WRT dict count 811 words. WRT dict online. Stream(10) compressed from 8167896 to 284267 bytes Segment data compressed from 27 to 18 bytes Total 25853971 bytes compressed to 418059 bytes. Time 4365.32 sec, used 15766 MB (3646968591 bytes) of memory Thank you!!!! ​Luca
    770 replies | 289349 view(s)
  • Marco_B's Avatar
    Yesterday, 17:00
    > You say "We update it before to extract any information", but normally we have to decode a symbol > first to update anything. Yes, I do it to take into account for the posterior estimation, the check of the reversibility of the matrix in the secondary predictiom should assure that the decoder can walk in lock-in, obviously it is possible I disredgard something. > In any case, SEE/SSE is about using a probability as context. I try to re-read with this in mind.
    189 replies | 8825 view(s)
  • xezz's Avatar
    Yesterday, 16:04
    removed cidx. speed and ratio is sometimes better than v4.
    13 replies | 27444 view(s)
  • SolidComp's Avatar
    Yesterday, 16:01
    SolidComp replied to a thread Zstandard in Data Compression
    Jyrki, everything you said is correct, but you're missing something. One of my core ideas here is a prospective dictionary. Your dictionary is retrospective — it was generated based on old or existing HTML source. As a result, it doesn't support modern and near future HTML source very well. A prospective dictionary has two advantages: Efficient encoding of modern and foreseeable near future syntax. The ability to shape and influence syntax conventions as publishers optimize for the dictionary – a feedback loop. In reality a prospective dictionary of the sort I'm advocating would be a combination of prospective and retrospective. The tests you're asking for are nearly impossible since by definition a prospective dictionary isn't based on existing source files, but rather is intended to standardize and shape future source files. The kind of syntax a good prospective dictionary could include are things like: <meta name="viewport" content="width=device-width"> (an increasingly common bit of code) <link rel="dns-prefetch" href="https:// <link rel="preconnect" href="https:// <link crossorigin="anonymous" media="all" rel="stylesheet" href="https:// These are all nice and chunky strings, and they all represent syntax that could be standardized for the next decade or so. If they were present in the brotli or Zstd dictionary, publishers would then optimize their source to include these strings in this exact standardized form. What I mean is having the rel before the href and so forth. Note that these are just a few examples. A dictionary built this way would be a lot more efficient than the status quo.
    399 replies | 124548 view(s)
  • suryakandau@yahoo.co.id's Avatar
    Yesterday, 14:45
    my assumption is you have compiled the source code on linux...the question is why the same source code compiled with different compiler can cause the decompress file corrupted :confused:
    449 replies | 110319 view(s)
  • suryakandau@yahoo.co.id's Avatar
    Yesterday, 11:59
    Do you use my binary compiled with Dev c++ ??
    449 replies | 110319 view(s)
  • suryakandau@yahoo.co.id's Avatar
    Yesterday, 11:25
    How to use colab ?
    449 replies | 110319 view(s)
  • Sportman's Avatar
    Yesterday, 09:43
    Sportman replied to a thread 2019-nCoV in The Off-Topic Lounge
    Only 3 months ago: https://promedmail.org/promed-post/?id=20191230.6864153 https://www.scmp.com/news/china/politics/article/3044050/mystery-illness-hits-chinas-wuhan-city-nearly-30-hospitalised https://www.scmp.com/news/china/politics/article/3044207/china-shuts-seafood-market-linked-mystery-viral-pneumonia
    35 replies | 2060 view(s)
  • byronknoll's Avatar
    Yesterday, 08:53
    byronknoll replied to a thread cmix in Data Compression
    It works for me. Here is a Colab where you can see it compress+decompress to the same md5: https://colab.research.google.com/drive/1G4JU8QRwKZGh5YfbbheX-ZSjMgF29Ci7
    449 replies | 110319 view(s)
  • kaitz's Avatar
    Yesterday, 02:29
    kaitz replied to a thread Paq8pxd dict in Data Compression
    You cant. ​ If my brain breaks then :D So in a day...
    770 replies | 289349 view(s)
  • suryakandau@yahoo.co.id's Avatar
    2nd April 2020, 22:59
    i just setting paq8 and paq8hp to level 6 on cmix17. i do not use dictionary when compress and decompress. here is wrtpre.cpp
    449 replies | 110319 view(s)
  • Darek's Avatar
    2nd April 2020, 22:47
    Darek replied to a thread Paq8pxd dict in Data Compression
    @Kaitz - this version looks like good step forward. When will be out?
    770 replies | 289349 view(s)
  • Shelwien's Avatar
    2nd April 2020, 19:50
    21,819,822 enwik8.grn3-world95 21,857,586 enwik8.grn4-book1 21,377,801 enwik8.grn4-world95
    189 replies | 8825 view(s)
  • byronknoll's Avatar
    2nd April 2020, 19:35
    byronknoll replied to a thread cmix in Data Compression
    I might be able to help debug this. Did you make any modifications to cmix v17? Did you use a dictionary when compressing and decompressing? Can you post a copy of wrtpre.cpp so I can try reproducing the problem on my computer?
    449 replies | 110319 view(s)
  • xinix's Avatar
    2nd April 2020, 19:22
    xinix replied to a thread Paq8pxd dict in Data Compression
    Hi kaitz How to run v79 with an external dictionary? I want to try not a dynamic dictionary, but my english.dic thanks
    770 replies | 289349 view(s)
  • kaitz's Avatar
    2nd April 2020, 18:19
    kaitz replied to a thread Paq8pxd dict in Data Compression
    Parsing wiki compared to v79 Used some provided enwik-specific transforms. savings on compression enwik8 -s8. 1. Removing html some entities, converting id,timestamps saves 16kb 2. Converting #xxx; to utf8 saves 3kb (grouping them made things worse , as they were added to dict) 3. Reordering [[lang:xxxx to end of file saves 19kb 4. rest small fixes total about 39kb. (-s15 only 26kb)
    770 replies | 289349 view(s)
  • Shelwien's Avatar
    2nd April 2020, 17:39
    It looks like a history context rather than SSE. You say "We update it before to extract any information", but normally we have to decode a symbol first to update anything. Still, something like you describe can happen if we'd store information about last symbol seen in a given context. Then we can use this previously seen symbol as context for symbol counts. In any case, SEE/SSE is about using a probability as context. History contexts are related (for example FSM counter states are quantizations of history, but are quite different from probabilities), but are not compatible with interpolation and other SSE-specific tricks.
    189 replies | 8825 view(s)
  • Marco_B's Avatar
    2nd April 2020, 17:02
    I am sorry to slip into my own example of a (presumable) SSE, but I am trying to understand, please tell me if what follows is a correct pattern for it. Let an alphabet be of three symbols, and for the moment we left aside the sharing of statistics of different contests. At some point in the input stream we have the contest "a" with the follow symbol "b"; for this contest the statistic, presented in counting form, for the primary prediction is a b c 5 7 9 We update it before to extract any information and we obtain a b c 5 8 9 that still gives the wrong symbol "c". We pass "c" to the error correction matrix of the secondary prediction for the contest "a", where each element of a row shows how many time the symbol has changed into the column label, which has the form __a b c a 3 4 5 b 4 3 2 c 1 2 1 and we can see that "c" translates into the matching symbol "b". Now it is necessary to check the reversibility of the matrix, i.e. that among symbols listed in the rows only one leads to "b", otherwise we have to code an escape symbol. In the present example, that is the case. Then we update the matrix into __a b c a 3 4 5 b 4 3 2 c 1 3 1 and the decoding process will be able to keep synchronized all the table statistics.
    189 replies | 8825 view(s)
  • Shelwien's Avatar
    2nd April 2020, 13:16
    @Romul: You're right. For example, lossless audio compressors deal with it - its mostly incompressible, but SAC still compresses better than coders that store low bits without modelling. (Its likely not quite "white" noise, though). However there's a difference between digitized analog "noise" and originally digital random data which technically fits the definition of "white noise". Only the latter falls under "counting argument" (lossless compression transforms each data instance i of N bits to a unique string of Mi bits; if we say that compression always reduces the size of data by at least 1 bit, it means that 2^(N-1) compressed strings have to be decompressed to 2^N _unique_ data instances, which is impossible. Thus for some i values Mi>=N is true, so some data instances can't be compressed).
    41 replies | 1782 view(s)
  • Romul's Avatar
    2nd April 2020, 08:53
    My idea is that the so-called "white noise" is actually not so random. ​At least this applies to discrete white noise. There are patterns in it. https://en.wikipedia.org/wiki/White_noise To better understand all this, there is a catastrophic lack of time. PS: I write through an online translator, so my text may not look very correct.
    41 replies | 1782 view(s)
  • Self_Recursive_Data's Avatar
    2nd April 2020, 04:25
    Ok I'll check back later to see the score then.
    189 replies | 8825 view(s)
  • Shelwien's Avatar
    2nd April 2020, 03:28
    > Do you have the score for Green+SSE -vs- enwiki8 (or BOOK1)? Best book1 result is 209211 (with new book1 coefs), or 212690 with old world95 coefs (default). Just started enwik8 compression, but it would take time. But with any extra context the compression can be improved further: flag = rc.rc_BProcess( SSE], freq], t, flag ); // 209211 flag = rc.rc_BProcess( SSE*2+(j==0)], freq], t, flag ); // 209040 flag = rc.rc_BProcess( SSE*4+(j<3?j:3)], freq], t, flag ); // 208938 Here (j==0) means that this symbol has sort rank 0, ie has max probability in this distribution. cidx is the symbol code with rank j. > So the secondary prediction is made from the counts > we have thought 'a' was coming next but 'b' actually came next..... > it's sorta like an error correction? Just a binary decomposition. We can arrange symbols into some kind of binary tree, then compute probabilities of branch choices (from sum of counts of symbols this branch leads to), and apply SSE to these. > The bit that is revealed is what updates the secondary table? Yes.
    189 replies | 8825 view(s)
  • suryakandau@yahoo.co.id's Avatar
    2nd April 2020, 03:20
    i have tested cmix17 on wrtpre.cpp and the hash value after decompression is not match with the original file. why ??? this is the hash value of wrtpre.cpp 3FE3BD3E77A2A34869EC12FD77491EF9D0192BFA i attach the source code and the binary of cmix17 and compiled it using dev c++
    449 replies | 110319 view(s)
  • Self_Recursive_Data's Avatar
    2nd April 2020, 03:09
    Do you have the score for Green+SSE -vs- enwiki8 (or BOOK1)? So the secondary prediction is made from the counts we have thought 'a' was coming next but 'b' actually came next.....it's sorta like an error correction? The bit that is revealed is what updates the secondary table? > Btw, in fact, there's a compression improvement even if SSE is only applied to the probability of only one (most probable) symbol. ok, hmm
    189 replies | 8825 view(s)
  • Shelwien's Avatar
    2nd April 2020, 02:59
    > But if SSE is about taking the prediction as context, > then the new prediction must be a full letter.....it can't use ex. a 6 bit (flag)... New prediction is a probability. A probability of what is for you to decide. For example, it can be a probability of (symbol=='a') flag, or (symbol<'c') flag. > If the primary model is (i.e. context|prediction) ex. abc..... Here "prediction" is a probability distribution - a table of 256 probabilities. And we don't have a good way to use it as SSE context as a whole. But we can split it into multiple individual predictions for symbols, like {'a':0.33,not-'a':0.77}, without taking into account other symbols. And like this we can collect secondary statistics of 'a' appearing in context of primary prediction {'a':0.33}. Btw, in fact, there's a compression improvement even if SSE is only applied to the probability of only one (most probable) symbol. > then secondary model (SSE) is prediction|? ... prediction|prediction2 > where do you get the prediction (the ?) from? I'm so so so lost.... From secondary statistics in context of first prediction. It also helps if secondary statistics are initialized to map primary prediction to itself before updates.
    189 replies | 8825 view(s)
  • Self_Recursive_Data's Avatar
    2nd April 2020, 02:37
    But if SSE is about taking the prediction as context, then the new prediction must be a full letter.....it can't use ex. a 6 bit (flag)... If the primary model is (i.e. context|prediction) ex. abc.....then secondary model (SSE) is prediction|? ...where do you get the prediction (the ?) from? I'm so so so lost.... I can only read Python, I can't read your code. I'll test it though.
    189 replies | 8825 view(s)
  • Shelwien's Avatar
    2nd April 2020, 02:12
    Its not about actual data bits in the file at all, though its quite ok to directly encode data bits, like paq does. (In which case SSE can be also directly applied to bit probabilities). Its also possible to dynamically build any kind of bitcode for symbols, compute bit probabilities from original byte distribution, and use these for SSE. I posted a green version with SSE: https://github.com/Shelwien/green/blob/master/green.cpp#L209
    189 replies | 8825 view(s)
  • Cyan's Avatar
    2nd April 2020, 01:48
    Cyan replied to a thread Zstandard in Data Compression
    > Zstd has an excellent dictionary feature that could be leverage to create a web standard static dictionary. We do agree. We measured fairly substantial compression ratio improvements by using this technique and applying it to general websites, achieving much better compression ratio than any other technique available today, using a small set of static dictionaries. (Gains are even better when using site-dedicated dictionaries, but that's a different topic.) > Anyone know if a web dictionary is already being worked on? Investing time on this topic only makes sense if at least one significant browser manufacturer is willing to ship it. This is a very short list. Hence we discussed the topic with Mozilla, even going as far as inviting them to control the decision process regarding dictionaries' content. One can follow it here : https://github.com/mozilla/standards-positions/issues/105 Since this strategy failed, maybe working in the other direction would make better sense : produce an initial set of static dictionaries, publish them as candidates, demonstrate the benefits with clear measurements, then try to get browsers on board, possibly create a fork as a substantial demo. The main issue is that it's a lot of work. Who has enough time to invest in this direction, knowing that even if benefits are clearly and unquestionably established, one should be prepared to see this work dismissed with hand-waved comments because "not invented here" ? So I guess the team investing time on this topic should better have strong relations with relevant parties, such as Mozilla, Chrome and of course the Web Committee.
    399 replies | 124548 view(s)
  • Self_Recursive_Data's Avatar
    2nd April 2020, 01:45
    So you look out for a secret code flag in the Arithmetic Code bin file ex. 1010000101111010 000 ?
    189 replies | 8825 view(s)
  • Shelwien's Avatar
    2nd April 2020, 01:34
    > Oh ya, if the tree doesn't see the whole file first, then it has uncertainty. There's some uncertainty even if you have symbol counts for the whole file. Real files are not stationary, there're always some local deviations. > But the way secondary modelling gets extra stats is still unclear to me. > We take a set of primary prediction a, b, c: 33%, 10%, 57%, as context, > and gives us a new prediction: _?_ What does it predict? There's no way to implement secondary estimation with context of the whole byte distribution - preserving just symbol ranks requires log2(256!)=1684 bits of information. So SSE is always applied after binary decomposition a: 33 -> 1 b: 10 -> 01 c: 57 -> 001 ^^^-- p=1, no need to actually encode \\--- p=10/(10+57)=0.149 \--- p=33/(33+10+57)=0.33 So we'd look up SSE and see secondary statistics for this case, then encode the (symbol=='a') flag using SSE prediction, then update SSE stats with actual flag value. > What is the observation? The predicted symbol that is revealed once transmitted? Yes, but a bit/flag rather than whole symbol.
    189 replies | 8825 view(s)
  • Self_Recursive_Data's Avatar
    2nd April 2020, 01:15
    > That's not true, count 0 doesn't mean Oh ya, if the tree doesn't see the whole file first, then it has uncertainty. But the way secondary modelling gets extra stats is still unclear to me. We take a set of primary prediction a, b, c: 33%, 10%, 57%, as context, and gives us a new prediction: _?_ What does it predict? What is the observation? The predicted symbol that is revealed once transmitted?
    189 replies | 8825 view(s)
  • Shelwien's Avatar
    1st April 2020, 23:16
    > SEE was "originally" made for > 1) escape probabilities and for Yes, since escape is frequently encoded multiple times per symbol in PPM, so its handling is more important than any other symbol... and there's no perfect mathematical model for it. Also SEE/SSE makes it possible to use extra contexts, unrelated to normal prefix order-N contexts. For example, ppmd uses escape flag from previous symbol as SEE context. > SSE was made for using the prediction as input to get another prediction. Well, ppmd uses a linear scan to find a symbol anyway, so its obvious that its possible to encode "escape from symbol" flags for each symbol, rather than once per context. Then we can apply SEE model to that. But secondary estimation idea is helpful in any contexts really. Just that its impractical to feed it more than 2 probabilities at once, so some kind of binary decomposition has to be implemented to use SSE with non-binary statistics. > But I do wonder about SSE still. > What is the theory/reason it works is unclear. There're two main reasons: 1) Persistent skews in primary estimation (n0/(n0+n1) isn't good for approximating binomial distribution) 2) Adding extra contexts to prediction. Its also possible to take predictions from two submodels at once and feed them to 2d SSE, thus making a binary mixer equivalent. > I don't see how you can improve the unimprovable stats, they speak themselves... > 20% time c is saw, 40% b....so b get 40% smaller code given... That's not true, count 0 doesn't mean that a symbol won't appear, and 100% c doesn't mean that next symbol would be also c. There's a difference between prior and posterior probability estimations.
    189 replies | 8825 view(s)
  • Jyrki Alakuijala's Avatar
    1st April 2020, 20:13
    I'm not aware of a single file of HTML, CSS, JS or SVG where Zstd compresses more than brotli. Even if brotli's internal static dictionary is disabled, brotli still wins these tests by 5 % in density. The Zstd dictionaries can be used with brotli, too. Shared brotli supports a more flexible custom dictionary format than either Zstd or normal Brotli, where the word transforms are available . It is better than any dictionary I have seen so far. ​ While people can read the dictionary and speculate, no one has actually made an experiment that shows better performance with a similarly sized dictionary on general purpose web benchmark. I suspect it is very difficult to do and certainly impossible with a simple LZ custom dictionary such as zstd's. Brotli's word based dictionary saves about 2 bits per word reference due to the more compact representation by addressing words instead of being able to address between of words or combinations of consequent words or combinations of word fragments. We can achieve 50 % more compression for specific use cases, but the web didn't change that much that the brotli dictionary would need an update. Actually, the brotli dictionary is well suited for compressing human communications from 200 years ago and will likely work in another 200 years. https://datatracker.ietf.org/doc/draft-vandevenne-shared-brotli-format/
    399 replies | 124548 view(s)
  • encode's Avatar
    1st April 2020, 17:43
    I've done some experiments with LZ77 offset (distance) context coding. Decided to share the results + it's my first decent and publicly available "LZ77+ARI+Optimal Parsing" compressor. Previous program was LZPM v0.16 (back in 2008 ) - not the right or competitive implementation. For me, as a data compression author it is a must have. Quick results on "fp.log": Original -> 20,617,071 bytes NLZM v1.02 -window:27 -> 858,733 bytes Tornado v0.6 -16 -> 807,692 bytes ZSTD v1.4.4 -19 -> 805,203 bytes LZMA v19.00 -fb273 -> 707,646 bytes MComp v2.00 -mlzx -> 696,338 bytes LZPM v0.17 -> 685,952 bytes :_coffee:
    2 replies | 411 view(s)
  • SolidComp's Avatar
    1st April 2020, 16:46
    What is it for? Is this a reference implementation, a design exercise?
    2 replies | 411 view(s)
  • SolidComp's Avatar
    1st April 2020, 16:17
    SolidComp replied to a thread Zstandard in Data Compression
    I'm seeing rumblings about browser support for Zstd. What's the status? There's an opportunity here to achieve much better compression of HTML, CSS, JS, and SVG content. I mean better than brotli and of course gzip. The opportunity lies in building a great dictionary. None of the benchmarks I've seen for Zstd employ a dictionary. Zstd has an excellent dictionary feature that could be leverage to create a web standard static dictionary. Brotli has a static dictionary, but it isn't a great one. The strings are too short, it doesn't support modern HTML and JS features/keywords (because it was generated off of old HTML files), and it has a lot of strange entries, like "pittsburgh" (but not for example "Los Angeles", which is much more common) and "CIA World Factbook", which is an extremely rare string. The biggest opportunity with Zstd is to create a static dictionary that would standardize certain conventions and strings in HTML, CSS, and JS source (especially HTML). It could be developed in conjunction with a web minification standard, which we've long needed. A dictionary could then guide and optimize content creation, CMSes, and so forth. For example, it could standardize strings like the beginning of an HTML file as <!DOCTYPE html><html lang= CMSes and minifiers could then implement the standardized strings and minification conventions. There could be hundreds of such standardized strings... Anyone know if a web dictionary is already being worked on?
    399 replies | 124548 view(s)
  • Self_Recursive_Data's Avatar
    1st April 2020, 14:43
    So my understanding is: SEE was "originally" made for 1) escape probabilities and for 2) finding shared context i.e. abc & wxy both see A follow, so you get extra statistics for free by using wxy. SSE was made for using the prediction as input to get another prediction. If that's the case, I get it all then. But I do wonder about SSE still. What is the theory/reason it works is unclear. For normal primary stats the stats themselves are the % of time to give small codes correctly, I don't see how you can improve the unimprovable stats, they speak themselves...20% time c is saw, 40% b....so b get 40% smaller code given...
    189 replies | 8825 view(s)
  • suryakandau@yahoo.co.id's Avatar
    1st April 2020, 14:18
    cmixHP3 it needs only <=2 gb RAM and ~100 hours. 1000000000 bytes -> 148153579 bytes in 377474.20 s. cross entropy: 1.185
    16 replies | 1250 view(s)
  • compgt's Avatar
    1st April 2020, 13:17
    Since everyone has all the time in the world, with these lockdowns and home quarantines, Has anyone actually implemented RandomDataCompressor (RDC) #1 and #2, hmmm? ;)
    7 replies | 362 view(s)
  • uhost's Avatar
    1st April 2020, 06:58
    i think following example can fix your doubt 4000 => 000096 Decoding Input: 0000 96 = outputput : 4000 Decoding Input: 00000 96 = outputput : 8096 Decoding Input: 000 96 = outputput : 1952 Decoding Input: 00 96 = outputput : 928 Decoding Input: 0 96 = outputput : 416 Decoding Input: 96 = outputput : 160 1):D96 have duplication but not same in my algorithm so left side zero is impotent or valuable information of 96 2)if you want to avoid left side of zero , some rule will help , all left zero number is indicated-] and equivalent to +] value like as -6 0 6 eg:0003-]==19 +]real value of 0003=61] ,020-]==52+]real value of 020=108] (all negative values have equivalent positive values it is less than real value of negative ) 3)first step division cannot need this information because This Algorithm is only part of my project it will be 1 or more bit reduce per step of division although some problem has occurred but that is not still i will fix , Data Compression is complicated This method is not finished for compression.
    41 replies | 1782 view(s)
  • Shelwien's Avatar
    1st April 2020, 03:01
    https://encode.su/threads/541-Simple-bytewise-context-mixing-demo?p=64454&viewfull=1#post64454 > So SSE, how does it work... if I have 17 models to mix, > do I use a short model ex. order3 and use it as a goat for the rest? No. You mix predictions from these 17 models, then pass them to SSE, then use SSE prediction for coding. Its also possible to have multiple SSE-like stages (paq does). > How does this improve compression. Can you explain it clearly? 1) By sharing statistics for new contexts - a prediction of context with just {'a':1} can be improved a lot with additional statistics for all such cases. 2) By compensating for some persistent prediction errors > Why "secondary".... Because its a second model which uses prediction of the first model as another context. > where do I pull it from, when, and what do I merge it to... There're multiple open-source implementations at this point. http://mattmahoney.net/dc/dce.html#Section_433 > I'm lost...can you elaborate enough context to walk me through this? Its just a secondary statistical model, same as first one, but able to use primary prediction as an extra context. Now I added a ppmonstr-like SSE implementation to green. But its possible to improve its compression by adding more contexts. > Is SSE (or is it SEE) just about updating the table error after prediction? Isn't that backpropagation? It would be backpropagation if you'd try to modify weights to get a specific SSE output. Which is not really helpful for a CM (I tried), so its not what normally happens. Yes, its "about updating the table after prediction". Same as normal contextual statistics.
    189 replies | 8825 view(s)
  • Shelwien's Avatar
    1st April 2020, 02:23
    https://github.com/Shelwien/green http://ctxmodel.net/files/green/green_v4_SSE.rar v4_SSE tuned to: book1 world95 book1 209211 212690 world95 479273 462406 - updated rangecoder to sh_v2f (saved 5 bytes) - basic interpolated SSE with unary symbol coding - bugfix for text buffer access - updated book1 parameter profile
    13 replies | 27444 view(s)
  • schnaader's Avatar
    1st April 2020, 02:09
    Let me address my concerns with your method: 1. In your example, you're reducing numbers from the "input" range 3000-4000 to numbers in the "output" range 1-20. Doing this for more than 20 numbers in the input range would produce duplicate outputs, so decoding won't work anymore. 2. Since you're reducing the numbers in multiple steps (e.g. 6 steps for 3055 => 1042 => 1006 => 18 => 14 => 2), you have additional information ("6 steps") to encode to get back from 2 to 3055. If you don't encode this additional information, the decoder doesn't know if he should stop at 14, 18, 1006, 1042 or 3055. Another way would be to encode the range 3000-4000 and stop with the first number in that range, but again, this is additional information that has to be encoded. So it looks like a 12 bit to 2 bit reduction, but the additional information will increase the 2 bit result. 3. If "have different angles & generate different master number" means that you can encode the same number in different ways, this is additional information that the decoder has to know, too. So that's why I don't think that this statement holds:
    41 replies | 1782 view(s)
  • Sportman's Avatar
    1st April 2020, 00:05
    Sportman replied to a thread 2019-nCoV in The Off-Topic Lounge
    There is a cure when needed (given by medical professional): https://techstartups.com/2020/03/28/dr-vladimir-zelenko-now-treated-699-coronavirus-patients-100-success-using-hydroxychloroquine-sulfate-zinc-z-pak-update/ https://techstartups.com/2020/03/31/dr-vladimir-zelenko-provides-important-update-three-drug-regimen-hydroxychloroquine-sulfate-zinc-azithromycin-z-pak-used-effectively-treat-699-coronavirus-patients-100-su/
    35 replies | 2060 view(s)
  • Self_Recursive_Data's Avatar
    31st March 2020, 22:02
    Is SSE (or is it SEE) just about updating the table error after prediction? Isn't that backpropagation?
    189 replies | 8825 view(s)
  • compgt's Avatar
    31st March 2020, 21:19
    Noted. I don't have a list. I think of notable figures here to send to. Now that you've informed me, i should not be sending you messages anymore in the future. Those messages were sent to you in private. But it's my Sent Folder. Should not had been deleted by Shelwien. He implies he can do anything because he's the one paying for this forum; what if i keep telling that i co-designed this encode.su GUI in the 1970s Cold War when i was a very talented child? I was enthused when i saw encode.ru again. You, JamesB, probably knew me only in the late 80s. Yes, i was surprised by the "brain tumor" tag on Sami and Dec. 2012.
    307 replies | 316182 view(s)
  • encode's Avatar
    31st March 2020, 18:48
    Hi all! :_hi2: Please welcome my (new) LZ77 compressor! It's basically a baseline LZ77 - no buffered (rep) offsets, no literal masking, no exe-filters, whistles and bells. On some files it could be very efficient on other ones - not. On ENWIKs it's not the best performer, but on files like fp.log or english.dic LZPM is pretty good (for pure LZ77). Window size = block size = 128 MB. Enjoy new release! :_superman2:
    2 replies | 411 view(s)
  • Darek's Avatar
    31st March 2020, 16:15
    Darek replied to a thread Paq8pxd dict in Data Compression
    My score of enwik9: 125'475'149 - enwik9_1423 -x15 by Paq8pxd_v79_AVX2 => very small differene than plain enwik9. I wonder if there no differences between AVX2 and SSE4 versions.
    770 replies | 289349 view(s)
  • JamesB's Avatar
    31st March 2020, 16:04
    We're not here to steal your ideas or claim credit for your actions. I feel sorry for you as I'd guess there is some mental health issue here, which I wouldn't wish on my worst enemy. I'm not being mean and there is no stigma associated IMO, but please do consider getting help. However please stop sending me messages. I don't feel that request is in any way an infringement on your rights - yes it's *your* sent folder, but it's also *my* inbox and just as I can opt out of all sorts of sources of spam I am exercising my right to do it here too. I request that you take me off your list.
    307 replies | 316182 view(s)
  • maadjordan's Avatar
    31st March 2020, 12:58
    maadjordan replied to a thread WinRAR in Data Compression
    how can 7-zip plugins be added to 7-zip dll file to allow winrar unpack with such plugins? https://encode.su/threads/3177-7-zip-plugins?highlight=7zip
    180 replies | 125959 view(s)
  • A_Better_Vice's Avatar
    31st March 2020, 11:40
    I have not attempted to re-compress the output files as I do not expect any type of serious gains. I can give it a try for for amusement and maybe it will hit 1% smaller if not have negative gains. I did however compress encrypted files from AxCrypt at again 30% smaller. Only more recently in the last few weeks or so have I had a serious gain in speeds where it went from originally weeks to days to hours to seconds using various techniques. In looking through my code I see that I should easily still be able to chop off some overhead and make it run even faster from removing calls that do not contribute to do anything. Lots of calls all over the place since I was trying so many different techniques until I landed the MUCH more productive technique in terms of speed. Within a few days I should get it to ~1kB per second vs 1kb per second in a non parallel non Linux non Cuda environment not from genius re tweaking or gen two but just cleaning the crap up and gutting calls that have no positive impact to this technique that works. Off to bed soon. My best work is from 11pm to 4am but at around 5am I can start making serious mistakes ... I will be posting some specific decompression speeds as well soon and official benchmarks I guess more than likely on my website. Sent out some feelers to some companies and a few only accept emails from private domains not coldmail or warmmail or even hotmail... So time to fireup my website. So much to do, but happy times are here again :) Hope all is well with everyone. How is Ukraine ? My mother in law is from there.
    8 replies | 492 view(s)
  • xinix's Avatar
    31st March 2020, 09:30
    xinix replied to a thread Paq8pxd dict in Data Compression
    i7-3820 4GHz Mem 64gb ____ 125.997.424 bytes - enwik9 pxd_v79_SSE4 -s15 (128167 sec) 125.532.596 bytes - enwik9 pxd_v79_SSE4 -x15 (160497 sec)
    770 replies | 289349 view(s)
  • snowcat's Avatar
    31st March 2020, 06:58
    snowcat replied to a thread 2019-nCoV in The Off-Topic Lounge
    I'm a Vietnamese and currently my country have only 204 cases, with 55 cases are fully recovered and 0 death so far... Not sure if I should panick that much but I am panicking af right now, especially when I know how bad it could be. Look at what it did to Italy and USA...
    35 replies | 2060 view(s)
  • uhost's Avatar
    31st March 2020, 06:30
    count = how many divisions are taken eg: 3055 Fist Div Result = 1041 After 3 or 4 steps we have 17 or 15 (0) == == (1) ==== (2) === (3) == (4) = (5) 3055 => 1041 => 1007 => 17 => 15 => 1 1 have different angles & generate different master number For example : 01,001,0001 This value is equal but 1 position is different. this method complete success but i try to more effective new method it encoder is completed but decode some complicated that can 3055 =>7 with angles (position )7 ,281474970863668=>32751=>14 with angles (position )7,281474970863667=>32767=>15 with angles (position )7 it take 1 or max 3 step if you do not understand my explanation please forgive me when i complete this method i will explain how to work
    41 replies | 1782 view(s)
  • uhost's Avatar
    31st March 2020, 05:56
    :)
    41 replies | 1782 view(s)
  • Self_Recursive_Data's Avatar
    31st March 2020, 04:00
    So SSE, how does it work... if I have 17 models to mix, do I use a short model ex. order3 and use it as a goat for the rest? How does this improve compression. Can you explain it clearly? Why "secondary"....where do I pull it from, when, and what do I merge it to... I'm lost...can you elaborate enough context to walk me through this? You can barely find it on the internet....can you write it yourself below, so we can spread the instructions on the internet? Just rewrite it below clearly. We seem to lack new full explanations.....
    189 replies | 8825 view(s)
  • Shelwien's Avatar
    31st March 2020, 03:36
    > Green VS enwiki8. If I remember correctly, Green achieved approx. 21,800,000 bytes. Well, existing implementations of bytewise CM/PPM with SSE (ppmonstr,durilca,ash) seem to reach 18,0xx,xxx only with preprocessing (DRT or similar) - their results are around 19,0xx,xxx normally. > And I'm wondering if you can add *both* SSE & SEE in the same algorithm? Yes, its exactly what ppmonstr/durilca do. But you don't need to encode escapes in a CM, green doesn't keep even escape statistics. > Using the 21.8MB mark, how much would SEE bring it down by? 1MB improvement? You'd need a PPM for that, but yes. > And what about SSE? 0.34MB improvement? This question will let me know how useful SSE vs SEE is. I repeat, there's no one specific SSE implementation. PAQ has a bitwise SSE, while in theory its also possible to precompute the whole byte probability distribution, and with it implement unary SSE like in ppmonstr. SSE is just a secondary model. Its unknown what kind of model it would be, so its hard to predict its performance. Wrong SSE implementation can make compression worse quite easily too.
    189 replies | 8825 view(s)
  • Self_Recursive_Data's Avatar
    31st March 2020, 03:17
    > How much is "21.8MB" in bytes? 22,858,956? Green VS enwiki8. If I remember correctly, Green achieved approx. 21,800,000 bytes. And I'm wondering if you can add *both* SSE & SEE in the same algorithm? Using the 21.8MB mark, how much would SEE bring it down by? 1MB improvement? And what about SSE? 0.34MB improvement? This question will let me know how useful SSE vs SEE is.
    189 replies | 8825 view(s)
  • Shelwien's Avatar
    31st March 2020, 02:53
    > If we add SEE to Green, and Green already compresses enwiki8 to 21.8MB, How much is "21.8MB" in bytes? 22,858,956? > how much would it improve this? 19.6MB? SEE only makes sense in PPM - it implies a completely different weighting method. http://mattmahoney.net/dc/text.html#1839 ppmd J -m256 -o10 -r1 21,388,296 // Uses SEE ppmd_sh9 16 1863 20,784,856 ppmonstr J -m1863 -o16 19,040,451 // Uses SSE > Now, say we add SSE, too. How much would it improve this "19.6MB"? 18MB? Maybe, depends on specific implementation. SSE is not a fixed function - its any statistical model that uses primary prediction as context. > But isn't this the same thing as Local Text for giving more weight > to Models that have been predicting great recently? > I.e. adaptive Global Model weight favoring models based on error. > Or, is SSE similar but meant for specific context? Normally SSE is orthogonal to weights and primary contexts. It takes already mixed primary predictions and refines them using statistics in context of these predictions (and possibly some other).
    189 replies | 8825 view(s)
  • Self_Recursive_Data's Avatar
    31st March 2020, 01:51
    If we add SEE to Green, and Green already compresses enwiki8 to 21.8MB, how much would it improve this? 19.6MB? Now, say we add SSE, too. How much would it improve this "19.6MB"? 18MB? I have another question about SSE. It updates a table based on prediction error, to refine the prediction, basically. But isn't this the same thing as Local Text for giving more weight to Models that have been predicting great recently? I.e. adaptive Global Model weight favoring models based on error. Or, is SSE similar but meant for specific context?
    189 replies | 8825 view(s)
  • Darek's Avatar
    31st March 2020, 00:32
    Darek replied to a thread Paq8pxd dict in Data Compression
    enwik8/9 update: 16'339'122 - enwik8 -s8 by Paq8pxd_v74_AVX2 15'993'409 - enwik8 -s15 by Paq8pxd_v74_AVX2 16'279'540 - enwik8 -x8 by Paq8pxd_v74_AVX2 15'928'916 - enwik8 -x15 by Paq8pxd_v74_AVX2 15'880'133 - enwik8.drt -x15 by Paq8pxd_v74_AVX2 125'752'479 - enwik9_1423 -x15 by Paq8pxd_v74_AVX2 - best overall score for paq8pxd serie 16'291'281 - enwik8 -s8 by Paq8pxd_v78_AVX2 15'941'450 - enwik8 -s15 by Paq8pxd_v78_AVX2 16'231'687 - enwik8 -x8 by Paq8pxd_v78_AVX2 15'877'659 - enwik8 -x15 by Paq8pxd_v78_AVX2 15'852'312 - enwik8.drt -x15 by Paq8pxd_v78_AVX2 125'942'438 - enwik9 -x15 by Paq8pxd_v78_AVX2 - plain enwik9 file tested by Kaitz 125'797'519 - enwik9_1423 -x15 by Paq8pxd_v78_AVX2 16'272'537 - enwik8 -s8 by Paq8pxd_v79_AVX2 15'925'621 - enwik8 -s15 by Paq8pxd_v79_AVX2 - tested by Sportman 16'214'034 - enwik8 -x8 by Paq8pxd_v79_AVX2 15'862'122 - enwik8 -x15 by Paq8pxd_v79_AVX2 - tested by Sportman - best score w/o DRT preprocessor for paq8pxd serie 15'843'925 - enwik8.drt -x15 by Paq8pxd_v79_AVX2 - best overall score for paq8pxd serie 125'67x'xxx - (estimated) - enwik9_1423 -x15 by Paq8pxd_v79_AVX2 - we'll see...
    770 replies | 289349 view(s)
  • schnaader's Avatar
    31st March 2020, 00:25
    I don't understand this "count" step. What do I have to count? If I count zeros and ones for example, I get 2 (zeros) and 10 (ones), not 3 and 17. Also, first you wrote 3055 => 1041 => 1007 => 17, the second one looks like 3055 => 113 => 15, which one is correct? Last question: Why can't 17 and 19 be reduced further?
    41 replies | 1782 view(s)
  • uhost's Avatar
    30th March 2020, 23:08
    After 3 conversion 3055 = >1041=>1007=>17,3053=>1043=>1005=>19 this method can provide decimal free conversion 3055 =101111101111 count=11(3)+10001(17)=1110001 if you want reduce again 113=1110001 113=>15==1111
    41 replies | 1782 view(s)
  • uhost's Avatar
    30th March 2020, 22:41
    :)
    41 replies | 1782 view(s)
More Activity