Activity Stream

Filter
Sort By Time Show
Recent Recent Popular Popular Anytime Anytime Last 24 Hours Last 24 Hours Last 7 Days Last 7 Days Last 30 Days Last 30 Days All All Photos Photos Forum Forums
  • Jyrki Alakuijala's Avatar
    Today, 21:43
    I have to say that I am proud of the jpeg xl team delivering such coding speeds and the strong guarantees on quality above 0.5 bpp (1 : 50 compression).
    102 replies | 24401 view(s)
  • Darek's Avatar
    Today, 20:12
    Darek replied to a thread Paq8pxd dict in Data Compression
    It's only fix?
    790 replies | 289706 view(s)
  • kaitz's Avatar
    Today, 19:01
    kaitz replied to a thread Paq8pxd dict in Data Compression
    No need to test this one.
    790 replies | 289706 view(s)
  • Darek's Avatar
    Today, 18:49
    Darek replied to a thread Paq8pxd dict in Data Compression
    Damn, Kaitz is too fast... I'm still testing v80 ;)
    790 replies | 289706 view(s)
  • moisesmcardona's Avatar
    Today, 18:32
    v81 is on GitHub: https://github.com/kaitz/paq8pxd/releases/tag/v81
    790 replies | 289706 view(s)
  • Darek's Avatar
    Today, 17:30
    Darek replied to a thread Paq8pxd dict in Data Compression
    @CompressMaster - first attempt to tarball compression: 10'335'343 - score of compress particular files with one option => paq8pxe v1 gc82 10'116'571 - best score of solid archive, revisited by paq8pxe v1 gc82, I've faced some issues with paq8px v184 and v185 to test... 10'122'251 - tarball file compressed by paq8pxe v1 gc82 - slightly worse than solid compresion for the same version. I wonder if file order could be important for tar file. Question - is it possible to set particular files in tarball file in my own order?
    790 replies | 289706 view(s)
  • schnaader's Avatar
    Today, 17:22
    schnaader replied to a thread Paq8pxd dict in Data Compression
    The second number in each line is a 32 bit datatype display error. 33310 MiB*1024*1024 modulo 2^32 = 568,328,192 bytes
    790 replies | 289706 view(s)
  • LucaBiondi's Avatar
    Today, 16:03
    LucaBiondi replied to a thread Paq8pxd dict in Data Compression
    Thank you!
    790 replies | 289706 view(s)
  • Sportman's Avatar
    Today, 15:07
    Sportman replied to a thread Paq8pxd dict in Data Compression
    I use only 32GB and no problems so far.
    790 replies | 289706 view(s)
  • LucaBiondi's Avatar
    Today, 14:49
    LucaBiondi replied to a thread Paq8pxd dict in Data Compression
    Thank you! ...so i need al least 48 gb
    790 replies | 289706 view(s)
  • Darek's Avatar
    Today, 14:46
    Darek replied to a thread Paq8pxd dict in Data Compression
    Looks like the even times are smaller... !
    790 replies | 289706 view(s)
  • Sportman's Avatar
    Today, 14:05
    Sportman replied to a thread Paq8pxd dict in Data Compression
    "used 33310 MB (568349169 bytes) of memory" -x15 "used 33310 MB (568349201 bytes) of memory" -x15 -w
    790 replies | 289706 view(s)
  • Sportman's Avatar
    Today, 14:04
    Sportman replied to a thread Paq8pxd dict in Data Compression
    enwik8: 15,835,340 bytes, 7,418.998 sec., paq8pxd_v80_avx2 -x15 -w
    790 replies | 289706 view(s)
  • kaitz's Avatar
    Today, 13:52
    kaitz replied to a thread Paq8pxd dict in Data Compression
    ​enwik8 16222997 -s8 -w Paq8pxd_v80_AVX2Decompression identical.
    790 replies | 289706 view(s)
  • LucaBiondi's Avatar
    Today, 13:21
    LucaBiondi replied to a thread Paq8pxd dict in Data Compression
    Hi sportman how much memory do you need to use -x15 option? I am upgrading some server.. Luca
    790 replies | 289706 view(s)
  • Darek's Avatar
    Today, 13:14
    Darek replied to a thread paq8px in Data Compression
    There is something wrong with use "solid" compression with paq8px v185, until paq8pxe v1 gc82 it works fine but starting from v185 compreccion scores are much worse (2.5x higher) than previous versions. Maybe it could be helfpful that paq8px v184 runs quite ok, but for some files (I.EXE) there is an crash with "encodeExe read error".
    1841 replies | 525604 view(s)
  • Sportman's Avatar
    Today, 11:57
    Sportman replied to a thread Paq8pxd dict in Data Compression
    enwik8: 15,861,418 bytes, 7,564.860 sec., paq8pxd_v80_avx2 -x15
    790 replies | 289706 view(s)
  • suryakandau@yahoo.co.id's Avatar
    Today, 06:53
    @shelwien i use this script to compile paq8px v182 and it works but for paq8pxd v69 it does not work why ??
    1841 replies | 525604 view(s)
  • suryakandau@yahoo.co.id's Avatar
    Today, 04:05
    ​how to compile paq8pxd uses mingw ? could you give the script to compile it ? thank you
    1841 replies | 525604 view(s)
  • schnaader's Avatar
    Yesterday, 23:10
    Thanks for the info, but please don't forget the units ;) The paper says "throughput in megapixels/second". I was confused at first because I read it as "seconds".
    102 replies | 24401 view(s)
  • Darek's Avatar
    Yesterday, 23:07
    Darek replied to a thread Paq8pxd dict in Data Compression
    @CompressMaster - hmmm, I've never think of doing that. Really. But it looks as quite nice idea to try! Thanks. At this moment my records are: 10'335'343 - score of compress particular files with one option => paq8pxe v1 gc82 10'196'351 - summary of best scores for all files => various compressors - paq8px, paq8pxd and cmix actually 10'116'576 - best score of solid archive => paq8px v183fix1 -9t - it gets advantage of two similar files (D.TGA and E.TIF) which are different formats but there are the same images. I'll test best versions of decent compressors to check if tarball file get better score.
    790 replies | 289706 view(s)
  • CompressMaster's Avatar
    Yesterday, 22:03
    @Darek, have you ever tried to TAR files first? Maybe then you could achieve better results...
    790 replies | 289706 view(s)
  • Jarek's Avatar
    Yesterday, 21:31
    Benchmarking JPEG XL image compression (1 April 2020): https://www.spiedigitallibrary.org/conference-proceedings-of-spie/11353/113530X/Benchmarking-JPEG-XL-image-compression/10.1117/12.2556264.full?SSO=1 E.g. speed: Codec Encode Decode JPEG XL (N=4) 49.753 132.424 JPEG (libjpeg) 9.013 11.133 JPEG (libjpeg-turbo) 48.811 107.981 HEVC-HM-YUV444 0.014 5.257 HEVC-x265-YUV444 1.031 14.037 HEVC-x265-YUV444 (N = 4) 3.691 14.100 HEVC-x265-YUV444 (N = 8 ) 6.345 13.471
    102 replies | 24401 view(s)
  • Darek's Avatar
    Yesterday, 17:25
    Darek replied to a thread Paq8pxd dict in Data Compression
    Scores of my testset by paq8pxd v80. Mixed results => exe files got some improvements.K.WAD, L.PAK, got some loses...
    790 replies | 289706 view(s)
  • Trench's Avatar
    Yesterday, 17:06
    Trench replied to a thread 2019-nCoV in The Off-Topic Lounge
    on the other hand truth only comes when everything is evaluated ever if right or wrong. Over a trillion for corona stimulus they voted on and the people get around 200 billion (1/10). what about the the majority of the trillion? O that goes to organizations like arts, public broadcasting, etc. in other words mostly unrelated things. remember the coronavirus of 2012 the media said it would be a pandemic? More died from the common flu yet didnt destroy the world economy which more people will get sick and die from other things than the virus. People with low vitamin D levels report say get it serious. And people which have a genetically low vitamin C also. "The microbe is nothing. The terrain is everything" — Claude Bernard 1813-1878 (widely regarded to be the father of modern physiology). “The primary cause of disease is in us, always in us” — Professor Pierre Antoine Bechamp, 1883. How does one get bad terrain? environment from the AIR, WATER, FOOD. What does air have and why do wild fires go more out of control? Why are toxic metals in more people and increasing every year like aluminum, mercury, etc? Too much chromium which the side affect is lung scaring (breathing issues)? What happens when you put metal in the microwave? bad idea. or putting your head outside the microwave bad but not constant(one punch). How would the body react when something else that gives off the waves nonstop like sitting/sleeping next to a router/modem (never ending soft taps) and a person has metals as most do? why do people with health issues get problems with more radiation just like people that die from corona? Even if you feel good and not sick its those people that spread the "virus" as the media says since they dont know it yet. If that is true since who knows then you are to blame. As the the CDC and WHO saying masks are useless but it can spread though talking well that is another oxymoron thing to say which they are not stupid so whats the deal? either they lie or are incompetent which either way why does anyone listen to anything a liar/idiot says? the odd thing their are reports that some people are getting it without being in contact with anyone or anything for over 20 days. Very strange things. If it walks like a duck, talks like a duck, looks like a duck, it must be a dog? WHAT?
    36 replies | 2105 view(s)
  • moisesmcardona's Avatar
    Yesterday, 16:00
    moisesmcardona replied to a thread paq8px in Data Compression
    It seems that some files are getting stuck extracting at 100% when the file has been compressed using pre-training text model. Using the refactored v185.
    1841 replies | 525604 view(s)
  • Marco_B's Avatar
    Yesterday, 15:33
    I would specify better how the decoder may work. The distribution passed to the arithmetic/range coder is that for the alphabet specified in the row indicated by I (see again #187) in the matrix of the secondary prediction, augmented by an escape symbol. The frequency for a certain symbol is calculated making the sum of the counts for its column divided the sum of all other columns. The decoder receives the code and infers the corresponding symbol, then it updates the table of the primary prediction, finally with the symbol which emerges from here used as J-index it could update the matrix of the secondary prediction. In this way it seems to me that it is not even necessary the check about the reversibility of the matrix.
    190 replies | 8940 view(s)
  • kaitz's Avatar
    Yesterday, 11:44
    kaitz replied to a thread Paq8pxd dict in Data Compression
    paq8pxd_v80 - Small changes in wordModel - Add second option -w for direct input of wikipedia dumps - WIT data type (option -w) for wikipedia no detection, malformed input gets transform fail -> subtract ID, convert timestamp, convert html entities (also to UTF8) | extract article header ns/id -> contributor, place after data | extract langs at the end of article, place after data - Move online wrt out from wordmodel Compressing enwik8: paq8pxd_v80 -s8 -w enwik8 Without -w no wiki specific processing is done. This will work on enwik9 and squeezechart files. Currently fails on enwik10. I think its second file inside.
    790 replies | 289706 view(s)
  • kaitz's Avatar
    Yesterday, 11:36
    kaitz replied to a thread Paq8pxd dict in Data Compression
    Maybe open it in notepad++ and see if it is infact a text. ​You need to change wrtpre.cpp. If it reports bintext then its text mixed with binary data. And wrt probably will bloat the hell out of this file.
    790 replies | 289706 view(s)
  • pklat's Avatar
    Yesterday, 09:31
    pklat replied to a thread Zstandard in Data Compression
    You might then move to some entirely new format, more efficient than html and compress that. Does brotli pack each file separately? If so, huge gains could be made if all text files for that page were solid packed in one. iirc, time is mostly wasted on fetching multitude of small files.
    400 replies | 124693 view(s)
  • Jyrki Alakuijala's Avatar
    Yesterday, 07:53
    Perhaps you are more focused on aesthetics and elegance than efficiency. Efficiency is something that can be measured in a benchmark, not by reasoning. As an example when I played with dictionary generation (both zstd --train and shared brotli), occasionally I found that taking 10 random samples of the data and finding the best sample as a dictionary turned out more efficient than running either of the more expensive dictionary extraction algorithm. Other times concatenating 10 random samples was a decent strategy. It is not necessary for thorough thinking, logic and beauty to 'win' the dictionary efficiency game. Depending on how well the integration of a shared dictionary has been done, different 'rotting' times can be observed. SDCH dictionaries were rotting every 3-6 months into being mostly useless or already slightly harmful, with brotli dictionaries we barely see rot at all. Zstd dictionaries use -- while less efficient than shared brotli style shared dictionary coding -- also likely rots much slower than SDCH dictionaries. This is because SDCH used to mix the entropy originating from the dictionary use with the literals in the data, and then hope that a general purpose compressor can make sense out of this bit porridge. IMHO, we could come up with a unified way to do shared dictionaries and use it across the simple compressors (like zstd and brotli).
    400 replies | 124693 view(s)
  • xinix's Avatar
    Yesterday, 07:24
    xinix replied to a thread Paq8pxd dict in Data Compression
    I can compile. Therefore. Can you tell which part of the code to change? _____ Perhaps simplify the task. pxd_v79 does not preprocess for my file Segmentation outputs bintext And does not preprocess If he can ignore it and apply preprocessing anyway, then I won’t need an external dictionary. Thanks!
    790 replies | 289706 view(s)
  • xinix's Avatar
    Yesterday, 07:20
    You must not use "C9" and you must not use "C" You must use lowercase "c" And do not try to use phda9dec.exe windows phda9 only works under linux The created file in linux will not be unpacked under windows + due to the fact that phda is based on floats, it is possible to pack and unpack only on the same processor\linux.
    90 replies | 31015 view(s)
  • Alexander Rhatushnyak's Avatar
    Yesterday, 07:14
    C9 is only for enwik9 C is for other files, but they can't be as big as enwik9. There's something in readme.txt about sizes. Sorry, it looks like this year I won't have more than 5 minutes per month for this.
    90 replies | 31015 view(s)
  • ShihYat's Avatar
    Yesterday, 06:16
    Same problem here as above. ​When I use 'C' on other large enwik text files, it will return segmentation fault at different percentage, as well as use 'C' instead of 'C9' on enwik9, fault at 91%.
    90 replies | 31015 view(s)
  • byronknoll's Avatar
    Yesterday, 00:11
    byronknoll replied to a thread cmix in Data Compression
    I didn't use your binary. Here are some suggestions that might help: - change the compiler flag from -Ofast to -O3. The binary will be slower, but might fix the issue you are seeing. - upgrade your compiler to a more recent version. - change to a different compiler - I recommend clang.
    449 replies | 110390 view(s)
  • LucaBiondi's Avatar
    3rd April 2020, 17:01
    LucaBiondi replied to a thread Paq8pxd dict in Data Compression
    Hi Guys! I hope you are all fine! I have two XML similar files. i found that one is detected as default while the other is detected as text: 0 |default | 1 | 14404734 ​18 |text | 1 | 11449237 Why? What should i verify inside my files? This is the log: C:\Compression\paq8pxd>paq8pxd_v76_AVX2 -x12 testset_xml c:\compression\Xml_testset\*.xml Slow mode FileDisk: unable to open file (No such file or directory) Creating archive testset_xml.paq8pxd76 with 2 file(s)... File list (76 bytes) Compressed from 76 to 58 bytes. 1/2 Filename: c:/compression/Xml_testset/12_A_20060313194711.xml (11449237 bytes) Block segmentation: 0 | text | 11449237 2/2 Filename: c:/compression/Xml_testset/AS_B_20150226093210.xml (14404734 bytes) Block segmentation: 0 | default | 14404734 Segment data size: 27 bytes TN |Type name |Count |Total size ----------------------------------------- 0 |default | 1 | 14404734 18 |text | 1 | 11449237 ----------------------------------------- Total level 0 | 2 | 25853971 default stream(0). Total 14404734 bigtext wrt stream(10). Total 8167896 Stream(0) compressed from 14404734 to 133667 bytes WRT dict count 811 words. WRT dict online. Stream(10) compressed from 8167896 to 284267 bytes Segment data compressed from 27 to 18 bytes Total 25853971 bytes compressed to 418059 bytes. Time 4365.32 sec, used 15766 MB (3646968591 bytes) of memory Thank you!!!! ​Luca
    790 replies | 289706 view(s)
  • Marco_B's Avatar
    3rd April 2020, 17:00
    > You say "We update it before to extract any information", but normally we have to decode a symbol > first to update anything. Yes, I do it to take into account for the posterior estimation, the check of the reversibility of the matrix in the secondary predictiom should assure that the decoder can walk in lock-in, obviously it is possible I disredgard something. > In any case, SEE/SSE is about using a probability as context. I try to re-read with this in mind.
    190 replies | 8940 view(s)
  • xezz's Avatar
    3rd April 2020, 16:04
    removed cidx. speed and ratio is sometimes better than v4.
    13 replies | 27466 view(s)
  • SolidComp's Avatar
    3rd April 2020, 16:01
    SolidComp replied to a thread Zstandard in Data Compression
    Jyrki, everything you said is correct, but you're missing something. One of my core ideas here is a prospective dictionary. Your dictionary is retrospective — it was generated based on old or existing HTML source. As a result, it doesn't support modern and near future HTML source very well. A prospective dictionary has two advantages: Efficient encoding of modern and foreseeable near future syntax. The ability to shape and influence syntax conventions as publishers optimize for the dictionary – a feedback loop. In reality a prospective dictionary of the sort I'm advocating would be a combination of prospective and retrospective. The tests you're asking for are nearly impossible since by definition a prospective dictionary isn't based on existing source files, but rather is intended to standardize and shape future source files. The kind of syntax a good prospective dictionary could include are things like: <meta name="viewport" content="width=device-width"> (an increasingly common bit of code) <link rel="dns-prefetch" href="https:// <link rel="preconnect" href="https:// <link crossorigin="anonymous" media="all" rel="stylesheet" href="https:// These are all nice and chunky strings, and they all represent syntax that could be standardized for the next decade or so. If they were present in the brotli or Zstd dictionary, publishers would then optimize their source to include these strings in this exact standardized form. What I mean is having the rel before the href and so forth. Note that these are just a few examples. A dictionary built this way would be a lot more efficient than the status quo.
    400 replies | 124693 view(s)
  • suryakandau@yahoo.co.id's Avatar
    3rd April 2020, 14:45
    my assumption is you have compiled the source code on linux...the question is why the same source code compiled with different compiler can cause the decompress file corrupted :confused:
    449 replies | 110390 view(s)
  • suryakandau@yahoo.co.id's Avatar
    3rd April 2020, 11:59
    Do you use my binary compiled with Dev c++ ??
    449 replies | 110390 view(s)
  • suryakandau@yahoo.co.id's Avatar
    3rd April 2020, 11:25
    How to use colab ?
    449 replies | 110390 view(s)
  • Sportman's Avatar
    3rd April 2020, 09:43
    Sportman replied to a thread 2019-nCoV in The Off-Topic Lounge
    Only 3 months ago: https://promedmail.org/promed-post/?id=20191230.6864153 https://www.scmp.com/news/china/politics/article/3044050/mystery-illness-hits-chinas-wuhan-city-nearly-30-hospitalised https://www.scmp.com/news/china/politics/article/3044207/china-shuts-seafood-market-linked-mystery-viral-pneumonia
    36 replies | 2105 view(s)
  • byronknoll's Avatar
    3rd April 2020, 08:53
    byronknoll replied to a thread cmix in Data Compression
    It works for me. Here is a Colab where you can see it compress+decompress to the same md5: https://colab.research.google.com/drive/1G4JU8QRwKZGh5YfbbheX-ZSjMgF29Ci7
    449 replies | 110390 view(s)
  • kaitz's Avatar
    3rd April 2020, 02:29
    kaitz replied to a thread Paq8pxd dict in Data Compression
    You cant. ​ If my brain breaks then :D So in a day...
    790 replies | 289706 view(s)
  • suryakandau@yahoo.co.id's Avatar
    2nd April 2020, 22:59
    i just setting paq8 and paq8hp to level 6 on cmix17. i do not use dictionary when compress and decompress. here is wrtpre.cpp
    449 replies | 110390 view(s)
  • Darek's Avatar
    2nd April 2020, 22:47
    Darek replied to a thread Paq8pxd dict in Data Compression
    @Kaitz - this version looks like good step forward. When will be out?
    790 replies | 289706 view(s)
  • Shelwien's Avatar
    2nd April 2020, 19:50
    21,819,822 enwik8.grn3-world95 21,857,586 enwik8.grn4-book1 21,377,801 enwik8.grn4-world95
    190 replies | 8940 view(s)
  • byronknoll's Avatar
    2nd April 2020, 19:35
    byronknoll replied to a thread cmix in Data Compression
    I might be able to help debug this. Did you make any modifications to cmix v17? Did you use a dictionary when compressing and decompressing? Can you post a copy of wrtpre.cpp so I can try reproducing the problem on my computer?
    449 replies | 110390 view(s)
  • xinix's Avatar
    2nd April 2020, 19:22
    xinix replied to a thread Paq8pxd dict in Data Compression
    Hi kaitz How to run v79 with an external dictionary? I want to try not a dynamic dictionary, but my english.dic thanks
    790 replies | 289706 view(s)
  • kaitz's Avatar
    2nd April 2020, 18:19
    kaitz replied to a thread Paq8pxd dict in Data Compression
    Parsing wiki compared to v79 Used some provided enwik-specific transforms. savings on compression enwik8 -s8. 1. Removing html some entities, converting id,timestamps saves 16kb 2. Converting #xxx; to utf8 saves 3kb (grouping them made things worse , as they were added to dict) 3. Reordering [[lang:xxxx to end of file saves 19kb 4. rest small fixes total about 39kb. (-s15 only 26kb)
    790 replies | 289706 view(s)
  • Shelwien's Avatar
    2nd April 2020, 17:39
    It looks like a history context rather than SSE. You say "We update it before to extract any information", but normally we have to decode a symbol first to update anything. Still, something like you describe can happen if we'd store information about last symbol seen in a given context. Then we can use this previously seen symbol as context for symbol counts. In any case, SEE/SSE is about using a probability as context. History contexts are related (for example FSM counter states are quantizations of history, but are quite different from probabilities), but are not compatible with interpolation and other SSE-specific tricks.
    190 replies | 8940 view(s)
  • Marco_B's Avatar
    2nd April 2020, 17:02
    I am sorry to slip into my own example of a (presumable) SSE, but I am trying to understand, please tell me if what follows is a correct pattern for it. Let an alphabet be of three symbols, and for the moment we left aside the sharing of statistics of different contexts. At some point in the input stream we have the contest "a" with the follow symbol "b"; for this context the statistic, presented in counting form, for the primary prediction is a b c 5 7 9 We update it before to extract any information and we obtain a b c 5 8 9 that still gives the wrong symbol "c". We pass "c" to the error correction matrix of the secondary prediction for the context "a", where each element of a row shows how many time the symbol has changed into the column label, which has the form \I a b c J _a 5 4 3 _b 2 3 4 _c 1 2 1 and we can see that "c" translates into the matching symbol "b". Now it is necessary to check the reversibility of the matrix, i.e. that among symbols listed in the rows only one leads to "b", otherwise we have to code an escape symbol. In the present example, that is the case. Then we update the matrix into \I a b c J _a 5 4 3 _b 2 3 4 _c 1 3 1 and the decoding process will be able to keep synchronized all the table statistics.
    190 replies | 8940 view(s)
  • Shelwien's Avatar
    2nd April 2020, 13:16
    @Romul: You're right. For example, lossless audio compressors deal with it - its mostly incompressible, but SAC still compresses better than coders that store low bits without modelling. (Its likely not quite "white" noise, though). However there's a difference between digitized analog "noise" and originally digital random data which technically fits the definition of "white noise". Only the latter falls under "counting argument" (lossless compression transforms each data instance i of N bits to a unique string of Mi bits; if we say that compression always reduces the size of data by at least 1 bit, it means that 2^(N-1) compressed strings have to be decompressed to 2^N _unique_ data instances, which is impossible. Thus for some i values Mi>=N is true, so some data instances can't be compressed).
    41 replies | 1811 view(s)
  • Romul's Avatar
    2nd April 2020, 08:53
    My idea is that the so-called "white noise" is actually not so random. ​At least this applies to discrete white noise. There are patterns in it. https://en.wikipedia.org/wiki/White_noise To better understand all this, there is a catastrophic lack of time. PS: I write through an online translator, so my text may not look very correct.
    41 replies | 1811 view(s)
  • Self_Recursive_Data's Avatar
    2nd April 2020, 04:25
    Ok I'll check back later to see the score then.
    190 replies | 8940 view(s)
  • Shelwien's Avatar
    2nd April 2020, 03:28
    > Do you have the score for Green+SSE -vs- enwiki8 (or BOOK1)? Best book1 result is 209211 (with new book1 coefs), or 212690 with old world95 coefs (default). Just started enwik8 compression, but it would take time. But with any extra context the compression can be improved further: flag = rc.rc_BProcess( SSE], freq], t, flag ); // 209211 flag = rc.rc_BProcess( SSE*2+(j==0)], freq], t, flag ); // 209040 flag = rc.rc_BProcess( SSE*4+(j<3?j:3)], freq], t, flag ); // 208938 Here (j==0) means that this symbol has sort rank 0, ie has max probability in this distribution. cidx is the symbol code with rank j. > So the secondary prediction is made from the counts > we have thought 'a' was coming next but 'b' actually came next..... > it's sorta like an error correction? Just a binary decomposition. We can arrange symbols into some kind of binary tree, then compute probabilities of branch choices (from sum of counts of symbols this branch leads to), and apply SSE to these. > The bit that is revealed is what updates the secondary table? Yes.
    190 replies | 8940 view(s)
  • suryakandau@yahoo.co.id's Avatar
    2nd April 2020, 03:20
    i have tested cmix17 on wrtpre.cpp and the hash value after decompression is not match with the original file. why ??? this is the hash value of wrtpre.cpp 3FE3BD3E77A2A34869EC12FD77491EF9D0192BFA i attach the source code and the binary of cmix17 and compiled it using dev c++
    449 replies | 110390 view(s)
  • Self_Recursive_Data's Avatar
    2nd April 2020, 03:09
    Do you have the score for Green+SSE -vs- enwiki8 (or BOOK1)? So the secondary prediction is made from the counts we have thought 'a' was coming next but 'b' actually came next.....it's sorta like an error correction? The bit that is revealed is what updates the secondary table? > Btw, in fact, there's a compression improvement even if SSE is only applied to the probability of only one (most probable) symbol. ok, hmm
    190 replies | 8940 view(s)
  • Shelwien's Avatar
    2nd April 2020, 02:59
    > But if SSE is about taking the prediction as context, > then the new prediction must be a full letter.....it can't use ex. a 6 bit (flag)... New prediction is a probability. A probability of what is for you to decide. For example, it can be a probability of (symbol=='a') flag, or (symbol<'c') flag. > If the primary model is (i.e. context|prediction) ex. abc..... Here "prediction" is a probability distribution - a table of 256 probabilities. And we don't have a good way to use it as SSE context as a whole. But we can split it into multiple individual predictions for symbols, like {'a':0.33,not-'a':0.77}, without taking into account other symbols. And like this we can collect secondary statistics of 'a' appearing in context of primary prediction {'a':0.33}. Btw, in fact, there's a compression improvement even if SSE is only applied to the probability of only one (most probable) symbol. > then secondary model (SSE) is prediction|? ... prediction|prediction2 > where do you get the prediction (the ?) from? I'm so so so lost.... From secondary statistics in context of first prediction. It also helps if secondary statistics are initialized to map primary prediction to itself before updates.
    190 replies | 8940 view(s)
  • Self_Recursive_Data's Avatar
    2nd April 2020, 02:37
    But if SSE is about taking the prediction as context, then the new prediction must be a full letter.....it can't use ex. a 6 bit (flag)... If the primary model is (i.e. context|prediction) ex. abc.....then secondary model (SSE) is prediction|? ...where do you get the prediction (the ?) from? I'm so so so lost.... I can only read Python, I can't read your code. I'll test it though.
    190 replies | 8940 view(s)
  • Shelwien's Avatar
    2nd April 2020, 02:12
    Its not about actual data bits in the file at all, though its quite ok to directly encode data bits, like paq does. (In which case SSE can be also directly applied to bit probabilities). Its also possible to dynamically build any kind of bitcode for symbols, compute bit probabilities from original byte distribution, and use these for SSE. I posted a green version with SSE: https://github.com/Shelwien/green/blob/master/green.cpp#L209
    190 replies | 8940 view(s)
  • Cyan's Avatar
    2nd April 2020, 01:48
    Cyan replied to a thread Zstandard in Data Compression
    > Zstd has an excellent dictionary feature that could be leverage to create a web standard static dictionary. We do agree. We measured fairly substantial compression ratio improvements by using this technique and applying it to general websites, achieving much better compression ratio than any other technique available today, using a small set of static dictionaries. (Gains are even better when using site-dedicated dictionaries, but that's a different topic.) > Anyone know if a web dictionary is already being worked on? Investing time on this topic only makes sense if at least one significant browser manufacturer is willing to ship it. This is a very short list. Hence we discussed the topic with Mozilla, even going as far as inviting them to control the decision process regarding dictionaries' content. One can follow it here : https://github.com/mozilla/standards-positions/issues/105 Since this strategy failed, maybe working in the other direction would make better sense : produce an initial set of static dictionaries, publish them as candidates, demonstrate the benefits with clear measurements, then try to get browsers on board, possibly create a fork as a substantial demo. The main issue is that it's a lot of work. Who has enough time to invest in this direction, knowing that even if benefits are clearly and unquestionably established, one should be prepared to see this work dismissed with hand-waved comments because "not invented here" ? So I guess the team investing time on this topic should better have strong relations with relevant parties, such as Mozilla, Chrome and of course the Web Committee.
    400 replies | 124693 view(s)
  • Self_Recursive_Data's Avatar
    2nd April 2020, 01:45
    So you look out for a secret code flag in the Arithmetic Code bin file ex. 1010000101111010 000 ?
    190 replies | 8940 view(s)
  • Shelwien's Avatar
    2nd April 2020, 01:34
    > Oh ya, if the tree doesn't see the whole file first, then it has uncertainty. There's some uncertainty even if you have symbol counts for the whole file. Real files are not stationary, there're always some local deviations. > But the way secondary modelling gets extra stats is still unclear to me. > We take a set of primary prediction a, b, c: 33%, 10%, 57%, as context, > and gives us a new prediction: _?_ What does it predict? There's no way to implement secondary estimation with context of the whole byte distribution - preserving just symbol ranks requires log2(256!)=1684 bits of information. So SSE is always applied after binary decomposition a: 33 -> 1 b: 10 -> 01 c: 57 -> 001 ^^^-- p=1, no need to actually encode \\--- p=10/(10+57)=0.149 \--- p=33/(33+10+57)=0.33 So we'd look up SSE and see secondary statistics for this case, then encode the (symbol=='a') flag using SSE prediction, then update SSE stats with actual flag value. > What is the observation? The predicted symbol that is revealed once transmitted? Yes, but a bit/flag rather than whole symbol.
    190 replies | 8940 view(s)
  • Self_Recursive_Data's Avatar
    2nd April 2020, 01:15
    > That's not true, count 0 doesn't mean Oh ya, if the tree doesn't see the whole file first, then it has uncertainty. But the way secondary modelling gets extra stats is still unclear to me. We take a set of primary prediction a, b, c: 33%, 10%, 57%, as context, and gives us a new prediction: _?_ What does it predict? What is the observation? The predicted symbol that is revealed once transmitted?
    190 replies | 8940 view(s)
  • Shelwien's Avatar
    1st April 2020, 23:16
    > SEE was "originally" made for > 1) escape probabilities and for Yes, since escape is frequently encoded multiple times per symbol in PPM, so its handling is more important than any other symbol... and there's no perfect mathematical model for it. Also SEE/SSE makes it possible to use extra contexts, unrelated to normal prefix order-N contexts. For example, ppmd uses escape flag from previous symbol as SEE context. > SSE was made for using the prediction as input to get another prediction. Well, ppmd uses a linear scan to find a symbol anyway, so its obvious that its possible to encode "escape from symbol" flags for each symbol, rather than once per context. Then we can apply SEE model to that. But secondary estimation idea is helpful in any contexts really. Just that its impractical to feed it more than 2 probabilities at once, so some kind of binary decomposition has to be implemented to use SSE with non-binary statistics. > But I do wonder about SSE still. > What is the theory/reason it works is unclear. There're two main reasons: 1) Persistent skews in primary estimation (n0/(n0+n1) isn't good for approximating binomial distribution) 2) Adding extra contexts to prediction. Its also possible to take predictions from two submodels at once and feed them to 2d SSE, thus making a binary mixer equivalent. > I don't see how you can improve the unimprovable stats, they speak themselves... > 20% time c is saw, 40% b....so b get 40% smaller code given... That's not true, count 0 doesn't mean that a symbol won't appear, and 100% c doesn't mean that next symbol would be also c. There's a difference between prior and posterior probability estimations.
    190 replies | 8940 view(s)
  • Jyrki Alakuijala's Avatar
    1st April 2020, 20:13
    I'm not aware of a single file of HTML, CSS, JS or SVG where Zstd compresses more than brotli. Even if brotli's internal static dictionary is disabled, brotli still wins these tests by 5 % in density. The Zstd dictionaries can be used with brotli, too. Shared brotli supports a more flexible custom dictionary format than either Zstd or normal Brotli, where the word transforms are available . It is better than any dictionary I have seen so far. ​ While people can read the dictionary and speculate, no one has actually made an experiment that shows better performance with a similarly sized dictionary on general purpose web benchmark. I suspect it is very difficult to do and certainly impossible with a simple LZ custom dictionary such as zstd's. Brotli's word based dictionary saves about 2 bits per word reference due to the more compact representation by addressing words instead of being able to address between of words or combinations of consequent words or combinations of word fragments. We can achieve 50 % more compression for specific use cases, but the web didn't change that much that the brotli dictionary would need an update. Actually, the brotli dictionary is well suited for compressing human communications from 200 years ago and will likely work in another 200 years. https://datatracker.ietf.org/doc/draft-vandevenne-shared-brotli-format/
    400 replies | 124693 view(s)
  • encode's Avatar
    1st April 2020, 17:43
    I've done some experiments with LZ77 offset (distance) context coding. Decided to share the results + it's my first decent and publicly available "LZ77+ARI+Optimal Parsing" compressor. Previous program was LZPM v0.16 (back in 2008 ) - not the right or competitive implementation. For me, as a data compression author it is a must have. Quick results on "fp.log": Original -> 20,617,071 bytes NLZM v1.02 -window:27 -> 858,733 bytes Tornado v0.6 -16 -> 807,692 bytes ZSTD v1.4.4 -19 -> 805,203 bytes LZMA v19.00 -fb273 -> 707,646 bytes MComp v2.00 -mlzx -> 696,338 bytes LZPM v0.17 -> 685,952 bytes :_coffee:
    2 replies | 463 view(s)
More Activity