Activity Stream

Filter
Sort By Time Show
Recent Recent Popular Popular Anytime Anytime Last 24 Hours Last 24 Hours Last 7 Days Last 7 Days Last 30 Days Last 30 Days All All Photos Photos Forum Forums
  • Sportman's Avatar
    Today, 01:42
    Sportman replied to a thread Brotli in Data Compression
    Input: 550.134.615 bytes, data Output brotli v1.0.9: 102.266.376 bytes, 0.859 sec., -q 0 102.266.376 bytes, 0.875 sec., -q 0 --large_window=30 94.709.758 bytes, 1.062 sec., -q 1 --large_window=30 94.709.758 bytes, 1.079 sec., -q 1 88.676.457 bytes, 2.047 sec., -q 2 --large_window=30 88.676.457 bytes, 2.062 sec., -q 2 86.346.994 bytes, 3.094 sec., -q 3 --large_window=30 86.520.452 bytes, 2.485 sec., -q 3 83.379.252 bytes, 3.828 sec., -q 4 83.240.067 bytes, 4.375 sec., -q 4 --large_window=30 74.331.012 bytes, 7.313 sec., -q 5 73.662.493 bytes, 8.750 sec., -q 5 --large_window=30 73.638.704 bytes, 9.922 sec., -q 6 73.165.996 bytes, 15.484 sec., -q 7 73.013.647 bytes, 11.390 sec., -q 6 --large_window=30 72.818.316 bytes, 27.063 sec., -q 8 72.569.338 bytes, 16.953 sec., -q 7 --large_window=30 72.528.178 bytes, 44.781 sec., -q 9 72.275.093 bytes, 28.485 sec., -q 8 --large_window=30 71.948.380 bytes, 56.546 sec., -q 9 --large_window=30 62.045.211 bytes, 4:03.359 sec., -q 10 60.195.693 bytes, 4:47.812 sec., -q 10 --large_window=30 59.622.346 bytes, 10:28.891 sec., -q 11 57.740.770 bytes, 11:08.765 sec., -q 11 --large_window=30
    278 replies | 88557 view(s)
  • Sportman's Avatar
    Today, 01:41
    Sportman replied to a thread Zstandard in Data Compression
    DEL.
    465 replies | 148103 view(s)
  • Sportman's Avatar
    Today, 01:22
    Sportman replied to a thread Brotli in Data Compression
    Input: 984.801.591 bytes - data Output brotli v1.0.9: 175.701.536 bytes, 1.453 sec., -q 0 175.701.536 bytes, 1.453 sec., -q 0 --large_window=30 161.499.206 bytes, 1.843 sec., -q 1 161.499.206 bytes, 1.860 sec., -q 1 --large_window=30 142.364.130 bytes, 3.468 sec., -q 2 --large_window=30 142.364.130 bytes, 3.469 sec., -q 2 139.425.020 bytes, 4.219 sec., -q 3 139.410.464 bytes, 5.188 sec., -q 3 --large_window=30 139.134.076 bytes, 7.859 sec., -q 4 --large_window=30 139.127.955 bytes, 6.844 sec., -q 4 123.909.231 bytes, 12.828 sec., -q 5 123.759.363 bytes, 15.328 sec., -q 5 --large_window=30 123.290.351 bytes, 17.813 sec., -q 6 123.160.409 bytes, 20.594 sec., -q 6 --large_window=30 122.860.446 bytes, 29.172 sec., -q 7 122.741.906 bytes, 32.407 sec., -q 7 --large_window=30 122.500.887 bytes, 49.781 sec., -q 8 122.399.956 bytes, 58.157 sec., -q 8 --large_window=30 122.192.524 bytes, 1:14.125 sec., -q 9 122.077.056 bytes, 2:01.484 sec., -q 9 --large_window=30 106.141.053 bytes, 8:33.797 sec., -q 10 104.599.029 bytes, 10:13.969 sec., -q 10 --large_window=30 102.694.304 bytes, 21:40.672 sec., -q 11 100.920.256 bytes, 23:22.500 sec., -q 11 --large_window=30
    278 replies | 88557 view(s)
  • JamesB's Avatar
    Yesterday, 18:55
    This neatly demonsrrates the crux of the problem with the patent system. From the article: "And therein lies the rub. The Federal Circuit essentially expected a written document describing a wire attached to a hearing aid and ending with a plug with multiple prongs. What are the chances that a written reference will say this? What scientist would waste time writing an article entitled “The Benefits of Multi-Pronged Plugs for Hearing Aids”? What publisher would ever find such an article worthy of its journals? What reader would subscribe to a journal that published such banality? Yet such an article, such a simplistic, uninspiring, obvious article, is exactly what the Federal Circuit would demand to prove this claim obvious." Ok I'm not a lawyer and the author of the article I think is, but it's dumb and IMO goes against original intention for inventiveness being one of the requirements for patent granting. I'd argue that there are a myriad of prior art cases of known electrical devices using plugs, and I'm pretty sure MANY of them are written or pictorial shown. The notion of putting one on a hearing aid is just an application of standard practices. Maybe it's *novel* if it's never been written down before when used in conjunction with that specific device, but it is not *inventive*. Anyone, even a child let alone someone versed in that particular field, could figure that one out. It sounds to me like the Federal Circuit is just lawyers protecting and justifying their own skin basically. The entire thing stinks :(
    19 replies | 2287 view(s)
  • JamesB's Avatar
    Yesterday, 17:32
    So, this thread started off discussing the StorLeap patent (still pending), which is here: https://patents.google.com/patent/GB2538218A/en?oq=GB2538218+ Clause 4 in that states: 4. A method according to claim 1, wherein the said probability distribution table is extended with a special code to form an extended probability distribution table, wherein the said special code is used to mark a temporary switch to a secondary probability distribution table for encoding transitions or symbols in the said data stream that cannot be encoded using said probability distribution table. Now I'm no lawyer, but "switch to a secondary probability distribution table" sounds very much like "switching between different static probability models on a fragment-by-fragment basis for different fragments of symbols". Yes there are caveats and nuances, but it's tenuous at best. I didn't believe the storleap patent was valid (and made filings against it based own my prior work IIRC), but I can't see the Microsoft one flying at all.
    130 replies | 105764 view(s)
  • LawCounsels's Avatar
    Yesterday, 17:28
    Start a GoFundMe / CrowdJustice encode.su group project for rAN defends fund myself + others here , can help oversee until law firms can be appointed
    130 replies | 105764 view(s)
  • JamesB's Avatar
    Yesterday, 17:27
    Edit nm - old thread :-) Ignore me.
    130 replies | 105764 view(s)
  • Jarek's Avatar
    Yesterday, 16:25
    Indeed these are just written basic possibilities, variations - from company which seems not to have any ANS-based compressors ... which will be forbidden for 20 years if this patent is granted. While people usually use rANS with more complex models, I don't know if somebody is just switching between static models ... adaptiveness since LZNA does it in a more sophisticated way. And it seems it cites just patents - not noticing dozens of compressors, articles. But defending from it might be quite difficult, requires lawyers ... I don't have resources for that. The hope is e.g. in JPEG XL team - this patent will make it much more difficult to build especially hardware implementations. ps. Just found some reddit: https://www.reddit.com/r/programming/comments/lzi2vt/after_being_defended_from_google_now_microsoft/
    130 replies | 105764 view(s)
  • Shelwien's Avatar
    Yesterday, 15:51
    Shelwien replied to a thread Brotli in Data Compression
    Here I just did (mingw gcc10.2/SSE4)
    278 replies | 88557 view(s)
  • JamesB's Avatar
    Yesterday, 15:28
    The problem is one of inventiveness. There's just no check for this any more it seems. Is switching between different static probability models in an entropy encoder a new thing? Most definitely not. Has anyone explictly done it yet with rANS as the entropy encoder of choice? Maybe, maybe not. *But it's not an invention*! It's the totally bonkers approach of "if I'm the first to write it down then I'm the inventor" without any validation of novelty or obviousness to those skilled in the art. It's like someone invented a new composite material, let's call it bolix, and then instantly people patenting chairs made of bolix, tables made of bolix, the application of bolix in car dashboards, bolix coated umbrella stands, etc. Sure it's maybe new, but every time someone has used say, aluminium, wood, paper, or whatever you can just swap in bolix. Blatently obvious... indeed "patently obvious". Here we just get a bun fight of everything we've done before using arithmetic coding, only with *ANS swapped in instead. Grrr, rant over. I'm with Jarek on this though, it's utterly demoralising and depressing for inventors to try and fight the patents. I tried with some of my work and it's simply whack-a-mole. When you see companies getting 10s of millions of pounds in venture capital and then spewing 100s of patents left right and centre, you know you'll never keep up. It's foolish to try. Instead let other companies have that battle. Someone with lots of cash will want to use rANS in a similar style at some point, and then they can prove that this patent isn't novel. Meanwhile the rest of us just suffer.
    130 replies | 105764 view(s)
  • Sportman's Avatar
    Yesterday, 15:13
    Lord of the Ring(s): Side Channel Attacks on the CPU On-Chip Ring Interconnect Are Practical: https://arxiv.org/abs/2103.03443 https://arxiv.org/pdf/2103.03443.pdf Code: https://github.com/FPSG-UIUC/lotr News: https://www.theregister.com/2021/03/08/intel_ring_flaw/
    32 replies | 6938 view(s)
  • Shelwien's Avatar
    Yesterday, 15:12
    > What's a "stream"? Is it a thread? A set of serially dependent instructions (usually multiple such sets are interleaved). In compression case it means that we can generate instructions for encoding/decoding of multiple independent data streams, and _single_ CPU core would be able to process them in parallel (with many restrictions, but still). This is actually the main cause for difference in speed between "new-generation codecs" like zstd,oodle,brotli and older codecs. > Does it matter that the 7820X has four memory channels? Well, are you using all four? Well, I had to write this benchmark to actually find out, and apparently its more complicated than that. > This reminds me, have you looked into what Apple is doing with the memory > management on their new M1 chip? I don't have access to any M1, so no. If you mean "unified memory" - that's just using same memory both for CPU and GPU. Normally data has to be explicitly copied to GPU memory, which is of course much slower. Latency also seems relatively low, like PC's L1 until 8MB, then 3x faster than PC RAM: https://images.anandtech.com/doci/16252/latency-m1.png (https://www.anandtech.com/show/16252/mac-mini-apple-m1-tested). https://lemire.me/blog/2021/01/06/memory-access-on-the-apple-m1-processor/ > shouldn't be that big to enable a 15 Watt chip to beat a 65 Watt, Well, anandtech measured 31W at full load. > I read maybe on Charles' blog that these systems have more latency than PC DDR4 setups. Something like this? https://cbloomrants.blogspot.com/2017/03/kraken-perf-with-simultaneous-threaded.html There're times though, not cycles. > Should they have more parallelism? They might... but they tend to have less caches compared to PC stack of L1-L2-L3. > I never understood why graphics RAM had more latency, RAM itself should be somewhat faster than DDR4, but there're lots of other parts that affect the latency, like bus design and memory controller (and its clock).
    3 replies | 197 view(s)
  • schnaader's Avatar
    Yesterday, 11:59
    The concept of what3words offers a possible solution. They are converting locations of size 3 meter x 3 meter to a 3 word combination "address". This way, the adress is human-readable, can be memorized easier and is less error-prone because of the redundancy. The disadvantage for your use case might be that these addresses are not unique, e.g. a building can have multiple addresses depending on what location you choose (e.g. different entrances). Then again, the main principle can be applied to other hashing solutions as well, giving slightly longer word codes instead of cryptic alphanumeric ones.
    6 replies | 122 view(s)
  • Jon Sneyers's Avatar
    Yesterday, 09:46
    Well yes, it's not about rANS itself, but of course you typically want to do rANS with some kind of context modelling, and it does look like they're claiming they invented that. At least that's how I understand claim 1: "do rANS but with a context model". There should be examples of prior art for this. The pik code that was published in 2017-2018 for example, before it became part of jxl.
    130 replies | 105764 view(s)
  • SolidComp's Avatar
    Yesterday, 06:54
    What's a "stream"? Is it a thread? Does it matter that the 7820X has four memory channels? Well, are you using all four? I assume the U has only two since it's a laptop chip. This reminds me, have you looked into what Apple is doing with the memory management on their new M1 chip? I read some mention of it being novel or groundbreaking, that because of their approach they're able to get crazy performance from just 8 GB of RAM. I assume it's a SoC since the RAM is LPDDR4X or something. Do they have less latency with their setup than a normal x86-64 PC? I was amazed by that Rocket Lake preview on AnandTech. That thing consumes so much power for what it does. I'm very confused by the M1's performance given that it's only 15 watts or so. 5nm's advantage over 14nm shouldn't be that big to enable a 15 Watt chip to beat a 65 Watt, so there must be more going on. What about the new consoles with the unified or shared GDDR6? I read maybe on Charles' blog that these systems have more latency than PC DDR4 setups. Should they have more parallelism? I never understood why graphics RAM had more latency, and it suggests that integrated GPUs like the Intel Xe should have a latency advantage if they have any RAM like the Iris Pros did.
    3 replies | 197 view(s)
  • Trench's Avatar
    Yesterday, 05:14
    xinix The file is big enough. I understand your points, but you seem to not understand mine. But its ok. Gotty I stated a simple fact with a question which you verified and answered So ok. I do not program anymore and mostly forgot it or care to do it much but understand how it works and know how to edit some code. I get annoyed trying to find the errors (or typos) in the code even on basic things like excel so no thanks. LOL I just am not into the details which restricts and the other issues that come with it. I also do not state a lot of things to let people think about it or let people believe what they like. Giving answers does not help bring innovation. Kind of like the saying "if you give a man a fish you feed them for a day, if you teach them how to fish you feed them for a life time." A skill is like a double edged sword, Math and programming are beneficial but can also restrict in other ways in how to go after the solution. If you ask various professionals how to target a problem, they will go at it from their own experience or profession. Even a math background every mathematician can give a different answer at times no one is alike even with the lotto example. A stated other times a programmer is like a translator, and a translator usually does create top selling books, but translates them to other languages, and a novelist does not translate it to other languages but uses other professions. Many programmers have ego problem to think they are creative when their job is not that as artistic to create but restrictive to have to follow rules and be precise. Like all profession people do not have the time or ability to be a master of everything. The brain does not work like that and you should know the odds to that being in math. i am not saying I am right or you are wrong, I am saying step back and try it another away as a reminder at times. I understand the math to your game, but you put more effort to explain it that I did which yes it is a simple calculation no matter what amount used. I did state an error before to say 50% which should be 33%. I guess no one understood it to correct it or maybe you did before with your other example before then. But when dealing with a small example 33% may work as shown, but with dealing with a real world example 50% works better it seems. What you stated is fine and proper with metadata etc, but sometimes you have to start new and put restrictions to force yourself to find something else. Cant find something new when doing the same methods, especially when we agree that method reached a limit. Their is a reason why its called to think out of the box since you have to look elsewhere to solve something than the conventional thing which is easier said than done and sometimes people need a reminder. Sometimes just ignore everything and start new with a new way of seeing things to find the solution.
    22 replies | 718 view(s)
  • SolidComp's Avatar
    Yesterday, 04:33
    Gotty, have you thought of ways to make the addressing system more robust and reliable? For example, there should be ways to design address formats, street names, and numbering schemes to make it less error-prone. There may not be a lot we can do about root street names, since governments and developers want to be able to name streets after people and events and plants or whatever, and those strings will be arbitrary with respect to robustness. But the numbering scheme can be designed. The way addresses are conveyed, or even thought of, can be designed and normalized. We recently got mail from a prisoner at some California prison. It was addressed to Canada, which is still a different country than the US. The only thing our address had in common with the Canadian address was the three-digit house number. That was incredibly strange. 1B/1b should be ruled out as a possible building number by any decent system, since its fragility is obvious. The letter B is excluded in all the character sets I've designed, because of its similarity to 8 and 6 and 13. My typical set consists of about 20 or so characters: A C D E F H K L M N P Q R T X Y 6 7 8 9 That would be for a strictly all-caps system, so L and N are okay (n is too close to h and r). It also depends on whether it's something we expect people to ever write by hand. With addresses, that still seems likely to happen sometimes, so we'd probably need to tighten it up. There's also the issue of typical font designs, and machine readability. The letter J, for example, is surprisingly timid in most fonts, with just a tiny, subtle descender. There are lots of interesting systems that prune a few of the ambiguous characters, like the Base58 that Satoshi Nakamoto developed for Bitcoin addresses.
    6 replies | 122 view(s)
  • Gotty's Avatar
    Yesterday, 02:38
    Aha, I see now. Thank you very much for the explanation! >>On address quality, yes that's a big issue with commercial mailers. Let's ask someone who is an insider in the postal system: how big is this issue? You are lucky ;-) I'm such an insider (not for the US postal system however). Addressing quality is terrible. I mean: TERRIBLE. A true fact: you can misspell a city's name in 23 different ways. A few of its misspellings make it difficult to distinguish it from another city. And that's only the city name. You don't want to know about streets, street numbers and so on. It's very-very difficult to enforce canonical addresses. ((At this moment you can't.)) It's not easy to imagine the depth of the problem until you actually see the data. I mean all the data... I saw it. A couple of times I needed to find how many packages or mail were delivered to specific addresses. Well, I'm experienced and rather good at data cleansing and lookup. But these "projects" were not very successful. Want to see my findings? The data was in such a bad shape and the results were so uncertain that I didn't give any numbers. My "report" was simply this: "At this moment it's not possible to determine such metrics". I had to deliver approximate numbers and I couldn't. After working through the data manually I'm confident that there is no machine learning approach that can assist you. A personal example: Just a couple of weeks ago we expected but we didn't receive a package from I friend of ours. The correct house number is 1B (or 1b), but the handwriting was probably not too clear so the delivery man probably read it as 18 (or 16). "Address unknown" - and ... back to sender. We already received packages and mail from the same sender. Maybe there was a replacement delivery man that day? How many misdelivered packages are there? You don't wanna know ;-) Or do you? Guess, and then see here. The first thing is to fix the addressing system. Without that any (other) effort is meaningless. I'm very sorry to say that.
    6 replies | 122 view(s)
  • SolidComp's Avatar
    Yesterday, 01:24
    Hi, I have a couple of purposes. One is for social science and psychology research where I'm looking at ways to tokenize and obscure personally identifiable information (PII). There are different layers to that, and tokenizing would be one. My second purpose is to develop a system for tokenizing addresses in general as a privacy-enhancing measure, after seeing so many hacks in the US (mostly by the Chinese military) that obtained huge troves of personal data on millions of Americans, all of which is being combined and mined. The basic insight here is that all these different businesses, doctors, etc. don't actually need our address – they just need to be able to communicate with us, including sending us paper mail. I was fascinated by credit card tokenization, and it seems like we should be able to tokenize pretty much anything, like addresses, phone numbers, etc. The basic framework would be that mailers and e-commerce merchants like Amazon would just have a token for your address, like V84DXA, along with your postal code since that is necessary for pre-sorting and pricing shipments, and the token would be converted to your actual address by the shipping carrier through a secure process. They would then print the necessary address info on the package or envelope – whatever info the delivery drivers or mail carriers needed for the "last mile" delivery (probably just the street address, not city or state, since the postal code would already be there, it would be pre-sorted for the drivers down to street level, and they'd want to save on ink). Yes, I need uniqueness since the tokens have to be convertible back to (unique) addresses, and this would only work if they were unique. I don't understand your questions about characters vs. bytes. What's the difference? Can't all hashes be expressed in both forms? You mean the textual form of the hash? It has to have a printable text form, a token. Well, maybe not necessarily for the general address tokenization system, since they could just print a linear or 2D barcode on the letter or package, and that could encode anything. They're already used to printing barcodes anyway. They only point where they need any text is for last mile carrier to know the specific building number, etc. I've created a few different character sets for the text form of the tokens or hash, mainly to eliminate ambiguous characters. A Base32 equivalent can work, similar to Crockford's or z-base-32. Or a hexadecimal set, but composed of different characters than 0-F. "Ambiguity" operates at different levels of analysis and different contexts, like something designed for completely naive users who don't know that certain letters or numbers are excluded, or for machine readability, etc. But my assumption that we need a text form of the tokens, whether they are technically hashes or not. On address quality, yes that's a big issue with commercial mailers. Addresses would be converted to canonical form first, and then tokenized based on that form. At least that's what I expect, and what I would do in my use cases. But the system might be robust to any input. And maybe hashing would be a good way to check an address for accuracy, validity, or canonicality... In the US we use these standards and the USPS Publication 28. It's basically all caps, a certain line order, no punctuation in most cases, and a standard set of abbreviations for directions and street prefixes and suffixes, like N for North, ST for Street, HWY for Highway, etc. It's common for businesses and merchants to automatically present the canonical form of one's address to you for approval after you submit a form with your address in non-canonical form. The general addressing system would look something like this: DAVID BROWN 8422 E MERRIWEATHER DR DALLAS, TX 75211-5517 Converted to: DAVID BROWN 75211 V84DXA
    6 replies | 122 view(s)
  • Sportman's Avatar
    7th March 2021, 22:43
    Sportman replied to a thread Brotli in Data Compression
    Did somebody compiled a Brotli v1.0.9 Windows executable? https://github.com/google/brotli/releases
    278 replies | 88557 view(s)
  • Sportman's Avatar
    7th March 2021, 20:45
    Sportman replied to a thread Zstandard in Data Compression
    Input: 550.134.615 bytes, data Output zstd v1.4.9: 92.590.444 bytes, 0.843 sec. - 0.500 sec., zstd -1 --ultra -T1 90.892.415 bytes, 0.985 sec. - 0.531 sec., zstd -2 --ultra -T1 90.695.852 bytes, 1.187 sec. - 0.516 sec., zstd -1 --ultra --long -T1 88.947.977 bytes, 1.328 sec. - 0.516 sec., zstd -2 --ultra --long -T1 86.822.223 bytes, 1.218 sec. - 0.500 sec., zstd -3 --ultra -T1 86.817.226 bytes, 1.344 sec. - 0.515 sec., zstd -4 --ultra -T1 85.401.646 bytes, 1.625 sec. - 0.516 sec., zstd -3 --ultra --long -T1 85.380.924 bytes, 1.641 sec. - 0.515 sec., zstd -4 --ultra --long -T1 83.538.083 bytes, 2.156 sec. - 0.500 sec., zstd -5 --ultra -T1 82.246.447 bytes, 2.312 sec. - 0.516 sec., zstd -5 --ultra --long -T1 80.272.993 bytes, 2.375 sec. - 0.485 sec., zstd -6 --ultra -T1 79.188.211 bytes, 2.593 sec. - 0.500 sec., zstd -6 --ultra --long -T1 77.888.214 bytes, 3.063 sec. - 0.468 sec., zstd -7 --ultra -T1 77.683.693 bytes, 3.781 sec. - 0.469 sec., zstd -8 --ultra -T1 77.033.654 bytes, 3.235 sec. - 0.484 sec., zstd -7 --ultra --long -T1 76.982.033 bytes, 4.422 sec. - 0.453 sec., zstd -9 --ultra -T1 76.830.435 bytes, 3.922 sec. - 0.484 sec., zstd -8 --ultra --long -T1 76.733.141 bytes, 4.828 sec. - 0.453 sec., zstd -10 --ultra -T1 76.604.831 bytes, 5.672 sec. - 0.468 sec., zstd -11 --ultra -T1 76.234.213 bytes, 4.532 sec. - 0.484 sec., zstd -9 --ultra --long -T1 76.171.937 bytes, 7.266 sec. - 0.453 sec., zstd -12 --ultra -T1 76.040.602 bytes, 4.968 sec. - 0.485 sec., zstd -10 --ultra --long -T1 75.911.832 bytes, 6.015 sec. - 0.484 sec., zstd -11 --ultra --long -T1 75.764.674 bytes, 17.235 sec. - 0.468 sec., zstd -13 --ultra -T1 75.606.371 bytes, 18.485 sec. - 0.453 sec., zstd -14 --ultra -T1 75.536.752 bytes, 7.546 sec. - 0.500 sec., zstd -12 --ultra --long -T1 75.308.212 bytes, 22.891 sec. - 0.453 sec., zstd -15 --ultra -T1 75.195.609 bytes, 16.782 sec. - 0.469 sec., zstd -13 --ultra --long -T1 75.025.088 bytes, 18.344 sec. - 0.500 sec., zstd -14 --ultra --long -T1 74.714.546 bytes, 22.922 sec. - 0.500 sec., zstd -15 --ultra --long -T1 69.794.235 bytes, 33.797 sec. - 0.468 sec., zstd -16 --ultra -T1 68.674.180 bytes, 35.172 sec. - 0.500 sec., zstd -16 --ultra --long -T1 64.483.007 bytes, 39.593 sec. - 0.485 sec., zstd -17 --ultra -T1 64.078.317 bytes, 54.672 sec. - 0.500 sec., zstd -18 --ultra -T1 63.592.913 bytes, 40.828 sec. - 0.500 sec., zstd -17 --ultra --long -T1 63.399.155 bytes, 1:22.890 sec. - 0.485 sec., zstd -19 --ultra -T1 63.188.456 bytes, 56.125 sec. - 0.516 sec., zstd -18 --ultra --long -T1 62.692.187 bytes, 1:28.328 sec. - 0.516 sec., zstd -20 --ultra -T1 62.361.863 bytes, 1:27.547 sec. - 0.531 sec., zstd -19 --ultra --long -T1 62.095.898 bytes, 1:43.766 sec. - 0.547 sec., zstd -21 --ultra -T1 62.078.942 bytes, 1:32.157 sec. - 0.546 sec., zstd -20 --ultra --long -T1 61.752.973 bytes, 1:51.047 sec. - 0.547 sec., zstd -21 --ultra --long -T1 61.480.950 bytes, 2:12.687 sec. - 0.547 sec., zstd -22 --ultra -T1 61.480.950 bytes, 2:12.844 sec. - 0.547 sec., zstd -22 --ultra --long -T1
    465 replies | 148103 view(s)
  • Gotty's Avatar
    7th March 2021, 20:08
    Gotty replied to a thread cmix in Data Compression
    It looks like you have an older cmix source. Your error message indicates that you have a source where the ByteModel is in use. But Byron removed the ByteModel a year ago. Please download the current source: https://github.com/byronknoll/cmix
    453 replies | 126281 view(s)
  • innar's Avatar
    7th March 2021, 19:51
    Finally... ​Ubuntu 18.04.5 LTS @ Intel Xeon CPU E5-2690 v3 @ 2.60GHz, 512GB RAM ./cmix -c coronavirus.fasta.un-wrapped coronavirus.fasta.un-wrapped.cmix 1317937667 bytes -> 866379 bytes in 1995375.27 s. cross entropy: 0.005
    78 replies | 5821 view(s)
  • Gribok's Avatar
    7th March 2021, 19:29
    Asking for help. I do not have access to latest AMD (Zen 2/3) or ARM64 (Apple M1) CPUs. So I would appreciate if folks could help and benchmark libsais vs divsufsort on this CPU microarchitectures.
    7 replies | 888 view(s)
  • suryakandau@yahoo.co.id's Avatar
    7th March 2021, 18:37
    there is an error while compiling cmix v18 on my laptop
    453 replies | 126281 view(s)
  • Jarek's Avatar
    7th March 2021, 17:47
    Jyrki, the situation here is more complex - this patent covers natural modifications, which while maybe not being very practical for software implementations, might bring improvements for hardware. These basic variations would be probably explored if AV1 have remained with rANS, maybe you have some prior art to submit there from this development. Otherwise, it seems there will be 20 year minefield preventing from going out of the basic way ... e.g. for hardware optimizations of JPEG XL implementations.
    19 replies | 2287 view(s)
  • Gotty's Avatar
    7th March 2021, 16:54
    Gotty replied to a thread cmix in Data Compression
    Here it is.
    453 replies | 126281 view(s)
  • Jyrki Alakuijala's Avatar
    7th March 2021, 16:51
    How Amazon got a patent on white-background photography
    19 replies | 2287 view(s)
  • suryakandau@yahoo.co.id's Avatar
    7th March 2021, 16:36
    @shelwien how to compile cmix v18 using batch script ? thank you
    453 replies | 126281 view(s)
  • choochootrain's Avatar
    7th March 2021, 16:31
    choochootrain replied to a thread Zstandard in Data Compression
    The file inside zstd-v1.4.9-win64.zip is another zip, it just doesn't have an extension
    465 replies | 148103 view(s)
  • Gotty's Avatar
    7th March 2021, 15:17
    It looks like that the patent is not patenting rANS. I believe rANS is safe. Reading the patent text they could chose any entropy coding mechanism like arithmetic encoding, they just happened to chose rANS. ((I'm not protecting the patent. I also believe that patenting is silly.)) See these quotes: ((The patent is all about these 5 "innovations" (better wording would be "enhancements").)) See? They are not "reinventing" rANS but extending rANS so that it is capable of the above desired features. It's not really about the entropy encoding itself, it's more about modelling.
    130 replies | 105764 view(s)
  • mpais's Avatar
    7th March 2021, 13:50
    Here's an interesting article that provides useful hints for this challenge: Learning the language of viral evolution and escape
    78 replies | 5821 view(s)
  • necros's Avatar
    7th March 2021, 13:43
    So flexigif is still not safe to use right?
    42 replies | 7727 view(s)
  • Jarek's Avatar
    7th March 2021, 12:21
    Microsoft has filed large general patent in 2019 for rANS: https://patents.google.com/patent/US20200413106A1 Please check if it does not interfere with JPEG XL, if so help preventing it.
    19 replies | 2287 view(s)
  • Sportman's Avatar
    7th March 2021, 06:46
    Sportman replied to a thread Zstandard in Data Compression
    I used WinRAR to unzip and it only showed a file without file extension instead of files and folders, I thought it was a Linux file or a corrupt archive, so I tried to click it and it behaved as an folder and showed files and folders what's normal showed as root of the archive and could extract zstd.exe (extracting all files and folders in that file what's a folder gave no error).
    465 replies | 148103 view(s)
  • SolidComp's Avatar
    7th March 2021, 04:00
    This is a depressing reminder of how terrible our legal system is in the US and in similar countries. This patent system is insanely bad. We let people just think stuff up and write it down, and BOOM, patent. Then they become patent predators, née trolls, and stalk and shake down random humans/businesses for tons of money when they deserve nothing. With these corporations patenting these algorithms and vague ideas and abstractions, we obviously need an epistemic and legal framework that we don't have. If there's a reasonable framework for software patents, I'm not sure if anyone has developed it, and we don't employ it. Nothing should be patentable unless it's in a product offered in the marketplace or used to competitive advantage internally. No more of this nonsense of just thinking stuff up and writing it down. Obviously something already invented by others should not be patentable, so these Google and Microsoft patents should not be possible. The fact that they're happening means we don't have the right framework, training, or domain experts handling these applications. I think there are lots of potential solutions involving crowdsourcing patent approvals. Relying on slow and non-expert government bureaucrats is ridiculous. There are some interesting possibilities that would provide financial rewards to people for examining patent applications tied to the quality of the patent, which could be measured by various means over time. We could even have a betting market on patents, designed in lots of different ways. Right now patents are a huge energy, time, and financial drain. It's such a distraction from productive work to have people at Google writing these ridiculous patents, lawyers filing them, government employees reading them, original inventors needing to challenge them, lawyers challenging them, judges ruling on them, lawyers trying to trick jurors about them, jurors trying to judge them, etc. Our jury system is so embarrassing and stupid. Americans have become too distracted by stupid ideologies, and are missing how deeply flawed our legal system is across the board, and how much it costs us in injustice and dollars. It's become a lottery for a lot of them, but with much better odds than actual lotteries.
    130 replies | 105764 view(s)
  • SolidComp's Avatar
    7th March 2021, 03:31
    SolidComp replied to a thread Zstandard in Data Compression
    I don't understand what you're saying, but the Win64 zip file seems to be corrupt. It just unzips into a weird blob that doesn't do anything. Is that what you're referring to?
    465 replies | 148103 view(s)
  • Shelwien's Avatar
    7th March 2021, 03:28
    Anandtech posted a nice picture about cache latency. For comparison, reg-reg multiplication has a 3clk latency.
    3 replies | 197 view(s)
  • Sportman's Avatar
    7th March 2021, 00:26
    Sportman replied to a thread Zstandard in Data Compression
    It's if a file name contain the root folder content instead of the direct root folder content as by 1.4.5-1.4.8 win64 zip files.
    465 replies | 148103 view(s)
  • Sportman's Avatar
    7th March 2021, 00:18
    Sportman replied to a thread Zstandard in Data Compression
    Input: 984.801.591 bytes - data Output zstd v1.4.8: 151.119.611 bytes, 1.344 sec., zstd -1 --ultra -T1 149.904.965 bytes, 3.375 sec., zstd -1 --ultra --long -T1 147.939.926 bytes, 1.547 sec., zstd -2 --ultra -T1 146.708.829 bytes, 3.547 sec., zstd -2 --ultra --long -T1 143.977.987 bytes, 1.969 sec., zstd -3 --ultra -T1 143.965.769 bytes, 2.141 sec., zstd -4 --ultra -T1 142.896.262 bytes, 4.063 sec., zstd -3 --ultra --long -T1 142.869.673 bytes, 4.125 sec., zstd -4 --ultra --long -T1 138.447.020 bytes, 3.437 sec., zstd -5 --ultra -T1 137.671.146 bytes, 5.187 sec., zstd -5 --ultra --long -T1 135.307.599 bytes, 3.735 sec., zstd -6 --ultra -T1 134.660.677 bytes, 5.625 sec., zstd -6 --ultra --long -T1 129.713.923 bytes, 5.297 sec., zstd -7 --ultra -T1 129.552.776 bytes, 7.125 sec., zstd -7 --ultra --long -T1 129.262.251 bytes, 6.547 sec., zstd -8 --ultra -T1 129.090.886 bytes, 8.328 sec., zstd -8 --ultra --long -T1 128.271.371 bytes, 7.687 sec., zstd -9 --ultra -T1 128.210.617 bytes, 9.485 sec., zstd -9 --ultra --long -T1 128.087.378 bytes, 8.328 sec., zstd -10 --ultra -T1 128.056.308 bytes, 10.312 sec., zstd -10 --ultra --long -T1 127.999.182 bytes, 9.812 sec., zstd -11 --ultra -T1 127.940.684 bytes, 12.109 sec., zstd -11 --ultra --long -T1 127.445.750 bytes, 14.828 sec., zstd -12 --ultra --long -T1 127.422.579 bytes, 12.656 sec., zstd -12 --ultra -T1 126.935.545 bytes, 28.125 sec., zstd -13 --ultra --long -T1 126.802.723 bytes, 30.359 sec., zstd -14 --ultra --long -T1 126.768.260 bytes, 26.766 sec., zstd -13 --ultra -T1 126.659.580 bytes, 28.734 sec., zstd -14 --ultra -T1 126.499.068 bytes, 37.922 sec., zstd -15 --ultra --long -T1 126.327.122 bytes, 35.860 sec., zstd -15 --ultra -T1 118.142.065 bytes, 1:18.344 sec., zstd -16 --ultra -T1 117.643.674 bytes, 1:22.531 sec., zstd -16 --ultra --long -T1 113.112.473 bytes, 1:29.234 sec., zstd -17 --ultra -T1 112.773.080 bytes, 1:33.485 sec., zstd -17 --ultra --long -T1 112.255.514 bytes, 2:11.281 sec., zstd -18 --ultra -T1 111.880.990 bytes, 2:16.656 sec., zstd -18 --ultra --long -T1 111.517.628 bytes, 2:46.016 sec., zstd -19 --ultra -T1 110.962.579 bytes, 2:57.359 sec., zstd -19 --ultra --long -T1 111.041.788 bytes, 2:55.297 sec., zstd -20 --ultra -T1 110.644.739 bytes, 3:04.047 sec., zstd -20 --ultra --long -T1 110.431.137 bytes, 3:25.172 sec., zstd -21 --ultra -T1 110.188.109 bytes, 3:39.516 sec., zstd -21 --ultra --long -T1 109.903.935 bytes, 3:52.750 sec., zstd -22 --ultra -T1 109.903.935 bytes, 3:52.719 sec., zstd -22 --ultra --long -T1 Output zstd v1.4.9: 151.119.611 bytes, 1.344 sec., zstd -1 --ultra -T1 149.930.990 bytes, 1.969 sec., zstd -1 --ultra --long -T1 147.939.926 bytes, 1.547 sec., zstd -2 --ultra -T1 146.731.879 bytes, 2.141 sec., zstd -2 --ultra --long -T1 143.977.987 bytes, 1.969 sec., zstd -3 --ultra -T1 143.965.769 bytes, 2.156 sec., zstd -4 --ultra -T1 142.942.060 bytes, 2.656 sec., zstd -3 --ultra --long -T1 142.912.214 bytes, 2.687 sec., zstd -4 --ultra --long -T1 138.447.020 bytes, 3.453 sec., zstd -5 --ultra -T1 137.698.236 bytes, 3.781 sec., zstd -5 --ultra --long -T1 135.307.599 bytes, 3.781 sec., zstd -6 --ultra -T1 134.685.348 bytes, 4.188 sec., zstd -6 --ultra --long -T1 129.713.923 bytes, 5.297 sec., zstd -7 --ultra -T1 129.541.732 bytes, 5.687 sec., zstd -7 --ultra --long -T1 129.262.251 bytes, 6.563 sec., zstd -8 --ultra -T1 129.079.778 bytes, 6.890 sec., zstd -8 --ultra --long -T1 128.271.371 bytes, 7.703 sec., zstd -9 --ultra -T1 128.185.107 bytes, 8.016 sec., zstd -9 --ultra --long -T1 128.087.378 bytes, 8.344 sec., zstd -10 --ultra -T1 128.030.897 bytes, 8.860 sec., zstd -10 --ultra --long -T1 127.999.182 bytes, 9.859 sec., zstd -11 --ultra -T1 127.915.469 bytes, 10.719 sec., zstd -11 --ultra --long -T1 127.422.579 bytes, 12.703 sec., zstd -12 --ultra -T1 127.415.784 bytes, 13.422 sec., zstd -12 --ultra --long -T1 126.907.957 bytes, 26.703 sec., zstd -13 --ultra --long -T1 126.775.263 bytes, 28.954 sec., zstd -14 --ultra --long -T1 126.768.260 bytes, 26.750 sec., zstd -13 --ultra -T1 126.659.580 bytes, 28.735 sec., zstd -14 --ultra -T1 126.467.698 bytes, 36.609 sec., zstd -15 --ultra --long -T1 126.327.122 bytes, 35.812 sec., zstd -15 --ultra -T1 118.142.065 bytes, 1:18.329 sec., zstd -16 --ultra -T1 117.640.741 bytes, 1:21.078 sec., zstd -16 --ultra --long -T1 113.112.473 bytes, 1:29.187 sec., zstd -17 --ultra -T1 112.767.845 bytes, 1:32.015 sec., zstd -17 --ultra --long -T1 112.255.514 bytes, 2:10.438 sec., zstd -18 --ultra -T1 111.878.283 bytes, 2:14.078 sec., zstd -18 --ultra --long -T1 111.517.628 bytes, 2:45.672 sec., zstd -19 --ultra -T1 111.041.788 bytes, 2:55.078 sec., zstd -20 --ultra -T1 110.958.922 bytes, 2:55.203 sec., zstd -19 --ultra --long -T1 110.691.009 bytes, 3:01.454 sec., zstd -20 --ultra --long -T1 110.431.137 bytes, 3:24.843 sec., zstd -21 --ultra -T1 110.190.299 bytes, 3:36.609 sec., zstd -21 --ultra --long -T1 109.900.896 bytes, 3:49.813 sec., zstd -22 --ultra -T1 109.900.896 bytes, 3:49.844 sec., zstd -22 --ultra --long -T1
    465 replies | 148103 view(s)
  • mitiko's Avatar
    7th March 2021, 00:04
    Oh, that type of a transform... Well, static dictionaries support all types of patterns (like the ones described in the post) and I'm planning on doing this in one way or another. What they're doing with LZ77 + static dictionaries is pretty ingenious actually! Wait, that's actually pretty clever for a fast encoder! Anyways, that's not the point of BWD. I'm trying to preserve the context of the text, the information it holds. After LZ77 and such pattern-y words the rest of the compression pipeline looses context and it will predict worse. But if we instead create a whole new entire context based on a pattern, it will compress better and the original stream will be more compressible as well. Patterns are great, they just seem a bit to deterministic right now - like for every pattern/dictionry transform that you can imagine, you need to write it's own ranking function and extraction function and then handle it's serialization to the dictionary, etc. It'd be amazing if we find a way to get the system to generate patterns, or find them in the dictionary after the words have been chosen. For example if the dictionary is calculated to be something like: ​ "the next" "the following" "the lion" "the fox" Maybe some independent system can see the pattern and generate the word "the (*)", which will then match all of the above and also split the stream into 2 - the main stream + the stream of the pattern. The stream of the pattern would the look something like: "next;following;lion;next;fox;fox;fox;next..." which is very compressable. In practice this would suit well some grammar rules (essentially making the grammar information that text holds, known, so zero bytes to transmit) like "(*)ing" or "(*)ed". With patterns, we get into this whole idea of context splitting and start to think if this is possible in the general case. I have some notes, if you'd like to look at them.
    17 replies | 898 view(s)
  • Jarek's Avatar
    6th March 2021, 23:26
    Thanks but I am not taking this only "award" for ANS to be able to defend it for the rest of my life from succeeding megacorporations. Corporations using it, instead of deploying own "defensive" legal mines, please just prevent building this minefield. Especially that I haven't heard of any ANS-based compressor from Microsoft (?)
    130 replies | 105764 view(s)
  • SolidComp's Avatar
    6th March 2021, 22:19
    Brotli has a seemingly novel set of 120 dictionary transforms. You can read a bit about them in this recent Cloudflare post.
    17 replies | 898 view(s)
  • JamesWasil's Avatar
    6th March 2021, 19:10
    Nope. Only semantic rewording of your work and fabricating big tech innovations for Microsoft to compete with Google by bogus patent claims for what was intended to be free with due credit. This is why we can't have nice things, and why development and certain types of information sharing and discoveries are no longer offered in good faith like they were before. Corporations by majority have ruined the good faith and academic contributions for the sake of exclusive profiteering, even against the inventor's wishes. It says a lot about the companies and state of the world when they have to rip off the author, take credit and make money on work released freely, rather than developing their own UNIQUE methods even if they yield nearly the same performance or results. Cookie cutter world now. Question: Should Jarek make a global Gofundme for legal help to fight this to where the lawyers who Google, Microsoft, and other big tech don't yet have in their pocket might actually help? I would assume that if Marcus Hutter could throw 50k at the Hutter prize, would he and others be willing to donate to the Jarek rANS/tANS defense fund?
    130 replies | 105764 view(s)
  • mitiko's Avatar
    6th March 2021, 17:28
    I'm going to try and compete with BWD later this year. I also have some good ideas for modeling after the dictionary transform.
    78 replies | 12373 view(s)
  • mitiko's Avatar
    6th March 2021, 17:13
    Thanks for looking through the code. I knew I had to eventually remove LINQ but I don't think I'm using it at some very critical places. Right now, the most work seems to be in counting the words and ranking. Thanks for the heads up for foreach loops, I didn't know they were more expensive than for's. The architecture is pretty ok for now, it's like LINQ but async (using IAsyncEnumerable). I've thought it might be slowing me down but it's actually not. It's processing data in blocks and I'm testing it for optimality for the block size being the full filesize, so all the data only travels the pipeline once. The slow-down comes from calculating the dictionary, when we have it, parsing and serializing dictionary is pretty fast. What the rank (as per the latest version) calculates is the difference in bits used to encode the string. So if the dictionary size is 256, 8 bits would be needed per word and the ranking would be analogous to (n-1)*s for n=word size, s=occurences. (I'd rather work with l and c for length and count) In your example, the weights are correct and the savings in bytes are the same as the rank. The whole ranking system is greedy in this way. It will prefer: ​"the" -> rank=2000 "rabbit" -> rank=175 "lion" -> rank=105 "and" -> rank=70 "fox" -> rank=70 total = 2420 The ranks here only being from the top of my head. The non-greedy variation you have in mind looks something like: "the fox and the rabbit and the lion" -> rank=1225 "the" -> rank=2000 total = 3225 I have actually thought a lot about this and it's not a problem of the ranking function itself, at least not it's primitive variation. No mathematical formula can solve the greediness, there's always going to be some example where if we switch the order, we'll get a better result. It's a problem of choosing the best word from the ranking. Instead of greedy-ly choosing the highest ranked word, we can look ahead (in time) and calculate the ranks of future words, then choose the optimal branch. It becomes a look-ahead tree. The fragmentation problem by itself can be solved if we consider the dictionary to initially be the alphabet and then morphing it into longer words. Like: A -- AT T -- A C -- T G -- C ​ G Then we can do parsing before each new word calculation and we won't have the constraint of words not spanning over multiple blocks. I've excluded this as a good idea, because 1) we have to do parsing at each word and only reparsing doesn't seem viable. 2) we won't have this problem for a large enough look-ahead tree. But now that you mention it, maybe the look-ahead tree can't be that big. My hope was that since we're calculating a lot of the words' ranks, they won't change under the tree and when we choose the optimal branch, we'd only have to calculate the ranks of the next layer of words. Basically ranking becomes at most x times slower, if we prune and look at the top x words. The tree depth is only a concern for the first word and then updates are pretty fast, so it's pretty inexpensive to have a deeper tree, compared to a more spanning tree. But for the look-ahead tree to solve the fragmentation dilemma, a deeper tree doesn't cut it. We need a more spanning tree. So, maybe you're right, I should reconsider overlapping words and look into re-parsing, although I'm still pretty set on having the look-ahead tree as well. A couple assumptions: Right now, with the naive ranking ((n-1)*s) we don't have to rank all words and can only re-rank some. If we choose to re-rank the words, we have to store all the ranks, we can't store just the top 5, because at some point we won't be able to update the top words for 5 times in a row and we won't have a top 5. My current ranking was last updated with a hashtable for ranks, so the actual calculation is only done once per similair words. What we can do, for sure, is only re-count some words. I'm working on that still. I have a draft, I just have to fix some minor indexing issues. The reason why I haven't done re-ranking is because the optimal ranking algorithm uses a state. The technique used is similiar to morphing/updating the dictionary. This way the current rank of a word depends on the other words chosen and it's closer to the actual optimal order of the dictionary. Especially with the look-ahead tree. For sure, re-ranking and re-counting + the fragmentation fix will create a great compressor, with acceptable compression times. But I'm rather looking to improve BWD in the direction of optimality for the cost of compression time, than leveraging memory for time. When I'm done, I'll definitely try to implement one with this stack!
    17 replies | 898 view(s)
  • CompressMaster's Avatar
    6th March 2021, 15:58
    CompressMaster replied to a thread WinRAR in Data Compression
    Yeah, I have seen this almost 2 years ago or so... even in windows there are vulnerabilities like that - one that has been lastly patched in win10 existed even from windows 3.11..
    192 replies | 137311 view(s)
  • Jarek's Avatar
    6th March 2021, 15:40
    I was just pointed 2019 Microsoft patent for rANS: https://patents.google.com/patent/US20200413106A1 Can anybody find something novel here?
    130 replies | 105764 view(s)
  • LucaBiondi's Avatar
    6th March 2021, 13:30
    LucaBiondi replied to a thread paq8px in Data Compression
    Hi guys Thanks to all for the comments. I run paq8px besides the normal work. My pc has 16 gb of ram and often ram is not enough. This happen for 201 -10 vs. 201 -11 Luca
    2337 replies | 625938 view(s)
  • Gotty's Avatar
    6th March 2021, 13:11
    Gotty replied to a thread paq8px in Data Compression
    Yes. In this league there is a serious "fight" for each byte gained. Each version is better like around 0.02%-0.10%, and for that tiny improvement there is a lot going on. The result is extremely "tight" and optimized already - it's not really possible to "tune" the existing models and contexts for any more gains, we (usually) need to add more models to experience better gains. That costs time and memory. The gain is not too great usually as the existing models cover the possibilities quite well. We really sacrifice memory and speed for compression ratio. So in this league 3% improvement is a lot - and it costs a lot.
    2337 replies | 625938 view(s)
  • hexagone's Avatar
    6th March 2021, 07:21
    hexagone replied to a thread paq8px in Data Compression
    You mean 'accurate' I assume. Accurate timings are useful. Or just compare the option with the highest compression for each release. ​ 3% improvement for a x 3 compression time is incredibly expensive. "Worth" is in the eye of the beholder...
    2337 replies | 625938 view(s)
  • Gotty's Avatar
    6th March 2021, 04:53
    I have only two points (for performance-critical parts of your code): - use for loops instead of foreach loops; - avoid using LINQ - convert them to normal loops. (These modifications will be eventually needed when you really want to convert your code to c/c++ one day.) I know these will give you only a small boost, you're looking for a faster architecture. Unfortunately I don't know your architecture (only had a peek). Could you try something? What if you rank words simply by their size? That is a stable metric, you don't need to recalculate the ranking after each round, just see if there is still more than 1 occurrence in the next round. It may sound strange at first, but it could be as good as the current metric that taks into account the occurrence too. Why? When you rank the words by (n-1)*s there is an issue: long words are polluted when their constituents are replaced. Example: the fox and the rabbit and the lion the chicken the fox and the rabbit and the lion the lion the fox and the rabbit and the lion the horse the fox and the rabbit and the lion the penguin the zebra the cat the worm ... (many more) ... the dog The weights: "the": (n-1)*s= 2*1000 = 2000 "the fox and the rabbit and the lion": (n-1)*s = 3*35 = 105 So "the" will win the first round. After the first replacement those 4 long substring will be unnecessary polluted by indexes of the "the"s. In this case it would be better to just replace the 4 long substrings first and then replace those "the"'s. What do you think? I wonder how much is the difference with this metric and is it better or worse.
    17 replies | 898 view(s)
  • JamesWasil's Avatar
    6th March 2021, 03:04
    JamesWasil replied to a thread WinRAR in Data Compression
    This may have been missed or not seen a year or two ago, but found it interesting that a vulnerability like this existed for almost 20 years in the wild. Technically not WinRAR's fault, but support for WinACE and external formats was why and how the exploit came to be: https://www.theverge.com/2019/2/21/18234448/winrar-winace-19-year-old-vulnerability-patched-version-5-70-beta-1
    192 replies | 137311 view(s)
  • Alexander Rhatushnyak's Avatar
    6th March 2021, 02:19
    I won't submit anything neither this year nor next year. Not later than in late 2022 someone else will win a prize. Maybe a simplified version of cmix or nncp, maybe indeed something like paq8px with a better NLP part.
    78 replies | 12373 view(s)
  • Gotty's Avatar
    6th March 2021, 00:26
    Gotty replied to a thread paq8px in Data Compression
    It looks like the timing is not reliable - look at the last 2 rows: (Paq8px_v201 -10 / Paq8px_v201 -11). In the xml column they are the same (to the hundredth), which is probably an error. Ignoring this column the ratio still varies between 1.29 and 4.29. That's a very large spread. Luca was probably running multiple instances at the same time or run other tasks beside paq8px. So it's best to ignore the timings. They are not really useful. If you still would like to compare timings, you may need to clean the results by scaling the outliers but most importantly compare the results along the same command line options. You may compare versions along the compression level (8 to 8, 9 to 9, etc), and especially if lstm was used or not used. I'd say the "slowdown" from v95 to v201 is probably about 3x. Yes it's still a lot. But for 3-4% compression gain it worth it.
    2337 replies | 625938 view(s)
  • Scope's Avatar
    5th March 2021, 22:13
    https://gitlab.com/wg1/jpeg-xl/-/commit/65825c8629f181dbe8bdd432b15576735af29701
    19 replies | 2287 view(s)
  • Fallon's Avatar
    5th March 2021, 15:39
    Fallon replied to a thread WinRAR in Data Compression
    WinRAR - What's new in the latest version https://www.rarlab.com/download.htm Version 6.01 beta 1
    192 replies | 137311 view(s)
  • Shelwien's Avatar
    5th March 2021, 14:07
    https://stackoverflow.com/questions/45382914/parallel-memory-access-on-modern-processors I made a benchmark which times something like this loop: #define rndup(seed) (seed = 214013 * seed + 2531011) s0=1*seedinc; s1=2*seedinc; s2=3*seedinc; s3=4*seedinc; for( i=0; i<numreads; i++ ) { s0 += buf; s1 += buf; s2 += buf; s3 += buf; rndup(s0);rndup(s1);rndup(s2);rndup(s3); } for number of streams 1..64 (loop code is generated by a script) and tested it on two CPUs: i7-7820X @ 4.5Ghz and i7-8650U @ 2.5Ghz (see pictures). For 7820 there's no clear result, just that timings converge after ~16 streams and there's no further speed improvement (per stream). For 8650 it seems like there're ~16 streams for L3 and 6-8 for RAM, there's a visible speed difference after that.
    3 replies | 197 view(s)
  • Gotty's Avatar
    5th March 2021, 12:29
    First of all, your use case is not clear to me: what is the purpose of hashing in your case? Do you need uniqueness? If yes, why? Do you need characters in the hash? Or bytes are also OK? >>to convert a street address Addresses are terribly regarding their quality. What are you planning to do with addresses that are written differently but they represent the same actual address? Do you plan a preprocessing stage and clean the addresses up?
    6 replies | 122 view(s)
  • Shelwien's Avatar
    5th March 2021, 06:32
    > So for example, what's the best way to convert a street address to a short fixed-length token? Some cryptohash (md5,sha1,sha256,...). Faster hashes might work too, but are less reliable. > Say a six-character string. If its actually a binary blob, you can just take first 6 bytes of hash value or something. But if its a restricted-charset "readable" string, you'd have to use first 6 symbols of base64 code of hash value, or, ideally, a more flexible restricted-charset code like http://nishi.dreamhosters.com/u/marc_v2b.7z > If every hash has to be unique, six characters can't do it even if it gives us more than enough range for 10 million addresses. There's obviously no perfect solution, but one idea is to modify actual strings (eg. add spaces) until unique hash value is found. > Unless... there's some way to guide the hash so that there are no collisions Hash functions usually contain some magic numbers that can be treated as parameters and adjusted to improve the results. This might let you get rid of a few collisions among known entries... but of course it won't work for unknown future entries.
    6 replies | 122 view(s)
  • SolidComp's Avatar
    5th March 2021, 03:29
    Hi all – I've not too familiar with hashing theory, and had some questions. I want to tokenize various data in social science research. So for example, what's the best way to convert a street address to a short fixed-length token? Say a six-character string. Addresses are of course highly variable strings, and max out at around 100 characters/bytes. For a country with a max of 10 million possible and future addresses, I assume there's no way to hash an address to a six-character string without collisions right? Six characters long from a 30 character set covers 729 million possibilities, but the inputs can be any of more than 100^36 possibilities (all caps + numbers). It's actually close to the factorial of that number since the input address is variable length. If every hash has to be unique, six characters can't do it even if it gives us more than enough range for 10 million addresses. Unless... there's some way to guide the hash so that there are no collisions. I don't know of anything like that that is still called a "hashing algorithm". It would be some kind of subset hashing that takes inputs that have a huge range if they're arbitrary, where the actual possibilities are a much smaller subset, and the output hash string is sized for the actual possibilities. Would you just randomly assign string tokens instead of hashing? Is that what they do with credit card tokenization? Thanks.
    6 replies | 122 view(s)
  • LucaBiondi's Avatar
    5th March 2021, 00:26
    LucaBiondi replied to a thread paq8px in Data Compression
    Oh yes, help me @Darek please! Luca
    2337 replies | 625938 view(s)
  • Darek's Avatar
    5th March 2021, 00:04
    Darek replied to a thread paq8px in Data Compression
    @LucaBiondi - maybe I'm not the master of Excel but I could help you to made some changes
    2337 replies | 625938 view(s)
  • hexagone's Avatar
    4th March 2021, 23:45
    hexagone replied to a thread paq8px in Data Compression
    This is brutal. Dividing the last row of the numbers by the first show that the few percents of compression ratio improvements cost x 5 to x 15 in compression time.
    2337 replies | 625938 view(s)
  • Hacker's Avatar
    4th March 2021, 23:19
    Yup, I am aware of JPEG XL, I hope for some wider adoption, as time and time again with each hopeful JPEG successor.
    32 replies | 2456 view(s)
  • Jarek's Avatar
    4th March 2021, 16:44
    Jarek replied to a thread Zstandard in Data Compression
    https://www.phoronix.com/scan.php?page=news_item&px=Zstd-1.4.9-Released https://www.reddit.com/r/linux/comments/lx43x1/zstandard_v149_released_2x_faster_long_distance/
    465 replies | 148103 view(s)
  • SolidComp's Avatar
    4th March 2021, 06:15
    It needs a much higher quality display to be competitive with major competitors in this space. Also it seems to be missing basic biometrics, like IR camera for Windows Hello facial recognition. I wouldn't buy a laptop without those features. But I love the microphone cover. That's actually more important than covering a webcam. And the PCIe 4.0 x4 slot for an SSD is excellent and I don't know of another laptop that has 4-lane PCIe 4.0.
    1 replies | 127 view(s)
  • LucaBiondi's Avatar
    3rd March 2021, 23:08
    LucaBiondi replied to a thread paq8px in Data Compression
    Done! i hope you enjoy the graphs size vs. time
    2337 replies | 625938 view(s)
  • Gotty's Avatar
    3rd March 2021, 21:56
    Gotty replied to a thread paq8px in Data Compression
    Hi Luca! You can upload the excel itself (in a zip or 7zip).
    2337 replies | 625938 view(s)
  • LucaBiondi's Avatar
    3rd March 2021, 18:36
    LucaBiondi replied to a thread paq8px in Data Compression
    Hi guys i have an excel that contain the results of the compression of my dataset starting from PAQ8PX_V95(!) i have also plotted for each datatype size / time. It's not easy to attach all these images to the post. Where could i upload my excel? Maybe someone can help to plot data in a better way.. thank you, Luca
    2337 replies | 625938 view(s)
More Activity