Activity Stream

Filter
Sort By Time Show
Recent Recent Popular Popular Anytime Anytime Last 24 Hours Last 24 Hours Last 7 Days Last 7 Days Last 30 Days Last 30 Days All All Photos Photos Forum Forums
  • schnaader's Avatar
    Today, 11:59
    The concept of what3words offers a possible solution. They are converting locations of size 3 meter x 3 meter to a 3 word combination "address". This way, the adress is human-readable, can be memorized easier and is less error-prone because of the redundancy. The disadvantage for your use case might be that these addresses are not unique, e.g. a building can have multiple addresses depending on what location you choose (e.g. different entrances). Then again, the main principle can be applied to other hashing solutions as well, giving slightly longer word codes instead of cryptic alphanumeric ones.
    6 replies | 101 view(s)
  • Jon Sneyers's Avatar
    Today, 09:46
    Well yes, it's not about rANS itself, but of course you typically want to do rANS with some kind of context modelling, and it does look like they're claiming they invented that. At least that's how I understand claim 1: "do rANS but with a context model". There should be examples of prior art for this. The pik code that was published in 2017-2018 for example, before it became part of jxl.
    125 replies | 104592 view(s)
  • SolidComp's Avatar
    Today, 06:54
    What's a "stream"? Is it a thread? Does it matter that the 7820X has four memory channels? Well, are you using all four? I assume the U has only two since it's a laptop chip. This reminds me, have you looked into what Apple is doing with the memory management on their new M1 chip? I read some mention of it being novel or groundbreaking, that because of their approach they're able to get crazy performance from just 8 GB of RAM. I assume it's a SoC since the RAM is LPDDR4X or something. Do they have less latency with their setup than a normal x86-64 PC? I was amazed by that Rocket Lake preview on AnandTech. That thing consumes so much power for what it does. I'm very confused by the M1's performance given that it's only 15 watts or so. 5nm's advantage over 14nm shouldn't be that big to enable a 15 Watt chip to beat a 65 Watt, so there must be more going on. What about the new consoles with the unified or shared GDDR6? I read maybe on Charles' blog that these systems have more latency than PC DDR4 setups. Should they have more parallelism? I never understood why graphics RAM had more latency, and it suggests that integrated GPUs like the Intel Xe should have a latency advantage if they have any RAM like the Iris Pros did.
    2 replies | 158 view(s)
  • Trench's Avatar
    Today, 05:14
    xinix The file is big enough. I understand your points, but you seem to not understand mine. But its ok. Gotty I stated a simple fact with a question which you verified and answered So ok. I do not program anymore and mostly forgot it or care to do it much but understand how it works and know how to edit some code. I get annoyed trying to find the errors (or typos) in the code even on basic things like excel so no thanks. LOL I just am not into the details which restricts and the other issues that come with it. I also do not state a lot of things to let people think about it or let people believe what they like. Giving answers does not help bring innovation. Kind of like the saying "if you give a man a fish you feed them for a day, if you teach them how to fish you feed them for a life time." A skill is like a double edged sword, Math and programming are beneficial but can also restrict in other ways in how to go after the solution. If you ask various professionals how to target a problem, they will go at it from their own experience or profession. Even a math background every mathematician can give a different answer at times no one is alike even with the lotto example. A stated other times a programmer is like a translator, and a translator usually does create top selling books, but translates them to other languages, and a novelist does not translate it to other languages but uses other professions. Many programmers have ego problem to think they are creative when their job is not that as artistic to create but restrictive to have to follow rules and be precise. Like all profession people do not have the time or ability to be a master of everything. The brain does not work like that and you should know the odds to that being in math. i am not saying I am right or you are wrong, I am saying step back and try it another away as a reminder at times. I understand the math to your game, but you put more effort to explain it that I did which yes it is a simple calculation no matter what amount used. I did state an error before to say 50% which should be 33%. I guess no one understood it to correct it or maybe you did before with your other example before then. But when dealing with a small example 33% may work as shown, but with dealing with a real world example 50% works better it seems. What you stated is fine and proper with metadata etc, but sometimes you have to start new and put restrictions to force yourself to find something else. Cant find something new when doing the same methods, especially when we agree that method reached a limit. Their is a reason why its called to think out of the box since you have to look elsewhere to solve something than the conventional thing which is easier said than done and sometimes people need a reminder. Sometimes just ignore everything and start new with a new way of seeing things to find the solution.
    22 replies | 686 view(s)
  • SolidComp's Avatar
    Today, 04:33
    Gotty, have you thought of ways to make the addressing system more robust and reliable? For example, there should be ways to design address formats, street names, and numbering schemes to make it less error-prone. There may not be a lot we can do about root street names, since governments and developers want to be able to name streets after people and events and plants or whatever, and those strings will be arbitrary with respect to robustness. But the numbering scheme can be designed. The way addresses are conveyed, or even thought of, can be designed and normalized. We recently got mail from a prisoner at some California prison. It was addressed to Canada, which is still a different country than the US. The only thing our address had in common with the Canadian address was the three-digit house number. That was incredibly strange. 1B/1b should be ruled out as a possible building number by any decent system, since its fragility is obvious. The letter B is excluded in all the character sets I've designed, because of its similarity to 8 and 6 and 13. My typical set consists of about 20 or so characters: A C D E F H K L M N P Q R T X Y 6 7 8 9 That would be for a strictly all-caps system, so L and N are okay (n is too close to h and r). It also depends on whether it's something we expect people to ever write by hand. With addresses, that still seems likely to happen sometimes, so we'd probably need to tighten it up. There's also the issue of typical font designs, and machine readability. The letter J, for example, is surprisingly timid in most fonts, with just a tiny, subtle descender. There are lots of interesting systems that prune a few of the ambiguous characters, like the Base58 that Satoshi Nakamoto developed for Bitcoin addresses.
    6 replies | 101 view(s)
  • Gotty's Avatar
    Today, 02:38
    Aha, I see now. Thank you very much for the explanation! >>On address quality, yes that's a big issue with commercial mailers. Let's ask someone who is an insider in the postal system: how big is this issue? You are lucky ;-) I'm such an insider (not for the US postal system however). Addressing quality is terrible. I mean: TERRIBLE. A true fact: you can misspell a city's name in 23 different ways. A few of its misspellings make it difficult to distinguish it from another city. And that's only the city name. You don't want to know about streets, street numbers and so on. It's very-very difficult to enforce canonical addresses. ((At this moment you can't.)) It's not easy to imagine the depth of the problem until you actually see the data. I mean all the data... I saw it. A couple of times I needed to find how many packages or mail were delivered to specific addresses. Well, I'm experienced and rather good at data cleansing and lookup. But these "projects" were not very successful. Want to see my findings? The data was in such a bad shape and the results were so uncertain that I didn't give any numbers. My "report" was simply this: "At this moment it's not possible to determine such metrics". I had to deliver approximate numbers and I couldn't. After working through the data manually I'm confident that there is no machine learning approach that can assist you. A personal example: Just a couple of weeks ago we expected but we didn't receive a package from I friend of ours. The correct house number is 1B (or 1b), but the handwriting was probably not too clear so the delivery man probably read it as 18 (or 16). "Address unknown" - and ... back to sender. We already received packages and mail from the same sender. Maybe there was a replacement delivery man that day? How many misdelivered packages are there? You don't wanna know ;-) Or do you? Guess, and then see here. The first thing is to fix the addressing system. Without that any (other) effort is meaningless. I'm very sorry to say that.
    6 replies | 101 view(s)
  • SolidComp's Avatar
    Today, 01:24
    Hi, I have a couple of purposes. One is for social science and psychology research where I'm looking at ways to tokenize and obscure personally identifiable information (PII). There are different layers to that, and tokenizing would be one. My second purpose is to develop a system for tokenizing addresses in general as a privacy-enhancing measure, after seeing so many hacks in the US (mostly by the Chinese military) that obtained huge troves of personal data on millions of Americans, all of which is being combined and mined. The basic insight here is that all these different businesses, doctors, etc. don't actually need our address – they just need to be able to communicate with us, including sending us paper mail. I was fascinated by credit card tokenization, and it seems like we should be able to tokenize pretty much anything, like addresses, phone numbers, etc. The basic framework would be that mailers and e-commerce merchants like Amazon would just have a token for your address, like V84DXA, along with your postal code since that is necessary for pre-sorting and pricing shipments, and the token would be converted to your actual address by the shipping carrier through a secure process. They would then print the necessary address info on the package or envelope – whatever info the delivery drivers or mail carriers needed for the "last mile" delivery (probably just the street address, not city or state, since the postal code would already be there, it would be pre-sorted for the drivers down to street level, and they'd want to save on ink). Yes, I need uniqueness since the tokens have to be convertible back to (unique) addresses, and this would only work if they were unique. I don't understand your questions about characters vs. bytes. What's the difference? Can't all hashes be expressed in both forms? You mean the textual form of the hash? It has to have a printable text form, a token. Well, maybe not necessarily for the general address tokenization system, since they could just print a linear or 2D barcode on the letter or package, and that could encode anything. They're already used to printing barcodes anyway. They only point where they need any text is for last mile carrier to know the specific building number, etc. I've created a few different character sets for the text form of the tokens or hash, mainly to eliminate ambiguous characters. A Base32 equivalent can work, similar to Crockford's or z-base-32. Or a hexadecimal set, but composed of different characters than 0-F. "Ambiguity" operates at different levels of analysis and different contexts, like something designed for completely naive users who don't know that certain letters or numbers are excluded, or for machine readability, etc. But my assumption that we need a text form of the tokens, whether they are technically hashes or not. On address quality, yes that's a big issue with commercial mailers. Addresses would be converted to canonical form first, and then tokenized based on that form. At least that's what I expect, and what I would do in my use cases. But the system might be robust to any input. And maybe hashing would be a good way to check an address for accuracy, validity, or canonicality... In the US we use these standards and the USPS Publication 28. It's basically all caps, a certain line order, no punctuation in most cases, and a standard set of abbreviations for directions and street prefixes and suffixes, like N for North, ST for Street, HWY for Highway, etc. It's common for businesses and merchants to automatically present the canonical form of one's address to you for approval after you submit a form with your address in non-canonical form. The general addressing system would look something like this: DAVID BROWN 8422 E MERRIWEATHER DR DALLAS, TX 75211-5517 Converted to: DAVID BROWN 75211 V84DXA
    6 replies | 101 view(s)
  • Sportman's Avatar
    Yesterday, 22:43
    Sportman replied to a thread Brotli in Data Compression
    Did somebody compiled a Brotli v1.0.9 Windows executable? https://github.com/google/brotli/releases
    275 replies | 88374 view(s)
  • Sportman's Avatar
    Yesterday, 20:45
    Sportman replied to a thread Zstandard in Data Compression
    Input: 550.134.615 bytes, data Output zstd v1.4.9: 92.590.444 bytes, 0.843 sec. - 0.500 sec., zstd -1 --ultra -T1 90.892.415 bytes, 0.985 sec. - 0.531 sec., zstd -2 --ultra -T1 90.695.852 bytes, 1.187 sec. - 0.516 sec., zstd -1 --ultra --long -T1 88.947.977 bytes, 1.328 sec. - 0.516 sec., zstd -2 --ultra --long -T1 86.822.223 bytes, 1.218 sec. - 0.500 sec., zstd -3 --ultra -T1 86.817.226 bytes, 1.344 sec. - 0.515 sec., zstd -4 --ultra -T1 85.401.646 bytes, 1.625 sec. - 0.516 sec., zstd -3 --ultra --long -T1 85.380.924 bytes, 1.641 sec. - 0.515 sec., zstd -4 --ultra --long -T1 83.538.083 bytes, 2.156 sec. - 0.500 sec., zstd -5 --ultra -T1 82.246.447 bytes, 2.312 sec. - 0.516 sec., zstd -5 --ultra --long -T1 80.272.993 bytes, 2.375 sec. - 0.485 sec., zstd -6 --ultra -T1 79.188.211 bytes, 2.593 sec. - 0.500 sec., zstd -6 --ultra --long -T1 77.888.214 bytes, 3.063 sec. - 0.468 sec., zstd -7 --ultra -T1 77.683.693 bytes, 3.781 sec. - 0.469 sec., zstd -8 --ultra -T1 77.033.654 bytes, 3.235 sec. - 0.484 sec., zstd -7 --ultra --long -T1 76.982.033 bytes, 4.422 sec. - 0.453 sec., zstd -9 --ultra -T1 76.830.435 bytes, 3.922 sec. - 0.484 sec., zstd -8 --ultra --long -T1 76.733.141 bytes, 4.828 sec. - 0.453 sec., zstd -10 --ultra -T1 76.604.831 bytes, 5.672 sec. - 0.468 sec., zstd -11 --ultra -T1 76.234.213 bytes, 4.532 sec. - 0.484 sec., zstd -9 --ultra --long -T1 76.171.937 bytes, 7.266 sec. - 0.453 sec., zstd -12 --ultra -T1 76.040.602 bytes, 4.968 sec. - 0.485 sec., zstd -10 --ultra --long -T1 75.911.832 bytes, 6.015 sec. - 0.484 sec., zstd -11 --ultra --long -T1 75.764.674 bytes, 17.235 sec. - 0.468 sec., zstd -13 --ultra -T1 75.606.371 bytes, 18.485 sec. - 0.453 sec., zstd -14 --ultra -T1 75.536.752 bytes, 7.546 sec. - 0.500 sec., zstd -12 --ultra --long -T1 75.308.212 bytes, 22.891 sec. - 0.453 sec., zstd -15 --ultra -T1 75.195.609 bytes, 16.782 sec. - 0.469 sec., zstd -13 --ultra --long -T1 75.025.088 bytes, 18.344 sec. - 0.500 sec., zstd -14 --ultra --long -T1 74.714.546 bytes, 22.922 sec. - 0.500 sec., zstd -15 --ultra --long -T1 69.794.235 bytes, 33.797 sec. - 0.468 sec., zstd -16 --ultra -T1 68.674.180 bytes, 35.172 sec. - 0.500 sec., zstd -16 --ultra --long -T1 64.483.007 bytes, 39.593 sec. - 0.485 sec., zstd -17 --ultra -T1 64.078.317 bytes, 54.672 sec. - 0.500 sec., zstd -18 --ultra -T1 63.592.913 bytes, 40.828 sec. - 0.500 sec., zstd -17 --ultra --long -T1 63.399.155 bytes, 1:22.890 sec. - 0.485 sec., zstd -19 --ultra -T1 63.188.456 bytes, 56.125 sec. - 0.516 sec., zstd -18 --ultra --long -T1 62.692.187 bytes, 1:28.328 sec. - 0.516 sec., zstd -20 --ultra -T1 62.361.863 bytes, 1:27.547 sec. - 0.531 sec., zstd -19 --ultra --long -T1 62.095.898 bytes, 1:43.766 sec. - 0.547 sec., zstd -21 --ultra -T1 62.078.942 bytes, 1:32.157 sec. - 0.546 sec., zstd -20 --ultra --long -T1 61.752.973 bytes, 1:51.047 sec. - 0.547 sec., zstd -21 --ultra --long -T1 61.480.950 bytes, 2:12.687 sec. - 0.547 sec., zstd -22 --ultra -T1 61.480.950 bytes, 2:12.844 sec. - 0.547 sec., zstd -22 --ultra --long -T1
    464 replies | 147985 view(s)
  • Gotty's Avatar
    Yesterday, 20:08
    Gotty replied to a thread cmix in Data Compression
    It looks like you have an older cmix source. Your error message indicates that you have a source where the ByteModel is in use. But Byron removed the ByteModel a year ago. Please download the current source: https://github.com/byronknoll/cmix
    453 replies | 126220 view(s)
  • innar's Avatar
    Yesterday, 19:51
    Finally... ​Ubuntu 18.04.5 LTS @ Intel Xeon CPU E5-2690 v3 @ 2.60GHz, 512GB RAM ./cmix -c coronavirus.fasta.un-wrapped coronavirus.fasta.un-wrapped.cmix 1317937667 bytes -> 866379 bytes in 1995375.27 s. cross entropy: 0.005
    78 replies | 5761 view(s)
  • Gribok's Avatar
    Yesterday, 19:29
    Asking for help. I do not have access to latest AMD (Zen 2/3) or ARM64 (Apple M1) CPUs. So I would appreciate if folks could help and benchmark libsais vs divsufsort on this CPU microarchitectures.
    7 replies | 784 view(s)
  • suryakandau@yahoo.co.id's Avatar
    Yesterday, 18:37
    there is an error while compiling cmix v18 on my laptop
    453 replies | 126220 view(s)
  • Jarek's Avatar
    Yesterday, 17:47
    Jyrki, the situation here is more complex - this patent covers natural modifications, which while maybe not being very practical for software implementations, might bring improvements for hardware. These basic variations would be probably explored if AV1 have remained with rANS, maybe you have some prior art to submit there from this development. Otherwise, it seems there will be 20 year minefield preventing from going out of the basic way ... e.g. for hardware optimizations of JPEG XL implementations.
    18 replies | 2197 view(s)
  • Gotty's Avatar
    Yesterday, 16:54
    Gotty replied to a thread cmix in Data Compression
    Here it is.
    453 replies | 126220 view(s)
  • Jyrki Alakuijala's Avatar
    Yesterday, 16:51
    How Amazon got a patent on white-background photography
    18 replies | 2197 view(s)
  • suryakandau@yahoo.co.id's Avatar
    Yesterday, 16:36
    @shelwien how to compile cmix v18 using batch script ? thank you
    453 replies | 126220 view(s)
  • choochootrain's Avatar
    Yesterday, 16:31
    choochootrain replied to a thread Zstandard in Data Compression
    The file inside zstd-v1.4.9-win64.zip is another zip, it just doesn't have an extension
    464 replies | 147985 view(s)
  • Gotty's Avatar
    Yesterday, 15:17
    It looks like that the patent is not patenting rANS. I believe rANS is safe. Reading the patent text they could chose any entropy coding mechanism like arithmetic encoding, they just happened to chose rANS. ((I'm not protecting the patent. I also believe that patenting is silly.)) See these quotes: ((The patent is all about these 5 "innovations" (better wording would be "enhancements").)) See? They are not "reinventing" rANS but extending rANS so that it is capable of the above desired features. It's not really about the entropy encoding itself, it's more about modelling.
    125 replies | 104592 view(s)
  • mpais's Avatar
    Yesterday, 13:50
    Here's an interesting article that provides useful hints for this challenge: Learning the language of viral evolution and escape
    78 replies | 5761 view(s)
  • necros's Avatar
    Yesterday, 13:43
    So flexigif is still not safe to use right?
    42 replies | 7703 view(s)
  • Jarek's Avatar
    Yesterday, 12:21
    Microsoft has filed large general patent in 2019 for rANS: https://patents.google.com/patent/US20200413106A1 Please check if it does not interfere with JPEG XL, if so help preventing it.
    18 replies | 2197 view(s)
  • Sportman's Avatar
    Yesterday, 06:46
    Sportman replied to a thread Zstandard in Data Compression
    I used WinRAR to unzip and it only showed a file without file extension instead of files and folders, I thought it was a Linux file or a corrupt archive, so I tried to click it and it behaved as an folder and showed files and folders what's normal showed as root of the archive and could extract zstd.exe (extracting all files and folders in that file what's a folder gave no error).
    464 replies | 147985 view(s)
  • SolidComp's Avatar
    Yesterday, 04:00
    This is a depressing reminder of how terrible our legal system is in the US and in similar countries. This patent system is insanely bad. We let people just think stuff up and write it down, and BOOM, patent. Then they become patent predators, née trolls, and stalk and shake down random humans/businesses for tons of money when they deserve nothing. With these corporations patenting these algorithms and vague ideas and abstractions, we obviously need an epistemic and legal framework that we don't have. If there's a reasonable framework for software patents, I'm not sure if anyone has developed it, and we don't employ it. Nothing should be patentable unless it's in a product offered in the marketplace or used to competitive advantage internally. No more of this nonsense of just thinking stuff up and writing it down. Obviously something already invented by others should not be patentable, so these Google and Microsoft patents should not be possible. The fact that they're happening means we don't have the right framework, training, or domain experts handling these applications. I think there are lots of potential solutions involving crowdsourcing patent approvals. Relying on slow and non-expert government bureaucrats is ridiculous. There are some interesting possibilities that would provide financial rewards to people for examining patent applications tied to the quality of the patent, which could be measured by various means over time. We could even have a betting market on patents, designed in lots of different ways. Right now patents are a huge energy, time, and financial drain. It's such a distraction from productive work to have people at Google writing these ridiculous patents, lawyers filing them, government employees reading them, original inventors needing to challenge them, lawyers challenging them, judges ruling on them, lawyers trying to trick jurors about them, jurors trying to judge them, etc. Our jury system is so embarrassing and stupid. Americans have become too distracted by stupid ideologies, and are missing how deeply flawed our legal system is across the board, and how much it costs us in injustice and dollars. It's become a lottery for a lot of them, but with much better odds than actual lotteries.
    125 replies | 104592 view(s)
  • SolidComp's Avatar
    Yesterday, 03:31
    SolidComp replied to a thread Zstandard in Data Compression
    I don't understand what you're saying, but the Win64 zip file seems to be corrupt. It just unzips into a weird blob that doesn't do anything. Is that what you're referring to?
    464 replies | 147985 view(s)
  • Shelwien's Avatar
    Yesterday, 03:28
    Anandtech posted a nice picture about cache latency. For comparison, reg-reg multiplication has a 3clk latency.
    2 replies | 158 view(s)
  • Sportman's Avatar
    Yesterday, 00:26
    Sportman replied to a thread Zstandard in Data Compression
    It's if a file name contain the root folder content instead of the direct root folder content as by 1.4.5-1.4.8 win64 zip files.
    464 replies | 147985 view(s)
  • Sportman's Avatar
    Yesterday, 00:18
    Sportman replied to a thread Zstandard in Data Compression
    Input: 984.801.591 bytes - data Output zstd v1.4.8: 151.119.611 bytes, 1.344 sec., zstd -1 --ultra -T1 149.904.965 bytes, 3.375 sec., zstd -1 --ultra --long -T1 147.939.926 bytes, 1.547 sec., zstd -2 --ultra -T1 146.708.829 bytes, 3.547 sec., zstd -2 --ultra --long -T1 143.977.987 bytes, 1.969 sec., zstd -3 --ultra -T1 143.965.769 bytes, 2.141 sec., zstd -4 --ultra -T1 142.896.262 bytes, 4.063 sec., zstd -3 --ultra --long -T1 142.869.673 bytes, 4.125 sec., zstd -4 --ultra --long -T1 138.447.020 bytes, 3.437 sec., zstd -5 --ultra -T1 137.671.146 bytes, 5.187 sec., zstd -5 --ultra --long -T1 135.307.599 bytes, 3.735 sec., zstd -6 --ultra -T1 134.660.677 bytes, 5.625 sec., zstd -6 --ultra --long -T1 129.713.923 bytes, 5.297 sec., zstd -7 --ultra -T1 129.552.776 bytes, 7.125 sec., zstd -7 --ultra --long -T1 129.262.251 bytes, 6.547 sec., zstd -8 --ultra -T1 129.090.886 bytes, 8.328 sec., zstd -8 --ultra --long -T1 128.271.371 bytes, 7.687 sec., zstd -9 --ultra -T1 128.210.617 bytes, 9.485 sec., zstd -9 --ultra --long -T1 128.087.378 bytes, 8.328 sec., zstd -10 --ultra -T1 128.056.308 bytes, 10.312 sec., zstd -10 --ultra --long -T1 127.999.182 bytes, 9.812 sec., zstd -11 --ultra -T1 127.940.684 bytes, 12.109 sec., zstd -11 --ultra --long -T1 127.445.750 bytes, 14.828 sec., zstd -12 --ultra --long -T1 127.422.579 bytes, 12.656 sec., zstd -12 --ultra -T1 126.935.545 bytes, 28.125 sec., zstd -13 --ultra --long -T1 126.802.723 bytes, 30.359 sec., zstd -14 --ultra --long -T1 126.768.260 bytes, 26.766 sec., zstd -13 --ultra -T1 126.659.580 bytes, 28.734 sec., zstd -14 --ultra -T1 126.499.068 bytes, 37.922 sec., zstd -15 --ultra --long -T1 126.327.122 bytes, 35.860 sec., zstd -15 --ultra -T1 118.142.065 bytes, 1:18.344 sec., zstd -16 --ultra -T1 117.643.674 bytes, 1:22.531 sec., zstd -16 --ultra --long -T1 113.112.473 bytes, 1:29.234 sec., zstd -17 --ultra -T1 112.773.080 bytes, 1:33.485 sec., zstd -17 --ultra --long -T1 112.255.514 bytes, 2:11.281 sec., zstd -18 --ultra -T1 111.880.990 bytes, 2:16.656 sec., zstd -18 --ultra --long -T1 111.517.628 bytes, 2:46.016 sec., zstd -19 --ultra -T1 110.962.579 bytes, 2:57.359 sec., zstd -19 --ultra --long -T1 111.041.788 bytes, 2:55.297 sec., zstd -20 --ultra -T1 110.644.739 bytes, 3:04.047 sec., zstd -20 --ultra --long -T1 110.431.137 bytes, 3:25.172 sec., zstd -21 --ultra -T1 110.188.109 bytes, 3:39.516 sec., zstd -21 --ultra --long -T1 109.903.935 bytes, 3:52.750 sec., zstd -22 --ultra -T1 109.903.935 bytes, 3:52.719 sec., zstd -22 --ultra --long -T1 Output zstd v1.4.9: 151.119.611 bytes, 1.344 sec., zstd -1 --ultra -T1 149.930.990 bytes, 1.969 sec., zstd -1 --ultra --long -T1 147.939.926 bytes, 1.547 sec., zstd -2 --ultra -T1 146.731.879 bytes, 2.141 sec., zstd -2 --ultra --long -T1 143.977.987 bytes, 1.969 sec., zstd -3 --ultra -T1 143.965.769 bytes, 2.156 sec., zstd -4 --ultra -T1 142.942.060 bytes, 2.656 sec., zstd -3 --ultra --long -T1 142.912.214 bytes, 2.687 sec., zstd -4 --ultra --long -T1 138.447.020 bytes, 3.453 sec., zstd -5 --ultra -T1 137.698.236 bytes, 3.781 sec., zstd -5 --ultra --long -T1 135.307.599 bytes, 3.781 sec., zstd -6 --ultra -T1 134.685.348 bytes, 4.188 sec., zstd -6 --ultra --long -T1 129.713.923 bytes, 5.297 sec., zstd -7 --ultra -T1 129.541.732 bytes, 5.687 sec., zstd -7 --ultra --long -T1 129.262.251 bytes, 6.563 sec., zstd -8 --ultra -T1 129.079.778 bytes, 6.890 sec., zstd -8 --ultra --long -T1 128.271.371 bytes, 7.703 sec., zstd -9 --ultra -T1 128.185.107 bytes, 8.016 sec., zstd -9 --ultra --long -T1 128.087.378 bytes, 8.344 sec., zstd -10 --ultra -T1 128.030.897 bytes, 8.860 sec., zstd -10 --ultra --long -T1 127.999.182 bytes, 9.859 sec., zstd -11 --ultra -T1 127.915.469 bytes, 10.719 sec., zstd -11 --ultra --long -T1 127.422.579 bytes, 12.703 sec., zstd -12 --ultra -T1 127.415.784 bytes, 13.422 sec., zstd -12 --ultra --long -T1 126.907.957 bytes, 26.703 sec., zstd -13 --ultra --long -T1 126.775.263 bytes, 28.954 sec., zstd -14 --ultra --long -T1 126.768.260 bytes, 26.750 sec., zstd -13 --ultra -T1 126.659.580 bytes, 28.735 sec., zstd -14 --ultra -T1 126.467.698 bytes, 36.609 sec., zstd -15 --ultra --long -T1 126.327.122 bytes, 35.812 sec., zstd -15 --ultra -T1 118.142.065 bytes, 1:18.329 sec., zstd -16 --ultra -T1 117.640.741 bytes, 1:21.078 sec., zstd -16 --ultra --long -T1 113.112.473 bytes, 1:29.187 sec., zstd -17 --ultra -T1 112.767.845 bytes, 1:32.015 sec., zstd -17 --ultra --long -T1 112.255.514 bytes, 2:10.438 sec., zstd -18 --ultra -T1 111.878.283 bytes, 2:14.078 sec., zstd -18 --ultra --long -T1 111.517.628 bytes, 2:45.672 sec., zstd -19 --ultra -T1 111.041.788 bytes, 2:55.078 sec., zstd -20 --ultra -T1 110.958.922 bytes, 2:55.203 sec., zstd -19 --ultra --long -T1 110.691.009 bytes, 3:01.454 sec., zstd -20 --ultra --long -T1 110.431.137 bytes, 3:24.843 sec., zstd -21 --ultra -T1 110.190.299 bytes, 3:36.609 sec., zstd -21 --ultra --long -T1 109.900.896 bytes, 3:49.813 sec., zstd -22 --ultra -T1 109.900.896 bytes, 3:49.844 sec., zstd -22 --ultra --long -T1
    464 replies | 147985 view(s)
  • mitiko's Avatar
    Yesterday, 00:04
    Oh, that type of a transform... Well, static dictionaries support all types of patterns (like the ones described in the post) and I'm planning on doing this in one way or another. What they're doing with LZ77 + static dictionaries is pretty ingenious actually! Wait, that's actually pretty clever for a fast encoder! Anyways, that's not the point of BWD. I'm trying to preserve the context of the text, the information it holds. After LZ77 and such pattern-y words the rest of the compression pipeline looses context and it will predict worse. But if we instead create a whole new entire context based on a pattern, it will compress better and the original stream will be more compressible as well. Patterns are great, they just seem a bit to deterministic right now - like for every pattern/dictionry transform that you can imagine, you need to write it's own ranking function and extraction function and then handle it's serialization to the dictionary, etc. It'd be amazing if we find a way to get the system to generate patterns, or find them in the dictionary after the words have been chosen. For example if the dictionary is calculated to be something like: ​ "the next" "the following" "the lion" "the fox" Maybe some independent system can see the pattern and generate the word "the (*)", which will then match all of the above and also split the stream into 2 - the main stream + the stream of the pattern. The stream of the pattern would the look something like: "next;following;lion;next;fox;fox;fox;next..." which is very compressable. In practice this would suit well some grammar rules (essentially making the grammar information that text holds, known, so zero bytes to transmit) like "(*)ing" or "(*)ed". With patterns, we get into this whole idea of context splitting and start to think if this is possible in the general case. I have some notes, if you'd like to look at them.
    17 replies | 872 view(s)
  • Jarek's Avatar
    6th March 2021, 23:26
    Thanks but I am not taking this only "award" for ANS to be able to defend it for the rest of my life from succeeding megacorporations. Corporations using it, instead of deploying own "defensive" legal mines, please just prevent building this minefield. Especially that I haven't heard of any ANS-based compressor from Microsoft (?)
    125 replies | 104592 view(s)
  • SolidComp's Avatar
    6th March 2021, 22:19
    Brotli has a seemingly novel set of 120 dictionary transforms. You can read a bit about them in this recent Cloudflare post.
    17 replies | 872 view(s)
  • JamesWasil's Avatar
    6th March 2021, 19:10
    Nope. Only semantic rewording of your work and fabricating big tech innovations for Microsoft to compete with Google by bogus patent claims for what was intended to be free with due credit. This is why we can't have nice things, and why development and certain types of information sharing and discoveries are no longer offered in good faith like they were before. Corporations by majority have ruined the good faith and academic contributions for the sake of exclusive profiteering, even against the inventor's wishes. It says a lot about the companies and state of the world when they have to rip off the author, take credit and make money on work released freely, rather than developing their own UNIQUE methods even if they yield nearly the same performance or results. Cookie cutter world now. Question: Should Jarek make a global Gofundme for legal help to fight this to where the lawyers who Google, Microsoft, and other big tech don't yet have in their pocket might actually help? I would assume that if Marcus Hutter could throw 50k at the Hutter prize, would he and others be willing to donate to the Jarek rANS/tANS defense fund?
    125 replies | 104592 view(s)
  • mitiko's Avatar
    6th March 2021, 17:28
    I'm going to try and compete with BWD later this year. I also have some good ideas for modeling after the dictionary transform.
    78 replies | 12336 view(s)
  • mitiko's Avatar
    6th March 2021, 17:13
    Thanks for looking through the code. I knew I had to eventually remove LINQ but I don't think I'm using it at some very critical places. Right now, the most work seems to be in counting the words and ranking. Thanks for the heads up for foreach loops, I didn't know they were more expensive than for's. The architecture is pretty ok for now, it's like LINQ but async (using IAsyncEnumerable). I've thought it might be slowing me down but it's actually not. It's processing data in blocks and I'm testing it for optimality for the block size being the full filesize, so all the data only travels the pipeline once. The slow-down comes from calculating the dictionary, when we have it, parsing and serializing dictionary is pretty fast. What the rank (as per the latest version) calculates is the difference in bits used to encode the string. So if the dictionary size is 256, 8 bits would be needed per word and the ranking would be analogous to (n-1)*s for n=word size, s=occurences. (I'd rather work with l and c for length and count) In your example, the weights are correct and the savings in bytes are the same as the rank. The whole ranking system is greedy in this way. It will prefer: ​"the" -> rank=2000 "rabbit" -> rank=175 "lion" -> rank=105 "and" -> rank=70 "fox" -> rank=70 total = 2420 The ranks here only being from the top of my head. The non-greedy variation you have in mind looks something like: "the fox and the rabbit and the lion" -> rank=1225 "the" -> rank=2000 total = 3225 I have actually thought a lot about this and it's not a problem of the ranking function itself, at least not it's primitive variation. No mathematical formula can solve the greediness, there's always going to be some example where if we switch the order, we'll get a better result. It's a problem of choosing the best word from the ranking. Instead of greedy-ly choosing the highest ranked word, we can look ahead (in time) and calculate the ranks of future words, then choose the optimal branch. It becomes a look-ahead tree. The fragmentation problem by itself can be solved if we consider the dictionary to initially be the alphabet and then morphing it into longer words. Like: A -- AT T -- A C -- T G -- C ​ G Then we can do parsing before each new word calculation and we won't have the constraint of words not spanning over multiple blocks. I've excluded this as a good idea, because 1) we have to do parsing at each word and only reparsing doesn't seem viable. 2) we won't have this problem for a large enough look-ahead tree. But now that you mention it, maybe the look-ahead tree can't be that big. My hope was that since we're calculating a lot of the words' ranks, they won't change under the tree and when we choose the optimal branch, we'd only have to calculate the ranks of the next layer of words. Basically ranking becomes at most x times slower, if we prune and look at the top x words. The tree depth is only a concern for the first word and then updates are pretty fast, so it's pretty inexpensive to have a deeper tree, compared to a more spanning tree. But for the look-ahead tree to solve the fragmentation dilemma, a deeper tree doesn't cut it. We need a more spanning tree. So, maybe you're right, I should reconsider overlapping words and look into re-parsing, although I'm still pretty set on having the look-ahead tree as well. A couple assumptions: Right now, with the naive ranking ((n-1)*s) we don't have to rank all words and can only re-rank some. If we choose to re-rank the words, we have to store all the ranks, we can't store just the top 5, because at some point we won't be able to update the top words for 5 times in a row and we won't have a top 5. My current ranking was last updated with a hashtable for ranks, so the actual calculation is only done once per similair words. What we can do, for sure, is only re-count some words. I'm working on that still. I have a draft, I just have to fix some minor indexing issues. The reason why I haven't done re-ranking is because the optimal ranking algorithm uses a state. The technique used is similiar to morphing/updating the dictionary. This way the current rank of a word depends on the other words chosen and it's closer to the actual optimal order of the dictionary. Especially with the look-ahead tree. For sure, re-ranking and re-counting + the fragmentation fix will create a great compressor, with acceptable compression times. But I'm rather looking to improve BWD in the direction of optimality for the cost of compression time, than leveraging memory for time. When I'm done, I'll definitely try to implement one with this stack!
    17 replies | 872 view(s)
  • CompressMaster's Avatar
    6th March 2021, 15:58
    CompressMaster replied to a thread WinRAR in Data Compression
    Yeah, I have seen this almost 2 years ago or so... even in windows there are vulnerabilities like that - one that has been lastly patched in win10 existed even from windows 3.11..
    192 replies | 137293 view(s)
  • Jarek's Avatar
    6th March 2021, 15:40
    I was just pointed 2019 Microsoft patent for rANS: https://patents.google.com/patent/US20200413106A1 Can anybody find something novel here?
    125 replies | 104592 view(s)
  • LucaBiondi's Avatar
    6th March 2021, 13:30
    LucaBiondi replied to a thread paq8px in Data Compression
    Hi guys Thanks to all for the comments. I run paq8px besides the normal work. My pc has 16 gb of ram and often ram is not enough. This happen for 201 -10 vs. 201 -11 Luca
    2337 replies | 625657 view(s)
  • Gotty's Avatar
    6th March 2021, 13:11
    Gotty replied to a thread paq8px in Data Compression
    Yes. In this league there is a serious "fight" for each byte gained. Each version is better like around 0.02%-0.10%, and for that tiny improvement there is a lot going on. The result is extremely "tight" and optimized already - it's not really possible to "tune" the existing models and contexts for any more gains, we (usually) need to add more models to experience better gains. That costs time and memory. The gain is not too great usually as the existing models cover the possibilities quite well. We really sacrifice memory and speed for compression ratio. So in this league 3% improvement is a lot - and it costs a lot.
    2337 replies | 625657 view(s)
  • hexagone's Avatar
    6th March 2021, 07:21
    hexagone replied to a thread paq8px in Data Compression
    You mean 'accurate' I assume. Accurate timings are useful. Or just compare the option with the highest compression for each release. ​ 3% improvement for a x 3 compression time is incredibly expensive. "Worth" is in the eye of the beholder...
    2337 replies | 625657 view(s)
  • Gotty's Avatar
    6th March 2021, 04:53
    I have only two points (for performance-critical parts of your code): - use for loops instead of foreach loops; - avoid using LINQ - convert them to normal loops. (These modifications will be eventually needed when you really want to convert your code to c/c++ one day.) I know these will give you only a small boost, you're looking for a faster architecture. Unfortunately I don't know your architecture (only had a peek). Could you try something? What if you rank words simply by their size? That is a stable metric, you don't need to recalculate the ranking after each round, just see if there is still more than 1 occurrence in the next round. It may sound strange at first, but it could be as good as the current metric that taks into account the occurrence too. Why? When you rank the words by (n-1)*s there is an issue: long words are polluted when their constituents are replaced. Example: the fox and the rabbit and the lion the chicken the fox and the rabbit and the lion the lion the fox and the rabbit and the lion the horse the fox and the rabbit and the lion the penguin the zebra the cat the worm ... (many more) ... the dog The weights: "the": (n-1)*s= 2*1000 = 2000 "the fox and the rabbit and the lion": (n-1)*s = 3*35 = 105 So "the" will win the first round. After the first replacement those 4 long substring will be unnecessary polluted by indexes of the "the"s. In this case it would be better to just replace the 4 long substrings first and then replace those "the"'s. What do you think? I wonder how much is the difference with this metric and is it better or worse.
    17 replies | 872 view(s)
  • JamesWasil's Avatar
    6th March 2021, 03:04
    JamesWasil replied to a thread WinRAR in Data Compression
    This may have been missed or not seen a year or two ago, but found it interesting that a vulnerability like this existed for almost 20 years in the wild. Technically not WinRAR's fault, but support for WinACE and external formats was why and how the exploit came to be: https://www.theverge.com/2019/2/21/18234448/winrar-winace-19-year-old-vulnerability-patched-version-5-70-beta-1
    192 replies | 137293 view(s)
  • Alexander Rhatushnyak's Avatar
    6th March 2021, 02:19
    I won't submit anything neither this year nor next year. Not later than in late 2022 someone else will win a prize. Maybe a simplified version of cmix or nncp, maybe indeed something like paq8px with a better NLP part.
    78 replies | 12336 view(s)
  • Gotty's Avatar
    6th March 2021, 00:26
    Gotty replied to a thread paq8px in Data Compression
    It looks like the timing is not reliable - look at the last 2 rows: (Paq8px_v201 -10 / Paq8px_v201 -11). In the xml column they are the same (to the hundredth), which is probably an error. Ignoring this column the ratio still varies between 1.29 and 4.29. That's a very large spread. Luca was probably running multiple instances at the same time or run other tasks beside paq8px. So it's best to ignore the timings. They are not really useful. If you still would like to compare timings, you may need to clean the results by scaling the outliers but most importantly compare the results along the same command line options. You may compare versions along the compression level (8 to 8, 9 to 9, etc), and especially if lstm was used or not used. I'd say the "slowdown" from v95 to v201 is probably about 3x. Yes it's still a lot. But for 3-4% compression gain it worth it.
    2337 replies | 625657 view(s)
  • Scope's Avatar
    5th March 2021, 22:13
    https://gitlab.com/wg1/jpeg-xl/-/commit/65825c8629f181dbe8bdd432b15576735af29701
    18 replies | 2197 view(s)
  • Fallon's Avatar
    5th March 2021, 15:39
    Fallon replied to a thread WinRAR in Data Compression
    WinRAR - What's new in the latest version https://www.rarlab.com/download.htm Version 6.01 beta 1
    192 replies | 137293 view(s)
  • Shelwien's Avatar
    5th March 2021, 14:07
    https://stackoverflow.com/questions/45382914/parallel-memory-access-on-modern-processors I made a benchmark which times something like this loop: #define rndup(seed) (seed = 214013 * seed + 2531011) s0=1*seedinc; s1=2*seedinc; s2=3*seedinc; s3=4*seedinc; for( i=0; i<numreads; i++ ) { s0 += buf; s1 += buf; s2 += buf; s3 += buf; rndup(s0);rndup(s1);rndup(s2);rndup(s3); } for number of streams 1..64 (loop code is generated by a script) and tested it on two CPUs: i7-7820X @ 4.5Ghz and i7-8650U @ 2.5Ghz (see pictures). For 7820 there's no clear result, just that timings converge after ~16 streams and there's no further speed improvement (per stream). For 8650 it seems like there're ~16 streams for L3 and 6-8 for RAM, there's a visible speed difference after that.
    2 replies | 158 view(s)
  • Gotty's Avatar
    5th March 2021, 12:29
    First of all, your use case is not clear to me: what is the purpose of hashing in your case? Do you need uniqueness? If yes, why? Do you need characters in the hash? Or bytes are also OK? >>to convert a street address Addresses are terribly regarding their quality. What are you planning to do with addresses that are written differently but they represent the same actual address? Do you plan a preprocessing stage and clean the addresses up?
    6 replies | 101 view(s)
  • Shelwien's Avatar
    5th March 2021, 06:32
    > So for example, what's the best way to convert a street address to a short fixed-length token? Some cryptohash (md5,sha1,sha256,...). Faster hashes might work too, but are less reliable. > Say a six-character string. If its actually a binary blob, you can just take first 6 bytes of hash value or something. But if its a restricted-charset "readable" string, you'd have to use first 6 symbols of base64 code of hash value, or, ideally, a more flexible restricted-charset code like http://nishi.dreamhosters.com/u/marc_v2b.7z > If every hash has to be unique, six characters can't do it even if it gives us more than enough range for 10 million addresses. There's obviously no perfect solution, but one idea is to modify actual strings (eg. add spaces) until unique hash value is found. > Unless... there's some way to guide the hash so that there are no collisions Hash functions usually contain some magic numbers that can be treated as parameters and adjusted to improve the results. This might let you get rid of a few collisions among known entries... but of course it won't work for unknown future entries.
    6 replies | 101 view(s)
  • SolidComp's Avatar
    5th March 2021, 03:29
    Hi all – I've not too familiar with hashing theory, and had some questions. I want to tokenize various data in social science research. So for example, what's the best way to convert a street address to a short fixed-length token? Say a six-character string. Addresses are of course highly variable strings, and max out at around 100 characters/bytes. For a country with a max of 10 million possible and future addresses, I assume there's no way to hash an address to a six-character string without collisions right? Six characters long from a 30 character set covers 729 million possibilities, but the inputs can be any of more than 100^36 possibilities (all caps + numbers). It's actually close to the factorial of that number since the input address is variable length. If every hash has to be unique, six characters can't do it even if it gives us more than enough range for 10 million addresses. Unless... there's some way to guide the hash so that there are no collisions. I don't know of anything like that that is still called a "hashing algorithm". It would be some kind of subset hashing that takes inputs that have a huge range if they're arbitrary, where the actual possibilities are a much smaller subset, and the output hash string is sized for the actual possibilities. Would you just randomly assign string tokens instead of hashing? Is that what they do with credit card tokenization? Thanks.
    6 replies | 101 view(s)
  • LucaBiondi's Avatar
    5th March 2021, 00:26
    LucaBiondi replied to a thread paq8px in Data Compression
    Oh yes, help me @Darek please! Luca
    2337 replies | 625657 view(s)
  • Darek's Avatar
    5th March 2021, 00:04
    Darek replied to a thread paq8px in Data Compression
    @LucaBiondi - maybe I'm not the master of Excel but I could help you to made some changes
    2337 replies | 625657 view(s)
  • hexagone's Avatar
    4th March 2021, 23:45
    hexagone replied to a thread paq8px in Data Compression
    This is brutal. Dividing the last row of the numbers by the first show that the few percents of compression ratio improvements cost x 5 to x 15 in compression time.
    2337 replies | 625657 view(s)
  • Hacker's Avatar
    4th March 2021, 23:19
    Yup, I am aware of JPEG XL, I hope for some wider adoption, as time and time again with each hopeful JPEG successor.
    32 replies | 2446 view(s)
  • Jarek's Avatar
    4th March 2021, 16:44
    Jarek replied to a thread Zstandard in Data Compression
    https://www.phoronix.com/scan.php?page=news_item&px=Zstd-1.4.9-Released https://www.reddit.com/r/linux/comments/lx43x1/zstandard_v149_released_2x_faster_long_distance/
    464 replies | 147985 view(s)
  • SolidComp's Avatar
    4th March 2021, 06:15
    It needs a much higher quality display to be competitive with major competitors in this space. Also it seems to be missing basic biometrics, like IR camera for Windows Hello facial recognition. I wouldn't buy a laptop without those features. But I love the microphone cover. That's actually more important than covering a webcam. And the PCIe 4.0 x4 slot for an SSD is excellent and I don't know of another laptop that has 4-lane PCIe 4.0.
    1 replies | 125 view(s)
  • LucaBiondi's Avatar
    3rd March 2021, 23:08
    LucaBiondi replied to a thread paq8px in Data Compression
    Done! i hope you enjoy the graphs size vs. time
    2337 replies | 625657 view(s)
  • Gotty's Avatar
    3rd March 2021, 21:56
    Gotty replied to a thread paq8px in Data Compression
    Hi Luca! You can upload the excel itself (in a zip or 7zip).
    2337 replies | 625657 view(s)
  • LucaBiondi's Avatar
    3rd March 2021, 18:36
    LucaBiondi replied to a thread paq8px in Data Compression
    Hi guys i have an excel that contain the results of the compression of my dataset starting from PAQ8PX_V95(!) i have also plotted for each datatype size / time. It's not easy to attach all these images to the post. Where could i upload my excel? Maybe someone can help to plot data in a better way.. thank you, Luca
    2337 replies | 625657 view(s)
  • Darek's Avatar
    3rd March 2021, 11:16
    Darek replied to a thread paq8px in Data Compression
    No, there were different moments. It's probably something with my laptop. I suppose that could be issue of low space on Disk C (system). I need to sometime close the laptop and hibernate system and for some cases, after standing up system, paq8px quits w/o any communicate. At now I started to plan compression for time when there won't be needed to hibernate the system. For enwik9 it's about 4 days of time which need to be not disturbed.
    2337 replies | 625657 view(s)
  • Shelwien's Avatar
    3rd March 2021, 07:33
    Which can be interpreted as "Let's wait 3 more years, then buy a bunch of otherwise useless preprocessors from Alex for $25k". 500000*(1-116673681/122000000) = 21829 (122000000-116673681)/5000/365 = 2.92 122MB is the estimated paq8px result, cmix and nncp won't fit due to time limit.
    78 replies | 12336 view(s)
  • mitiko's Avatar
    2nd March 2021, 23:40
    Uhm, Wikipedia says it's a general purpose LZ77 paired with Huffman and second order context mixing, but looking at the code it also includes a small static general purpose dictionary: https://github.com/google/brotli/blob/fcda9db7fd554ffb19c2410b9ada57cdabd19de5/c/common/dictionary.c But I'm probably not the guy to ask. Meanwhile, I made the counting linear but it's way slower. I tried some performance analyzing tools but nothing conslusive, just a little bit of everything - matching is expensive and sorting is expensive and looking at the bitvector (which I implemented myself) is inefficient. I'm considering some dp approach where we only count the words once but then there would be quite a complicated process of updating the counts, involving multiple suffix array searches and I'm still debating if it's worth it. Correct me if I'm wrong but mcm seems to be getting away with counting easily because they only rank matches to the words in the lookahead window. Later it's also using a hashtable of strings, which would be quite inefficient in my case.
    17 replies | 872 view(s)
  • SolidComp's Avatar
    2nd March 2021, 21:55
    Mitiko, what about brotli's dictionary transforms? Are they LZ77/78?
    17 replies | 872 view(s)
  • byronknoll's Avatar
    2nd March 2021, 21:47
    A new rule has been added to Hutter Prize to make it easier: http://prize.hutter1.net/hfaq.htm#ddisc
    78 replies | 12336 view(s)
  • Gotty's Avatar
    2nd March 2021, 14:03
    Yes! You have it. Exactly: if you have no further information ("no theory" in your words) you can ask 2 questions, hoping that you will find the number quicker, but if you are unlucky, there is a higher chance that you will need to ask 4 questions. So 3 is the overall optimal: it's always 3, no risk. There is no way to solve it with 2 question without risking. If you have some information ("theory"), that usually it's AB, and rarely it's CDEFGH, you can go for AB first. But in this case the symbols are not equiprobable so it's not random. And we are going for random content, so you don't have more information what the result usually is (what distribution the symbols usually follow). If we do the game with 16 letters, it's the same: the optimal is 4 questions. With 32 it's 5. No matter how big your range is, the optimal way is always log2(n). Going even higher doesn't change the game: If you have 4096x8 random bits, your optimal way to find out those bits is to do it in 4096x8 questions. Can't do it in less. Practical compression example: try compressing random files with any compression software. The result will always be a bit bigger. In case of paq* it will actually try finding patters (find a "theory" behind the bits and bytes). It really does try finding patterns! Being a random file we know that there are no patterns there. But paq does not know that. So it tries. When it believes it found some theory (that some symbols are more probably than others) then it applies it - and experiences a loss. Ouch! So the result gets a bit bigger. Like trying to guess the 8-symbol game in 2 question. Risky right? Because it is programmed to lower the risk and try finding the best theory it will give up soon trying to solve the 8-symbol game in 2 questions - it will stick with 3 questions. And so it reduces it's loss. So this is the reason why there is always a small loss there - because it is always doing a bit riskier at the beginning, so those losses are inevitable (in case of paq or any other compression software with similar internal workings).
    22 replies | 686 view(s)
  • Gotty's Avatar
    2nd March 2021, 13:49
    xinix you are a bit harsh. As Trench said: "upset". I know it's upsetting when people have claims or statements and they don't fully understand the subject (yet), but it's OK. I sense that he's open and he tries to understand it deeper, so let's give it a go. He has no background in programming or information theory - it would be nice to have, but what can you do? We are here to help him (remember: he came with a question). This is what we can do: practical examples, visualizations, games - so even without the full experience or solid background at least he can sense what randomness means. You know how I started in ~ 2007? Funny: I tried to see if I could compress the already compressed paq8hp result smaller. I could easy-peasy win the contest. Fortunately I have math and programming background, and I quickly realized the it won't work. You do have to try compressing random files in order to understand what randomness means (not with paq or any other software - they don't give you first-hand experience, but in your own way, creating your own tools). So then I deeply understood the real meaning of randomness. And I can tell: it has its beauty. Whatever you do, however you slice, you just can't make them smaller. The result will always have the same size (in optimal case). So I tried it, and have actual first-hand experience. Equipped with my math background I also know the explanation. Now he can have a tiny-little experience by trying to solve the ABCDEFGH-puzzle in 2 questions.
    22 replies | 686 view(s)
  • Gotty's Avatar
    2nd March 2021, 13:21
    I'm trying to understand what's happening. So, we had breakthroughs in 2007, 2015, 2020, and now in 2021. The messages switching between significant savings and "little smaller". At this point this is what I get: no breakthroughs, the previous attempts were false alarms. This time you can probably save a couple of bytes with not completely random files. (Which is belivable). Your request for "be patient" tells me that you are more careful this time. I think it's wise, and I respect that.
    66 replies | 2238 view(s)
  • xinix's Avatar
    2nd March 2021, 06:54
    Ahaaaa!!! There you go! Kudos to RANDOM, you had an epiphany! We wrote you too that your "random" 4kb file is also very small for correct tests. You need at least 1 megabyte! That would eliminate some of the overhead, archive format header and the like. Gotty don't worry, I'm fine! It doesn't work any other way with him Why are you blind? You've been told 2 times that PAQ already compresses 2 characters! (Reminder, Gotty don't worry, I'm fine!) Well, PAQ already does what you need it to do, it already compresses those 2 characters, only it does it better and faster, without bloating the file and we don't have to convert the file to bit view because PAQ does it itself! Just read: https://en.wikipedia.org/wiki/PAQ http://mattmahoney.net/dc/paq1.pdf You have a lot of time to think about:) ______ Tell me, do you program? Have you really learned any programming language?
    22 replies | 686 view(s)
  • Trench's Avatar
    2nd March 2021, 05:05
    I rushed in my last reply which was sloppy You said it well about how conversion loses to compression, but you also answered the question after to state that no one deals with compressing 2 digits since their is no need for it. Which again their is no need for me to answer the puzzle but it was a challenge for others to learn from. Which again if everyone does the same nothing new is being done to get a difference perspective. Your reply is right from a conventional stand point. But why have you not 2nd guessed yourself? Since its not needed as you say. How do you know that you know it all? based on experience. A lot of things are found by accident just fooling around and trying silly things since the obvious is not so obvious. Random to you is not random to me. people with super memory have a structure to remember random numbers of phrases due to association. They may be random for others but not to a few. The computer does not know what random is unless a person puts in what is or is not random which you agree. So to say its impossible due to it being random is like saying it is impossible since it is being ignored. Ignoring something does not make it true truly but temporary to the limitation of the person. And that person puts it in the code which also makes it limited. and then it becomes a standard and then people ignore it since who will argue with the official stance. That does not seem healthy for innovation, since it puts boundaries and discourages. So not everyone says you can not compress random it reached its limit by so many experts and anyone new feels experts know best. That is not healthy for any field of work is what I am trying to get at. words make of break things which have restricted a lot of innovation. Even dictionaries have changed the meaning of words over time. Even something as simple as the word leisure means now to relax with free time but it use to mean to use free time to learn if we go much further back as a quick example. I am not here to change anyone mind I am here to offer another perspective to see things. Since when everyone seems a playing cards from one side it may look wide but from another it is thin. No one has to agree, but just like your tests its a way to get use to thinking and thinking differently to help things. on a side note Tesla likes the #3 and he sid the secret to everything is frequency. When he meaning everything does he mean everything? as for game ABCDEFG which i did not have time to read it last time. but the 8 letters seems to small to answer under 3. 8= you choose 4, 4=2, 2=1. the selection it too small to deal with but needs other methods to get 2. Sure your problem can be solved with 2 but again the odds of it being wrong are greater. LOL but if you only had 2 choices then you have to come up with a theory on why you choose and understand the context of the where it it coming from. not as practical but its a good way to think things through. as for the 256 which was answered last time it might be good to take a guess in the first part to eliminate 2/3 i was trying to take that path than 1/2 which is the sure and steady way to get a shorter answer. it has a probability of being wrong to delay more but when right it has a probability of being fast which was the purpose of the game. if i was right all the time guess and eliminate 2/3 that would be 6 turns if wrong and stuck to it then 13 turns at worse which to be wrong every time the probability would be 50% so within that 50% it would take 8 turns as well. So 5 at best not likely 13 at worst. Which that is how I came to my answer but thought it out now to understand it better to explain the numbers. Unless your percentage is different? SO you are right but I think I gave a better answer in speed which was a risk. You try it at 2/3 off and tell me how fast you get it. answer is 3 and the 2nd one is 200. use the same method which order will you eliminate first is up to you. so what are your odds? I should have puyt more effort last time but as i said i was in a rush and did not have time to think it through. since lotto odds have also a pattern which most dont know, but it looks "Random" to most. The thing is probability to get better odds. Just like if you flip a coin it will be 50% heads or tails. and if you flip it 100 times it should be near 50% heads or tails. Sure it can be heads 100 times but the odds increased despite it always has a 50% change in every flip to be heads despite it it is heads 4 times in a row or 8 times but their are other probability factors in play that most ignore due to what they see at face value which everyone is tricked even most mathematicians. you say "Also don't expect that one day we will invent a smart method to compress random data. " You already do, :) A file is a file. what the program that deals with it determines if it is too "random" for it to handle. So to call a random file not random when you know the code is not fair. it is just like a lock how it is all random but not to the key you have. all locks are the same, its the key that determines if it is in order or not. if anything the compression program is random to use predetermined patterns. Since some programs compress better than others on the same file that is not "random" which is random in how much it is compressed. It is not the files fault it is the program. Or to be fair it is both, and if it does not work the program is not comparable. It is not the 4k files fault that the compression programs do not do a good job which as you said no need to compress 2 character, which is based of programmers preconceived notion. ;) which that is what I am trying to say how everyone is stuck to some degree doing the same thing fear of taking the next step. Everyone is trying to squeeze a dried up lemon and putting more memory and processor into doing so. Its like throwing more money at the problem to need a bigger processor and memory. Again most wont agree which I see xinix (hello) is upset with me and gives you tumbs up all the time. LOL reply is long enough I will read the other game another time. you can try my method if it applies and see if that works. YEP ;) So try to beat 4.0k
    22 replies | 686 view(s)
  • LawCounsels's Avatar
    2nd March 2021, 02:34
    Celebrating 1st time ever random files compressed a little smaller and reconstructs exact same ( no further details will be provided until separate encoder decoder done ) Subject to eventual final confirmation with separate encoder decoder over 2 clean PCs
    66 replies | 2238 view(s)
  • Gotty's Avatar
    2nd March 2021, 02:26
    Random files? You mean: more? I thought it was jut a zstd file. So you have different files, right? That's good! I just wanted to ask you to double check the results with real random files. xinix also posted one. You also referred to random.org files in your earlier posts, so I believe you have now some test files with you. So what are the (preliminary?) results? What are we celebrating?
    66 replies | 2238 view(s)
More Activity