Activity Stream

Filter
Sort By Time Show
Recent Recent Popular Popular Anytime Anytime Last 24 Hours Last 24 Hours Last 7 Days Last 7 Days Last 30 Days Last 30 Days All All Photos Photos Forum Forums
Filter by: Last 7 Days Clear All
  • Sportman's Avatar
    Today, 00:26
    Sportman replied to a thread Zstandard in Data Compression
    It's if a file name contain the root folder content instead of the direct root folder content as by 1.4.5-1.4.8 win64 zip files.
    460 replies | 147596 view(s)
  • Sportman's Avatar
    Today, 00:18
    Sportman replied to a thread Zstandard in Data Compression
    Input: 984.801.591 bytes - data Output zstd v1.4.8: 151.119.611 bytes, 1.344 sec., zstd -1 --ultra -T1 149.904.965 bytes, 3.375 sec., zstd -1 --ultra --long -T1 147.939.926 bytes, 1.547 sec., zstd -2 --ultra -T1 146.708.829 bytes, 3.547 sec., zstd -2 --ultra --long -T1 143.977.987 bytes, 1.969 sec., zstd -3 --ultra -T1 143.965.769 bytes, 2.141 sec., zstd -4 --ultra -T1 142.896.262 bytes, 4.063 sec., zstd -3 --ultra --long -T1 142.869.673 bytes, 4.125 sec., zstd -4 --ultra --long -T1 138.447.020 bytes, 3.437 sec., zstd -5 --ultra -T1 137.671.146 bytes, 5.187 sec., zstd -5 --ultra --long -T1 135.307.599 bytes, 3.735 sec., zstd -6 --ultra -T1 134.660.677 bytes, 5.625 sec., zstd -6 --ultra --long -T1 129.713.923 bytes, 5.297 sec., zstd -7 --ultra -T1 129.552.776 bytes, 7.125 sec., zstd -7 --ultra --long -T1 129.262.251 bytes, 6.547 sec., zstd -8 --ultra -T1 129.090.886 bytes, 8.328 sec., zstd -8 --ultra --long -T1 128.271.371 bytes, 7.687 sec., zstd -9 --ultra -T1 128.210.617 bytes, 9.485 sec., zstd -9 --ultra --long -T1 128.087.378 bytes, 8.328 sec., zstd -10 --ultra -T1 128.056.308 bytes, 10.312 sec., zstd -10 --ultra --long -T1 127.999.182 bytes, 9.812 sec., zstd -11 --ultra -T1 127.940.684 bytes, 12.109 sec., zstd -11 --ultra --long -T1 127.445.750 bytes, 14.828 sec., zstd -12 --ultra --long -T1 127.422.579 bytes, 12.656 sec., zstd -12 --ultra -T1 126.935.545 bytes, 28.125 sec., zstd -13 --ultra --long -T1 126.802.723 bytes, 30.359 sec., zstd -14 --ultra --long -T1 126.768.260 bytes, 26.766 sec., zstd -13 --ultra -T1 126.659.580 bytes, 28.734 sec., zstd -14 --ultra -T1 126.499.068 bytes, 37.922 sec., zstd -15 --ultra --long -T1 126.327.122 bytes, 35.860 sec., zstd -15 --ultra -T1 118.142.065 bytes, 1:18.344 sec., zstd -16 --ultra -T1 117.643.674 bytes, 1:22.531 sec., zstd -16 --ultra --long -T1 113.112.473 bytes, 1:29.234 sec., zstd -17 --ultra -T1 112.773.080 bytes, 1:33.485 sec., zstd -17 --ultra --long -T1 112.255.514 bytes, 2:11.281 sec., zstd -18 --ultra -T1 111.880.990 bytes, 2:16.656 sec., zstd -18 --ultra --long -T1 111.517.628 bytes, 2:46.016 sec., zstd -19 --ultra -T1 110.962.579 bytes, 2:57.359 sec., zstd -19 --ultra --long -T1 111.041.788 bytes, 2:55.297 sec., zstd -20 --ultra -T1 110.644.739 bytes, 3:04.047 sec., zstd -20 --ultra --long -T1 110.431.137 bytes, 3:25.172 sec., zstd -21 --ultra -T1 110.188.109 bytes, 3:39.516 sec., zstd -21 --ultra --long -T1 109.903.935 bytes, 3:52.750 sec., zstd -22 --ultra -T1 109.903.935 bytes, 3:52.719 sec., zstd -22 --ultra --long -T1 Output zstd v1.4.9: 151.119.611 bytes, 1.344 sec., zstd -1 --ultra -T1 149.930.990 bytes, 1.969 sec., zstd -1 --ultra --long -T1 147.939.926 bytes, 1.547 sec., zstd -2 --ultra -T1 146.731.879 bytes, 2.141 sec., zstd -2 --ultra --long -T1 143.977.987 bytes, 1.969 sec., zstd -3 --ultra -T1 143.965.769 bytes, 2.156 sec., zstd -4 --ultra -T1 142.942.060 bytes, 2.656 sec., zstd -3 --ultra --long -T1 142.912.214 bytes, 2.687 sec., zstd -4 --ultra --long -T1 138.447.020 bytes, 3.453 sec., zstd -5 --ultra -T1 137.698.236 bytes, 3.781 sec., zstd -5 --ultra --long -T1 135.307.599 bytes, 3.781 sec., zstd -6 --ultra -T1 134.685.348 bytes, 4.188 sec., zstd -6 --ultra --long -T1 129.713.923 bytes, 5.297 sec., zstd -7 --ultra -T1 129.541.732 bytes, 5.687 sec., zstd -7 --ultra --long -T1 129.262.251 bytes, 6.563 sec., zstd -8 --ultra -T1 129.079.778 bytes, 6.890 sec., zstd -8 --ultra --long -T1 128.271.371 bytes, 7.703 sec., zstd -9 --ultra -T1 128.185.107 bytes, 8.016 sec., zstd -9 --ultra --long -T1 128.087.378 bytes, 8.344 sec., zstd -10 --ultra -T1 128.030.897 bytes, 8.860 sec., zstd -10 --ultra --long -T1 127.999.182 bytes, 9.859 sec., zstd -11 --ultra -T1 127.915.469 bytes, 10.719 sec., zstd -11 --ultra --long -T1 127.422.579 bytes, 12.703 sec., zstd -12 --ultra -T1 127.415.784 bytes, 13.422 sec., zstd -12 --ultra --long -T1 126.907.957 bytes, 26.703 sec., zstd -13 --ultra --long -T1 126.775.263 bytes, 28.954 sec., zstd -14 --ultra --long -T1 126.768.260 bytes, 26.750 sec., zstd -13 --ultra -T1 126.659.580 bytes, 28.735 sec., zstd -14 --ultra -T1 126.467.698 bytes, 36.609 sec., zstd -15 --ultra --long -T1 126.327.122 bytes, 35.812 sec., zstd -15 --ultra -T1 118.142.065 bytes, 1:18.329 sec., zstd -16 --ultra -T1 117.640.741 bytes, 1:21.078 sec., zstd -16 --ultra --long -T1 113.112.473 bytes, 1:29.187 sec., zstd -17 --ultra -T1 112.767.845 bytes, 1:32.015 sec., zstd -17 --ultra --long -T1 112.255.514 bytes, 2:10.438 sec., zstd -18 --ultra -T1 111.878.283 bytes, 2:14.078 sec., zstd -18 --ultra --long -T1 111.517.628 bytes, 2:45.672 sec., zstd -19 --ultra -T1 111.041.788 bytes, 2:55.078 sec., zstd -20 --ultra -T1 110.958.922 bytes, 2:55.203 sec., zstd -19 --ultra --long -T1 110.691.009 bytes, 3:01.454 sec., zstd -20 --ultra --long -T1 110.431.137 bytes, 3:24.843 sec., zstd -21 --ultra -T1 110.190.299 bytes, 3:36.609 sec., zstd -21 --ultra --long -T1 109.900.896 bytes, 3:49.813 sec., zstd -22 --ultra -T1 109.900.896 bytes, 3:49.844 sec., zstd -22 --ultra --long -T1
    460 replies | 147596 view(s)
  • mitiko's Avatar
    Today, 00:04
    Oh, that type of a transform... Well, static dictionaries support all types of patterns (like the ones described in the post) and I'm planning on doing this in one way or another. What they're doing with LZ77 + static dictionaries is pretty ingenious actually! Wait, that's actually pretty clever for a fast encoder! Anyways, that's not the point of BWD. I'm trying to preserve the context of the text, the information it holds. After LZ77 and such pattern-y words the rest of the compression pipeline looses context and it will predict worse. But if we instead create a whole new entire context based on a pattern, it will compress better and the original stream will be more compressible as well. Patterns are great, they just seem a bit to deterministic right now - like for every pattern/dictionry transform that you can imagine, you need to write it's own ranking function and extraction function and then handle it's serialization to the dictionary, etc. It'd be amazing if we find a way to get the system to generate patterns, or find them in the dictionary after the words have been chosen. For example if the dictionary is calculated to be something like: ​ "the next" "the following" "the lion" "the fox" Maybe some independent system can see the pattern and generate the word "the (*)", which will then match all of the above and also split the stream into 2 - the main stream + the stream of the pattern. The stream of the pattern would the look something like: "next;following;lion;next;fox;fox;fox;next..." which is very compressable. In practice this would suit well some grammar rules (essentially making the grammar information that text holds, known, so zero bytes to transmit) like "(*)ing" or "(*)ed". With patterns, we get into this whole idea of context splitting and start to think if this is possible in the general case. I have some notes, if you'd like to look at them.
    17 replies | 793 view(s)
  • Jarek's Avatar
    Yesterday, 23:26
    Thanks but I am not taking this only "award" for ANS to be able to defend it for the rest of my life from succeeding megacorporations. Corporations using it, instead of deploying own "defensive" legal mines, please just prevent building this minefield. Especially that I haven't heard of any ANS-based compressor from Microsoft (?)
    122 replies | 91254 view(s)
  • SolidComp's Avatar
    Yesterday, 22:19
    Brotli has a seemingly novel set of 120 dictionary transforms. You can read a bit about them in this recent Cloudflare post.
    17 replies | 793 view(s)
  • JamesWasil's Avatar
    Yesterday, 19:10
    Nope. Only semantic rewording of your work and fabricating big tech innovations for Microsoft to compete with Google by bogus patent claims for what was intended to be free with due credit. This is why we can't have nice things, and why development and certain types of information sharing and discoveries are no longer offered in good faith like they were before. Corporations by majority have ruined the good faith and academic contributions for the sake of exclusive profiteering, even against the inventor's wishes. It says a lot about the companies and state of the world when they have to rip off the author, take credit and make money on work released freely, rather than developing their own UNIQUE methods even if they yield nearly the same performance or results. Cookie cutter world now. Question: Should Jarek make a global Gofundme for legal help to fight this to where the lawyers who Google, Microsoft, and other big tech don't yet have in their pocket might actually help? I would assume that if Marcus Hutter could throw 50k at the Hutter prize, would he and others be willing to donate to the Jarek rANS/tANS defense fund?
    122 replies | 91254 view(s)
  • mitiko's Avatar
    Yesterday, 17:28
    I'm going to try and compete with BWD later this year. I also have some good ideas for modeling after the dictionary transform.
    78 replies | 12265 view(s)
  • mitiko's Avatar
    Yesterday, 17:13
    Thanks for looking through the code. I knew I had to eventually remove LINQ but I don't think I'm using it at some very critical places. Right now, the most work seems to be in counting the words and ranking. Thanks for the heads up for foreach loops, I didn't know they were more expensive than for's. The architecture is pretty ok for now, it's like LINQ but async (using IAsyncEnumerable). I've thought it might be slowing me down but it's actually not. It's processing data in blocks and I'm testing it for optimality for the block size being the full filesize, so all the data only travels the pipeline once. The slow-down comes from calculating the dictionary, when we have it, parsing and serializing dictionary is pretty fast. What the rank (as per the latest version) calculates is the difference in bits used to encode the string. So if the dictionary size is 256, 8 bits would be needed per word and the ranking would be analogous to (n-1)*s for n=word size, s=occurences. (I'd rather work with l and c for length and count) In your example, the weights are correct and the savings in bytes are the same as the rank. The whole ranking system is greedy in this way. It will prefer: ​"the" -> rank=2000 "rabbit" -> rank=175 "lion" -> rank=105 "and" -> rank=70 "fox" -> rank=70 total = 2420 The ranks here only being from the top of my head. The non-greedy variation you have in mind looks something like: "the fox and the rabbit and the lion" -> rank=1225 "the" -> rank=2000 total = 3225 I have actually thought a lot about this and it's not a problem of the ranking function itself, at least not it's primitive variation. No mathematical formula can solve the greediness, there's always going to be some example where if we switch the order, we'll get a better result. It's a problem of choosing the best word from the ranking. Instead of greedy-ly choosing the highest ranked word, we can look ahead (in time) and calculate the ranks of future words, then choose the optimal branch. It becomes a look-ahead tree. The fragmentation problem by itself can be solved if we consider the dictionary to initially be the alphabet and then morphing it into longer words. Like: A -- AT T -- A C -- T G -- C ​ G Then we can do parsing before each new word calculation and we won't have the constraint of words not spanning over multiple blocks. I've excluded this as a good idea, because 1) we have to do parsing at each word and only reparsing doesn't seem viable. 2) we won't have this problem for a large enough look-ahead tree. But now that you mention it, maybe the look-ahead tree can't be that big. My hope was that since we're calculating a lot of the words' ranks, they won't change under the tree and when we choose the optimal branch, we'd only have to calculate the ranks of the next layer of words. Basically ranking becomes at most x times slower, if we prune and look at the top x words. The tree depth is only a concern for the first word and then updates are pretty fast, so it's pretty inexpensive to have a deeper tree, compared to a more spanning tree. But for the look-ahead tree to solve the fragmentation dilemma, a deeper tree doesn't cut it. We need a more spanning tree. So, maybe you're right, I should reconsider overlapping words and look into re-parsing, although I'm still pretty set on having the look-ahead tree as well. A couple assumptions: Right now, with the naive ranking ((n-1)*s) we don't have to rank all words and can only re-rank some. If we choose to re-rank the words, we have to store all the ranks, we can't store just the top 5, because at some point we won't be able to update the top words for 5 times in a row and we won't have a top 5. My current ranking was last updated with a hashtable for ranks, so the actual calculation is only done once per similair words. What we can do, for sure, is only re-count some words. I'm working on that still. I have a draft, I just have to fix some minor indexing issues. The reason why I haven't done re-ranking is because the optimal ranking algorithm uses a state. The technique used is similiar to morphing/updating the dictionary. This way the current rank of a word depends on the other words chosen and it's closer to the actual optimal order of the dictionary. Especially with the look-ahead tree. For sure, re-ranking and re-counting + the fragmentation fix will create a great compressor, with acceptable compression times. But I'm rather looking to improve BWD in the direction of optimality for the cost of compression time, than leveraging memory for time. When I'm done, I'll definitely try to implement one with this stack!
    17 replies | 793 view(s)
  • CompressMaster's Avatar
    Yesterday, 15:58
    CompressMaster replied to a thread WinRAR in Data Compression
    Yeah, I have seen this almost 2 years ago or so... even in windows there are vulnerabilities like that - one that has been lastly patched in win10 existed even from windows 3.11..
    192 replies | 137083 view(s)
  • Jarek's Avatar
    Yesterday, 15:40
    I was just pointed 2019 Microsoft patent for rANS: https://patents.google.com/patent/US20200413106A1 Can anybody find something novel here?
    122 replies | 91254 view(s)
  • LucaBiondi's Avatar
    Yesterday, 13:30
    LucaBiondi replied to a thread paq8px in Data Compression
    Hi guys Thanks to all for the comments. I run paq8px besides the normal work. My pc has 16 gb of ram and often ram is not enough. This happen for 201 -10 vs. 201 -11 Luca
    2337 replies | 625128 view(s)
  • Gotty's Avatar
    Yesterday, 13:11
    Gotty replied to a thread paq8px in Data Compression
    Yes. In this league there is a serious "fight" for each byte gained. Each version is better like around 0.02%-0.10%, and for that tiny improvement there is a lot going on. The result is extremely "tight" and optimized already - it's not really possible to "tune" the existing models and contexts for any more gains, we (usually) need to add more models to experience better gains. That costs time and memory. The gain is not too great usually as the existing models cover the possibilities quite well. We really sacrifice memory and speed for compression ratio. So in this league 3% improvement is a lot - and it costs a lot.
    2337 replies | 625128 view(s)
  • hexagone's Avatar
    Yesterday, 07:21
    hexagone replied to a thread paq8px in Data Compression
    You mean 'accurate' I assume. Accurate timings are useful. Or just compare the option with the highest compression for each release. ​ 3% improvement for a x 3 compression time is incredibly expensive. "Worth" is in the eye of the beholder...
    2337 replies | 625128 view(s)
  • Gotty's Avatar
    Yesterday, 04:53
    I have only two points (for performance-critical parts of your code): - use for loops instead of foreach loops; - avoid using LINQ - convert them to normal loops. (These modifications will be eventually needed when you really want to convert your code to c/c++ one day.) I know these will give you only a small boost, you're looking for a faster architecture. Unfortunately I don't know your architecture (only had a peek). Could you try something? What if you rank words simply by their size? That is a stable metric, you don't need to recalculate the ranking after each round, just see if there is still more than 1 occurrence in the next round. It may sound strange at first, but it could be as good as the current metric that taks into account the occurrence too. Why? When you rank the words by (n-1)*s there is an issue: long words are polluted when their constituents are replaced. Example: the fox and the rabbit and the lion the chicken the fox and the rabbit and the lion the lion the fox and the rabbit and the lion the horse the fox and the rabbit and the lion the penguin the zebra the cat the worm ... (many more) ... the dog The weights: "the": (n-1)*s= 2*1000 = 2000 "the fox and the rabbit and the lion": (n-1)*s = 3*35 = 105 So "the" will win the first round. After the first replacement those 4 long substring will be unnecessary polluted by indexes of the "the"s. In this case it would be better to just replace the 4 long substrings first and then replace those "the"'s. What do you think? I wonder how much is the difference with this metric and is it better or worse.
    17 replies | 793 view(s)
  • JamesWasil's Avatar
    Yesterday, 03:04
    JamesWasil replied to a thread WinRAR in Data Compression
    This may have been missed or not seen a year or two ago, but found it interesting that a vulnerability like this existed for almost 20 years in the wild. Technically not WinRAR's fault, but support for WinACE and external formats was why and how the exploit came to be: https://www.theverge.com/2019/2/21/18234448/winrar-winace-19-year-old-vulnerability-patched-version-5-70-beta-1
    192 replies | 137083 view(s)
  • Alexander Rhatushnyak's Avatar
    Yesterday, 02:19
    I won't submit anything neither this year nor next year. Not later than in late 2022 someone else will win a prize. Maybe a simplified version of cmix or nncp, maybe indeed something like paq8px with a better NLP part.
    78 replies | 12265 view(s)
  • Gotty's Avatar
    Yesterday, 00:26
    Gotty replied to a thread paq8px in Data Compression
    It looks like the timing is not reliable - look at the last 2 rows: (Paq8px_v201 -10 / Paq8px_v201 -11). In the xml column they are the same (to the hundredth), which is probably an error. Ignoring this column the ratio still varies between 1.29 and 4.29. That's a very large spread. Luca was probably running multiple instances at the same time or run other tasks beside paq8px. So it's best to ignore the timings. They are not really useful. If you still would like to compare timings, you may need to clean the results by scaling the outliers but most importantly compare the results along the same command line options. You may compare versions along the compression level (8 to 8, 9 to 9, etc), and especially if lstm was used or not used. I'd say the "slowdown" from v95 to v201 is probably about 3x. Yes it's still a lot. But for 3-4% compression gain it worth it.
    2337 replies | 625128 view(s)
  • Scope's Avatar
    5th March 2021, 22:13
    https://gitlab.com/wg1/jpeg-xl/-/commit/65825c8629f181dbe8bdd432b15576735af29701
    15 replies | 2018 view(s)
  • Fallon's Avatar
    5th March 2021, 15:39
    Fallon replied to a thread WinRAR in Data Compression
    WinRAR - What's new in the latest version https://www.rarlab.com/download.htm Version 6.01 beta 1
    192 replies | 137083 view(s)
  • Shelwien's Avatar
    5th March 2021, 14:07
    https://stackoverflow.com/questions/45382914/parallel-memory-access-on-modern-processors I made a benchmark which times something like this loop: #define rndup(seed) (seed = 214013 * seed + 2531011) s0=1*seedinc; s1=2*seedinc; s2=3*seedinc; s3=4*seedinc; for( i=0; i<numreads; i++ ) { s0 += buf; s1 += buf; s2 += buf; s3 += buf; rndup(s0);rndup(s1);rndup(s2);rndup(s3); } for number of streams 1..64 (loop code is generated by a script) and tested it on two CPUs: i7-7820X @ 4.5Ghz and i7-8650U @ 2.5Ghz (see pictures). For 7820 there's no clear result, just that timings converge after ~16 streams and there's no further speed improvement (per stream). For 8650 it seems like there're ~16 streams for L3 and 6-8 for RAM, there's a visible speed difference after that.
    0 replies | 92 view(s)
  • Gotty's Avatar
    5th March 2021, 12:29
    First of all, your use case is not clear to me: what is the purpose of hashing in your case? Do you need uniqueness? If yes, why? Do you need characters in the hash? Or bytes are also OK? >>to convert a street address Addresses are terribly regarding their quality. What are you planning to do with addresses that are written differently but they represent the same actual address? Do you plan a preprocessing stage and clean the addresses up?
    2 replies | 60 view(s)
  • Shelwien's Avatar
    5th March 2021, 06:32
    > So for example, what's the best way to convert a street address to a short fixed-length token? Some cryptohash (md5,sha1,sha256,...). Faster hashes might work too, but are less reliable. > Say a six-character string. If its actually a binary blob, you can just take first 6 bytes of hash value or something. But if its a restricted-charset "readable" string, you'd have to use first 6 symbols of base64 code of hash value, or, ideally, a more flexible restricted-charset code like http://nishi.dreamhosters.com/u/marc_v2b.7z > If every hash has to be unique, six characters can't do it even if it gives us more than enough range for 10 million addresses. There's obviously no perfect solution, but one idea is to modify actual strings (eg. add spaces) until unique hash value is found. > Unless... there's some way to guide the hash so that there are no collisions Hash functions usually contain some magic numbers that can be treated as parameters and adjusted to improve the results. This might let you get rid of a few collisions among known entries... but of course it won't work for unknown future entries.
    2 replies | 60 view(s)
  • SolidComp's Avatar
    5th March 2021, 03:29
    Hi all – I've not too familiar with hashing theory, and had some questions. I want to tokenize various data in social science research. So for example, what's the best way to convert a street address to a short fixed-length token? Say a six-character string. Addresses are of course highly variable strings, and max out at around 100 characters/bytes. For a country with a max of 10 million possible and future addresses, I assume there's no way to hash an address to a six-character string without collisions right? Six characters long from a 30 character set covers 729 million possibilities, but the inputs can be any of more than 100^36 possibilities (all caps + numbers). It's actually close to the factorial of that number since the input address is variable length. If every hash has to be unique, six characters can't do it even if it gives us more than enough range for 10 million addresses. Unless... there's some way to guide the hash so that there are no collisions. I don't know of anything like that that is still called a "hashing algorithm". It would be some kind of subset hashing that takes inputs that have a huge range if they're arbitrary, where the actual possibilities are a much smaller subset, and the output hash string is sized for the actual possibilities. Would you just randomly assign string tokens instead of hashing? Is that what they do with credit card tokenization? Thanks.
    2 replies | 60 view(s)
  • LucaBiondi's Avatar
    5th March 2021, 00:26
    LucaBiondi replied to a thread paq8px in Data Compression
    Oh yes, help me @Darek please! Luca
    2337 replies | 625128 view(s)
  • Darek's Avatar
    5th March 2021, 00:04
    Darek replied to a thread paq8px in Data Compression
    @LucaBiondi - maybe I'm not the master of Excel but I could help you to made some changes
    2337 replies | 625128 view(s)
  • hexagone's Avatar
    4th March 2021, 23:45
    hexagone replied to a thread paq8px in Data Compression
    This is brutal. Dividing the last row of the numbers by the first show that the few percents of compression ratio improvements cost x 5 to x 15 in compression time.
    2337 replies | 625128 view(s)
  • Hacker's Avatar
    4th March 2021, 23:19
    Yup, I am aware of JPEG XL, I hope for some wider adoption, as time and time again with each hopeful JPEG successor.
    32 replies | 2394 view(s)
  • Jarek's Avatar
    4th March 2021, 16:44
    Jarek replied to a thread Zstandard in Data Compression
    https://www.phoronix.com/scan.php?page=news_item&px=Zstd-1.4.9-Released https://www.reddit.com/r/linux/comments/lx43x1/zstandard_v149_released_2x_faster_long_distance/
    460 replies | 147596 view(s)
  • SolidComp's Avatar
    4th March 2021, 06:15
    It needs a much higher quality display to be competitive with major competitors in this space. Also it seems to be missing basic biometrics, like IR camera for Windows Hello facial recognition. I wouldn't buy a laptop without those features. But I love the microphone cover. That's actually more important than covering a webcam. And the PCIe 4.0 x4 slot for an SSD is excellent and I don't know of another laptop that has 4-lane PCIe 4.0.
    1 replies | 117 view(s)
  • LucaBiondi's Avatar
    3rd March 2021, 23:08
    LucaBiondi replied to a thread paq8px in Data Compression
    Done! i hope you enjoy the graphs size vs. time
    2337 replies | 625128 view(s)
  • Gotty's Avatar
    3rd March 2021, 21:56
    Gotty replied to a thread paq8px in Data Compression
    Hi Luca! You can upload the excel itself (in a zip or 7zip).
    2337 replies | 625128 view(s)
  • LucaBiondi's Avatar
    3rd March 2021, 18:36
    LucaBiondi replied to a thread paq8px in Data Compression
    Hi guys i have an excel that contain the results of the compression of my dataset starting from PAQ8PX_V95(!) i have also plotted for each datatype size / time. It's not easy to attach all these images to the post. Where could i upload my excel? Maybe someone can help to plot data in a better way.. thank you, Luca
    2337 replies | 625128 view(s)
  • Darek's Avatar
    3rd March 2021, 11:16
    Darek replied to a thread paq8px in Data Compression
    No, there were different moments. It's probably something with my laptop. I suppose that could be issue of low space on Disk C (system). I need to sometime close the laptop and hibernate system and for some cases, after standing up system, paq8px quits w/o any communicate. At now I started to plan compression for time when there won't be needed to hibernate the system. For enwik9 it's about 4 days of time which need to be not disturbed.
    2337 replies | 625128 view(s)
  • Shelwien's Avatar
    3rd March 2021, 07:33
    Which can be interpreted as "Let's wait 3 more years, then buy a bunch of otherwise useless preprocessors from Alex for $25k". 500000*(1-116673681/122000000) = 21829 (122000000-116673681)/5000/365 = 2.92 122MB is the estimated paq8px result, cmix and nncp won't fit due to time limit.
    78 replies | 12265 view(s)
  • mitiko's Avatar
    2nd March 2021, 23:40
    Uhm, Wikipedia says it's a general purpose LZ77 paired with Huffman and second order context mixing, but looking at the code it also includes a small static general purpose dictionary: https://github.com/google/brotli/blob/fcda9db7fd554ffb19c2410b9ada57cdabd19de5/c/common/dictionary.c But I'm probably not the guy to ask. Meanwhile, I made the counting linear but it's way slower. I tried some performance analyzing tools but nothing conslusive, just a little bit of everything - matching is expensive and sorting is expensive and looking at the bitvector (which I implemented myself) is inefficient. I'm considering some dp approach where we only count the words once but then there would be quite a complicated process of updating the counts, involving multiple suffix array searches and I'm still debating if it's worth it. Correct me if I'm wrong but mcm seems to be getting away with counting easily because they only rank matches to the words in the lookahead window. Later it's also using a hashtable of strings, which would be quite inefficient in my case.
    17 replies | 793 view(s)
  • SolidComp's Avatar
    2nd March 2021, 21:55
    Mitiko, what about brotli's dictionary transforms? Are they LZ77/78?
    17 replies | 793 view(s)
  • byronknoll's Avatar
    2nd March 2021, 21:47
    A new rule has been added to Hutter Prize to make it easier: http://prize.hutter1.net/hfaq.htm#ddisc
    78 replies | 12265 view(s)
  • Gotty's Avatar
    2nd March 2021, 14:03
    Yes! You have it. Exactly: if you have no further information ("no theory" in your words) you can ask 2 questions, hoping that you will find the number quicker, but if you are unlucky, there is a higher chance that you will need to ask 4 questions. So 3 is the overall optimal: it's always 3, no risk. There is no way to solve it with 2 question without risking. If you have some information ("theory"), that usually it's AB, and rarely it's CDEFGH, you can go for AB first. But in this case the symbols are not equiprobable so it's not random. And we are going for random content, so you don't have more information what the result usually is (what distribution the symbols usually follow). If we do the game with 16 letters, it's the same: the optimal is 4 questions. With 32 it's 5. No matter how big your range is, the optimal way is always log2(n). Going even higher doesn't change the game: If you have 4096x8 random bits, your optimal way to find out those bits is to do it in 4096x8 questions. Can't do it in less. Practical compression example: try compressing random files with any compression software. The result will always be a bit bigger. In case of paq* it will actually try finding patters (find a "theory" behind the bits and bytes). It really does try finding patterns! Being a random file we know that there are no patterns there. But paq does not know that. So it tries. When it believes it found some theory (that some symbols are more probably than others) then it applies it - and experiences a loss. Ouch! So the result gets a bit bigger. Like trying to guess the 8-symbol game in 2 question. Risky right? Because it is programmed to lower the risk and try finding the best theory it will give up soon trying to solve the 8-symbol game in 2 questions - it will stick with 3 questions. And so it reduces it's loss. So this is the reason why there is always a small loss there - because it is always doing a bit riskier at the beginning, so those losses are inevitable (in case of paq or any other compression software with similar internal workings).
    21 replies | 616 view(s)
  • Gotty's Avatar
    2nd March 2021, 13:49
    xinix you are a bit harsh. As Trench said: "upset". I know it's upsetting when people have claims or statements and they don't fully understand the subject (yet), but it's OK. I sense that he's open and he tries to understand it deeper, so let's give it a go. He has no background in programming or information theory - it would be nice to have, but what can you do? We are here to help him (remember: he came with a question). This is what we can do: practical examples, visualizations, games - so even without the full experience or solid background at least he can sense what randomness means. You know how I started in ~ 2007? Funny: I tried to see if I could compress the already compressed paq8hp result smaller. I could easy-peasy win the contest. Fortunately I have math and programming background, and I quickly realized the it won't work. You do have to try compressing random files in order to understand what randomness means (not with paq or any other software - they don't give you first-hand experience, but in your own way, creating your own tools). So then I deeply understood the real meaning of randomness. And I can tell: it has its beauty. Whatever you do, however you slice, you just can't make them smaller. The result will always have the same size (in optimal case). So I tried it, and have actual first-hand experience. Equipped with my math background I also know the explanation. Now he can have a tiny-little experience by trying to solve the ABCDEFGH-puzzle in 2 questions.
    21 replies | 616 view(s)
  • Gotty's Avatar
    2nd March 2021, 13:21
    I'm trying to understand what's happening. So, we had breakthroughs in 2007, 2015, 2020, and now in 2021. The messages switching between significant savings and "little smaller". At this point this is what I get: no breakthroughs, the previous attempts were false alarms. This time you can probably save a couple of bytes with not completely random files. (Which is belivable). Your request for "be patient" tells me that you are more careful this time. I think it's wise, and I respect that.
    66 replies | 2210 view(s)
  • xinix's Avatar
    2nd March 2021, 06:54
    Ahaaaa!!! There you go! Kudos to RANDOM, you had an epiphany! We wrote you too that your "random" 4kb file is also very small for correct tests. You need at least 1 megabyte! That would eliminate some of the overhead, archive format header and the like. Gotty don't worry, I'm fine! It doesn't work any other way with him Why are you blind? You've been told 2 times that PAQ already compresses 2 characters! (Reminder, Gotty don't worry, I'm fine!) Well, PAQ already does what you need it to do, it already compresses those 2 characters, only it does it better and faster, without bloating the file and we don't have to convert the file to bit view because PAQ does it itself! Just read: https://en.wikipedia.org/wiki/PAQ http://mattmahoney.net/dc/paq1.pdf You have a lot of time to think about:) ______ Tell me, do you program? Have you really learned any programming language?
    21 replies | 616 view(s)
  • Trench's Avatar
    2nd March 2021, 05:05
    I rushed in my last reply which was sloppy You said it well about how conversion loses to compression, but you also answered the question after to state that no one deals with compressing 2 digits since their is no need for it. Which again their is no need for me to answer the puzzle but it was a challenge for others to learn from. Which again if everyone does the same nothing new is being done to get a difference perspective. Your reply is right from a conventional stand point. But why have you not 2nd guessed yourself? Since its not needed as you say. How do you know that you know it all? based on experience. A lot of things are found by accident just fooling around and trying silly things since the obvious is not so obvious. Random to you is not random to me. people with super memory have a structure to remember random numbers of phrases due to association. They may be random for others but not to a few. The computer does not know what random is unless a person puts in what is or is not random which you agree. So to say its impossible due to it being random is like saying it is impossible since it is being ignored. Ignoring something does not make it true truly but temporary to the limitation of the person. And that person puts it in the code which also makes it limited. and then it becomes a standard and then people ignore it since who will argue with the official stance. That does not seem healthy for innovation, since it puts boundaries and discourages. So not everyone says you can not compress random it reached its limit by so many experts and anyone new feels experts know best. That is not healthy for any field of work is what I am trying to get at. words make of break things which have restricted a lot of innovation. Even dictionaries have changed the meaning of words over time. Even something as simple as the word leisure means now to relax with free time but it use to mean to use free time to learn if we go much further back as a quick example. I am not here to change anyone mind I am here to offer another perspective to see things. Since when everyone seems a playing cards from one side it may look wide but from another it is thin. No one has to agree, but just like your tests its a way to get use to thinking and thinking differently to help things. on a side note Tesla likes the #3 and he sid the secret to everything is frequency. When he meaning everything does he mean everything? as for game ABCDEFG which i did not have time to read it last time. but the 8 letters seems to small to answer under 3. 8= you choose 4, 4=2, 2=1. the selection it too small to deal with but needs other methods to get 2. Sure your problem can be solved with 2 but again the odds of it being wrong are greater. LOL but if you only had 2 choices then you have to come up with a theory on why you choose and understand the context of the where it it coming from. not as practical but its a good way to think things through. as for the 256 which was answered last time it might be good to take a guess in the first part to eliminate 2/3 i was trying to take that path than 1/2 which is the sure and steady way to get a shorter answer. it has a probability of being wrong to delay more but when right it has a probability of being fast which was the purpose of the game. if i was right all the time guess and eliminate 2/3 that would be 6 turns if wrong and stuck to it then 13 turns at worse which to be wrong every time the probability would be 50% so within that 50% it would take 8 turns as well. So 5 at best not likely 13 at worst. Which that is how I came to my answer but thought it out now to understand it better to explain the numbers. Unless your percentage is different? SO you are right but I think I gave a better answer in speed which was a risk. You try it at 2/3 off and tell me how fast you get it. answer is 3 and the 2nd one is 200. use the same method which order will you eliminate first is up to you. so what are your odds? I should have puyt more effort last time but as i said i was in a rush and did not have time to think it through. since lotto odds have also a pattern which most dont know, but it looks "Random" to most. The thing is probability to get better odds. Just like if you flip a coin it will be 50% heads or tails. and if you flip it 100 times it should be near 50% heads or tails. Sure it can be heads 100 times but the odds increased despite it always has a 50% change in every flip to be heads despite it it is heads 4 times in a row or 8 times but their are other probability factors in play that most ignore due to what they see at face value which everyone is tricked even most mathematicians. you say "Also don't expect that one day we will invent a smart method to compress random data. " You already do, :) A file is a file. what the program that deals with it determines if it is too "random" for it to handle. So to call a random file not random when you know the code is not fair. it is just like a lock how it is all random but not to the key you have. all locks are the same, its the key that determines if it is in order or not. if anything the compression program is random to use predetermined patterns. Since some programs compress better than others on the same file that is not "random" which is random in how much it is compressed. It is not the files fault it is the program. Or to be fair it is both, and if it does not work the program is not comparable. It is not the 4k files fault that the compression programs do not do a good job which as you said no need to compress 2 character, which is based of programmers preconceived notion. ;) which that is what I am trying to say how everyone is stuck to some degree doing the same thing fear of taking the next step. Everyone is trying to squeeze a dried up lemon and putting more memory and processor into doing so. Its like throwing more money at the problem to need a bigger processor and memory. Again most wont agree which I see xinix (hello) is upset with me and gives you tumbs up all the time. LOL reply is long enough I will read the other game another time. you can try my method if it applies and see if that works. YEP ;) So try to beat 4.0k
    21 replies | 616 view(s)
  • LawCounsels's Avatar
    2nd March 2021, 02:34
    Celebrating 1st time ever random files compressed a little smaller and reconstructs exact same ( no further details will be provided until separate encoder decoder done ) Subject to eventual final confirmation with separate encoder decoder over 2 clean PCs
    66 replies | 2210 view(s)
  • Gotty's Avatar
    2nd March 2021, 02:26
    Random files? You mean: more? I thought it was jut a zstd file. So you have different files, right? That's good! I just wanted to ask you to double check the results with real random files. xinix also posted one. You also referred to random.org files in your earlier posts, so I believe you have now some test files with you. So what are the (preliminary?) results? What are we celebrating?
    66 replies | 2210 view(s)
  • LawCounsels's Avatar
    2nd March 2021, 02:03
    Sportman's provided random files ( pls leave any further requests for technical details for when separate encoder decoder done. None will be responded at this time)
    66 replies | 2210 view(s)
  • Gotty's Avatar
    2nd March 2021, 01:46
    Are these inputs random files?
    66 replies | 2210 view(s)
  • LawCounsels's Avatar
    2nd March 2021, 01:34
    Hi , I have made good progress coding done compressor and have now greatly made simpler decompressor At moment decompressor works direct from compressed internal variables decompressed exact same. There is only work to separate into encoder decoder. But do not underestimate the amount of 'mundane' tedious work involved .Likely will be a week's time to able do as requested. But we can and should already be celebrating. ( now can handle any size input also ) LawCounsels
    66 replies | 2210 view(s)
  • Gotty's Avatar
    1st March 2021, 20:39
    Gotty replied to a thread paq8px in Data Compression
    Does it happen around the same %? How does it crash? Blue screen?
    2337 replies | 625128 view(s)
  • Shelwien's Avatar
    1st March 2021, 02:58
    > am I supposed to remove/ sanction the 2 'dog>ate' in set #2 before blend them? In a similar case in PPM (called masked contexts) symbols seen in higher order contexts before escape are excluded from probability estimation. But in most CM implementations, submodels (based on contexts of different orders in this case) are handled independently, thus multiple contexts could provide predictions based on same data. Overall first approach is better, but is also harder to implement in a CM, and CM provides better compression than PPM even without it.
    196 replies | 17169 view(s)
  • Gotty's Avatar
    1st March 2021, 01:25
    I agree: unfamiliar. Actually there is a better word. It is called "unpredictable". See the wikipedia definition of randomness: "randomness is the apparent lack of pattern or predictability in events". See? "Apparent lack". It means that the file may have a pattern, you just don't find it or see it. From your viewpoint then the file is practically random. There is no way you can compress it. I also agree: "all files are random", but you need to add "... until you find some useful patterns". Otherwise this statement in itself is not valid.
    21 replies | 616 view(s)
  • Gotty's Avatar
    1st March 2021, 01:11
    Let's do another (different) approach for the second game. Let's say that you record the answers to my questions with ones ("yes") and zeroes ("no"). Here are my questions, and your answers 1. Is x>= 128? No 2. Is x>= 64? No 3. Is x>= 32? No 4. Is x>= 16? No 5. Is x>= 8? Yes (we are in 8..15) 6. Is x>= 12? Yes (we are in 12..15) 7. Is x>= 14? No (we are in 12..13) 8. Is x>= 13? Yes Then it is 13. What did you record? No-no-no-no-yes-yes-no-yes -> 00001101. This looks like 8 bits. In other words a byte... Hmmm... Let's look up this number in decimal. It's 13. Wow! Try with other numbers. The recorded yes-no answers will always match the number's binary representation. Isn't that cool? Your yes-no answers and the binary format will be the same when you follow the above question-structure. If the number is between 1-65536, that's 16 yes-no questions. Can you calculate the required number of questions somehow? Yes. The needed answers is always log2(n) where n is the number of possible symbols (the range). So log2(256) = 8 and log2(65536) = 16. Now think with me. A bit more concentration... Let's do the opposite. You thought of 8 bits (8 numbers of either 0 or 1). Let's stay at the same example. Let they be 00001101. How can I find out which 8 bits you have? I need to simply ask the bits one after the other: Is the 1st bit 1? No Is the 2nd bit 1? No Is the 3rd bit 1? No Is the 4th bit 1? No Is the 5th bit 1? Yes Is the 6th bit 1? Yes Is the 7th bit 1? No Is the 8th bit 1? Yes Notice that again it's No-no-no-no-yes-yes-no-yes. It matches exactly with the bits (00001101) again. And that represents 13. You needed 8 yes-no questions, in other words: 8 bits. It's even more amazing when you have probabilities for the symbols. ((Actually this is the universal case it covers the above yes-no questions, as well.)) You can have questions like that (guessing again 13 from 1..256): Is it 1? (p=1/256) No Is it 2? (p=1/255) No Is it 3? (p=1/254) No Is it 4? (p=1/253) No Is it 5? (p=1/252) No Is it 6? (p=1/251) No Is it 7? (p=1/250) No Is it 8? (p=1/249) No Is it 9? (p=1/248) No Is it 10? (p=1/247) No Is it 11? (p=1/246) No Is it 12? (p=1/245) No Is it 13? (p=1/244) Yes Got it: it's 13. In order to find out the information content, you'll need to sum {-log2(1-p) for the "No"s and -log(p) for the "Yes"es} for the above answers. Here is the result for each question: Is it 1? (p=1/256) No 0.005646563 Is it 2? (p=1/255) No 0.00566875 Is it 3? (p=1/254) No 0.005691112 Is it 4? (p=1/253) No 0.005713651 Is it 5? (p=1/252) No 0.00573637 Is it 6? (p=1/251) No 0.005759269 Is it 7? (p=1/250) No 0.005782353 Is it 8? (p=1/249) No 0.005805622 Is it 9? (p=1/248) No 0.005829079 Is it 10? (p=1/247) No 0.005852726 Is it 11? (p=1/246) No 0.005876566 Is it 12? (p=1/245) No 0.005900601 Is it 13? (p=1/244) Yes 7.930737338 Sum those numbers and be amazed. It's again 8. (It's 13 questions with skewed probabilities with 8 bits of information content.) You can also start from the top: Is it 256?, 255?... 13?. Even you had much more questions the answer will be still 8. Try. No matter what order you do, no matter how you slice up your range, it will be always 8 bits. See? The method or the used number of symbols is irrelevant: whether it's 8 bits or 1 byte (base 2 or base 256): it's still 256 symbols, you always need 8 bits to represent a number from this range. Actually if you store anything in any format it's always the same: the information content is: how many yes-no questions do you need to figure out the full content. This is the number of bits. You cannot do in less - you cannot compress these 8 bits of information to be less. If you understand all the above (even partially) you will have a better grasp: no matter how you transform your random files, no matter what approach you use, no matter with how many symbols you use to represent your data, the information content will be always the same. (4K in your case.)
    21 replies | 616 view(s)
  • Self_Recursive_Data's Avatar
    1st March 2021, 00:26
    Does the following help prediction? How if so? I tried it one time but it didn't work. If we take 2 sets/ orders of predictions to blend: SET1 my dog>ate my dog>ate + SET2 dog>ate dog>ate dog>jumped , am I supposed to remove/ sanction the 2 'dog>ate' in set #2 before blend them? They are in the same location in the dataset, cuz order3 is found in order2 but not the other way around always. The dataset looks like (with to emphasize the matches): "], ], ". As we can see here, some of the order3 is order2.
    196 replies | 17169 view(s)
  • Gotty's Avatar
    1st March 2021, 00:17
    Not bad! So you figured out that you need to do halving to find the number. The correct answer is 8 actually, but your approach is good, so accepted ;-).
    21 replies | 616 view(s)
  • Gotty's Avatar
    1st March 2021, 00:16
    Stuck = no improvements. But we do have improvements in the last 10 years in every field (image, video, general) both practical and experimental. Are you following the news? :-) As time progresses, the leaps will be smaller and smaller - we are slowly but surely approaching the theoretical compression limit that is reachable in reasonable time. So don't expect anything "extraordinary". Also don't expect that one day we will invent a smart method to compress random data. I think that would count a big leap, right? 2-digit ascii compression would be everywhere if that would be a common format to share data. But no one share data like that. So it's not in focus at all. You convert your binary file to a 2-digit file when you want to visualize the bits. Other than that it doesn't seem to be a useful format. It's just a waste of space. Do you use or see 2-digit files? Look, xinix also immediately converted it to binary. Why? That's the natural and compact container for such content.
    21 replies | 616 view(s)
  • Trench's Avatar
    28th February 2021, 23:11
    ---- Ok I just saw post #8. As explained in another post that all files are random. you can take any file and it will look random to you at face value. The program does not see it as random since it looks for certain patterns. And those patterns become "familiar" to be able to compress it. even with the dictionary size of a 128 it still has limitation. if the dictionary size was 4 it misses out on the other 122 which you call random when it is obviously not. It is just not familiar to the program in what to use. The computer does not know the difference from 1234 or 1z$T. You see the difference but you are not doing it manul by your own dictionary in your head, but the computer does not unless you tell it. For people that talk technical random should not be in the definition but it is unfamiliar should be. And I stick by that despite I can slip up and say it at times being influenced by others. LOL Anyway You are not stuck? so you made leaps and bounds on compression or did we all hit a wall for the past 10 years to see major improvements for a standard model of compression? Some programs are improved by a bit but not practical which is why not in general use. You help explain the comprehension barrier, but the issue still remains that I do not see any effort to improve 2 digit compression since everyone is focued on the standard, and it is an alert to focus on. People can disagree or disregard but its a new door to explore. You notice how a computer has memory? well it tried to copy the mind and it is hard to keep track when memory is full with other things. LOL So I lost track on the 33,420 info. lol I disagree that both are not random and dont like that word which You can not beat 4.0kb the question is will you try to. from doubling the numbers 1 2 4 8 , 16 32 64, 128 256 1. is it between 1-128? (2/3 off) yes 2. is it 8 to 32? (2/3 off) yes 3. is it 8-16? yes 8 9 10, 11 12 13, 14 15 16 4. is it 11-13? yes 5.is it 11-12? no 6. answer is 13. which would means 6 questions maybe some errors. It was the first thing that came to mind so I could be wrong but did it fast. faster but inaccurate to a smaller degree. if you said 7 I would have lost. LOL will reply to rest later.
    21 replies | 616 view(s)
  • hexagone's Avatar
    28th February 2021, 23:02
    Code is here: https://github.com/mathieuchartier/mcm/blob/master/Dict.hpp & https://github.com/mathieuchartier/mcm/blob/master/WordCounter.hpp Look at CodeWordGeneratorFast (Dict.hpp), especially WordCount::CompareSavings & Savings (WordCounter.hpp) for the ranking logic.
    17 replies | 793 view(s)
  • Shelwien's Avatar
    28th February 2021, 22:43
    There was this: https://encode.su/threads/482-Bit-guessing-game Humans are actually pretty bad at generating random numbers.
    21 replies | 616 view(s)
  • mitiko's Avatar
    28th February 2021, 22:40
    Thanks for the tests, but it's not quite optimized yet. I might have forgotten to bump up the block size, so it has done 10 blocks to compress enwik8 and it needs 10 dictionaries instead of 1. LZ probably only generates overhead as well. I'm rewriting the ranking loop now, so it can do ranking with states and also save some CPU with hashtable lookups. I'm pretty busy with school but I'll try to drop it this week. mcm seems interesting, do you know what it uses as a ranking function, I'm a little bit lost in the code.
    17 replies | 793 view(s)
  • Darek's Avatar
    28th February 2021, 22:36
    Darek replied to a thread Paq8sk in Data Compression
    OK, I'll try. Somehow I have some issues with my laptop recently - I've started 5 times enwik9 for paq8px v201 - and even after 3-4 days my computer crashes...
    228 replies | 23706 view(s)
  • Gotty's Avatar
    28th February 2021, 21:47
    Second game: Let's say that you thought of a number between 1-256. I'd like to figure out that number. It could be any number in that range. Let's say it is 13. What do you think how many yes-no questions do I need to find it out? I encourage you to really try to find out the answer. It's free tutoring :-) Take advantage of the opportunity.
    21 replies | 616 view(s)
  • Gotty's Avatar
    28th February 2021, 21:39
    Please follow the game here: Think of a letter between A-H. That is: A,B,C,D,E,F,G,H (8 different symbols). Let me find out which it is. How many yes-no questions do I need to ask you in the optimal case (= fewest possible questions)? Let's say you think of "F". My approach: - Is it between A-D? - No. - (Then it is between "E"-"H". Let's half that range then.). Is it between "E"-"F"? - Yes. - (I have two choice now: either "E" or "F".) Is it "E"? - Yes. - I got it! It's "E"! How many questions did I need? 3. Could I solve it in less questions? 2 questions? Is there a smarter way?
    21 replies | 616 view(s)
  • hexagone's Avatar
    28th February 2021, 21:31
    You should look at the dict encoding in https://github.com/mathieuchartier/mcm. It dynamically creates a dictionary of ranked words. mcm_sk64.exe -filter=dict -store enwik8 Compressing to enwik8.mcm mode=store mem=6 Analyzing 97656KB , 46591KB/s text : 1(95.3674MB) Analyzing took 2.098s Compressed metadata 14 -> 18 Compressing text stream size=100,000,000 Constructed dict words=32+11904+33948=45884 save=6956257+30898956+3574176=414293 89 extra=0 time=0.064s Dictionary words=45884 size=376.186KB 97656KB -> 59990KB 74774KB/s ratio: 0.61431 Compressed 100,000,000 -> 61,430,639 in 1.38400s Compressed 100,000,000->61,430,671 in 3.544s bpc=4.914 Avg rate: 35.440 ns/B ​
    17 replies | 793 view(s)
  • Gotty's Avatar
    28th February 2021, 21:25
    I don't feel that I'm stuck. I'm posting improvements quite regularly. Why would you say that we are stuck? Data compression is a technical challenge - not a philosophical one. So we address the issue from technical point of view. If you think there is a better approach... show us. But it should work ;-) Your issue is that you would like to understand why compressing a 2-digit file is worse than simply converting it to binary. This is what you asked. This is what I explained. Regarding what the issue is - I think we are on the same page. No, it's not. The 33,421 bytes could be compressed to 4-5K, it's clearly not random. The 4096 bytes are random. From (general) compression point of view these two files are totally different. In your head (and in my head) the binary file is equivalent to the 2-digit file (and it's absolutely true from our viewpoint), but they are not equal for the compression engines. Please read my explanation again above. It looks like my explanation didn't get through. I tried my best to make it clear - it looks like I was not very successful. Please read my posts again and tell me where it is difficult to understand. I'm ready to try to explain it better. No, they will never beat conversion. The entropy is 4096 bytes in each of your case. That's how low you can go no matter what. Regarding their information content, they are equal - you are right! But "technically" they are different: the 2-digit file needs to be actually compressed to get to (nearly) 4K, while the 4K file is already 4K - nothing to do from compression point of view. You cannot beat 4K. That's the limit. What do you mean by "conventional excuses"? Is it that you hear similar explanations again and again that random content is not compressible? Because it's not. You experienced it yourself. If you read most of the threads in the "Random Compression" subforum, you'll notice that most people try converting some random content to different formats hoping that it will become compressible. Like yourself. It's the most natural approach, it's true. So everyone is doing it. But everyone fails. Do you think that there is some law behind it? There is. You have still the same information content (4K) in any format you convert the file to. If you convert it to any format that is actually compressible (like the 2-digit format), you just give file compressors a hard time, because they need to do actual compression to go down to near 4K. What they see it not a random file - the 2-digit format is not random. You mean 256 symbols and 2 symbols (not digits). Your intuition is correct: it's easier to work with 2 symbols than 256 symbols. This is how the paq* family works. But it does not mean that a 2-symbol file is more compressible than a 256 symbol file. From information content point of view they are equal. It's surprising, I know. In my next post I'll give you an example, maybe it helps you grasp what information content is.
    21 replies | 616 view(s)
  • suryakandau@yahoo.co.id's Avatar
    28th February 2021, 21:19
    I just curious how much Silesia can be compressed using the new mixer context set in each predictor...
    228 replies | 23706 view(s)
  • suryakandau@yahoo.co.id's Avatar
    28th February 2021, 21:09
    The base of paq8sk is paq8pxd. @darek, could you test it on Silesia using the same parameter like you do when you test paq8pxd ??
    228 replies | 23706 view(s)
  • xinix's Avatar
    28th February 2021, 20:24
    Does compression seem to be enabled now? I would like to test only the preprocessor enwik8 lzma - 23.6 MB (24,750,737 bytes) enwik8.bwd lzma - 34.7 MB (36,438,877 bytes) ___ enwik8 PAQ fp8 - 18.6 MB (19,566,727 bytes) enwik8.bwd PAQ fp8 - 29.9 MB (31,357,466 bytes)
    17 replies | 793 view(s)
  • xinix's Avatar
    28th February 2021, 19:37
    I tested this on enwik8 95.3 MB (100,000,000 bytes) Processing lasted 5 hours on AMD Ryzen 5 3600X enwik8.bwd 83.7 MB (87,829,234 bytes)
    17 replies | 793 view(s)
  • Darek's Avatar
    28th February 2021, 18:46
    Darek replied to a thread Paq8sk in Data Compression
    @suryakandau - which version is an base of paq8sk48 - paq8px or paq8pxd? OK, i Know. paq8pxd is the base
    228 replies | 23706 view(s)
  • Trench's Avatar
    28th February 2021, 18:26
    4096 in its converted state which yes I agree. But the unconverted 33,420 bytes file is also a mess but compression fails do beat out conversion. It does get much better results without the newline. ;) But it still compression still loses to conversion from what I see which is the main issue of this post. from 31.9 compressed to bzip2 4.29, lzma2 4.11, p12 4.39, and converted 4.00. I assume better results will show when the file is much bigger since metadata is saved in the compressing programs, but the issue is can it beat conversion? maybe no. But it should be technically equal. So overall all your comments are right but again its not the issue. Can you beat 4.0 kb in theory without conventional excuses? You can not compress 256 digits as easily as 2 digits. Sure 256 is faster but 2 is more orderly. take food for example try chewing an entire apple in your mouth than a fraction of it. the person with a fraction of an apple will do better. The topic was not about compressing files form binary format but binary digits. Everyone in this forum thinks too technical which seems to be stuck in the Chinese finger trap which is not helping matters and progressing nowhere fast. But if everyone feels I am wrong....ok. Thanks again
    21 replies | 616 view(s)
  • Gotty's Avatar
    28th February 2021, 17:09
    I'm not sure what you mean. You don't like the word "Random" when used for incompressible data?
    21 replies | 616 view(s)
More Activity