Activity Stream

Filter
Sort By Time Show
Recent Recent Popular Popular Anytime Anytime Last 24 Hours Last 24 Hours Last 7 Days Last 7 Days Last 30 Days Last 30 Days All All Photos Photos Forum Forums
Filter by: Last 24 Hours Clear All
  • Sportman's Avatar
    Today, 00:26
    Sportman replied to a thread Zstandard in Data Compression
    It's if a file name contain the root folder content instead of the direct root folder content as by 1.4.5-1.4.8 win64 zip files.
    460 replies | 147596 view(s)
  • Sportman's Avatar
    Today, 00:18
    Sportman replied to a thread Zstandard in Data Compression
    Input: 984.801.591 bytes - data Output zstd v1.4.8: 151.119.611 bytes, 1.344 sec., zstd -1 --ultra -T1 149.904.965 bytes, 3.375 sec., zstd -1 --ultra --long -T1 147.939.926 bytes, 1.547 sec., zstd -2 --ultra -T1 146.708.829 bytes, 3.547 sec., zstd -2 --ultra --long -T1 143.977.987 bytes, 1.969 sec., zstd -3 --ultra -T1 143.965.769 bytes, 2.141 sec., zstd -4 --ultra -T1 142.896.262 bytes, 4.063 sec., zstd -3 --ultra --long -T1 142.869.673 bytes, 4.125 sec., zstd -4 --ultra --long -T1 138.447.020 bytes, 3.437 sec., zstd -5 --ultra -T1 137.671.146 bytes, 5.187 sec., zstd -5 --ultra --long -T1 135.307.599 bytes, 3.735 sec., zstd -6 --ultra -T1 134.660.677 bytes, 5.625 sec., zstd -6 --ultra --long -T1 129.713.923 bytes, 5.297 sec., zstd -7 --ultra -T1 129.552.776 bytes, 7.125 sec., zstd -7 --ultra --long -T1 129.262.251 bytes, 6.547 sec., zstd -8 --ultra -T1 129.090.886 bytes, 8.328 sec., zstd -8 --ultra --long -T1 128.271.371 bytes, 7.687 sec., zstd -9 --ultra -T1 128.210.617 bytes, 9.485 sec., zstd -9 --ultra --long -T1 128.087.378 bytes, 8.328 sec., zstd -10 --ultra -T1 128.056.308 bytes, 10.312 sec., zstd -10 --ultra --long -T1 127.999.182 bytes, 9.812 sec., zstd -11 --ultra -T1 127.940.684 bytes, 12.109 sec., zstd -11 --ultra --long -T1 127.445.750 bytes, 14.828 sec., zstd -12 --ultra --long -T1 127.422.579 bytes, 12.656 sec., zstd -12 --ultra -T1 126.935.545 bytes, 28.125 sec., zstd -13 --ultra --long -T1 126.802.723 bytes, 30.359 sec., zstd -14 --ultra --long -T1 126.768.260 bytes, 26.766 sec., zstd -13 --ultra -T1 126.659.580 bytes, 28.734 sec., zstd -14 --ultra -T1 126.499.068 bytes, 37.922 sec., zstd -15 --ultra --long -T1 126.327.122 bytes, 35.860 sec., zstd -15 --ultra -T1 118.142.065 bytes, 1:18.344 sec., zstd -16 --ultra -T1 117.643.674 bytes, 1:22.531 sec., zstd -16 --ultra --long -T1 113.112.473 bytes, 1:29.234 sec., zstd -17 --ultra -T1 112.773.080 bytes, 1:33.485 sec., zstd -17 --ultra --long -T1 112.255.514 bytes, 2:11.281 sec., zstd -18 --ultra -T1 111.880.990 bytes, 2:16.656 sec., zstd -18 --ultra --long -T1 111.517.628 bytes, 2:46.016 sec., zstd -19 --ultra -T1 110.962.579 bytes, 2:57.359 sec., zstd -19 --ultra --long -T1 111.041.788 bytes, 2:55.297 sec., zstd -20 --ultra -T1 110.644.739 bytes, 3:04.047 sec., zstd -20 --ultra --long -T1 110.431.137 bytes, 3:25.172 sec., zstd -21 --ultra -T1 110.188.109 bytes, 3:39.516 sec., zstd -21 --ultra --long -T1 109.903.935 bytes, 3:52.750 sec., zstd -22 --ultra -T1 109.903.935 bytes, 3:52.719 sec., zstd -22 --ultra --long -T1 Output zstd v1.4.9: 151.119.611 bytes, 1.344 sec., zstd -1 --ultra -T1 149.930.990 bytes, 1.969 sec., zstd -1 --ultra --long -T1 147.939.926 bytes, 1.547 sec., zstd -2 --ultra -T1 146.731.879 bytes, 2.141 sec., zstd -2 --ultra --long -T1 143.977.987 bytes, 1.969 sec., zstd -3 --ultra -T1 143.965.769 bytes, 2.156 sec., zstd -4 --ultra -T1 142.942.060 bytes, 2.656 sec., zstd -3 --ultra --long -T1 142.912.214 bytes, 2.687 sec., zstd -4 --ultra --long -T1 138.447.020 bytes, 3.453 sec., zstd -5 --ultra -T1 137.698.236 bytes, 3.781 sec., zstd -5 --ultra --long -T1 135.307.599 bytes, 3.781 sec., zstd -6 --ultra -T1 134.685.348 bytes, 4.188 sec., zstd -6 --ultra --long -T1 129.713.923 bytes, 5.297 sec., zstd -7 --ultra -T1 129.541.732 bytes, 5.687 sec., zstd -7 --ultra --long -T1 129.262.251 bytes, 6.563 sec., zstd -8 --ultra -T1 129.079.778 bytes, 6.890 sec., zstd -8 --ultra --long -T1 128.271.371 bytes, 7.703 sec., zstd -9 --ultra -T1 128.185.107 bytes, 8.016 sec., zstd -9 --ultra --long -T1 128.087.378 bytes, 8.344 sec., zstd -10 --ultra -T1 128.030.897 bytes, 8.860 sec., zstd -10 --ultra --long -T1 127.999.182 bytes, 9.859 sec., zstd -11 --ultra -T1 127.915.469 bytes, 10.719 sec., zstd -11 --ultra --long -T1 127.422.579 bytes, 12.703 sec., zstd -12 --ultra -T1 127.415.784 bytes, 13.422 sec., zstd -12 --ultra --long -T1 126.907.957 bytes, 26.703 sec., zstd -13 --ultra --long -T1 126.775.263 bytes, 28.954 sec., zstd -14 --ultra --long -T1 126.768.260 bytes, 26.750 sec., zstd -13 --ultra -T1 126.659.580 bytes, 28.735 sec., zstd -14 --ultra -T1 126.467.698 bytes, 36.609 sec., zstd -15 --ultra --long -T1 126.327.122 bytes, 35.812 sec., zstd -15 --ultra -T1 118.142.065 bytes, 1:18.329 sec., zstd -16 --ultra -T1 117.640.741 bytes, 1:21.078 sec., zstd -16 --ultra --long -T1 113.112.473 bytes, 1:29.187 sec., zstd -17 --ultra -T1 112.767.845 bytes, 1:32.015 sec., zstd -17 --ultra --long -T1 112.255.514 bytes, 2:10.438 sec., zstd -18 --ultra -T1 111.878.283 bytes, 2:14.078 sec., zstd -18 --ultra --long -T1 111.517.628 bytes, 2:45.672 sec., zstd -19 --ultra -T1 111.041.788 bytes, 2:55.078 sec., zstd -20 --ultra -T1 110.958.922 bytes, 2:55.203 sec., zstd -19 --ultra --long -T1 110.691.009 bytes, 3:01.454 sec., zstd -20 --ultra --long -T1 110.431.137 bytes, 3:24.843 sec., zstd -21 --ultra -T1 110.190.299 bytes, 3:36.609 sec., zstd -21 --ultra --long -T1 109.900.896 bytes, 3:49.813 sec., zstd -22 --ultra -T1 109.900.896 bytes, 3:49.844 sec., zstd -22 --ultra --long -T1
    460 replies | 147596 view(s)
  • mitiko's Avatar
    Today, 00:04
    Oh, that type of a transform... Well, static dictionaries support all types of patterns (like the ones described in the post) and I'm planning on doing this in one way or another. What they're doing with LZ77 + static dictionaries is pretty ingenious actually! Wait, that's actually pretty clever for a fast encoder! Anyways, that's not the point of BWD. I'm trying to preserve the context of the text, the information it holds. After LZ77 and such pattern-y words the rest of the compression pipeline looses context and it will predict worse. But if we instead create a whole new entire context based on a pattern, it will compress better and the original stream will be more compressible as well. Patterns are great, they just seem a bit to deterministic right now - like for every pattern/dictionry transform that you can imagine, you need to write it's own ranking function and extraction function and then handle it's serialization to the dictionary, etc. It'd be amazing if we find a way to get the system to generate patterns, or find them in the dictionary after the words have been chosen. For example if the dictionary is calculated to be something like: ​ "the next" "the following" "the lion" "the fox" Maybe some independent system can see the pattern and generate the word "the (*)", which will then match all of the above and also split the stream into 2 - the main stream + the stream of the pattern. The stream of the pattern would the look something like: "next;following;lion;next;fox;fox;fox;next..." which is very compressable. In practice this would suit well some grammar rules (essentially making the grammar information that text holds, known, so zero bytes to transmit) like "(*)ing" or "(*)ed". With patterns, we get into this whole idea of context splitting and start to think if this is possible in the general case. I have some notes, if you'd like to look at them.
    17 replies | 793 view(s)
  • Jarek's Avatar
    Yesterday, 23:26
    Thanks but I am not taking this only "award" for ANS to be able to defend it for the rest of my life from succeeding megacorporations. Corporations using it, instead of deploying own "defensive" legal mines, please just prevent building this minefield. Especially that I haven't heard of any ANS-based compressor from Microsoft (?)
    122 replies | 91254 view(s)
  • SolidComp's Avatar
    Yesterday, 22:19
    Brotli has a seemingly novel set of 120 dictionary transforms. You can read a bit about them in this recent Cloudflare post.
    17 replies | 793 view(s)
  • JamesWasil's Avatar
    Yesterday, 19:10
    Nope. Only semantic rewording of your work and fabricating big tech innovations for Microsoft to compete with Google by bogus patent claims for what was intended to be free with due credit. This is why we can't have nice things, and why development and certain types of information sharing and discoveries are no longer offered in good faith like they were before. Corporations by majority have ruined the good faith and academic contributions for the sake of exclusive profiteering, even against the inventor's wishes. It says a lot about the companies and state of the world when they have to rip off the author, take credit and make money on work released freely, rather than developing their own UNIQUE methods even if they yield nearly the same performance or results. Cookie cutter world now. Question: Should Jarek make a global Gofundme for legal help to fight this to where the lawyers who Google, Microsoft, and other big tech don't yet have in their pocket might actually help? I would assume that if Marcus Hutter could throw 50k at the Hutter prize, would he and others be willing to donate to the Jarek rANS/tANS defense fund?
    122 replies | 91254 view(s)
  • mitiko's Avatar
    Yesterday, 17:28
    I'm going to try and compete with BWD later this year. I also have some good ideas for modeling after the dictionary transform.
    78 replies | 12265 view(s)
  • mitiko's Avatar
    Yesterday, 17:13
    Thanks for looking through the code. I knew I had to eventually remove LINQ but I don't think I'm using it at some very critical places. Right now, the most work seems to be in counting the words and ranking. Thanks for the heads up for foreach loops, I didn't know they were more expensive than for's. The architecture is pretty ok for now, it's like LINQ but async (using IAsyncEnumerable). I've thought it might be slowing me down but it's actually not. It's processing data in blocks and I'm testing it for optimality for the block size being the full filesize, so all the data only travels the pipeline once. The slow-down comes from calculating the dictionary, when we have it, parsing and serializing dictionary is pretty fast. What the rank (as per the latest version) calculates is the difference in bits used to encode the string. So if the dictionary size is 256, 8 bits would be needed per word and the ranking would be analogous to (n-1)*s for n=word size, s=occurences. (I'd rather work with l and c for length and count) In your example, the weights are correct and the savings in bytes are the same as the rank. The whole ranking system is greedy in this way. It will prefer: ​"the" -> rank=2000 "rabbit" -> rank=175 "lion" -> rank=105 "and" -> rank=70 "fox" -> rank=70 total = 2420 The ranks here only being from the top of my head. The non-greedy variation you have in mind looks something like: "the fox and the rabbit and the lion" -> rank=1225 "the" -> rank=2000 total = 3225 I have actually thought a lot about this and it's not a problem of the ranking function itself, at least not it's primitive variation. No mathematical formula can solve the greediness, there's always going to be some example where if we switch the order, we'll get a better result. It's a problem of choosing the best word from the ranking. Instead of greedy-ly choosing the highest ranked word, we can look ahead (in time) and calculate the ranks of future words, then choose the optimal branch. It becomes a look-ahead tree. The fragmentation problem by itself can be solved if we consider the dictionary to initially be the alphabet and then morphing it into longer words. Like: A -- AT T -- A C -- T G -- C ​ G Then we can do parsing before each new word calculation and we won't have the constraint of words not spanning over multiple blocks. I've excluded this as a good idea, because 1) we have to do parsing at each word and only reparsing doesn't seem viable. 2) we won't have this problem for a large enough look-ahead tree. But now that you mention it, maybe the look-ahead tree can't be that big. My hope was that since we're calculating a lot of the words' ranks, they won't change under the tree and when we choose the optimal branch, we'd only have to calculate the ranks of the next layer of words. Basically ranking becomes at most x times slower, if we prune and look at the top x words. The tree depth is only a concern for the first word and then updates are pretty fast, so it's pretty inexpensive to have a deeper tree, compared to a more spanning tree. But for the look-ahead tree to solve the fragmentation dilemma, a deeper tree doesn't cut it. We need a more spanning tree. So, maybe you're right, I should reconsider overlapping words and look into re-parsing, although I'm still pretty set on having the look-ahead tree as well. A couple assumptions: Right now, with the naive ranking ((n-1)*s) we don't have to rank all words and can only re-rank some. If we choose to re-rank the words, we have to store all the ranks, we can't store just the top 5, because at some point we won't be able to update the top words for 5 times in a row and we won't have a top 5. My current ranking was last updated with a hashtable for ranks, so the actual calculation is only done once per similair words. What we can do, for sure, is only re-count some words. I'm working on that still. I have a draft, I just have to fix some minor indexing issues. The reason why I haven't done re-ranking is because the optimal ranking algorithm uses a state. The technique used is similiar to morphing/updating the dictionary. This way the current rank of a word depends on the other words chosen and it's closer to the actual optimal order of the dictionary. Especially with the look-ahead tree. For sure, re-ranking and re-counting + the fragmentation fix will create a great compressor, with acceptable compression times. But I'm rather looking to improve BWD in the direction of optimality for the cost of compression time, than leveraging memory for time. When I'm done, I'll definitely try to implement one with this stack!
    17 replies | 793 view(s)
  • CompressMaster's Avatar
    Yesterday, 15:58
    CompressMaster replied to a thread WinRAR in Data Compression
    Yeah, I have seen this almost 2 years ago or so... even in windows there are vulnerabilities like that - one that has been lastly patched in win10 existed even from windows 3.11..
    192 replies | 137083 view(s)
  • Jarek's Avatar
    Yesterday, 15:40
    I was just pointed 2019 Microsoft patent for rANS: https://patents.google.com/patent/US20200413106A1 Can anybody find something novel here?
    122 replies | 91254 view(s)
  • LucaBiondi's Avatar
    Yesterday, 13:30
    LucaBiondi replied to a thread paq8px in Data Compression
    Hi guys Thanks to all for the comments. I run paq8px besides the normal work. My pc has 16 gb of ram and often ram is not enough. This happen for 201 -10 vs. 201 -11 Luca
    2337 replies | 625128 view(s)
  • Gotty's Avatar
    Yesterday, 13:11
    Gotty replied to a thread paq8px in Data Compression
    Yes. In this league there is a serious "fight" for each byte gained. Each version is better like around 0.02%-0.10%, and for that tiny improvement there is a lot going on. The result is extremely "tight" and optimized already - it's not really possible to "tune" the existing models and contexts for any more gains, we (usually) need to add more models to experience better gains. That costs time and memory. The gain is not too great usually as the existing models cover the possibilities quite well. We really sacrifice memory and speed for compression ratio. So in this league 3% improvement is a lot - and it costs a lot.
    2337 replies | 625128 view(s)
  • hexagone's Avatar
    Yesterday, 07:21
    hexagone replied to a thread paq8px in Data Compression
    You mean 'accurate' I assume. Accurate timings are useful. Or just compare the option with the highest compression for each release. ​ 3% improvement for a x 3 compression time is incredibly expensive. "Worth" is in the eye of the beholder...
    2337 replies | 625128 view(s)
  • Gotty's Avatar
    Yesterday, 04:53
    I have only two points (for performance-critical parts of your code): - use for loops instead of foreach loops; - avoid using LINQ - convert them to normal loops. (These modifications will be eventually needed when you really want to convert your code to c/c++ one day.) I know these will give you only a small boost, you're looking for a faster architecture. Unfortunately I don't know your architecture (only had a peek). Could you try something? What if you rank words simply by their size? That is a stable metric, you don't need to recalculate the ranking after each round, just see if there is still more than 1 occurrence in the next round. It may sound strange at first, but it could be as good as the current metric that taks into account the occurrence too. Why? When you rank the words by (n-1)*s there is an issue: long words are polluted when their constituents are replaced. Example: the fox and the rabbit and the lion the chicken the fox and the rabbit and the lion the lion the fox and the rabbit and the lion the horse the fox and the rabbit and the lion the penguin the zebra the cat the worm ... (many more) ... the dog The weights: "the": (n-1)*s= 2*1000 = 2000 "the fox and the rabbit and the lion": (n-1)*s = 3*35 = 105 So "the" will win the first round. After the first replacement those 4 long substring will be unnecessary polluted by indexes of the "the"s. In this case it would be better to just replace the 4 long substrings first and then replace those "the"'s. What do you think? I wonder how much is the difference with this metric and is it better or worse.
    17 replies | 793 view(s)
No More Results