Activity Stream

Filter
Sort By Time Show
Recent Recent Popular Popular Anytime Anytime Last 24 Hours Last 24 Hours Last 7 Days Last 7 Days Last 30 Days Last 30 Days All All Photos Photos Forum Forums
  • Gotty's Avatar
    Today, 16:21
    Additional explanation: Unfortunately most compressors will try finding longer patterns in the ascii files. And indeed there are patterns and they seem to be useful. They think that it's a text file and looking for string matches is a good strategy. Unfortunately it's not. The best strategy would be to simply convert the file to binary. But examining this possibility is not programmed in them. So they go for the string matches... Why looking for patterns its not a good strategy in this case? (A simplified example follows) Imagine that you are a compressor and you are given the binary file (4096 bytes). There are *very* short matches (max. 2 bytes only) and you'll see that these short matches won't help compressing the file. So you won't use them. Good. Or you can use them, but still it won't help compression. Now you are given the 2-digit ascii file. And you'll see that you have for example quite many 8-character matches (since the bits are bytes now). Oh, that's very good - you say. Instead of storing those 8 bytes I will just encode the match position and length and I'm good - encoding the position and length is cheaper (let's say it's 2 bytes). So you win 6 bytes. And this is where you were led astray. In reality you encoded a 1-byte information to 2 bytes. That's a huge loss. And you thought you did good... So an additional problem is that when seeing the ascii files many compressors are led astray. They think finding string matches is a good strategy. They will not find or not easily find the "optimal" strategy (which is an order-0 model with equiprobable symbols.)
    6 replies | 119 view(s)
  • Gotty's Avatar
    Today, 15:04
    And here is the answer: The original binary file (4096 bytes) is a random file, you can't compress it further. The entropy of the file is 4096 bytes. Any solid compressor will be around that 4096 bytes + they will of course add some bytes as filename, file size and some more structural info. Those compressors are the winners here that realize that the file is incompressible, and add the smallest possible metadata to it. On the other hand, an ascii file consisting of 2 digits (representing bits) or 16 hexadecimal characters (representing nibbles) is not random. They need to be compressed down to 4096 bytes (from 32768 (when bits) and 8192 bytes (when nibbles)). Their file size (32768 and 8192 bytes) is far from the entropy (4096 bytes). We actually need compression now, so file compressors will do their magic: they need to do prediction, create a dictionary, etc. So actual compression takes place. They need to get down the file size from 32768/8192 to 4096. So even if the content represents a random file the actual content in these ascii files is not random: the first bit for example is always zero (being an ascii file). In case of a 2-digit file the ideal probability if a '0' character is 50%, the ideal probability of a '1' is 50%, the ideal probability of any of the remaining 254 characters is 0%. (That's why it's important to remove those newlines from xorig.txt otherwise a compressor needs to deal with them.) A compressor needs to store all this information (counts, probabilities or anything that helps reconstructing the ascii file) somehow in compressed format additionally to the actual entropy of the original file (which is 4096 bytes). That's why you experienced that the results in this case are farer from the ideal 4096 bytes. The winner compressor will be the one that realizes that the best model is an order-0 model with equiprobable symbols. The sooner a compressor gets there the better its result will be - meaning: the closer it will get to the desired 4096 bytes. So always represent a (random) file in it's original form, translate the file to 2-digit or 16-char format only for visualization, don't try compressing it.
    6 replies | 119 view(s)
  • xinix's Avatar
    Today, 14:12
    Thank you! Can this be used as a preprocessor?
    5 replies | 177 view(s)
  • mitiko's Avatar
    Today, 12:58
    Of, course. There're lots of good ideas, differing in various ways. What I meant by this "Since the creation of the LZ77 algorithm..." is that most programs out there use LZ77/78. They dominate the space for dictionary transforms. I'm only trying to explain my motivation behind the idea, I might be wrong at a lot of my statements but it's intuition that drives me. I didn't know there were that many papers on the topic, I couldn't find many myself. I especially like Przemysław Skibiński's PhD thesis section 4.8.1 as it's the closest to what I'm trying to do. It makes me very happy that we've independently came to the same ranking equation. But I've imporved upon it and I'm trying to develop the idea for patterns as well. I'm trying to fall under some reasonable time to compress larger files, so that it becomes feasible to use it in the real world. Here's an exe, although I'm not sure all of the files are needed. Usage is BWDPerf.Release.exe -c fileName.txt -> creates fileName.txt.bwd BWDPerf.Release.exe -d fileName.txt.bwd -> creates decompressed I've hardcoded a max word size of 32 and dictionary size of 256 for now.
    5 replies | 177 view(s)
  • Mauro Vezzosi's Avatar
    Today, 12:32
    A choice that I have made implied that is not clearly visible in the source is this: Suppose we have the following input data: abcdefghi and for each position to have the following matching strings in the dictionary: abc bcde cde d efg fghi gh There are 2 solutions with two-step-lookahead: (a)(bcde)(fghi) and (ab)(cde)(fghi). For simplicity I chose the first encountered, but we can choose based on the length or age of the strings (bcde) and (cde), the number of children they have, the subsequent data, ... More generally, whenever we have a choice we have to ask ourselves which one to choose.
    41 replies | 7390 view(s)
  • suryakandau@yahoo.co.id's Avatar
    Today, 09:41
    Paq8sk48 - improve the compression ratio for each data type by adding some mixer context set in each predictor(DEFAULT, JPEG, EXE, image 24 bit, image 8 bit, DECA, textwrt). - improve image 8 bit model c.tif (darek corpus) paq8sk48 -s8 Total 896094 bytes compressed to 270510 bytes. Time 166.60 sec, used 2317 MB (2430078529 bytes) of memory paq8px201 -8 271186 bytes Time 76.12 sec, used 2479 MB (2599526258 bytes) of memory paq8pxd101 -s8 Total 896094 bytes compressed to 271564 bytes. Time 64.56 sec, used 2080 MB (2181876707 bytes) of memory
    224 replies | 22859 view(s)
  • Trench's Avatar
    Today, 06:14
    Thanks Gotty. I guess it comes down to terminology since I mean one thing and others understand it completely different. Which is why you asked your question despite I thought I explained it. But not well enough obviously so sorry for that. Definition of binary "consisting of, indicating, or involving two." Which my file has only 2 to be considered binary. :) I am trying to speak in dictionary terms than technical terms which I see a lot of misused of words and I guess you guys feel I misuse them. LOL As for the issues, despite the newlines as you state which I thought I removed but programs are flicked. You are correct it needs 2 more bits which 00 would do, but again it is rough estimating and lost track since I was going off of Hex 4096 which uses that an an even number, so use it for orderly purposes. But messed up with the other file but relied mostly on % for measurement than kb amount to see the difference. But again the point is not the accuracy of the file as much but the issue of 2 digits being compressed. As small as the file was it was still big enough to give an issue in other ways. SO... the questions is as stated a file with 2 digits (binary by definition) can not be compressed as good compared to converting. It does not have the 256 characters to make it harder to compress despite "not random". As it seems its easier to convert the file than to compress the file. And no file compression can do a good job in compressing a file with 2 digits as good as converting. Or maybe I do not know which one is good and maybe someone else does? If so which one? Taking the 2 digits of 1 and 0 and putting it on the link and taking the hex characters into a hex editors and then saving it as a regular asci file works better. I cant not copy directly Ascii since it oddly does not copy it exactly. Which again that gives a better results than compressing the file when doing that or to do that first then compressing. Maybe you know that maybe you dont but pointing it out and asking why. No matter if the file has recognized patters or non recognized patters which many like to use the term "random". Random to everyone but not truly random so by definition both files that are compressible and non compressible are random. It is just that the program knows what pattern to use in a broad spectrum of files which works better for some than others. Even to have a mp3 playing a same patterns can not be compressed due to not being random in how it sounds but is defined as random in the ascii due to it being unfamiliar to whoever put in the algorithm. So i disagree with this forum that used the term "Random" and prefer "unfamiliar" or "unfamiliar patters" as a more accurate definition. And the more people use proper terms it can help move things forward. I try to not give long explanations but sometimes seems like it is needed to clarify more.
    6 replies | 119 view(s)
  • hexagone's Avatar
    Today, 05:01
    And https://encode.su/threads/3240-Text-ish-preprocessors?highlight=text-ish
    5 replies | 177 view(s)
  • Gotty's Avatar
    Today, 04:50
    On your github page, we find: >>Since the creation of the LZ77 algorithm by A.Lempel and J. Ziv there hasn’t really been any new foundationally different approach to replacing strings of symbols with new symbols. I'm not sure what you mean by that. What counts as fundamentally different? If we count your approach as fundamentally different, then i believe there are many more. Let's see. Please have a look at the following quote form Dresdenboy: You asked for ideas. So the best thing would be to check out those links for ideas and also: https://github.com/antirez/smaz https://encode.su/threads/1935-krc-keyword-recursive-compressor https://encode.su/threads/542-Kwc-–-very-simple-keyword-compressor https://encode.su/threads/1909-Tree-alpha-v0-1-download https://encode.su/threads/1874-Alba https://encode.su/threads/1301-Small-dictionary-prepreprocessing-for-text-files (wbpe)
    5 replies | 177 view(s)
  • Gotty's Avatar
    Today, 03:51
    No, xinix converted your xorig.txt (which is not binary, but bits represented by ascii digits) to hexadecimal (nibbles represented in ascii). So the size is doubled. (For the first 5 characters it is: "10011" -> "31303031".) You said "hex" = 4096 bytes. Aha, then that is "binary", not "hex". It looks like you misunderstood what xinix said because your "binary" and "hex" is different from his. But also xinix missed the correct term. xorig.txt is not "binary". OK, textual representation of bits, but still it's textual... binary is the 4096-byte file. It think you mean "binary", not "hex". Correct? @Trench, the xorig.txt format is not "compatible" with it's binary representation. It has newlines - please remove them. Also 4096 bytes converted to 'bits in ascii" would be 32768 bits, but you have only 32766 bits in xorig.txt. 2 bits are missing. To be fully compatibly you'll need to add 2 more ascii bits. @Trench, I don't understand your posts. What is exactly your question or problem?
    6 replies | 119 view(s)
  • xinix's Avatar
    Today, 00:01
    Hi! Can you post the EXE file for testing?
    5 replies | 177 view(s)
  • mitiko's Avatar
    Yesterday, 22:56
    Dictionary transforms are quite useful but do we take advantage of all they can offer when optimizing for size. LZ77/78 techniques are wildly known and optimal parsing is an easy O(n) minimal path search in DAWG (directed acyclic weighted/word graph). In practice, optimal LZ77 algorithms don't do optimal parsing because of the big offsets it can generate. Also LZ algs lose context when converting words to (offset, count) pairs. These pairs are harder to predict by entropy coders. I've been working on BWD - pretentious name stands for Best Word Dictionary. It's a semi-adaptive dictionary, which means it computes the best/optimal dictionary (or tries to get some suboptimal approximation) for the data given and adds it to the stream. This makes decompression faster than LZ77/78 methods, as the whole dictionary can sit in memory but compression becomes very costly. The main advatages are: optimality, easier to predict symbols, slight improvement in decompression speed. The disadvantages are clear: have to add the dictionary to the compressed file, slow compression. This is still quite experimental and I do plan on rewriting it in C++ (it's in C# right now), for which I'll probably need your guys' help later on. I've also considered starting a new set of paq8 compressors but that seems like getting too deep into the woods for an idea I haven't developed fully yet. Someone requested test results - Compression takes about 45s for enwik5 and decompression is at around 20ms. You can find the exact results in https://github.com/Mitiko/BWDPerf/issues/5 I'm still making frequent changes and I'm not spending too much time testing. It generates a new sequence of symbols (representing the indexes to the dictionary) - length is 50257 and entropy is 5.79166, dictionary uncompressed is 34042 bytes. This means, that after an entropy coder it can go to about 325'113 bytes. (This is for enwik5 with m=32 and dictionarySize=256) Stay tuned, I'm working on a very good optimization that should take compression time down a lot. I'll try to release more information soon on how everything works, the structure of the compressed file and some of my notes. You can read more on how the algorith works here: https://mitiko.github.io/BWDPerf/ This is the github repo: https://github.com/Mitiko/BWDPerf/ I'll be glad to answer any and all questions here or with the new github feature of discussions: https://github.com/Mitiko/BWDPerf/discussions Note this project is also a pipeline for compression, that I built with the idea of being able to switch between algorithms fast. If you'd like to contribute any of your experiments, transforms or other general well-known compressors to it, to create a library for easy benchmarking, that'd be cool. I know Bulat Ziganshin made something similiar so I'm not too invested in this idea. Any cool new ideas, either on how to optimize parsing, counting words or how to model the new symbols are always appreciated.
    5 replies | 177 view(s)
  • Trench's Avatar
    Yesterday, 18:23
    I dont understand how you got your numbers since when i convened the binary (32.6 KB (33,420 bytes)) to hex I got 4.00 KB (4,096 bytes) Forgot to upload another file which I just did here to gave a complete random one which also shows similar results despite not as good. I probably messed up a small bit to make it more disorganized but in general you see the issues. As stated yes 7zip did not do as good sometimes as another which I used P12 from http://mattmahoney.net/dc/p12.exe File size 32.6k When i tried to use 7zip i get 82.27% =5.78k wity p12 i get 85.86% =4.61k converting it to hex I get 87.73% =4.0k which wins directly. And can not compress the hex file more which gets worse results at 86.53%. =4.39k With a more orderly which I uploaded (32.0 KB (32,864 bytes)) 7zip=89.16% = 3.47k, p12=87.97% = 3.85k oddly worse, and hex=87.47% =4.01, and the hex file is compressed to be 90.53% = 3.03k Hex wins indirectly. as it seems converting it to binary to compress gets around 1% better result which the more random one gets less than 1%. around a 3% difference. maybe a bigger files will get better results. I used this site to convent binary https://www.rapidtables.com/convert/number/ascii-hex-bin-dec-converter.html And took the hex and put it on a hex editor since other odd things happen. But it gets annoying to have to manually select which type of compression method is best with trial and error than the program selecting the best one.
    6 replies | 119 view(s)
  • Mauro Vezzosi's Avatar
    Yesterday, 17:28
    What did you 2 mean by "full optimal parsing" in a basic LZW as in flexiGIF? IMHO, the one-step-lookahead LZW of flexiGIF is already the best it can be and we don't need to look any further for a lower code cost. We could try to optimize the construction of the dictionary, but it seems difficult; however we can do something simpler and still quite effective (more on .Z than on .GIF) and to do so I will use two-step-lookahead. Basically, if possible, we choose the next string so that the second ends where the third begins (so the second string gets longer); the next string is immediately chosen if it is already advantageous. Suppose we have the following input data: abcdefghi and for each position to have the following matching strings in the dictionary: abc bcde cdef d efg fghi gh Greedy: (abc)(d)(efg) - Strings (abc) and (d) grow, output 3 codes for 7 symbols. One-step-lookahead: (ab)(cde)(fghi) - Strings (abc) and (cdef) don't grow, output 3 codes for 9 symbols. Two-step-lookahead: (a)(bcde)(fghi) - String (bcde) grow (because it is a full length string) and (abc) does not, output 3 codes for 9 symbols. We can go through more steps to improve accuracy of the parsing. I quickly edited flexiGIF and I attach the source with the changes (they are in a single block delimited by "//§ Begin" and "//§ End"). I leave to Stephan to decide if, which variant and how to definitively implement two-step-lookahead. Original .Z are compressed with ncompress 5.0 (site, github, releases). Variant 0 is standard flexiGIF 2018.11a. Variants 1..4 are flexiGIF 2018.11a with my two-step lookahead; they are always better or the same as Variant 0. Variant 1 is the simplest and fastest. Variant 2 is what I think it should be. Variant 2a is Variant 2 with -a divided by 10 (min -a1). Variant 3 and 4 have minor differences. Greedy drops -p; sometimes it is better than flexible parsing. Original Variant 0 Variant 1 Variant 2 Variant 2a Variant 3 Variant 4 Greedy Options File 339.011 339.150 338.893 338.886 338.886 338.886 338.893 340.023 -p -a=1 200px-Rotating_earth_(large)-p-2018_10a.gif 55.799 55.799 55.793 55.793 55.793 55.793 55.793 55.869 -p -a=1 220px-Sunflower_as_gif_websafe-p-2018_10a.gif 280 280 280 280 280 280 280 281 -p -a=1 Icons-mini-file_acrobat-a1-2018_10a.gif 167.333 167.410 167.291 167.298 167.298 167.298 167.293 167.529 -p -a=1 skates-p-m1-2018_10a.gif 52.663 52.722 52.388 52.389 52.389 52.391 52.407 53.430 -p -a=1 SmallFullColourGIF-p-m1-2018_10a.gif 615.086 615.361 614.645 614.646 614.646 614.648 614.666 617.132 Total 95 95 95 95 95 95 95 95 -p -Z -a=1 ENWIK2.Z 530 509 504 504 504 504 504 530 -p -Z -a=1 ENWIK3.Z 5.401 5.320 5.242 5.242 5.242 5.242 5.242 5.401 -p -Z -a=1 ENWIK4.Z 46.355 46.485 45.525 45.510 45.510 45.510 45.506 46.355 -p -Z -a=10 ENWIK5.Z 442.297 450.769 440.173 440.053 439.433 440.053 439.921 441.483 -p -Z -a=100 ENWIK6.Z 4.578.745 4.607.141 4.514.015 4.514.039 4.506.557 4.514.071 4.514.181 4.541.851 -p -Z -a=1000 ENWIK7.Z 46.247.947 45.915.615 44.938.673 44.936.761 44.803.785 44.936.713 44.939.553 45.130.643 -p -Z -a=10000 ENWIK8.Z 51.321.370 51.025.934 49.944.227 49.942.204 49.801.126 49.942.188 49.945.002 50.166.358 Total 2.407.918 2.329.939 2.312.506 2.312.156 2.308.735 2.312.172 2.312.602 2.329.660 -p -Z -a=1000 AcroRd32.exe.Z 1.517.475 1.476.017 1.461.401 1.461.194 1.458.239 1.461.258 1.461.400 1.472.982 -p -Z -a=1000 english.dic.Z 2.699.855 4.336.651 3.628.969 3.628.901 3.534.005 3.630.245 3.630.425 2.684.421 -p -Z -a=1000 FP.LOG.Z 2.922.165 2.792.591 2.781.381 2.781.253 2.779.381 2.781.253 2.781.429 2.804.053 -p -Z -a=1000 MSO97.DLL.Z 1.553.729 1.889.773 1.758.357 1.758.085 1.747.052 1.758.229 1.758.517 1.491.035 -p -Z -a=1000 ohs.doc.Z 1.444.893 1.418.481 1.391.773 1.389.819 1.388.155 1.389.807 1.391.609 1.377.631 -p -Z -a=1000 rafale.bmp.Z 1.456.931 1.491.769 1.412.083 1.411.773 1.402.315 1.411.893 1.412.105 1.408.439 -p -Z -a=1000 vcfiu.hlp.Z 1.133.483 1.198.564 1.140.473 1.140.189 1.136.877 1.140.239 1.140.529 1.170.537 -p -Z -a=1000 world95.txt.Z 15.136.449 16.933.785 15.886.943 15.883.370 15.754.759 15.885.096 15.888.616 14.738.758 Total 23.514.759 23.068.375 22.031.991 22.029.031 21.893.328 22.030.971 22.034.373 20.887.657 -p -Z -a=1000 MaxCompr.tar.Z (10 files) 199.765 194.884 194.041 194.041 194.039 194.041 194.041 196.327 -p -Z -a=10 flexiGIF.2018.11a.exe.Z 12.088 11.999 11.785 11.786 11.786 11.786 11.785 12.088 -p -Z -a=10 flexiGIF.cpp.Z 5.357 5.326 5.278 5.278 5.278 5.278 5.278 5.357 -p -Z -a=1 readme.Z (readme of flexiGIF) 217.210 212.209 211.104 211.105 211.103 211.105 211.104 213.772 Total 90.804.874 91.855.664 88.688.910 88.680.356 88.274.962 88.684.008 88.693.761 86.623.677 Total
    41 replies | 7390 view(s)
  • Mauro Vezzosi's Avatar
    Yesterday, 17:21
    I have verified that the decompression creates a file identical to the original coronavirus.fasta.
    76 replies | 5186 view(s)
  • Darek's Avatar
    Yesterday, 16:24
    Darek replied to a thread paq8px in Data Compression
    enwik scores for paq8px v201: 15'896'588 - enwik8 -12leta by Paq8px_v189, change: -0,07% 15'490'302 - enwik8.drt -12leta by Paq8px_v189, change: -0,10% 121'056'858 - enwik9_1423.drt -12leta by Paq8px_v189, change: -2,99% 15'884'947 - enwik8 -12lreta by Paq8px_v193, change: -0,02% 15'476'230 - enwik8.drt -12lreta by Paq8px_v193, change: -0,02% 126'066'739 - enwik9_1423 -12lreta by Paq8px_v193, change: -0,09% 121'067'259 - enwik9_1423.drt -12lreta by Paq8px_v193, change: 0,08% 15'863'690 - enwik8 -12lreta by Paq8px_v201, change: -0,23% - time to compress: 45'986,20s 15'462'431 - enwik8.drt -12lreta by Paq8px_v201, change: -0,12% - best score for paq8px series- time to compress: 30'951,71s 120'921'555 - enwik9_1423.drt -12lreta by Paq8px_v201, change: -0,13% - best score for paq8px series- time to compress: 406'614,43s
    2325 replies | 622335 view(s)
  • xinix's Avatar
    Yesterday, 09:20
    I converted your file to its original form. This file is random, it cannot be compressed! __ You're going in the wrong direction, you don't need to convert the file to view 01, you're trying to compress binary code, but if you tried to understand compression you would know that PAQ already compresses binary bits. Not bytes but the bits you're trying to show us. __ If you still don't get it, let me improve the compression ratio of your file right now! I took your binary xorig.txt and converted it to HEX ASCII, and now the file is 66,840 bytes! But the compression has improved! xorig binary is only 85% compression xorig HEX ASCII - compression ratio as much as 92%!!! Did you see that?
    6 replies | 119 view(s)
  • Trench's Avatar
    Yesterday, 08:22
    I asked a while ago about compressing binary file and got some decent programs that can compress but suck at binary compared to just putting the binary into Hex / Ascii and the compressing it which beats just binary compression by around 15% what I tested. While other program like 7zip it is not as good even with its Lzma2 being better than Bzip2 when dealing with binary but still. Maybe my tests were not done properly some might feel for the file to not be big enough (32kb) or not putting the proper configurations. This was just a quick tests. Maybe more focus is needed on binary compression? Bad enough other compressed programs as stated before like PNG is off by over 50%.
    6 replies | 119 view(s)
  • Trench's Avatar
    Yesterday, 07:39
    As expected in the poll. 2 people would take 1 to 100 mill, while 2 say no comment which means they want to be 1 nice to not say or 2 dont know what they will do when it comes to it which usually means they will also choose the money. Also the fact that 212?... but lets say a fraction of that about 25 people also that saw this post had no comment to state the obvious truth. Silence speaks volumes. Does the poll have a right answer? What if you were on the other end and you has a multi million dollar idea what would you want the other person do choose if you rely on them to be any lawyer, any programmer, etc? You would probably want them to choose 0. Shouldn't one decide on the... 1) golden rule of do to others what you want others to do to you? or maybe people prefer the other 2) golden rule of whoever has the gold makes the rules. Which is what some of those examples were which can sure seem to buy people mentality as well in how they are viewed. Even if you know this and change your mind its just shows the mentality of most. Since if you are willing to screw over 1-2 people to benefit in your life which is rarely the same to be that law the other will also think that. And if everyone screws over the other then that is 7+ billion people will to screw you or have already one way or another. A downward spiral where the end result is everyone gets hurt equaly. kind of like communism everyone suffers equally. Freedom and responsibility go together, if you want freedom to do what you want without responsibility then more rules will be put on by force which means less freedom, so to help secure others of theirs freedom which costs them a bit of freedom. The more security you get the less freedom which is what a prison is. As one of the US founders Benjamin Franklin somewhat said to compromise freedom for more security you will get neither and deserve neither. It seems people can not handle truth or freedom. Give a child the freedom to do anything they will be a slave to their addictions. But for an adult to act the same seems to be as childish. Well that's that. ​ mega That is not the point of the topic but an example of dishonestly which if you feel everyone is honest then o well. But to clarify on your points. 1 Tesla worked with Westinghouse and was pressured by Morgan to be sued and decided to give up the patent despite Westinghouse/Tesla knew they can win in court but Morgan wanted to sue knowing it will cost them legal fees and take years to settle. Later on it was hard to get financial backers since his Biggest backed Astoria died on the Titanic owned by Morgan which did not go on the ship and has insurance. Maybe coincidence. 2 Apple's Steve Jobs said good artists copy and great artists steal. Also blamed Gates for taking the UI from them. Apple would have been out of business without Gov support. 3 MS sued for anti trust. Did you know gates was the fist person to make a computer virus, yet is pushing top stop viruses with is Organisation made after he got sued to help his image. As Gates said in public years ago that we need good vaccines to reduce the population. LOL 4 Xerox it depends, they had their info free for development for improvement but did not get credit as they should which would be the honorable things, while others say their was a patent but no real database and so it took a long time to get one which was eventually accepted in 1991. Laws are in place due to people not being honorable. 5 FB got sued and had to pay 100's of millions. You can look it all up. But to think that billion dollar companies do things honorably to get that wealth is naive. Their is a lot of info from admitting court record, admitted by the people that did it. Think what you will.
    2 replies | 265 view(s)
  • Gonzalo's Avatar
    Yesterday, 00:47
    There is a great replacement for jpg right now. It's called jpeg-xl. There's a lot of chat about it on this forum. About future compatibility? Yes and no. wimlib. tar and fxz are open source. Srep and fazip too but I haven't had much luck compiling them. And they're abandonware, so not great.
    31 replies | 1954 view(s)
  • Hacker's Avatar
    26th February 2021, 14:52
    Ah, found the executables, I was blind, sorry.
    31 replies | 1954 view(s)
  • Cyan's Avatar
    26th February 2021, 02:41
    Yes, totally. I wouldn't bother with that kind of optimization. Yes. The underlying concept is that the baseline LZ4 implementation in lz4.c can be made malloc-less. The only need is some workspace for LZ4 compression context, and even that one can be allocated on stack, or allocated externally. It's also possible to redirect the few LZ4_malloc() invocations to externally defined functions : https://github.com/lz4/lz4/blob/dev/lib/lz4.c#L190 Yes, that's correct. Yes. All that matters is that the state is correctly initialized at least once. There are many ways to do that, LZ4_compress_fast_extState() is one of them. LZ4_initStream() is another one. memset() the area should also work fine. Finally, creating the state with LZ4_createStream() guarantees that it's correctly initialized from the get go. I don't remember the exact details. To begin with, initialization is very fast, and it only makes sense to skip it for tiny inputs. Moreover, I believe that skipping initialization can result in a subtle impact on branch prediction later on, resulting in slower compression speed. I _believe_ this issue has been mitigated in latest versions, but don't remember for sure. Really, this would deserve to be benchmarked, to see if there is any benefit, or detriment, in avoiding initialization for larger inputs. I don't see that comment. Compression state LZ4_stream_t should be valid for both single-shot and streaming compression. What matters is to not forget to fast-reset it before starting a new stream. For single-shot, it's not necessary because the fast-reset is "embedded" into the one-shot compression function. Yes, it is. It's generally better (i.e. faster) to ensure that the new block and its history are contiguous (and in the right order). Otherwise, decompression will work, but the critical loop becomes more complex, due to the need to determine in which buffer the copy pointer must start, control overflows conditions, etc. So, basically, try to achieve contiguous ] and speed should be optimal.
    1 replies | 351 view(s)
  • macarena's Avatar
    25th February 2021, 21:43
    Hello, Most recent papers on image compression use the Bjontegard metric to report average bitrate savings or PSNR/SSIM gains (BD-BR, BD-PSNR etc.). It works by finding the average difference between RD curves. Here's the original document: https://www.itu.int/wftp3/av-arch/video-site/0104_Aus/VCEG-M33.doc I am a little confused by the doc.The bitrate they show in the graphs at the end of the document seems to be in bits/sec. At least it is not bits per pixel (bpp) which is commonly used in RD curves. As per this site https://www.intopix.com/blogs/post/How-to-define-the-compression-rate-according-to-bpp-or-bps, converting from bpp to bits/s would require information of fps which might not be known(?). I want to know if it really matters if the bitrate is in bpp or bits/s? Or does the metric give correct values no matter which one uses? Here's a Matlab implementation that seems to be recommended by JPEG: https://fr.mathworks.com/matlabcentral/fileexchange/27798-bjontegaard-metric . I ran a few experiments and it seems bpp gives plausible results. Though a confirmation would be nice. Thanks!
    0 replies | 133 view(s)
  • JamesWasil's Avatar
    25th February 2021, 20:58
    I've wanted this with 286 and 386 laptops for decades. Do you feel it will live up to the promises from the press release? https://arstechnica.com/gadgets/2021/02/framework-startup-designed-a-thin-modular-repairable-13-inch-laptop/
    0 replies | 68 view(s)
  • Jarek's Avatar
    25th February 2021, 15:12
    The first was uABS, for which we indeed start with postulating symbol spread - decoder, then a bit surprisingly turns out you can derive formula for encoder ... but in rANS both are quite similar.
    4 replies | 340 view(s)
  • mitiko's Avatar
    25th February 2021, 14:37
    Yeah that makes sense. I can see how the traversal of this tree is avoided with range coding. I was trying to find some deterministic approach by using a table (just visualizing how states are linked to each other), but I'm now realizing we have to bruteforce our way to find the checksum if we process the data this way. There really can't be any configuration that works as FIFO if the states that the encoder and decoder visit are the same. I can justify this by an example: Abstractly say we want to encode "abc". x0​ --(a)--> x1 --(b)--> x2 --(c)--> x3 We start off from a known initial state x0 and work our way forward. What information each state holds: x0 - no information x1 - "a" x2 - "ab" x3 - "abc" When we decode we have: x3 --(a)--> x4 --(b)--> x5 --(c)--> x6 x3 - "abc" x4 - "bc" x5 - "c" x6 - no information There's no reason for x1=x5 or x2=x4. I haven't found a solution but I'm working on it. Basically I have to define new C(x,s) and D(x). Jarek, when working on ANS, did you dicover the decoder or the encoder first? (in the paper it seems it was the decoder)
    4 replies | 340 view(s)
  • Jarek's Avatar
    25th February 2021, 09:09
    I have searched for FIFO ANS but without success - see e.g. Section 3.7 of https://arxiv.org/pdf/1311.2540.pdf In practice e.g. in JPEG XL the chosen state (initial of encoder, final of decoder) is used as checksum, alternatively we can put some information there to compensate the cost. Another such open problem is extension of uABS approach to larger alphabet ...
    4 replies | 340 view(s)
  • Shelwien's Avatar
    25th February 2021, 04:09
    > we can't generally compress encrypted data? More like we can assume that any compressibility means reduction of encryption strength. But practically processing speed is a significant factor, so somewhat compressible encrypted data is not rare. > extract useful data from an encrypted dataset without decrypting the dataset One example is encrypting a data block using its hash as a key (such keys are then separately stored and encrypted using main key). In this case its possible to apply dedup to encrypted data, since same blocks would be encrypted with same keys and produce same encrypted data. > the possibility of compressing encrypted data Anything useful for compression also provides corresponding attack vectors, so its not really a good idea. From security p.o.v. its much better to compress the data first, then encrypt it.
    1 replies | 184 view(s)
  • Shelwien's Avatar
    25th February 2021, 03:44
    RC fills symbol interval with intervals of next symbols. ANS fills it with previously encoded symbols. Arithmetic operations are actually pretty similar (taking into account that CPU division produces both quotient and remainder at once), so we could say that RC is the FIFO ANS. https://encode.su/threads/3542?p=68051&pp=1 Also all good compressed formats need blocks anyway, so ANS being LIFO is usually not a problem anyway.
    4 replies | 340 view(s)
  • Gribok's Avatar
    25th February 2021, 02:02
    I have no plans on updating bsc.
    6 replies | 524 view(s)
  • mitiko's Avatar
    25th February 2021, 00:35
    Is there a way to make ANS FIFO? I'm assuming it's either impossible or no one has tried it, since we're stuck with a LIFO fast coder since 2014. If so, what's the reason making it impossible? ANS basically maps numbers to bits and bigger state means more bits. This is why the state is increasing as we encode. Normalization helps us with infinite precision. Can't we just reverse this process? Instead of starting at a low state and increasing as we go, start at some high state and decrease it. Normalization would make sure we don't get to negative states - it will increase the state. In ANS we're asking the question "What is the next state that encodes all the bits we have and one more bit?". Now we can ask "What is the closest previous state that encoded the information in the current state, preceded with the new bit?" I'm imagining some hypothetical (for the binary alphabet): Let's say we want to encode 00110100 First symbol is 0 x_0 = 16 encodes 000100 The last symbol of this sequence is 0 and it matches the end of this code. We don't switch states. The next symbol we read is 0. The closest smaller state that matches is: x_1 = 13 encodes 00100 The next symbol is 1. The closest smaller state that matches 001 is: x_2 = 5 encodes 001 The next symbol is 1. But we can't encode anything - our state is too low. So we normalize - we output some bits and go to let's say state x_3 = 40 x_3 = 40 encodes 0010010001001 This matches our symbol 1. We read the next symbol to be 0. x_4 = 38 encodes 0100100101 and matches "01" And so on... That's the basic idea. I'm trying to understand deeper how ANS works and how exactly it relates to range coding. Notice the string encoded by our state is the reverse of what we read. We're still reading in the correct direction but the state encodes in the opposite - we're adding data to the most significant bits of the state, not the least significant. That's just me brainstorming. Has anyone else tried this? I would find great pleasure in a proof it doesn't work as well. Thanks!
    4 replies | 340 view(s)
  • Bulat Ziganshin's Avatar
    24th February 2021, 22:55
    I started to work on codecs bindings for the new CELS framework with what I thought the simplest one - LZ4. But it turned out to be more complex work than I expected. My main surce of information is lz4 manual, library source code and examples. I have not found any tutorial-style guides for using the library, and the documentaton provided is centered around individual functions rather than usage scenarios. My goal is to avoid mallocs inside the library (providing all memory buffers by myself), and reusing compression contexts between operations in order to maximize perfromance. I will put my questions in bold, mixed with commentaries describing my setup. Q1: It seems that decompression context is so small and quickly initialized, that keeping and reusing it between operations is meaningless? For compression of the single buffer, I use sequence of `malloc(LZ4_sizeofState()) + LZ4_compress_fast_extState() + free()` Q2: Does LZ4_compress_fast_extState guranteed to never call malloc inside (in current 1.9.3 version)? It seems that the next step to improve perfromance is to have malloc + LZ4_compress_fast_extState + LZ4_compress_fast_extState_fastReset (many times) + free? Q3: Am I correct that I should call LZ4_compress_fast_extState with freshly allocated state object, but on the following operations using the same state object, I can use LZ4_compress_fast_extState_fastReset to improve perfomance? Q4: In lz4.c:819 I see that the fast reset is actually used only for input blocks less than 4 KB. Why it don't used for larger blocks? Now, going to the stream compression. Q5: The LZ4_resetStream_fast docs stated that despite having same size, context used for single-shot compression, shouldn't be reused for the stream compression and vice versa. Is it true or is it my misunderstanding? The proper approach to maximize speed by reusing context with streaming compression looks like: 1. malloc + LZ4_initStream 2. call LZ4_compress_fast_continue multiple times, processing single stream 3. LZ4_resetStream_fast 4. call LZ4_compress_fast_continue multiple times, processing single stream repeat steps 3 and 4 again for each subsequent stream and finally 5. free the memory Q6: Is it correct approach to ensure max speed? Q7: LZ4_decompress_safe_continue documentation mentions three decompression scenarios, and LZ4_compress_fast_continue mentions various scenarios as well. Is any of them has speed advantage? In particular, I have implemented double-buffering with adjancent buffers (as implemented in blockStreaming_doubleBuffer.c although this contradicts Note 3 on LZ4_compress_fast_continue) - is it correct and can I improve the speed by using other approach to compression/decompression? By default I use two buffers of >=64 KB, but I also interested to know answer for smaller buffers. PS: at the end of day, it seems that internal LZ4 API is already well-thought for my needs, not so much for the manual.
    1 replies | 351 view(s)
  • Gonzalo's Avatar
    24th February 2021, 22:53
    Will you be updating your BSC with this new library?
    6 replies | 524 view(s)
  • algorithm's Avatar
    24th February 2021, 22:41
    Also my experience after my submission to GDCC. For inverse BWT, if you do it in parallel like in Lucas iBWT, the bottleneck is the Memory Level Parallelism of the cpu (that is the number of concurrent memory cache misses). Intel Skylake has around 10 and ZEN2 around 20. (I have not tested for ZEN but it is likely faster). Also you need Huge pages because then TLB misses will be the bottleneck. But note my block sorting was not exactly BWT but something a bit strange that used the cache better (but also with a compression ratio penalty).
    6 replies | 524 view(s)
  • SolidComp's Avatar
    24th February 2021, 22:32
    Hi all – I just realized that I don't know anything substantial about encryption. Am I correct in assuming that we can't generally compress encrypted data? I assume that effective encryption produces random-looking data to the naive sentient being, therefore the sequence of steps has to be compression first, then encryption, then decryption, then decompression right? I've heard a bit about homomorphic encryption, which I think means a type of encryption that enables one to extract useful data from an encrypted dataset without decrypting the dataset. At that level of description, this can mean a few different things. Homomorphic encryption might mean that data in its informative form exists at different levels of abstraction or something, some of which is not actually encrypted (?), or there might be something about the nature of the search algorithms – how you find what you're looking for without decrypting... Or it could be much more interesting than this. But the general thrust of homomorphic encryption hints at the possibility of compressing encrypted data, if the surfaced information was somehow useful to a compressor, could be acted on by a compressor, but I don't know enough about it. Has anyone worked on the theory around this? Thanks.
    1 replies | 184 view(s)
  • Lucas's Avatar
    24th February 2021, 21:11
    I know sais isn't a new algorithm, but being able to execute a majority of it out-of-order is an impressive optimization nonetheless.
    6 replies | 524 view(s)
  • Gribok's Avatar
    24th February 2021, 20:39
    libsais is not based on net new ideas (we are still using 10+ years old sais algorithm). I think what changed is hardware profile. CPU Frequencies are stalled, but CPU cache and RAM keep getting bigger and faster. So something which were not practical 10 years ago become practical now. DDR5 is also expect to land later this year, so I think could be even better.
    6 replies | 524 view(s)
  • Lucas's Avatar
    24th February 2021, 19:23
    It's great to see an actual improvement in SACA performance after 10 years of stagnation in the field. Brilliant work! This certainly seems like it will become a new standard.
    6 replies | 524 view(s)
  • Bulat Ziganshin's Avatar
    24th February 2021, 18:07
    In 2017, I managed to implement CELS framework and wrote quite a long documentation of it, with two parts - the first one for codec developers, and another one for application developers leveraging CELS to access all those codecs. But then the work stalled, and it remained unfinished. Now I continued my work on this framework. Since the initial design is pretty well documented, I invite you to read the docs and give your opinions on the design. To quickly summarize idea behind it, it provides to external (DLL) codecs the same API as the one provided to codecs inside FreeArc. It's more powerful and much easier to use that 7-zip codec API, so I hope that other archivers will employ it, and developers of compression libraries (or 3rd-party developers) can make these libraries available for all these archivers by implementing pretty simple API - instead of spending precious time on implementing CLI time and again. Repository is https://github.com/Bulat-Ziganshin/CELS
    13 replies | 5253 view(s)
  • Gribok's Avatar
    24th February 2021, 07:31
    libsais is my new library for fast (see Benchmarks on GitHub) linear time suffix array and Burrows-Wheeler transform construction based on induced sorting (same algorithm as in sais-lite by Yuta Mori). The algorithms runs in a linear time (and outperforms divsufsort) using typically only ~12KB of extra memory (with 2n bytes as absolute worst-case extra working space). Source code and Benchmarks is available at: https://github.com/IlyaGrebnov/libsais
    6 replies | 524 view(s)
  • mpais's Avatar
    23rd February 2021, 19:47
    Sure, but no one helped either. We've discussed this before, writing a full blown archiver as ambitious as that, on my free time, is delusional. There's a reason most of us here write single file compressors. An archiver needs to implement a lot more functionality (adding content to an existing archive, deleting content from an archive, partial extraction, etc) and we wanted to do it in an OS and ISA independent way, i.e., it should just work out of the box on Windows, Linux, MacOS, etc, and on x86, ARM, MIPS, RISC-V, etc. That is way above my skill level. Anyway, that is off-topic and I don't want to derail the thread. Gotty is doing a fine work on paq8gen, the paq8 legacy is in good hands.
    67 replies | 3922 view(s)
  • Shelwien's Avatar
    23rd February 2021, 18:42
    C# is troublesome, so can you provide some test results? Using some popular dataset like http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia ? Also, maybe move this to a new thread? This thread is about GLZA really.
    881 replies | 459572 view(s)
  • mitiko's Avatar
    23rd February 2021, 18:04
    I've been working on an optimal dictionary algorithm for a bit now, and I didn't even know this thread existed, so I'm happy to share some knowledge. The easiest way to search for strings and assign rankings is to construct the suffix array. This way all possible words to choose from are sorted and appear adjacently in the suffix array. (It's also not that costly to compute the SA - you can even optimize prefix doubling implementations by stopping early and constructing an array of the suffixes up to a size m - max word size). I've also done the math for the exact cost savings of choosing a word (given an entropy coder is used after the dictionary transform) A cool direction to look into is constructing a tree of possibilities when choosing a word - so that your algorithm isn't greedy to choose the best ranked word. Sometimes you should choose a word with smaller ranking and see the improvements in the next iteration. There's also a cool idea to choose a word which is a pattern - for example "*ing". Now when the encoder has to parse "doing" it will emit the pattern as a word and then fill in the remaining characters, in this case "do". Patterns allow you to match more words but at the cost of having more information to encode at each location. The hope is that the extracted bonus characters form a different context and can be encoded seperately than the rest of the file with better models. My code: https://github.com/Mitiko/BWDPerf (The documentation is a bit outdated now, so don't look too much into whatever pretentious stuff I've written)
    881 replies | 459572 view(s)
  • spaceship9876's Avatar
    23rd February 2021, 15:40
    I'm not sure if this is relevant or not: https://jpeg.org/items/20201226_jpeg_dna_2nd_workshop_announcement.html
    67 replies | 3922 view(s)
  • Mega's Avatar
    23rd February 2021, 15:06
    Nikola Tesla worked for the Rothschilds with the understanding that they owned most of his work upon contingency for funding it. While Apple may have been screwed over by Microsoft surely, Xerox was not by either and had no patents nor expectations on what they made before Apple or Microsoft. Facebook was not stolen, but planned. It is the current generation of CIA project Lifelock which was immediately transitioned to "Facebook" in 2004. Only the initial look of the front end and a few basic functions were in question by the Winklevoss twins. The rest was and still partly is classified as Lifelock. It would seem people's understanding of events are rather skewed aka "mainstreamed" to not know this?
    2 replies | 265 view(s)
  • hexagone's Avatar
    23rd February 2021, 04:33
    Here: https://encode.su/threads/2083-Kanzi-Java-Go-and-C-compressors?highlight=kanzi To be clear, it is a compressor, not an archiver.
    31 replies | 1954 view(s)
  • Hacker's Avatar
    23rd February 2021, 03:30
    data man, No executable though?
    31 replies | 1954 view(s)
  • Hacker's Avatar
    23rd February 2021, 03:26
    Gonzalo, Thanks, I'll give it a chance. Let me rephrase it - until they are more widely supported (e.g. in image viewers or in cameras) they are nice for experimenting but we should've had a JPG successor for at least 20 years now but we still don't. Apple did something with HEIC but that's it, unfortunately. OK, sounds good enough to give it a try as well. Any thoughts about future compatibility? Will I be able to open the archives in ten years? Are the involved programs standalone downloadable exe's or are they some sort of internal part of Windows 10?
    31 replies | 1954 view(s)
  • Hacker's Avatar
    23rd February 2021, 03:21
    fcorbelli, Ah, true, I forgot about the case when one already possesses the necessary knowledge.
    31 replies | 1954 view(s)
  • innar's Avatar
    23rd February 2021, 02:10
    mpais, I have validated the submission and updated the web page. Thanks again! My testing computer was probably a bit slower: 1036 seconds. Nothing compared to ongoing test with cmix. I think lossy compression should not be acceptable, especially in the case where lossless compression outperforms! Those correlations you found and which you think Gotty's gonna catch as well, resulted a shorter representation - they have to be meaningful. I don't have an answer yet how to translate that back to domain knowledge, but I will continue thinking about it and talking to people.
    76 replies | 5186 view(s)
  • Gonzalo's Avatar
    22nd February 2021, 21:24
    You want to work on Fairytale? No-one is stopping you, to be fair ;) Maybe there aren't a lot of people on board just yet but you know that can change in a heartbeat. I know I for one would test the hell out of the thing. Maybe I'm not a developer, but that hasn't stopped me from collaborating on lots of projects. Just look at all the bugs I've found in precomp, for instance...
    67 replies | 3922 view(s)
  • mpais's Avatar
    22nd February 2021, 20:35
    As I said in the challenge thread, I expect paq8gen to get down to 57x.xxx, at least, so no pressure :D The thing is, I don't think anyone in that field would even use it. Maybe innar, Kirr and JamesB have a different opinion on that, but it seems people in that field just cook up their own flawed compressors (see the benchmark that Kirr put together, especially his comments on each compressor) as long as it suits their particular workflow. For archiving and transmitting the large datasets they use, paq8gen is just too slow. And judging by innar's comments, it seems the field is not convinced of the use of compression algorithms as a way to find similarities between sequences. That's why I don't see the point in continuing working on LILY for this either. I'd rather work on something fun and useful, work is boring enough as it is, which is why I proposed fairytale in the first place, it would have kept me busy writing open-source versions of LILY, EMMA, LEA, etc. Alas, that ship has sailed, so maybe it's time for a new hobby.
    67 replies | 3922 view(s)
  • mpais's Avatar
    22nd February 2021, 20:03
    Well, I wasn't really paying much attention, so I just used the time from the last run (638s) to see how much slower the decompressor was, but I was browsing and doing other stuff at the same time. I've now run a "clean" test and it seems I grossly underestimated the difference, this run took just 528s (~2.4MB/s), so the size-optimized decompressor is a whopping 57% slower. But hey, it saves a few KB :_rofl2: As for GeCo3, the result listed is on the original fasta file. That file contains 22.020.510 "new line" bytes (ASCII #10), so even if it was stripping all of them and all the sequence names, it's still coming up short by about 14MB of what must undoubtedly be sequence data. Now, I'm far from an expert, but how can you have "lossy" compression on the actual sequence data and deem that acceptable? It's not like these are images and the appreciation of the importance of the lost details is subjective to every viewer.
    76 replies | 5186 view(s)
  • Gonzalo's Avatar
    22nd February 2021, 18:32
    Wow, I had completely forgotten about that one. Not exactly what I had in mind, but the idea is there. As a side effect, it allows for chunk reordering and that gives a surprising boost in ratio, even for smaller datasets. Of course it's not production-ready. More like a proof of concept. The binary is extremely slow, even for the copy operations, like restoring the original file (just a concatenation here; tested). The native version compiled with clang crashes very often, an gcc won't even compile. Anyway, I guess a newer version of paq* could be used to make something like this. The ideal situation would be to start working on Fairytale, which was supposed to do all this and more. But personally, I'm finishig a bootcamp on Full Stack web development (JavaScript) so maybe after that I could finally get myself to learn some c++. It's not that different, fortunately. I'll just have to deal with data types and memory management I guess.
    2 replies | 287 view(s)
  • data man's Avatar
    22nd February 2021, 03:27
    ​https://github.com/flanglet/kanzi-cpp is very active.
    31 replies | 1954 view(s)
  • Shelwien's Avatar
    22nd February 2021, 03:02
    https://encode.su/threads/1971-Detect-And-Segment-TAR-By-Headers?p=38810&viewfull=1#post38810 ? Nanozip and RZ (and some other of Christian's codecs) seem to have that kind of detection, but its not open-source.
    2 replies | 287 view(s)
  • Shelwien's Avatar
    22nd February 2021, 02:53
    The difference between bytewise and bitwise model is that after occurrence of some symbol, a bytewise model increases the probability of only this symbol, while decreasing all others. While bitwise model also increases the probability of some groups of other symbols with a matching prefix. So we can view the alphabet permutation as an approximation of non-binary APM. This would mean that optimal permutations can easily be different not only for different files, but even for different contexts. (Which can be even practically implemented by building symbol probability distributions from binary counters separately for each context, then converting these distributions to a common binary tree for mixing and coding.) Btw, alphabet permutation is not limited to permutations of 0..3 even for target files with only 4 symbols. A code like {0001,0010,0100,1000} would have a different behavior than any of 0..3 permutations.
    67 replies | 3922 view(s)
  • fcorbelli's Avatar
    22nd February 2021, 00:51
    I will try
    82 replies | 5018 view(s)
  • Gotty's Avatar
    21st February 2021, 23:33
    Back to the ACGT-alphabet-reordering. These are the results from the small DNA corpus: There is no clear winner which reordering is the "general best". There are some green areas though. Meanwhile I found the optimal alphabet-transform for the sars-cov-2 challenge: it's GCAT (meaning: 'G'->0; 'C'->1; 'A'->2; 'T'->3). It's followed closely by GCTA, CGAT, CGTA). Hmm... So G likes to be together with C and A with T. Looks like we've got something - see: https://en.wikipedia.org/wiki/Complementarity_(molecular_biology) But it does not really match with the results from the small DNA corpus. The difference in compression is significant. Who's got some insight?
    67 replies | 3922 view(s)
  • Gotty's Avatar
    21st February 2021, 23:19
    Oh, I thought it could be more. OK, I'm glad then!
    67 replies | 3922 view(s)
  • Gotty's Avatar
    21st February 2021, 23:17
    It's funny indeed. LILY and paq8gen are always head to head. So following that pattern, now paq8gen must go under 600K. OK. Don't know how on earth it's gonna be, yet. Feel free to add your magic there. ;-) I'm not that serious about squeeze the last bits from the exe. (You can see it's not my priority.) Pushing compression further is what I'm into. When enhancing paq8gen I'm also checking my changes against two other corpus: what I call the small DNA corpus (https://encode.su/threads/2105-DNA-Corpus) and the large one: https://tinyurl.com/DNAcorpus (referenced from GeCo3: https://github.com/cobilab/geco3). So for me it's a threefold challenge. Paq8gen is far from a proper genomic sequence compressor, yet: it should support a line-unwrapping transform and reordering sequence names and sequences automatically for example. Or look for palindromes... Or anything I haven't even heard of ;-) If you've got ideas and have some time - feel free to join and let's make it a good entropy estimator in the genomic-compression field.
    67 replies | 3922 view(s)
  • suryakandau@yahoo.co.id's Avatar
    21st February 2021, 23:11
    I don't have personal reason with mpais, I just see the result of paq8gen is better than lily.
    67 replies | 3922 view(s)
  • Gotty's Avatar
    21st February 2021, 22:56
    Thank you, Surya, it's very kind of you! You have to know that paq8gen is a community effort it's not "mine" - I'm standing on the shoulders of those who have laid the foundation with hard dedicated work or supported paq8* with ideas or testing. Please don't root against LILY. Let's be fair with one another. mpais did an excellent job with LILY. Now under 600K. It's clear that you are rooting against LILY for personal reasons. You have to know that paq8px and also paq8gen would not be as strong as they are today without mpais. The models and methods that help paq8gen to be "good" are all coming from paq8px, and guess who helped tremendously enhancing those models? It's mpais. Fun fact: the strength of paq8gen comes mostly from the MatchModel (which is based on the MatchModel in EMMA). So any success of paq8gen is also success for mpais.
    67 replies | 3922 view(s)
  • Gonzalo's Avatar
    21st February 2021, 22:32
    Fazip uses TTA but it doesn't work for compound content, like a .wav somewhere inside a TAR archive. Precomp was supposed to include wavpack like a year ago but nothing has happened yet and it doesn't seem like it's going to. So, do you know of something like this, that detects and compress audio chunks and makes a copy of the rest of the data? Thanks in advance!
    2 replies | 287 view(s)
  • Shelwien's Avatar
    21st February 2021, 21:50
    Single-file zstd -1 Compiling: g++ -O3 czstd.cpp -o czstd Usage: czstd c/d input output
    82 replies | 5018 view(s)
  • hexagone's Avatar
    21st February 2021, 21:47
    It is comparing apples to oranges and it is just too easy for the reader to miss the note. At first glance it looks like this compressor is better than other (lossless) compressors. You should really have a dedicated table for lossy compressors.
    76 replies | 5186 view(s)
  • spaceship9876's Avatar
    21st February 2021, 18:58
    When will you be releasing v0.12?
    82 replies | 5018 view(s)
  • innar's Avatar
    21st February 2021, 18:55
    Thank you! Just for curiosity - what was the compression time? * Cilibrasi/Vitanyi paper was added (later) for the context and building an argument that compression is useful in this case. AFAIK their paper is still in the review process for a journal, indicating that it is not obvious in that community that compression is useful for more purposes than 'saving storage'. * GeCo3 - I noticed it not being lossless later since I did not get it initially from the original paper. My bad and inconsistency. But I thought that there is value to not just to delete the row, but to mark it somehow. Especially, since the author was kind enough to tune-play with the parameters to achieve the best result. When testing with other sequence compressors, I keep realizing the same thing Kirr was indicating earlier in this thread - most of the algorithms in the sequence compression are 'broken' in some sense, mostly lossy. When reaching out to authors, I have gotten the answer that lossless compression-decompression has not been the goal, since the goal is to compress sequences (?!). If your lossless compression algorithm is better than another, but lossy algorithm, I think it makes it especially good.
    76 replies | 5186 view(s)
  • mpais's Avatar
    21st February 2021, 18:37
    From SARS-CoV-2 Coronavirus Data Compression Benchmark thread: Reordering even helped LZMA get close to the paq8l result on the original fasta file. Also, it's funny seeing how you find and tackle the same things as I do, like the low precision problem (you really went scorched earth on that one, 28 bits of precision? :D). The way you solved it was my plan B, in case the way I solved it in LILY wouldn't work. If you're serious about the sars-cov-2 benchmark, why not just create a new branch of paq8gen on the repo dedicated to it, instead of polluting the main version with a bunch of #defines?
    67 replies | 3922 view(s)
  • mpais's Avatar
    21st February 2021, 18:20
    Gotty seems to have made some tests and his data seems to confirm precision problems in cmix. I honestly haven't tried anything with cmix because it's simply unusable, it's far too slow. It shouldn't be a surprise though that it doesn't do very well on what seems like "textual data", since it was continuously tuned over the years for the LTCB. Comparing it to paq8gen would make more sense, since they're much closer in architecture, and paq8gen is progressing nicely. It's quite impressive that it's been keeping up with LILY without any special model. As for LILY, it's just about 1000 lines of badly hacked together code, much simpler than any of those 2, as implied by the difference in speed. And since it's modelling exactly the correlations I'm choosing, I can tell exactly what they are. From the moment I decided to have a go at this, I've been focusing on a theory (call it a hunch, an intuition, if you will) and have been trying to see if I could model it. Now that I've finally found what I was doing wrong, it was quite simple to get a big gain (2 lines of code gave 55.000 bytes of gain). If anything, I'd be tempted to just start over because I think I can get pretty much the same result with half the complexity. I must say however that it's frustating, not being an expert, to not know if the correlations I found, that accurately predict the differences in the sequences, have any actual meaning in the grand scheme of things. They seem to validate what I thought, but I can't shake the feeling there's an higher order structure at play here that I'm missing. I'm sure Gotty will find them too, and paq8gen will probably go at least as low as 57x.xxx bytes.
    76 replies | 5186 view(s)
More Activity