Activity Stream

Filter
Sort By Time Show
Recent Recent Popular Popular Anytime Anytime Last 24 Hours Last 24 Hours Last 7 Days Last 7 Days Last 30 Days Last 30 Days All All Photos Photos Forum Forums
Filter by: Last 7 Days Clear All
  • LucaBiondi's Avatar
    Today, 18:36
    LucaBiondi replied to a thread paq8px in Data Compression
    Hi guys i have an excel that contain the results of the compression of my dataset starting from PAQ8PX_V95(!) i have also plotted for each datatype size / time. It's not easy to attach all these images to the post. Where could i upload my excel? Maybe someone can help to plot data in a better way.. thank you, Luca
    2328 replies | 623464 view(s)
  • Darek's Avatar
    Today, 11:16
    Darek replied to a thread paq8px in Data Compression
    No, there were different moments. It's probably something with my laptop. I suppose that could be issue of low space on Disk C (system). I need to sometime close the laptop and hibernate system and for some cases, after standing up system, paq8px quits w/o any communicate. At now I started to plan compression for time when there won't be needed to hibernate the system. For enwik9 it's about 4 days of time which need to be not disturbed.
    2328 replies | 623464 view(s)
  • Shelwien's Avatar
    Today, 07:33
    Which can be interpreted as "Let's wait 3 more years, then buy a bunch of otherwise useless preprocessors from Alex for $25k". 500000*(1-116673681/122000000) = 21829 (122000000-116673681)/5000/365 = 2.92 122MB is the estimated paq8px result, cmix and nncp won't fit due to time limit.
    76 replies | 11923 view(s)
  • mitiko's Avatar
    Yesterday, 23:40
    Uhm, Wikipedia says it's a general purpose LZ77 paired with Huffman and second order context mixing, but looking at the code it also includes a small static general purpose dictionary: https://github.com/google/brotli/blob/fcda9db7fd554ffb19c2410b9ada57cdabd19de5/c/common/dictionary.c But I'm probably not the guy to ask. Meanwhile, I made the counting linear but it's way slower. I tried some performance analyzing tools but nothing conslusive, just a little bit of everything - matching is expensive and sorting is expensive and looking at the bitvector (which I implemented myself) is inefficient. I'm considering some dp approach where we only count the words once but then there would be quite a complicated process of updating the counts, involving multiple suffix array searches and I'm still debating if it's worth it. Correct me if I'm wrong but mcm seems to be getting away with counting easily because they only rank matches to the words in the lookahead window. Later it's also using a hashtable of strings, which would be quite inefficient in my case.
    13 replies | 539 view(s)
  • SolidComp's Avatar
    Yesterday, 21:55
    Mitiko, what about brotli's dictionary transforms? Are they LZ77/78?
    13 replies | 539 view(s)
  • byronknoll's Avatar
    Yesterday, 21:47
    A new rule has been added to Hutter Prize to make it easier: http://prize.hutter1.net/hfaq.htm#ddisc
    76 replies | 11923 view(s)
  • Gotty's Avatar
    Yesterday, 14:03
    Yes! You have it. Exactly: if you have no further information ("no theory" in your words) you can ask 2 questions, hoping that you will find the number quicker, but if you are unlucky, there is a higher chance that you will need to ask 4 questions. So 3 is the overall optimal: it's always 3, no risk. There is no way to solve it with 2 question without risking. If you have some information ("theory"), that usually it's AB, and rarely it's CDEFGH, you can go for AB first. But in this case the symbols are not equiprobable so it's not random. And we are going for random content, so you don't have more information what the result usually is (what distribution the symbols usually follow). If we do the game with 16 letters, it's the same: the optimal is 4 questions. With 32 it's 5. No matter how big your range is, the optimal way is always log2(n). Going even higher doesn't change the game: If you have 4096x8 random bits, your optimal way to find out those bits is to do it in 4096x8 questions. Can't do it in less. Practical compression example: try compressing random files with any compression software. The result will always be a bit bigger. In case of paq* it will actually try finding patters (find a "theory" behind the bits and bytes). It really does try finding patterns! Being a random file we know that there are no patterns there. But paq does not know that. So it tries. When it believes it found some theory (that some symbols are more probably than others) then it applies it - and experiences a loss. Ouch! So the result gets a bit bigger. Like trying to guess the 8-symbol game in 2 question. Risky right? Because it is programmed to lower the risk and try finding the best theory it will give up soon trying to solve the 8-symbol game in 2 questions - it will stick with 3 questions. And so it reduces it's loss. So this is the reason why there is always a small loss there - because it is always doing a bit riskier at the beginning, so those losses are inevitable (in case of paq or any other compression software with similar internal workings).
    21 replies | 449 view(s)
  • Gotty's Avatar
    Yesterday, 13:49
    xinix you are a bit harsh. As Trench said: "upset". I know it's upsetting when people have claims or statements and they don't fully understand the subject (yet), but it's OK. I sense that he's open and he tries to understand it deeper, so let's give it a go. He has no background in programming or information theory - it would be nice to have, but what can you do? We are here to help him (remember: he came with a question). This is what we can do: practical examples, visualizations, games - so even without the full experience or solid background at least he can sense what randomness means. You know how I started in ~ 2007? Funny: I tried to see if I could compress the already compressed paq8hp result smaller. I could easy-peasy win the contest. Fortunately I have math and programming background, and I quickly realized the it won't work. You do have to try compressing random files in order to understand what randomness means (not with paq or any other software - they don't give you first-hand experience, but in your own way, creating your own tools). So then I deeply understood the real meaning of randomness. And I can tell: it has its beauty. Whatever you do, however you slice, you just can't make them smaller. The result will always have the same size (in optimal case). So I tried it, and have actual first-hand experience. Equipped with my math background I also know the explanation. Now he can have a tiny-little experience by trying to solve the ABCDEFGH-puzzle in 2 questions.
    21 replies | 449 view(s)
  • Gotty's Avatar
    Yesterday, 13:21
    I'm trying to understand what's happening. So, we had breakthroughs in 2007, 2015, 2020, and now in 2021. The messages switching between significant savings and "little smaller". At this point this is what I get: no breakthroughs, the previous attempts were false alarms. This time you can probably save a couple of bytes with not completely random files. (Which is belivable). Your request for "be patient" tells me that you are more careful this time. I think it's wise, and I respect that.
    66 replies | 2023 view(s)
  • xinix's Avatar
    Yesterday, 06:54
    Ahaaaa!!! There you go! Kudos to RANDOM, you had an epiphany! We wrote you too that your "random" 4kb file is also very small for correct tests. You need at least 1 megabyte! That would eliminate some of the overhead, archive format header and the like. Gotty don't worry, I'm fine! It doesn't work any other way with him Why are you blind? You've been told 2 times that PAQ already compresses 2 characters! (Reminder, Gotty don't worry, I'm fine!) Well, PAQ already does what you need it to do, it already compresses those 2 characters, only it does it better and faster, without bloating the file and we don't have to convert the file to bit view because PAQ does it itself! Just read: https://en.wikipedia.org/wiki/PAQ http://mattmahoney.net/dc/paq1.pdf You have a lot of time to think about:) ______ Tell me, do you program? Have you really learned any programming language?
    21 replies | 449 view(s)
  • Trench's Avatar
    Yesterday, 05:05
    I rushed in my last reply which was sloppy You said it well about how conversion loses to compression, but you also answered the question after to state that no one deals with compressing 2 digits since their is no need for it. Which again their is no need for me to answer the puzzle but it was a challenge for others to learn from. Which again if everyone does the same nothing new is being done to get a difference perspective. Your reply is right from a conventional stand point. But why have you not 2nd guessed yourself? Since its not needed as you say. How do you know that you know it all? based on experience. A lot of things are found by accident just fooling around and trying silly things since the obvious is not so obvious. Random to you is not random to me. people with super memory have a structure to remember random numbers of phrases due to association. They may be random for others but not to a few. The computer does not know what random is unless a person puts in what is or is not random which you agree. So to say its impossible due to it being random is like saying it is impossible since it is being ignored. Ignoring something does not make it true truly but temporary to the limitation of the person. And that person puts it in the code which also makes it limited. and then it becomes a standard and then people ignore it since who will argue with the official stance. That does not seem healthy for innovation, since it puts boundaries and discourages. So not everyone says you can not compress random it reached its limit by so many experts and anyone new feels experts know best. That is not healthy for any field of work is what I am trying to get at. words make of break things which have restricted a lot of innovation. Even dictionaries have changed the meaning of words over time. Even something as simple as the word leisure means now to relax with free time but it use to mean to use free time to learn if we go much further back as a quick example. I am not here to change anyone mind I am here to offer another perspective to see things. Since when everyone seems a playing cards from one side it may look wide but from another it is thin. No one has to agree, but just like your tests its a way to get use to thinking and thinking differently to help things. on a side note Tesla likes the #3 and he sid the secret to everything is frequency. When he meaning everything does he mean everything? as for game ABCDEFG which i did not have time to read it last time. but the 8 letters seems to small to answer under 3. 8= you choose 4, 4=2, 2=1. the selection it too small to deal with but needs other methods to get 2. Sure your problem can be solved with 2 but again the odds of it being wrong are greater. LOL but if you only had 2 choices then you have to come up with a theory on why you choose and understand the context of the where it it coming from. not as practical but its a good way to think things through. as for the 256 which was answered last time it might be good to take a guess in the first part to eliminate 2/3 i was trying to take that path than 1/2 which is the sure and steady way to get a shorter answer. it has a probability of being wrong to delay more but when right it has a probability of being fast which was the purpose of the game. if i was right all the time guess and eliminate 2/3 that would be 6 turns if wrong and stuck to it then 13 turns at worse which to be wrong every time the probability would be 50% so within that 50% it would take 8 turns as well. So 5 at best not likely 13 at worst. Which that is how I came to my answer but thought it out now to understand it better to explain the numbers. Unless your percentage is different? SO you are right but I think I gave a better answer in speed which was a risk. You try it at 2/3 off and tell me how fast you get it. answer is 3 and the 2nd one is 200. use the same method which order will you eliminate first is up to you. so what are your odds? I should have puyt more effort last time but as i said i was in a rush and did not have time to think it through. since lotto odds have also a pattern which most dont know, but it looks "Random" to most. The thing is probability to get better odds. Just like if you flip a coin it will be 50% heads or tails. and if you flip it 100 times it should be near 50% heads or tails. Sure it can be heads 100 times but the odds increased despite it always has a 50% change in every flip to be heads despite it it is heads 4 times in a row or 8 times but their are other probability factors in play that most ignore due to what they see at face value which everyone is tricked even most mathematicians. you say "Also don't expect that one day we will invent a smart method to compress random data. " You already do, :) A file is a file. what the program that deals with it determines if it is too "random" for it to handle. So to call a random file not random when you know the code is not fair. it is just like a lock how it is all random but not to the key you have. all locks are the same, its the key that determines if it is in order or not. if anything the compression program is random to use predetermined patterns. Since some programs compress better than others on the same file that is not "random" which is random in how much it is compressed. It is not the files fault it is the program. Or to be fair it is both, and if it does not work the program is not comparable. It is not the 4k files fault that the compression programs do not do a good job which as you said no need to compress 2 character, which is based of programmers preconceived notion. ;) which that is what I am trying to say how everyone is stuck to some degree doing the same thing fear of taking the next step. Everyone is trying to squeeze a dried up lemon and putting more memory and processor into doing so. Its like throwing more money at the problem to need a bigger processor and memory. Again most wont agree which I see xinix (hello) is upset with me and gives you tumbs up all the time. LOL reply is long enough I will read the other game another time. you can try my method if it applies and see if that works. YEP ;) So try to beat 4.0k
    21 replies | 449 view(s)
  • LawCounsels's Avatar
    Yesterday, 02:34
    Celebrating 1st time ever random files compressed a little smaller and reconstructs exact same ( no further details will be provided until separate encoder decoder done ) Subject to eventual final confirmation with separate encoder decoder over 2 clean PCs
    66 replies | 2023 view(s)
  • Gotty's Avatar
    Yesterday, 02:26
    Random files? You mean: more? I thought it was jut a zstd file. So you have different files, right? That's good! I just wanted to ask you to double check the results with real random files. xinix also posted one. You also referred to random.org files in your earlier posts, so I believe you have now some test files with you. So what are the (preliminary?) results? What are we celebrating?
    66 replies | 2023 view(s)
  • LawCounsels's Avatar
    Yesterday, 02:03
    Sportman's provided random files ( pls leave any further requests for technical details for when separate encoder decoder done. None will be responded at this time)
    66 replies | 2023 view(s)
  • Gotty's Avatar
    Yesterday, 01:46
    Are these inputs random files?
    66 replies | 2023 view(s)
  • LawCounsels's Avatar
    Yesterday, 01:34
    Hi , I have made good progress coding done compressor and have now greatly made simpler decompressor At moment decompressor works direct from compressed internal variables decompressed exact same. There is only work to separate into encoder decoder. But do not underestimate the amount of 'mundane' tedious work involved .Likely will be a week's time to able do as requested. But we can and should already be celebrating. ( now can handle any size input also ) LawCounsels
    66 replies | 2023 view(s)
  • Gotty's Avatar
    1st March 2021, 20:39
    Gotty replied to a thread paq8px in Data Compression
    Does it happen around the same %? How does it crash? Blue screen?
    2328 replies | 623464 view(s)
  • Shelwien's Avatar
    1st March 2021, 02:58
    > am I supposed to remove/ sanction the 2 'dog>ate' in set #2 before blend them? In a similar case in PPM (called masked contexts) symbols seen in higher order contexts before escape are excluded from probability estimation. But in most CM implementations, submodels (based on contexts of different orders in this case) are handled independently, thus multiple contexts could provide predictions based on same data. Overall first approach is better, but is also harder to implement in a CM, and CM provides better compression than PPM even without it.
    196 replies | 16936 view(s)
  • Gotty's Avatar
    1st March 2021, 01:25
    I agree: unfamiliar. Actually there is a better word. It is called "unpredictable". See the wikipedia definition of randomness: "randomness is the apparent lack of pattern or predictability in events". See? "Apparent lack". It means that the file may have a pattern, you just don't find it or see it. From your viewpoint then the file is practically random. There is no way you can compress it. I also agree: "all files are random", but you need to add "... until you find some useful patterns". Otherwise this statement in itself is not valid.
    21 replies | 449 view(s)
  • Gotty's Avatar
    1st March 2021, 01:11
    Let's do another (different) approach for the second game. Let's say that you record the answers to my questions with ones ("yes") and zeroes ("no"). Here are my questions, and your answers 1. Is x>= 128? No 2. Is x>= 64? No 3. Is x>= 32? No 4. Is x>= 16? No 5. Is x>= 8? Yes (we are in 8..15) 6. Is x>= 12? Yes (we are in 12..15) 7. Is x>= 14? No (we are in 12..13) 8. Is x>= 13? Yes Then it is 13. What did you record? No-no-no-no-yes-yes-no-yes -> 00001101. This looks like 8 bits. In other words a byte... Hmmm... Let's look up this number in decimal. It's 13. Wow! Try with other numbers. The recorded yes-no answers will always match the number's binary representation. Isn't that cool? Your yes-no answers and the binary format will be the same when you follow the above question-structure. If the number is between 1-65536, that's 16 yes-no questions. Can you calculate the required number of questions somehow? Yes. The needed answers is always log2(n) where n is the number of possible symbols (the range). So log2(256) = 8 and log2(65536) = 16. Now think with me. A bit more concentration... Let's do the opposite. You thought of 8 bits (8 numbers of either 0 or 1). Let's stay at the same example. Let they be 00001101. How can I find out which 8 bits you have? I need to simply ask the bits one after the other: Is the 1st bit 1? No Is the 2nd bit 1? No Is the 3rd bit 1? No Is the 4th bit 1? No Is the 5th bit 1? Yes Is the 6th bit 1? Yes Is the 7th bit 1? No Is the 8th bit 1? Yes Notice that again it's No-no-no-no-yes-yes-no-yes. It matches exactly with the bits (00001101) again. And that represents 13. You needed 8 yes-no questions, in other words: 8 bits. It's even more amazing when you have probabilities for the symbols. ((Actually this is the universal case it covers the above yes-no questions, as well.)) You can have questions like that (guessing again 13 from 1..256): Is it 1? (p=1/256) No Is it 2? (p=1/255) No Is it 3? (p=1/254) No Is it 4? (p=1/253) No Is it 5? (p=1/252) No Is it 6? (p=1/251) No Is it 7? (p=1/250) No Is it 8? (p=1/249) No Is it 9? (p=1/248) No Is it 10? (p=1/247) No Is it 11? (p=1/246) No Is it 12? (p=1/245) No Is it 13? (p=1/244) Yes Got it: it's 13. In order to find out the information content, you'll need to sum {-log2(1-p) for the "No"s and -log(p) for the "Yes"es} for the above answers. Here is the result for each question: Is it 1? (p=1/256) No 0.005646563 Is it 2? (p=1/255) No 0.00566875 Is it 3? (p=1/254) No 0.005691112 Is it 4? (p=1/253) No 0.005713651 Is it 5? (p=1/252) No 0.00573637 Is it 6? (p=1/251) No 0.005759269 Is it 7? (p=1/250) No 0.005782353 Is it 8? (p=1/249) No 0.005805622 Is it 9? (p=1/248) No 0.005829079 Is it 10? (p=1/247) No 0.005852726 Is it 11? (p=1/246) No 0.005876566 Is it 12? (p=1/245) No 0.005900601 Is it 13? (p=1/244) Yes 7.930737338 Sum those numbers and be amazed. It's again 8. (It's 13 questions with skewed probabilities with 8 bits of information content.) You can also start from the top: Is it 256?, 255?... 13?. Even you had much more questions the answer will be still 8. Try. No matter what order you do, no matter how you slice up your range, it will be always 8 bits. See? The method or the used number of symbols is irrelevant: whether it's 8 bits or 1 byte (base 2 or base 256): it's still 256 symbols, you always need 8 bits to represent a number from this range. Actually if you store anything in any format it's always the same: the information content is: how many yes-no questions do you need to figure out the full content. This is the number of bits. You cannot do in less - you cannot compress these 8 bits of information to be less. If you understand all the above (even partially) you will have a better grasp: no matter how you transform your random files, no matter what approach you use, no matter with how many symbols you use to represent your data, the information content will be always the same. (4K in your case.)
    21 replies | 449 view(s)
  • Self_Recursive_Data's Avatar
    1st March 2021, 00:26
    Does the following help prediction? How if so? I tried it one time but it didn't work. If we take 2 sets/ orders of predictions to blend: SET1 my dog>ate my dog>ate + SET2 dog>ate dog>ate dog>jumped , am I supposed to remove/ sanction the 2 'dog>ate' in set #2 before blend them? They are in the same location in the dataset, cuz order3 is found in order2 but not the other way around always. The dataset looks like (with to emphasize the matches): "], ], ". As we can see here, some of the order3 is order2.
    196 replies | 16936 view(s)
  • Gotty's Avatar
    1st March 2021, 00:17
    Not bad! So you figured out that you need to do halving to find the number. The correct answer is 8 actually, but your approach is good, so accepted ;-).
    21 replies | 449 view(s)
  • Gotty's Avatar
    1st March 2021, 00:16
    Stuck = no improvements. But we do have improvements in the last 10 years in every field (image, video, general) both practical and experimental. Are you following the news? :-) As time progresses, the leaps will be smaller and smaller - we are slowly but surely approaching the theoretical compression limit that is reachable in reasonable time. So don't expect anything "extraordinary". Also don't expect that one day we will invent a smart method to compress random data. I think that would count a big leap, right? 2-digit ascii compression would be everywhere if that would be a common format to share data. But no one share data like that. So it's not in focus at all. You convert your binary file to a 2-digit file when you want to visualize the bits. Other than that it doesn't seem to be a useful format. It's just a waste of space. Do you use or see 2-digit files? Look, xinix also immediately converted it to binary. Why? That's the natural and compact container for such content.
    21 replies | 449 view(s)
  • Trench's Avatar
    28th February 2021, 23:11
    ---- Ok I just saw post #8. As explained in another post that all files are random. you can take any file and it will look random to you at face value. The program does not see it as random since it looks for certain patterns. And those patterns become "familiar" to be able to compress it. even with the dictionary size of a 128 it still has limitation. if the dictionary size was 4 it misses out on the other 122 which you call random when it is obviously not. It is just not familiar to the program in what to use. The computer does not know the difference from 1234 or 1z$T. You see the difference but you are not doing it manul by your own dictionary in your head, but the computer does not unless you tell it. For people that talk technical random should not be in the definition but it is unfamiliar should be. And I stick by that despite I can slip up and say it at times being influenced by others. LOL Anyway You are not stuck? so you made leaps and bounds on compression or did we all hit a wall for the past 10 years to see major improvements for a standard model of compression? Some programs are improved by a bit but not practical which is why not in general use. You help explain the comprehension barrier, but the issue still remains that I do not see any effort to improve 2 digit compression since everyone is focued on the standard, and it is an alert to focus on. People can disagree or disregard but its a new door to explore. You notice how a computer has memory? well it tried to copy the mind and it is hard to keep track when memory is full with other things. LOL So I lost track on the 33,420 info. lol I disagree that both are not random and dont like that word which You can not beat 4.0kb the question is will you try to. from doubling the numbers 1 2 4 8 , 16 32 64, 128 256 1. is it between 1-128? (2/3 off) yes 2. is it 8 to 32? (2/3 off) yes 3. is it 8-16? yes 8 9 10, 11 12 13, 14 15 16 4. is it 11-13? yes 5.is it 11-12? no 6. answer is 13. which would means 6 questions maybe some errors. It was the first thing that came to mind so I could be wrong but did it fast. faster but inaccurate to a smaller degree. if you said 7 I would have lost. LOL will reply to rest later.
    21 replies | 449 view(s)
  • hexagone's Avatar
    28th February 2021, 23:02
    Code is here: https://github.com/mathieuchartier/mcm/blob/master/Dict.hpp & https://github.com/mathieuchartier/mcm/blob/master/WordCounter.hpp Look at CodeWordGeneratorFast (Dict.hpp), especially WordCount::CompareSavings & Savings (WordCounter.hpp) for the ranking logic.
    13 replies | 539 view(s)
  • Shelwien's Avatar
    28th February 2021, 22:43
    There was this: https://encode.su/threads/482-Bit-guessing-game Humans are actually pretty bad at generating random numbers.
    21 replies | 449 view(s)
  • mitiko's Avatar
    28th February 2021, 22:40
    Thanks for the tests, but it's not quite optimized yet. I might have forgotten to bump up the block size, so it has done 10 blocks to compress enwik8 and it needs 10 dictionaries instead of 1. LZ probably only generates overhead as well. I'm rewriting the ranking loop now, so it can do ranking with states and also save some CPU with hashtable lookups. I'm pretty busy with school but I'll try to drop it this week. mcm seems interesting, do you know what it uses as a ranking function, I'm a little bit lost in the code.
    13 replies | 539 view(s)
  • Darek's Avatar
    28th February 2021, 22:36
    Darek replied to a thread Paq8sk in Data Compression
    OK, I'll try. Somehow I have some issues with my laptop recently - I've started 5 times enwik9 for paq8px v201 - and even after 3-4 days my computer crashes...
    228 replies | 23402 view(s)
  • Gotty's Avatar
    28th February 2021, 21:47
    Second game: Let's say that you thought of a number between 1-256. I'd like to figure out that number. It could be any number in that range. Let's say it is 13. What do you think how many yes-no questions do I need to find it out? I encourage you to really try to find out the answer. It's free tutoring :-) Take advantage of the opportunity.
    21 replies | 449 view(s)
  • Gotty's Avatar
    28th February 2021, 21:39
    Please follow the game here: Think of a letter between A-H. That is: A,B,C,D,E,F,G,H (8 different symbols). Let me find out which it is. How many yes-no questions do I need to ask you in the optimal case (= fewest possible questions)? Let's say you think of "F". My approach: - Is it between A-D? - No. - (Then it is between "E"-"H". Let's half that range then.). Is it between "E"-"F"? - Yes. - (I have two choice now: either "E" or "F".) Is it "E"? - Yes. - I got it! It's "E"! How many questions did I need? 3. Could I solve it in less questions? 2 questions? Is there a smarter way?
    21 replies | 449 view(s)
  • hexagone's Avatar
    28th February 2021, 21:31
    You should look at the dict encoding in https://github.com/mathieuchartier/mcm. It dynamically creates a dictionary of ranked words. mcm_sk64.exe -filter=dict -store enwik8 Compressing to enwik8.mcm mode=store mem=6 Analyzing 97656KB , 46591KB/s text : 1(95.3674MB) Analyzing took 2.098s Compressed metadata 14 -> 18 Compressing text stream size=100,000,000 Constructed dict words=32+11904+33948=45884 save=6956257+30898956+3574176=414293 89 extra=0 time=0.064s Dictionary words=45884 size=376.186KB 97656KB -> 59990KB 74774KB/s ratio: 0.61431 Compressed 100,000,000 -> 61,430,639 in 1.38400s Compressed 100,000,000->61,430,671 in 3.544s bpc=4.914 Avg rate: 35.440 ns/B ​
    13 replies | 539 view(s)
  • Gotty's Avatar
    28th February 2021, 21:25
    I don't feel that I'm stuck. I'm posting improvements quite regularly. Why would you say that we are stuck? Data compression is a technical challenge - not a philosophical one. So we address the issue from technical point of view. If you think there is a better approach... show us. But it should work ;-) Your issue is that you would like to understand why compressing a 2-digit file is worse than simply converting it to binary. This is what you asked. This is what I explained. Regarding what the issue is - I think we are on the same page. No, it's not. The 33,421 bytes could be compressed to 4-5K, it's clearly not random. The 4096 bytes are random. From (general) compression point of view these two files are totally different. In your head (and in my head) the binary file is equivalent to the 2-digit file (and it's absolutely true from our viewpoint), but they are not equal for the compression engines. Please read my explanation again above. It looks like my explanation didn't get through. I tried my best to make it clear - it looks like I was not very successful. Please read my posts again and tell me where it is difficult to understand. I'm ready to try to explain it better. No, they will never beat conversion. The entropy is 4096 bytes in each of your case. That's how low you can go no matter what. Regarding their information content, they are equal - you are right! But "technically" they are different: the 2-digit file needs to be actually compressed to get to (nearly) 4K, while the 4K file is already 4K - nothing to do from compression point of view. You cannot beat 4K. That's the limit. What do you mean by "conventional excuses"? Is it that you hear similar explanations again and again that random content is not compressible? Because it's not. You experienced it yourself. If you read most of the threads in the "Random Compression" subforum, you'll notice that most people try converting some random content to different formats hoping that it will become compressible. Like yourself. It's the most natural approach, it's true. So everyone is doing it. But everyone fails. Do you think that there is some law behind it? There is. You have still the same information content (4K) in any format you convert the file to. If you convert it to any format that is actually compressible (like the 2-digit format), you just give file compressors a hard time, because they need to do actual compression to go down to near 4K. What they see it not a random file - the 2-digit format is not random. You mean 256 symbols and 2 symbols (not digits). Your intuition is correct: it's easier to work with 2 symbols than 256 symbols. This is how the paq* family works. But it does not mean that a 2-symbol file is more compressible than a 256 symbol file. From information content point of view they are equal. It's surprising, I know. In my next post I'll give you an example, maybe it helps you grasp what information content is.
    21 replies | 449 view(s)
  • suryakandau@yahoo.co.id's Avatar
    28th February 2021, 21:19
    I just curious how much Silesia can be compressed using the new mixer context set in each predictor...
    228 replies | 23402 view(s)
  • suryakandau@yahoo.co.id's Avatar
    28th February 2021, 21:09
    The base of paq8sk is paq8pxd. @darek, could you test it on Silesia using the same parameter like you do when you test paq8pxd ??
    228 replies | 23402 view(s)
  • xinix's Avatar
    28th February 2021, 20:24
    Does compression seem to be enabled now? I would like to test only the preprocessor enwik8 lzma - 23.6 MB (24,750,737 bytes) enwik8.bwd lzma - 34.7 MB (36,438,877 bytes) ___ enwik8 PAQ fp8 - 18.6 MB (19,566,727 bytes) enwik8.bwd PAQ fp8 - 29.9 MB (31,357,466 bytes)
    13 replies | 539 view(s)
  • xinix's Avatar
    28th February 2021, 19:37
    I tested this on enwik8 95.3 MB (100,000,000 bytes) Processing lasted 5 hours on AMD Ryzen 5 3600X enwik8.bwd 83.7 MB (87,829,234 bytes)
    13 replies | 539 view(s)
  • Darek's Avatar
    28th February 2021, 18:46
    Darek replied to a thread Paq8sk in Data Compression
    @suryakandau - which version is an base of paq8sk48 - paq8px or paq8pxd? OK, i Know. paq8pxd is the base
    228 replies | 23402 view(s)
  • Trench's Avatar
    28th February 2021, 18:26
    4096 in its converted state which yes I agree. But the unconverted 33,420 bytes file is also a mess but compression fails do beat out conversion. It does get much better results without the newline. ;) But it still compression still loses to conversion from what I see which is the main issue of this post. from 31.9 compressed to bzip2 4.29, lzma2 4.11, p12 4.39, and converted 4.00. I assume better results will show when the file is much bigger since metadata is saved in the compressing programs, but the issue is can it beat conversion? maybe no. But it should be technically equal. So overall all your comments are right but again its not the issue. Can you beat 4.0 kb in theory without conventional excuses? You can not compress 256 digits as easily as 2 digits. Sure 256 is faster but 2 is more orderly. take food for example try chewing an entire apple in your mouth than a fraction of it. the person with a fraction of an apple will do better. The topic was not about compressing files form binary format but binary digits. Everyone in this forum thinks too technical which seems to be stuck in the Chinese finger trap which is not helping matters and progressing nowhere fast. But if everyone feels I am wrong....ok. Thanks again
    21 replies | 449 view(s)
  • Gotty's Avatar
    28th February 2021, 17:09
    I'm not sure what you mean. You don't like the word "Random" when used for incompressible data?
    21 replies | 449 view(s)
  • mitiko's Avatar
    28th February 2021, 16:52
    Yes, it's a transform. The new ranking function is based on it being followed by entropy coding.
    13 replies | 539 view(s)
  • Gotty's Avatar
    28th February 2021, 16:21
    Additional explanation: Unfortunately most compressors will try finding longer patterns in the ascii files. And indeed there are patterns and they seem to be useful. They think that it's a text file and looking for string matches is a good strategy. Unfortunately it's not. The best strategy would be to simply convert the file to binary. But examining this possibility is not programmed in them. So they go for the string matches... Why looking for patterns its not a good strategy in this case? (A simplified example follows) Imagine that you are a compressor and you are given the binary file (4096 bytes). There are *very* short matches (max. 2 bytes only) and you'll see that these short matches won't help compressing the file. So you won't use them. Good. Or you can use them, but still it won't help compression. Now you are given the 2-digit ascii file. And you'll see that you have for example quite many 8-character matches (since the bits are bytes now). Oh, that's very good - you say. Instead of storing those 8 bytes I will just encode the match position and length and I'm good - encoding the position and length is cheaper (let's say it's 2 bytes). So you win 6 bytes. And this is where you were led astray. In reality you encoded a 1-byte information to 2 bytes. That's a huge loss. And you thought you did good... So an additional problem is that when seeing the ascii files many compressors are led astray. They think finding string matches is a good strategy. They will not find or not easily find the "optimal" strategy (which is an order-0 model with equiprobable symbols.)
    21 replies | 449 view(s)
  • Gotty's Avatar
    28th February 2021, 15:04
    And here is the answer: The original binary file (4096 bytes) is a random file, you can't compress it further. The entropy of the file is 4096 bytes. Any solid compressor will be around that 4096 bytes + they will of course add some bytes as filename, file size and some more structural info. Those compressors are the winners here that realize that the file is incompressible, and add the smallest possible metadata to it. On the other hand, an ascii file consisting of 2 digits (representing bits) or 16 hexadecimal characters (representing nibbles) is not random. They need to be compressed down to 4096 bytes (from 32768 (when bits) and 8192 bytes (when nibbles)). Their file size (32768 and 8192 bytes) is far from the entropy (4096 bytes). We actually need compression now, so file compressors will do their magic: they need to do prediction, create a dictionary, etc. So actual compression takes place. They need to get down the file size from 32768/8192 to 4096. So even if the content represents a random file the actual content in these ascii files is not random: the first bit for example is always zero (being an ascii file). In case of a 2-digit file the ideal probability if a '0' character is 50%, the ideal probability of a '1' is 50%, the ideal probability of any of the remaining 254 characters is 0%. (That's why it's important to remove those newlines from xorig.txt otherwise a compressor needs to deal with them.) A compressor needs to store all this information (counts, probabilities or anything that helps reconstructing the ascii file) somehow in compressed format additionally to the actual entropy of the original file (which is 4096 bytes). That's why you experienced that the results in this case are farer from the ideal 4096 bytes. The winner compressor will be the one that realizes that the best model is an order-0 model with equiprobable symbols. The sooner a compressor gets there the better its result will be - meaning: the closer it will get to the desired 4096 bytes. So always represent a (random) file in it's original form, translate the file to 2-digit or 16-char format only for visualization, don't try compressing it.
    21 replies | 449 view(s)
  • xinix's Avatar
    28th February 2021, 14:12
    Thank you! Can this be used as a preprocessor?
    13 replies | 539 view(s)
  • mitiko's Avatar
    28th February 2021, 12:58
    Of, course. There're lots of good ideas, differing in various ways. What I meant by this "Since the creation of the LZ77 algorithm..." is that most programs out there use LZ77/78. They dominate the space for dictionary transforms. I'm only trying to explain my motivation behind the idea, I might be wrong at a lot of my statements but it's intuition that drives me. I didn't know there were that many papers on the topic, I couldn't find many myself. I especially like Przemysław Skibiński's PhD thesis section 4.8.1 as it's the closest to what I'm trying to do. It makes me very happy that we've independently came to the same ranking equation. But I've imporved upon it and I'm trying to develop the idea for patterns as well. I'm trying to fall under some reasonable time to compress larger files, so that it becomes feasible to use it in the real world. Here's an exe, although I'm not sure all of the files are needed. Usage is BWDPerf.Release.exe -c fileName.txt -> creates fileName.txt.bwd BWDPerf.Release.exe -d fileName.txt.bwd -> creates decompressed I've hardcoded a max word size of 32 and dictionary size of 256 for now.
    13 replies | 539 view(s)
  • Mauro Vezzosi's Avatar
    28th February 2021, 12:32
    A choice that I have made implied that is not clearly visible in the source is this: Suppose we have the following input data: abcdefghi and for each position to have the following matching strings in the dictionary: abc bcde cde d efg fghi gh There are 2 solutions with two-step-lookahead: (a)(bcde)(fghi) and (ab)(cde)(fghi). For simplicity I chose the first encountered, but we can choose based on the length or age of the strings (bcde) and (cde), the number of children they have, the subsequent data, ... More generally, whenever we have a choice we have to ask ourselves which one to choose.
    41 replies | 7558 view(s)
  • suryakandau@yahoo.co.id's Avatar
    28th February 2021, 09:41
    Paq8sk48 - improve the compression ratio for each data type by adding some mixer context set in each predictor(DEFAULT, JPEG, EXE, image 24 bit, image 8 bit, DECA, textwrt). - improve image 8 bit model c.tif (darek corpus) paq8sk48 -s8 Total 896094 bytes compressed to 270510 bytes. Time 166.60 sec, used 2317 MB (2430078529 bytes) of memory paq8px201 -8 271186 bytes Time 76.12 sec, used 2479 MB (2599526258 bytes) of memory paq8pxd101 -s8 Total 896094 bytes compressed to 271564 bytes. Time 64.56 sec, used 2080 MB (2181876707 bytes) of memory
    228 replies | 23402 view(s)
  • Trench's Avatar
    28th February 2021, 06:14
    Thanks Gotty. I guess it comes down to terminology since I mean one thing and others understand it completely different. Which is why you asked your question despite I thought I explained it. But not well enough obviously so sorry for that. Definition of binary "consisting of, indicating, or involving two." Which my file has only 2 to be considered binary. :) I am trying to speak in dictionary terms than technical terms which I see a lot of misused of words and I guess you guys feel I misuse them. LOL As for the issues, despite the newlines as you state which I thought I removed but programs are flicked. You are correct it needs 2 more bits which 00 would do, but again it is rough estimating and lost track since I was going off of Hex 4096 which uses that an an even number, so use it for orderly purposes. But messed up with the other file but relied mostly on % for measurement than kb amount to see the difference. But again the point is not the accuracy of the file as much but the issue of 2 digits being compressed. As small as the file was it was still big enough to give an issue in other ways. SO... the questions is as stated a file with 2 digits (binary by definition) can not be compressed as good compared to converting. It does not have the 256 characters to make it harder to compress despite "not random". As it seems its easier to convert the file than to compress the file. And no file compression can do a good job in compressing a file with 2 digits as good as converting. Or maybe I do not know which one is good and maybe someone else does? If so which one? Taking the 2 digits of 1 and 0 and putting it on the link and taking the hex characters into a hex editors and then saving it as a regular asci file works better. I cant not copy directly Ascii since it oddly does not copy it exactly. Which again that gives a better results than compressing the file when doing that or to do that first then compressing. Maybe you know that maybe you dont but pointing it out and asking why. No matter if the file has recognized patters or non recognized patters which many like to use the term "random". Random to everyone but not truly random so by definition both files that are compressible and non compressible are random. It is just that the program knows what pattern to use in a broad spectrum of files which works better for some than others. Even to have a mp3 playing a same patterns can not be compressed due to not being random in how it sounds but is defined as random in the ascii due to it being unfamiliar to whoever put in the algorithm. So i disagree with this forum that used the term "Random" and prefer "unfamiliar" or "unfamiliar patters" as a more accurate definition. And the more people use proper terms it can help move things forward. I try to not give long explanations but sometimes seems like it is needed to clarify more.
    21 replies | 449 view(s)
  • hexagone's Avatar
    28th February 2021, 05:01
    And https://encode.su/threads/3240-Text-ish-preprocessors?highlight=text-ish
    13 replies | 539 view(s)
  • Gotty's Avatar
    28th February 2021, 04:50
    On your github page, we find: >>Since the creation of the LZ77 algorithm by A.Lempel and J. Ziv there hasn’t really been any new foundationally different approach to replacing strings of symbols with new symbols. I'm not sure what you mean by that. What counts as fundamentally different? If we count your approach as fundamentally different, then i believe there are many more. Let's see. Please have a look at the following quote form Dresdenboy: You asked for ideas. So the best thing would be to check out those links for ideas and also: https://github.com/antirez/smaz https://encode.su/threads/1935-krc-keyword-recursive-compressor https://encode.su/threads/542-Kwc-–-very-simple-keyword-compressor https://encode.su/threads/1909-Tree-alpha-v0-1-download https://encode.su/threads/1874-Alba https://encode.su/threads/1301-Small-dictionary-prepreprocessing-for-text-files (wbpe)
    13 replies | 539 view(s)
  • Gotty's Avatar
    28th February 2021, 03:51
    No, xinix converted your xorig.txt (which is not binary, but bits represented by ascii digits) to hexadecimal (nibbles represented in ascii). So the size is doubled. (For the first 5 characters it is: "10011" -> "31303031".) You said "hex" = 4096 bytes. Aha, then that is "binary", not "hex". It looks like you misunderstood what xinix said because your "binary" and "hex" is different from his. But also xinix missed the correct term. xorig.txt is not "binary". OK, textual representation of bits, but still it's textual... binary is the 4096-byte file. It think you mean "binary", not "hex". Correct? @Trench, the xorig.txt format is not "compatible" with it's binary representation. It has newlines - please remove them. Also 4096 bytes converted to 'bits in ascii" would be 32768 bits, but you have only 32766 bits in xorig.txt. 2 bits are missing. To be fully compatibly you'll need to add 2 more ascii bits. @Trench, I don't understand your posts. What is exactly your question or problem?
    21 replies | 449 view(s)
  • xinix's Avatar
    28th February 2021, 00:01
    Hi! Can you post the EXE file for testing?
    13 replies | 539 view(s)
  • mitiko's Avatar
    27th February 2021, 22:56
    Dictionary transforms are quite useful but do we take advantage of all they can offer when optimizing for size. LZ77/78 techniques are wildly known and optimal parsing is an easy O(n) minimal path search in DAWG (directed acyclic weighted/word graph). In practice, optimal LZ77 algorithms don't do optimal parsing because of the big offsets it can generate. Also LZ algs lose context when converting words to (offset, count) pairs. These pairs are harder to predict by entropy coders. I've been working on BWD - pretentious name stands for Best Word Dictionary. It's a semi-adaptive dictionary, which means it computes the best/optimal dictionary (or tries to get some suboptimal approximation) for the data given and adds it to the stream. This makes decompression faster than LZ77/78 methods, as the whole dictionary can sit in memory but compression becomes very costly. The main advatages are: optimality, easier to predict symbols, slight improvement in decompression speed. The disadvantages are clear: have to add the dictionary to the compressed file, slow compression. This is still quite experimental and I do plan on rewriting it in C++ (it's in C# right now), for which I'll probably need your guys' help later on. I've also considered starting a new set of paq8 compressors but that seems like getting too deep into the woods for an idea I haven't developed fully yet. Someone requested test results - Compression takes about 45s for enwik5 and decompression is at around 20ms. You can find the exact results in https://github.com/Mitiko/BWDPerf/issues/5 I'm still making frequent changes and I'm not spending too much time testing. It generates a new sequence of symbols (representing the indexes to the dictionary) - length is 50257 and entropy is 5.79166, dictionary uncompressed is 34042 bytes. This means, that after an entropy coder it can go to about 325'113 bytes. (This is for enwik5 with m=32 and dictionarySize=256) Stay tuned, I'm working on a very good optimization that should take compression time down a lot. I'll try to release more information soon on how everything works, the structure of the compressed file and some of my notes. You can read more on how the algorith works here: https://mitiko.github.io/BWDPerf/ This is the github repo: https://github.com/Mitiko/BWDPerf/ I'll be glad to answer any and all questions here or with the new github feature of discussions: https://github.com/Mitiko/BWDPerf/discussions Note this project is also a pipeline for compression, that I built with the idea of being able to switch between algorithms fast. If you'd like to contribute any of your experiments, transforms or other general well-known compressors to it, to create a library for easy benchmarking, that'd be cool. I know Bulat Ziganshin made something similiar so I'm not too invested in this idea. Any cool new ideas, either on how to optimize parsing, counting words or how to model the new symbols are always appreciated.
    13 replies | 539 view(s)
  • Trench's Avatar
    27th February 2021, 18:23
    I dont understand how you got your numbers since when i convened the binary (32.6 KB (33,420 bytes)) to hex I got 4.00 KB (4,096 bytes) Forgot to upload another file which I just did here to gave a complete random one which also shows similar results despite not as good. I probably messed up a small bit to make it more disorganized but in general you see the issues. As stated yes 7zip did not do as good sometimes as another which I used P12 from http://mattmahoney.net/dc/p12.exe File size 32.6k When i tried to use 7zip i get 82.27% =5.78k wity p12 i get 85.86% =4.61k converting it to hex I get 87.73% =4.0k which wins directly. And can not compress the hex file more which gets worse results at 86.53%. =4.39k With a more orderly which I uploaded (32.0 KB (32,864 bytes)) 7zip=89.16% = 3.47k, p12=87.97% = 3.85k oddly worse, and hex=87.47% =4.01, and the hex file is compressed to be 90.53% = 3.03k Hex wins indirectly. as it seems converting it to binary to compress gets around 1% better result which the more random one gets less than 1%. around a 3% difference. maybe a bigger files will get better results. I used this site to convent binary https://www.rapidtables.com/convert/number/ascii-hex-bin-dec-converter.html And took the hex and put it on a hex editor since other odd things happen. But it gets annoying to have to manually select which type of compression method is best with trial and error than the program selecting the best one.
    21 replies | 449 view(s)
  • Mauro Vezzosi's Avatar
    27th February 2021, 17:28
    What did you 2 mean by "full optimal parsing" in a basic LZW as in flexiGIF? IMHO, the one-step-lookahead LZW of flexiGIF is already the best it can be and we don't need to look any further for a lower code cost. We could try to optimize the construction of the dictionary, but it seems difficult; however we can do something simpler and still quite effective (more on .Z than on .GIF) and to do so I will use two-step-lookahead. Basically, if possible, we choose the next string so that the second ends where the third begins (so the second string gets longer); the next string is immediately chosen if it is already advantageous. Suppose we have the following input data: abcdefghi and for each position to have the following matching strings in the dictionary: abc bcde cdef d efg fghi gh Greedy: (abc)(d)(efg) - Strings (abc) and (d) grow, output 3 codes for 7 symbols. One-step-lookahead: (ab)(cde)(fghi) - Strings (abc) and (cdef) don't grow, output 3 codes for 9 symbols. Two-step-lookahead: (a)(bcde)(fghi) - String (bcde) grow (because it is a full length string) and (abc) does not, output 3 codes for 9 symbols. We can go through more steps to improve accuracy of the parsing. I quickly edited flexiGIF and I attach the source with the changes (they are in a single block delimited by "//§ Begin" and "//§ End"). I leave to Stephan to decide if, which variant and how to definitively implement two-step-lookahead. Original .Z are compressed with ncompress 5.0 (site, github, releases). Variant 0 is standard flexiGIF 2018.11a. Variants 1..4 are flexiGIF 2018.11a with my two-step lookahead; they are always better or the same as Variant 0. Variant 1 is the simplest and fastest. Variant 2 is what I think it should be. Variant 2a is Variant 2 with -a divided by 10 (min -a1). Variant 3 and 4 have minor differences. Greedy drops -p; sometimes it is better than flexible parsing. Original Variant 0 Variant 1 Variant 2 Variant 2a Variant 3 Variant 4 Greedy Options File 339.011 339.150 338.893 338.886 338.886 338.886 338.893 340.023 -p -a=1 200px-Rotating_earth_(large)-p-2018_10a.gif 55.799 55.799 55.793 55.793 55.793 55.793 55.793 55.869 -p -a=1 220px-Sunflower_as_gif_websafe-p-2018_10a.gif 280 280 280 280 280 280 280 281 -p -a=1 Icons-mini-file_acrobat-a1-2018_10a.gif 167.333 167.410 167.291 167.298 167.298 167.298 167.293 167.529 -p -a=1 skates-p-m1-2018_10a.gif 52.663 52.722 52.388 52.389 52.389 52.391 52.407 53.430 -p -a=1 SmallFullColourGIF-p-m1-2018_10a.gif 615.086 615.361 614.645 614.646 614.646 614.648 614.666 617.132 Total 95 95 95 95 95 95 95 95 -p -Z -a=1 ENWIK2.Z 530 509 504 504 504 504 504 530 -p -Z -a=1 ENWIK3.Z 5.401 5.320 5.242 5.242 5.242 5.242 5.242 5.401 -p -Z -a=1 ENWIK4.Z 46.355 46.485 45.525 45.510 45.510 45.510 45.506 46.355 -p -Z -a=10 ENWIK5.Z 442.297 450.769 440.173 440.053 439.433 440.053 439.921 441.483 -p -Z -a=100 ENWIK6.Z 4.578.745 4.607.141 4.514.015 4.514.039 4.506.557 4.514.071 4.514.181 4.541.851 -p -Z -a=1000 ENWIK7.Z 46.247.947 45.915.615 44.938.673 44.936.761 44.803.785 44.936.713 44.939.553 45.130.643 -p -Z -a=10000 ENWIK8.Z 51.321.370 51.025.934 49.944.227 49.942.204 49.801.126 49.942.188 49.945.002 50.166.358 Total 2.407.918 2.329.939 2.312.506 2.312.156 2.308.735 2.312.172 2.312.602 2.329.660 -p -Z -a=1000 AcroRd32.exe.Z 1.517.475 1.476.017 1.461.401 1.461.194 1.458.239 1.461.258 1.461.400 1.472.982 -p -Z -a=1000 english.dic.Z 2.699.855 4.336.651 3.628.969 3.628.901 3.534.005 3.630.245 3.630.425 2.684.421 -p -Z -a=1000 FP.LOG.Z 2.922.165 2.792.591 2.781.381 2.781.253 2.779.381 2.781.253 2.781.429 2.804.053 -p -Z -a=1000 MSO97.DLL.Z 1.553.729 1.889.773 1.758.357 1.758.085 1.747.052 1.758.229 1.758.517 1.491.035 -p -Z -a=1000 ohs.doc.Z 1.444.893 1.418.481 1.391.773 1.389.819 1.388.155 1.389.807 1.391.609 1.377.631 -p -Z -a=1000 rafale.bmp.Z 1.456.931 1.491.769 1.412.083 1.411.773 1.402.315 1.411.893 1.412.105 1.408.439 -p -Z -a=1000 vcfiu.hlp.Z 1.133.483 1.198.564 1.140.473 1.140.189 1.136.877 1.140.239 1.140.529 1.170.537 -p -Z -a=1000 world95.txt.Z 15.136.449 16.933.785 15.886.943 15.883.370 15.754.759 15.885.096 15.888.616 14.738.758 Total 23.514.759 23.068.375 22.031.991 22.029.031 21.893.328 22.030.971 22.034.373 20.887.657 -p -Z -a=1000 MaxCompr.tar.Z (10 files) 199.765 194.884 194.041 194.041 194.039 194.041 194.041 196.327 -p -Z -a=10 flexiGIF.2018.11a.exe.Z 12.088 11.999 11.785 11.786 11.786 11.786 11.785 12.088 -p -Z -a=10 flexiGIF.cpp.Z 5.357 5.326 5.278 5.278 5.278 5.278 5.278 5.357 -p -Z -a=1 readme.Z (readme of flexiGIF) 217.210 212.209 211.104 211.105 211.103 211.105 211.104 213.772 Total 90.804.874 91.855.664 88.688.910 88.680.356 88.274.962 88.684.008 88.693.761 86.623.677 Total
    41 replies | 7558 view(s)
  • Mauro Vezzosi's Avatar
    27th February 2021, 17:21
    I have verified that the decompression creates a file identical to the original coronavirus.fasta.
    76 replies | 5401 view(s)
  • Darek's Avatar
    27th February 2021, 16:24
    Darek replied to a thread paq8px in Data Compression
    enwik scores for paq8px v201: 15'896'588 - enwik8 -12leta by Paq8px_v189, change: -0,07% 15'490'302 - enwik8.drt -12leta by Paq8px_v189, change: -0,10% 121'056'858 - enwik9_1423.drt -12leta by Paq8px_v189, change: -2,99% 15'884'947 - enwik8 -12lreta by Paq8px_v193, change: -0,02% 15'476'230 - enwik8.drt -12lreta by Paq8px_v193, change: -0,02% 126'066'739 - enwik9_1423 -12lreta by Paq8px_v193, change: -0,09% 121'067'259 - enwik9_1423.drt -12lreta by Paq8px_v193, change: 0,08% 15'863'690 - enwik8 -12lreta by Paq8px_v201, change: -0,23% - time to compress: 45'986,20s 15'462'431 - enwik8.drt -12lreta by Paq8px_v201, change: -0,12% - best score for paq8px series- time to compress: 30'951,71s 120'921'555 - enwik9_1423.drt -12lreta by Paq8px_v201, change: -0,13% - best score for paq8px series- time to compress: 406'614,43s
    2328 replies | 623464 view(s)
  • xinix's Avatar
    27th February 2021, 09:20
    I converted your file to its original form. This file is random, it cannot be compressed! __ You're going in the wrong direction, you don't need to convert the file to view 01, you're trying to compress binary code, but if you tried to understand compression you would know that PAQ already compresses binary bits. Not bytes but the bits you're trying to show us. __ If you still don't get it, let me improve the compression ratio of your file right now! I took your binary xorig.txt and converted it to HEX ASCII, and now the file is 66,840 bytes! But the compression has improved! xorig binary is only 85% compression xorig HEX ASCII - compression ratio as much as 92%!!! Did you see that?
    21 replies | 449 view(s)
  • Trench's Avatar
    27th February 2021, 08:22
    I asked a while ago about compressing binary file and got some decent programs that can compress but suck at binary compared to just putting the binary into Hex / Ascii and the compressing it which beats just binary compression by around 15% what I tested. While other program like 7zip it is not as good even with its Lzma2 being better than Bzip2 when dealing with binary but still. Maybe my tests were not done properly some might feel for the file to not be big enough (32kb) or not putting the proper configurations. This was just a quick tests. Maybe more focus is needed on binary compression? Bad enough other compressed programs as stated before like PNG is off by over 50%.
    21 replies | 449 view(s)
  • Trench's Avatar
    27th February 2021, 07:39
    As expected in the poll. 2 people would take 1 to 100 mill, while 2 say no comment which means they want to be 1 nice to not say or 2 dont know what they will do when it comes to it which usually means they will also choose the money. Also the fact that 212?... but lets say a fraction of that about 25 people also that saw this post had no comment to state the obvious truth. Silence speaks volumes. Does the poll have a right answer? What if you were on the other end and you has a multi million dollar idea what would you want the other person do choose if you rely on them to be any lawyer, any programmer, etc? You would probably want them to choose 0. Shouldn't one decide on the... 1) golden rule of do to others what you want others to do to you? or maybe people prefer the other 2) golden rule of whoever has the gold makes the rules. Which is what some of those examples were which can sure seem to buy people mentality as well in how they are viewed. Even if you know this and change your mind its just shows the mentality of most. Since if you are willing to screw over 1-2 people to benefit in your life which is rarely the same to be that law the other will also think that. And if everyone screws over the other then that is 7+ billion people will to screw you or have already one way or another. A downward spiral where the end result is everyone gets hurt equaly. kind of like communism everyone suffers equally. Freedom and responsibility go together, if you want freedom to do what you want without responsibility then more rules will be put on by force which means less freedom, so to help secure others of theirs freedom which costs them a bit of freedom. The more security you get the less freedom which is what a prison is. As one of the US founders Benjamin Franklin somewhat said to compromise freedom for more security you will get neither and deserve neither. It seems people can not handle truth or freedom. Give a child the freedom to do anything they will be a slave to their addictions. But for an adult to act the same seems to be as childish. Well that's that. ​ mega That is not the point of the topic but an example of dishonestly which if you feel everyone is honest then o well. But to clarify on your points. 1 Tesla worked with Westinghouse and was pressured by Morgan to be sued and decided to give up the patent despite Westinghouse/Tesla knew they can win in court but Morgan wanted to sue knowing it will cost them legal fees and take years to settle. Later on it was hard to get financial backers since his Biggest backed Astoria died on the Titanic owned by Morgan which did not go on the ship and has insurance. Maybe coincidence. 2 Apple's Steve Jobs said good artists copy and great artists steal. Also blamed Gates for taking the UI from them. Apple would have been out of business without Gov support. 3 MS sued for anti trust. Did you know gates was the fist person to make a computer virus, yet is pushing top stop viruses with is Organisation made after he got sued to help his image. As Gates said in public years ago that we need good vaccines to reduce the population. LOL 4 Xerox it depends, they had their info free for development for improvement but did not get credit as they should which would be the honorable things, while others say their was a patent but no real database and so it took a long time to get one which was eventually accepted in 1991. Laws are in place due to people not being honorable. 5 FB got sued and had to pay 100's of millions. You can look it all up. But to think that billion dollar companies do things honorably to get that wealth is naive. Their is a lot of info from admitting court record, admitted by the people that did it. Think what you will.
    2 replies | 291 view(s)
  • Gonzalo's Avatar
    27th February 2021, 00:47
    There is a great replacement for jpg right now. It's called jpeg-xl. There's a lot of chat about it on this forum. About future compatibility? Yes and no. wimlib. tar and fxz are open source. Srep and fazip too but I haven't had much luck compiling them. And they're abandonware, so not great.
    31 replies | 2144 view(s)
  • Hacker's Avatar
    26th February 2021, 14:52
    Ah, found the executables, I was blind, sorry.
    31 replies | 2144 view(s)
  • Cyan's Avatar
    26th February 2021, 02:41
    Yes, totally. I wouldn't bother with that kind of optimization. Yes. The underlying concept is that the baseline LZ4 implementation in lz4.c can be made malloc-less. The only need is some workspace for LZ4 compression context, and even that one can be allocated on stack, or allocated externally. It's also possible to redirect the few LZ4_malloc() invocations to externally defined functions : https://github.com/lz4/lz4/blob/dev/lib/lz4.c#L190 Yes, that's correct. Yes. All that matters is that the state is correctly initialized at least once. There are many ways to do that, LZ4_compress_fast_extState() is one of them. LZ4_initStream() is another one. memset() the area should also work fine. Finally, creating the state with LZ4_createStream() guarantees that it's correctly initialized from the get go. I don't remember the exact details. To begin with, initialization is very fast, and it only makes sense to skip it for tiny inputs. Moreover, I believe that skipping initialization can result in a subtle impact on branch prediction later on, resulting in slower compression speed. I _believe_ this issue has been mitigated in latest versions, but don't remember for sure. Really, this would deserve to be benchmarked, to see if there is any benefit, or detriment, in avoiding initialization for larger inputs. I don't see that comment. Compression state LZ4_stream_t should be valid for both single-shot and streaming compression. What matters is to not forget to fast-reset it before starting a new stream. For single-shot, it's not necessary because the fast-reset is "embedded" into the one-shot compression function. Yes, it is. It's generally better (i.e. faster) to ensure that the new block and its history are contiguous (and in the right order). Otherwise, decompression will work, but the critical loop becomes more complex, due to the need to determine in which buffer the copy pointer must start, control overflows conditions, etc. So, basically, try to achieve contiguous ] and speed should be optimal.
    1 replies | 385 view(s)
  • macarena's Avatar
    25th February 2021, 21:43
    Hello, Most recent papers on image compression use the Bjontegard metric to report average bitrate savings or PSNR/SSIM gains (BD-BR, BD-PSNR etc.). It works by finding the average difference between RD curves. Here's the original document: https://www.itu.int/wftp3/av-arch/video-site/0104_Aus/VCEG-M33.doc I am a little confused by the doc.The bitrate they show in the graphs at the end of the document seems to be in bits/sec. At least it is not bits per pixel (bpp) which is commonly used in RD curves. As per this site https://www.intopix.com/blogs/post/How-to-define-the-compression-rate-according-to-bpp-or-bps, converting from bpp to bits/s would require information of fps which might not be known(?). I want to know if it really matters if the bitrate is in bpp or bits/s? Or does the metric give correct values no matter which one uses? Here's a Matlab implementation that seems to be recommended by JPEG: https://fr.mathworks.com/matlabcentral/fileexchange/27798-bjontegaard-metric . I ran a few experiments and it seems bpp gives plausible results. Though a confirmation would be nice. Thanks!
    0 replies | 157 view(s)
  • JamesWasil's Avatar
    25th February 2021, 20:58
    I've wanted this with 286 and 386 laptops for decades. Do you feel it will live up to the promises from the press release? https://arstechnica.com/gadgets/2021/02/framework-startup-designed-a-thin-modular-repairable-13-inch-laptop/
    0 replies | 76 view(s)
  • Jarek's Avatar
    25th February 2021, 15:12
    The first was uABS, for which we indeed start with postulating symbol spread - decoder, then a bit surprisingly turns out you can derive formula for encoder ... but in rANS both are quite similar.
    4 replies | 385 view(s)
  • mitiko's Avatar
    25th February 2021, 14:37
    Yeah that makes sense. I can see how the traversal of this tree is avoided with range coding. I was trying to find some deterministic approach by using a table (just visualizing how states are linked to each other), but I'm now realizing we have to bruteforce our way to find the checksum if we process the data this way. There really can't be any configuration that works as FIFO if the states that the encoder and decoder visit are the same. I can justify this by an example: Abstractly say we want to encode "abc". x0​ --(a)--> x1 --(b)--> x2 --(c)--> x3 We start off from a known initial state x0 and work our way forward. What information each state holds: x0 - no information x1 - "a" x2 - "ab" x3 - "abc" When we decode we have: x3 --(a)--> x4 --(b)--> x5 --(c)--> x6 x3 - "abc" x4 - "bc" x5 - "c" x6 - no information There's no reason for x1=x5 or x2=x4. I haven't found a solution but I'm working on it. Basically I have to define new C(x,s) and D(x). Jarek, when working on ANS, did you dicover the decoder or the encoder first? (in the paper it seems it was the decoder)
    4 replies | 385 view(s)
  • Jarek's Avatar
    25th February 2021, 09:09
    I have searched for FIFO ANS but without success - see e.g. Section 3.7 of https://arxiv.org/pdf/1311.2540.pdf In practice e.g. in JPEG XL the chosen state (initial of encoder, final of decoder) is used as checksum, alternatively we can put some information there to compensate the cost. Another such open problem is extension of uABS approach to larger alphabet ...
    4 replies | 385 view(s)
  • Shelwien's Avatar
    25th February 2021, 04:09
    > we can't generally compress encrypted data? More like we can assume that any compressibility means reduction of encryption strength. But practically processing speed is a significant factor, so somewhat compressible encrypted data is not rare. > extract useful data from an encrypted dataset without decrypting the dataset One example is encrypting a data block using its hash as a key (such keys are then separately stored and encrypted using main key). In this case its possible to apply dedup to encrypted data, since same blocks would be encrypted with same keys and produce same encrypted data. > the possibility of compressing encrypted data Anything useful for compression also provides corresponding attack vectors, so its not really a good idea. From security p.o.v. its much better to compress the data first, then encrypt it.
    1 replies | 204 view(s)
  • Shelwien's Avatar
    25th February 2021, 03:44
    RC fills symbol interval with intervals of next symbols. ANS fills it with previously encoded symbols. Arithmetic operations are actually pretty similar (taking into account that CPU division produces both quotient and remainder at once), so we could say that RC is the FIFO ANS. https://encode.su/threads/3542?p=68051&pp=1 Also all good compressed formats need blocks anyway, so ANS being LIFO is usually not a problem anyway.
    4 replies | 385 view(s)
  • Gribok's Avatar
    25th February 2021, 02:02
    I have no plans on updating bsc.
    6 replies | 586 view(s)
More Activity