Activity Stream

Filter
Sort By Time Show
Recent Recent Popular Popular Anytime Anytime Last 24 Hours Last 24 Hours Last 7 Days Last 7 Days Last 30 Days Last 30 Days All All Photos Photos Forum Forums
  • Gotty's Avatar
    Today, 10:04
    @urntme I implemented your algorithm with 3-byte codewords and automatic space insertion and built an optimal dictionary per file to find out the best compression theoretically possible. So - I don't have a general dictionary. The aim is to find the theoretical limit per file. - It is not optimized for speed - readability is more important at this stage. - Speed: it encodes a 100 MB file in 1 sec, and decodes it in half a sec ( tested with text8 ). It encodes/decodes anything small in way less then a millisecond. A tweet message would be a couple of microseconds. - However compression-wise it cannot get better than 50%. That seems to be the average best case (average theoretical limit) - even when using an optimal dictionary. Unfortunately when there's a word or chunk that is not in the dictionary it seriously bloats the file. Any quote, comma, bracket, full stop or typo (or an unknown word) is a serious show stopper and with having such things in the content to be compressed the algo usually looses quite often (i.e. size > 100%). So it's performance (compression ratio and speed) in the optimal case is similar to an order-0 model (like a speed-optimized fpaq variant) but the latter is 1) not limited to files containing English text only. 2) uses way less memory (does not need a dictionary). Without sacrificing speed for some cleverness, it does not look promising so far. Would you consider refining it? You've got plenty of ideas from us above. I could go ahead and refine it myself, but you are the author, it's your child. You will also need to update your pdf file. Also include the details we were discussing above when we were trying to understand your idea better. You now know what information we were missing. You will also need to include (describe) an algorithm how you would build a dictionary for the general case. It's your turn ;-
    25 replies | 839 view(s)
  • Trench's Avatar
    Today, 06:28
    "A dead man's switch (see alternative names) is a switch that is designed to be activated or deactivated if the human operator becomes incapacitated, such as through death, loss of consciousness, or being bodily removed from control. Originally applied to switches on a vehicle or machine, it has since come to be used to describe other intangible uses like in computer software." (weakpedia.. i mean wikipedia) To all newcomers to the forum if you say you have something that can do what no other person figured out a solution for try to have a back up just in case. Maybe you are afraid your underwear for a mask is not enough to protect you, or you might get a impossible computer virus in the brain which a computer virus can't cross into other species but ok. How about giving your info to someone else as a last resort and what better place than someone in this forum... I assume. Despite they didn't figure it out yet or never will it's not their fault they are taught to think the way they do, since they are they and you are you. Despite this message is slightly amusing to lighten the mood it should be taken seriously... If you care. Who you choose is up to you. Usually the serious ones have their email to send info... (to members. What you don't? Fine you get nothing and like it.) Or you can send it to multiple people you know but you should also send at least to another person you don't know after a certain time frame. Not to some homeless stranger on the street they don't count. Also don't be naive to think certain people will take good care of the info. Plenty of naive people you don't have to be one too. Or to someone with a above average IQ... well good luck on that. Here is one site that might help but you can use your own if you like. deadmansswitch.net Even some emails can send thing at certain time frames but I would not trust the security on those depending on the site since even some cloud storage sites or emails can access and erase content they do not find acceptable or even erase your account if not used after a while. Maybe some people below can give advise... or maybe not. Or just give up since no one deserves it... but then again you deserve nothing others have given you too then. :P You are not allowed to see this. Fine, say hello if you did.
    0 replies | 17 view(s)
  • Trench's Avatar
    Today, 05:03
    LOL Svenben your very amusing opinion noted.:) Whatever make you feel better to cope with the fear you have. Put a mask, with goggles and an oxygen tank as well if it makes you feel comfortable. Here is a basic understanding on how to find truth is to listen to all sides and expand on it. As much of a reason you have to think x is good have an equal amount to see how x is bad. I know it's hard since many are conditioned to be lazy and do what they know. I only stated the sourced from news site stating these facts. If you can't read between the lines o well, but I don't expect the average intellect person to get things. Its like the joke of 6 ft apart which you take to heart but partials are in the air for hours and people pass though them, you get 10000 of various viruses a day yet fine. It is obvious you don't know what a virus is and many confuse it with bacteria but then again most don't know what a bacteria is. Most people know nothing compared to the vast knowledge out their and it is amusing what think think. Which i why your comment is very funny which reminds me of a child that think it knows enough after obtaining some info. You ever hear of the word conflict of interest? It is not admissible as evidence no matter where you go unless its a banana republic. There more sources form research and doctors but people want to believe what they want to make them feel comfortable. Don't make the mistake again to assume you know more than the other which you did now or believe a unelected bureaucrat that could not make it in the private sector properly which is what you side with. Think wisely. The good news is so many people are dying less from so many other diseases it's a miracle. LOL
    4 replies | 345 view(s)
  • fabiorug's Avatar
    Yesterday, 14:53
    in my opinion only at -q 93 -s 3 -d 0.83 -s 7 jpeg xl build 07 September 2020 has higher quality than other codecs such as NHW codec, for lower qualities, the artifacts of NHW could look good and you can save space. with this re-encoding I got 300724. but obviously from a png normal jpeg xl such as -d 1 looks way better. probably for encoding beard etc. that have neatness is not the best codec.
    106 replies | 8521 view(s)
  • Dresdenboy's Avatar
    Yesterday, 13:15
    I thought about creating some simple prototype (with a small amount of lines if code) eventually. So we could test it. That sounds plausible to me. Detecting the best fitting dictionary sounds interesting. With splitting them there are many interesting ways to optimize the loading of them.
    25 replies | 839 view(s)
  • urntme's Avatar
    Yesterday, 10:04
    This is a really a cool idea. Thanks for sharing. That is true. You make some really interesting points. Thank you for your comments. However, what I'm thinking now is a compromise of sorts. We use variable encoding, however, we fix the size of the codes to bytes instead of bits. That is, we use 1 - 3 bytes for each code, so codes like spaces and dots would get 1 byte, however, words like table or manager would be three bytes. We get speed because we are not working at the bit level and we get space because we are working with sometimes 3 bytes of data. This compromise would result in great speed, however, we would lose some of our compression ratio and dictionary size. But, we would also have no formatting errors in the process. About your point of this being costly to load the data for instant messaging; I was thinking that as the user types the message, the program in the background finds out which dictionaries it would need to compress the data and loads only those dictionaries while the message is being typed. As you said, we could have multiple indexed dictionaries present and have multiple sets of small dictionaries. I don't think this has been done before. It would provide speed, and it would be efficient. Thoughts?
    25 replies | 839 view(s)
  • Dresdenboy's Avatar
    Yesterday, 08:11
    It will likely be fast. Thanks to a fixed size index there are some optimization opportunities. If you're using variable length encoding, they will be a bit different. But for example (keep in mind my cache remark about the processor processing this data) you could fill your 3 index bytes a bit differently (creating groups of dictionaries). If the dictionary is sorted by typical frequency of the words in texts, the most common ones would group together (low index). If the dictionary is split into 256 64k sized dictionaries (with many of them not used), the first byte of your index could just select the dictionary and the next 2 bytes are an index into it. Or you use 64k 256-word dictionaries or alphabets. This way there would be a much more efficient use of memory while compressing/decompressing. But instant messaging has a different meaning - and very small text sizes. SMAZ is aimed at this task. But just imagine what happens: A ~320 byte message every n minutes (or seconds at best). The processor's cores did a lot of other stuff already in the meantime, clearing the dictionaries from the cache. Reloading 2+MB of data each time to compress ~200 Bytes on average is costly. This might be the case for scientific or technical texts.
    25 replies | 839 view(s)
  • urntme's Avatar
    Yesterday, 05:19
    The hope is that these would be stored in the dictionary. But if they're not, they would bloat the file. This would also cause bloat of the file. Well, I was thinking it would handle it similar to how compilers handle it. The first one would not use a space on the right and the second one would not use a space on the left. You bring up interesting points. But the easiest answer in all these cases is that it would bloat the file. To be honest I don't really have a concrete speed in mind. I just thought it would be a simple/fast encoding method. I suppose this would depend on the exact application in mind. I was actually thinking it could be used in instant messaging apps such as Twitter. Twitter would be a perfect example. I suppose for that the much simpler method would be just to use variable codes, a static dictionary, and encode all the spaces and all special characters in order of probability. But such a thing probably already exists. Due to the instant nature of these instant messaging apps, you want something light and quick to encode them. My method would be a good teaching method I feel. To learn how compression could work. It's simple to learn and simple to implement. So it could probably be used for that. Also, as you mentioned, it might be useful if it is specialized for coding applications such as HTML/JAVA etc. ------------------- Yes. You are probably right on this. Error correction could be a real problem if it is used in a generalized case. I just need to find an application where all the words in the dictionary have a somewhat equal probability of occurring. That would be an advantage for this. Plus the speed would also be an advantage if it's required somewhere. Hmmmm. Dunno. I need to think about this.
    25 replies | 839 view(s)
  • SvenBent's Avatar
    17th September 2020, 22:20
    I agree with gotty on many points. Once you have to adapt special flags for things that are not just words but also punctuations. indents quotes acronyms, capitalization etc etc, you realize adding on one extra flag to deal with shorter symbols for word are really not adding in a lot extra complexity as you have already done this several times over just for "error correction" in the code. The code is only simple/fast because it is completely ignorings a big chunk of its task
    25 replies | 839 view(s)
  • Shelwien's Avatar
    17th September 2020, 18:23
    http://mattmahoney.net/dc/dce.html#Section_525 Basically use offset within a hashtable cell instead of global distance. Same has to be done during decoding, so decoding is slower. But compression is better. Its something like intermediate step from LZ to PPM. Christian's RZ is ROLZ.
    36 replies | 1712 view(s)
  • Gotty's Avatar
    17th September 2020, 17:09
    Back to your initial request: Your algorithm is specialized in compressing only (human-written) pure ascii text very fast. Since we already know that it would not be able to compress (even it would bloat) any other content - it has a very limited use case. Unfortunately I don't know any fields that would need something like that. If the algorithm would be specialized to HTML or JavaScript or JSON compression...
    25 replies | 839 view(s)
  • Gotty's Avatar
    17th September 2020, 16:59
    I don't think variable length codes would be that much slower. But compression would be significantly better. What is your goal speed-wise? Do you have a concrete one? Would you want to gain some milliseconds and compress much worse or compress much better and lose some milliseconds? I believe simplicity is also not a problem. The algothm is very-very simple in either case.
    25 replies | 839 view(s)
  • Dresdenboy's Avatar
    17th September 2020, 16:50
    English Wikipedia removed the article because it wasn't citing well known sources. But this was years ago and it might be put up again as there are more references now.
    36 replies | 1712 view(s)
  • Gotty's Avatar
    17th September 2020, 16:40
    Ah, my bad, sorry. Made a fast calculation, and didn't think it through. It's 6%. Fixed my post. I see. How do you encode and decode "O'Brian", "well-known", "higher/lower" etc.? You can't delete and insert a space in these cases. How do you plan encoding multiple spaces? They are used for indentation in many clear text files. I'm also interested in how do you know where to insert or delete a space when there are (double or single) quotes? Example: 'Your paper titled "Disha Format" is an interesting one.' Which of the spaces are you going to remove (and add during decompression) in this case? The ones adjacent to the quotes are ambiguously either left or right from the word. The same is true for parantheses, brackets, braces, but with establishing a rule (no space after "(", "[", "{") it's possible to hack most occurrences. But since open and end quotes are similar it's not really possible to establish such replacement rules that would meet your goal of having max-speed.
    25 replies | 839 view(s)
  • lz77's Avatar
    17th September 2020, 16:14
    Regarding ROLZ I saw only idea on ru.wikipedia.org/wiki/ROLZ.
    36 replies | 1712 view(s)
  • anormal's Avatar
    17th September 2020, 13:47
    I remember some guy did a bat file? wich uses every combination of 7z compression params to test a file, I asked the guy behind 7z-STD (great fork) if this could be implemented just for fun :D (regardless of time/space needed to test).
    7 replies | 800 view(s)
  • urntme's Avatar
    17th September 2020, 10:27
    First of all thank you all for your comments and your interest. Below are my replies. ---------- Wow. Thank you for sharing the link. It's quite an interesting concept. Plus it's got the code, so I can learn the code from it too. It's a really cool link. Thank you for sharing. The 3 spaces are deleted and not stored. I mentioned this too in the algorithm. If you store the spaces you would actually gain bytes. This is because, the average space taken by the uncompressed txt file is 4.7 bytes per word and 1 byte per space. Since every word accompanies a space, this is around 5.7 bytes. However, we use 3 bytes per word and delete the space. As a rule, when the file is being output again, you insert a space after every word. If you store the space after every word, we would need 3 bytes per word + 3 bytes per space. Which would mean we would use 6 bytes per word + space combo. So we would actually gain. A line break is given 3 bytes however but that should be rare. Going by your results, I'm actually surprised that you gained less bytes than expected. You only saw a gain of about 2% which is surprisingly lesser than expected. But I believe you are right. If you use a variable length code word for each word, you might save more space so in that way you could say it is sub-optimal if your goal is only space savings. That's not the point. The point is we save time and overhead costs this way. The advantage is speed rather than most optimal compression ratio. Hmmm. That's quite an interesting text file. Half the words is the!? What kind of text is that. Lol. It's funny. The reason we are not using variable length coding is because we want simplicity and speed. However, you are probably right. If you used variable length coding, then you could possibly get more compression out of it. Undeniably. Especially for this particular text file you tested it on. First of all thank you for testing the concept. It's great. But I would like to kindly request you to post the results after testing it without storing the space because as I mentioned above, it basically inflates the size. Also please do mention the time it took to process it. ------------- Well the method is slightly different than that. We are using fixed 3 byte codes for each word. 2 byte codes have been suggested by others and me but it depends on the test file and the algorithm. This method give great speed and is simpler to implement. ​Yes. You are right. This would be one way of going about it. Thank you for sharing. I have suggested that the start capitalized words and the full capitalized words would be stored as separate words in the dictionary. Since we are using fixed set of bytes, there is enough room in the dictionary for them. -------------- Wow. That's quite a story. Thank you for sharing it. The ideas you present are quite interesting. Thank you for sharing them. ---------- Those are quite interesting ideas. Thank you for sharing them. The story is also quite interesting. Thank you for sharing. Yes. Who knows what will happen. Thank you for your comments and your encouragement. Thank you for sharing. --------- Thank you all for your comments and messages. Thank you all for the discussion.
    25 replies | 839 view(s)
  • pklat's Avatar
    17th September 2020, 09:11
    pklat replied to a thread GPU compression in Data Compression
    I doubt it, at least for decompression. IIRC, there are already specialized chips for AI. They are probably faster and more efficient than GPU. Anyway old CPUs can do anything just much slower.
    1 replies | 133 view(s)
  • JamesWasil's Avatar
    17th September 2020, 07:04
    Do you think that if GPU compressors become more useful that it will change things drastically to where people who use computers and server systems without a specific GPU will not be able to use them? Or will they have to emulate the hardware which might still work, but defeats the purpose of using the GPU? Will they be locked out of compression or the ability to decompress things done with a specific GPU or other hardware if they do not have the equivalent hardware to match it in the future? After seeing TensorFlow, I wondered about this.
    1 replies | 133 view(s)
  • JamesWasil's Avatar
    17th September 2020, 06:37
    As an example of the 2 word field only from the past, if the ASCII symbol A represented the word "Hi" on a text database: If you read a two letter word like "Hi" which could be uppercase or lowercase as 16 bits, the output would be: 0001 (4 bits) + A (8 bits) = "HI" (all characters uppercase, without suffixes) 0100 (4 bits) + A (8 bits) = "hi" 01010 (5 Bits) + A (8 bits) = "hi-" 01011 (5 bits) + A (8 bits) = "hi " 01100 (5 bits) + A (8 bits) = "hi!" 01101 (5 bits) + A (8 bits) = "hi?" 01110 (5 bits) + A (8 bits) = "hi," 01111 (5 bits) + A (8 bits) = "hi (CR/LF)" 001 (3 bits) = If the next 3 text letters were unknown symbols/letters, they were output as literals and were basically 9 bit codes rather than 8. 00001 (5 bits) = If the word was 3 bytes or larger or a prefix of another word like "Hip" that existed on a dictionary a level up, step up to the next dictionary size and search for a word or data set that matches and use flag bits after this code for them instead of staying on the 2 text symbol level. 00000 (5 bits) = Use a text-based LZ buffer that matches 2 text symbols from the last 128 characters that were read and output even if literal and no match was found. For all symbols that were 4 bits or larger starting with 0100, the first bit could be set to 0 or 1 or inverted, where 0 meant it was lowercase, and 1 meant the first letter was uppercase. If 1 was read first, it was automatically known to be a 4 bit code or larger. If a 0 was read, it could be a variable bit code from 3 bits to 6 bits. It could size up or size down on the dictionary dynamically based on the prefix codes, and go from entire sentences to 2 letter words and then back up to 5 to 10 letter words and sentences later based on the codes read ahead of time. For 2 letter words it only saved about 3 or 4 bits, but for 3 letter words and larger it saved a significant amount, and often things that PKZIP and other compressors weren't seeing as text patterns because they were designed to look for matches differently and not as words and English analyzed. The LZ buffer helped a lot too whenever there weren't matches, because it checked that first and avoided a lot of 9 bit literal output codes by not having to switch to that and read/expand the next 3 symbols by 1 bit each if they weren't seen anywhere on the current dictionary level or directory. There were other modifications and changes made to this a year or two later that were ahead of the above layout that I worked on when I turned 18 that made it even smaller, but by then I remember being cheesed about the fact that 7zip had just been released and was getting the text compression smaller than the above method was, even with all the neat additions to make it versatile. I decided that it was a lot of CPU time and file cost to keep doing it with marginal gains and differences between 7zip and LZMA large windows, and abandoned development on it after that. It could have been improved further though. By then it was using branch trees and other logic that sped it up enough to be useful and practical, but I moved on to other things by then. Maybe you can use a similar strategy to build and improve upon that for text compression with Disha? I know it's very old and the example is more of a basic layout from one of the first methods I used for it, but perhaps you could extend it and adapt it to your method if you decide to use static 3 byte codes still and make it useful to you? The nice thing about it too is that it can be made fast even if it uses larger dictionaries with the right logic to move and look-up quickly. That, and that it requires no statistical preprocessing and you can do it on-the-fly with network or file reads. There are built-in statistics to the English language that make this possible and favorable already. Back then the internet was new to the public and I was developing on my own as I went without references for anything, only comp.compression on USENET if I was lucky. But now, you could network with members here on this forum and elsewhere to really make a great text compressor with that and the ideas from the many users and contributions made here by others. If you think Disha can really be a strong text compressor, consider these ideas and keep at it. It may be with the right changes and planning. :)
    25 replies | 839 view(s)
  • SvenBent's Avatar
    17th September 2020, 06:01
    you deductions a laughable ignorant and based on not knowing how to draw propper scientific conclusion. you are pretty much just embarrassing yourself but showing how easy you are fooled by not understanding statistics ive only checked 2 of your sources and they do not prove what you think it proves. that the end of my time to waste on this kind of low mental Rambling
    4 replies | 345 view(s)
  • JamesWasil's Avatar
    17th September 2020, 05:41
    25 years ago, when I was about 16 turning 17, I wrote what might be considered a variant of "Disha". It started out with 1 byte codes for the smallest words of the English language, but used an extendable prefix bit that was variable length, similar to a huffman code in a way, that determined the size of the bytes that represented the encoded word. The dictionary was divided to word sizes and output code sizes where each one had a separate directory (this was done initially in MSDOS on a 286). If the prefix bits grew to certain lengths, the directory would switch to larger databases for words, phrases, and entire sentences that were commonly used. If there were partial matches, there were indicator bit flags for larger words to where each letter could use a 1 to 2 bit uppercase or lowercase bit flag to make sure that having one letter capitalized or misspelled was still compressed if it was a heuristic match. While it did work and saved a vast amount of space for text files (more than Pkzip at the time which I was pleased with), it was horribly slow unless you used a RAM drive, and back then PCs were lucky to have 4MB, 8MB, or 16MB at most to work with commonly. Disha sounds like a 3 byte fixed version attempting to account for all word sizes that it encounters, but as SvenBent pointed out, that usually doesn't save much and can actually expand it due to the fact that words like "to" and "or" will be used where one byte will actually be gained rather than compressed, while other words that are common like "the" "and" or "how" will break even without a flag bit or any modifier and not give compression, but will get larger slowly if flag bits and descriptors are used. The majority of your text messages with 3 byte output codes will have to be at least 4 letters and larger for you to gain 1 byte or more per output for your compression sequences. Statistically, there are too many short words used more frequently and long words less frequently for text compression to benefit from that unless you use something dynamic like I was doing as a teenager experimenting with text compression, or you use another method that lets you adjust to variable sizes where you can still have gains with 2 and 3 letter words any other way. You could try to leverage the fact that most words have a SPACE or ASCII 32 suffix at the end of each word and represent that with an extended flag bit to save the difference and try to break even or compress more that way. Maybe extend it to other common symbols like commas, periods, question marks, exclamation marks, and CR+LF and save the difference on the bits to represent that since prefix bits will let you read and account for that before it tries to work with an output code even if you keep it static at 3 bytes.
    25 replies | 839 view(s)
  • SvenBent's Avatar
    17th September 2020, 01:48
    i might have missed something but is thi not just the same as dictionary preprocessing of text? replacing all "known" words with a simply value before modeling is applied ? This "compression also uses a set value cost for any thing despite occurence being different. so we are wasting a lot of bites compared to do a more cost/occurance analytics Instead of 3 buts per word i would do something likes this 1bit flag for long or short workd if 1st bit is set means we only read that one but this leaves 7 bits for 128 lost cost words (they only cost 1 bytes) if 1st bit is not set we read the next 2 bytes as well this leaves us with 23 bits for 8Mill full cost words still more than plenty for te document claim of "171476 words " This would save 2 bytes on the 128 most occurent words How are we dealing with capitalized words?. Start capitalization. full capitalization
    25 replies | 839 view(s)
  • Gotty's Avatar
    16th September 2020, 23:46
    ​>>How do you construct the dictionary? Is it constructed during compression or do you use a fixed/predetermined one? >Yes. We use a fixed/predetermined one. Then this is one of the most similar algorithms to your idea: https://github.com/antirez/smaz It uses a static "fragment" dictionary for ascii texts (see the dictionary here: https://github.com/antirez/smaz/blob/master/smaz.c). Not full words as in your case. That's a difference. >>you probably forgot the full stop + the size of the dictionary. >No I believe I mentioned the full stop. It would take 3 bytes for registering the full stop. So it would gain here. Oh, OK. Than what happens with the 3 spaces? I feel that using a fixed-codeword approach for content that follows Zipf's law (English text) is (very) suboptimal. For such content variable length codewords is much better. Let's see. Let's do an experiment. Take a large ascii text file (like the 100 MB text8 from here: http://mattmahoney.net/dc/textdata.html), let's grab all the words from it. A sample from the file: anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans culottes of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization of society it has also been taken up as a positive label by self defined anarchists the word anarchism is derived from the greek without archons ruler chief king anarchism as a political philosophy is the belief that rulers The file contains 17 million words (253854 distinct), all lowercase and only a single space in between the words (that's in favor of your algorithm). 6% of the words is "the" (it occurs 106396 times), and we have 0.7% of the words occur just once (118519 of them). This distribution reflect Zipf's law. When you use 3 bytes for every word then all the "the" words will be encoded to 3 x 106396 bytes. And you didn't gain anything (3-char word to 3-byte codeword). Why not use a single byte for the most frequent words? And gain 2 x 106396 bytes in the case of "the" alone. Don't you think that using variable length encoding would be the way to go? I used you compression idea and actually tried compressing text8. Although I'm not very certain yet how you deal with spaces between the words. I know from your comments above that a full stop is 3-bytes and any individual character is 3-bytes. So I supposed that you would encode spaces also as 3-byte codewords. Is that correct? If it is indeed correct then encoding the 100'000'000 byte text8 that contains only words and a single character (space) in between, your algorithm would encode it to 102'031'242 bytes. When your dictionary would not contain all the words from text8 then the result would be worse of course.
    25 replies | 839 view(s)
  • Raphael Canut's Avatar
    16th September 2020, 23:21
    Hello, For those interested, I have made a new version of NHW.This one has more precision and a little more neatness. I still find that NHW has a very good neatness -which I find visually more pleasant than AVIF and x265 for example-, and NHW is still extremely fast to encode/decode, faster than JPEG! There are still many improvements that can be done, first one is to adapt to any arbitrary image size this proof-of-concept, and for that I wanted to let you know (again...) that I am searching for a sponsor, but it's extremely difficult to "exist/survive" outside of AOM and MPEG. I don't have touched the entropy coding schemes, and so we can still save 2.5KB in average per .nhw file.I'll also try to document on the Chroma from Luma technique as it'll save additional bits on chroma compression. More of this release at: http://nhwcodec.blogspot.com/ Cheers, Raphael
    188 replies | 22134 view(s)
  • fcorbelli's Avatar
    16th September 2020, 17:21
    fcorbelli replied to a thread zpaq updates in Data Compression
    I am not the developer of ZPAQ, but Matt Mahoney. I do not know if GPU can be used, ZPAQ use a rather "uniqe" approach (maybe even "really weird") Why do you think that encryption is outdated? PS on http://www.francocorbelli.it/pakka.exe always the last release of my little software
    2548 replies | 1104479 view(s)
  • Scope's Avatar
    16th September 2020, 17:20
    Yes, and at all other speeds https://slow.pics/c/pRmeSxZ0
    106 replies | 8521 view(s)
  • Shelwien's Avatar
    16th September 2020, 16:34
    > parsing is not suitable, Maybe offline optimization of a heuristic parser? I mean, see what slow parsing optimizer outputs and how to predict its behavior from hash matches and input data? Eg. LZMA a0 is an example of that, and even a1 is not 100% bruteforce. > I'm using 5 hash tables instead... If you need fast encoding anyway, maybe try ROLZ?
    36 replies | 1712 view(s)
  • Jyrki Alakuijala's Avatar
    16th September 2020, 16:30
    I see it, too. Do you see it if you use the fast mode (no -s parameter) with -d 1 ? I'll take this as a priority and try to fix this soon. Thank you!
    106 replies | 8521 view(s)
  • lz77's Avatar
    16th September 2020, 15:10
    It's not low bytes, it's 10000 9-bits offsets (from offset slot № 1). For Rapid compression (40 sec. for compress/decompress for 1Gb, 1 sec. == 1 Mb) parsing is not suitable, I'm using 5 hash tables instead...
    36 replies | 1712 view(s)
  • Shelwien's Avatar
    16th September 2020, 14:40
    Well, low bytes of these offsets (if you mean that, since they don't fit in a byte) do seem pretty random. CM (paq8px,nzcc) does compress them to 9950 or so, but certainly not plain "ARI/tANS". The usual trick is parsing optimization though - there're usually multiple candidates for matches (multiple instances of a word etc), so you can choose the one that is most compressible in context of others.
    36 replies | 1712 view(s)
  • urntme's Avatar
    16th September 2020, 14:37
    Thank you all for your comments. It is very insightful. ------------ Thank you so much for sharing these papers. These are really interesting reads. I like the fact that you took care to put in papers that I could download and that weren't behind a paywall. Thank you so much for that thought. Interesting thesis. I took a quick look at it. He mentions quite the few common methods that are in use. Interesting paper. They use an indexed dictionary here. This is quite a strange paper for me. I didn't quite understand it as completely as I wished. This one was the most interesting paper for me. To be honest, I would need to spend more time with all the papers to understand all the nuances of it. But thank you so much for all the papers. Oh? I suspected that but I wasn't sure. It's really great to have the experts here. This is a really cool forum for Data Compression. Very active and thriving! It's great! :) That is quite true. I love exploring and learning from new ideas and new perspectives. Learning about different ways to think about the same thing at multiple levels is a really interesting adventure. That's great. I believe research should always be about asking the questions we don't know the ideas for. Otherwise why do it? Oh? That's an interesting piece of information. Yes. I guess that would be one way to do it. This was quite an interesting read. What I also found interesting is that he also provides all the coding for his statements. That's really cool to learn from. Yes. This is also an interesting way to do it. Yes. You're absolutely right about that. Oh, is that so? I'll take a look at it sometime. Thank you so much for the link. Yes. Thank you for that. You are absolutely right, we just have to find our place in this world. However I also believe that not everything has been discovered. The knowledge out there is a vast ocean and we know so so very little. Yes. That's true. There's always a lot to learn by experimenting and just trying new ideas. Life is a constant learning process I feel. ------------- Phew. That is a lot of information you've given me. It took me a long time to process all of that. Also, it was a long post. Thank you so much for dedicating the time to post this long post. I am really learning. ------------- Yes. Perhaps. ------------- Thank you all for your comments.
    25 replies | 839 view(s)
  • urntme's Avatar
    16th September 2020, 14:03
    First of all thank you for your comment. It was really interesting and insightful. Yes. You could be right about that. I was only looking at enterprises where there are a lot of files. For examples in places where they store a lot of files of the same type, each file could save a byte or two for very little overhead cost, if the number of files is large enough, then the space savings could be very significant and this for very little overhead. You bring up an interesting point about clusters. Certainly saving space on a byte or two wouldn't matter if that cluster wasn't usable, however, imagine a scenario where a file is only a byte or two extra and would require an entire cluster just because of their extra one or two bytes. Then by saving those two bytes, one could save the entire cluster. This might happen rarely, certainly, however, at scale, the savings could be significant at very little cost of overhead. This might make it worth it for some enterprises. Furthermore, saving space on a server is not the really money savings method since storage costs are quite low. They save money through bandwidth costs at scale. Imagine a million people accessing the same file, which is fairly common these days on many social media platforms, the savings then is worth it I feel. Moreover there are many places where they want to save as much space as they can. This could provide a means to do that there. The advantage this algorithm offers is that it is file type agnostic, also, it can compress a file that has already been compressed to its limits at very little/almost negligible overhead cost. So I feel there could be uses for it. I'm confused about what you mean that this information would be lost. Why would the information be lost while sent over a network? This is something I don't quite understand. Also, if it's being compressed, then this should be final stage, after all, it can compress even compressed files. The file type doesn't matter here. Yes. That is one way to implement it. You say that this might cause problems when processing the packets, however, I wonder why you say that. What problem would there be in handling additional protocol numbers? Surely it's not a fixed list. They could add protocol numbers at any time. Why would the routers or firewalls react to that? Please elaborate. Certainly, both ends of the node need to be compatible enough to know that they are doing this, but this can be done in a slow and phased manner. If implemented correctly, this could be implemented well. Perhaps not entirely implemented throughout the internet at once, but slowly, over time. All networks evolve after all. And in fact, as the traffic and our consumption of data increases, the more savings this would provide. I believe this is a feature the future internet and all networks should have as these systems evolve. Moreover, the most savings is achieved by using this algorithm at all layers. High traffic servers could definitely get a lot of use out of this. That is an interesting comparison. Thank you for your comment.
    6 replies | 333 view(s)
  • e8c's Avatar
    16th September 2020, 13:07
    ​Несколько слов о тайм-лимитах для T2. Мне нравится, как команда WebP определяет "скоростные режимы": не "fast" и "slow", а "capture mode" и "delivery/storage mode". Предположим, необходим кодек для покадрового lossless-сжатия видео, захваченного с экрана во время игры. У геймера 8-ядерный проц, из которых только 2 ядра могут быть отданы под real-time сжатие. Видео 1080@60 - это 375 MB/s, то есть минимальная однопоточная производительность кодека = 200 MB/s. В то же время, победителем в категории "T2, Rapid" может стать асимметричный кодек с соотношением "enc / dec" == "36 s / 4 s" == "28 MB/s / 250 MB/s", который useless для описанного выше варианта применения.
    2 replies | 88 view(s)
  • lz77's Avatar
    16th September 2020, 12:38
    Hm, I tried to compress offsets.txt from my archive above with lzpm & lzturbo -32 (-39). Both refused to compress and just added their own header to the file... lzpm: 10000 bytes -> 10636 bytes lzturbo: 10000 bytes -> 10033 bytes. But where is ARI/tANS?
    36 replies | 1712 view(s)
  • algorithm's Avatar
    16th September 2020, 10:14
    @MS1 Can we use Huge pages for our submissions? I have created a submission the performance of which heavily depends on Huge pages, 8sec vs 13sec. Normally the kernel does it automatically using Transparent Huge Pages, but there is no guarantee. After a fresh reboot with a lot of free memory, Transparent Huge Pages work well, but after memory gets fragmented or filled up, the kernel cant use Transparent Huge Pages. Using Normal Huge pages can guarantee the use of 2MB pages but needs some simple extra commands (for Linux). echo 512 | sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages # allocate 512 * 2MB time ./app options # run submission echo 0 | sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages # release memory So i ask If we can ask for a certain amount of Huge pages for each submission, such that compression speed will be consistent.
    91 replies | 7327 view(s)
  • necros's Avatar
    16th September 2020, 10:06
    can someone share binary of tensorflow or deepzip?
    7 replies | 1068 view(s)
  • pklat's Avatar
    16th September 2020, 09:54
    hmm, perhaps one day accurate 'google translation' could preprocess text to something that can be compressed easier.
    25 replies | 839 view(s)
  • Dresdenboy's Avatar
    16th September 2020, 07:49
    There are practical problems regarding using filenames, as they'd just save a little space when being stored at the same kind of filesystem all the time. If the files are being sent over a network (e.g. from a NAS) or being compressed and decompressed, this information would get lost - except it's implemented in the OS itself. But I'm sure that engineers in the past kept an eye on not wasting too much bytes anywhere. In the end, there are sector sizes (e.g. 512 bytes or 4096 bytes) and block cluster sizes (e.g. 32 KB). If this method doesn't reduce the filesize enough to save one block or cluster, there is no use. Even Windows' internal (and transparent) file compression methods (XPRESS4K, XPRESS8K, XPRESS16K and LZX) wouldn't be used for compressing a file, if no block or cluster is freed. There is no use in this while just adding overhead in processing the files. I hope you didn't mean to replace the protocol number completely as it's being used somewhere, e.g. in routers or firewalls. But if you'd use the unused protocol list entries by mirroring existing protocol numbers, you might transmit one bit in the header. This is very similar to phased in code or flat binary code as called by charles bloom, but inversely adding one bit of information instead of removing one redundant bit of information. See http://cbloomrants.blogspot.com/2013/04/04-30-13-packing-values-in-bits-flat.html. Applying this idea to the protocol codes would mean: Blindly mapping protocols 0 - 108 to 144 - 252 (not looking for the most used ones) could transmit one bit: protocol no. is in : bit is '0' and original protocol no. is the same protocol no. is in : bit is '1', original protocol no. is protocol no. minus 144. But in the end this might cause problems in processing the packets. This also reminds me a bit of hiding information in images etc. (obfuscation).
    6 replies | 333 view(s)
  • Dresdenboy's Avatar
    16th September 2020, 07:28
    I didn't exactly hear about that 3 byte dictionary index encoding for compression, but with some digging it might show up. I'm back to more intensively (as time permits) experimenting with compression since 2016 and I still find new concepts on my research sessions, sometimes dating back to the 80's or 90's. E.g. some kind of ANS-like encoding in the early 90's by John Boss. So you might find the exact same solution maybe in 2 years. The more exotic, the more it might take. But I found some interesting concepts of doing word based compression, just check following links: Przemysław Skibiński's PhD thesis: http://pskibinski.pl/papers/PhD_thesis.pdf (contains a lot about word processing and word based compression techniques) Paper "Revisiting dictionary-based compression": https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.4520&rep=rep1&type=pdf Paper "A Novel Approach to Compress Centralized Text Data using Indexed Dictionary": https://arxiv.org/ftp/arxiv/papers/1512/1512.07153.pdf Paper "Fast Text Compression Using Multiple Static Dictionaries": https://www.researchgate.net/publication/47690252_Fast_Text_Compression_Using_Multiple_Static_Dictionaries (PDF available) Paper "WordCode using WordTrie": https://www.sciencedirect.com/science/article/pii/S1319157818313417 (PDF available, mentions multiple other methods incl. one which encodes English words using a 19 bit integer) Or check Matt Mahoney's Data Compression Explained page: Matt and other well known compression experts are active on this very forum, BTW. :) While researching something it is necessary to look at existing solutions and compare them to your own. I literally spent tens of hours over the last months just to look for existing solutions to some very specific problem. This is both interesting and also might trigger some interesting ideas and improvements for your own idea. And for many researchers it's always an inherent motivation (let's call it "fun" ;)) to try new ideas. I also saw, what you answered in another posting above. Of course removing 8b means a lot. But there are ways to optimize the 3 byte alignment handling for more speed. A lot of experts here know ways to do this, down to the assembler level. Independent from its performance impact, it also causes bloat. As you wrote (and also mentioned by me) there are just a few thousand words which are most commonly used. Hence my idea of using UTF-8 (and I found references to UTF-8 usage in my linked papers above). This could also be done by using a variable length byte encoding (e.g. use 1 bit in a byte to denote, if length is 1 byte or > 1 bytes, then use similar bits or already a 2 bit count value in the next bytes). You might check Charles Bloom's articles here (with 3 other parts linked there): http://cbloomrants.blogspot.com/2012/09/09-04-12-encoding-values-in-bytes-part-4.html If such a block has been found, a length encoded in a range of the dictionary would be enough. So this turns into 1 encoded length (using your 3 byte value) followed by raw bytes. If there are more, there will just be a sequence of +raw bytes and a non max length block. No error throwing needed. This is an escape mechanism. I think this is the point where one finds out the most by just experimenting with the ideas on real data and gathering statistics. You might also discuss it in Google groups (former usenet), group name is: "comp.compression". You are right about finding a niche. I also found an interesting niche with others (tiny decompressors). The big areas are already covered. But it might also just be fun to play around with such algorithms. There is a lot one could learn by doing this.
    25 replies | 839 view(s)
  • urntme's Avatar
    16th September 2020, 05:44
    Hello. Thank you all for your comments. The following are my replies. -------- Thank you so much for your comment. Yes, you are probably right. I am thinking about trying to figure out how to code this thing. Starting from a open source file is a good idea and is probably a good starting point. Let me see how to go about it. ------------- Yes. We use a fixed/predetermined one. No I believe I mentioned the full stop. It would take 3 bytes for registering the full stop. So it would gain here. For dealing with the cases where the first letter is uppercase or all the words are capitals, we would store all these words in the dictionary as well. There's enough room for 3 different versions of the same word. For cases where the word is not recognized let me give you an example of how it would be handled. So let's say the unrecognized word is "Harky" Then there would be a huge gain here. The word would be stored in the file as 3 byte versions of each character from their ASCII value. So instead of "Harky" taking 5 bytes in ASCII, it would take 3 bytes per character + a stop character so: 5 x 3 bytes + 3 bytes for stop = 18 bytes for this one word. But this should happen very very rarely. Yes. You are probably right. But if the dictionary is not built for that particular purpose, then it just bloats up the entire file and the whole thing is pointless. So I suppose throwing up an error here might not be all that bad a thing. But you're probably right. I suppose it depends on how it is designed. ------------------ Yes. You are probably right. Perhaps 2 bytes per word is sufficient. That would give us a dictionary of 65k words, which is not bad considering that the majority of the frequently used words are only about 3000 or so. Also, as mentioned previously by Dresdenboy it would help on machines with 2 byte alignments. The dictionary is not stored in the file. It is predetermined and stored in the compressor itself. -------------- I hope that clarifies the questions asked. I hope the discussion continues. It is quite insightful. Thank you all once again for all your comments.
    25 replies | 839 view(s)
  • urntme's Avatar
    16th September 2020, 05:12
    First of all, thank you for your comment. I tried to make the document as readable as possible, but perhaps I haven't done the best job of it. I hope my following statements clear some of your questions. Yes. You are right. One cannot compress random data. But this compresses random data. I'm sorry for being confusing but it just depends on your perspective here. It all depends on if you consider what we are compressing as "Random data" or not. Yes. You have probably got it. We are moving bytes of a file and "changing" the extensions. We are not adding extensions to the existing name. There is a difference here. Okay, so here I must say that there is a distinct difference in how "overheads" or in this case the "file name" is handled and how the "data" is handled. A data packet or file can be of any size in a range but the overhead is always of a fixed size. Whenever you need to open/transmit a file or data packet, you need to send this overhead and you cannot reduce the size of the overhead. So any amount of information you add to the overhead, does not increase the overall size of the file or data packet but since the overhead can be only of a fixed size, you obviously have a limit on how much information can be added to the overhead. In this case we are changing the file extension which is a part of the overhead. Different operating systems operate the file extensions differently. Some allocate space for the file extensions separately and some allocate space for the entire file name as one chunk. But no matter what the file name is, and we cannot change the file name because we don't want to change a user's data, we can change the file extension. This is because a lot of file extensions go unused. These are essentially unused states in the file that we are using to reduce the size of the overall data. This makes more sense if you see this through the IPv4 protocol. The above image is the IPv4 protocol header. Notice the part about "Protocols" it is the protocol number. You can check the assigned protocol numbers from this link: https://www.iana.org/assignments/protocol-numbers/protocol-numbers.xhtml The protocol numbers from: 144 - 252 are left unassigned and are not used in any way. These are basically like file extensions for the protocols being transmitted. So we use the unused protocol numbers to compress a part of the data packet being sent. These are the unused states of the file overhead in this case. Now if you want to transmit a protocol. You need to transmit the protocol number. The system doesn't care what the number is, so we used the unused protocol numbers to transfer a part of the file and save us some space. I hope all this makes sense. But I think you've already probably got the essence of it. We do transfer the bytes from the data packet to the file name in a way and this does save us some space. I hope this clears any doubts and confusion of what I am saying here. Thank you for your comments. I hope the discussion continues.
    6 replies | 333 view(s)
  • hexagone's Avatar
    16th September 2020, 02:06
    Dictionary based compression of text is nothing new, so it is not a bad idea but you need to start thinking about the details. EG: is 3 byte per word the best distribution in general ? How is the dictionary stored ? Etc...
    25 replies | 839 view(s)
  • Gotty's Avatar
    16th September 2020, 00:37
    Hm. I don't understand this document. I'll grab only two sentences that caught my attention. >>The data compression algorithm talked about in this paper can be thought of as a random compression algorithm,since it compresses random data in a random manner Ops. What do you mean? You probably know that you cannot compress random data. >>Normally compressed files are converted to a single file extension, for example ".zip" but in our case, the more extensions or suffixes we have, the more compression we can achieve since that essentially corresponds to more states available. Are you removing bytes from the file content and adding more and more extensions to the name of the file at the same time? If that is the case you are just moving information form one place to another. How does it save space then?
    6 replies | 333 view(s)
  • Gotty's Avatar
    16th September 2020, 00:17
    Thank you for the example! I have some missing pieces: How do you construct the dictionary? Is it constructed during compression or do you use a fixed/predetermined one? If you construct it during compression, is it a 1-pass approach (your dictionary grows as you process more and more words) or is it a 2-pass approach: in the first pass you construct the dictionary (collect all the words) then you do the encoding (compression)? If it's a fixed one then how do you deal with words not present in the dictionary and how do you exactly encode the special characters? How many possible special characters are there? 256-26=230? Or 256-52=204? If it's not a fixed one i.e. you construct an optimal dictionary for every file then how do you encode (= how do you store) the dictionary? In your example the gain is 4 bytes (I believe it's actually 3 bytes: you probably forgot the full stop) + the size of the dictionary. And the latter would probably nullify any gains for smaller files. How do you deal with words having Uppercase words? CamelCase? ALL CAPITALS? I'd say it's not advisable to use the file extension for determining what the content might be. For example README, READ.ME may contain proper English text without the "expected" file extension. The opposite is also true - but rarer.
    25 replies | 839 view(s)
  • e8c's Avatar
    15th September 2020, 22:23
    Fix (in attachment). ​Обидно, "уплыл косарь".
    2 replies | 88 view(s)
  • fabiorug's Avatar
    15th September 2020, 21:00
    ...ah he corrected. There is a new jpeg2000 encoder https://github.com/GrokImageCompression/grok/releases/tag/v7.6.0 C:\Users\User\Documents\grok\grok-v7.6.0-windows-x64\bin\grk_compress.exe -i C:\Users\User\Documents\Jpeg4\b.png -o C:\Users\User\Documents\Jpeg4\o.jp2 -q 30,41,42 This codec is awesome also it falsifies colors but it helps the images to look less plastic and it saves bits, it does the opposite of avif. I tried also jpeg XR
    106 replies | 8521 view(s)
  • Scope's Avatar
    15th September 2020, 20:35
    I've compared about a thousand photos with settings in VarDCT mode from -d1 to -d4 and so far I haven't noticed any excessive sharpness, but sometimes I notice some visual color differences, for example in this photo on the barbell plate https://slow.pics/c/0nQeUsBl ​ Even on -d 1, I chose AVIF for comparison and the difference is not noticeable: 2cq8e7bpiib41.d1.jxl (-d 1 -s 8) - 1 239 881 2cq8e7bpiib41.avif (-s 0 --max 32) - 1 165 848
    106 replies | 8521 view(s)
  • carlosnewmusic's Avatar
    15th September 2020, 20:11
    Don't you have a github or public repository to make it easier to develop and update the application a bit outdated?, since the GPU could also be used for the necessary calculations for file compression also update the encryption as the current one is outdated
    2548 replies | 1104479 view(s)
  • pklat's Avatar
    15th September 2020, 18:16
    edit:ignore
    106 replies | 8521 view(s)
  • Lucas's Avatar
    15th September 2020, 16:28
    You should try to beat LZ4 which is completely byte aligned. Make a working prototype before announcing results, otherwise there's no proof to the claims. After reading your paper it just looks like a word replacement transform, look at XWRT or WBPE for reference as they are open source and might help you implement a working version of Disha.
    25 replies | 839 view(s)
  • urntme's Avatar
    15th September 2020, 16:09
    Just a quick point. Working on bytes instead of bits is a lot more simple and faster on most systems. This is also an advantage for this algorithm.
    25 replies | 839 view(s)
  • urntme's Avatar
    15th September 2020, 14:07
    First of all, I would like to thank you both for your comments. I am very grateful to receive feedback. All your comments are very insightful and interesting. So a simple example would be: Let's say a text file has the following sentence: "Here comes the cow." ASCII would need 1 byte per character here. So: Here - 4 characters x 1 byte; comes - 5 characters x 1 byte; the - 3 characters x 1 byte; cow characters - 3 x 1 byte; . - 1 byte; 3 spaces - 3 x 1 byte. So, 19 bytes in total. The Algorithm would need: 3 bytes x 4 words = 12 bytes + 3 bytes for the special character. So - 15 bytes. So in this case, we would have 1 - (15/19) = 21% compression ratio. Yes, you are right. If a non-ASCII/non-English file is used it would first throw up an error based off the file extension. If the file extension is .txt, it would most likely increase the file size because none of the words would be recognized by the dictionary and all the words or all the data would be treated as "Special characters" or "Special cases." This would make the file size huge. Yes, you are probably right about this. Let me try to find a way to code it. As I am not good with coding, this is a problem. But you are probably right. I thought making the claims based off off theory would be enough. But probably not. Let me try to figure out a way to do this. Yes. Thank you for your comments. The core idea of this concept is to make a fast compressor using fixed sized binary codes for words instead of using them for characters. The fixed size enables a vast dictionary because if you have a variable length coding, the dictionary has to be smaller. A vast dictionary helps not using "Special characters" at all. Furthermore, you can make commonly used sentences as symbols as well. Also, if all your words have an equal probability with only minimal variations in their probabilities, then it makes sense to use fixed code sizes for them. Now, in the paper I stated that it would be used for ASCII/English word files but this can be configured for other languages as well. Variable length binary coding is good for compression only as long as the probabilities are skewed in a way and the number of codes needed are in an acceptable range. By not using a variable length code, we enable a vast dictionary to be used and this enables high speed as well. These are my thoughts. Please let me know yours. -------- Has it really been tried before? Could you give me a link to it? I couldn't find where it is being used. Yes. You are probably right on this. Interesting point. Yes. You are probably right about this. But a smaller byte chunk for each word means a smaller dictionary as well. I suppose it depends on the application and where it would be used. I suppose one could use them. I suppose that's one way of handling them. We could specify a limit on the length of these special characters in a row and allocate some space in the dictionary to handle them. Anymore than the limit say a 1000 consecutive special characters would throw up an error in that case. If it breaches that limit, one could simply encode them as three byte character mappings of their ASCII values. Since there is a vast dictionary, there are a lot of options for handling the special characters. But I suppose it would depend on the application. ------------- Thank you all for this. Your comments have been really insightful. It feels great to find a forum to discuss these things. But you have made me think. Is there a need for this? Where could it find its use? Although, one never knows with technology. There's always a use for something somewhere. One needs to just find their niche.
    25 replies | 839 view(s)
  • e8c's Avatar
    15th September 2020, 13:41
    T2, Rapid. Modified IZ. Сам доделать не осилю.
    2 replies | 88 view(s)
  • Shelwien's Avatar
    15th September 2020, 12:42
    @lz77: There's also 2nd place prize ($1k). And you get to add the corresponding line to CV/resume and maybe find some job offers. Also for the competition itself its better if more people participate - depending on results it could be made permanent, like Hutter Prize.
    91 replies | 7327 view(s)
  • urntme's Avatar
    15th September 2020, 12:41
    Thank you very much for your comment. However I'm not sure where I've used "Phased-in binary codes." As far as my understanding goes, I haven't used it in either of the algorithms. The Disha Algorithm uses fixed size binary codes, which I mentioned in the document as 3 bytes, and does not any use any type of variable length encoding. So I'm confused by what you mean here. Perhaps I'm not getting it. In the Data Compression Algorithm, I use unused states in a file overhead to save a few bytes in the payload. So, again here I'm not using "Phased-in binary codes." So I'm confused. Could you please clarify?
    6 replies | 333 view(s)
  • Dresdenboy's Avatar
    15th September 2020, 12:04
    Hi! That's an interesting idea. It might have been tried before. Some first thoughts: 3 Byte alignment might cost performance on processors working better on 2^n byte alignments than arbitrary ones. The big dictionary will cost a lot of memory and only fit in the slow L3 cache. Since it's byte-aligned you surely didn't want to use the phased-in binary encoding you also used for your bit recycling. But some words are more common than others. So a few hundreds words are most common, a few thousands the amount of words used by most people. How about using the variable width byte encoding of Unicode then? (Is that new?) Just encode words using UTF-8 characters (>2 million available). There are existing, optimized engines to process them. And about the handling of spaces and special characters: other researchers simply switch between words and non-word strings. They could be encoded using 8 or 16 bit sized characters depending on frequency. In the end you might just need an escape mechanism, which could include the length of the non-word strings and store them as plain bytes.
    25 replies | 839 view(s)
  • Dresdenboy's Avatar
    15th September 2020, 11:40
    Thanks for posting your ideas. After first quick skimming I didn't get your file compression idea, going to read that later. Your reduced state encoding reminds me of "phased-in binary codes", also called "flat code" by some. You could check this thread here: https://encode.su/threads/3419-Are-there-LZW-LZ78-LZC-etc-variants-with-efficient-codeword-encoding?p=65961&viewfull=1#post65961
    6 replies | 333 view(s)
  • lz77's Avatar
    15th September 2020, 10:56
    Is there any point in submitting compressors that won't win prizes?
    91 replies | 7327 view(s)
  • fabiorug's Avatar
    15th September 2020, 10:45
    -s 4 -d 8.82 + -s 8 -d 1.28 to me it looks more reconstructed than sharpened in the 7 September 2020 build and it's an example of recompression so I don't care, but I got 36 kb 720p files with no text). S5 88,5 + S 5 77,8 definitely sharpened by a lot scope said me that it looks far from the original but i expect from two recompression. It's good to compress one time. No need to disable parameters it's better to have more. I used Discord platform.
    106 replies | 8521 view(s)
  • fabiorug's Avatar
    15th September 2020, 10:33
    i guess many features are disabled in jpeg xl such as text improvements efficiency not because you're focusing on quality but because the bitstream isn't final. yes i can use modular for text, but it isn't the same as a tuned mode vardct for it. But i'm not an expert and we should wait final bitstream.
    106 replies | 8521 view(s)
  • Gotty's Avatar
    15th September 2020, 10:10
    Welcome, Urntme! The attached document seems to flash an idea but does not go into details. Please elaborate how compression and decompression works with a small contrete example. You seem to target only ASCII files with the English alphabet where there are English sentences. If someone would use your method in practice, they would certainly try to feed many different files to your algorithm (not just pure ASCII files with pure English sentences). What would happen? Error message? It would try to compress it? But there are no English words there. So please elaborate how your method would deal with non-English text files and non-text files. Before being able to publish your idea you will need to actually try it and prove that it works. Without having such a proof you can not make any claims about its compression ratio or speed or simplicity. You will need to run benchmarks and compare your idea with other existing methods. If you would like to make a scientific paper from your idea and make it published then you will need to fill in such missing pieces.
    25 replies | 839 view(s)
  • Jyrki Alakuijala's Avatar
    15th September 2020, 10:08
    If sharpening happens in VarDCT, I'd love to learn about it and fix it.
    106 replies | 8521 view(s)
  • urntme's Avatar
    15th September 2020, 08:50
    Hello all, This is my second post on this forum. I also posted about the Disha format in a previous post. This algorithm is an algorithm that can save bits on a network with a high amount of traffic. I have attached a paper I wrote about it to this thread. Feedback, thoughts, and comments. All are welcome. I hope you find it interesting and it makes you think. One thing in particular I would be interested in knowing is if you think there are any avenues which would be interested in publishing this. Best Regards, ​Urntme.
    6 replies | 333 view(s)
  • urntme's Avatar
    15th September 2020, 08:12
    Hello all, I am a newcomer to this forum. I request you all to please treat me kindly. :) So, this is my first post to this forum and I was directed to this place by a few people. Now about the Compression Algorithm. I have attached a paper I wrote about the algorithm in Pdf format to this Post. Kindly read the attachment. The Algorithm is a Text Compression Algorithm that I came up with. Its features are that it is fast, simple, and provides a decent compression for a low Time/Overhead cost. It does not achieve the best possible compression but it does a good job. I am looking for feedback on this algorithm. The feedback I am looking for is: Where can it be used? What are its possible applications? and for all possible comments on what you think about the algorithm. Now since I am code-challenged, as in I am not a good programmer, I would also be interested to know if anyone is interested in collaborating with me on this project to make a code for this. Also, do you know any avenues which would be interested in publishing this paper? For example: Any journals or publications. So, I welcome any and all feedback. I hope you find the paper interesting and I hope it makes you think a little bit. Thank you all for reading. Best Regards, Urntme. P.S: Urntme - The username is read as "You aren't me."
    25 replies | 839 view(s)
  • Scope's Avatar
    15th September 2020, 02:59
    In my tests I haven't noticed any examples when VarDCT creates a sharper image than the original (only some Jpegs may seem sharper, but these are probably some additional compression losses) and the quoted message from fabiorug was mine and it was about using additional filters before encoding that can "improve" the image and also one of not my examples with modular compression that doesn't look better (especially after multiple lossy re-compression of the same image) than VarDCT at the same or even bigger size and noticeable sharpness. That example no longer exists, but in the image below I tried to simulate a similar result with modular compression, although the difference was softer (because I do not know what exact settings were used) https://slow.pics/c/uBc2kklc
    106 replies | 8521 view(s)
  • Darek's Avatar
    14th September 2020, 21:58
    Darek replied to a thread paq8px in Data Compression
    Some enwik scores (mostly enwik8 ) for the latest paq8px versions: 16'190'519 - enwik8 -12 by Paq8px_v187fix2 16'080'588 - enwik8 -12eta by Paq8px_v187fix2 15'889'931 - enwik8.drt -12eta by Paq8px_v187fix2 127'626'051 - enwik9_1423 -12eta by Paq8px_v187fix2 124'786'260 - enwik9_1423.drt -12eta by Paq8px_v187fix2 15'900'206 - enwik8 -12leta by Paq8px_v188, change: -1,12% 15'503'221 - enwik8.drt -12leta by Paq8px_v188, change: -2,43% 15'907'081 - enwik8 -12leta by Paq8px_v188b, change: 0,04% 15'505'761 - enwik8.drt -12leta by Paq8px_v188b, change: 0,02% 15'896'588 - enwik8 -12leta by Paq8px_v189, change: -0,07% 15'490'302 - enwik8.drt -12leta by Paq8px_v189, change: -0,10% 121'056'858 - enwik9_1423.drt -12leta by Paq8px_v189, change: -2,99% - best score measured 15'896'196 - enwik8 -12leta by Paq8px_v190, change: 0,00% 15'896'126 - enwik8 -12lreta by Paq8px_v190, change: 0,00% 15'490'045 - enwik8.drt -12leta by Paq8px_v190, change: 0,00% 15'490'045 - enwik8.drt -12lreta by Paq8px_v190, change: 0,00% 15'888'954 - enwik8 -12lreta by Paq8px_v191, change: -0,05% 15'638'056 - enwik8.drt -12lreta by Paq8px_v191, change: 0,96% 15'898'305 - enwik8 -12lreta by Paq8px_v191a, change: 0,01% 15'638'254 - enwik8.drt -12lreta by Paq8px_v191a, change: 0,96% 15'898'544 - enwik8 -12lreta by Paq8px_v192, change: 0,00% 15'888'545 - enwik8 -12leta by Paq8px_v192, change: -0,06% 15'638'126 - enwik8.drt -12lreta by Paq8px_v192, change: 0,00% 15'479'471 - enwik8.drt -12leta by Paq8px_v192, change: -1,02% 15'884'947 - enwik8 -12lreta by Paq8px_v193, change: -0,02% - best score measured 15'476'230 - enwik8.drt -12lreta by Paq8px_v193, change: -0,02%- best score measured 126'150'760 - estimated enwik9_1423 -12lreta by Paq8px_v193, change: -0,02% - estimated best score - time to compress about 6 days 120'946'885 - estimated enwik9_1423.drt -12lreta by Paq8px_v193, change: -0,02% - estimated best score - time to compress about 4 days 15'899'080 - enwik8 -12lreta by Paq8px_v193fix2, change: 0,09% 15'477'066 - enwik8.drt -12lreta by Paq8px_v193fix2, change: 0,01% Looks that paq8px v193 got the best scores for enwik8 and probably on enwik9, however until the test, the best score of enwik9 belongs to paq8px v189. It's close to break 120'000'000 barrier but there is about 1MB to gain :)
    2110 replies | 569457 view(s)
  • fabiorug's Avatar
    14th September 2020, 19:45
    Honestly, this user is very active in jpeg xl. it's called scopeburst or scope. I don't know honestly. He also filled issues in GitLab. But i know that optimizing vardct for text is low priority for the moment as far I understood. the developers of jpeg xl use very high-quality image at high bitrate not at -s 4 -d 8.82 + -s 8 -d 1.28 produced bitrate. all 3 photos look natural to me, I tested only with a re-encoding that is older than the one I mentioned in "produced bitrate" the line before. I didn't point any issue.
    106 replies | 8521 view(s)
  • Jyrki Alakuijala's Avatar
    14th September 2020, 19:37
    Where should I look in the image you shared to find the sharpening?
    106 replies | 8521 view(s)
  • Shelwien's Avatar
    14th September 2020, 13:58
    Rep-codes have some effect on text too, but extra token types also add redundancy, so parsing optimization is basically required for rep-codes to be useful. http://nishi.dreamhosters.com/u/lzma_delrep_v0.rar http://nishi.dreamhosters.com/u/lzma_delrep_v1.rar 400,000,000 TS40.txt 100,238,609 TS40.lzma 212,626,315 TS40.lzma.rec 213,279,344 TS40.lzma.rec.dr0 // all rep-codes removed 100,625,066 TS40.lzma.rec.dr0.lzma 212,587,654 TS40.lzma.rec.dr1 // rep-codes re-applied with a greedy method 100,240,681 TS40.lzma.rec.dr1.lzma total tokens: 35614133 token types: 1361 litR0, 101650 rep0, 61267 rep1-3
    36 replies | 1712 view(s)
More Activity