Page 1 of 2 12 LastLast
Results 1 to 30 of 31

Thread: Disha Format - A different Text Compression Algorithm

  1. #1
    Member
    Join Date
    Sep 2020
    Location
    India
    Posts
    18
    Thanks
    18
    Thanked 0 Times in 0 Posts

    Disha Format - A different Text Compression Algorithm

    Hello all,

    I am a newcomer to this forum. I request you all to please treat me kindly.

    So, this is my first post to this forum and I was directed to this place by a few people.

    Now about the Compression Algorithm.

    I have attached a paper I wrote about the algorithm in Pdf format to this Post. Kindly read the attachment.

    The Algorithm is a Text Compression Algorithm that I came up with. Its features are that it is fast, simple, and provides a decent compression for a low Time/Overhead cost. It does not achieve the best possible compression but it does a good job.

    I am looking for feedback on this algorithm. The feedback I am looking for is: Where can it be used? What are its possible applications? and for all possible comments on what you think about the algorithm.

    Now since I am code-challenged, as in I am not a good programmer, I would also be interested to know if anyone is interested in collaborating with me on this project to make a code for this. Also, do you know any avenues which would be interested in publishing this paper? For example: Any journals or publications.

    So, I welcome any and all feedback. I hope you find the paper interesting and I hope it makes you think a little bit.

    Thank you all for reading.

    Best Regards,
    Urntme.

    P.S:
    Urntme - The username is read as "You aren't me."
    Attached Files Attached Files

  2. #2
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    554
    Thanks
    356
    Thanked 356 Times in 193 Posts
    Welcome, Urntme!

    The attached document seems to flash an idea but does not go into details. Please elaborate how compression and decompression works with a small contrete example.

    You seem to target only ASCII files with the English alphabet where there are English sentences.
    If someone would use your method in practice, they would certainly try to feed many different files to your algorithm (not just pure ASCII files with pure English sentences). What would happen? Error message? It would try to compress it? But there are no English words there.
    So please elaborate how your method would deal with non-English text files and non-text files.

    Before being able to publish your idea you will need to actually try it and prove that it works. Without having such a proof you can not make any claims about its compression ratio or speed or simplicity.
    You will need to run benchmarks and compare your idea with other existing methods.

    If you would like to make a scientific paper from your idea and make it published then you will need to fill in such missing pieces.

  3. Thanks:

    urntme (15th September 2020)

  4. #3
    Member
    Join Date
    May 2020
    Location
    Berlin
    Posts
    64
    Thanks
    10
    Thanked 20 Times in 15 Posts
    Hi!
    That's an interesting idea. It might have been tried before.
    Some first thoughts:

    3 Byte alignment might cost performance on processors working better on 2^n byte alignments than arbitrary ones.

    The big dictionary will cost a lot of memory and only fit in the slow L3 cache.

    Since it's byte-aligned you surely didn't want to use the phased-in binary encoding you also used for your bit recycling. But some words are more common than others. So a few hundreds words are most common, a few thousands the amount of words used by most people.

    How about using the variable width byte encoding of Unicode then? (Is that new?)
    Just encode words using UTF-8 characters (>2 million available). There are existing, optimized engines to process them.

    And about the handling of spaces and special characters: other researchers simply switch between words and non-word strings. They could be encoded using 8 or 16 bit sized characters depending on frequency. In the end you might just need an escape mechanism, which could include the length of the non-word strings and store them as plain bytes.

  5. Thanks:

    urntme (15th September 2020)

  6. #4
    Member
    Join Date
    Sep 2020
    Location
    India
    Posts
    18
    Thanks
    18
    Thanked 0 Times in 0 Posts
    First of all, I would like to thank you both for your comments. I am very grateful to receive feedback. All your comments are very insightful and interesting.

    The attached document seems to flash an idea but does not go into details. Please elaborate how compression and decompression works with a small contrete example.
    So a simple example would be:

    Let's say a text file has the following sentence: "Here comes the cow."

    ASCII would need 1 byte per character here. So: Here - 4 characters x 1 byte; comes - 5 characters x 1 byte; the - 3 characters x 1 byte; cow characters - 3 x 1 byte; . - 1 byte; 3 spaces - 3 x 1 byte. So, 19 bytes in total.

    The Algorithm would need: 3 bytes x 4 words = 12 bytes + 3 bytes for the special character. So - 15 bytes. So in this case, we would have 1 - (15/19) = 21% compression ratio.

    You seem to target only ASCII files with the English alphabet where there are English sentences.
    If someone would use your method in practice, they would certainly try to feed many different files to your algorithm (not just pure ASCII files with pure English sentences). What would happen? Error message? It would try to compress it? But there are no English words there.
    So please elaborate how your method would deal with non-English text files and non-text files.
    Yes, you are right. If a non-ASCII/non-English file is used it would first throw up an error based off the file extension. If the file extension is .txt, it would most likely increase the file size because none of the words would be recognized by the dictionary and all the words or all the data would be treated as "Special characters" or "Special cases." This would make the file size huge.

    Before being able to publish your idea you will need to actually try it and prove that it works. Without having such a proof you can not make any claims about its compression ratio or speed or simplicity.
    You will need to run benchmarks and compare your idea with other existing methods.
    Yes, you are probably right about this. Let me try to find a way to code it. As I am not good with coding, this is a problem. But you are probably right. I thought making the claims based off off theory would be enough. But probably not. Let me try to figure out a way to do this.

    If you would like to make a scientific paper from your idea and make it published then you will need to fill in such missing pieces.
    Yes. Thank you for your comments.

    The core idea of this concept is to make a fast compressor using fixed sized binary codes for words instead of using them for characters. The fixed size enables a vast dictionary because if you have a variable length coding, the dictionary has to be smaller. A vast dictionary helps not using "Special characters" at all. Furthermore, you can make commonly used sentences as symbols as well. Also, if all your words have an equal probability with only minimal variations in their probabilities, then it makes sense to use fixed code sizes for them.

    Now, in the paper I stated that it would be used for ASCII/English word files but this can be configured for other languages as well.

    Variable length binary coding is good for compression only as long as the probabilities are skewed in a way and the number of codes needed are in an acceptable range.

    By not using a variable length code, we enable a vast dictionary to be used and this enables high speed as well.

    These are my thoughts. Please let me know yours.

    --------

    That's an interesting idea. It might have been tried before.
    Has it really been tried before? Could you give me a link to it? I couldn't find where it is being used.

    3 Byte alignment might cost performance on processors working better on 2^n byte alignments than arbitrary ones.
    Yes. You are probably right on this. Interesting point.

    Since it's byte-aligned you surely didn't want to use the phased-in binary encoding you also used for your bit recycling. But some words are more common than others. So a few hundreds words are most common, a few thousands the amount of words used by most people.
    Yes. You are probably right about this. But a smaller byte chunk for each word means a smaller dictionary as well. I suppose it depends on the application and where it would be used.

    How about using the variable width byte encoding of Unicode then? (Is that new?)
    Just encode words using UTF-8 characters (>2 million available). There are existing, optimized engines to process them.
    I suppose one could use them.

    And about the handling of spaces and special characters: other researchers simply switch between words and non-word strings. They could be encoded using 8 or 16 bit sized characters depending on frequency. In the end you might just need an escape mechanism, which could include the length of the non-word strings and store them as plain bytes.
    I suppose that's one way of handling them. We could specify a limit on the length of these special characters in a row and allocate some space in the dictionary to handle them. Anymore than the limit say a 1000 consecutive special characters would throw up an error in that case. If it breaches that limit, one could simply encode them as three byte character mappings of their ASCII values.

    Since there is a vast dictionary, there are a lot of options for handling the special characters. But I suppose it would depend on the application.

    -------------

    Thank you all for this. Your comments have been really insightful. It feels great to find a forum to discuss these things. But you have made me think. Is there a need for this? Where could it find its use? Although, one never knows with technology. There's always a use for something somewhere. One needs to just find their niche.

  7. #5
    Member
    Join Date
    Sep 2020
    Location
    India
    Posts
    18
    Thanks
    18
    Thanked 0 Times in 0 Posts
    Just a quick point. Working on bytes instead of bits is a lot more simple and faster on most systems. This is also an advantage for this algorithm.

  8. #6
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    176
    Thanks
    28
    Thanked 74 Times in 44 Posts
    You should try to beat LZ4 which is completely byte aligned. Make a working prototype before announcing results, otherwise there's no proof to the claims. After reading your paper it just looks like a word replacement transform, look at XWRT or WBPE for reference as they are open source and might help you implement a working version of Disha.

  9. Thanks:

    urntme (21st September 2020)

  10. #7
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    554
    Thanks
    356
    Thanked 356 Times in 193 Posts
    Quote Originally Posted by urntme View Post
    So a simple example would be:

    Let's say a text file has the following sentence: "Here comes the cow."

    ASCII would need 1 byte per character here. So: Here - 4 characters x 1 byte; comes - 5 characters x 1 byte; the - 3 characters x 1 byte; cow characters - 3 x 1 byte; . - 1 byte; 3 spaces - 3 x 1 byte. So, 19 bytes in total.

    The Algorithm would need: 3 bytes x 4 words = 12 bytes + 3 bytes for the special character. So - 15 bytes. So in this case, we would have 1 - (15/19) = 21% compression ratio.
    Thank you for the example!

    I have some missing pieces:

    How do you construct the dictionary? Is it constructed during compression or do you use a fixed/predetermined one? If you construct it during compression, is it a 1-pass approach (your dictionary grows as you process more and more words) or is it a 2-pass approach: in the first pass you construct the dictionary (collect all the words) then you do the encoding (compression)?
    If it's a fixed one then how do you deal with words not present in the dictionary and how do you exactly encode the special characters? How many possible special characters are there? 256-26=230? Or 256-52=204?
    If it's not a fixed one i.e. you construct an optimal dictionary for every file then how do you encode (= how do you store) the dictionary? In your example the gain is 4 bytes (I believe it's actually 3 bytes: you probably forgot the full stop) + the size of the dictionary. And the latter would probably nullify any gains for smaller files.
    How do you deal with words having Uppercase words? CamelCase? ALL CAPITALS?

    Quote Originally Posted by urntme View Post
    it would first throw up an error based off the file extension
    I'd say it's not advisable to use the file extension for determining what the content might be. For example README, READ.ME may contain proper English text without the "expected" file extension. The opposite is also true - but rarer.


  11. Thanks:

    urntme (21st September 2020)

  12. #8
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    158
    Thanks
    51
    Thanked 44 Times in 33 Posts
    Disha Format is a completely different compressed text format. What happens in the format is that instead of storing “characters,” “words” are stored.
    Given time, the author is certain it will find its space in the market.
    Dictionary based compression of text is nothing new, so it is not a bad idea but you need to start thinking about the details. EG: is 3 byte per word the best distribution in general ? How is the dictionary stored ? Etc...

  13. Thanks:

    urntme (21st September 2020)

  14. #9
    Member
    Join Date
    Sep 2020
    Location
    India
    Posts
    18
    Thanks
    18
    Thanked 0 Times in 0 Posts
    Hello. Thank you all for your comments. The following are my replies.

    --------

    You should try to beat LZ4 which is completely byte aligned. Make a working prototype before announcing results, otherwise there's no proof to the claims. After reading your paper it just looks like a word replacement transform, look at XWRT or WBPE for reference as they are open source and might help you implement a working version of Disha.
    Thank you so much for your comment. Yes, you are probably right. I am thinking about trying to figure out how to code this thing. Starting from a open source file is a good idea and is probably a good starting point. Let me see how to go about it.

    -------------

    How do you construct the dictionary? Is it constructed during compression or do you use a fixed/predetermined one?
    Yes. We use a fixed/predetermined one.

    you probably forgot the full stop + the size of the dictionary.
    No I believe I mentioned the full stop. It would take 3 bytes for registering the full stop. So it would gain here.

    How do you deal with words having Uppercase words? CamelCase? ALL CAPITALS?
    For dealing with the cases where the first letter is uppercase or all the words are capitals, we would store all these words in the dictionary as well. There's enough room for 3 different versions of the same word.

    For cases where the word is not recognized let me give you an example of how it would be handled. So let's say the unrecognized word is "Harky"

    Then there would be a huge gain here. The word would be stored in the file as 3 byte versions of each character from their ASCII value. So instead of "Harky" taking 5 bytes in ASCII, it would take 3 bytes per character + a stop character so: 5 x 3 bytes + 3 bytes for stop = 18 bytes for this one word. But this should happen very very rarely.

    I'd say it's not advisable to use the file extension for determining what the content might be. For example README, READ.ME may contain proper English text without the "expected" file extension. The opposite is also true - but rarer.
    Yes. You are probably right. But if the dictionary is not built for that particular purpose, then it just bloats up the entire file and the whole thing is pointless. So I suppose throwing up an error here might not be all that bad a thing. But you're probably right. I suppose it depends on how it is designed.

    ------------------

    Dictionary based compression of text is nothing new, so it is not a bad idea but you need to start thinking about the details. EG: is 3 byte per word the best distribution in general ? How is the dictionary stored ? Etc...
    Yes. You are probably right. Perhaps 2 bytes per word is sufficient. That would give us a dictionary of 65k words, which is not bad considering that the majority of the frequently used words are only about 3000 or so. Also, as mentioned previously by Dresdenboy it would help on machines with 2 byte alignments.

    The dictionary is not stored in the file. It is predetermined and stored in the compressor itself.

    --------------

    I hope that clarifies the questions asked. I hope the discussion continues. It is quite insightful. Thank you all once again for all your comments.

  15. #10
    Member
    Join Date
    May 2020
    Location
    Berlin
    Posts
    64
    Thanks
    10
    Thanked 20 Times in 15 Posts
    Quote Originally Posted by urntme View Post
    Has it really been tried before? Could you give me a link to it? I couldn't find where it is being used.
    I didn't exactly hear about that 3 byte dictionary index encoding for compression, but with some digging it might show up. I'm back to more intensively (as time permits) experimenting with compression since 2016 and I still find new concepts on my research sessions, sometimes dating back to the 80's or 90's. E.g. some kind of ANS-like encoding in the early 90's by John Boss. So you might find the exact same solution maybe in 2 years. The more exotic, the more it might take.

    But I found some interesting concepts of doing word based compression, just check following links:
    Przemysław Skibiński's PhD thesis: http://pskibinski.pl/papers/PhD_thesis.pdf (contains a lot about word processing and word based compression techniques)
    Paper "Revisiting dictionary-based compression": https://citeseerx.ist.psu.edu/viewdo...=rep1&type=pdf
    Paper "A Novel Approach to Compress Centralized Text Data using Indexed Dictionary": https://arxiv.org/ftp/arxiv/papers/1512/1512.07153.pdf
    Paper "Fast Text Compression Using Multiple Static Dictionaries": https://www.researchgate.net/publica...c_Dictionaries (PDF available)
    Paper "WordCode using WordTrie": https://www.sciencedirect.com/scienc...19157818313417 (PDF available, mentions multiple other methods incl. one which encodes English words using a 19 bit integer)

    Or check Matt Mahoney's Data Compression Explained page:
    PASQDA (v1.0-v4.4, Jan. 2005 to Jan. 2006) is a fork by Przemyslaw Skibinski. It adds an external dictionary to replace words in text files with 1 to 3 byte symbols. This technique was used successfully in the Hutter Prize and in the top ranked programs in the large text benchmark. PAQ8A2, PAQ8B, PAQ8C, PAQ8D, PAQ8E, and PAQ8G (Feb. to Mar. 2006) also use this technique, as does PAQAR 4.5 by Alexander Rhatushnyak.
    Matt and other well known compression experts are active on this very forum, BTW.

    While researching something it is necessary to look at existing solutions and compare them to your own. I literally spent tens of hours over the last months just to look for existing solutions to some very specific problem. This is both interesting and also might trigger some interesting ideas and improvements for your own idea.

    And for many researchers it's always an inherent motivation (let's call it "fun" ) to try new ideas.

    Quote Originally Posted by urntme View Post
    Yes. You are probably right about this. But a smaller byte chunk for each word means a smaller dictionary as well. I suppose it depends on the application and where it would be used.
    I also saw, what you answered in another posting above. Of course removing 8b means a lot. But there are ways to optimize the 3 byte alignment handling for more speed. A lot of experts here know ways to do this, down to the assembler level. Independent from its performance impact, it also causes bloat. As you wrote (and also mentioned by me) there are just a few thousand words which are most commonly used. Hence my idea of using UTF-8 (and I found references to UTF-8 usage in my linked papers above). This could also be done by using a variable length byte encoding (e.g. use 1 bit in a byte to denote, if length is 1 byte or > 1 bytes, then use similar bits or already a 2 bit count value in the next bytes).

    You might check Charles Bloom's articles here (with 3 other parts linked there):
    http://cbloomrants.blogspot.com/2012...es-part-4.html

    I suppose that's one way of handling them. We could specify a limit on the length of these special characters in a row and allocate some space in the dictionary to handle them. Anymore than the limit say a 1000 consecutive special characters would throw up an error in that case. If it breaches that limit, one could simply encode them as three byte character mappings of their ASCII values.
    If such a block has been found, a length encoded in a range of the dictionary would be enough. So this turns into 1 encoded length (using your 3 byte value) followed by [length] raw bytes. If there are more, there will just be a sequence of [max length]+raw bytes and a non max length block. No error throwing needed. This is an escape mechanism.

    I think this is the point where one finds out the most by just experimenting with the ideas on real data and gathering statistics.

    Thank you all for this. Your comments have been really insightful. It feels great to find a forum to discuss these things. But you have made me think. Is there a need for this? Where could it find its use? Although, one never knows with technology. There's always a use for something somewhere. One needs to just find their niche.
    You might also discuss it in Google groups (former usenet), group name is: "comp.compression".

    You are right about finding a niche. I also found an interesting niche with others (tiny decompressors). The big areas are already covered.

    But it might also just be fun to play around with such algorithms. There is a lot one could learn by doing this.

  16. Thanks:

    urntme (21st September 2020)

  17. #11
    Member
    Join Date
    Jun 2018
    Location
    Yugoslavia
    Posts
    62
    Thanks
    8
    Thanked 5 Times in 5 Posts
    hmm, perhaps one day accurate 'google translation' could preprocess text to something that can be compressed easier.

  18. Thanks:

    urntme (21st September 2020)

  19. #12
    Member
    Join Date
    Sep 2020
    Location
    India
    Posts
    18
    Thanks
    18
    Thanked 0 Times in 0 Posts
    Thank you all for your comments. It is very insightful.

    ------------

    But I found some interesting concepts of doing word based compression, just check following links:
    Przemysław Skibiński's PhD thesis: http://pskibinski.pl/papers/PhD_thesis.pdf (contains a lot about word processing and word based compression techniques)
    Paper "Revisiting dictionary-based compression": https://citeseerx.ist.psu.edu/viewdo...=rep1&type=pdf
    Paper "A Novel Approach to Compress Centralized Text Data using Indexed Dictionary": https://arxiv.org/ftp/arxiv/papers/1512/1512.07153.pdf
    Paper "Fast Text Compression Using Multiple Static Dictionaries": https://www.researchgate.net/publica...c_Dictionaries (PDF available)
    Paper "WordCode using WordTrie": https://www.sciencedirect.com/scienc...19157818313417 (PDF available, mentions multiple other methods incl. one which encodes English words using a 19 bit integer)
    Thank you so much for sharing these papers. These are really interesting reads. I like the fact that you took care to put in papers that I could download and that weren't behind a paywall. Thank you so much for that thought.

    Przemysław Skibiński's PhD thesis: http://pskibinski.pl/papers/PhD_thesis.pdf (contains a lot about word processing and word based compression techniques)
    Interesting thesis. I took a quick look at it. He mentions quite the few common methods that are in use.

    Paper "Revisiting dictionary-based compression": https://citeseerx.ist.psu.edu/viewdo...=rep1&type=pdf
    Interesting paper.

    Paper "A Novel Approach to Compress Centralized Text Data using Indexed Dictionary": https://arxiv.org/ftp/arxiv/papers/1512/1512.07153.pdf
    They use an indexed dictionary here.

    Paper "Fast Text Compression Using Multiple Static Dictionaries": https://www.researchgate.net/publica...c_Dictionaries (PDF available)
    This is quite a strange paper for me. I didn't quite understand it as completely as I wished.

    Paper "WordCode using WordTrie": https://www.sciencedirect.com/scienc...19157818313417 (PDF available, mentions multiple other methods incl. one which encodes English words using a 19 bit integer)
    This one was the most interesting paper for me.

    To be honest, I would need to spend more time with all the papers to understand all the nuances of it. But thank you so much for all the papers.

    Matt and other well known compression experts are active on this very forum, BTW.
    Oh? I suspected that but I wasn't sure. It's really great to have the experts here. This is a really cool forum for Data Compression. Very active and thriving! It's great!

    While researching something it is necessary to look at existing solutions and compare them to your own. I literally spent tens of hours over the last months just to look for existing solutions to some very specific problem. This is both interesting and also might trigger some interesting ideas and improvements for your own idea.
    That is quite true. I love exploring and learning from new ideas and new perspectives. Learning about different ways to think about the same thing at multiple levels is a really interesting adventure.

    And for many researchers it's always an inherent motivation (let's call it "fun" ) to try new ideas.
    That's great. I believe research should always be about asking the questions we don't know the ideas for. Otherwise why do it?

    I also saw, what you answered in another posting above. Of course removing 8b means a lot. But there are ways to optimize the 3 byte alignment handling for more speed. A lot of experts here know ways to do this, down to the assembler level. Independent from its performance impact, it also causes bloat.


    Oh? That's an interesting piece of information.

    As you wrote (and also mentioned by me) there are just a few thousand words which are most commonly used. Hence my idea of using UTF-8 (and I found references to UTF-8 usage in my linked papers above). This could also be done by using a variable length byte encoding (e.g. use 1 bit in a byte to denote, if length is 1 byte or > 1 bytes, then use similar bits or already a 2 bit count value in the next bytes).


    Yes. I guess that would be one way to do it.

    You might check Charles Bloom's articles here (with 3 other parts linked there):
    http://cbloomrants.blogspot.com/2012...es-part-4.html
    This was quite an interesting read. What I also found interesting is that he also provides all the coding for his statements. That's really cool to learn from.

    If such a block has been found, a length encoded in a range of the dictionary would be enough. So this turns into 1 encoded length (using your 3 byte value) followed by [length] raw bytes. If there are more, there will just be a sequence of [max length]+raw bytes and a non max length block. No error throwing needed. This is an escape mechanism.
    Yes. This is also an interesting way to do it.

    I think this is the point where one finds out the most by just experimenting with the ideas on real data and gathering statistics.
    Yes. You're absolutely right about that.

    You might also discuss it in Google groups (former usenet), group name is: "comp.compression".
    Oh, is that so? I'll take a look at it sometime. Thank you so much for the link.

    You are right about finding a niche. I also found an interesting niche with others (tiny decompressors). The big areas are already covered.
    Yes. Thank you for that. You are absolutely right, we just have to find our place in this world. However I also believe that not everything has been discovered. The knowledge out there is a vast ocean and we know so so very little.

    But it might also just be fun to play around with such algorithms. There is a lot one could learn by doing this.
    Yes. That's true. There's always a lot to learn by experimenting and just trying new ideas. Life is a constant learning process I feel.

    -------------

    Phew. That is a lot of information you've given me. It took me a long time to process all of that. Also, it was a long post. Thank you so much for dedicating the time to post this long post. I am really learning.

    -------------

    hmm, perhaps one day accurate 'google translation' could preprocess text to something that can be compressed easier.
    Yes. Perhaps.

    -------------

    Thank you all for your comments.

  20. #13
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    554
    Thanks
    356
    Thanked 356 Times in 193 Posts
    >>How do you construct the dictionary? Is it constructed during compression or do you use a fixed/predetermined one?
    >Yes. We use a fixed/predetermined one.

    Then this is one of the most similar algorithms to your idea: https://github.com/antirez/smaz
    It uses a static "fragment" dictionary for ascii texts (see the dictionary here: https://github.com/antirez/smaz/blob/master/smaz.c). Not full words as in your case. That's a difference.



    >>you probably forgot the full stop + the size of the dictionary.
    >No I believe I mentioned the full stop. It would take 3 bytes for registering the full stop. So it would gain here.

    Oh, OK. Than what happens with the 3 spaces?


    I feel that using a fixed-codeword approach for content that follows Zipf's law (English text) is (very) suboptimal. For such content variable length codewords is much better.


    Let's see.


    Let's do an experiment. Take a large ascii text file (like the 100 MB text8 from here: http://mattmahoney.net/dc/textdata.html), let's grab all the words from it. A sample from the file:
    Code:
    anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans culottes of the french 
    revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization of society it has also been taken up as a positive 
    label by self defined anarchists the word anarchism is derived from the greek without archons ruler chief king anarchism as a political philosophy is the belief that rulers
    The file contains 17 million words (253854 distinct), all lowercase and only a single space in between the words (that's in favor of your algorithm).
    6% of the words is "the" (it occurs 106396 times), and we have 0.7% of the words occur just once (118519 of them).
    This distribution reflect Zipf's law.

    When you use 3 bytes for every word then all the "the" words will be encoded to 3 x 106396 bytes. And you didn't gain anything (3-char word to 3-byte codeword).
    Why not use a single byte for the most frequent words? And gain 2 x 106396 bytes in the case of "the" alone. Don't you think that using variable length encoding would be the way to go?


    I used you compression idea and actually tried compressing text8. Although I'm not very certain yet how you deal with spaces between the words. I know from your comments above that a full stop is 3-bytes and any individual character is 3-bytes. So I supposed that you would encode spaces also as 3-byte codewords. Is that correct?
    If it is indeed correct then encoding the 100'000'000 byte text8 that contains only words and a single character (space) in between, your algorithm would encode it to 102'031'242 bytes.
    When your dictionary would not contain all the words from text8 then the result would be worse of course.


    Last edited by Gotty; 17th September 2020 at 16:05. Reason: Fixed word counts

  21. Thanks:

    urntme (21st September 2020)

  22. #14
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    925
    Thanks
    57
    Thanked 116 Times in 93 Posts
    i might have missed something but is thi not just the same as dictionary preprocessing of text?
    replacing all "known" words with a simply value before modeling is applied ?

    This "compression also uses a set value cost for any thing despite occurence being different. so we are wasting a lot of bites compared to do a more cost/occurance analytics

    Instead of 3 buts per word i would do something likes this

    1bit flag for long or short workd
    if 1st bit is set means we only read that one but this leaves 7 bits for 128 lost cost words (they only cost 1 bytes)
    if 1st bit is not set we read the next 2 bytes as well this leaves us with 23 bits for 8Mill full cost words still more than plenty for te document claim of "171476 words "

    This would save 2 bytes on the 128 most occurent words



    How are we dealing with capitalized words?. Start capitalization. full capitalization

  23. Thanks:

    JamesWasil (17th September 2020)

  24. #15
    Member JamesWasil's Avatar
    Join Date
    Dec 2017
    Location
    Arizona
    Posts
    103
    Thanks
    96
    Thanked 18 Times in 17 Posts
    25 years ago, when I was about 16 turning 17, I wrote what might be considered a variant of "Disha".

    It started out with 1 byte codes for the smallest words of the English language, but used an extendable prefix bit that was variable length, similar to a huffman code in a way, that determined the size of the bytes that represented the encoded word.

    The dictionary was divided to word sizes and output code sizes where each one had a separate directory (this was done initially in MSDOS on a 286).

    If the prefix bits grew to certain lengths, the directory would switch to larger databases for words, phrases, and entire sentences that were commonly used. If there were partial matches, there were indicator bit flags for larger words to where each letter could use a 1 to 2 bit uppercase or lowercase bit flag to make sure that having one letter capitalized or misspelled was still compressed if it was a heuristic match.

    While it did work and saved a vast amount of space for text files (more than Pkzip at the time which I was pleased with), it was horribly slow unless you used a RAM drive, and back then PCs were lucky to have 4MB, 8MB, or 16MB at most to work with commonly.

    Disha sounds like a 3 byte fixed version attempting to account for all word sizes that it encounters, but as SvenBent pointed out, that usually doesn't save much and can actually expand it due to the fact that words like "to" and "or" will be used where one byte will actually be gained rather than compressed, while other words that are common like "the" "and" or "how" will break even without a flag bit or any modifier and not give compression, but will get larger slowly if flag bits and descriptors are used.

    The majority of your text messages with 3 byte output codes will have to be at least 4 letters and larger for you to gain 1 byte or more per output for your compression sequences. Statistically, there are too many short words used more frequently and long words less frequently for text compression to benefit from that unless you use something dynamic like I was doing as a teenager experimenting with text compression, or you use another method that lets you adjust to variable sizes where you can still have gains with 2 and 3 letter words any other way.

    You could try to leverage the fact that most words have a SPACE or ASCII 32 suffix at the end of each word and represent that with an extended flag bit to save the difference and try to break even or compress more that way. Maybe extend it to other common symbols like commas, periods, question marks, exclamation marks, and CR+LF and save the difference on the bits to represent that since prefix bits will let you read and account for that before it tries to work with an output code even if you keep it static at 3 bytes.

  25. #16
    Member JamesWasil's Avatar
    Join Date
    Dec 2017
    Location
    Arizona
    Posts
    103
    Thanks
    96
    Thanked 18 Times in 17 Posts
    As an example of the 2 word field only from the past, if the ASCII symbol A represented the word "Hi" on a text database:

    If you read a two letter word like "Hi" which could be uppercase or lowercase as 16 bits, the output would be:

    0001 (4 bits) + A (8 bits) = "HI" (all characters uppercase, without suffixes)

    0100 (4 bits) + A (8 bits) = "hi"
    01010 (5 Bits) + A (8 bits) = "hi-"
    01011 (5 bits) + A (8 bits) = "hi "
    01100 (5 bits) + A (8 bits) = "hi!"
    01101 (5 bits) + A (8 bits) = "hi?"
    01110 (5 bits) + A (8 bits) = "hi,"
    01111 (5 bits) + A (8 bits) = "hi (CR/LF)"

    001 (3 bits) = If the next 3 text letters were unknown symbols/letters, they were output as literals and were basically 9 bit codes rather than 8.

    00001 (5 bits) = If the word was 3 bytes or larger or a prefix of another word like "Hip" that existed on a dictionary a level up, step up to the next dictionary size and search for a word or data set that matches and use flag bits after this code for them instead of staying on the 2 text symbol level.

    00000 (5 bits) = Use a text-based LZ buffer that matches 2 text symbols from the last 128 characters that were read and output even if literal and no match was found.

    For all symbols that were 4 bits or larger starting with 0100, the first bit could be set to 0 or 1 or inverted, where 0 meant it was lowercase, and 1 meant the first letter was uppercase. If 1 was read first, it was automatically known to be a 4 bit code or larger. If a 0 was read, it could be a variable bit code from 3 bits to 6 bits.

    It could size up or size down on the dictionary dynamically based on the prefix codes, and go from entire sentences to 2 letter words and then back up to 5 to 10 letter words and sentences later based on the codes read ahead of time.

    For 2 letter words it only saved about 3 or 4 bits, but for 3 letter words and larger it saved a significant amount, and often things that PKZIP and other compressors weren't seeing as text patterns because they were designed to look for matches differently and not as words and English analyzed.

    The LZ buffer helped a lot too whenever there weren't matches, because it checked that first and avoided a lot of 9 bit literal output codes by not having to switch to that and read/expand the next 3 symbols by 1 bit each if they weren't seen anywhere on the current dictionary level or directory.

    There were other modifications and changes made to this a year or two later that were ahead of the above layout that I worked on when I turned 18 that made it even smaller, but by then I remember being cheesed about the fact that 7zip had just been released and was getting the text compression smaller than the above method was, even with all the neat additions to make it versatile.

    I decided that it was a lot of CPU time and file cost to keep doing it with marginal gains and differences between 7zip and LZMA large windows, and abandoned development on it after that. It could have been improved further though. By then it was using branch trees and other logic that sped it up enough to be useful and practical, but I moved on to other things by then.

    Maybe you can use a similar strategy to build and improve upon that for text compression with Disha? I know it's very old and the example is more of a basic layout from one of the first methods I used for it, but perhaps you could extend it and adapt it to your method if you decide to use static 3 byte codes still and make it useful to you?

    The nice thing about it too is that it can be made fast even if it uses larger dictionaries with the right logic to move and look-up quickly. That, and that it requires no statistical preprocessing and you can do it on-the-fly with network or file reads. There are built-in statistics to the English language that make this possible and favorable already.

    Back then the internet was new to the public and I was developing on my own as I went without references for anything, only comp.compression on USENET if I was lucky. But now, you could network with members here on this forum and elsewhere to really make a great text compressor with that and the ideas from the many users and contributions made here by others.

    If you think Disha can really be a strong text compressor, consider these ideas and keep at it. It may be with the right changes and planning.

  26. Thanks:

    urntme (21st September 2020)

  27. #17
    Member
    Join Date
    Sep 2020
    Location
    India
    Posts
    18
    Thanks
    18
    Thanked 0 Times in 0 Posts
    First of all thank you all for your comments and your interest. Below are my replies.

    ----------

    Then this is one of the most similar algorithms to your idea: https://github.com/antirez/smaz
    It uses a static "fragment" dictionary for ascii texts (see the dictionary here: https://github.com/antirez/smaz/blob/master/smaz.c). Not full words as in your case. That's a difference.
    Wow. Thank you for sharing the link. It's quite an interesting concept. Plus it's got the code, so I can learn the code from it too. It's a really cool link. Thank you for sharing.

    Oh, OK. Than what happens with the 3 spaces?

    I feel that using a fixed-codeword approach for content that follows Zipf's law (English text) is (very) suboptimal. For such content variable length codewords is much better.


    Let's see.
    The 3 spaces are deleted and not stored. I mentioned this too in the algorithm. If you store the spaces you would actually gain bytes.

    This is because, the average space taken by the uncompressed txt file is 4.7 bytes per word and 1 byte per space.

    Since every word accompanies a space, this is around 5.7 bytes. However, we use 3 bytes per word and delete the space. As a rule, when the file is being output again, you insert a space after every word. If you store the space after every word, we would need 3 bytes per word + 3 bytes per space. Which would mean we would use 6 bytes per word + space combo. So we would actually gain.

    A line break is given 3 bytes however but that should be rare.

    Going by your results, I'm actually surprised that you gained less bytes than expected. You only saw a gain of about 2% which is surprisingly lesser than expected.

    But I believe you are right. If you use a variable length code word for each word, you might save more space so in that way you could say it is sub-optimal if your goal is only space savings. That's not the point. The point is we save time and overhead costs this way. The advantage is speed rather than most optimal compression ratio.

    The file contains 253854 distinct words, all lowercase and only a single space in between the words (that's in favor of your algorithm).
    Almost half of the words is "the" (it occurs 106396 times), and almost half of the words occur just once (118519 of them). This distribution reflect Zipf's law.
    Hmmm. That's quite an interesting text file. Half the words is the!? What kind of text is that. Lol. It's funny.

    When you use 3 bytes for every word then all the "the" words will be encoded to 3 x 106396 bytes. And you didn't gain anything (3-char word to 3-byte codeword).
    Why not use a single byte for the most frequent words? And gain 2 x 106396 bytes in the case of "the" alone. Don't you think that using variable length encoding would be the way to go?
    The reason we are not using variable length coding is because we want simplicity and speed.

    However, you are probably right. If you used variable length coding, then you could possibly get more compression out of it. Undeniably. Especially for this particular text file you tested it on.

    I used you compression idea and actually tried compressing text8. Although I'm not very certain yet how you deal with spaces between the words. I know from your comments above that a full stop is 3-bytes and any individual character is 3-bytes. So I supposed that you would encode spaces also as 3-byte codewords. Is that correct?
    If it is indeed correct then encoding the 100'000'000 byte text8 that contains only words and a single character (space) in between, your algorithm would encode it to 102'031'242 bytes.
    When your dictionary would not contain all the words from text8 then the result would be worse of course.
    First of all thank you for testing the concept. It's great. But I would like to kindly request you to post the results after testing it without storing the space because as I mentioned above, it basically inflates the size. Also please do mention the time it took to process it.

    -------------

    i might have missed something but is thi not just the same as dictionary preprocessing of text?
    replacing all "known" words with a simply value before modeling is applied ?
    Well the method is slightly different than that. We are using fixed 3 byte codes for each word. 2 byte codes have been suggested by others and me but it depends on the test file and the algorithm. This method give great speed and is simpler to implement.

    This "compression also uses a set value cost for any thing despite occurence being different. so we are wasting a lot of bites compared to do a more cost/occurance analytics

    Instead of 3 buts per word i would do something likes this

    1bit flag for long or short workd
    if 1st bit is set means we only read that one but this leaves 7 bits for 128 lost cost words (they only cost 1 bytes)
    if 1st bit is not set we read the next 2 bytes as well this leaves us with 23 bits for 8Mill full cost words still more than plenty for te document claim of "171476 words "

    This would save 2 bytes on the 128 most occurent words
    ​Yes. You are right. This would be one way of going about it. Thank you for sharing.

    How are we dealing with capitalized words?. Start capitalization. full capitalization
    I have suggested that the start capitalized words and the full capitalized words would be stored as separate words in the dictionary. Since we are using fixed set of bytes, there is enough room in the dictionary for them.

    --------------

    25 years ago, when I was about 16 turning 17, I wrote what might be considered a variant of "Disha".

    It started out with 1 byte codes for the smallest words of the English language, but used an extendable prefix bit that was variable length, similar to a huffman code in a way, that determined the size of the bytes that represented the encoded word.

    The dictionary was divided to word sizes and output code sizes where each one had a separate directory (this was done initially in MSDOS on a 286).

    If the prefix bits grew to certain lengths, the directory would switch to larger databases for words, phrases, and entire sentences that were commonly used. If there were partial matches, there were indicator bit flags for larger words to where each letter could use a 1 to 2 bit uppercase or lowercase bit flag to make sure that having one letter capitalized or misspelled was still compressed if it was a heuristic match.

    While it did work and saved a vast amount of space for text files (more than Pkzip at the time which I was pleased with), it was horribly slow unless you used a RAM drive, and back then PCs were lucky to have 4MB, 8MB, or 16MB at most to work with commonly.

    Disha sounds like a 3 byte fixed version attempting to account for all word sizes that it encounters, but as SvenBent pointed out, that usually doesn't save much and can actually expand it due to the fact that words like "to" and "or" will be used where one byte will actually be gained rather than compressed, while other words that are common like "the" "and" or "how" will break even without a flag bit or any modifier and not give compression, but will get larger slowly if flag bits and descriptors are used.

    The majority of your text messages with 3 byte output codes will have to be at least 4 letters and larger for you to gain 1 byte or more per output for your compression sequences. Statistically, there are too many short words used more frequently and long words less frequently for text compression to benefit from that unless you use something dynamic like I was doing as a teenager experimenting with text compression, or you use another method that lets you adjust to variable sizes where you can still have gains with 2 and 3 letter words any other way.

    You could try to leverage the fact that most words have a SPACE or ASCII 32 suffix at the end of each word and represent that with an extended flag bit to save the difference and try to break even or compress more that way. Maybe extend it to other common symbols like commas, periods, question marks, exclamation marks, and CR+LF and save the difference on the bits to represent that since prefix bits will let you read and account for that before it tries to work with an output code even if you keep it static at 3 bytes.
    Wow. That's quite a story. Thank you for sharing it. The ideas you present are quite interesting. Thank you for sharing them.

    ----------

    As an example of the 2 word field only from the past, if the ASCII symbol A represented the word "Hi" on a text database:

    If you read a two letter word like "Hi" which could be uppercase or lowercase as 16 bits, the output would be:

    0001 (4 bits) + A (8 bits) = "HI" (all characters uppercase, without suffixes)

    0100 (4 bits) + A (8 bits) = "hi"
    01010 (5 Bits) + A (8 bits) = "hi-"
    01011 (5 bits) + A (8 bits) = "hi "
    01100 (5 bits) + A (8 bits) = "hi!"
    01101 (5 bits) + A (8 bits) = "hi?"
    01110 (5 bits) + A (8 bits) = "hi,"
    01111 (5 bits) + A (8 bits) = "hi (CR/LF)"

    001 (3 bits) = If the next 3 text letters were unknown symbols/letters, they were output as literals and were basically 9 bit codes rather than 8.

    00001 (5 bits) = If the word was 3 bytes or larger or a prefix of another word like "Hip" that existed on a dictionary a level up, step up to the next dictionary size and search for a word or data set that matches and use flag bits after this code for them instead of staying on the 2 text symbol level.

    00000 (5 bits) = Use a text-based LZ buffer that matches 2 text symbols from the last 128 characters that were read and output even if literal and no match was found.

    For all symbols that were 4 bits or larger starting with 0100, the first bit could be set to 0 or 1 or inverted, where 0 meant it was lowercase, and 1 meant the first letter was uppercase. If 1 was read first, it was automatically known to be a 4 bit code or larger. If a 0 was read, it could be a variable bit code from 3 bits to 6 bits.

    It could size up or size down on the dictionary dynamically based on the prefix codes, and go from entire sentences to 2 letter words and then back up to 5 to 10 letter words and sentences later based on the codes read ahead of time.

    For 2 letter words it only saved about 3 or 4 bits, but for 3 letter words and larger it saved a significant amount, and often things that PKZIP and other compressors weren't seeing as text patterns because they were designed to look for matches differently and not as words and English analyzed.

    The LZ buffer helped a lot too whenever there weren't matches, because it checked that first and avoided a lot of 9 bit literal output codes by not having to switch to that and read/expand the next 3 symbols by 1 bit each if they weren't seen anywhere on the current dictionary level or directory.

    There were other modifications and changes made to this a year or two later that were ahead of the above layout that I worked on when I turned 18 that made it even smaller, but by then I remember being cheesed about the fact that 7zip had just been released and was getting the text compression smaller than the above method was, even with all the neat additions to make it versatile.

    I decided that it was a lot of CPU time and file cost to keep doing it with marginal gains and differences between 7zip and LZMA large windows, and abandoned development on it after that. It could have been improved further though. By then it was using branch trees and other logic that sped it up enough to be useful and practical, but I moved on to other things by then.
    Those are quite interesting ideas. Thank you for sharing them. The story is also quite interesting. Thank you for sharing.

    Maybe you can use a similar strategy to build and improve upon that for text compression with Disha? I know it's very old and the example is more of a basic layout from one of the first methods I used for it, but perhaps you could extend it and adapt it to your method if you decide to use static 3 byte codes still and make it useful to you?

    The nice thing about it too is that it can be made fast even if it uses larger dictionaries with the right logic to move and look-up quickly. That, and that it requires no statistical preprocessing and you can do it on-the-fly with network or file reads. There are built-in statistics to the English language that make this possible and favorable already.

    Back then the internet was new to the public and I was developing on my own as I went without references for anything, only comp.compression on USENET if I was lucky. But now, you could network with members here on this forum and elsewhere to really make a great text compressor with that and the ideas from the many users and contributions made here by others.

    If you think Disha can really be a strong text compressor, consider these ideas and keep at it. It may be with the right changes and planning.
    Yes. Who knows what will happen. Thank you for your comments and your encouragement. Thank you for sharing.

    ---------

    Thank you all for your comments and messages. Thank you all for the discussion.

  28. #18
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    554
    Thanks
    356
    Thanked 356 Times in 193 Posts
    Quote Originally Posted by urntme View Post
    Hmmm. That's quite an interesting text file. Half the words is the!? What kind of text is that. Lol. It's funny.
    Ah, my bad, sorry. Made a fast calculation, and didn't think it through. It's 6%. Fixed my post.

    Quote Originally Posted by urntme View Post
    The 3 spaces are deleted and not stored. I mentioned this too in the algorithm. If you store the spaces you would actually gain bytes.
    [...]
    Since every word accompanies a space, this is around 5.7 bytes. However, we use 3 bytes per word and delete the space. As a rule, when the file is being output again, you insert a space after every word. If you store the space after every word, we would need 3 bytes per word + 3 bytes per space.
    I see. How do you encode and decode "O'Brian", "well-known", "higher/lower" etc.? You can't delete and insert a space in these cases.

    How do you plan encoding multiple spaces? They are used for indentation in many clear text files.

    I'm also interested in how do you know where to insert or delete a space when there are (double or single) quotes? Example:
    'Your paper titled "Disha Format" is an interesting one.'
    Which of the spaces are you going to remove (and add during decompression) in this case? The ones adjacent to the quotes are ambiguously either left or right from the word.
    The same is true for parantheses, brackets, braces, but with establishing a rule (no space after "(", "[", "{") it's possible to hack most occurrences. But since open and end quotes are similar it's not really possible to establish such replacement rules that would meet your goal of having max-speed.

  29. Thanks:

    urntme (21st September 2020)

  30. #19
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    554
    Thanks
    356
    Thanked 356 Times in 193 Posts
    Quote Originally Posted by urntme View Post
    The reason we are not using variable length coding is because we want simplicity and speed.
    I don't think variable length codes would be that much slower. But compression would be significantly better. What is your goal speed-wise? Do you have a concrete one? Would you want to gain some milliseconds and compress much worse or compress much better and lose some milliseconds?
    I believe simplicity is also not a problem. The algothm is very-very simple in either case.

  31. Thanks:

    urntme (21st September 2020)

  32. #20
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    554
    Thanks
    356
    Thanked 356 Times in 193 Posts
    Back to your initial request:
    Quote Originally Posted by urntme View Post
    I am looking for feedback on this algorithm. The feedback I am looking for is: Where can it be used? What are its possible applications?
    Your algorithm is specialized in compressing only (human-written) pure ascii text very fast. Since we already know that it would not be able to compress (even it would bloat) any other content - it has a very limited use case.
    Unfortunately I don't know any fields that would need something like that.
    If the algorithm would be specialized to HTML or JavaScript or JSON compression...

  33. Thanks:

    urntme (21st September 2020)

  34. #21
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    925
    Thanks
    57
    Thanked 116 Times in 93 Posts
    I agree with gotty on many points.

    Once you have to adapt special flags for things that are not just words but also punctuations. indents quotes acronyms, capitalization etc etc, you realize adding on one extra flag to deal with shorter symbols for word are really not adding in a lot extra complexity as you have already done this several times over just for "error correction" in the code.


    The code is only simple/fast because it is completely ignorings a big chunk of its task

  35. Thanks:

    urntme (21st September 2020)

  36. #22
    Member
    Join Date
    Sep 2020
    Location
    India
    Posts
    18
    Thanks
    18
    Thanked 0 Times in 0 Posts
    I see. How do you encode and decode "O'Brian", "well-known", "higher/lower" etc.? You can't delete and insert a space in these cases.
    The hope is that these would be stored in the dictionary. But if they're not, they would bloat the file.

    How do you plan encoding multiple spaces? They are used for indentation in many clear text files.
    This would also cause bloat of the file.

    I'm also interested in how do you know where to insert or delete a space when there are (double or single) quotes? Example:
    'Your paper titled "Disha Format" is an interesting one.'


    Well, I was thinking it would handle it similar to how compilers handle it. The first one would not use a space on the right and the second one would not use a space on the left.

    Which of the spaces are you going to remove (and add during decompression) in this case? The ones adjacent to the quotes are ambiguously either left or right from the word.
    The same is true for parantheses, brackets, braces, but with establishing a rule (no space after "(", "[", "{") it's possible to hack most occurrences. But since open and end quotes are similar it's not really possible to establish such replacement rules that would meet your goal of having max-speed.
    You bring up interesting points. But the easiest answer in all these cases is that it would bloat the file.

    I don't think variable length codes would be that much slower. But compression would be significantly better. What is your goal speed-wise? Do you have a concrete one?


    To be honest I don't really have a concrete speed in mind. I just thought it would be a simple/fast encoding method.

    Would you want to gain some milliseconds and compress much worse or compress much better and lose some milliseconds?
    I believe simplicity is also not a problem. The algothm is very-very simple in either case.
    I suppose this would depend on the exact application in mind.

    Your algorithm is specialized in compressing only (human-written) pure ascii text very fast. Since we already know that it would not be able to compress (even it would bloat) any other content - it has a very limited use case.
    Unfortunately I don't know any fields that would need something like that.
    If the algorithm would be specialized to HTML or JavaScript or JSON compression...
    I was actually thinking it could be used in instant messaging apps such as Twitter. Twitter would be a perfect example.

    I suppose for that the much simpler method would be just to use variable codes, a static dictionary, and encode all the spaces and all special characters in order of probability. But such a thing probably already exists.

    Due to the instant nature of these instant messaging apps, you want something light and quick to encode them.

    My method would be a good teaching method I feel. To learn how compression could work. It's simple to learn and simple to implement. So it could probably be used for that.

    Also, as you mentioned, it might be useful if it is specialized for coding applications such as HTML/JAVA etc.

    -------------------

    I agree with gotty on many points.

    Once you have to adapt special flags for things that are not just words but also punctuations. indents quotes acronyms, capitalization etc etc, you realize adding on one extra flag to deal with shorter symbols for word are really not adding in a lot extra complexity as you have already done this several times over just for "error correction" in the code.


    The code is only simple/fast because it is completely ignorings a big chunk of its task
    Yes. You are probably right on this. Error correction could be a real problem if it is used in a generalized case.

    I just need to find an application where all the words in the dictionary have a somewhat equal probability of occurring. That would be an advantage for this. Plus the speed would also be an advantage if it's required somewhere.

    Hmmmm. Dunno. I need to think about this.

  37. #23
    Member
    Join Date
    May 2020
    Location
    Berlin
    Posts
    64
    Thanks
    10
    Thanked 20 Times in 15 Posts
    Quote Originally Posted by urntme View Post
    To be honest I don't really have a concrete speed in mind. I just thought it would be a simple/fast encoding method.
    [...]
    I was actually thinking it could be used in instant messaging apps such as Twitter. Twitter would be a perfect example.
    [...]
    Due to the instant nature of these instant messaging apps, you want something light and quick to encode them.
    It will likely be fast. Thanks to a fixed size index there are some optimization opportunities. If you're using variable length encoding, they will be a bit different. But for example (keep in mind my cache remark about the processor processing this data) you could fill your 3 index bytes a bit differently (creating groups of dictionaries). If the dictionary is sorted by typical frequency of the words in texts, the most common ones would group together (low index). If the dictionary is split into 256 64k sized dictionaries (with many of them not used), the first byte of your index could just select the dictionary and the next 2 bytes are an index into it. Or you use 64k 256-word dictionaries or alphabets. This way there would be a much more efficient use of memory while compressing/decompressing.

    But instant messaging has a different meaning - and very small text sizes. SMAZ is aimed at this task. But just imagine what happens: A ~320 byte message every n minutes (or seconds at best). The processor's cores did a lot of other stuff already in the meantime, clearing the dictionaries from the cache. Reloading 2+MB of data each time to compress ~200 Bytes on average is costly.

    Quote Originally Posted by urntme View Post
    I just need to find an application where all the words in the dictionary have a somewhat equal probability of occurring. That would be an advantage for this. Plus the speed would also be an advantage if it's required somewhere.
    This might be the case for scientific or technical texts.

  38. Thanks:

    urntme (21st September 2020)

  39. #24
    Member
    Join Date
    Sep 2020
    Location
    India
    Posts
    18
    Thanks
    18
    Thanked 0 Times in 0 Posts
    It will likely be fast. Thanks to a fixed size index there are some optimization opportunities. If you're using variable length encoding, they will be a bit different. But for example (keep in mind my cache remark about the processor processing this data) you could fill your 3 index bytes a bit differently (creating groups of dictionaries). If the dictionary is sorted by typical frequency of the words in texts, the most common ones would group together (low index). If the dictionary is split into 256 64k sized dictionaries (with many of them not used), the first byte of your index could just select the dictionary and the next 2 bytes are an index into it. Or you use 64k 256-word dictionaries or alphabets. This way there would be a much more efficient use of memory while compressing/decompressing.
    This is a really a cool idea. Thanks for sharing.

    But instant messaging has a different meaning - and very small text sizes. SMAZ is aimed at this task. But just imagine what happens: A ~320 byte message every n minutes (or seconds at best). The processor's cores did a lot of other stuff already in the meantime, clearing the dictionaries from the cache. Reloading 2+MB of data each time to compress ~200 Bytes on average is costly.
    That is true. You make some really interesting points. Thank you for your comments.

    However, what I'm thinking now is a compromise of sorts. We use variable encoding, however, we fix the size of the codes to bytes instead of bits. That is, we use 1 - 3 bytes for each code, so codes like spaces and dots would get 1 byte, however, words like table or manager would be three bytes. We get speed because we are not working at the bit level and we get space because we are working with sometimes 3 bytes of data. This compromise would result in great speed, however, we would lose some of our compression ratio and dictionary size. But, we would also have no formatting errors in the process.

    About your point of this being costly to load the data for instant messaging; I was thinking that as the user types the message, the program in the background finds out which dictionaries it would need to compress the data and loads only those dictionaries while the message is being typed. As you said, we could have multiple indexed dictionaries present and have multiple sets of small dictionaries.

    I don't think this has been done before. It would provide speed, and it would be efficient.

    Thoughts?

  40. #25
    Member
    Join Date
    May 2020
    Location
    Berlin
    Posts
    64
    Thanks
    10
    Thanked 20 Times in 15 Posts
    Quote Originally Posted by urntme View Post
    However, what I'm thinking now is a compromise of sorts. We use variable encoding, however, we fix the size of the codes to bytes instead of bits. That is, we use 1 - 3 bytes for each code, so codes like spaces and dots would get 1 byte, however, words like table or manager would be three bytes. We get speed because we are not working at the bit level and we get space because we are working with sometimes 3 bytes of data. This compromise would result in great speed, however, we would lose some of our compression ratio and dictionary size. But, we would also have no formatting errors in the process.
    I thought about creating some simple prototype (with a small amount of lines if code) eventually. So we could test it.

    Quote Originally Posted by urntme View Post
    About your point of this being costly to load the data for instant messaging; I was thinking that as the user types the message, the program in the background finds out which dictionaries it would need to compress the data and loads only those dictionaries while the message is being typed. As you said, we could have multiple indexed dictionaries present and have multiple sets of small dictionaries.

    I don't think this has been done before. It would provide speed, and it would be efficient.
    That sounds plausible to me. Detecting the best fitting dictionary sounds interesting. With splitting them there are many interesting ways to optimize the loading of them.

  41. Thanks:

    urntme (21st September 2020)

  42. #26
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    554
    Thanks
    356
    Thanked 356 Times in 193 Posts
    @urntme
    I implemented your algorithm with 3-byte codewords and automatic space insertion and built an optimal dictionary per file to find out the best compression theoretically possible. So
    - I don't have a general dictionary. The aim is to find the theoretical limit per file.
    - It is not optimized for speed - readability is more important at this stage.
    - Speed: it encodes a 100 MB file in 1 sec, and decodes it in half a sec ( tested with text8 ). It encodes/decodes anything small in way less then a millisecond. A tweet message would be a couple of microseconds.
    - However compression-wise it cannot get better than 50%. That seems to be the average best case (average theoretical limit) - even when using an optimal dictionary. Unfortunately when there's a word or chunk that is not in the dictionary it seriously bloats the file. Any quote, comma, bracket, full stop or typo (or an unknown word) is a serious show stopper and with having such things in the content to be compressed the algo usually looses quite often (i.e. size > 100%).

    So it's performance (compression ratio and speed) in the optimal case is similar to an order-0 model (like a speed-optimized fpaq variant) but the latter is 1) not limited to files containing English text only. 2) uses way less memory (does not need a dictionary).
    Without sacrificing speed for some cleverness, it does not look promising so far. Would you consider refining it? You've got plenty of ideas from us above. I could go ahead and refine it myself, but you are the author, it's your child.

    You will also need to update your pdf file. Also include the details we were discussing above when we were trying to understand your idea better. You now know what information we were missing. You will also need to include (describe) an algorithm how you would build a dictionary for the general case.

    It's your turn ;-
    Attached Files Attached Files
    Last edited by Gotty; 19th September 2020 at 12:21.

  43. Thanks (2):

    JamesWasil (19th September 2020),urntme (21st September 2020)

  44. #27
    Member
    Join Date
    Sep 2020
    Location
    India
    Posts
    18
    Thanks
    18
    Thanked 0 Times in 0 Posts
    I implemented your algorithm with 3-byte codewords and automatic space insertion and built an optimal dictionary per file to find out the best compression theoretically possible. So
    - I don't have a general dictionary. The aim is to find the theoretical limit per file.
    - It is not optimized for speed - readability is more important at this stage.
    - Speed: it encodes a 100 MB file in 1 sec, and decodes it in half a sec ( tested with text8 ). It encodes/decodes anything small in way less then a millisecond. A tweet message would be a couple of microseconds.
    - However compression-wise it cannot get better than 50%. That seems to be the average best case (average theoretical limit) - even when using an optimal dictionary. Unfortunately when there's a word or chunk that is not in the dictionary it seriously bloats the file. Any quote, comma, bracket, full stop or typo (or an unknown word) is a serious show stopper and with having such things in the content to be compressed the algo usually looses quite often (i.e. size > 100%).

    So it's performance (compression ratio and speed) in the optimal case is similar to an order-0 model (like a speed-optimized fpaq variant) but the latter is 1) not limited to files containing English text only. 2) uses way less memory (does not need a dictionary).
    Without sacrificing speed for some cleverness, it does not look promising so far. Would you consider refining it? You've got plenty of ideas from us above. I could go ahead and refine it myself, but you are the author, it's your child.

    You will also need to update your pdf file. Also include the details we were discussing above when we were trying to understand your idea better. You now know what information we were missing. You will also need to include (describe) an algorithm how you would build a dictionary for the general case.

    It's your turn ;-
    This is quite a headturner. Thank you so much for all your efforts.

    I'm going to need some time to think about all of this and process it. Please give me a couple of days to respond.

  45. #28
    Member
    Join Date
    Sep 2020
    Location
    India
    Posts
    18
    Thanks
    18
    Thanked 0 Times in 0 Posts
    @urntme
    I implemented your algorithm with 3-byte codewords and automatic space insertion and built an optimal dictionary per file to find out the best compression theoretically possible. So
    - I don't have a general dictionary. The aim is to find the theoretical limit per file.
    - It is not optimized for speed - readability is more important at this stage.
    - Speed: it encodes a 100 MB file in 1 sec, and decodes it in half a sec ( tested with text8 ). It encodes/decodes anything small in way less then a millisecond. A tweet message would be a couple of microseconds.
    - However compression-wise it cannot get better than 50%. That seems to be the average best case (average theoretical limit) - even when using an optimal dictionary. Unfortunately when there's a word or chunk that is not in the dictionary it seriously bloats the file. Any quote, comma, bracket, full stop or typo (or an unknown word) is a serious show stopper and with having such things in the content to be compressed the algo usually looses quite often (i.e. size > 100%).


    First of all, thank you for putting the time and effort into implementing this algorithm. It means a lot to me. I actually find this information very good. It means the algorithm works the way it's supposed to. That's great news.

    Now about refinement, let's look into it below:

    So it's performance (compression ratio and speed) in the optimal case is similar to an order-0 model (like a speed-optimized fpaq variant) but the latter is 1) not limited to files containing English text only. 2) uses way less memory (does not need a dictionary).
    Without sacrificing speed for some cleverness, it does not look promising so far. Would you consider refining it? You've got plenty of ideas from us above. I could go ahead and refine it myself, but you are the author, it's your child.

    You will also need to update your pdf file. Also include the details we were discussing above when we were trying to understand your idea better. You now know what information we were missing. You will also need to include (describe) an algorithm how you would build a dictionary for the general case.

    It's your turn ;-
    I have spent some time thinking about what changes could be made and I wrote up a document on it. I called it Disha Version 2. I have attached the file to this post. Kindly take a look and let me know your thoughts.

    This doesn't have to be the final version of it. It can go through more refinements.

    My goal is to contribute something of worth to the community at the very least. So if by doing some refinements it can be a worthy contribution to the community, then that would be a great thing for me.

    If we can all put our heads together and collaborate and make something new, interesting, and worthwhile for the community, then I feel that would be great reward for our efforts in itself. So please let me know if you would be interested in collaborating. If you are, then we can discuss ideas, debate, and come to some common ground and make something useful. I have passion and drive and I can contribute time in writing a detailed paper on the algorithm we finally decide upon and you all have experience so I feel it can make a good combination.

    Thank you all for all your time and patience. Take care.
    Attached Files Attached Files

  46. #29
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    554
    Thanks
    356
    Thanked 356 Times in 193 Posts
    Quote Originally Posted by urntme View Post
    [/COLOR]If we can all put our heads together and collaborate and make something new, interesting, and worthwhile for the community, then I feel that would be great reward for our efforts in itself. So please let me know if you would be interested in collaborating. If you are, then we can discuss ideas, debate, and come to some common ground and make something useful. I have passion and drive and I can contribute time in writing a detailed paper on the algorithm we finally decide upon and you all have experience so I feel it can make a good combination.
    From my side, I can see your passion and your will to do something interesting and unique.
    It's a very good start. Passion is an absolute must to reach greatness, but it's not enough.

    I have a strong feeling that before you'd go on you'll need to attain more knowledge in algorithms and learn some programming. You will need to learn more and read more about data compression, too. You will probably know then how to describe an algorithm properly and how be more specific.
    If you do intend to publish it sometime in the future you'll need to learn some scientific writing. Your document is not there yet. (Probably you know that.)

    Collaboration works only if we are "equal" in some sense. What do you bring to the table? Until now it's only an idea. Not very specific, so I'd even say it's a "vague idea". Please read again those links above. Read the source codes of the open source software that are similar to yours. Learn the advantages and disadvantages of them. Then you'll know where you would "break into the market".
    In v2 you mention that "The difference between this algorithm and others is that instead of using code words at the bit level, byte level code words are used". Homework: you'll really need to read those links above again.

    But most importantly: if you'd learn some programming you could give birth to your child. Please don't expect such a collaboration from us that you drop in an idea without specific details, and someone will eventually sit down, take time, elaborate the missing parts and create your software - which will not be really yours. It may be very far from your idea. Implementing V1 was (fortunately) (almost) straight forward. But V2 is not.

    You'll need to be experimenting, failing, experimenting more, failing more until you are getting better and better. Eventually an expert.
    Then you will be able to create a unique compression algorithm. If you invest your time, I believe you'll be there. Your passion will get you there.
    We are here, we can help you, we can spot missing stuff we can point you to different directions, but please don't expect us to elaborate on the details of your compression idea, you'll need to do that.

  47. #30
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    554
    Thanks
    356
    Thanked 356 Times in 193 Posts
    Writing Disha version V2 was just too quick. Please spend more time on it, don't rush. Start again.

    - I would like to suggest you to phrase v2 in such a way that it does not reference anything in v1. Even if you are not intending publishing it anymore, you would make it easier for a future reader.
    - Be more specific. V2 is still formulated as an idea and not a concrete algorithm. You will need to write down concretely...

    ...Define what are the symbols exactly. Only characters and words? Or expressions ("United States of America"?) Fragments? Combination of characters? "http://" or ".com"?
    ...How do you handle capitalization?
    ...How the dictionary would be constructed - this one is not straightforward - there are many ways. You'd like to create one or more static dictionaries, right? How exactly? What decides if a word would be a part of one of them? Where would you get your corpus from?
    ...How compression would work - this one is also not straightforward: how do you imagine to split up the input file to its parts (words). An example:

    When you have found "the" in your input and it's in the dictionary. Would you go on processing more characters in the hope that there will be a longer match?
    Let's say you do. And you are lucky, it's actually "there" and it is still in the dictionary. Nice. But wait.
    If you just did that, you will find that the following characters are "scue". Hm, There is no such a word. What happened? Ah, it is "www.therescue.com", and you should have really stopped at "the".
    So let's stop then early (after "the"). But now you will find "then", and that (the+n) will be expensive, you should not have stopped this time.
    See the problem? Its not straightforward.

    Should it be greedy, should it do some backtrack? Should it try multiple ways and decide on some calculated metric ? Optimally splitting a text into parts is an NP-hard problem so you would probably need heuristics (except when doing it greedy).

    The current description is too general - there are too many ways to implement it. If you'd ask 10 programmers to implement it, there would probably be 10 different results. So we need more specific details, please.

Page 1 of 2 12 LastLast

Similar Threads

  1. Large Text Compression Benchmark
    By CompressMaster in forum Data Compression
    Replies: 22
    Last Post: 24th November 2019, 03:10
  2. I am looking for the best TEXT compression algorithm available
    By CompressMaster in forum Data Compression
    Replies: 37
    Last Post: 29th July 2018, 08:05
  3. Numbers vs text compression
    By irect in forum Data Compression
    Replies: 3
    Last Post: 7th March 2016, 01:24
  4. text compression?
    By codebox in forum The Off-Topic Lounge
    Replies: 2
    Last Post: 16th March 2015, 16:31
  5. Rationale for Text Compression
    By cfeck in forum Data Compression
    Replies: 34
    Last Post: 20th November 2013, 03:43

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •