Results 1 to 7 of 7

Thread: A Data Compression Algorithm that uses Bit Recycling

  1. #1
    Member
    Join Date
    Sep 2020
    Location
    India
    Posts
    18
    Thanks
    18
    Thanked 0 Times in 0 Posts

    A Data Compression Algorithm that uses Bit Recycling

    Hello all,

    This is my second post on this forum. I also posted about the Disha format in a previous post.

    This algorithm is an algorithm that can save bits on a network with a high amount of traffic. I have attached a paper I wrote about it to this thread.

    Feedback, thoughts, and comments. All are welcome.

    I hope you find it interesting and it makes you think. One thing in particular I would be interested in knowing is if you think there are any avenues which would be interested in publishing this.

    Best Regards,
    ​Urntme.

  2. #2
    Member
    Join Date
    May 2020
    Location
    Berlin
    Posts
    64
    Thanks
    10
    Thanked 20 Times in 15 Posts
    Thanks for posting your ideas. After first quick skimming I didn't get your file compression idea, going to read that later.

    Your reduced state encoding reminds me of "phased-in binary codes", also called "flat code" by some. You could check this thread here: https://encode.su/threads/3419-Are-t...ll=1#post65961

  3. Thanks:

    urntme (15th September 2020)

  4. #3
    Member
    Join Date
    Sep 2020
    Location
    India
    Posts
    18
    Thanks
    18
    Thanked 0 Times in 0 Posts
    Thank you very much for your comment. However I'm not sure where I've used "Phased-in binary codes." As far as my understanding goes, I haven't used it in either of the algorithms.

    The Disha Algorithm uses fixed size binary codes, which I mentioned in the document as 3 bytes, and does not any use any type of variable length encoding. So I'm confused by what you mean here. Perhaps I'm not getting it.

    In the Data Compression Algorithm, I use unused states in a file overhead to save a few bytes in the payload. So, again here I'm not using "Phased-in binary codes." So I'm confused.

    Could you please clarify?

  5. #4
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    554
    Thanks
    356
    Thanked 356 Times in 193 Posts
    Hm. I don't understand this document.

    I'll grab only two sentences that caught my attention.

    >>The data compression algorithm talked about in this paper can be thought of as a random compression algorithm,since it compresses random data in a random manner
    Ops. What do you mean? You probably know that you cannot compress random data.

    >>Normally compressed files are converted to a single file extension, for example ".zip" but in our case, the more extensions or suffixes we have, the more compression we can achieve since that essentially corresponds to more states available.
    Are you removing bytes from the file content and adding more and more extensions to the name of the file at the same time? If that is the case you are just moving information form one place to another. How does it save space then?

  6. #5
    Member
    Join Date
    Sep 2020
    Location
    India
    Posts
    18
    Thanks
    18
    Thanked 0 Times in 0 Posts
    First of all, thank you for your comment.

    I tried to make the document as readable as possible, but perhaps I haven't done the best job of it. I hope my following statements clear some of your questions.

    Ops. What do you mean? You probably know that you cannot compress random data.
    Yes. You are right. One cannot compress random data. But this compresses random data. I'm sorry for being confusing but it just depends on your perspective here. It all depends on if you consider what we are compressing as "Random data" or not.

    Are you removing bytes from the file content and adding more and more extensions to the name of the file at the same time? If that is the case you are just moving information form one place to another. How does it save space then?
    Yes. You have probably got it. We are moving bytes of a file and "changing" the extensions. We are not adding extensions to the existing name. There is a difference here.

    Okay, so here I must say that there is a distinct difference in how "overheads" or in this case the "file name" is handled and how the "data" is handled. A data packet or file can be of any size in a range but the overhead is always of a fixed size. Whenever you need to open/transmit a file or data packet, you need to send this overhead and you cannot reduce the size of the overhead. So any amount of information you add to the overhead, does not increase the overall size of the file or data packet but since the overhead can be only of a fixed size, you obviously have a limit on how much information can be added to the overhead.

    In this case we are changing the file extension which is a part of the overhead. Different operating systems operate the file extensions differently. Some allocate space for the file extensions separately and some allocate space for the entire file name as one chunk. But no matter what the file name is, and we cannot change the file name because we don't want to change a user's data, we can change the file extension. This is because a lot of file extensions go unused. These are essentially unused states in the file that we are using to reduce the size of the overall data.

    This makes more sense if you see this through the IPv4 protocol.



    The above image is the IPv4 protocol header.

    Notice the part about "Protocols" it is the protocol number. You can check the assigned protocol numbers from this link:

    https://www.iana.org/assignments/pro...-numbers.xhtml

    The protocol numbers from: 144 - 252 are left unassigned and are not used in any way. These are basically like file extensions for the protocols being transmitted. So we use the unused protocol numbers to compress a part of the data packet being sent. These are the unused states of the file overhead in this case.

    Now if you want to transmit a protocol. You need to transmit the protocol number. The system doesn't care what the number is, so we used the unused protocol numbers to transfer a part of the file and save us some space.

    I hope all this makes sense. But I think you've already probably got the essence of it. We do transfer the bytes from the data packet to the file name in a way and this does save us some space.

    I hope this clears any doubts and confusion of what I am saying here.

    Thank you for your comments. I hope the discussion continues.
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	IPv4-Header-1.png 
Views:	4 
Size:	16.8 KB 
ID:	7907  
    Last edited by urntme; 16th September 2020 at 05:53.

  7. #6
    Member
    Join Date
    May 2020
    Location
    Berlin
    Posts
    64
    Thanks
    10
    Thanked 20 Times in 15 Posts
    Quote Originally Posted by urntme View Post
    Yes. You have probably got it. We are moving bytes of a file and "changing" the extensions. We are not adding extensions to the existing name. There is a difference here.

    Okay, so here I must say that there is a distinct difference in how "overheads" or in this case the "file name" is handled and how the "data" is handled. A data packet or file can be of any size in a range but the overhead is always of a fixed size. Whenever you need to open/transmit a file or data packet, you need to send this overhead and you cannot reduce the size of the overhead. So any amount of information you add to the overhead, does not increase the overall size of the file or data packet but since the overhead can be only of a fixed size, you obviously have a limit on how much information can be added to the overhead.

    In this case we are changing the file extension which is a part of the overhead. Different operating systems operate the file extensions differently. Some allocate space for the file extensions separately and some allocate space for the entire file name as one chunk. But no matter what the file name is, and we cannot change the file name because we don't want to change a user's data, we can change the file extension. This is because a lot of file extensions go unused. These are essentially unused states in the file that we are using to reduce the size of the overall data.
    There are practical problems regarding using filenames, as they'd just save a little space when being stored at the same kind of filesystem all the time. If the files are being sent over a network (e.g. from a NAS) or being compressed and decompressed, this information would get lost - except it's implemented in the OS itself. But I'm sure that engineers in the past kept an eye on not wasting too much bytes anywhere. In the end, there are sector sizes (e.g. 512 bytes or 4096 bytes) and block cluster sizes (e.g. 32 KB). If this method doesn't reduce the filesize enough to save one block or cluster, there is no use. Even Windows' internal (and transparent) file compression methods (XPRESS4K, XPRESS8K, XPRESS16K and LZX) wouldn't be used for compressing a file, if no block or cluster is freed. There is no use in this while just adding overhead in processing the files.


    Quote Originally Posted by urntme View Post
    The protocol numbers from: 144 - 252 are left unassigned and are not used in any way. These are basically like file extensions for the protocols being transmitted. So we use the unused protocol numbers to compress a part of the data packet being sent. These are the unused states of the file overhead in this case.

    Now if you want to transmit a protocol. You need to transmit the protocol number. The system doesn't care what the number is, so we used the unused protocol numbers to transfer a part of the file and save us some space.
    I hope you didn't mean to replace the protocol number completely as it's being used somewhere, e.g. in routers or firewalls.

    But if you'd use the unused protocol list entries by mirroring existing protocol numbers, you might transmit one bit in the header. This is very similar to phased in code or flat binary code as called by charles bloom, but inversely adding one bit of information instead of removing one redundant bit of information. See http://cbloomrants.blogspot.com/2013...bits-flat.html. Applying this idea to the protocol codes would mean:
    Blindly mapping protocols 0 - 108 to 144 - 252 (not looking for the most used ones) could transmit one bit:

    • protocol no. is in [0,108]: bit is '0' and original protocol no. is the same
    • protocol no. is in [144,252]: bit is '1', original protocol no. is protocol no. minus 144.


    But in the end this might cause problems in processing the packets.

    This also reminds me a bit of hiding information in images etc. (obfuscation).

  8. #7
    Member
    Join Date
    Sep 2020
    Location
    India
    Posts
    18
    Thanks
    18
    Thanked 0 Times in 0 Posts
    First of all thank you for your comment. It was really interesting and insightful.

    There are practical problems regarding using filenames, as they'd just save a little space when being stored at the same kind of filesystem all the time.If the files are being sent over a network (e.g. from a NAS) or being compressed and decompressed, this information would get lost - except it's implemented in the OS itself. In the end, there are sector sizes (e.g. 512 bytes or 4096 bytes) and block cluster sizes (e.g. 32 KB). If this method doesn't reduce the filesize enough to save one block or cluster, there is no use. Even Windows' internal (and transparent) file compression methods (XPRESS4K, XPRESS8K, XPRESS16K and LZX) wouldn't be used for compressing a file, if no block or cluster is freed. There is no use in this while just adding overhead in processing the files.
    Yes. You could be right about that. I was only looking at enterprises where there are a lot of files. For examples in places where they store a lot of files of the same type, each file could save a byte or two for very little overhead cost, if the number of files is large enough, then the space savings could be very significant and this for very little overhead.

    You bring up an interesting point about clusters. Certainly saving space on a byte or two wouldn't matter if that cluster wasn't usable, however, imagine a scenario where a file is only a byte or two extra and would require an entire cluster just because of their extra one or two bytes. Then by saving those two bytes, one could save the entire cluster. This might happen rarely, certainly, however, at scale, the savings could be significant at very little cost of overhead. This might make it worth it for some enterprises.

    Furthermore, saving space on a server is not the really money savings method since storage costs are quite low. They save money through bandwidth costs at scale. Imagine a million people accessing the same file, which is fairly common these days on many social media platforms, the savings then is worth it I feel.

    Moreover there are many places where they want to save as much space as they can. This could provide a means to do that there.

    The advantage this algorithm offers is that it is file type agnostic, also, it can compress a file that has already been compressed to its limits at very little/almost negligible overhead cost. So I feel there could be uses for it.

    If the files are being sent over a network (e.g. from a NAS) or being compressed and decompressed, this information would get lost - except it's implemented in the OS itself.
    I'm confused about what you mean that this information would be lost. Why would the information be lost while sent over a network? This is something I don't quite understand.

    Also, if it's being compressed, then this should be final stage, after all, it can compress even compressed files. The file type doesn't matter here.

    I hope you didn't mean to replace the protocol number completely as it's being used somewhere, e.g. in routers or firewalls.

    But if you'd use the unused protocol list entries by mirroring existing protocol numbers, you might transmit one bit in the header. This is very similar to phased in code or flat binary code as called by charles bloom, but inversely adding one bit of information instead of removing one redundant bit of information. See http://cbloomrants.blogspot.com/2013...bits-flat.html. Applying this idea to the protocol codes would mean:
    Blindly mapping protocols 0 - 108 to 144 - 252 (not looking for the most used ones) could transmit one bit:


    • protocol no. is in [0,108]: bit is '0' and original protocol no. is the same
    • protocol no. is in [144,252]: bit is '1', original protocol no. is protocol no. minus 144.



    But in the end this might cause problems in processing the packets.
    Yes. That is one way to implement it.

    You say that this might cause problems when processing the packets, however, I wonder why you say that. What problem would there be in handling additional protocol numbers? Surely it's not a fixed list. They could add protocol numbers at any time. Why would the routers or firewalls react to that? Please elaborate.

    Certainly, both ends of the node need to be compatible enough to know that they are doing this, but this can be done in a slow and phased manner. If implemented correctly, this could be implemented well. Perhaps not entirely implemented throughout the internet at once, but slowly, over time. All networks evolve after all. And in fact, as the traffic and our consumption of data increases, the more savings this would provide.

    I believe this is a feature the future internet and all networks should have as these systems evolve.

    Moreover, the most savings is achieved by using this algorithm at all layers.

    High traffic servers could definitely get a lot of use out of this.

    This also reminds me a bit of hiding information in images etc. (obfuscation).
    That is an interesting comparison.

    Thank you for your comment.
    Last edited by urntme; 16th September 2020 at 14:41.

Similar Threads

  1. Data Compression Diamond Algorithm
    By uhost in forum Random Compression
    Replies: 42
    Last Post: 7th April 2020, 13:17
  2. Revolutionary data compression algorithm
    By kaitz in forum Random Compression
    Replies: 1
    Last Post: 24th March 2020, 13:55
  3. The Recursive Data Compression Algorithm
    By greatalazar in forum Data Compression
    Replies: 5
    Last Post: 29th September 2018, 12:28
  4. 32 bit or 64 bit Windows usage case
    By necros in forum The Off-Topic Lounge
    Replies: 10
    Last Post: 18th August 2016, 08:57
  5. Compression benchmarking: 64 bit images and 24 bit codecs
    By m^2 in forum The Off-Topic Lounge
    Replies: 6
    Last Post: 30th November 2011, 16:01

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •