Page 3 of 9 FirstFirst 12345 ... LastLast
Results 61 to 90 of 254

Thread: loseless data compression method for all digital data type

  1. #61
    Member rarkyan's Avatar
    Join Date
    Dec 2010
    Location
    Tell Me Where
    Posts
    88
    Thanks
    15
    Thanked 2 Times in 2 Posts
    Ok mr. Matt. Thank you for the file.
    This is how my compressor idea must do :

    1. Open the file using hex editor (the compressor should be able to read any file on hex editor mode)
    2. This is the screenshot of ur file on hex editor :


    3. The program must have Database ID or Occurence ID before start to search every 256 kind of bit code occurence.
    4. Now the program start to searching every bit occurence, take an example like this :


    5. As we see, there is occurence of 00. The program need to search all of bit. Total of the bit to be search are 256. From 00 till ff.
    6. After search the occurence per one column and n rows, the program need to write the Database ID or Occurence ID into output table. See my output idea before.
    7. The key is Occurence ID. Once u create the ID, u can handle any other file which have same or less file size.
    Last edited by rarkyan; 26th February 2011 at 10:45.

  2. #62
    Member rarkyan's Avatar
    Join Date
    Dec 2010
    Location
    Tell Me Where
    Posts
    88
    Thanks
    15
    Thanked 2 Times in 2 Posts
    Quote Originally Posted by GerryB View Post
    Rarkyan, let me summarize your algorithm. You look for very common duplicated bytes like 00 in your file. You then take these out of the file, making it shorter, and you remember a simple bitmask to tell you where those values were taken from. To re-expand, you use the bitmask to place the byte back into place.

    This is a workable algorithm and it can indeed give compression for many files. For example, if half the bytes of a file are 00, you'd make a new file that's only half the size. But you also need to remember that bitmap, which is 1/8 of the file size (one bit per byte.)
    If more than 1/8 ( 12.5%) of the files bytes are that one common byte, you end up with a net savings of space.. you have compression! Success!

    The flaw of this algorithm is that it can't compress files which have no common byte that happens more than 12.5% of the time. The 12.5% comes from the size of the bitmap compared to the size of the file itself.
    For example, say you had an 800 byte file, and byte 00 happened a lot.. 10% of the time. So you make your bitmask of 800 bits, which is 100 bytes. You can remove all the 00 bytes from the original file, so it's now shorter, it's 720 bytes.
    But look.. now the 720 byte file still needs the extra 100 byte bitmap.. that's 820 bytes, bigger than the file you started with. So there's no space savings for that example... it gets bigger.

    So your algorithm can work for some files. A very biased file with a very common byte can be compressed with a bitmap "hole" map. But not all files can be compressed this way.

    Now there are many many many ways to improve this and take advantage of biased byte distributions.. when one byte is more likely than another. The first algorithms in compression books deal with efficient ways to handle this problem and can take advantage of even small biases. But none of those methods can compress all files.

    Good for you for being interested in compression.. it's fascinating. You may be interested in text "entropy", the foundation of Information Theory, which qas a broad definition, quantifies how these small biases in "common bytes" can be efficiently exploited.
    Thank you mr. Gerry. My weakness are trully understanding technical programming term. But i can maybe understand ur advice. Im usually use 00 occurence to demonstrate the pattern. I know 00 are common bit on every file. So they take a lot of space on constructing file. The other, there is still 255 bit code left. I mean, i will do the same search for the rest of the bit code. The program must find all 256 bit occurences. Searching it one by one, or search 256 bit at the same time, and then write the Occurence ID on the output table for each bit code. The program's main weapon are only the Database ID or Occurence ID, this is the very important which made from (2^n)-1. All of the occurence already recorded on this Databases. So the program can handle for any other small bit code occurence which have more less than 00 occurence on the file. Maybe this is the clear simple example. Create the Occurence ID first. Say we've got 1 column and 4 rows. Using (2^n)-1 should be :

    Image 1

    Now we got the Occurence ID for all 4 rows bit possition possibilities. Next, search for the occurences on the file. This occurence not only for 00 but every bit have the same opportunities to get into any position. Maybe like this :

    Image 2

    Take an example im search for bit 01, and above are the results. The program already have the Occurence ID, now its time to place the ID on the output table. Maybe like this:

    Image 3

    U see the ID "7" on the output table are got from the Occurence ID (see image 1). Because on the search result (image 2), bit 01 get into the 1st and 4th position on the rows. Occurence ID handle all of that. Next, right from ID "7" we see ID "3", this is the same explanation. Bit 01 get into the 3rd position on the rows. Occurence ID already record that position. Sometimes and maybe we see the same pattern. U can see ID "7" are placed 3 times on the output table. It is because the same pattern happen on the different columns (see column 00, 02, 05 on image 2). And on the last column of bit 01, we see ID "0", it is because there is no occurence of bit 01 found on column 0f. Repeat the method for any other bit. Place their Occurence ID on the table, fill all the "x" mark. Saving compression output file using the Occurence ID surely shorten any long and random bit position. Any other advices are fully wellcome
    Last edited by rarkyan; 26th February 2011 at 10:46.

  3. #63
    Member rarkyan's Avatar
    Join Date
    Dec 2010
    Location
    Tell Me Where
    Posts
    88
    Thanks
    15
    Thanked 2 Times in 2 Posts
    Creating Occurence ID using (2^75.000.000)-1 and store the result on the harddisk are really crazy thing. But i think the crazy result can handle every bit occurence and return them back into their original place (mr. Gerry call this bitmask) via hex editor or any the same program like that. To give the name of the ID, using the short name is an advantage. Take an example, there is 3 digits of the ID. Fill the digits using all possible combination of 0-9, and a-z, or maybe include other symbols :!@#$% etc. Im still wondering it could be happen. Fiewh...

  4. #64
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 798 Times in 489 Posts
    The file I sent you is 415,241 random bytes. There are no patterns in it. Each byte occurs on average 415,241/256 = about 1622 times. Tell me how you will store the list of occurrences of each byte so your lists take up less space than that. If you use a bitmap, your algorithm will make the file 32 times larger.

  5. #65
    Member rarkyan's Avatar
    Join Date
    Dec 2010
    Location
    Tell Me Where
    Posts
    88
    Thanks
    15
    Thanked 2 Times in 2 Posts

    Cool

    Quote Originally Posted by Matt Mahoney View Post
    The file I sent you is 415,241 random bytes. There are no patterns in it. Each byte occurs on average 415,241/256 = about 1622 times. Tell me how you will store the list of occurrences of each byte so your lists take up less space than that. If you use a bitmap, your algorithm will make the file 32 times larger.
    I just return from another city. I made an answer on pdf files. Just check the link bellow. Sorry for bad english translation :


    Enjoy
    Last edited by rarkyan; 26th February 2011 at 10:42.

  6. #66
    Member rarkyan's Avatar
    Join Date
    Dec 2010
    Location
    Tell Me Where
    Posts
    88
    Thanks
    15
    Thanked 2 Times in 2 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Tell me how you will store the list of occurrences of each byte so your lists take up less space than that. If you use a bitmap, your algorithm will make the file 32 times larger.
    The Occurence ID just like a complete dictionary for the program to compare and find the pattern (bit location). I think it is a really big and super large database. I dont know what is the best way to create this ID and where to save this ID database. Using bitmap or something else. It is ur turn to decide. Since the output (compressed file) just a table that contain an ID, the compressed file will not really large.
    Last edited by rarkyan; 16th December 2010 at 20:41.

  7. #67
    Member rarkyan's Avatar
    Join Date
    Dec 2010
    Location
    Tell Me Where
    Posts
    88
    Thanks
    15
    Thanked 2 Times in 2 Posts
    Guess what? i still want the money.... [T.T]

  8. #68
    Member rarkyan's Avatar
    Join Date
    Dec 2010
    Location
    Tell Me Where
    Posts
    88
    Thanks
    15
    Thanked 2 Times in 2 Posts
    Heeeyyyyyyyyy lets discuss!!! where is the other??? trying to code my method???

  9. #69
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    It seems you still didn't get the point. You proposed a method how to encode a file with your bitmask approach. But, in the end, the problem is same. How could you reduce total number of bits in total? Let's examine your example to make it clear:

    You took 4 bytes as example. But, your output table is much bigger. Why? Because, each position has 5 states (including blank). So, each position requires 3 bits. 3x(8 + 3x16) = 168 bits. So, your output table 168/(4x = 5.25 times bigger than input data. If you still insist on "my method definitely works and I want to earn money from it", you should "definitely" show size reduction on a real file. And even more important, you have to take into account your output table size while posting answers.
    BIT Archiver homepage: www.osmanturan.com

  10. #70
    Member rarkyan's Avatar
    Join Date
    Dec 2010
    Location
    Tell Me Where
    Posts
    88
    Thanks
    15
    Thanked 2 Times in 2 Posts
    No. I still get the point. I'll try my best to help on data compression. I cut the file into 4 rows just only for simple example. The Output Table become larger than the original file because the ID using so much bit than the byte inside the file itself. The compressor will compress any file until it meet minimum compression, as for my method the minimum compression are determined by the shortest Occurence ID. Now if u could create short ID for large or many occurences, u will really get an advantages. This is the example, i write it on pdf :

    Output Table - Sample.pdf

    Imagine that the program already create very big and super large database of Occurence ID. If the program are able to write the Occurence ID on the Output Table, it will shorten any random bytes position. ID notation on the table already represent any byte position on the original file. On the Output Table above im use about 15 digits (notation) for the Occurence ID. Thats just a sample. Size of the Output Table i created on pdf are just only 142.323 bytes (138kb). When i create them on excell they are just only 72kb. They wont more bigger than the original files. The only one who gets really big are just the database of Occurence ID. The Occurence ID are not the same with the output. It is only a database for all file. A database that contain bitmasks for all byte position on every column inside hex editor.
    Last edited by rarkyan; 18th December 2010 at 04:41. Reason: New Attachment

  11. #71
    Member rarkyan's Avatar
    Join Date
    Dec 2010
    Location
    Tell Me Where
    Posts
    88
    Thanks
    15
    Thanked 2 Times in 2 Posts

    Thumbs up

    Okay. I cant show the real compression. From the beginning i just show how the program shall work. Now, how about to help each other? May i request something from u or anyone here as an expert programmer. I think this is really simple for u all. Please make a Pattern Generator program or in my word it is Occurence ID. This is the rule :

    1. It must able to generate a sequences of Occurence.
    2. It is using (2^n)-1 formula
    3. I can input how many rows i need. But just for trial, give me 1-10 rows.
    4. After i click generate button, i will get data that show all of occurence. No same pattern results, and no same ID for different results
    5. Maybe like this :

    I hope the generator program could simplify my explain. I really appreciate ur work. Thank you very much
    Last edited by rarkyan; 26th February 2011 at 10:46.

  12. #72
    Member Skymmer's Avatar
    Join Date
    Mar 2009
    Location
    Russia
    Posts
    688
    Thanks
    41
    Thanked 174 Times in 88 Posts
    Its kinda strange for me that such "useless-from-beggining" topic gets so much attention. I presume that there are much more real and nice conversations exist here.

  13. #73
    Member rarkyan's Avatar
    Join Date
    Dec 2010
    Location
    Tell Me Where
    Posts
    88
    Thanks
    15
    Thanked 2 Times in 2 Posts
    I hope in the end my method arent useless

  14. #74
    Member rarkyan's Avatar
    Join Date
    Dec 2010
    Location
    Tell Me Where
    Posts
    88
    Thanks
    15
    Thanked 2 Times in 2 Posts
    Hmmm... looks like no one fulfill my request .... Allright then. I will ask my friend to create the program i need. I'll be back someday

  15. #75
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,505
    Thanks
    26
    Thanked 136 Times in 104 Posts
    Good luck then. Hope you'll learn something interesting.

  16. #76
    Member
    Join Date
    May 2008
    Location
    England
    Posts
    325
    Thanks
    18
    Thanked 6 Times in 5 Posts
    Would have made sense to get your friend to create it in the 1st place then come here to show it working? best of luck with it and we look forward to seeing the results.

  17. #77
    Member rarkyan's Avatar
    Join Date
    Dec 2010
    Location
    Tell Me Where
    Posts
    88
    Thanks
    15
    Thanked 2 Times in 2 Posts
    Yo im back. Did someone missed me? I guess not. I want to ask something. Is it possible to create a set of database (occurence) of 2^75,000,000 ? If its possible, how long the process? How much storage to save the results? Can supercomputer do that? Thanks

  18. #78
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,505
    Thanks
    26
    Thanked 136 Times in 104 Posts
    From what I read, the number of atoms in the Universe is estimated as below 2^300, so it's not even remotely possible.

  19. #79
    Member RichSelian's Avatar
    Join Date
    Aug 2011
    Location
    Shenzhen, China
    Posts
    174
    Thanks
    20
    Thanked 62 Times in 31 Posts
    Representing the number 2^75,000,000 requires 75,000,000 bits, or 9375KB. How do you prove that the "compressed data" plus the occurrence table will be smaller than the original data?

  20. #80
    Member rarkyan's Avatar
    Join Date
    Dec 2010
    Location
    Tell Me Where
    Posts
    88
    Thanks
    15
    Thanked 2 Times in 2 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    From what I read, the number of atoms in the Universe is estimated as below 2^300, so it's not even remotely possible.
    So, it is still impossible for that. It will be hard then

  21. #81
    Member rarkyan's Avatar
    Join Date
    Dec 2010
    Location
    Tell Me Where
    Posts
    88
    Thanks
    15
    Thanked 2 Times in 2 Posts
    Quote Originally Posted by RichSelian View Post
    Representing the number 2^75,000,000 requires 75,000,000 bits, or 9375KB. How do you prove that the "compressed data" plus the occurrence table will be smaller than the original data?
    "In short, the counting argument says that if a lossless compression program
    compresses some files, it must expand others" (http://www.faqs.org/faqs/compression...section-8.html)

    It must expand others. So the compression will be at maximum peak. I can prove it using theory, not practical. But the others and experts even says it still impossible. I thought that there was nothing impossible. Only we still didnt find the correct way

  22. #82
    Member RichSelian's Avatar
    Join Date
    Aug 2011
    Location
    Shenzhen, China
    Posts
    174
    Thanks
    20
    Thanked 62 Times in 31 Posts
    Quote Originally Posted by rarkyan View Post
    "In short, the counting argument says that if a lossless compression program
    compresses some files, it must expand others" (http://www.faqs.org/faqs/compression...section-8.html)

    It must expand others. So the compression will be at maximum peak. I can prove it using theory, not practical. But the others and experts even says it still impossible. I thought that there was nothing impossible. Only we still didnt find the correct way
    It is obviously that a compression method must expand some input by the permutation theory, and it seeems you understand THERE MUST BE SOMETHING NOT COMPRESSIBLE. So what type of data will you compress? And why?
    For most of people here, they work on compressing so called "generic data" which has redundancy (like string duplication). For random data or compressed data just like an RAR archive, there will be little redundancy in it. How do you compress those data losslessly?

  23. #83
    Member rarkyan's Avatar
    Join Date
    Dec 2010
    Location
    Tell Me Where
    Posts
    88
    Thanks
    15
    Thanked 2 Times in 2 Posts
    What type of data will you compress? -- All of data, any type of data (music, document, video, etc all of them which exist).
    Why? -- To help facing issues about big data.
    How do you compress those data losslessly? -- Data are very randoms. I didnt focus on redundancy, i'm focusing on patterns. I have logical concepts that would be possible to compresses any kind of random data losslessly. The first thing: It must expand others.

    God compresses big trees only into a small seed. Why God create such of a big things like planets, solar systems, etc. It is the expand.

  24. #84
    Member
    Join Date
    Aug 2012
    Location
    Moscow
    Posts
    5
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I understand what rarkyan describe.
    He talk about known CPU/Memory tradeoff. This method is used in rainbow tables for search crypto hashes. He is offer to create very big dictionary (more than 1gb) of many common patterns (text or binary) and distribute it once with decompressor, so it will not be nessesary to add dictionary to compressed file like current compressors does. So it is all about rainbow tables. You can create dedicated dictionary for e.g. text/books, by scaning for common words and phrases (like Google do during indexing process).
    But main purpose for this I think is speed up, like cryptography. You make many GB dictionary for common 4x4/8x8 DCT coeff, and speedup video/photo decoding by lookup to corresponging uncompressed array, distributed with decoder.

  25. #85
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,505
    Thanks
    26
    Thanked 136 Times in 104 Posts
    Isn't the whole idea of data compression to reduce storage and bandwidth requirements? If we need to transfer a lot of data or store huge dictionaries then we don't have any savings.


    By the way:
    Matt wrote a BARF compressor which works in this way:
    - at compilation it is joined with a some files (in BARF case those files are from Calgary corpus),
    - when we compress a file from Calgary corpus with BARF, then BARF outputs a file that contains only a identifier of that file in BARF's database,
    - when we compress a file that is not from Calgary corpus, then BARF uses some weak compression method, or cheats by moving bytes from file to filename,

    So basically, what BARF does is not very different from what rarkyan wants to achieve. The only difference is that BARF's database is small, but that could be easily changed.

  26. #86
    Member RichSelian's Avatar
    Join Date
    Aug 2011
    Location
    Shenzhen, China
    Posts
    174
    Thanks
    20
    Thanked 62 Times in 31 Posts
    Quote Originally Posted by rarkyan View Post
    What type of data will you compress? -- All of data, any type of data (music, document, video, etc all of them which exist).
    Why? -- To help facing issues about big data.
    How do you compress those data losslessly? -- Data are very randoms. I didnt focus on redundancy, i'm focusing on patterns. I have logical concepts that would be possible to compresses any kind of random data losslessly. The first thing: It must expand others.

    God compresses big trees only into a small seed. Why God create such of a big things like planets, solar systems, etc. It is the expand.
    I think you should first read Shannon's entropy theory and tell us what's wrong with his theory, before you say your idea can compress all/random data.

  27. #87
    Member rarkyan's Avatar
    Join Date
    Dec 2010
    Location
    Tell Me Where
    Posts
    88
    Thanks
    15
    Thanked 2 Times in 2 Posts
    Nothing wrong with Shannon's entrophy theory. I dont want to blame theory. I just want to try different approach that maybe able to give solution. Even it is only a small crack for the problem.

  28. #88
    Member RichSelian's Avatar
    Join Date
    Aug 2011
    Location
    Shenzhen, China
    Posts
    174
    Thanks
    20
    Thanked 62 Times in 31 Posts
    Quote Originally Posted by rarkyan View Post
    Nothing wrong with Shannon's entrophy theory. I dont want to blame theory. I just want to try different approach that maybe able to give solution. Even it is only a small crack for the problem.
    Shannon theory is the basic rule, but not a optional tool of compression, and "random data is not compressible" is also the rule.
    I found that discussing compression with someone not understanding Shannon theory is useless. I will stop the topic until you read the theory and find something wrong with your "new method".

    -- thanks to Google Translation.

  29. #89
    Member rarkyan's Avatar
    Join Date
    Dec 2010
    Location
    Tell Me Where
    Posts
    88
    Thanks
    15
    Thanked 2 Times in 2 Posts
    Quote Originally Posted by denisr View Post
    I understand what rarkyan describe.
    He talk about known CPU/Memory tradeoff. This method is used in rainbow tables for search crypto hashes. He is offer to create very big dictionary (more than 1gb) of many common patterns (text or binary) and distribute it once with decompressor, so it will not be nessesary to add dictionary to compressed file like current compressors does. So it is all about rainbow tables. You can create dedicated dictionary for e.g. text/books, by scaning for common words and phrases (like Google do during indexing process).
    But main purpose for this I think is speed up, like cryptography. You make many GB dictionary for common 4x4/8x8 DCT coeff, and speedup video/photo decoding by lookup to corresponging uncompressed array, distributed with decoder.
    Wow, it is too deep. I even dont understand such technical term. But at some point i'm agree that my method use dictionary. And yes, the dictionary are not included into compressed files. Thanks for understanding me

  30. #90
    Member rarkyan's Avatar
    Join Date
    Dec 2010
    Location
    Tell Me Where
    Posts
    88
    Thanks
    15
    Thanked 2 Times in 2 Posts
    Quote Originally Posted by RichSelian View Post
    Shannon theory is the basic rule, but not a optional tool of compression, and "random data is not compressible" is also the rule.
    I found that discussing compression with someone not understanding Shannon theory is useless. I will stop the topic until you read the theory and find something wrong with your "new method".

    -- thanks to Google Translation.
    "random data is not compressible". Sorry, but i do not agree with this statement. I still able to see patterns on every random data. Even for some different files, there will be a redundant. My concepts of dictionary records all of the patterns. Thats why i'm always curious about computer ability to generate 2^75.000.000. Crazy number and results. I just ordered someone to create a generator but it is only 2^10. On my low spec computer the generate process take about 4 seconds to show the results. Anyone here have supercomputer? Could anyone help me to test generate the crazy number?

Page 3 of 9 FirstFirst 12345 ... LastLast

Similar Threads

  1. Any money in data compression?
    By bitewing in forum The Off-Topic Lounge
    Replies: 18
    Last Post: 19th March 2019, 11:34
  2. Data compression explained
    By Matt Mahoney in forum Data Compression
    Replies: 92
    Last Post: 7th May 2012, 19:26
  3. Advice in data compression
    By Chuckie in forum Data Compression
    Replies: 29
    Last Post: 26th March 2010, 16:09
  4. Data Compression Crisis
    By encode in forum The Off-Topic Lounge
    Replies: 15
    Last Post: 24th May 2009, 20:30
  5. Data Compression Evolution
    By encode in forum Forum Archive
    Replies: 3
    Last Post: 11th February 2007, 15:33

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •