Results 1 to 14 of 14

Thread: CRC used to figure out compression algorithm

  1. #1
    Member
    Join Date
    Sep 2010
    Location
    Australia
    Posts
    46
    Thanks
    0
    Thanked 0 Times in 0 Posts

    CRC used to figure out compression algorithm

    Sorry messed up title, should say MD5 not crc
    Hey guys I`m back after so long, i know you missed me
    Ive been thinking about a problem with knowing what compression algorithm was used on a given archive, and have an idea on how to discover it
    Please let me know if this idea is feasable
    1.First thing is to Un-Arc the Archive
    2.Generate a MD5 of the Folder
    3.Check many algorithm`s, Since we know the correct MD5 and there can only be 1 way to get this MD5 and that is to obtain the exact algorithm it used to compress it
    So we have a program that contains as many different algorithm`s as possible and we test each one on our Uncompressed folder, then check the MD5, if it matches then we have it, if not move down to the next one
    My question is would this work?
    Last edited by Omnikam; 16th September 2015 at 11:13.

  2. #2
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,569
    Thanks
    777
    Thanked 687 Times in 372 Posts
    what is the input and output data of your algorithm, exactly? for me it seems that you just mixed them

  3. #3
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    I'm not sure I'm 100% clear on the problem you're trying to solve, but this may help. Hashes (like MD5) are sort of like fingerprints for files. Instead of comparing two large files byte-by-byte, you can compare their MD5 "fingerprints" -- which is generally faster. Two files with different MD5s are definitely different; two files with identical MD5s are (almost certainly) the same. After comparing their MD5s and finding them equal, you can do a full byte-by-byte comparison just to be sure -- but the MD5 check will almost never give false results, so it would be reasonable to omit this.

    Since MD5 works on files and not folders, you'd have to use something (like zip or tar) to convert the folders into files before you could generate their MD5 hashes.
    Last edited by nburns; 17th September 2015 at 03:11.

  4. #4
    Member
    Join Date
    Sep 2010
    Location
    Australia
    Posts
    46
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I have a bat file that will generate a MD5 on all files in a folder. The idea is to use the MD5 as the known positive
    So you create new archives using different algorithms, and unpack them then take a MD5 on the files, then compare. If the MD5 match you have the right compression, if not you try with another method. This continues until the MD5 match. I guess it acts like a brute force method with the added bonus that you know what the right MD5 is.
    I based this idea on the formula of a+b=c where c is MD5 and B is files and A is compression. The idea was to use what is known to work out what is not.
    I know that there are many many different algorithms for compression and also many settings, so I expected this would be a very long testing time
    I hope I made more sense?

  5. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,569
    Thanks
    777
    Thanked 687 Times in 372 Posts
    can you just say that is the input and output of your algorithm?

  6. #6
    Member
    Join Date
    Sep 2010
    Location
    Australia
    Posts
    46
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    can you just say that is the input and output of your algorithm?
    Unfortunately no, I don't know any compression algorithms, this was just an idea I had after reading an article on encryption and specifically thw encryption WinRar uses. I was trying to figure out how it could be broken, so far only brute force is possible. My thoughts just led me to think about the problem in a simple manner, as simple as A+B=C. 256 but encryption is still unbreakable, my goal was to think about how one would go about breaking in, so I thought the problem is that there are just too many ways to come to a number for example
    10 could be 9+1, 8+2 ,7+3, 6+4 etc
    so the real problem is you don't have any idea what the right one leads to our 10. But I was thinking about this and thought perhaps we do have clues as to the right way, and the first clue is the MD5 of the file itself, it's a fingerprint, the second possible clue is the algorithm used to encrypt. But anyway I was just trying to think about these problem's from a very simplistic understanding, as simple as a child with very little knowledge on a subject might come up with idea's, and perhaps there is no real life aplication for these idea's, however I did wonder if the logic behind my idea was sound, so I attempted to ask a question, perhaps lacking the appropriate language.
    What is my input-output algorithm? I have no idea lol
    I hoped if my logic was correct then it would inspire brighter and more learned mind's to solve the problem. And ultimately the real hidden agenda of my question was breaking 256 bit encryption, starting with the logic needed to break a algorithm used on a archive

  7. #7
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    920
    Thanks
    57
    Thanked 113 Times in 90 Posts
    I might be misundestanding but it seems to me you basic idea is to just brute force all known algorithms and see if you get he same compressed data. if so then you have found the right compressor. What is the value in this info ?
    Is it not blatantly obvious that trying everything and see if you get the same result will tell you you have found the same method? Besides the facts that trying all possible iterations of all possible compressing algorithms will probably take you longer than the lifespan of the universe.

    Yes bruteforce can solve/break almost anything. Its just a matter of time cost.

  8. #8
    Member
    Join Date
    Sep 2010
    Location
    Australia
    Posts
    46
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Thanks for your reply guy's, I suppose it was just brute forcing, I certainly wouldn't want anything running til the end of time, not after the 42 fiasco

  9. #9
    Member
    Join Date
    Sep 2010
    Location
    Australia
    Posts
    46
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Thanks for your reply guy's, I suppose it was just brute forcing, I certainly wouldn't want anything running til the end of time, not after the 42 fiasco

  10. #10
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Actually this is a waste of time. When you try to decompress a file most of the time. The program will crash. Since most compressors are so poorly designed they fail to decompress files that were not compressed with them and who knows what you will get if anything for output. I suppose that if the compression really sucks it could even get in an infinite loop if you try the apply the wrong decompression program on the code. Also if original file was some sort of text file and you have the MD5 for it. But you decompress with the correct program the MD5 code may not match due to the way text is done if different computers used.

  11. #11
    Member
    Join Date
    Sep 2010
    Location
    Australia
    Posts
    46
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Thanks for your response, I agree it turned out to be a time waste. And I couldn't have attempted it myself, was just an idea.

  12. #12
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    920
    Thanks
    57
    Thanked 113 Times in 90 Posts
    Sorry for beating a dead horse but putting aside that this is just a brute force approch lets take a look at just the usign the hash for verification part to speed up things might actually make very thing slower.
    but first let me get the understand right of what it is you want to acclerate by using crc:

    You start by having an unknown archive for which you want to figure out what kind of algorithm was/is to compress/decompress it. ?
    1: you decompres the archieve ( so you must posses the decompresser at hand but its internal mechanics are unknown)
    2: you calculate the md5 of the decompressed data.
    3: you now try a huge set of decompresser for which you know the internal mechanics for
    4: you calculate hash of these trials of decompressed data
    5: compared the trials hash vs the original decompress hash from #2

    Did i get it correct?

    first issues that different compression techniques might results in the same compressed file so even though you find the right decompress does not mean you know exactly what the compress is capable of.
    f.eks both 7-zip kzip and ardinary zip kan make a zip file that is decompress with the same decoder but compress differently.
    but lets tosse that asied ans say you are happily enough to just know the right way to decompress it

    Here is the silver bullet of why your method is not going to make it faster, it is in the contrary its going to slow it down A LOT.
    You need to decompress the FULL data for a file before you can calculate the hash and compared. That a huge wasted on a file that different on the first byte.
    Instead just doing a simple direct compare as you decode would be able to stop at the first sign of a difference.

    so even putting aside all the reasons why its not a feasible job. The method is still not helpful for the task.
    You would at least need to make it hashes of blocks of the files if its because you want to reduce I/O from the storage media and/or memory usage.
    Last edited by SvenBent; 28th September 2015 at 05:47.

  13. #13
    Member
    Join Date
    Sep 2010
    Location
    Australia
    Posts
    46
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I agree with you %100
    My idea was stupid, but I didn't know that at first. Which is why I posted it here, I respect the insight of many of this forum's member's and knew if I posted my question here I'd know pretty quickly if it had merit or was a waste.
    I thank everyone for the time you took to respond, especially concidering my question was poorly structured

  14. #14
    Member
    Join Date
    Feb 2016
    Location
    Canada
    Posts
    1
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by biject.bwts View Post
    Actually this is a waste of time. When you try to decompress a file most of the time. The program will crash. Since most compressors are so poorly designed they fail to decompress files that were not compressed with them and who knows what you will get if anything for output. I suppose that if the compression really sucks it could even get in an infinite loop if you try the apply the wrong decompression program on the code. Also if original file was some sort of text file and you have the MD5 for it. But you decompress with the correct program the MD5 code may not match due to the way text is done if different computers used.
    I don't think I would agree with this. Yes, different platforms deal with text in different ways, but compression algorithms don't deal with text, they deal with binary data (bytes). The algorithm will decompress the file by using the bytes it sees in the compressed file, and the algorithm itself should deal with those bytes the same on any platform unless there's an issue with some platform specific code if there is any. The bytes fed through the algorithm should be the same 100% of the time, unless there's an issue with the compressor/decompressor itself. The only issues you might have is when you try to open a decompressed text file that maybe was created on some other platform which writes and interprets text files in a different way, but the decompressed data should be consistent. The compression algorithm doesn't care about things like CR and LF, it cares about the bytes that represent those values or numbers. As long as the algorithm deals with bytes, I doubt that even endianness should be a concern as well. Btw, MD5 doesn't deal with text either, the argument is the same. an MD5 checksum of some text file created on Windows, and checked on another platform should theoretically be the same hash, because it's an algorithm dealing with the bytes of the file -- text interpretation is not relevant to MD5.
    Last edited by bitm0de; 29th February 2016 at 02:24.

Similar Threads

  1. can`t figure out how to compile gcc under windows
    By necros in forum The Off-Topic Lounge
    Replies: 2
    Last Post: 29th June 2015, 06:25
  2. Anyone know which compression algorithm does this?
    By hjazz in forum Data Compression
    Replies: 8
    Last Post: 24th March 2014, 05:49
  3. Help identify compression algorithm?
    By DotDotDot in forum Data Compression
    Replies: 0
    Last Post: 1st June 2013, 09:15
  4. Hierarchy compression algorithm and more
    By teddybot in forum The Off-Topic Lounge
    Replies: 7
    Last Post: 3rd May 2012, 01:16
  5. New layer 0 - compression algorithm
    By abocut in forum Data Compression
    Replies: 5
    Last Post: 28th May 2010, 01:32

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •