Page 1 of 2 12 LastLast
Results 1 to 30 of 33

Thread: Detect And Segment TAR By Headers?

  1. #1
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    330
    Thanks
    190
    Thanked 54 Times in 38 Posts

    Detect And Segment Files By Headers (Like PAQ8)?

    Hi everyone,

    I'm curious as to the existence of a command-line tool to segment a file like PAQ8 does.

    If you have a large TAR file, PAQ8 will detect data/file types based on headers and use the appropriate filters to compress those segments, correct?

    Is there a command-line tool that can do the same detection but rather than compress the file, it creates a separate file for taht specific data type.

    For example: Let's say a 20MB file, PAQ8PX may detect the following:

    default
    exe
    default
    jpeg
    exe
    default

    My idea would be for this theoretical program to 6 files from that 20MB file based on those detections.

    Does such a thing exist? Is this logical when attempting to improve compression? There are archivers that compress better when given a directory of 100 files as opposed to 1 large TAR file containing those 100 files.

    Obviously a TAR file can be decompressed easily. So my idea would be for a binary file perhaps that has different types of data embedded that when separated can assist an archiver in achieving a better compression ratio.

    Any thoughts? I am very inexperienced in the data compression so I apologize for amateur question.
    Last edited by comp1; 2nd June 2014 at 15:46.

  2. #2
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    If I'm understanding you, it sounds like what you want to do is to untar the file. There's a command line program for that called 'tar'. The tar format has no compression of its own. For that reason, tarfiles are commonly compressed with gzip or bzip2.

    If the file is gzipped or bzipped, you will have to deal with the compression before you can read the tar metadata (in contrast to a zip file). From what I know, the tar format is simple, stable and very well-documented.
    Last edited by nburns; 2nd June 2014 at 08:43.

  3. #3
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    330
    Thanks
    190
    Thanked 54 Times in 38 Posts
    Quote Originally Posted by nburns View Post
    If I'm understanding you, it sounds like what you want to do is to untar the file. There's a command line program for that called 'tar'. The tar format has no compression of its own. For that reason, tarfiles are commonly compressed with gzip or bzip2.

    If the file is gzipped or bzipped, you will have to deal with the compression before you can read the tar metadata (in contrast to a zip file). From what I know, the tar format is simple, stable and very well-documented.
    Hi,

    No, the title of the thread is completely misleading and I realized it after I posted it.

    Forget the TAR, ZIP, etc. archive. I meant a binary file (let's say a DLL).

    I was asking about a program that would use PAQ8's detection algorithm that detects headers within a file as it scans it and then splits off that part of the file into separate segments.

    So, if FILE.DLL (20 MB) has the following according to PAQ8 when compressing it:

    default
    jpeg
    exe
    default
    8b-image
    default

    Then, the program I'm looking for should create 6 separate files according to those headers it reads within the FILE.DLL.

    Then, an archiver that only identifies one data type per file, will give a better compression ratio when given 6 files (one for each file type) and the archiver will not be limited to only using one filter.

    In other words:

    FILE.DLL

    would have to be either executable, text, audio, etc. When in reality, we know because of PAQ8 that there are multiple data types within FILE.DLL. And if we segment it into 6 separate files and feed those 6 as the input then the archiver can use better filters...So:

    FILE_1: No filter
    FILE_2: jpeg filter
    FILE_3: executable filter
    FILE_4: no filter
    FILE_5: Bitmap filter
    FILE_6: No filter

    And then a better ratio would be achieved since we know that FILE.DLL is not just one giant executable.

    Does a program exist that would segment the file along the lines of what PAQ8's detection algorithm?

    BTW: Sorry about the misleading thread title, that was my mistake.

  4. #4
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    If you are talking specifically about DLLs, then something in here may be what you're looking for:

    http://msdn.microsoft.com/en-us/libr...v=vs.100).aspx

    Otherwise, it's probably not something I know much about.

    Edit:
    It sounds like you may be seeking some sort of universal un-archiver. I don't know if that exists, but my guess is that it doesn't. In Linux and BSD OSes, major file formats tend to be well-documented and have standard (free) tools for hacking them, so a universal tool isn't that important.
    Last edited by nburns; 3rd June 2014 at 00:57.

  5. #5
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    330
    Thanks
    190
    Thanked 54 Times in 38 Posts
    Quote Originally Posted by nburns View Post
    If you are talking specifically about DLLs, then something in here may be what you're looking for:

    http://msdn.microsoft.com/en-us/libr...v=vs.100).aspx

    Otherwise, it's probably not something I know much about.

    Edit:
    It sounds like you may be seeking some sort of universal un-archiver. I don't know if that exists, but my guess is that it doesn't. In Linux and BSD OSes, major file formats tend to be well-documented and have standard (free) tools for hacking them, so a universal tool isn't that important.
    Not really an un-archiver. Let me try to explain again with images to demonstrate:

    Let's say we use AcroRd32.exe as an example:



    So what I'm saying is that a command-line utility that could split AcroRd32.exe into 6 files according to the image above, then archivers could compress better.

    So segment 5 (an 8b-image of 187,500 bytes) would be identified as an image by the archiver and thus use its native image filter to compress better. Whereas if the archiver was given AcroRd32.exe as one input file, it wouldn't see the 8b-image inside of it and therefore it would not apply its image filter and wouldn't achieve such strong compression.

    So what I'm suggesting is to have a tool that could identify all those 6 segments in AcroRd32.exe (or any file of any extension) and create separate files for each segment (in this case, as seen in the image, 6 files).

    Does anyone else think this would be a good idea?

  6. #6
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,480
    Thanks
    26
    Thanked 122 Times in 96 Posts
    Splitting data streams by type is commonly called segmentation and it's built into many archivers, though standalone segmenters are non-existent I think. Segmenters works by looking for known headers and/ or probe data. Probing data is eg looking for frequent x86 instructions, applying image/ audio filters and measuring entropy, looking for regular structure (ie repeated values in fixed intervals) etc

    PAQ probably doesn't have a super smart segmentation algorithm. I think it switches to specific modes when it finds a specific header in stream. Otherwise it is in default mode which includes modelling data in various ways and letting the neural networks select the best models on-the-fly.

    Francesco's archivers 'traditionally' include lots of file type detectors and specific filters/ segmenters. He plans to open source LZA archiver which will probably mean he will reveal some of his content detectors.

  7. #7
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    330
    Thanks
    190
    Thanked 54 Times in 38 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Splitting data streams by type is commonly called segmentation and it's built into many archivers, though standalone segmenters are non-existent I think. Segmenters works by looking for known headers and/ or probe data. Probing data is eg looking for frequent x86 instructions, applying image/ audio filters and measuring entropy, looking for regular structure (ie repeated values in fixed intervals) etc

    PAQ probably doesn't have a super smart segmentation algorithm. I think it switches to specific modes when it finds a specific header in stream. Otherwise it is in default mode which includes modelling data in various ways and letting the neural networks select the best models on-the-fly.

    Francesco's archivers 'traditionally' include lots of file type detectors and specific filters/ segmenters. He plans to open source LZA archiver which will probably mean he will reveal some of his content detectors.
    Ahh ok! Yes that is what I was talking about.

    So, why isn't there a good stand-alone file segmentation tool? I know many archivers do have an internal one but some do not and I think it would be quite beneficial.

    Am I incorrect that it would be useful? Or does its usefulness decrease significantly when one counts the number of archivers that have their own internal segmentation process before compression?

    For me, it would be very useful. Even one as simple/basic as the one in PAQ8 would be enough for me to start with. So long as the headers remained in tact in the segmented files, then archivers could identify the type of data properly and use filters correctly.

    Does anyone else see a use for a stand-alone command line tool for this, or just me?

  8. #8
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Quote Originally Posted by comp1 View Post
    Does anyone else see a use for a stand-alone command line tool for this, or just me?
    I do.

  9. #9
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,480
    Thanks
    26
    Thanked 122 Times in 96 Posts
    See here: http://encode.su/threads/1024-Data-s...ll=1#post19742
    It's outdated but should be somewhat useful.

  10. #10
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    330
    Thanks
    190
    Thanked 54 Times in 38 Posts
    I do.
    Glad I'm not the only one. And a good example for the use of this is in the SFC MaximumCompression.com test of the Microsoft DOC file. PAQ8PX wins big time over any other compressor...Why? Because it properly sees the embedded jpeg inside the document and uses its jpeg filter.

    See here: http://encode.su/threads/1024-Data-s...ll=1#post19742
    It's outdated but should be somewhat useful.
    I've played around with durilca and seg_file...I don't seem to have much luck getting a noticeable compression improvement. seg_file has a very low filesize limit because it puts 7 times the filesize into memory for segmentation and it is a 32-bit program....Durilca's -t1 -l combination segments files but it doesn't seem to have much of an effect when further compressed (if any at all actually).

    A PAQ8-type stand-alone segmentation tool would be better I believe. Going back to my aforementioned ohs.doc file in MaximumCompression.com's SFC test: If we were able to feed the file into the theoretical stand-alone segmentation program then we'd have a segment of the jpeg image, and that would greatly improve compression for programs with jpeg filters that can't see the embedded jpeg in the .doc file.

    Durilca and seg_file were ok back then but were not as good at detection and proper segmentation as PAQ8's detection system and also they were very limited by filesize/memory usage.

  11. Thanks:

    Paul W. (4th June 2014)

  12. #11
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,480
    Thanks
    26
    Thanked 122 Times in 96 Posts
    There's precomp http://schnaader.info/precomp.php which handles embedded JPEGs.

    Overall, there are tons of files formats and looking for all of them by headers would be a very strenuous task. One reason is that some formats have very compact headers which are easy to mistake for something else. Also you can stumble upon something which is a broken embedded BMP, JPG, EXE or anything so you have to be error-resilient, ideally you should detect whether there's corruption, insertion or deletion and try to compensate for that.

    A good segmentation scheme is a moving target - formats are emerging, evolving etc There are new types of redundancy etc

    Overall, as of today, highest efficiency is achieved when content creators cooperate with data compression algorithmists, IMO. Until we create a reasonable AI to detect file types automatically and adapt to them, we have to do that more or less manually.

  13. #12
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    330
    Thanks
    190
    Thanked 54 Times in 38 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    There's precomp http://schnaader.info/precomp.php which handles embedded JPEGs.

    Overall, there are tons of files formats and looking for all of them by headers would be a very strenuous task. One reason is that some formats have very compact headers which are easy to mistake for something else. Also you can stumble upon something which is a broken embedded BMP, JPG, EXE or anything so you have to be error-resilient, ideally you should detect whether there's corruption, insertion or deletion and try to compensate for that.

    A good segmentation scheme is a moving target - formats are emerging, evolving etc There are new types of redundancy etc

    Overall, as of today, highest efficiency is achieved when content creators cooperate with data compression algorithmists, IMO. Until we create a reasonable AI to detect file types automatically and adapt to them, we have to do that more or less manually.
    Yes. Everything you are saying makes perfect sense and I am in agreement.

    What I was more so describing is a simple file segmentation tool. A very complex one that picked up every last possible, imaginable data type would be fantastic but also impossible until AI is perfected (as you stated).

    The reason I mentioned PAQ8's detection/segmentation process is because it does just enough. In other words, a stand-alone tool that could segment exe/dll, jpeg, bitmap and wav data would be enough. Most compressors have filters for those data types and that's usually all.

    The idea was to have a tool that could identify the most common data types like PAQ8 does, not to identify every possible data type (like DNA data, ogg audio, etc. etc.), just the main types that most archivers look for to apply filters for.

    I don't know how difficult it would be to imitate PAQ8's system but if someone could make something like I've described I'd be willing to pay them for the work to be done. I'm offering this because besides Paul W. and I, nobody else has jumped in and expressed their interest.

  14. #13
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by comp1 View Post
    I don't know how difficult it would be to imitate PAQ8's system but if someone could make something like I've described I'd be willing to pay them for the work to be done. I'm offering this because besides Paul W. and I, nobody else has jumped in and expressed their interest.
    Isn't PAQ8's source code GPL'ed?

    It seems like the problem in creating this tool is deciding exactly what it should do. Should it unarchive things perfectly? Probably not. Then what kind of files, exactly, do you get back? How do you losslessly reassemble them?

    On the other hand, it seems to me that 1) identifying unknown file formats is maybe not as hard as you'd think (http://mark0.net/soft-trid-e.html), and 2) in most cases, knowing the exact file format isn't necessary.

  15. #14
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,480
    Thanks
    26
    Thanked 122 Times in 96 Posts
    On the other hand, it seems to me that 1) identifying unknown file formats is maybe not as hard as you'd think (http://mark0.net/soft-trid-e.html), and 2) in most cases, knowing the exact file format isn't necessary.
    1. The tools looks for known file formats :] Depends how you understand "unknown". It has a database of known formats and theirs characteristics.
    2. It isn't but it's helpful. For example, it's easier to extract image width from header than trying to guess it from content. Knowing image width is crucial for good image compression.
    +
    3. It's not about identifying files as a whole but identifying several embedded streams of different types. Identifying embedded streams is harder than guessing whole file type.
    Last edited by Piotr Tarsa; 4th June 2014 at 11:32.

  16. #15
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    1. The tools looks for known file formats :] Depends how you understand "unknown". It has a database of known formats and theirs characteristics.
    I didn't look too closely at it, but it seemed to have a flexible way of being trained to recognize types. Knowing the type seems useless unless you know what to do with files of that type, anyway.

    2. It isn't but it's helpful. For example, it's easier to extract image width from header than trying to guess it from content. Knowing image width is crucial for good image compression.
    Ok. But in other cases, you may only want to know if the data is already compressed, so you can skip re-compressing it.

    +
    3. It's not about identifying files as a whole but identifying several embedded streams of different types. Identifying embedded streams is harder than guessing whole file type.
    That's what I meant. In a lot of cases, the embedded stream is just another file, or, at least, it originated from one.

  17. #16
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    330
    Thanks
    190
    Thanked 54 Times in 38 Posts
    Isn't PAQ8's source code GPL'ed?
    I am not a programmer so I apologize if the following question is ridiculous. Isn't it then possible to use the detection and segmentation algorithm in PAQ8's source and then add the file creation process to the source? In other words, Isn't the most difficult part of detecting data types already taken care of if it can just be copied from, for example, PAQ8PX's source?

    [QUOTE]It seems like the problem in creating this tool is deciding exactly what it should do. Should it unarchive things perfectly? Probably not. Then what kind of files, exactly, do you get back? How do you losslessly reassemble them?[/QUOTE\

    The tool should create separate files for separate data types. The most important part of that multiple file segment creation is that the headers and/or anything else remain in tact so that other archivers can properly identify them. So perhaps, the easiest thing would be to to just split the original input file "as-is" along the lines of the different data types.

    So, if PAQ8 says that from byte 331 to 1990 the data is "exe", then a 1659 byte file should be created by splitting off that portion of the input file.

  18. #17
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    886
    Thanks
    52
    Thanked 107 Times in 85 Posts
    Quote Originally Posted by comp1 View Post
    Does anyone else see a use for a stand-alone command line tool for this, or just me?
    i do

  19. Thanks:

    Bulat Ziganshin (6th June 2014)

  20. #18
    Member
    Join Date
    May 2008
    Location
    Estonia
    Posts
    452
    Thanks
    182
    Thanked 271 Times in 149 Posts
    I did something similar. Had to restore lost data from recovered hdd image (corrupted filesystem). All tools generated hundreds of jpg images. But files where actually videos from digital cameras.
    In encoder used something like this (one way only).
    Code:
    void direct_encode_block(Filetype type, FILE *in, int len, Encoder &en, int s1, int s2,int info=-1,int r=0 ) {
      printf("\n");
      en.compress(type);
      en.compress(len>>24);
      en.compress(len>>16);
      en.compress(len>>8);
      en.compress(len);
      if (info!=-1) {
        en.compress(info>>24);
        en.compress(info>>16);
        en.compress(info>>8);
        en.compress(info);
      }
     if (level>0)   printf("Compressing... ");
     static const char* typenames[14]={"default", "jpeg", "hdr",
        "1b-image", "4b-image", "8b-image", "24b-image", "audio","mp4",
        "exe", "cd", "text","utf-8","base64"};
     char b2[32];
     //r--;
     sprintf(b2,"%d.%s",r,typenames[type]);
     FILE* dtmp1=fopen(b2, "wb+");
      const int total=s1+len+s2;
      for (int j=s1; j<s1+len; ++j) {
        if (!(j&0xfff)) printStatus(j, total);
        U8 g=getc(in);
        en.compress(g);
        fputc(g, dtmp1);
        
      }
      fclose(dtmp1);
    if (level>0)    printf("\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b");
    }
    r is block number.
    Not perfect, kind of stupid solution but did what was needed.
    KZo


  21. #19
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    330
    Thanks
    190
    Thanked 54 Times in 38 Posts
    Quote Originally Posted by SvenBent View Post
    i do
    Ok so now we have 3 interested people. Who can help make such a program? I can't donate programming knowledge but I'd be happy to donate money if someone is willing to undertake this project to compensate them for their time.

  22. #20
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by comp1 View Post
    I am not a programmer so I apologize if the following question is ridiculous. Isn't it then possible to use the detection and segmentation algorithm in PAQ8's source and then add the file creation process to the source? In other words, Isn't the most difficult part of detecting data types already taken care of if it can just be copied from, for example, PAQ8PX's source?
    I'd assume that you could take whatever PAQ8 is doing that's helpful. I don't know the answers to all your questions, but if you know exactly what you want and another program already does the individual parts, then it can be done. But the next question is whether it's practical and the result justifies the effort. That I can't answer.

  23. #21
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    330
    Thanks
    190
    Thanked 54 Times in 38 Posts
    Quote Originally Posted by nburns View Post
    I'd assume that you could take whatever PAQ8 is doing that's helpful. I don't know the answers to all your questions, but if you know exactly what you want and another program already does the individual parts, then it can be done. But the next question is whether it's practical and the result justifies the effort. That I can't answer.
    Right. What I am wanting is the data detection/segmentation algorithm from paq8px in a stand alone program that outputs the segments into separate files.

    Then perhaps if it is as useful as I and a few other believe it will be, then others can improve upon it.
    Last edited by comp1; 5th June 2014 at 01:17.

  24. #22
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by comp1 View Post
    Right. What I am wanting is the data detection/segmentation algorithm from paq8px in a stand alone program that outputs the segments into separate files.
    Sorry, my last post was too vague. There are going to be multiple ways of splitting out the files, and I suspect that choosing a way that gets what everyone wants, at an acceptable cost, is what's going to be a hard thing to decide on. The result will be a compromise.

    Edit: here's what I would want the output of the tool to be: I would want a file that lists the types and the offsets of all the internal files that it found. Nothing else. That way, the tool hasn't done anything destructive, and you have all the information to very easily chop up the archive if you want.
    Last edited by nburns; 5th June 2014 at 05:41.

  25. #23
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    330
    Thanks
    190
    Thanked 54 Times in 38 Posts
    Quote Originally Posted by nburns View Post
    Sorry, my last post was too vague. There are going to be multiple ways of splitting out the files, and I suspect that choosing a way that gets what everyone wants, at an acceptable cost, is what's going to be a hard thing to decide on. The result will be a compromise.

    Edit: here's what I would want the output of the tool to be: I would want a file that lists the types and the offsets of all the internal files that it found. Nothing else. That way, the tool hasn't done anything destructive, and you have all the information to very easily chop up the archive if you want.
    I understand what you're saying I think. Identify and display the different data types detected within a file and at what point in the file it is (byte to byte, like PAQ8 displays on the far right), then use a separate file splitting tool to do the splitting.

    I'm not opposed to that idea. What I was suggesting for now would be basically to take PAQ8PX and add a "-l" switch to it. By that I mean the -l switch that the older versions of DURILCA had to output segments as it would do internally into files.

    So perhaps to start with (and to really test this theory of mine), maybe all that is needed is a PAQ8PX modification as opposed to an entirely new program. Modify it in a way that one can type:

    Code:
    paq8px_v69 -0 -l AcroRd32.exe
    And be given multiple, untouched, segments of AcroRd32.exe based upon the already implemented data type algorithm it has.

    Then, have a way to join the files again (would "copy /b" work)?

    Paul W, SvenBent, are you both still in agreement with me as a starting point? Or have I lost you both and strayed away from what you both were thinking?

  26. #24
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,532
    Thanks
    755
    Thanked 674 Times in 365 Posts
    seg_file approach was to divide file into segments of similar order-1 stats. so it's general approach while paq8 seems to have a huge amount of specific data detectors

    also since it's OSS, seg_file may be compiled as 64-bit program - attached
    Attached Files Attached Files
    Last edited by Bulat Ziganshin; 6th June 2014 at 00:28.

  27. #25
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    330
    Thanks
    190
    Thanked 54 Times in 38 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    seg_file approach was to divide file into segments of similar order-1 stats. so it's general approach while paq8 seems to have a huge amount of specific data detectorsalso since it's OSS, seg_file may be compiled as 64-bit program - attached
    Yes. Seg_file served a purpose. The idea of here is the same but along the lines of segmenting by data type to assist compressors that aren't as sophisticated when identifying inlined/embedded data.

    I contend that PAQ8PX's core compressor isn't as powerful as it appears and that its true brilliance is in its data detection. If paq8px compressed an entire 50MB file as "default" because no special data filters were used that its ratios would not be nearly as impressive. While at the same time, having a program like the one I mentioned to detect and segment files prior to compression, other compressors' ratios would be more impressive.
    Last edited by comp1; 6th June 2014 at 01:13.

  28. #26
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    For raw data there are some open-source data recovery tools which could be pretty useful here. Foremost, which is public domain, and PhotoRec, which is GPLv2+, come to mind. It's been a while since I used it, but I seem to remember Foremost being pretty modular, with the ability to define additional file types pretty easily... if I were you I would try to find a way to re-use those definitions, probably by copying code from Foremost.

    IIRC executables and DLLs on Windows have a specific format for embedding resources, so it should be pretty straightforward to find the offsets of each resource and deal with them separately. Searching for "portable executable extract resource" should get you on the right track for that. I know there are a bunch of tools to do it, but it's Windows so finding code you can reuse might be tricky. I don't think the tools nburns linked to will be very helpful because you don't just want to extract the content, you want to know where it is so you can treat it differently for compression/decompression... what you want is basically the Windows equivalent of libelf.

    On Linux, it's not actually that common to compile resources into executables. We're starting to see it in some GTK+ and GNOME applications (they're using GResource), but even then it's mostly used for pretty small amounts of data (like GTK+ builder UI definitions)--larger resources like images, sounds, etc. are generally still external files.

  29. #27
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    330
    Thanks
    190
    Thanked 54 Times in 38 Posts
    Quote Originally Posted by nemequ View Post
    For raw data there are some open-source data recovery tools which could be pretty useful here. Foremost, which is public domain, and PhotoRec, which is GPLv2+, come to mind. It's been a while since I used it, but I seem to remember Foremost being pretty modular, with the ability to define additional file types pretty easily... if I were you I would try to find a way to re-use those definitions, probably by copying code from Foremost.

    IIRC executables and DLLs on Windows have a specific format for embedding resources, so it should be pretty straightforward to find the offsets of each resource and deal with them separately. Searching for "portable executable extract resource" should get you on the right track for that. I know there are a bunch of tools to do it, but it's Windows so finding code you can reuse might be tricky. I don't think the tools nburns linked to will be very helpful because you don't just want to extract the content, you want to know where it is so you can treat it differently for compression/decompression... what you want is basically the Windows equivalent of libelf.

    On Linux, it's not actually that common to compile resources into executables. We're starting to see it in some GTK+ and GNOME applications (they're using GResource), but even then it's mostly used for pretty small amounts of data (like GTK+ builder UI definitions)--larger resources like images, sounds, etc. are generally still external files.
    I appreciate the information. But what a few of us are after is a program specifically designed to find special data inside of files that are non-decompressible.

    If a .doc file has a jpeg image in it, the tool will make a separate file with that jpeg. If an exe has a bmp in it, the program will make a separate file with the bmp data.

    As of now, we don't have a dedicated tool for this.

    Perhaps I should try to get in touch with Jan Ondrus since he is the author of PAQ8PX and that program already has an ideal detection algorithm. NanoZip also does a good job at this.

  30. #28
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by comp1 View Post
    Yes. Seg_file served a purpose. The idea of here is the same but along the lines of segmenting by data type to assist compressors that aren't as sophisticated when identifying inlined/embedded data.

    I contend that PAQ8PX's core compressor isn't as powerful as it appears and that its true brilliance is in its data detection. If paq8px compressed an entire 50MB file as "default" because no special data filters were used that its ratios would not be nearly as impressive. While at the same time, having a program like the one I mentioned to detect and segment files prior to compression, other compressors' ratios would be more impressive.
    You could test that theory without creating a new tool. Try compressing sets of files both separately and combined into an archive with different compressors. Then you can estimate the value of PAQ8's splitter as distinct from its core compressor (assuming that's a distinction worth making).
    Last edited by nburns; 6th June 2014 at 03:26.

  31. #29
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    330
    Thanks
    190
    Thanked 54 Times in 38 Posts
    Quote Originally Posted by nburns View Post
    You could test that theory without creating a new tool. Try compressing sets of files both separately and combined into an archive with different compressors. Then you can estimate the value of PAQ8's splitter as distinct from its core compressor (assuming that's a distinction worth making).
    I have. This is why I came to the conclusion that a new tool for this is necessary. Here are some results: (shar = Shelwien's program)

    CompressionRatings.com Reference Files =
    Code:
    acrord32.exe
    book1
    E.coli
    enwik6
    flashmx5m.pdf
    fp5m.log
    geo
    lena.ppm
    obj2
    ohs.doc
    penderecki-capriccio.wav
    vcfiu.hlp
    zhwik6
    PAQ8PX -8 = 5,741,205
    PAQ8PX -8 (SHAR) = 5,741,331
    WinRK (PWCM) = 5,951,010
    WinRK (PWCM, SHAR) = 6,515,948

    IMHO the reason PAQ8PX achieves a better ratio is because it detects a data inside the files (e.g. the 8b-image in acrord32.exe; the jpeg in ohs.doc) that WinRK does not detect (as illustrated by the results of WinRK when compressing one .shar file).

    I can do more benchmarks if needed or use other archivers. But this demonstrates my idea I believe.

  32. #30
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    Quote Originally Posted by comp1 View Post
    I appreciate the information. But what a few of us are after is a program specifically designed to find special data inside of files that are non-decompressible.

    If a .doc file has a jpeg image in it, the tool will make a separate file with that jpeg. If an exe has a bmp in it, the program will make a separate file with the bmp data.

    As of now, we don't have a dedicated tool for this.
    I understand that. I'm saying that it would probably be much easier to re-purpose the source code from something like Foremost than to try to support every format individually... You pass it a blob of data, it carves that data up into the individual files you're interested in. Pass it a .doc file and it will probably detect the JPEG inside, assuming .doc is sensible enough to store the JPEG as-is. Make it recursive so that if it detects a container like a .doc or .exe it will just pass the data back to a new instance, possibly offset a byte or two so it's not just detected as a .doc or .exe, and it should detect whatever is inside.

    There are formats where this will not really work—IIRC ODF and OOXML are basically just zip files, so individual files will not be detectable without special processing. However, AFAIK PDF, PE, ELF, tar, etc. generally store content in ways that could be detected with a generic data carving program instead of having to write special support. Without some sort of generic fallback you would have to write special code for every format, which would be quite horrible—I could probably list a couple dozen off the top of my head.

    So, if I were you, I would start with something like Foremost for generic data, then use libarchive for specific formats. If there are formats libarchive doesn't support which will not work with data carving write libarchive plugins for those formats. That way many more people benefit (basically any software which consumes the libarchive API), and hopefully some of the maintenance cost will be taken up by others.

    If you really don't want to use data carving at all then libarchive is already basically what you want, it just likely doesn't include plugins for all the formats you might want to support.

Page 1 of 2 12 LastLast

Similar Threads

  1. How to detect compression of file
    By achik961 in forum Data Compression
    Replies: 10
    Last Post: 19th January 2013, 05:01
  2. Replies: 4
    Last Post: 2nd December 2012, 03:55

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •