Results 1 to 11 of 11

Thread: Smart Binary File Splitter

  1. #1
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    334
    Thanks
    192
    Thanked 57 Times in 41 Posts

    Smart Binary File Splitter

    Hi everyone,

    I was trying a few experiments with a few archivers (mainly PAQ and I had a thought.

    Many archivers do NOT use multi-core/multi-threading technology, so splitting the TAR (SHAR in my case) file into 4, 6, 8, etc. parts (depending on available threads) is a solution.

    A new problem comes up then, what if the file splitter used split the file in the middle of an exe, image, etc. data? Well, then PAQ8 or any other program cannot detect that type of data and therefore cannot use its special filter(s).

    My proposal (if it does not already exist), is for someone to create a win32 command-line file splitter that first does a scan of the file to determine these special types of data. Then when splitting happens, the splitter will not split these types of data.

    In other words, the splitter syntax should be:

    splitter -p8 FILE.EXT

    "-p8" will tell it to split the file into 8 parts (p = parts). Then, after the scan, it would attempt to do the most equal splitting of the file into 8 parts (FILE.EXT.001, FILE.EXT.002, etc.)

    What do you guys think? Is something like this possible? I think it would be very, very useful.

  2. #2
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    It's not hard to split a file into eight parts I may post a bijective file slicer or whatever in a few days. You just read with my bijective file !/O which is actually reading the bits in the file. Then write them out to the eight files each of which is the slice of concern. Just look at my bijective arithmetic coder arb255 to see how the file bit I/O works. One note is that a file to be sliced ends with the last one bit seen. So if you have only ZEROS written to a split file its actually is a NULL FILE. When reading any file of real data the bijective bit I/O will always read a last one as -1 for the file so no problem handling files of all bytes zero. When reading a NULL file when combining you treat those as a ZERO. You read until the last non Null file return its last one. If files all null the 8 combine to a single null file.

    So yes its possible. I prefer to do it bijectively so no wasted space. Is it useful SOMETIMES.
    chow

  3. #3
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    334
    Thanks
    192
    Thanked 57 Times in 41 Posts
    Quote Originally Posted by biject.bwts View Post
    It's not hard to split a file into eight parts I may post a bijective file slicer or whatever in a few days. You just read with my bijective file !/O which is actually reading the bits in the file. Then write them out to the eight files each of which is the slice of concern. Just look at my bijective arithmetic coder arb255 to see how the file bit I/O works. One note is that a file to be sliced ends with the last one bit seen. So if you have only ZEROS written to a split file its actually is a NULL FILE. When reading any file of real data the bijective bit I/O will always read a last one as -1 for the file so no problem handling files of all bytes zero. When reading a NULL file when combining you treat those as a ZERO. You read until the last non Null file return its last one. If files all null the 8 combine to a single null file.

    So yes its possible. I prefer to do it bijectively so no wasted space. Is it useful SOMETIMES.
    chow
    So you are saying that the program you are talking about will scan the entire file and detect EXE, image, etc. streams and will not fragment those streams when splitting the entire file into the 8 parts?

  4. #4
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,564
    Thanks
    773
    Thanked 687 Times in 372 Posts
    it's the great idea! then we can run shar again to join these parts so the process will be recursive (and bijective of course)

  5. #5
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    334
    Thanks
    192
    Thanked 57 Times in 41 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    it's the great idea! then we can run shar again to join these parts so the process will be recursive (and bijective of course)
    Exactly...

    Processors are so powerful now that using PAQ8 is almost reasonable. The problem is that normal file splitters do not detect different data streams so they fragment them and then PAQ cannot detect the streams so compression ratio decreases.

    Command line file splitting that is data-stream aware will be fantastic!

  6. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts
    It could be done, but not perfectly. If you have a tar file, then you could parse the headers to find file boundaries. You could also support less widely used formats like shar, qfc, 7zip -m0, zip -0, and probably lots of others. How many do you want to support? And what about expanding compressed archives like ordinary zip?

    JPEG images can be found embedded in lots of other formats like DLL, PDF, DOC, with no general rule. You could parse each file type that might have them, or try your luck with looking for patterns common in JPEG headers and accept some errors. And what about encoded formats like BASE64?

  7. #7
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,494
    Thanks
    26
    Thanked 131 Times in 101 Posts
    I think cutting could sometimes improve compression (well, referring to counting theorem, we can be sure of that). I'm not sure, but IIRC CCM from Christian Martelock detects stream type by data characteristics, not by headers, so CCM in theory shouldn't miss a BMP, WAV or whatever it likes when that BMP, WAV or whatever data-type is splitted.

  8. #8
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    334
    Thanks
    192
    Thanked 57 Times in 41 Posts
    Quote Originally Posted by Matt Mahoney View Post
    It could be done, but not perfectly. If you have a tar file, then you could parse the headers to find file boundaries. You could also support less widely used formats like shar, qfc, 7zip -m0, zip -0, and probably lots of others. How many do you want to support? And what about expanding compressed archives like ordinary zip?

    JPEG images can be found embedded in lots of other formats like DLL, PDF, DOC, with no general rule. You could parse each file type that might have them, or try your luck with looking for patterns common in JPEG headers and accept some errors. And what about encoded formats like BASE64?
    Interesting. Well shar is the only one I use because it is superior to tar in that the resulting file size is always smaller since it needs less data to make the archive (and it is very fast too!)

    If one win32 command-line executable could do 2 passes on a shar file (1st pass - detect headers of exe, bmp, wav, etc. 2nd pass - split file into -pN using header data to keep streams in tact) then that could easily be placed into a batch file as a step with precomp, reflate, srep, etc. and that would be perfect and simple.

    @Piotr Tarsa
    That's good info about CCM. I did not know that.

  9. #9
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    612
    Thanks
    250
    Thanked 240 Times in 119 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    I'm not sure, but IIRC CCM from Christian Martelock detects stream type by data characteristics, not by headers, so CCM in theory shouldn't miss a BMP, WAV or whatever it likes when that BMP, WAV or whatever data-type is splitted.
    Yes, this is correct, see this old forum post from Christian:

    CCM detects everything by content, therefore it doesn't need the headers (EXE, RGB24, DELTA, even image width and stuff). Just try it out. Still, if available, EXE headers are used for repositioning the address-transform - IIRC.
    http://schnaader.info
    Damn kids. They're all alike.

  10. #10
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by comp1 View Post
    So you are saying that the program you are talking about will scan the entire file and detect EXE, image, etc. streams and will not fragment those streams when splitting the entire file into the 8 parts?
    NO all I said was its easy to bijectively slice any file to 8 files. The rest is something else.

  11. #11
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts

    Talking Simple BIjective File Splitter Code

    This code treats any file as either a single file or as two files combined.

    split filein fileout1 fileout2
    divides the filein into two separate output files.

    unsplit filein1 filein2 fileout
    takes two separate input files and combines into one output file.

    First of all think of the combined file as in a Cantor pairing function where you
    think of each file as a number. The null file being zero

    see http://en.wikipedia.org/wiki/Pairing_function

    This was tested on my machine but may not work on yours it depends alot on
    how you machine or compiler handles the null files.

    Since this simple is the optimal best bijective method for combining two files
    on exactly the same length it is not such a good method when you have many
    files of vastly different lengths. Please test this and I will post a simple method
    of bijective joining where lengths vastly different

    This was also posted so people can see how easy it is to use bijectie bit I/O for files.
    **NOTE** abc is password for my uploaded 7z file
    Attached Files Attached Files
    Last edited by biject.bwts; 3rd May 2014 at 17:48.

  12. Thanks:

    Simorq (23rd May 2017)

Similar Threads

  1. Simple binary rangecoder demo
    By Shelwien in forum Data Compression
    Replies: 35
    Last Post: 17th June 2019, 16:21
  2. Crook, a new binary PPM compressor
    By valdmann in forum Data Compression
    Replies: 25
    Last Post: 19th March 2012, 17:12
  3. Dzo - Compression with Smart Deduplicaiton
    By veer in forum Data Compression
    Replies: 16
    Last Post: 16th October 2011, 18:17
  4. lzma's binary tree matchfinder
    By Shelwien in forum Data Compression
    Replies: 7
    Last Post: 13th October 2011, 08:38
  5. Bytewise vs Binary
    By Shelwien in forum Forum Archive
    Replies: 9
    Last Post: 30th March 2008, 16:51

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •