In the last weeks, I have been working on Compound File Binary Format (CFBF) Optimization – MSI, old DOC/PPT/XLS files, etc. Pretty much what Document Press does – no optimization of file content, just of the container.
tl;dr: I didn’t find an optimal solution. I reached Document Press’ compression rates; sometimes better, sometimes worse. Program & source code at the end of this post. Here’s what I learned:
Previous readings:
- https://encode.su/threads/336-Document-Press-6-012
- https://encode.su/threads/897-Docume...-Version-6-013
- https://encode.su/threads/2864-OLE-O...-for-cmd-tools
1. CFBF is a FAT-like file system.
Most of you will know this, but let’s have a clear foundation: CFBF is not compressed but it does support features like concurrent editing, transactions, reversal to earlier snapshots. It’s built on the FAT concept (details in the specification). Compression problems that arise from this:
- leftover data from earlier snapshots
- fragmentation
2. There are two CFBF versions.
Version 3 uses 512-B sectors while version 4 uses 4096-B sectors. Therefore, v4 has an advantage with large files. You cannot tell exactly because it depends on the number of files, but the breakeven size is often between 3 and 8 MiB.
V3 files are produced by old MS Office versions and by MSI setups prior to ~2010. MSI setups that have been built after ~2010 are almost exclusively v4 files.
3. Document press never uses the v4 format.
It tries to convert all v4 files to v3. This has huge advantages for small setups (e.g. 72 KiB → 55 KiB) but also great disadvantages for large setups. I have never seen Document Press reducing the size of a v4 CFBF file. It only copies the input because that’s smaller.
4. Defragmentation *is* important.
That’s what Document Press’s -opt does. CFBF allocates small files (<4096 B) and large files (≥4096 B) from different streams, but they share the underlying sector system. Writing small files, then large ones, and again small ones will almost certainly lead to fragmentation. Tom Jebo from the Microsoft Open Specification Support Blog (defunct since two weeks) has written about it in a two-part series (Exploring the Compound File Binary Format, Exploring the Compound File Binary Format (part deux)), but you should not trust the source code because it bloats the resulting file. Document Press got it right as far as I can see.
The true™ way of defragmenting a CFBF file (as far as I tried) is:
- Create the directory and all streams (with empty sizes!).
- Copy small files. Or large files? I don’t know, see below. But keep them seperate from each other!
- Properly close all open handles or else there will be scratch space or snapshots left in the file.
5. Then, it gets really fuzzy.
Once you removed needless snapshots & scratch space from the file, you’ve reached the minimal size. Defragmentation will not reduce size any further, so you have to compare sizes after compression e.g. with LZMA.
And now you’re left with the dilemma of finding the correct way of sorting files to help compression. I didn’t look into this and neither did Document press (I bet it and I got bet randomly).
6. MSI probably uses a bastardized v4 format.
This is something I didn’t know but would like to have information on.
Even though v4 is defined to have 4096-B sectors, I have seen MSIs missing 7/8 of the last sector, i.e. it’s just 512 B instead of 4096. I don’t know why that is. But I certainly know that it defeats optimization because no matter how hard you defragment, your standards-compliant result will always be 3584 B larger than the fragmented version you started with!
Moreover, I have seen many (but not all!) v4 MSIs being already optimized and defragmented, including some I have generated myself wia WiX or via MSI API. Considering the last-sector trick, it’s impossible to beat them. (Meaning Microsoft has put some effort into this after ca. 2010, which is good, isn’t it?)
That’s it. You can download my program here (Windows x86-32): https://papas-best.com/downloads/bes...6/bestcfbf.exe
And the C++ source code here: https://papas-best.com/downloads/bes...e/bestcfbf.cpp
If you want to optimize and defrag a CFBF file, run it withbestcfbf <in> <out> [-v4]For normal files, I recommend using it twice (one time with -v4 and one time without) and picking the smaller file. If it’s larger than the input, then you probably have a v4 MSI.
If you prepare the CFBF for compression, also use Document Press on it and select the one that compresses best.
I think the last way to improve anything would be reading and writing individual FAT sectors (which I didn’t, I just used the Shell Lightweight COM API). Igor Pavlov is a skilled programmer and there’s some chance that Document Press already does that.