In Surfer's ISO thread, I speculated about LZX recompression. After a bit of playing with FEPInstall.exe from the ISO and CabArc, I came up with a little proof of concept:
Code:
FEPInstall.exe 19.577.576 bytes
FEPInstall\*.* 31.211.856 bytes, 147 files // Extracted using 7-Zip
FEPInstall.7z_store.srep.7z_max: 18.574.538 bytes // Demonstrating we can really save bytes if we have access to the decompressed data
FEPInstall.cab 19.282.819 bytes // compressed using "CabArc -r -m LZX:21 n FEPInstall.cab FEPInstall\*.*"
FEPInstall.cab.srep 19.282.919 bytes // SREP can neither compress this file ...
FEPInstall.exe.srep 19.576.110 bytes // ... nor that one
FEP_concat 38.860.395 bytes // Concatenation of FEPInstall.exe and FEPInstall.cab
FEP_concat.srep 27.347.677 bytes // Only 40% larger than FEPInstall.exe.srep, so we have some matches
This one isn't very good, the +40% from SREP mean that we can not reconstruct the whole original file (if so, it would be 0%). Additionally, we can only save around 5% on the decompressed data, so we'll end up with a larger file. But it demonstrates that what CabArc creates will be very close to the original, even though we used 7-Zip to decompress.
Recursion could help a bit, though - there are many LZX compressed files inside FEPInstall, decompressing all of them leads to 67,5 MB of data in 5588 files (!) that can be compressed to 15 MB using SREP and 7-Zip Maximum (which is 25% compression, still not the 40%+ we'd need, but closer).
But here's a better one, a CAB file from inside FEPInstall.exe:
Code:
epploc.cab 31.397 bytes
epploc\*.* 126.656 bytes // this time we can use CabArc.exe for this step
epploc.7z_store.7z_normal: 22.772 bytes
epploc.7z_store.paq8o8_3: 19.799 bytes
epploc.7z_store.paq8o8_4: 18.851 bytes // and we can save up to 40% when using the decompressed data
epploc.cabarc 24.445 bytes // "CabArc -r -m LZX:21 n epploc.cabarc epploc\*.*"
epploc.cabarc.srep 24.489 bytes
epploc.cab.srep 31.441 bytes
epploc_concat 55.842 bytes
epploc.concat.srep 31.494 bytes // Only 0,1% larger than epploc.cab.srep, so we can reconstruct almost the whole original CAB
Of course this is very far from being a complete LZX recompressor, but it shows the potential and possibilities. Theoretically, it should also be possible to process CHM files this way, but I had no success with those so far.
The real challenge is to put all this together and to handle the license situation - I'm quite sure CabArc can't be used without Microsoft's permission and I don't know about free LZX compression tools or open source libraries (there's a specification of CAB/LZX that could be useful). Also, if not using CabArc, reconstruction of the original behaviour (and thus the original stream) is very uncertain.