Unfortunately it doesn't work like that, and we're like 50 years past the stage of huffman coding.
Compression-wise it would be _always_ better to just use arithmetic coding instead of huffman.
And reducing the size of input data won't necessarily also allow for better compression.
Consider this example:
Code:
768,771 BOOK1
261,220 BOOK1.7z
216,021 BOOK1.pmd
184,121 BOOK1.paq8px197
643,220 BOOK1_no_space
252,976 BOOK1_no_space.7z
219,554 BOOK1_no_space.pmd
199,364 BOOK1_no_space.paq8px197
BOOK1 is a plaintext english book. I removed the spaces from it and compressed both original and no-space version with some compressors.
.7z uses LZ77 compression (LZMA2 to be specific), which has no real context model, so it behaves as expected - less bytes on input, less on output.
But simple PPM statistical model in .pmd adds redundancy instead, and complex model in paq8 adds even more.
This happens because spaces were actually useful as context, and removing them actually added noise to the information contained in the file.
I'm quite sure that similar stuff would happen if we'd try to naively replace word patterns with some codes
(or even just remove the patterns from the main file, and add the necessary information to some secondary file).
Removing words from text breaks all kinds of structures, so statistical model would mispredict the "future" data at the point of replacement,
then would apply incorrect statistics at all following occurrences of "breakpoint contexts".
Beneficial replacements are actually possible, but you have to be careful to preserve the structure of the data
(at least types of structure visible to compressor's model).
https://encode.su/threads/3072-contr...p-tools-for-HP
I suppose, making a model which would take into account the "breakpoints" and work around them
should be also possible, but its like deciding to add "breaking encryptions" when trying to write a compressor -
sure, this feature can be useful in some cases (including better compression of encrypted files),
but normally it is simply not relevant for the main purpose, but makes the programming 10x harder.