Rarkyan, let me summarize your algorithm. You look for very common duplicated bytes like 00 in your file. You then take these out of the file, making it shorter, and you remember a simple bitmask to tell you where those values were taken from. To re-expand, you use the bitmask to place the byte back into place.
This is a workable algorithm and it can indeed give compression for many files. For example, if half the bytes of a file are 00, you'd make a new file that's only half the size. But you also need to remember that bitmap, which is 1/8 of the file size (one bit per byte.)
If more than 1/8 ( 12.5%) of the files bytes are that one common byte, you end up with a net savings of space.. you have compression! Success!
The flaw of this algorithm is that it can't compress files which have no common byte that happens more than 12.5% of the time. The 12.5% comes from the size of the bitmap compared to the size of the file itself.
For example, say you had an 800 byte file, and byte 00 happened a lot.. 10% of the time. So you make your bitmask of 800 bits, which is 100 bytes. You can remove all the 00 bytes from the original file, so it's now shorter, it's 720 bytes.
But look.. now the 720 byte file still needs the extra 100 byte bitmap.. that's 820 bytes, bigger than the file you started with. So there's no space savings for that example... it gets bigger.
So your algorithm
can work for some files. A very biased file with a very common byte can be compressed with a bitmap "hole" map. But
not all files can be compressed this way.
Now there are many many many ways to improve this and take advantage of biased byte distributions.. when one byte is more likely than another. The first algorithms in compression books deal with efficient ways to handle this problem and can take advantage of even small biases. But
none of those methods can compress
all files.
Good for you for being interested in compression.. it's fascinating. You may be interested in text "entropy", the foundation of
Information Theory, which qas a broad definition, quantifies how these small biases in "common bytes" can be efficiently exploited.