# Thread: Theory: Huffman overhead modification & special case of "unused code"?

1. ## Theory: Huffman overhead modification & special case of "unused code"?

Hi
One of those: I think i found another coding method while enjoying my vacation
Just let me know if this method "has a name" already, or has some flaw in it.
It's my first post, so, please, don't kick me in the neck yet.
I'm sorry, I was on it since last night only counted hours, but it seems legit (like usual ).
I am sorry, I could write it 3x longer, but I'm trying to avoid too many questions.
I use 2-bit codes (bit-sets / data-words / view-step) to simplify the examples.
As you see, i am sorry that haven't always made up my mind what terms to use.
Currently, I am focusing on almost uniform (near-random) distributions (like already compressed files e.g.).
Trying to find/play with not greedy compression algorithms.
After some puzzling my method started looking like a special case of Huffman with few modifications, so I start describing it from Huffman.

Intro:
Regular Huffman adds too much overhead to compress short blocks of data.
The longer the almost-uniform data is, the less probable it is to find any deviance from average distribution. The goal here, is to change the method for allowing shorter blocks.
Shorter overhead makes it possible to use smaller block-size, which is good (increases the probability) for having more blocks where distribution is non-uniform.
As I understand, to be fair, we need 1 bit to indicate that data is compressed with Huffman coding, instead of hiding the method used inside the filename e.g..
With 8-bit Huffman tree overhead plus 1-bit, we'd need to compress 9 bits to make it pay off. For using tree with code-lengths 1-2-3-3b, but pretty even distribution (for simplicity), to save 1 bit, we'd need to roughly 4 * 2b = 8b of data. So we'd need 8 * (8+1)b = 72b of data to achieve CR100. (Not the best example, because, in practice a bit shorter data is needed as 1st code had a bit higher probability).

Huffman coding of 2-bit words, can produce only 5 types of trees: with code-lengths 2 2 2 2, 1 2 3 3, 1 2 2 -, 1 1 - -, 1 - - - , where "-" is unused "word", where "word" is the 2-bit data.

• v0: 2 2 2 2: // 0 overhead, all codes have same length.
• v1: 1 2 3 3: // 4=2*2b overhead - need to say which codes have len 1b and 2b.
• v2: 1 2 2 -: // 4=2*2b overhead - need to say which codes have len 1b and "-".
• v3: 1 1 - - : // 4=2*2b overhead - need to say which 2 codes have len 1b
• v4: 1 - - - : // 4=2*2b overhead - need to say which code has len 1

To indicate the version, we'd need sad 3 bits, but as v4, where data would look just like "000000" or "1111111" thus is very rare, can be decoded actually also with tree-versions 1, 2 and 3 - so can get rid of it and use 2-bits to indicate the Huffman tree version! Plus variable extra-overhead of the specified data-words: 0, 4, 4, 4b - means version 0 has shorter header-table-overhead. If the those trees would occur evenly, the average overhead would be (2 + 0 + 3 * (2 + 4 )) / 4 = 20 / 4 = 5b. Not sure if this method is widely in use, didn't find exact one even from Canonical Huffman wiki, but this is not my interest here. Indicating all bit-lengths of codes, which is pretty optimal already, would have 4 * 2 = 8b of overhead.
This 5-bit overhead would guarantee that the code-lengths are chosen as optimally (as Huffman can choose) for all different data-word probabilities... Above example could give compression already from 8 * (5+1) = 48b data (last block length was +50%).

Special case / modifications to Huf:
Focusing on the special case of data where 1 of 2ˇn codes is not used/missing/unused - 1 of 4 in 2-bit example. There may be some file types with this nature of data. The compression for longer words would be smaller, but I'm looking pattens at the step-2b now. Huffman would give it a tree "1 2 2 -" which has 2+4b of overhead. If using version ID as 1 bit field: value 1 indicating that data has THE pattern and was worth to compress, and 0 saying that no compression was done. Overhead would go down to 1+4=5b.
As almost-uniform data would anyway have pretty equal probabilities, so it's not really necessary to indicate which exact 1 of 3 used codes has highest frequency (and should have shorter code). Removing this most-pop code from from the header, the overhead would go down to 1+2=3b and above mentioned data block can go down to 32-bit.
And having only 1-bit versionID and the "unused code's" 2 bits in the header, while the Huffman's 1-bit code-length goes automatically (hard-coded / without indicating) for the 1 of the 3 codes left. For "2-bit example" and simplicity and avoiding re-coding all values - the missing code's prefix bit (0 in case of 01) will be the 1-bit compressing-code and it goes to the only code that has this prefix (00 = 0 in case 01 was missing).

• Miss 00 -> magic "0" goes for 01
• Miss 01 -> magic "0" goes for 00
• Miss 10 -> magic "1" goes for 11
• Miss 11 -> magic "1" goes for 10

Could choose the smallest/first one (00 or 01), but need to recode so that i would be unique. May be (depends on data) a bit more useful to use the 1st one that occurs in data (hoping that other codes may have lower probability).

• Almost-random data 32 bits: 00 10 01 10 00 01 00 00 10 10 01 00 01 10 01 01
• Missing "11"
• "10" will be "1"
• Header: <1 - compressing> <11 - unused>
• Compressed data: <00 1 01 1 00 01 00 00 1 1 01 00 01 1 01 01> // saved 5x 10->1
• Compressed all: <1><11> <00 1 01 1 00 01 00 00 1 1 01 00 01 1 01 01>
• CR (3+32-5)/32 = 30/32, CR 93.8% but data is rather short!
• Had 16x 2bit codes. Chosen code appeared 5 times, which is below average probability - so example's CR is not too beautiful.

If data has the "unused code" pattern in more than every 15th 32-bit block, it can be compressed. If the Hamming weight would be close to 50%, but still having 1 code missing, the compression would be better (don't become greedy), but the goal is to find the un-greedy algorithm that works often but with less compression.

Further optimization of overhead for this unused-code case:
Current header has 1b for compression-switch ( aka version) and 2b for unused code. The unused code can be detected from the data when compressed-data has uncompressed data in the beginning. Using the same string of 2-bits "00 10 01 10 00 01 00 00 10 10 01 00 01 10 01 01" we can see at 6h bit already that 00, 10 and 01 are used, as the restriction says, 1 code is not in use, the unused code must be 11.
Using overhead of just 1 bits per block, while saving every 6th bit on average, the block must be 6-bit long for otherwise near uniform

• Almost-random data 32 bits: 00 10 01 10 00 01 00 00 10 10 01 00 01 10 01 01
• Missing "11", detected after reading bit 5 (6th)
• "10" will be "1"
• Compressed data: <00 10 01 1_ 00 01 00 00 1_ 1_ 01 00 01 1_ 01 01> // saved 4x 10->1
• Compressed all: <1> <00 10 01 1_ 00 01 00 00 1_ 1_ 01 00 01 1_ 01 01>
• CR (1+32-4)/32 = 29/32, CR 90.6%
• Had 16x 2bit codes. Chosen code appeared 5 times, which is below average probability - so example's CR is not too beautiful.

Even better compression and can be used with much shorter blocks.
This is actually basically what I came to before i found the relation to Huffman... Except the unused code was still used as a code for longer bitsets (3 vs 2 bits).

Comparison to Huffman:

• Almost-random data 32 bits: 00 10 01 10 00 01 00 00 10 10 01 00 01 10 01 01
• 00 - 5 -> 10
• 01 - 6 -> 0
• 10 - 5 -> 11
• 11 - 0 -> __
• Header <2 1 2 > coded-bit-lengths 6-bits, depending on Huffman coding type
• Compressed data: saves 6 * 1b of "01".
• Total: 6 + 32 - 6 = 32.
• CR: 32/32 = 100%

What about "coding-up" / extending the codes suffix instead of shorter prefix-coding?
My first thoughts, where I started it.
If one of 2-bit code is not in use, can use it for coding/compressing another 3-bit-set to that 2-bit code. This kind of data doesn't have uniform structure. For example if the code "00" is missing then 3-bit codes "01 1" "10 1" and "11 1" occur twice as often (2/9 vs 1/9) than the other 3. (not using overlapped counting, 2-bit step). There's a simple pattern that if the unused code starts with 0 (00 or 01) then 101 and 111 are with 2/9 probabilities and if starts with 1 then 000 & 010 are the good codes to take.
00 01 10 11
2/9 011 001 000 000
2/9 101 101 010 010
2/9 111 111 110 100

So the unused 2b code should go for one of those 3 codes with 2/9 probability. Setting it just to 1st one.

• Almost-random data 32 bits: 00 10 01 10 00 01 00 00 10 10 01 00 01 10 01 01
• Missing "11", detected after reading bit 5 (6th)
• "000" will be "11"
• Compressed data: <00 10 01 10 11 _1 11 _0 10 10 01 00 01 10 01 01> // saved 2x 000->10
• CR (1+32-2)/32 = 31/32, CR 96.9%
• Had 10x 3bit codes +2bits. Chosen code appeared 2 times of 10, which is below 2/9 probability - so example's CR is not too beautiful.

Another idea: If probabilities (and sequence) say so, can adapt by switching the meaning on "unused code" by "not compressing" but writing it out as a switching event. Real probabilities in this data (after detecting the missing 2b) - 00 10 01 10 00 01 00 00 10 10 01 00 01 10 01 01

• 000 2x, 001 1x, 010 3x, 011 1x, 100 3x, 101 2x

• If switching to 010, by not coding first 000 to 11, then 11 starts meaning 010 (the next code)....
• 00 10 01 10 00 0(SW) 1 00 00 10 10 01 00 01 10 01 01 // not switching ruins also 010, which almost could have coded on bit 10 (11th)
• 00 10 01 10 00 0 1 00 00 10 10 11 _0 01 10 11 _1 // save 2-bits with 1-bit overhead.
• CR (1+32-2)/32 = 31/32, CR 96.9%
Theoretically this way we could replace the unused code by longer code

• But model must be hard-coded this way, otherwise we'd need to add overhead to choose which longer code to use.
• And this pretty much reminds regular "Replace Longer Code With Free Short Code".
• And longer the code, the less probably it appears, so the modified Huffman seems to have more sense.
• ...so i postponed this one and continued with HuffMod.

Questions (for you and myself):

• Is this method already described? (I've read some for now in last months, never seen yet)
• How often is this "unused code case" in random file?
• What data types have this pattern?
• What are the usages?
• How about applying similar optimizations to other patterns (of short blocks of near-uniform data)?
• Program it and try with longer codes than 2-3b...
• What about if more than 1 codes are free (happens with longer bitset and smaller blocks)?
• Should I stop?
• If described method didn't have a name, is Slyy's Unused Code Kompressor - "SUCK" a good name?

In next 2 days i guess i find some time to implement it in C++ and test it. Curious who much it would increase already zipped file, even when carefully selecting block sizes that give more patterned blocks, and even after implementing variable-fixed size blocks

Any kind feedback is appreciated! (Prefer critical)
Thanks for your time again! (Didn't mean to write that long - have to learn to compress my text losslessly)

2. What about Tunstall Codes ?

3. Originally Posted by Cyan
Can you refer to the similarities? (If you mean that it is the same.)
I'm kind of waiting to see if someone also focused on the "unused codes"? And removed overhead of Huffman?
It's a really specific data structure, that this model applies to. So i should probably extend it to allow other types of structures and increase the overhead a bit or 2.

4. > Is this method already described? (I've read some for now in last months, never seen yet)

Order 0 Huffman coding over a 4 symbol alphabet.

> How often is this "unused code case" in random file?

Not enough to work. You are aware that you can't compress random data, right?

> What data types have this pattern?

Most non-random data. But we already have better ways to compress it, such as using higher order models.

5. Originally Posted by slyy2048
Huffman coding of 2-bit words, can produce only 5 types of trees: with code-lengths 2 2 2 2, 1 2 3 3, 1 2 2 -, 1 1 - -, 1 - - - , where "-" is unused "word", where "word" is the 2-bit data.

• v0: 2 2 2 2: // 0 overhead, all codes have same length.
• v1: 1 2 3 3: // 4=2*2b overhead - need to say which codes have len 1b and 2b.
• v2: 1 2 2 -: // 4=2*2b overhead - need to say which codes have len 1b and "-".
• v3: 1 1 - - : // 4=2*2b overhead - need to say which 2 codes have len 1b
• v4: 1 - - - : // 4=2*2b overhead - need to say which code has len 1

To indicate the version, we'd need sad 3 bits, but as v4, where data would look just like "000000" or "1111111" thus is very rare, can be decoded actually also with tree-versions 1, 2 and 3 - so can get rid of it and use 2-bits to indicate the Huffman tree version! Plus variable extra-overhead of the specified data-words: 0, 4, 4, 4b - means version 0 has shorter header-table-overhead. If the those trees would occur evenly, the average overhead would be (2 + 0 + 3 * (2 + 4 )) / 4 = 20 / 4 = 5b. Not sure if this method is widely in use, didn't find exact one even from Canonical Huffman wiki, but this is not my interest here. Indicating all bit-lengths of codes, which is pretty optimal already, would have 4 * 2 = 8b of overhead.
This 5-bit overhead would guarantee that the code-lengths are chosen as optimally (as Huffman can choose) for all different data-word probabilities... Above example could give compression already from 8 * (5+1) = 48b data (last block length was +50%).

A few corrections. v0, v1 and v2 make sense. v3 would be a 1 bit code and not a 2 bit code, so it's not a point worth discussing. v4 would be a 0 bit code (1 - - - is wrong, you'd use 0 - - -).

Looking at v2 (so as an example it could be a tree generated by using symbols with frequency 2, 3, 4), you observe codes of say A=0, B=10, C=11. There is no "missing" code and any overhead is purely down to whether bit lengths 1 and 2 correct match the observed frequencies of 2/(2+3+4), 3/(2+3+4) and 4/(2+3+4). In this case it's around 2% lower than arithmetic coding or a similar more optimal strategy.

If you wish to stick purely to huffman encoding and are suffering inefficiencies due to low alphabet size, then consider joining symbols together before encoding. Eg a 3 symbol alphabet allows for combinations of 5-mers (of which there are 3^5 combinations = 243; under 1 byte) to be merged before entropy encoding. This can help a lot if the frequencies are very skewed and poorly modelled by a whole-bit entropy encoder.

However you should probably look at fast alternatives instead, such as ANS.

6. There are many many ways to play the game your playing. However in the compression game when you praise one special case that makes your code look good. Then there will be more cases that do worse.

If your looking to compress byte data using 2 bits at a time using only a few huffman like trees. Its hard to beat bijective adaptive huffman since it is fully bijective and there is no wasted space. Your method wastes space so in general can't be that good.

I am looking at every possible file. However you win on certain special cases, but not for general files.

Example case where only two symbols used. You would produce a table where a 1 is for one symbol
and a 0 is used for the other symbol. While the bijective adaptive huffman would produce a table of a
1 for one symbol and a 01 for the less used symbol, a 001 and 000 would be reserved for the never seen symbols.

So this is a case your method would win. You have more over head in this case but after a few symbols processed your method would make for better compression since it could save up to one bit for every 2 symbols processed.

There are ways to make your method bijective to save more space with yours but thats something for you to do not me.

7. Thank you for your warm welcome!
Originally Posted by JamesB
A few corrections....
Thanks! But i have to disagree with the correction. Perhaps I didn't explain so well, that with 2-bit, we have 4 symbol alphabet, and we're looking for the blocks of data which have at least 1 of 4 letters missing (other blocks will have 1-bit overhead indicator that the data is raw/uncompressed). Inside the block, yes, alphabet is like shorter, but we can not store this property of block for free (a la switching automagically to 1 bit mode when we have just 2 symbols). So probably we we're talking about different things.

v3. (1b, 1b, -, - ) - 1b is the length of coded symbol, not the length of symbol in full alphabet. With 2-bit alphabet, when this/current block has 2 symbols only, we'd still have to store which are those 2 of 4 used or unused symbols, so it can't be 1-bit per symbol.
v4 won't be 0-bit for the same reason.

Yes, if all the blocks would have this property of "unused code" (3-letter, 7-letter etc alphabet), then this "arithmetic-sticking" would be way to go. I was focusing on the "unused code" pattern of a block, but still focusing on regular almost-uniform data that has perhaps most of the blocks without this pattern. All this can be coded to a fast algorithm, and it's rather a problem, being too basic, because all (or most of) the methods that can be discovered without using computer (on a paper) has been discovered already and unpopular methods are also the bad methods.
Originally Posted by biject.bwts
...Its hard to beat bijective adaptive huffman since it is fully bijective and there is no wasted space. Your method wastes space so in general can't be that good.
...There are ways to make your method bijective to save more space with yours but thats something for you to do not me.
Of course we win on certain cases. It's the reason why compression is possible!
> I am looking at every possible file. However you win on certain special cases, but not for general files.
Then you are looking at random data that can not be compressed on average - we always have to specialize.
> Example case where only two symbols used. You would produce a table where a 1 is for one symbol
I described the optimization of only the case where 1 code is missing (not 2).
You are talking about my intro of Huffman... "v3: 1 1 - - : // 4=2*2b overhead - need to say which 2 codes have len 1b"
Didn't think this case special through (like with unused code) yet. I would "0.5 bits in header" to say that tree version is such (1 of 4 versions). Because as I told, I'm focusing on rather uniform data, where the frequencies are even in file, but not in the block, so choosing the exact code doesn't really matter, and we still have to waste those 4b overhead to say which symbols are not in this block and which are, coding of used symbols would be hard-coded - no overhead.
If probabilities vary, I'd use 1 bit to ack if the smaller symbol is used more.
Bijective... I'm still new in this subject, have to read/play it through before commenting. Wasn't easy to find information about fast. But first glance at it seems to talk about at least similar thing. Thanks!

Originally Posted by Matt Mahoney
...Order 0 Huffman coding over a 4 symbol alphabet.
> How often is this "unused code case" in random file?
Not enough to work. You are aware that you can't compress random data, right?

> What data types have this pattern?
Most non-random data. But we already have better ways to compress it, such as using higher order models.
Most of non-random data has many patterns.
Most non-random data has this (like "111111"), yes, but...
Redefining question: What otherwise uniform data types have higher probability of fixed-length-blocks having "unused code" pattern? It doesn't seem to be like most of the non-random data. Perhaps something that has near random numbers of base 3. Perhaps any file that reserves 1 "bit-code" as special-character.
If the 1 (or few) of 2ˇx'th values occur in data with low frequency, could be "freed up" by coding it (those) as longer values. And applying this modified Huffman coding after that..., i think with relatively low frequencies it would still beat Huffman at least by overhead.

Does Order-0 Huffman really use this variable overhead (shorter header in this unused code case)? I'm using 1 bit for statistics (description of Huffman tree) about the data ("is it" or "not" the unused code case) and ignoring other versions of Huffman tree. Would be applying this model only after analyzing the whole file, when it seems to have the structure suitable for this approach.
By the way, this method can also have many iterations, if it pays off, but in an example and in already uniform data (like compressed files) i didn't wanna show/add this overhead of adding iteration count to the header - thou it would pay of when compressing text and other sparse data with it.

And to use all possible trees, would need to add 1 more bit to indicate 1 of 4 possible codings, plus some overhead for the case. And of course more bits if looking longer than 2-bit words as the number of different tree combinations grows.

RANDOMNESS - This is a bit more general (off) topic, but really important as it is the central part in compression and can't be avoided. In those 2.5 months I've understood that....
We can not compress ALL the random data! Because of the counting argument. Even thou it still puzzles many people.
I was lazy and hoping kind of answer like: "This pattern happens 5% of time in random data and it compresses 4%, so for all random data it just expands file by 1%". I'm not too strong in maths/statistics, will calc it later.
ALL THE RANDOM DATA. I like to think that it is: all possible combinations of the data, while most of "our" information is a small subset of ALL data.
We use different compressors for different (subsets of ALL) data.
If data is long enough, it is impossible to prove in our limited time that data is random. (Long enough is like <1KB for our computers?)
It think there exists the set of most random data (data that is the least compressible) for given length. Probably many combinations that give the same worst ideal compression.
If length is 1b, both 0 and 1 are evenly the most random combinations (maximum randomness).
The more random the data is the longer has to be the decompressor/parser/decoder/viewr code.
Because decompresser takes space, there exist data that is not maximally random, but still is uncompressable.
So there is this point in the randomness scale below what data becomes compressible (decompressor becomes smaller than the saving of compression).

Like a month ago I was trying to make it clear for myself, what is "the data"...
Data or decompressor-program-code-data, is basically the same data - it's all data. Raw uncompressed data is basically compressed with simplest compressor - its algorithm says to view data from the start sequentially bit by bit (our viewers do it automatically, but they also use some "decompression" code).
It is easy to design a language, that takes "data words" as the code-words of instructions/commands and generates even longer or irreversible data out of it when running this code (kind of "running the data"). While it is rather complex to design the language so that when "running the data" generates shorter and still reversible data.
For example:
Simplest language has a instruction-set: 0-Output0, 1-Output1. Let's say this is the language version zero. To use this language, we have to write the id of the language at the start of the file - so it actually becomes longer than the data. This is basically what our file-systems do for us (they say: "there is a data in a memory that can be read without any processing")! It's impossible to read the data with 0 bits of program.
In short: Code - can be a symbol of data (description/information about the object) or instruction (how to generate the information about the object).
File name extension are used to indicate that this "info or instruction" nature of the data (*.exe - data should be run, not read directly). Even processors differentiate between Instruction and Data cache - but it's all just bits, one for executing other for direct reading (OT in OT: would be nice if programmer could divide the size of cache. IIRC CUDA supports it.). The point is memory is memory, data is data (description or algorithm).

If it is not proven the nature of specific data is more random than the "compressibility point on randomness scale", it may still be compressible.
Finding the algorithm for long data is too complex (not possible) (proven by Kolmogorov).
Didn't work through all of his work, but i guess, finding the "compressibility point on randomness scale" can not be calculated neither - could approximate/estimate - perhaps "the C point" is one of descriptions of the term "entropy", but it should be on the adjusted percentage scale, where 00000000000000 is with 0% randomness and 1011010011011 100% randomness.

1KB file has 2ˇ8192 combinations. Humankind didn't even generate all the combinations. Even RSA relies usually on smaller numbers. We're talking about enormous complexity. So to prove randomness

Mostly we only care the decompressor to be short on x86 architecture, but i think it would be more fair if we count transistors needed to compress the file. RISC vs CISC or extreme example MPEG on-chip, you just new few instruction in decoder to decode the data. So even when x86 code uses same amount of bits for coding ADD or MUL instructions, then actually MUL is a bigger command. IF we measure, transistors used to decompress, then decompressor with ADD command is smaller! The same way decompressor that uses less memory is actually smaller although the x86 machine code may be with equal length. To measure/define the randomness, we can not ignore "transistors factor". For example: I like to think that "3 9 27 63" is more random than "36 45 54 63" because addition is so much cheaper that even with this greater input width ("2b x 2b => 6b" vs "6b + 4b => 6b"), which doubles the transistor counts many times, it uses less transistors than the multiplication pattern. It's probably the same for neurons. It maybe (or not) personal, but it also think that following addition pattern is easier to detect "59 62 65" vs ""9 27 63". Same way it should be cheaper to do 8-bit calculations on 4-bit CPU.
For 1bit its simple: 0 & 1 are both the most random and least random at the same time. But with 2bits 00 & 11 are less random than 01 and 10 (not sure but i guess switching the state of bit would need more transistors also...
If we are talking about good algorithms, we have to somewhat take it into account. So far the decompressor programs are rather short compare to the data size especially for longer files, so the size of decompressor is not so important (yet). [Has anyone tried to replace some code of PAQ with HW MPEG for the challenges? Technically it should make decoder program shorter. Idk, if it can be tuned to work losslessy and perhaps it's libraries are actually not optimized for size etc...]
I'm a bit afraid to continue my thoughts... Compression ratio, compression time, compression energy (KJ or KWh on your electric power bill, life-time of battery) it's all about efficiency... Entropy as it is in physics of compressed molecules and its relation to the data entropy... To measure the absolute efficiency of HW/SW data compressor or the entropy of data, we should put a computer in a closed space and measure how much temperature and pressure (entropy) changes in it while decompressing a file Cooler the better

EDIT: I think i was mixing up the randomness and data complexness - will fix it later. Trying to explain think out loud...
Then long sequence of he random (like dice) data can have all complexities, while most of the subsequences are complex. Only less complex sequences of random data are compressible. Complex data is not compressible (if it is compressible by 1%, it is 99% complex). For example: data of digits of number PI are not complex at all, because they have known short description. The task to find shortest description has high complexity still, we don't know which long numbers are actually complex and which are not (Kolmogorov?). Complexity of data is tightly related to compressibility (if we are comparing data with same fixed length). Short numbers are not compressible, but are not complex neither (takes just few transistors to generate the data) - something like this: Compressibility x N_data_length = Complexity: (CR 50% at uncompressed 20bit (to 10b) has similar complexity as 30b with CR33% (to 10b)). Complexity of data is similar to entropy, but short data has rather high entropy even thou its not complex? Subsequences of random data have all combinations of complexities with "random distribution". Random distribution has more complex data than simple data, but average complexity of random data is limited to constant (close to the non-compressibility). Single symbol can not be random, only the sequence of symbols can be random (on average). If data has only complex sub-sequences, then it's not really random, but still is not compressible. Entropy is the result of ideal compression. Compression is about finding out the MDL - which is basically same as defining its randomness, complexity and entropy (which can be only estimated for longer data due to the complexity of finding). What about information -Entropy is similar to information's state of density. If we want to send the information about the new states of 10 switches to be switched ON, we actually have less information in the message (entropy) "ALL ON" or "1111111111", and also the "switch box" loses information volume (but has higher information density and weight??) after this switching (in physics it would be the freezing), but it kind of has more "informational energy" meaning that any change in 11111 would cause information to expand. Lossy compression is kind of when you take the heat out while compressing. If you transfer or store the lost "heat" also it is possible to recover the first state. If we don't store it, extra "heat" (info) dissipates, but heat goes to some other object while information is gone forever. Cant really believe the information at the horizon of black hole theory. Perhaps the heat radiation that moves towards the void at the speed of light is also considered gone forever (because its impossible to detect its existence after that).

There are 4 types of un-compressible data:
* too random data
* short data (maybe considered as a case of random data)
* Data from the TOP compression challenge winning (near-random, especially when compressor had a bug, otherwise uniform)
* decrypted data (near-random, especially when password is forgotten )

I'm not too good to express my thoughts about those things "at the pub" , hoping to get some feedback here, to correct my thoughts and go on with master plan - to come out with own codec also.

All corrections are welcome! (Sorry, I'm from North and welcomes doesn't always mean warm )

8. I get jist of what your doing. I don't think its a good idea but so what!

V0 just add a "1" in front of block. Used when no compression occurs with other V's save space. increases length by 1 bit

V1 add "0" in front then 2 bits for the 1 bit code and then 2 bits for the one that is 2 bits long so increases length by 5 bits. This can be used for V4 and when you code the extra 2 bits you can use them to shorten the compressed file more. I leave it to you to see how to save more for the V4 case.

V2 or V3 add "0" in front then add 2 bits for the symbol that goes to 1 bit then repeat the same 2 bits so it can not be a V1 or V3

special case the fist symbol to be code for V3 is acually the higher one so if symbol in header is 00 then it
can not be case V3 so all "0" to 01 for unused and "10" to be 10 and "11" to be 11 this increases data so
only 6 or 7 bits used for this V2 case

special case symbol in header 01 this means only 00 allowed for V3 so repeat 01 if 00 also one bit. and if V2 code the 2 bits. This means for these cases 2 extra bits so a total of 7 bits added to file.

special case symbol in is 10. If V3 repeat the "10" then add "0" or "1" depending on if 00 or 01 the other 1 bit symbol this is 8 bits of lenght added to data block. For v2 just put in the extra two bit for 7 bits total.

last case 11 repeated. Note 3 case left for V3 and 3 left for V2 so just assign bits to account for each case
case 1 10
case 2 01
case 3 110
case 4 111
case 5 001
case 6 000
This makes either 7 or 8 bits total added to block for the last case

This is sort of what you are doing though I think its in general a poor idea.
So you either add 1 bit of length to each block or you need to save 6 7 8 or 9
bits depending on the V to create any sort of compression.

9. slyy2048, instead of writing such long posts describing your ideas, why don't you spend that time writing the code and testing them? I already said you can't compress random data, but maybe you can find out for yourself.

10. .. or invent EmDrive, in the worst case

11. I guess the web-session time-out here is rather short. Made break, continued writing, started inserting picture and it all hanged, lost the text. Will do it shorter now....
While also having some problems with Imigrasi of Indonesia, I don't have as much time as before...

To biject.bwts:
Played through your proposals... even tried to solve the "game" ("I leave it to you to see how to save more for the V4 case."). I believe it can be done, but it's not the main question at the moment.
I agree with your conclusion. But the text (from "special case the fist symbol to be code for V3 is acually") more confusion's than typos (perhaps one is the reason for other).

Originally Posted by Bulat Ziganshin
.. or invent EmDrive, in the worst case
Not my thing.
Originally Posted by Matt Mahoney
slyy2048, instead of writing such long posts describing your ideas, why don't you spend that time writing the code and testing them? I already said you can't compress random data, but maybe you can find out for yourself.
If data is random, it can not be compressed.
What if data in uniform like zpaq archive... We know it can be compressed (take decompressor+another compressor is 1 example).
ZPAQ is very uniform

TXT

RAR.SFX.EXE

RAR is more uniform than tighter compressed PMD.

(could discuss the method some other time, generally it is bit-wise analysis showing occurrences of bit-sets viewed with simple rules (not only with 1-bit step sequential))

ZPAQ like other archives have structure.
Did some tests last night and was surprised how precise structure it has. The structure of having exact non-compressability... (not a random non-compressability)

Results:
EDIT: Deleted.... Had a bug in analyzer
Fixed after refactoring:

Start of enwik8.zpaq. Checking 8192 block with length N (N=2...18B, 16...144b). Only 2-bit symbols.
1st row of the screenshot: 8192*2B.
(2nd row of the screenshot: Offset 1 bit. Starting from 2nd bit...)Last row 18Bx8192=147456B=1.18Mb
Trees

16-bit blocks, viewing 2-bit at a time:
V0: 2845 | (not printing for shorter output)
V1: 2208 {8193, 5055, 2208, 2208} |
V2: 2932 {11648, 7389, 4419, 0} |
V3: 197 {1034, 542, 0, 0} |
V4: 10 {80, 0, 0, 0}
For example v4 occurred 10 times. The total count of p0's is 80. 16b block has 8x 2b elements. 8x10=80.
All this 128Kb can be shortened by 17Kb without overhead. Means every 7.5th bit could have 1bit overhead. 16bit block could have only 2b overhead (OH) on average, otherwise it wont compress. v0 could have 1b OH, but The rest 5347 trees must have less than 14.4Kb total OH -> 2.7b/block (must include tree version).
Anyway the "Unused Code" method would add 5260b OH for other trees and would need to compress that out of v2 2932*16b=46,9Kb, would mean lower than 88% CR is needed. v2 can save 11648b with 1 2 2 _ coding if knowing the P0, but we can not even store it to header as OH would become 8796 and saving would be 2852b which is less than 5260b (other blocks OH). So hoping for uniformity could pick just 1 code and would depend on random and pack less in average. A la codeword "0" would go for "00" always (if its p0, p1, p2 or the missing p3). So (11648+7389+4419+0)/4=5864b (-5260=604b)of average save, but we'd still need to tell what are the 10 and 11, having 604b per 2932 block = 0.206b/block (can't add another bit, but we'd need at least 1.5b to tell which of 3 is missing)...
This file is one of worst examples to compress anyway :P

Doing further analysis of the analysis...

12. Tested minimum overhead method... First 8192 blocks...
On enwik.rar.exe
16b block gave CR 128.3%
144b block gave CR 99.0%
256b block gave CR 99.3%

On enwik8.ZPAQ
16b block gave CR 131.8%
144b block gave CR 100.8%
256b block gave CR 100.4%

Used excel to post-analyse results from C++, (manual huffman for tree-version codewords)
Looks like this...

This method doesn't use any scanning (adaption?) to reduce overhead even more.

EDIT: now tested with raw enwik8 text file and results were even worse. Perhaps the reason is that it ain't 2-bit file neither.
Should extend the algo to be able to test with longer word lengths and try with 8-bit for it. But then it will be pretty regular already.

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•