# Thread: Average / Arithmetic Mean vs. Harmonic Mean of Compression Ratios

1. ## Average / Arithmetic Mean vs. Harmonic Mean of Compression Ratios

Compression results for a corpus are often reported as arithmetic means of compression ratios where compression ratio is compressed size / uncompressed size (expressed in percentage or bits-per-byte).

Is this the right thing to do? My impression is that in performance analysis, it's generally considered The Wrong Thing to average ratios that way, and that you should use the harmonic mean instead. (For example, when talking about transactions per second.)

Are compression ratios somehow an exception to this rule?

I seem to recall reading someplace that they're not, and harmonic means of compression ratios are to be preferred, but I can't remember where. (I noticed that on the Canterbury Corpus website, they report averages and "weighted averages," but the weighted averages are not the harmonic means. )

Or is that only true if you define compression ratio the other way (uncompressed size / compressed size), so that compressing a file to half its size would give you a "compression ratio" of 2. (Which has always seemed more intuitive to me; the usual way seems like an incompression ratio.)

I suspect that's right: we're essentially taking the reciprocal before averaging, and could then take the reciprocal of the result to get the compression ratio in the higher-is-better sense, so in effect we are computing the reciprocal of the harmonic mean. But I don't trust my understanding well enough to be sure.

Any guidance would be appreciated.  Reply With Quote

2. Defining compression ratio in the usual way (compressed / uncompressed) is needed to make averages somewhat meaningful.

Consider a corpus with one file which is 100MB of zero bytes, and another file which is 100MB of noise. Typically the corpus can be compressed to about 100MB.
The first file can be compressed to nearly 0%, while the other one cannot be compressed at all (compression ratio 100%). The average is 50% which in this case corresponds to the compression ratio of the entire corpus (because both files have the same uncompressed size).

If you would define compression ratio the other way around, the first file would get a "compression ratio" of nearly infinite, while the noise gets a ratio of 1, so the average would be near infinite and not 2.  Reply With Quote

3. ## Thanks:

Paul W. (17th January 2016)

4. Originally Posted by Paul W. Compression results for a corpus are often reported as arithmetic means of compression ratios where compression ratio is compressed size / uncompressed size (expressed in percentage or bits-per-byte).

Is this the right thing to do? My impression is that in performance analysis, it's generally considered The Wrong Thing to average ratios that way, and that you should use the harmonic mean instead. (For example, when talking about transactions per second.)
I like the geometric ratios better. However, my absolute favorite way of showing the difference between two algorithms is the one used in Figure 3 in https://developers.google.com/speed/...ss_alpha_study

There, you can just look at the graph and see that for about 3 % of the data you can expect worse results, even when the average/median/geometric average etc. are a lot better.  Reply With Quote

5. ## Thanks (3):

Cyan (18th January 2016),Paul W. (18th January 2016),Turtle (18th January 2016)

6. Originally Posted by Paul W. Compression results for a corpus are often reported as arithmetic means of compression ratios where compression ratio is compressed size / uncompressed size (expressed in percentage or bits-per-byte).

Is this the right thing to do? My impression is that in performance analysis, it's generally considered The Wrong Thing to average ratios that way, and that you should use the harmonic mean instead.
I'd say it rather depends on what you want to deduce from the result. I'm not sure either of the averages make too much sense. Consider we take an average over a very disjoint data set of sources A, B and C and want to form an average. I would believe a sensible way of doing that is rather to consider a joint source D as the concatenation of all three sources. The purpose of an average would then be to give a theoretical estimate or bound on the performance on the concatenated source. This is not done by either average (harmonic or arithmetic).

If there is no correlation between the sources, a suitable estimate would rather be to divide the sum of the compressed sizes |c(A)|+|c(B)|+|c(C)| by the size of the concatenated file (|A|+|B|+|C|) where c is the encoder. If all source files are of equal size, this is the average, but otherwise, it is a weighted average. To be precise, the weight of source file #i is given by the size of this file divided by the average file size in the test corpus, or to put it in a different way, the contribution of a file to the "compression performance" of the total corpus is proportional to its size - which makes quite some sense.  Reply With Quote

7. ## Thanks:

Paul W. (18th January 2016)

8. The proper way to report compression ratio is the way others have done on the same benchmark. The most common way is to add up the compressed sizes of all of the files. If you are creating a new benchmark, then you should select files such that this scoring method makes sense.  Reply With Quote

9. Originally Posted by Jyrki Alakuijala I like the geometric ratios better. However, my absolute favorite way of showing the difference between two algorithms is the one used in Figure 3 in https://developers.google.com/speed/...ss_alpha_study

There, you can just look at the graph and see that for about 3 % of the data you can expect worse results, even when the average/median/geometric average etc. are a lot better.
I agree that's a very nice way to show compression ratio differences over a large set of files.

However, something that I haven't seen a good way to show is how to compare not just size, but space-speed over multiple files.

I think my "speedup" graphs are a pretty nice way to show space-speed on *one* file, but on multiple files it becomes a mess, so it doesn't show the way performance can vary on different files in a set.  Reply With Quote

10. If we need another dimension, maybe time can help. Graph animation ?   Reply With Quote

arithmetic mean, average, compression ratio, harmonic mean 