Compression results for a corpus are often reported as arithmetic means of compression ratios where compression ratio is compressed size / uncompressed size (expressed in percentage or bits-per-byte).

Is this the right thing to do? My impression is that in performance analysis, it's generally considered The Wrong Thing to average ratios that way, and that you should use the harmonic mean instead. (For example, when talking about transactions per second.)

Are compression ratios somehow an exception to this rule?

I seem to recall reading someplace that they're not, and harmonic means of compression ratios are to be preferred, but I can't remember where. (I noticed that on the Canterbury Corpus website, they report averages and "weighted averages," but the weighted averages are not the harmonic means. )

Or is that only true if you define compression ratio the other way (uncompressed size / compressed size), so that compressing a file to half its size would give you a "compression ratio" of 2. (Which has always seemed more intuitive to me; the usual way seems like an incompression ratio.)

I suspect that's right: we're essentially taking the reciprocal before averaging, and could then take the reciprocal of the result to get the compression ratio in the higher-is-better sense, so in effect we are computing the reciprocal of the harmonic mean. But I don't trust my understanding well enough to be sure.

2. Defining compression ratio in the usual way (compressed / uncompressed) is needed to make averages somewhat meaningful.

Consider a corpus with one file which is 100MB of zero bytes, and another file which is 100MB of noise. Typically the corpus can be compressed to about 100MB.
The first file can be compressed to nearly 0%, while the other one cannot be compressed at all (compression ratio 100%). The average is 50% which in this case corresponds to the compression ratio of the entire corpus (because both files have the same uncompressed size).

4. Originally Posted by Paul W. Compression results for a corpus are often reported as arithmetic means of compression ratios where compression ratio is compressed size / uncompressed size (expressed in percentage or bits-per-byte).

Is this the right thing to do? My impression is that in performance analysis, it's generally considered The Wrong Thing to average ratios that way, and that you should use the harmonic mean instead. (For example, when talking about transactions per second.)
I like the geometric ratios better. However, my absolute favorite way of showing the difference between two algorithms is the one used in Figure 3 in https://developers.google.com/speed/...ss_alpha_study

6. Originally Posted by Paul W. Compression results for a corpus are often reported as arithmetic means of compression ratios where compression ratio is compressed size / uncompressed size (expressed in percentage or bits-per-byte).

Is this the right thing to do? My impression is that in performance analysis, it's generally considered The Wrong Thing to average ratios that way, and that you should use the harmonic mean instead.
I'd say it rather depends on what you want to deduce from the result. I'm not sure either of the averages make too much sense. Consider we take an average over a very disjoint data set of sources A, B and C and want to form an average. I would believe a sensible way of doing that is rather to consider a joint source D as the concatenation of all three sources. The purpose of an average would then be to give a theoretical estimate or bound on the performance on the concatenated source. This is not done by either average (harmonic or arithmetic).

9. Originally Posted by Jyrki Alakuijala I like the geometric ratios better. However, my absolute favorite way of showing the difference between two algorithms is the one used in Figure 3 in https://developers.google.com/speed/...ss_alpha_study

There, you can just look at the graph and see that for about 3 % of the data you can expect worse results, even when the average/median/geometric average etc. are a lot better.
I agree that's a very nice way to show compression ratio differences over a large set of files.

However, something that I haven't seen a good way to show is how to compare not just size, but space-speed over multiple files.

