Optimizing color transforms like YCrCb for data compression – how they are chosen?

I have just updated context dependence for upscaling paper ( https://arxiv.org/abs/2004.03391: large improvement opportunities from FUIF/JXL no predictor) with RGB case, and standard YCrCb transform has turned out quite terrible (also PCA):
- orthogonal transforms individually optimized (~L1-PCA) for each image gave on average ~6.3% reduction from YCrCb,
- below single orthogonal transform optimized for all gave mean ~4.6% reduction:

Transform should align values along axes to approximate with three independent Laplace distributions (observed dependencies can be included in considered width prediction), but YCrCb is far form that (at least for upscaling):

How exactly color transforms are chosen/optimized?
I have optimized for lossless, while YCrCb is said to be perceptual, optimized for lossy - any chance to formalize the problem here? (what would e.g. allow to optimize it separately for various region types)
YCrCb is not orthogonal – is there an advantage of considering non-orthogonal color transforms?
Ps. I wasn’t able to find details about XYB from JPEG XL (?)

Maybe this reply will bite me in the butt but in the past I have read stuff that said that YCrCb was better for lossy compression because the Y-layer has most of the information and Cb and Cr can be loosely compressed because for the eye they are less important for the overall quality of the image. Also Cb and Cr are often sort of related to each other.

In audio we see the same idea when stereo is not treated like 2 totally different streams, but as one (mono) stream and a second part indicating the delta between the 2 streams.

Edit: sorry. You are much much more advanced in your analysis of the problem already. Should have read it better.

Kaw,
indeed YCrCb is motivated for lossy, it is probably why it is so terrible for lossless.
So the question is how exactly the lossy transforms are chosen? What optimization criteria are used?

algorithm,
at least orthonormal transforms can be used also for lossless - from accuracy/quantization perspective we get rotated lattice, usually we can uniquely translate from one lattice to rotated one.
However, putting already decoded channels into context for predicting the remaining channels, we can get similar dependencies - I have just tested it, getting only ~0.1-0.2 bits/pixel worsening from optimally chosen rotations.

Kaw,
indeed YCrCb is motivated for lossy, it is probably why it is so terrible for lossless.
So the question is how exactly the lossy transforms are chosen? What optimization criteria are used?

I think the chief criteria in a lot of color spaces is perceptual uniformity - so changes in either component are perceived as roughly similar changes in color difference; that way, when you minimize loss in the encoded color, you are indirectly using a perceptual heuristic. Something like CIELAB, the computer vision/graphics of yore's darling HSV, etc. are more perceptually uniform than RGB.

For compression, it maybe makes more sense to use decompositions where one component is much more perceptually important (and has to be coded more precisely), and the other components to be less perceptually important.

For lossless, I think you wouldn't care about any of these considerations, you simply want to decorrellate the color channels (so one plane is expensive to encode but the others are cheap); I think for lossless, tailoring the color transform is probably good for compression - storing the transform coefficients is not that expensive, so even small improvements in the coding can help; how you'd optimize the transform for compressibility is a different story, seems to me that if it was easy to do it efficiently, everyone would be doing it.

But of course there's also an arbitrary tradeoff between precise perceptual colorspaces and practical ones,
because better perceptual representations are nonlinear and have higher precision,
so compression based of these would be too slow.

Ps. I wasn’t able to find details about XYB from JPEG XL (?)

I'm guilty of not having published it in other form than three opensourced implementations (butteraugli, pik and jpeg xl).

Most color work is based on research from hundred years ago: https://en.wikipedia.org/wiki/CIE_1931_color_space possibly with slight differences. It concerned initially about 10 degree discs of constant color and about isosurfaces of color perception experience. An improvement was delivered later with 2 degree discs. However, pixels are about 100x smaller than that and using CIE 1931 is like modeling mice after knowing elephants very well.

With butteraugli development I looked into the anatomy of the eye and considered about where the information is in photographic images.

1. The anatomy of the eye is in fovea only bichromatic, L- and M-receptors only, the S-receptors are big and only around fovea. This make sense since S-receptors scatter more. Anatomic information is more reliable than physiological information.

2. Most of the color information stored in a photograph is in the high frequency information. In a photograph quality one can concider that more than 95 % of the information is in the 0.02 degree data rather than 2 degree data. The anatomic knowledge about the eye and our own psychovisual testing suggest that the eye is scale dependent, and this invalidates the use of CIE 1931 for modeling colors for image compression. We cannot use a large scale color model to model the fine scale, and the fine scale is all that matters for modern image compression.

3. Signal compression (gamma compression) happens in the photoreceptors (cones). It happens after (or at) the process where spectral sensitivity influences conversion of light into electricity. To model this, we need first to model L, M, and S spectral sensitivy in linear (RGB) color space and then apply a non-linearity. Applying the gamma-compression to other dimensions than the L, M and S spectral sensitivity will lead to warped color spaces and have no mathematical possibility in getting perception of colors right.

YCbCr is just a relic of analog color TV, that used to do something like that, and somehow we kept doing it when going from analog to digital. It's based on the constraints of backwards compatibility with the analog black & white television hardware of the 1940s and 1950s (as well as allowing color TVs to correctly show existing black & white broadcasts, which meant that missing chroma channels had to imply grayscale); things like chroma subsampling are a relic of the limited available bandwidth for chroma since the broadcasting frequency bands were already assigned and they didn't really have much room for chroma.

YCbCr is not suitable for lossless for the obvious reason that it is not reversible: converting 8-bit RGB to 8-bit YCbCr brings you from 16 million different colors to 4 million different colors. Basically two bits are lost. Roughly speaking, it does little more than convert 8-bit RGB to 7-bit R, 8-bit G, 7-bit B, in a clumsy way that doesn't allow you to restore G exactly. Of course the luma-chroma-chroma aspect of YCbCr does help for channel decorrelation, but still, it's mostly the bit depth reduction that helps compression. It's somewhat perceptual (R and B are "less important" than G), but only in a rather crude way.

Reversible color transforms in integer arithmetic have to be defined carefully - multiplying with some floating point matrix is not going to work. YCoCg is an example of what you can do while staying reversible. You can do some variants of that, but that's about it. Getting some amount of channel decorrelation is the only thing that matters for lossless – perceptual considerations are irrelevant since lossless is lossless anyway.

For lossy compression, things of course don't need to be reversible, and decorrelation is still the goal but also perceptual considerations apply: basically you want any compression artifacts (e.g. due to DCT coefficient quantization) to be maximally uniform perceptually – i.e. the color space itself should be maximally uniform perceptually (after color transform, the same distance in terms of color coordinates should result in the same perceptual distance in terms of similarity of the corresponding colors). YCbCr applied to sRGB is not very good at that: e.g. all the color banding artifacts you see in dark gradients, especially in video codecs, is caused by the lack of perceptual uniformity of that color space.

XYB is afaik the first serious attempt to use an actual perceptually motivated color space based on recent research in an image codec. It leads to bigger errors if you naively measure errors in terms of RGB PSNR (or YCbCr SSIM for that matter), but less noticeable artifacts.

While I have focused on lossless, adding lossy in practice usually (as PVQ failed) means just uniform quantization: with 1/Q size of lattice.

Entropy of width b Laplace distribution ( https://en.wikipedia.org/wiki/Laplace_distribution ) is lg(2 b e) bits.
ML estimator of b is mean of |x|, leading to choice of transform O as minimizing entropy estimation: e(x) = sum_{d=1..3} lg(sum_i |x_id|)
which optimization among rotations indeed leads to points nicely aligned along axes (plots above) - which can be approximated as 3 Laplace distributions.

For lossy, uniform quantization Q leads to entropy ~ lg(2be) + lg(Q) bits
So to the above e(x) entropy evaluation, we just need to add lg(Q1) + lg(Q2) + lg(Q3).
Hence still we should choose transform/rotation optimizing e(x), what is similar to PCA ... and only choose quantization constants Q1, Q2, Q3 according to perceptual evaluation.

Anyway, such choice shouldn't be made just by human vision analysis, but also analysis of datasample - I believe I could easily get lower bits/pixel with such optimization also for lossy.

Hence still we should choose transform/rotation optimizing e(x), what is similar to PCA ... and only choose quantization constants Q1, Q2, Q3 according to perceptual evaluation.

I am not sure that's the case - it can happen that the directions PCA picked are not well-aligned with "perceptual importance" directions, so to maintain preceptual quality you need good precision in all 3 quantized values; as an example, if the directions have the same weight for green, you may be forced to spend more bits on all three of them; or if the directions end up being equal angles apart from luma - same situation.

I think for lossless it doesn't matter because your loss according to any metric will be zero - the difficulty is in having an integer transform that's invertible.

For data compression applications, from aligning them with rotation there is a few percent size reduction - thanks to better agreement with 3 independent Laplace distributions (there is small dependence which can be included in width prediction to get additional ~0.3 bits/pixel further reduction) Agreement with assumed distribution is crucial for both lossy and lossless, log-likelihood e.g. from ML estimation is literally savings e.g. in bits/pixel, for disagreement we pay Kullback-Leibler bits/pixel.

My point is that we should use both simultaneously: optimize accordingly to perceptual criteria, and also this "dataset axis alignment" for agreement with assumed distribution ... while it seems the currently used are optimized only for perceptual (?)

To optimize for both simultaneously, the basic approach is just to choose three different quantization coefficients for the three axes, what is nearly optimal from bits/pixel perspective (as explained in previous post).

But maybe it would be worth to also rotate such axes for perceptual optimization? It would need formalizing such evaluation ...

Another question is orthogonality of such transform - if the three axes should be orthogonal?
While it seems natural from ML perspective (e.g. PCA), it is not true for YCrCb nor YCoCg.
But again to optimize for non-orthogonal there is needed some formalization of perceptual evaluation ...

To formalize such evaluation, we could use distortion metric with weights in perceptually chosen coordinates ...

I don't have any disagreement with the fact that the optimal solution will optimize for both perceptual and coding considerations. It just seemed from your comment that you think a sequential optimization will work - first optimize directions for coding, then optimize quantization for perceptual. I think the parameters are coupled if you evaluate them with a perceptual metric, so a sequential optimization strategy seems to be a bit on the greedy side. Perhaps I misunderstood your explanation in the forum, I am reacting to the comments and have not read the paper carefully.

I personally find the use of L1-PCA very rewarding, in ML for the longest time L2/Gaussians have been used not because people think they're accurate but because they are convenient to analyze/compute with. Then people will try to find exact solutions to crippled models instead of accepting sub-optimal solutions for a model that reflects reality closer (that's the Math bias in wanting convergence proofs / theoretical guarantees)

L1-PCA is not exactly what we want to here: lowest bits/pixel, so I have directly optimized:
e(x) = sum_{d=1..3} lg(sum_i |x_id|)
which gives literally sum of entropy of 3 estimated Laplace distributions: can be translated into approximated bits/pixel.

For all images it lead to this averaged transform:

First vector (as row) is kind of luminosity (Y) - should have higher accuracy, the remaining correspond to colors - I would just use finer quantization for the first one and coarser for the remaining two.

But we could also directly optimize both rotation and some perceptually chosen distortion evaluation instead - I have just written theory today and will update arxiv in a day or two.

I have updated https://arxiv.org/pdf/2004.03391 with perceptual evaluation, also to be combined with this decorrelation (for agreement with 3 nearly independent Laplace distributions) to automatically optimize quantization coefficients.

So there is a separate basis P for perceptual evaluation e.g. YCrCb. In this basis we define d=(d1, d2, d3) weights for distortion penalty, e.g. larger for Y, smaller for Cr, Cb.
There is also transform basis O into actually encoded channels (preferably decorrelated) with q=(q1, q2, q3) quantization coefficients.

This way perceptual evaluation (distortion) becomes: D = |diag(q) O P^T diag(p)| Frobenius norm.
Entropy (rate) is this H = h(X O^T) – lg(q1 q2 q3) + const bits/pixel.

If P=O (rotations): perceptual evaluation is defined for decorrelation axes, then distortion D is minimized for quantization coefficients:
(q1,q2,q3) = (1/d1,1/d2,1/d3) times constant choosing rate-distortion tradeoff.

For general perceptual evaluation (P!=O) we can minimize rate H under constraint of fixed distortion D.

ps. I didn't use bold 'H' because it literally crashes the server ("Internal Server Error", also italic, underline)

@Jarek: Sorry, but its not really a crash... Hosting company is experimenting with mod_security rules to block exploits.
Not sure how to deal with it - VBulletin engine is not very safe, so in some cases its actually helpful.