# Thread: Improving distribution and quantization for DCT coeffcients

1. ## Improving distribution and quantization for DCT coeffcients

I am supposed to work as consultant helping with design of data compressors, what motivated me to take a closer look and rethink - after directly predicting parameters e.g. of Laplace distribution from context, optimizing color transform, it is time to take a closer look at DCT coefficients - some first observations: https://arxiv.org/pdf/2007.12055

1) I have checked DST and DCT family and DCT-II is indeed the best, trials of further optimization gave barely a few bits per 8x8 improvement,

2) There is a "common knowledge" that AC coefficients are from Laplace distribution, I have taught students it ... but looking at data this is just wrong!
Checking exponential power distribution (EPD) family rho ~ exp(-|x|^k), we don't get k=1 of Laplace distribution, but k~0.5 which is quite different: more compact body, thicker tails.
Using Laplace distribution for values from k=0.5 EPD, we are wasting ~0.1 bits/value ... what for RGB without subsampling can mean ~0.3 bpp file reduction.
Here is such evaluation, at the bottom perfect agreement would be flat, while Laplace is the orange one:

The question is where directly is used Laplace assumption for AC coefficients? They might have this improvement capability ...
One is Golomb code which is "optimal": for some Laplace distributions, among prefix codes ... do anybody still use it in modern image/video compressors?
I remember discussing with Jean-Marc Valin about PVQ - I have shown how to optimize it for uniform distribution on sphere perfect for multivariate Gaussian distribution ... but he said they are from Laplace distribution for what the basic PVQ turned out better ... if it is k~1/2 instead, we again should be able to reduce MSE ~10-20% by such deformation.

3) I have also worked on flexible quantization for such densities - automatically get a better one than uniform, avoiding costly standard techniques like Lloyd-Max.
Are there some more practical techniques?
I have defined "quantization density" q choosing how dense is local quantization, kind of N->infinity limit of number of quantization regions, to finally be used for finite N.
Taking its inverse CDF on some size N regular lattice we get (nonuniform) quantization, e.g.:

This q/Q quantization density can be optimized for chosen density, above is for minimizing MSE distortion, for which as shown we should choose
q ~ rho^{1/3}
so we should use e.g. twice denser quantization for 8 times denser regions.

However, such distortion optimization increases entropy due to more uniform distribution - what turns out a serious problem also concerning e.g. Lloyd-Max.
I have also worked on such rate-distortion optimization, and it leads to nearly uniform quantization - so not so bad after all.

PS. I had these 3 points previously, but couldn't avoid this "internal server error" (is there a way?) ... and there everything crashed, and generally is super slow ...

2. ## Thanks:

Jyrki Alakuijala (25th July 2020)

3. I have just updated https://arxiv.org/pdf/2007.12055 by exploiting statistical dependencies inside 8x8 DCT block - turns out it allows for huge savings: ~0.5 bits/value (per pixel for grayscale, up to x3 for RGB).
So DCT decorrelates well - additional linear(!) predictions between its coefficients make nearly no sense.
However, there are also different statistical dependencies between them, like width dependence (between variances): the higher absolute value of one coefficients, the larger width of Laplace should we use for its neighbors.
It it is shown in the left plots below: the higher already decoded |DCT_12|, the larger width of Laplace should we use for DCT_13 - exploiting of which gives ~0.4 bit/value savings.
Bottom right contains weighs for finding optimal width from already decoded in zigazg order - it becomes a bit costly, but inexpensive approximations should be close.
If e.g. JPEG recompressors don't use it, expect finally breaking their 20-22% bound by a lot ...

4. Updated https://arxiv.org/pdf/2007.12055 with DCT prediction from already decoded neighboring blocks - both to improve compression ratio and to reduce blocking artifacts.
We can use DCT coefficients (left) of these decoded blocks, or their final decoded values (right) - here are visualized coefficients for both (and bits/value savings for various approximations), as expected it sufficient to focus on column/row of adjacent pixels:

For compromise between compression ration and computational cost, I would perform 1D DCT of these right-most columns and lowest rows of decoded blocks - to use as features for predictions for new blocks as there are frequency dependencies (Figure 7).
And generally mean savings from such predictions is ~0.05 bits/value, what is tiny comparing to savings from width prediction (of Laplace distribution) ~0.5 bits/value - the latter seems rarely used (LOCO-I ... ?) what would mean huge savings opportunity.

Is something like above used in practice?
Could anybody summarize methods used in JPEG repackers (PACKjpg, lepton, brunsli)?

5. ## Thanks:

MegaByte (27th August 2020)

6. Originally Posted by Jarek
... I would perform 1D DCT of these right-most columns and lowest rows of decoded blocks ...

Is something like above used in practice?
Could anybody summarize methods used in JPEG repackers (PACKjpg, lepton, brunsli)?
My understanding is that PackJPG (and possibly Lepton) do what you propose. I don't read other people's source code in compression so I don't know for sure.

In brunsli I do a different thing -- I extrapolate what the 1D DCT would be at the border of the two blocks. I'm secretly rather proud of this technique as it required a certain level of intuition in this domain. (Secretly, because usually no one wants to discuss this domain :-D). Intuitively this was a better approach for me, is faster to compute and we got better results with it when Zoltan compared the methods in practice (back in 2014 when we were developing that code).

The math is super-cute when you extrapolate the 1D DCT from the 2D DCT. For other side you just some the coefficients (row or column-wise depending where you are extrapolating), for the other side you do an alternating sum.

In brunsli we use these extrapolated 1D DCTs as context for some of the 2d DCT coefficients. Using a context is a balance between slowing down decoding and getting density -- so we don't use it for all coefficients.

7. ## Thanks:

Jarek (27th August 2020)

8. Thanks, indeed data compression generally (beside e.g. patent minefields) suffers lacks of documentation - even if source is available, it is often faster to just work with data - like above results of a few hours of just testing basic approaches.

Indeed going from 2D to 1D DCT is nice, however, as you usually have to decode the values, it seems cheaper to calculate 1D DCT from decoded values (... there is subtle difference in errors from quantization).
Interpolating next bit row/column is a good idea, but doing it with linear combinations, it is already hidden in linear regressions I have used - I trust data-driven optimizations more than e.g. interpolations (containing hidden regularity assumptions).

How do you choose width/uncertainty sigma for the distributions?

9. Originally Posted by Jarek
How do you choose width/uncertainty sigma for the distributions?
In brunsli the 1d dct extrapolation is only used for context modelling. The distribution is measured from the actual DCT values and not modeled.

The actual values are clustered further based on the system that I introduced in WebP lossless and used later in Brotli, too. We used this system in Brunsli and JPEG XL to cluster entropy codes so that where the context modeling is not effective we don't need to pay the cost of having too many entropy codes there.

10. There are strong statistical dependencies between neighboring values: beside linear value dependencies, variance dependencies are also extremely strong and quite complicated e.g. for DCT - I am skeptical about their effective exploitation with clustering-like approaches ... log-likelihood improvement of a model is literally mean bits/value savings.

11. Originally Posted by Jarek
There are strong statistical dependencies between neighboring values: beside linear value dependencies, variance dependencies are also extremely strong and quite complicated e.g. for DCT - I am skeptical about their effective exploitation with clustering-like approaches ... log-likelihood improvement of a model is literally mean bits/value savings.
Right. Clustering is nice even when it doesn't fully exploit the correlations since it can be made relatively fast to decode.

12. Finally added some evaluation with quantization to https://arxiv.org/pdf/2007.12055 - the blue improvements on the right use 1D DCT of column toward left and row above (plots are coefficients: prediction mainly uses corresponding frequencies - we can focus on them) - prediction of widths has similar cost as prediction of values, but gives much larger improvements here.
The red improvements use already decoded values in zigzag order - it is costly and gain is relatively small, but its practical approximations should have similar gains.

13. In PIK and JPEG XL we optimize the (zigzag) order to get best zero rle characteristics, too. There is also some additional benefit for keeping a lowering variance for other reasons in the chosen codings.

14. Sure order might improve a bit, but I am talking about exploiting dependencies inside block - the tests show that linear predictions from already decoded coefficients in this block can give a few percent improvement, especially for width prediction (#2 post here).

15. Prepared slides from my statistical modelling for image compression experience like above: https://www.dropbox.com/s/zcvqcy251c...age%20comp.pdf

16. Added to 2007.12055 CCA ( https://en.wikipedia.org/wiki/Canonical_correlation ) - very nice tool to approximate statistical dependencies of multidimensional variables with low dimensional bottleneck for reduced computational cost.

For example below are features obtained from CCA for DCT prediction: while we have previously discussed using 1D DCT of decoded neighboring pixels for that, here we automatically get something similar to this intuition, which is further optimized.
Using only 1), we take nearly mean of the neighboring pixels, and add it to predicted DCTs as in the 8x8 map (in practice can be reduced), this way getting ~0.14 bpp savings.