Compression of prediction residuals in lossless audio - geopack?!

Hi folks

i'm still experimenting with my lossless audio coder. So far, nothing I've implemented gave substantial gains - still on par with OptimFrog.

I've some problems:

1) "Real" Neural Networks learn too slow with gradient descent to be better than classical OLS (which can be directly solved)
Therefore I'm currently implementing the Levenberg marquardt algorithm.

1) How to mix several predictions?
We can do classical linear mixing p=a_1*p_1+...+a_n*p_n, but that rarely helps, especially if the predictors are strongly correlated.
I've tried doing something called "mixing of experts" using a neural network which tries to partition the input space but that reduces to the problem of clever feature extraction to help the gating network.
Expert weights also sum up to one. What happens if a predictor has a (predictable) bias or we have a predictor that is always worse than the best one, but can also improve confidence?
So far, fast mixing is done wie L1-LMS (also known as Sign-Sign LMS) which works good and is fast, but ignores the magnitude of predictors.

2) Encoding of residuals
I'm not so happy with seperating the task of prediction and residual encoding because that implies that error is additive, we loose informations from the prediction stage and introduce noise/artifacts.
Currently i'm using a "bitplane-coder" where I scan every residual-bitplane and use seperate (simple paq-like) models for significant (residuals>0) and yet insignificant residuals.

The big problem with audio is, that samples are sparse. The possible sample space is large (16-bit) in comparison to the size of a frame (~10s). We are not able to gather enough information for each sample.

What I want to do is mix predictiors in the encoding stage. I looked at geo-pack but have problems understanding the source.
What I would do: If x is the current input, make seperate distributions for x-p1, x-p2,... for every predictor, mix them somehow - is this the way geopack is working?

Modelling joint distribution of (context,current) as polynomial

Hi,
Predicting probability distribution (of residue) based on the context being real numbers is a generally a difficult problem, usually requiring some costly and unpredictable iterations like gradient descents ...
However, it becomes really simple, inexpensive and optimal if estimating this joint probability distribution as e.g. a linear combination like polynomial (slides, thread):
assuming rho(x) = sum_f a_f * f(x) for orthonormal basis f, MSE optimization gives a_f as just average over sample of value of f: a_f = 1/n * sum_x f(x) or for adaptive: a_f -> lambda a_f + (1-lambda) f(x).

It works great e.g. for financial time series (economists use really primitive models, e.g. predictor is just a linear combination, plus residue as fixed width Gaussian ... while e.g. data clearly shows Laplace as in data compression).
The problem is that it sometimes gives negative densities, what should be handled - for data compression we can predict bit-by-bit and use e.g. p = max(p,epsilon).
Finally for lossless audio:
- choose some features describing local past, preferably low dimensional (e.g. use PCA to reduce dimension),
- for each bit separately estimate joint probability distribution with context: (features, already decoded older bits) as a polynomial,
- for static write coefficients of this distribution in header, for adaptive start with default and modify on the way,
- substitute current context to current joint distribution, normalize such prediction to 1, correct rare eventual negative predictions.
Best,
Jarek

I know it was already posted in your thread but isn't context dependent (or feature dependend) gathering of statistics already the estimation of the multivariate probability distribution you are talking about?
Economics usually do ARMA or ARIMA, which is exatly what we do in lossless audio (with some tricks here and there).

Hey Sebastian,
It is context dependent, e.g. for Dow Jones https://arxiv.org/pdf/1807.04119 :
- normalize variables to nearly uniform in [0,1] (using CDF of Laplace),
- look at such d successive values: would be uniform in [0,1]^d if uncorrelated - fitting polynomial rho(x) = 1 + sum_j a_f * f_j(x), coefficients a_j describe statistical dependencies, can be cheaply independently calculated and have specific cumulan-like interpretation,
- now for probability prediction we substitute (d-1) previous values (context) to the estimated d-variable polynomial, getting single variable polynomial - after normalization to 1, it is predicted density of the new value.

Indeed, I have meant primitive ARIMA-like models, which can be greatly improved: instead of assuming e.g. fixed width Gaussian for residue, we can exploit statistical dependencies if modelling joint distribution.

Thanks. Interpolation between context sounds indeed interesting. I'll take a look into it. Maybe this could resolve some of the context-dilution problems I'm seeing.

But don't forget that the ARIMA Predictor y=a_1*x1+a2*x2+...+... is optimal in the L2-sense. We solve the covariance matrix directly. So there is no other optimal predictor if the (additive) noise is assumed to be gaussian.

Great, I would gladly discuss it.
Having large sample (in data compression it is huge) we can get as close the real joint distribution as needed.
The number of polynomial coefficients grows exponentially: (m+1)^d for degree m and d variables, so the dimension cannot be too large (e.g. PCA to reduce dimension).
But thanks to being really cheap, working on e.g. millions of coefficients is not a problem.

I'm calling this method (the one implemented in geopack) "probability translation".
In theory its simple and obvious, but research is necessary to find an
efficient implementation, because simple version is slow.
So I'd appreciate any progress with that.

In LPC case, the method is like this:
1. LPC is a linear shift, instead of encoding x[i] directly, we calculate a prediction d and encode x[i]-d.
2. Now, suppose we have simple frequency tables for probability distributions of
y1=half+x[i]-d1 and y2=half+x[i]-d2 (different LPC models),
with update f1[y1]++; f2[y2]++;
3. In this case, f1[half] would correspond to x[i]=d1, so f1[half+d2-d1] would correspond to x[i]=d2, right?
So we can just shift f1 table by (d2-d1) and mix, then encode y2 values with mixed table.
4. (what is implemented in geopack) Its also easy to do linear shifts with a tree.
Suppose we have a binary tree with probabilities of y1's bits (b0..bN; b0 is MSB).
Then, computing full value probability is done by walking from root to leaf and multiplying bit probabilities
(b0+p0*(1-2*b0)) * ... * (bN+pN*(1-2*bN))
And with same complexity we can compute a sum of a probability interval [0..a),
by also summing intermediate results of bit probability multiplication.
5. We can compute sum of probabilities of values in range [a..b) as [0..b)-[0..a), with two tree lookups.
6. b0=0 corresponds to y1 being in [0..2^(N-1)),
which in turn corresponds to y2 being in [d2-d1..d2-d1+2^(N-1)).
Thus, we can mix y1.p0 with range sum of [d2-d1..d2-d1+2^(N-1)) in y2 tree.
7. Once b0 is decoded, we can work with [0..2^(N-2)) or [2^(N-1)..2^(N-1)+2^(N-2))

Problem is, for 16-bit values and with M submodels, this
requires (M-1)*2*16 multiplications _per bit_ (and same amount of tree memory lookups, which is likely worse).
Of course, working with frequency tables is much easier
(by computing (f[d2-d1+2^(N-1)]-f[d2-d1]) etc),
but bitwise update provides much better predictions in this case,
and implementing it with plain tables would require 64k multiplications.

Alternatively, we could use eg. nibble trees and SIMD,
and/or variable-length sample codes, I don't know a good practical solution atm.

However, this is also the only method that allows to correctly apply LPC to floats (even without mixing),
so having a solution would be really nice.

> I still have a hard time following you, sorry for that.

I mostly tried to explain computing sum of probabilities of a range of symbols,
when a binary tree is used to hold the statistics.

> "So we can just shift f1 table by (d2-d1) and mix, then encode y2 values with mixed table."
> what do you mean by that? What do I gain by that, if e.g. d2 is vastly inferior to d1?

These (d1 and d2) are supposed to be LPC predictions, they're not constants.
So if d2 is always "vastly inferior"... don't use this kind of submodel, I guess.

Same as with usual CM submodels, even mixing predictions of two instances of the same model,
just with different update rates, is already helpful.

Normally we can't mix statistics from different LPC residuals, but the method I described lets us do that.

For example, in LPC coders, we usually subtract the prediction, collect statistics for residuals, and then encode same residuals.
But the difference between original value and residual is a linear shift; if we shift residual statistics back, we can
encode original values using (shifted) residual statistics, and get same compression as when directly encoding residuals.
Of course, normally it would be useless, but translating multiple probability distributions for same "alphabet"
would let us mix them, which improves compression.

Btw, as to sparseness, what about processing downsampled stream first?
Or I guess spectral transformation could be useful too - its possible to do a lossless reversible FFT, I have it.

Your idea sounds interesting but I'm still missing a weight between the distributions. Isn't there a "general" way to mix multialphabet distributions?

Originally Posted by Shelwien

Btw, as to sparseness, what about processing downsampled stream first?
Or I guess spectral transformation could be useful too - its possible to do a lossless reversible FFT, I have it.

I've already experimented with feeding a wavelet-decomposed time series to the predictor. This sometimes helps.
The idea is to build a overcomplete subband tree. Use a (general) predictor on each subband and select the sub-tree(s) which minimizes coding cost.
In this way we get the optimal subband decomposition.

> Your idea sounds interesting but I'm still missing a weight between the distributions.

I implied using paq mixer.
As I said, its also sometimes necessary without mixing too, for example with LPC on floats.
Because float's residual can't be encoded with integer number of bits without redundancy.
But it can, if you're encoding bits of original float, using shifted residual statistics.
This fact is used by geopack, because geo has a weird float-point format with 18-bit mantissas.

> Isn't there a "general" way to mix multialphabet distributions?

There is a natural extension of paq mixer for non-binary alphabets, you can even find the description somewhere in toffer's posts.
But in any case, it won't help you mix different LPC models on its own.

I have thought a bit more about this topic. I really would like to use the paq bit-level stuff for obvious reasons (counters, sse, mixers) and i can't see where the bit-level stuff comes into play in your approach.

Your idea seems to be inspired by TMW - They model the discrete distributions of every predictor (what is the prob. of the current symbol to fall between x and y under the current predictor)
with something like a t-Distribution - mix them (on a distribution level) and encode the range of the current source symbol.
Clever, but we need more flexibility here and i contend that this would not be better than my residual encoder (which is better than paq8px).

But, when we have a probability distribution of every source symbol for every predictor we can map this back to the bit-level (like in your mod_ppmd). After that we can encode the source symbol directly.

So...
Variant 1: Mix on a distribution level and encode a symbol range like in TMW
Variant 2: Mix on a bit level after we calculated bit probabilities for every predictor

> I really would like to use the paq bit-level stuff for obvious reasons (counters, sse, mixers)
> and i can't see where the bit-level stuff comes into play in your approach.

I described one way to apply linear shifts to probability distributions,
while still using binary trees with bitwise counters.

Once you can compute a probability sum from residual stats
corresponding to each bit of original value,
it becomes possible to mix and map these probabilities in any way you like.

> Your idea seems to be inspired by TMW

No, I thought of it independently, its an obvious answer to
a question "how do I apply mixing and SSE to multiple sets of LPC stats".

As to TMW (thanks for the link btw), it seems to work with
parametric approximations of predictor models, rather than
working with full stats, like I suggested.

TMW also assumes 8-bit sample precision, which makes things much easier.

> would not be better than my residual encoder (which is better than paq8px).

My idea was to start with a good LPC coder (normal one, using a single prediction),
then mix multiple instances of it and optimize update parameters.

When everything is done correctly, mixing result should never be worse than
best result from individual submodels.

> But, when we have a probability distribution of every source symbol for
> every predictor we can map this back to the bit-level (like in your
> mod_ppmd). After that we can encode the source symbol directly.

Yes, but I take this a step further, and propose to use probability range summing
for normal representation of residual statistics, rather than computing
all the probabilities of individual values first.

> Variant 2: Mix on a bit level after we calculated bit probabilities for every predictor

v3: Implement a function which can compute Sum[p(x=c),c in [a..b) ] = sum[0..b]-sum[0..a]
for statistics as is.

For example, suppose we have a binary tree with bit counters,
which is updated like this:

for( i=0,ctx=1; i<N; i++ ) {
// x = sample, N = number of bits in sample
bit = (x>>(N-1-i))&1;
counter[ctx].Update( bit, rate_params );
ctx = ctx*2 + bit;
}

In this case, counter[1].p (presuming its a probability of bit=0),
would be also a probability of x belonging to [0..1<<(N-1)) range.

If this specific x is a residual equal to (sample-d1), it means
that its range [0..1<<(N-1)) corresponds to range [d1..d1+(1<<(N-1))) of original sample value.
But if we can calculate the sum of probabilities in any range, it works out.
With a binary tree, it only needs two normal O(N) passes, to compute [0..a) and [0..b) sums.
Not sure, but it might be possible to somehow optimize this too, like computing whole [a..b) with a single pass.

i'm still experimenting with my lossless audio coder. So far, nothing I've implemented gave substantial gains

And nothing you will try will give substantial gain. Lossless coding is about that: You try something simple, you get 2:1. You try something very smart, you get 2:1. The problem is that you are mostly compressing noise.

1. Sure, but if 1% compression improvement of enwik is good, why can't we also improve wav/bmp compression by 1%?

2. The method discussed here ("probability translation") is applicable also to lossy compression.
It becomes possible to use LPC model(s) to process quantized coefs in unquantized domain,
which is normally inefficient, because residuals would contain unnecessary information.

Can you maybe provide some example files containing your residual data? Preferably with the block size you intend to use and also including your current compression rate for easier assessment whether we can do better.

Well, we were talking about this: http://nishi.dreamhosters.com/u/FLTgeo3b2.rar
File "geo" from Calgary Corpus contains some audio-like data, but in a weird float-point format with 18-bit mantissas.
So back in 2007 I made this CM codec specifically for it, and even now it compresses the file better than paq versions.

Can you maybe provide some example files containing your residual data? Preferably with the block size you intend to use and also including your current compression rate for easier assessment whether we can do better.

sure

File: atrain.wav (3.377.110 bytes)
I use a medium strong predictor so there are still some correlations left.

The file coefs.zip contains 4 files (ch0_0.res, ch0_1.res, ch1_0.res and ch1_1.res) which are the raw residuals from each channel for each block (blocksize=10 seconds=441.000 samples).
Coeffs are 16-bit and written raw from high to low using interleaved sign coding. To decode the residual:

val=(buf[0]<<8)+buf[1];
if (val&1) val=((val+1)>>1);
else val=-(val>>1);

Concatenating all files into a single file via "copy /b" gives:

The coder for sac is simple and uses 4 direct context counters (no state machine, or history mapping), mixes the predictions and adjust them via SSE.
Sac uses the coefs as written to the files, no interchannel-correlations or special sign coding is used.

Problem with "probability shift" is, that the estimation of the distribution in the predictor domain has to be at least as accurate as the current residual coder. I doubt thats possible.

We can compute 32-bit or even 64-bit (on x64, or x86+SSE) intermediate values, for probabilities of 16-bit values it seemed enough, precision-wise.
I tested it aside from geopack too, and mixing does improve compression, comparing to a single LPC model.
Just that I didn't get any interesting results anyway, because I was combining a few simple ad hoc delta models -
hopefully results would be better once I have a reasonable LPC as a base
(problem is, it has to be a sequential LPC, not frame-based one... but maybe I would be able to port MA for this.)

Anyway, I think that with bitwise coding (and thus bitwise probability estimation) the probability precision is not a problem.
On other hand, speed is. A counter tree for 16-bit (or maybe 17-bit for residuals) values won't fit into L1, and there're lots of multiplications.
I was actually considering some x*y=exp(log(x)+log(y)) approximation (with tables), but there'd be additions too...

We can compute 32-bit or even 64-bit (on x64, or x86+SSE) intermediate values, for probabilities of 16-bit values it seemed enough, precision-wise.
I tested it aside from geopack too, and mixing does improve compression, comparing to a single LPC model.

Well, i was not really talking about precision, but about local estimation of the current bit being 0 or 1. The current residual coder is really good at it.
Just counting occurences is not enough.

Just that I didn't get any interesting results anyway, because I was combining a few simple ad hoc delta models -
hopefully results would be better once I have a reasonable LPC as a base
(problem is, it has to be a sequential LPC, not frame-based one... but maybe I would be able to port MA for this.)

Well just use the OLS class i posted in the paq8px thread as a start. I use the same one is Sac.
Framing is just for convience and has nothing to do with the prediction model. MAs predictor is easily beaten.

> Well, i was not really talking about precision, but about local estimation
> of the current bit being 0 or 1. The current residual coder is really good at it.

Most normal models should be compatible with probability range estimation,
like I described. It doesn't have to be a uniform binary tree - sign-log-mantissa
coding is okay too.

An example of not (easily) compatible would be something like coding sign
after mantissa and in context of mantissa.

But if your coding scheme uses same lexicographic order of arithmetic code intervals as
original sample values, its even possible to estimate the shifted bit probability
using unmodified normal model.
Just use it to generate arithmetic codes for values a and b (bounds of range [a..b)).
The difference of these codes would be the range probability.

> Just counting occurences is not enough.

Of course, if it was enough, there won't be any speed issues -
I'd just access some cumulative frequency tables with added shift.

> Well just use the OLS class i posted in the paq8px thread as a start. I use the same one is Sac.

Thanks, I saved a copy for future comparison.
Problem is, I'd not be able to use a float-based model in archiver, as its coding is not repeatable.

> Framing is just for convience and has nothing to do with the prediction model.

I mean, some coders optimize LPC coefs for a frame, then store these coefs in a frame header.
This has bad compatibility with my LPC-mixing scheme, because it would be redundant to
store multiple sets of coefs.

> MAs predictor is easily beaten.

I initially proposed to use optimfrog dll (for wav coding in our archiver), but actual tests
showed that its speed is impractical when its compression is better than MA.
TAK may be also good, but its not open-source and heavy vector optimization would make it hard
to learn anything by reverse-engineering.
Practical archivers tend to use wavpack or tta for audio compression
(they're integer, sequential and relatively easy to use; original MA is kinda 2-pass).
And comparing to wavpack, MA is quite good.
I also hope to improve it first - there're plenty of hardcoded parameters which could benefit from
automated optimization.

Its a bitplane coder, so i scan every bitplane from top to the bottom. This allows me to construct local probability estimators which look into the "future" by considering partially coded/decoded residuals.
I don't know if its really better than sequential coding but it makes your approch difficult imo.

Maybe this paper is a good start for sequential modelling the (context dependend) distribution

> I don't know if its really better than sequential coding but it makes your approch difficult imo.

Well, maybe not.
AFAIU these bits (in bitplanes) still can be mapped to ranges of values of original samples (or residuals of one of submodels).
Then difference would be just the order of coding.
Like, you'd have to estimate range probability [d; 0x8000+d) for each sample,
rather than 15-16 nested intervals for each sample before proceeding to next one.

OptimFrog also uses floating point, afaik. MP4-ALS uses similar algorithm (RLS+NLMS) but on 64-bit integers - its results are not that good - but that might result from its strange stereo-decorrelation
(which is done before actual prediction).
NLMS code could be borrowed from MA but doing a low-order OLS in fixed-point integer is hard. MA uses only simple fixed weight low order predictors.
Most of the compression of OptimFrog and Sac comes from its low order OLS. I would even guess that a 32/16 interleaved OLS using my class (same as Paq):

Code:

Predict channel A using 16 past samples from channel A and 8 from channel B
Predict channel B using 16 past samples from channel B and 8 (including the previous) from channel A

gives better results than MA --high, although there are some files which clearly benefit from high order (>512) linear predictors.
As test files for good stereo decorrelation I would propose "female_speech.wav" and "male_speech.wav". TTA and Wavpack have a hard time compressing these two below 1mb (rarewares.org)