What are the best algorithms for Floating point data compression?
What are the best algorithms for Floating point data compression?
Use an adaptive filter to predict the next value and encode the difference with an indirect order 0 context model.
Well that would be a function of the statistics of the numbers them selves. However suppose your
floating point number system allows any 32bit combinations of bits to be a valid floating point number
then in theory any file of any combination of 32 bit words is possible.
Given that situation and knowing that most files thought of as 8 bit words since that is byte size. I would
bijectively map the file to bytes size. This means most of the files of 32 bit words would be compressed
since you skiped the files of odd length and skip half the files of even length.
Google finds this:
http://www.cs.unc.edu/~isenburg/lcpfpv/
http://www.csl.cornell.edu/~burtscher/research/FPC/
Also what Matt says, but there're tricky moments, depending on the nature of specific FP data.
For example, you might have many approximations of rational numbers (1/3 etc) stored as floats.
Or floats stored as doubles, in which case there'd be some unused bits in their mantissa and exponent fields.
Or some values with fixed _decimal_ format (eg. prices where you can't have precision below 1 cent).
So I guess we can say that there's no universal float model, same as there's no universal data model.
One obvious step is mixing of multiple such models, but the LPC approach usually prevents that
(different filter models would have statistics for different values), also LPC is inherently redundant
(because normally you won't bother to exclude delta values which cause float overflows and such),
so, depending on data nature, a direct little-endian aligned bitwise CM might be actually better than
a filter.
Still, compression-wise, mixing of multiple such models, LPC or others, would be the best.
For that we'd have to remap statistics gathered in context of specific submodels to a single representation.
(For example, if we feed inverted bits to a submodel, we'd then have to invert its output probability to be
able to mix it with normal predictions).
Once I even made a model based on that kind of mixing and remapping: http://nishi.dreamhosters.com/u/FLTgeo3b1.rar
geo file from Calgary Corpus has a somewhat unique format, but still its a kind of floats.
How is Matt's approach different from the FPC paper? Essentially doing the same thing as far as i can tell.
And Lcpfpv paper is more focused only on geometric data. May not work with interleaved data. I have to try it.
Capturing fractional numbers or decimal numbers will be very application specific but that information is lost if you just look at the FP data.
> How is Matt's approach different from the FPC paper?
"indirect order 0 context model".
Also "predict the next value and encode the difference" is not necessarily
the same as LPC - for example, a binary context model can be used as
a predictor.
Matt probably meant LPC though - you can see it in paq's wav model.
> Capturing fractional numbers or decimal numbers will be very
> application specific but that information is lost if you just look
> at the FP data.
Its not really lost... actually it would be good if we could lose it.
Detection is simple - we just have to convert given floats into that
custom representation (1/i or i/100) and back, and see whether there's
any difference with original value.
I guess some small difference (a few lsbs of mantissa) can be allowed too,
to compensate for different ascii-to-float implementations and such.
In fact, its easy to detect some of these even visually:
for example 1/3 = 3EAAAAAB, 1/5 = 3E4CCCCD etc
These are surprisingly important eg. for executable compression, because
modern compilers replace divisions by constant with multiplications.
But also 0.33 = 3EA8F5C3, which is pretty bad for direct binary compression.
Yes, I was thinking of LPC. But it depends on the data obviously. LPC would be appropriate for sensor measurements. But measurements are sometimes quantized so you may want to check if the data can be reduced to a small set of integers first, then use LPC on that.
Otherwise you can use floating point calculations for the LPC but only if you are sure that the decompresser will use exactly equivalent floating point hardware. When encoding the prediction error, first map the predicted and actual values to integers (like *(int64_t*)&x) and then encode the difference so there is no precision loss when you convert back.
I did some analysis of GEO in http://mattmahoney.net/dc/dce.html#Section_2
It uses an obsolete IBM floating point format with a base 16 exponent instead of base 2. The data is also quantized to 16 bits.
> use floating point calculations for the LPC but only if you are sure that
> the decompresser will use exactly equivalent floating point hardware
Unfortunately that doesn't work at all, because of compiler quirks.
I used to write float-point models before (eg. Ash is an example) and found that a FP model just has to be symmetric -
ie same program code used for modelling both in encoding and decoding, or decoding would fail sometimes.
Also such models are not compatible when compiled by different compilers (or even the same, but with different options).
Yeah, I guess so. You might get different results with x87 and SSE2 even on the same machine. They also have flags to control rounding, underflow, etc.
That's still ok, there're usually compiler options to control such things.
But with floats a-b==0 doesn't mean a==b and so on - common sense axioms don't work,
however compilers apply optimizations as if the precision is perfect.
That's the main problem - i've seen cases where separately compiled encoder and decoder
matched when i added a debug printf, but decoding failed without it.
On the tangent about using floating point in models:
streflop vs gafferongames
thanx for info, guys)
Possible methods, that can be used for floating-point compression:
- Transpose 4xN for float and 8xN for double + lz77
Implementation: Floating Point Compression in TurboTranspose
see also error-bound lossy compression- Predictor (ex. Finite Context Method) + encoding (ex. "integer compression").
Implementation: Floating Point Compression in TurboPFor- when possible, convert all floating point numbers to integers (ex. 1.63 -> 163), then use integer compression