Do you know of a compressor or archiver that's content-aware regarding sqlite or other databases?
Just to clarify, I am NOT talking about VACUUM command or similar in which the output is still a functional database
Do you know of a compressor or archiver that's content-aware regarding sqlite or other databases?
Just to clarify, I am NOT talking about VACUUM command or similar in which the output is still a functional database
https://stackoverflow.com/questions/...qlite-database
I'm aware of some preprocessors for simple DB formats (like .dbf) in storage engines - they try to detect sizes of fields and rows
and reorder them to improve compression.
However for more advanced formats that's not really practical - there its already a kind of recompression,
and is usually considered too slow.
Binary DB formats are also rather obscure, usually have multiple different versions and in many cases are proprietary and undocumented.
So compressing their dumps really makes sense, though its not really lossless.
I see... Yes, I guess it makes sense to try and compress the dump. Sadly, is not my use case. I just want to be able to archive a .db file among other files on a program folder for distribution.
I tried the usual asymmetric lossless compressors and even some fast symmetric ones like bsc, bcm, ccm plus the most common preprocessors. The only program that showed a significant improvement over baseline was razor somehow (I consider wimlib+rep+srep+flzma2 my baseline as it usually compresses better and way faster than 7z, rar, FA and similar with full dictionaries).
So now I'm thinking maybe the rolz family is better suited for the task? That or Christian just happened to add some very clever transforms like he usually does...
Try Bulat's delta preprocessor maybe?
Razor has dedup, exe preprocessor and delta tokens, otherwise its not really better than lzma.
Yup, one of the first things I did. I tried Bulat's delta but t doesn't help that much.
An important feature of many modern in-memory-databases is that data is always compressed, even when the DB system is running.
While a typical SQL dump is row-by-row (which doesn't compress very well), they store data column-wise.
A short blog posting about HANA (SAP's database): https://blogs.sap.com/2018/12/02/hana-data-compression-algorithms-dictionary-encoding/
They just some dictionary encoding, prefix encoding, run-length encoding and cluster encoding.
You could easily write a script that dumps/exports your SQLite database column-by-column instead of row-by-row and try the usual compression tools.
Which exactly db?
For example in mariadb you can compress innodb tablespace with lz4, lzo, zlib and others at page level.
Yours focus is on sqllite?
Yes, I just thought there was maybe a preprocessor written for databases in general. My focus is on sqlite, but not exclusively.
Thanks for the info, but I'm looking for a stronger compressor, above lzma levels, and solid. The idea is to archive the database offline, it doesn't need to be live.
The compression of a database in use is very different from that of an unused one.
In the first case, for example MariaDB, the pages, even if compressed, are written entirely on disk.
If a page with 4KB of data is compressed, let's say, in 3KB, 3KB + 1KB of zeros will be written (at least today).
Clearly, filesystems with sparse management are preferred, but this leads to problems of "holes" which can be very numerous.
Therefore it is not that simple to have a system with high efficiency (high compression ratio).
There are cases of row compression instead of page, but the efficiency is understandably very low.
There is a much higher margin, typically, in indexes.
They, in general, are highly compressible.
But, in common use, they are mainly kept in RAM (except for really large databases) thus mitigating the advantage of compression: they must be accessed at the highest possible speed, rather than with great space efficiency
SQLLite has a very ... crude but complex format, due from its birth as little more than a .dbf file, then gradually evolved until today
https://www.sqlite.org/fileformat.html
I hardly ever use it, except in whatsapp chats, and I have not deepen the topic of its compression
I understand now why razor had such great performance. Maybe it just happened to correctly guess where the indexes were. In the particular db of my tests, there are also other ids as alphanumeric fields that can be considered a number too, and part of a sequence.
I guess there is a lot of room for improvement here for a theoretical precompressor or an specialized compressor. Databases are usually full of timestamps and dates too. String or VARCHAR fields can be compressed very well if they are handled in the same context, especially those with names or email directions in them. base64 encoding can be reversed, etc.