What are some ways of giving any chunk of data of unknown origin some 'identity', order or classification?
I know of attempts based on:
* Extension - not very useful since it can be really whatever
* Magic numbers
* bytes count ('strings' tool)
* Metainfo and other markers (Mediainfo)
* Brute-force decompression attempts (precomp, Universal Extractor)
I think an histogram could be used as a quick and dirty classification tool for some kinds of data. For example, here are some representative graphics of the distribution of values on WAV and random-looking data:
What do you think about this? Can be extended to other kinds of data? Maybe taking nibbles or other units instead of bytes?
But overall what's the point?
For compression improvement it would be nice to be able to detect data structures (binary structures and text forms),
but its unlikely to be that quick and atm we don't have tools to make use of this info... except for lzma lp option maybe.
And sorting by entropy isn't likely to improve compression on its own.
We do have specialized compressors for different data. For example, the PCM file can be packed with a lossless audio compressor, usually leading to both stronger and faster compression. But we still have to identify it. Think of a precomp-ed stream, for example, with no file extension, of only 128k because it was inside a squashfs, and maybe even without a proper header for that same reason... There's got to be a fast way of telling what it is. Now, the histogram can be made on-the-fly when the stream is being processed, at top speed. Of course, that is only an example. There are countless other situations where a good {parser / recognition tool / whatever the name} can lead to better and/or faster compression.
Detecting incompresible data is very useful to know when to drop the heavy algorithms in favor of simpler and faster ones, also significantly diminishing memory usage, which can in turn be used to improve compression wherever it is possible.
And while probably similar entropy isn't a good measure in itself, it can be used along with other constraints to group similar files together without having to resort to brute force or trial and error. For example, I suspect very similar images will have very similar entropy, especially compared to other images. For example, a few monochromatic backdrops compared to a couple of photographs of the same place on a slightly different angle. If we were to group them using other methods, maybe we would have to resort to image recognition (blowing the cpu and memory budget out of the sky) or just sort them by width*length or something that can introduce very different images in the middle, or separate two similar images just bc they have a different resolution.
I know fv, although I only used the older version by Matt. It's a cool tool, but it is intended to be used by a human expert to analyze an unknown file and devise a custom algorithm to it, not to provide quick and easy discrimination values to a program. And, it actually compresses the file several times to get the stats if I understood its inner working correctly.
I'll take a peek at segfile...
Last edited by Gonzalo; 18th December 2019 at 19:08.
I just took segfile for a spin. For some reason it fails to identify clearly different data. I tried it on a few .pcf with only packed jpgs (incompressible) and pdf text and bitmap streams (highly compressible), but it only saves a few files, where it should be at least 50... What kind of statistics does it use? Did you try a simple byte count? If all 00 - FF values have roughly the same count, then it is random, encrypted or compressed. If there are only ASCII letters, then it is text, If both 00 and FF have the highest count and it is getting lower towards the inner values, then it probably is some form of raw uncompressed audio.
or any other multimedia data - such as geo from calgary corpus
No, not really...
First image is an average WAV 16 bits 44000 hz. They all look characteristically similar.
The next is geo and the last is geo with the highest value hidden to better appreciate the other ones.
-------------------------
This shows us something else. If you look at the text files from calgary, you'll see that the space character is by far the most used in all of them. That should be really helpful to discriminate text files.
This is news, paper1, and book1:
Also note that, no matter the content, the bytes distribution is pretty much the same for english text (applies for dickens, bible, really whatever). 'e' and 't' are next in line of frequence after ' ' and so on.
For xml, there's a lot of '{' and '}', on an exactly equal quantity. Other file formats have different distributions too, frequently distinguishable with a simple byte count.
I didn't meant this particular file, but what 00/FF property you observed may hold for any MM data, not only audio and bitmaps. geo is just an example of generic MM data
overall, we can easily imagine binary file with lots of 0x20 bytes. It's more reliable to use more advanced approaches to distinguish data types better compressible by specific algorithms
What I mean is, histogram shows a byte distribution similar enough among the same kind of data, and different enough between different kind of data for it to matter.
All WAV histograms are the same, and clearly different from other files'. You won't see a .wav having those peaks in the middle of the graphic. It's a perfect curve. - Actually, scratch that. There are different files. I guess is bc of different encoding, more channels, etc. Still, the groups can still be seen.
And I'm pretty sure whatever file saved by the same software will have a signature distribution very similar to that of 'geo', just as every english text have the same distribution, or 'curve'
Content doesn't really matter here, but format does, bc we're not looking for patterns, only looking at what bytes are used to store the data. Some DNA formats only use 4 bytes of all the 256 possible. So if you happen across an unknown file only containing ACGT, it's pretty sure a DNA file.
Last edited by Gonzalo; 18th December 2019 at 23:05.
Reason: I was wrong about all wav files having the same distribution
what's the point. you should ask "why these files have so much 00/FF and which other files may have similar distribution?". it may be just result of your limited testset rather than general rule
You're right, it may. Or may not. Look at the english text case. I believe that gathering info about a number large enough of files should lead to some insightful generalisations. Unfortunately I don't have the skills to write such a program myself.
At the very least, this curve shows us the file is mm and therefore can benefit from some filtering. If all bytes have the same value, there's no point in trying too hard to compress it, etc
Back when I was working on audio detection within embedded streams I found it best to model a histogram of the positions first to detect the number of channels up to 32 bytes wide. Then attempt different configurations of audio filter, typically linear prediction was the most accurate at figuring out the sound form. Then after computing the entropy of the raw audio versus the output of what my code thinks the audio looks like it picks the smallest entropy method (raw or linear prediction) and moves on.
It's not a challenging field, if you know how the data is stored you can make some code to detect it. Same goes for huffman codes, those can be detected with a bit level suffix array and an lcp test to find common bit level prefix codes. I don't believe AI is necessary when it comes to rough and fast detectors, it's also more likely to have real world applications if you can make a good enough heuristic.
[seg_file]
> What kind of statistics does it use? Did you try a simple byte count?
Its basically exactly what it does - counts byte freqs for blocks,
then cluster these blocks by entropy estimation based on freqs.
> but it only saves a few files, where it should be at least 50
It doesn't understand where one jpeg ends and next one starts when statistics are the same.
Well, there's a "PRECISION" parameter, but changing it requires recompiling the source.
I changed the PRECISION parameter to several different values, only to gain just a few kb. It was worth a try though.
These are the sizes after compression:
About segpaq, and paq8px_v183fix1, they didn't find anything. They just output a big 'default' file, I guess because they look for real files with proper headers and in my test file all jpgs are already converted to pjg. What surprises me is that they didn't find bitmap-like data (there is) nor text streams. They also didn't discriminate between highly and poorly compressible data.
Brotli 11 uses an iterated shortest path algorithm through symbols using all entropy codes in parallel, with an additional cost for switching between codes.
Very interesting! I'm just starting to read it and AFAIK, the "Byte Frequency Distribution (BFD)" is exactly what I was talking about on this whole thread. It seems they took it to the next level.
It would be more than great to actually implement it