Results 1 to 16 of 16

Thread: Data recognition / segmentation / grouping?

  1. #1
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    238
    Thanked 90 Times in 70 Posts

    Data recognition / segmentation / grouping?

    What are some ways of giving any chunk of data of unknown origin some 'identity', order or classification?

    I know of attempts based on:

    * Extension - not very useful since it can be really whatever
    * Magic numbers
    * bytes count ('strings' tool)
    * Metainfo and other markers (Mediainfo)
    * Brute-force decompression attempts (precomp, Universal Extractor)

    I think an histogram could be used as a quick and dirty classification tool for some kinds of data. For example, here are some representative graphics of the distribution of values on WAV and random-looking data:

    Click image for larger version. 

Name:	DeepinScreenshot_20191217200951.png 
Views:	16 
Size:	19.9 KB 
ID:	7141Click image for larger version. 

Name:	DeepinScreenshot_20191217201142.png 
Views:	17 
Size:	19.9 KB 
ID:	7142

    What do you think about this? Can be extended to other kinds of data? Maybe taking nibbles or other units instead of bytes?

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,943
    Thanks
    293
    Thanked 1,286 Times in 728 Posts
    For graphs there's this: http://www.fantascienza.net/leonardo...tatistics.html

    For segmentation there's http://ctxmodel.net/files/PPMd/segfile_sh2.rar (works based on simple statistics like what you described)
    But some people also use paq8px parser for this.

    Recently I also use cdm to detect bitcodes.

    But overall what's the point?
    For compression improvement it would be nice to be able to detect data structures (binary structures and text forms),
    but its unlikely to be that quick and atm we don't have tools to make use of this info... except for lzma lp option maybe.
    And sorting by entropy isn't likely to improve compression on its own.

  3. #3
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    238
    Thanked 90 Times in 70 Posts
    We do have specialized compressors for different data. For example, the PCM file can be packed with a lossless audio compressor, usually leading to both stronger and faster compression. But we still have to identify it. Think of a precomp-ed stream, for example, with no file extension, of only 128k because it was inside a squashfs, and maybe even without a proper header for that same reason... There's got to be a fast way of telling what it is. Now, the histogram can be made on-the-fly when the stream is being processed, at top speed. Of course, that is only an example. There are countless other situations where a good {parser / recognition tool / whatever the name} can lead to better and/or faster compression.

    Detecting incompresible data is very useful to know when to drop the heavy algorithms in favor of simpler and faster ones, also significantly diminishing memory usage, which can in turn be used to improve compression wherever it is possible.

    And while probably similar entropy isn't a good measure in itself, it can be used along with other constraints to group similar files together without having to resort to brute force or trial and error. For example, I suspect very similar images will have very similar entropy, especially compared to other images. For example, a few monochromatic backdrops compared to a couple of photographs of the same place on a slightly different angle. If we were to group them using other methods, maybe we would have to resort to image recognition (blowing the cpu and memory budget out of the sky) or just sort them by width*length or something that can introduce very different images in the middle, or separate two similar images just bc they have a different resolution.

    I know fv, although I only used the older version by Matt. It's a cool tool, but it is intended to be used by a human expert to analyze an unknown file and devise a custom algorithm to it, not to provide quick and easy discrimination values to a program. And, it actually compresses the file several times to get the stats if I understood its inner working correctly.
    I'll take a peek at segfile...
    Last edited by Gonzalo; 18th December 2019 at 18:08.

  4. #4
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    238
    Thanked 90 Times in 70 Posts
    I just took segfile for a spin. For some reason it fails to identify clearly different data. I tried it on a few .pcf with only packed jpgs (incompressible) and pdf text and bitmap streams (highly compressible), but it only saves a few files, where it should be at least 50... What kind of statistics does it use? Did you try a simple byte count? If all 00 - FF values have roughly the same count, then it is random, encrypted or compressed. If there are only ASCII letters, then it is text, If both 00 and FF have the highest count and it is getting lower towards the inner values, then it probably is some form of raw uncompressed audio.

  5. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,569
    Thanks
    777
    Thanked 687 Times in 372 Posts
    raw uncompressed audio
    or any other multimedia data - such as geo from calgary corpus

  6. #6
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    238
    Thanked 90 Times in 70 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    or any other multimedia data - such as geo from calgary corpus
    No, not really...

    Click image for larger version. 

Name:	DeepinScreenshot_20191217200951.png 
Views:	24 
Size:	19.9 KB 
ID:	7144Click image for larger version. 

Name:	DeepinScreenshot_xfdesktop_20191218143726.png 
Views:	20 
Size:	21.4 KB 
ID:	7145Click image for larger version. 

Name:	DeepinScreenshot_xfdesktop_20191218144100.png 
Views:	14 
Size:	22.1 KB 
ID:	7146

    First image is an average WAV 16 bits 44000 hz. They all look characteristically similar.
    The next is geo and the last is geo with the highest value hidden to better appreciate the other ones.

    -------------------------
    This shows us something else. If you look at the text files from calgary, you'll see that the space character is by far the most used in all of them. That should be really helpful to discriminate text files.

    This is news, paper1, and book1:

    Click image for larger version. 

Name:	DeepinScreenshot_xfdesktop_20191218144525.png 
Views:	13 
Size:	20.9 KB 
ID:	7147Click image for larger version. 

Name:	DeepinScreenshot_xfdesktop_20191218144829.png 
Views:	15 
Size:	20.8 KB 
ID:	7148Click image for larger version. 

Name:	DeepinScreenshot_xfdesktop_20191218144852.png 
Views:	15 
Size:	20.6 KB 
ID:	7149

    Also note that, no matter the content, the bytes distribution is pretty much the same for english text (applies for dickens, bible, really whatever). 'e' and 't' are next in line of frequence after ' ' and so on.
    For xml, there's a lot of '{' and '}', on an exactly equal quantity. Other file formats have different distributions too, frequently distinguishable with a simple byte count.

  7. #7
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,569
    Thanks
    777
    Thanked 687 Times in 372 Posts
    I didn't meant this particular file, but what 00/FF property you observed may hold for any MM data, not only audio and bitmaps. geo is just an example of generic MM data

    overall, we can easily imagine binary file with lots of 0x20 bytes. It's more reliable to use more advanced approaches to distinguish data types better compressible by specific algorithms

  8. #8
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    238
    Thanked 90 Times in 70 Posts
    What I mean is, histogram shows a byte distribution similar enough among the same kind of data, and different enough between different kind of data for it to matter.

    All WAV histograms are the same, and clearly different from other files'. You won't see a .wav having those peaks in the middle of the graphic. It's a perfect curve. - Actually, scratch that. There are different files. I guess is bc of different encoding, more channels, etc. Still, the groups can still be seen.


    And I'm pretty sure whatever file saved by the same software will have a signature distribution very similar to that of 'geo', just as every english text have the same distribution, or 'curve'

    Content doesn't really matter here, but format does, bc we're not looking for patterns, only looking at what bytes are used to store the data. Some DNA formats only use 4 bytes of all the 256 possible. So if you happen across an unknown file only containing ACGT, it's pretty sure a DNA file.
    Last edited by Gonzalo; 18th December 2019 at 22:05. Reason: I was wrong about all wav files having the same distribution

  9. #9
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,569
    Thanks
    777
    Thanked 687 Times in 372 Posts
    what's the point. you should ask "why these files have so much 00/FF and which other files may have similar distribution?". it may be just result of your limited testset rather than general rule

  10. #10
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    238
    Thanked 90 Times in 70 Posts
    You're right, it may. Or may not. Look at the english text case. I believe that gathering info about a number large enough of files should lead to some insightful generalisations. Unfortunately I don't have the skills to write such a program myself.

    At the very least, this curve shows us the file is mm and therefore can benefit from some filtering. If all bytes have the same value, there's no point in trying too hard to compress it, etc

  11. #11
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    174
    Thanks
    28
    Thanked 73 Times in 43 Posts
    Back when I was working on audio detection within embedded streams I found it best to model a histogram of the positions first to detect the number of channels up to 32 bytes wide. Then attempt different configurations of audio filter, typically linear prediction was the most accurate at figuring out the sound form. Then after computing the entropy of the raw audio versus the output of what my code thinks the audio looks like it picks the smallest entropy method (raw or linear prediction) and moves on.

    It's not a challenging field, if you know how the data is stored you can make some code to detect it. Same goes for huffman codes, those can be detected with a bit level suffix array and an lcp test to find common bit level prefix codes. I don't believe AI is necessary when it comes to rough and fast detectors, it's also more likely to have real world applications if you can make a good enough heuristic.

  12. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,943
    Thanks
    293
    Thanked 1,286 Times in 728 Posts
    [seg_file]
    > What kind of statistics does it use? Did you try a simple byte count?

    Its basically exactly what it does - counts byte freqs for blocks,
    then cluster these blocks by entropy estimation based on freqs.

    > but it only saves a few files, where it should be at least 50

    It doesn't understand where one jpeg ends and next one starts when statistics are the same.
    Well, there's a "PRECISION" parameter, but changing it requires recompiling the source.

    Also here's paq8px segmentation tool which I mentioned: https://encode.su/threads/1971-Detec...s?p=38810&pp=1

  13. #13
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    238
    Thanked 90 Times in 70 Posts
    I changed the PRECISION parameter to several different values, only to gain just a few kb. It was worth a try though.
    These are the sizes after compression:


    2.178.125 test2
    2.180.276 test5
    2.180.321 test32
    2.182.450 test64
    2.185.515 test10
    2.186.516 test1
    2.189.143 test //original


    About segpaq, and paq8px_v183fix1, they didn't find anything. They just output a big 'default' file, I guess because they look for real files with proper headers and in my test file all jpgs are already converted to pjg. What surprises me is that they didn't find bitmap-like data (there is) nor text streams. They also didn't discriminate between highly and poorly compressible data.

  14. #14
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    846
    Thanks
    242
    Thanked 309 Times in 184 Posts
    Brotli 11 uses an iterated shortest path algorithm through symbols using all entropy codes in parallel, with an additional cost for switching between codes.

  15. #15
    Member
    Join Date
    Feb 2014
    Location
    Belgium
    Posts
    3
    Thanks
    0
    Thanked 2 Times in 2 Posts
    Has anyone used Bayes analysis for file type detection in the context of compression ?
    It seems to have good results.
    eg. File-type Detection Using Naïve Bayes and n-gram Analysis

  16. Thanks:

    Gonzalo (22nd December 2019)

  17. #16
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    238
    Thanked 90 Times in 70 Posts
    Quote Originally Posted by pzo View Post
    Has anyone used Bayes analysis for file type detection in the context of compression ?
    It seems to have good results.
    eg. File-type Detection Using Naïve Bayes and n-gram Analysis
    Very interesting! I'm just starting to read it and AFAIK, the "Byte Frequency Distribution (BFD)" is exactly what I was talking about on this whole thread. It seems they took it to the next level.

    It would be more than great to actually implement it

Similar Threads

  1. loseless data compression method for all digital data type
    By rarkyan in forum Random Compression
    Replies: 244
    Last Post: 23rd March 2020, 16:33
  2. lrzip Segmentation fault
    By pklat in forum Data Compression
    Replies: 7
    Last Post: 16th November 2019, 15:47
  3. Replies: 0
    Last Post: 27th April 2019, 11:36
  4. Replies: 3
    Last Post: 23rd August 2018, 20:27
  5. Data segmentation for effective compression?
    By osmanturan in forum Forum Archive
    Replies: 6
    Last Post: 9th February 2008, 23:55

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •