Page 1 of 2 12 LastLast
Results 1 to 30 of 34

Thread: DataSmoke: detecting text, multimedia and incompressible data

  1. #1
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts

    DataSmoke: detecting text, multimedia and incompressible data

    Since incompresible, text and multimedia files are better compressed with specific algorithms, we need a fast and reliable way to detect those data. I call it data smoking.

    This project will provide various experimental algorithms that can recognize some of special datatypes (not necessary all), as well as samples of data that are especially hard to smoke correctly. Just now it consists of several entropy calculators.

    More info
    Repository (public domain license)
    Executables: version 0.1, version 0.2, version 0.3
    Last edited by Bulat Ziganshin; 12th February 2014 at 17:33.

  2. Thanks (13):

    Bilawal (16th December 2016),Cyan (8th February 2014),encode (9th February 2014),GOZARCK (8th February 2014),ivan2k2 (10th February 2014),jethro (9th February 2014),just a worm (11th February 2014),Matt Mahoney (12th February 2014),Mike (12th February 2014),moinakg (9th February 2014),samsat1024 (8th February 2014),Stephan Busch (8th February 2014),surfersat (10th February 2014)

  3. #2
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    357
    Thanks
    12
    Thanked 36 Times in 30 Posts
    I think this is a good idea, especially for DWord over text file and Byte over already compressed ones

  4. #3
    Member
    Join Date
    May 2013
    Location
    ARGENTINA
    Posts
    54
    Thanks
    62
    Thanked 13 Times in 10 Posts
    In the case mp3 header, ogg or other audio files compressed *.opus *.aac *.mp3 [smoke says ] the compressor will skip the file? or depends for each specific compression tool?
    |
    __________________________________________________ __ ____________________|
    |
    V
    Processing standstill_cuando.audio.mp3: 10,794,037 bytes
    ByteSmoker entropy: minimum 98.46%, average 99.42%, maximum 99.77%
    WordSmoker entropy: minimum 96.47%, average 98.47%, maximum 99.11%
    Order1Smoker entropy: minimum 94.48%, average 97.51%, maximum 98.46%
    DWordSmoker entropy: minimum 93.23%, average 95.26%, maximum 95.98%
    Last edited by GOZARCK; 8th February 2014 at 23:26.

  5. #4
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    Quote Originally Posted by fcorbelli View Post
    I think this is a good idea, especially for DWord over text file and Byte over already compressed ones
    the whole idea of the library is that we get data of unknown type and analyze them to determine whether it's text, incompressible or so. overall, i think that ByteSmoker should suffice for 95% of cases, failing only on repetitions of random data (such as two copies of the same zip file) and DWordSmoker should suffice on another 95%, failing on things like base64 encoding (random data of limited charset). Combining them together should provide us with algorithm processing 2 GB/s (since ByteSmoker and DWordSmoker employs different CPU resources) and able to determine almost any compressible data


    Quote Originally Posted by GOZARCK View Post
    In the case mp3 header, ogg or other audio files compressed *.opus *.aac *.mp3 [smoke says ] the compressor will skip the file? or depends for each specific compression tool?
    it looks rather close to incompressible data (as it should). DataSmoke, in the current state, is just an entropy calculation library. i personally tend to consider as incompressible the data that has >90% for both ByteSmoker and DWordSmoker. When i say about multimedia data, i meant uncompressed ones such as bmp files


    2All: please provide us with the samples of compressible data having unusually high "entropy" values, especially in ByteSmoker and DWordSmoker simultaneously
    Last edited by Bulat Ziganshin; 9th February 2014 at 10:28.

  6. #5
    Member FatBit's Avatar
    Join Date
    Jan 2012
    Location
    Prague, CZ
    Posts
    191
    Thanks
    0
    Thanked 36 Times in 27 Posts
    Dear Mr. Ziganshin,

    I would like to help you but I have problem with your "smoke" tool. I am not able to redirect (>>) output to file (Win XP SP3 Czech edition). And would be also possible to give results into one line?

    I can analyze different files but results in single line are more comfortable for next processing. I probably remove file name and save file extension.

    Sincerely yours,

    FatBit

  7. #6
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    FatBit:
    1. it prints to stdout, so use "smoke file 2>out". i will fix that
    2. can you provide sample of output you want to see? or push a patch since it OSS after all

  8. #7
    Member FatBit's Avatar
    Join Date
    Jan 2012
    Location
    Prague, CZ
    Posts
    191
    Thanks
    0
    Thanked 36 Times in 27 Posts
    Dear Mr. Ziganshin,

    thank you for your support. I have to learn moreā€¦

    1. Smoke file 2>out works
    2. I will probably send files in this "look":

    Processing *.cpp: 7,718 bytes
    ByteSmoker entropy: minimum 63.51%, average 63.51%, maximum 63.51%
    WordSmoker entropy: minimum 49.35%, average 49.35%, maximum 49.35%
    Order1Smoker entropy: minimum 35.19%, average 35.19%, maximum 35.19%
    DWordSmoker entropy: minimum 41.46%, average 41.46%, maximum 41.46%

    May be would better to collect individual result files into one big result file and therefore would be better to have the results in one single line and to sort according to file extension:

    Processing *.cpp: 7,718 bytes ByteSmoker entropy: minimum 63.51%, average 63.51%, maximum 63.51%, WordSmoker entropy: minimum 49.35%, average 49.35%, maximum 49.35%, Order1Smoker entropy: minimum 35.19%, average 35.19%, maximum 35.19%, DWordSmoker entropy: minimum 41.46%, average 41.46%, maximum 41.46%

    Sincerely yours,

    FatBit

  9. #8
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    1. citating updated readme:

    - ByteSmoker: Minimum recommended block is 16 KBytes.
    - WordSmoker: Minimum recommended block is 4 MBytes.
    - Order1Smoker: Minimum recommended block is 4 MBytes.
    - DWordSmoker: Minimum recommended block is STEP*FILTER*64 bytes (default STEP=4, FILTER=16).

    2. i don't want to add sophisticated front-end collecting all those cpp files. again, it's an open-source, so may be someone else will do it

    3. for me, it' more important to collect various data detection algorithms (and share them with other developers) rather than to make some end-user tool
    Last edited by Bulat Ziganshin; 9th February 2014 at 13:26.

  10. #9
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    Version 0.2:
    • DWord coverage modifed to use STEP=1 (required to detect duplicated compressed data) and HASHSIZE = 2*mb (required for more precise coverage computation, previously it was 88% on random data)
    • print amount of incompressible blocks (entropy or coverage >95%)
    • display results as Markdown-friendly tables
    • calculate min/max entropy only on complete 4MB blocks, so average entropy now may be less than minimal or larger than maximal
    • print everything to stdout instead of stderr
    • substantially updated README
    Last edited by Bulat Ziganshin; 9th February 2014 at 21:03.

  11. #10
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    357
    Thanks
    12
    Thanked 36 Times in 30 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    the whole idea of the library is that we get data of unknown type and analyze them to determine whether it's text, incompressible or so.
    I understand, and I signal that giving a "unknown" type this is a typically DB-DUMP report
    Code:
      ByteSmoker entropy: minimum 49.25%, average 61.00%, maximum 68.38%
      WordSmoker entropy: minimum 34.92%, average 46.97%, maximum 54.56%
      Order1Smoker entropy: minimum 20.60%, average 32.93%, maximum 41.22%
      DWordSmoker entropy: minimum 1.81%, average 5.88%, maximum 11.56% 
    In the sense you could easily recognize DUMP by DWord
    Last edited by fcorbelli; 9th February 2014 at 21:18.

  12. #11
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    i don't understand you. these stats look pretty typical for texts such as enwik9

  13. #12
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    357
    Thanks
    12
    Thanked 36 Times in 30 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    i don't understand you. these stats look pretty typical for texts such as enwik9
    Exactly.
    Infact an ASCII dump for a database have, often, less entropy of English text, because some keyword appears very often
    like
    Code:
    INSERT INTO `docfiles` VALUES
    In other words you can (in theory!) make a better pre-processor for ASCII dump rather than "straight" english text, replacing for example standard SQL keywords to minimal symbols (INSERT INTO, VALUES and NULL for example).
    Just a suggestion, to recognize a "ASCII-DUMP" type of file.

    Here a (I hope!) clear example
    Code:
    NULL,NULL,NULL,NULL,'AB30F62091F82F7FB37D67A33164E83B4953956F',NULL,NULL,NULL,0,0
    ,0,NULL,NULL,NULL,0,0,23039511873408122,NULL),(44216,226801,'PDF','\\\\anna\\sto
    rico\\12\\','00044216_2010_pratica completa scannata.pdf','2010-06-04','PDF',162
    4,'2010-06-04',NULL,0,NULL,0,NULL,NULL,NULL,NULL,NULL,'2B9A80070D989CF5F3A95A793
    EA09DD92168EF02',NULL,NULL,NULL,0,0,0,NULL,NULL,NULL,0,0,23039511873408123,NULL)
    ,(44670,226802,'PDF','\\\\anna\\storico\\12\\','00044670_20100604124509025_prati
    ca completa scannata.pdf','2010-06-04','PDF',1791,'2010-06-04',NULL,0,NULL,0,NUL
    L,NULL,NULL,NULL,NULL,'A59C87096382BC915E882CC408FD3D941AFEEBCD',NULL,NULL,NULL,
    0,0,0,NULL,NULL,NULL,0,0,23039511873408124,NULL),(45012,226803,'PDF','\\\\anna\\
    storico\\12\\','00045012_2010_pratica completa scannata.pdf','2010-06-04','PDF',
    810,'2010-06-04',NULL,0,NULL,0,NULL,NULL,NULL,NULL,NULL,'A8CE20B67CE5047A6CA10C0
    C5BCB1BA5F17598EA',NULL,NULL,NULL,0,0,0,NULL,NULL,NULL,0,0,23039511873408125,NUL
    L),(49573,226804,'SCANSIONE','\\\\anna\\storico\\12\\','00049573__Fwd_ abbina 20
    10 0005 0000025725___segreteria_es_net _al__ _segreteria_esn_net__2010_06
    _04_11_40_40_0359.eml','2010-06-04','eml',433,'2010-06-04',NULL,0,NULL,0,NULL,NU
    LL,NULL,NULL,NULL,'47E2AE0AA834B3117C7F2E464CD92DA38F7B9EB0',NULL,NULL,NULL,0,0,
    0,NULL,NULL,NULL,0,0,23039511873408126,NULL),(44572,226805,'PDF','\\\\anna\\stor
    ico\\12\\','00044572_20100604124717876_pratica completa scannata.pdf','2010-06-0
    4','PDF',937,'2010-06-04',NULL,0,NULL,0,NULL,NULL,NULL,NULL,NULL,'25A24686B5C7E2
    43ED8D05D2B9F74AC018F7F7E4',NULL,NULL,NULL,0,0,0,NULL,NULL,NULL,0,0,230395118734
    08127,NULL),(44949,226806,'PDF','\\\\anna\\storico\\12\\','00044949_2010_pratica
     completa scannata.pdf','2010-06-04','PDF',1387,'2010-06-04',NULL,0,NULL,0,NULL,
    NULL,NULL,NULL,NULL,'0CCFB8B49F793DF805B869DE689A4F57F43F89E0',NULL,NULL,NULL,0,
    0,0,NULL,NULL,NULL,0,0,23039511873408128,NULL),(49266,226807,'CLONATO','\\\\anna
    \\storico\\14\\','atto 00049266  full.doc','2010-06-04','doc',270,'2009-07
    Last edited by fcorbelli; 9th February 2014 at 21:27.

  14. #13
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    1. DB dump is a very specific datatype
    2. dictionary replacement is a well-known text preprocessing technique. it's already implemented in freearc as the "dict" filter
    3. the 4 currently provided smokers just measure the entropy and therefore check only for incompressible data. detection of text, multimedia and other data are in future plans as general line of library development
    Last edited by Bulat Ziganshin; 9th February 2014 at 22:54.

  15. #14
    Member
    Join Date
    May 2013
    Location
    ARGENTINA
    Posts
    54
    Thanks
    62
    Thanked 13 Times in 10 Posts
    this is my test smoke 0.2
    1.bat file (need help in bat can't exclude "1.bat")
    Code:
    @echo off
    set carp="%~snx0"
    for /R  %%f in (*) do smoke "%carp%" "%%f" >>smoke0.2_test.txt
    pause
    Result_vectors log.zip

    Result_Game log_game.zip

  16. #15
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    357
    Thanks
    12
    Thanked 36 Times in 30 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    1. DB dump is a very specific datatype
    I know, but I work with this all day

    2. dictionary replacement is a well-known text preprocessing technique. it's already implemented in freearc as the "dict" filter
    I know too, but... I tried
    3. the 4 currently provided smokers just measure the entropy and therefore check only for incompressible data. detection of text, multimedia and other data are in future plans as general line of library development
    OK, then the target is to find incompressible data.
    How can I (we) can help to improve?

  17. #16
    Member perovskite's Avatar
    Join Date
    Feb 2014
    Location
    Europe
    Posts
    5
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Hello Bulat, thanks for your new tool. I tried to contact you about a DataSmoke bug, but it seems you already fixed it before my registration was finished:

    version 0.1
    Code:
    Processing RandomNumbers.dat: 16,384 bytes
      ByteSmoker entropy: minimum 99.86%, average 99.86%, maximum 99.86%
      WordSmoker entropy: minimum 86.00%, average 86.00%, maximum 86.00%
      Order1Smoker entropy: minimum 72.13%, average 72.13%, maximum 72.13%
      DWordSmoker entropy: minimum 100.00%, average 101.56%, maximum 101.56%

    version 0.2
    Code:
       randomnumbers.dat | min % | avg % | max % | incompressible 4MB blocks
    ---------------------|------:|------:|------:|----------------------------
            Byte entropy | 99.86 | 99.86 | 99.86 | 1 of 1
            Word entropy | 86.00 | 86.00 | 86.00 | 0 of 1
    Order-1 byte entropy | 72.13 | 72.13 | 72.13 | 0 of 1
          DWord coverage |100.00 |100.00 |100.00 | 1 of 1
    I used raw data from http://www.random.org/bytes/ to test if the tool itself works correctly when entropy is 100%.

    Quote Originally Posted by fcorbelli View Post
    How can I (we) can help to improve?

    One problem is, I don't know how too look for files where DataSmoke fails (if it does). It would be nice if it could copy files with high entropy to a folder and compress them with 7zip or freearc automatically.

    Let's say:

    smoke C:\*
    -> avg % > 85% detected
    copy %%i D:\test\
    arc a %%i
    compare filesize %%i.* with %%i.arc

    Something like this, to automate the process.

    Probably this can be done with a powershell batch without the need to change the source code?

    Edit:

    Code:
    facial_animation_file.resource | min % | avg % | max % | incompressible 4MB blocks
    -------------------------------|------:|------:|------:|----------------------------
                      Byte entropy | 95.92 | 95.92 | 95.92 | 1 of 1
                      Word entropy | 93.19 | 93.19 | 93.19 | 0 of 1
              Order-1 byte entropy | 90.46 | 90.46 | 90.46 | 0 of 1
                    DWord coverage | 71.14 | 71.14 | 71.14 | 0 of 1
                    
                    
                    423408 Bytes
                    
    
    facial_animation_file.arc | min % | avg % | max % | incompressible 4MB blocks
    --------------------------|------:|------:|------:|----------------------------
                 Byte entropy | 99.99 | 99.99 | 99.99 | 1 of 1
                 Word entropy | 99.12 | 99.12 | 99.12 | 1 of 1
         Order-1 byte entropy | 98.25 | 98.25 | 98.25 | 1 of 1
               DWord coverage | 99.93 | 99.93 | 99.93 | 1 of 1
    
    
                    354680 Bytes
    md5: 6742a8b2e2a8803089548175da52d3bf
    Last edited by perovskite; 10th February 2014 at 21:16.

  18. #17
    Member FatBit's Avatar
    Join Date
    Jan 2012
    Location
    Prague, CZ
    Posts
    191
    Thanks
    0
    Thanked 36 Times in 27 Posts
    Dear Mr. Ziganshin,

    I try to "provide you with the samples of compressible data having unusually high entropy values, especially in ByteEntropy and DWordCoverage simultaneously". I used simple cli script

    Set FileSizeLimit=4194304
    For /R C:\Data\ %%f In (*.*) Do (If /i %%~zf GEQ %FileSizeLimit% (Smoke "%%f" >> C:\TMP\Smoke.dat))

    to check my and corporate hard drives and I found ~5k files > 4 MB and "smoked" them with your tool. Results are enclosed (Excel spreadsheet). May be some files can be sent and others (private and corporate) I will test according to your requirements and send results you.

    Please give us rules how to select suitable files.

    Sincerely yours,

    FatBit
    Attached Files Attached Files

  19. #18
    Member just a worm's Avatar
    Join Date
    Aug 2013
    Location
    planet "earth"
    Posts
    96
    Thanks
    29
    Thanked 6 Times in 5 Posts
    why do you call the detection of specific data types "smoking"?

  20. #19
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts

  21. #20
    Member ivan2k2's Avatar
    Join Date
    Nov 2012
    Location
    Russia
    Posts
    47
    Thanks
    15
    Thanked 8 Times in 5 Posts
    I've found some strange results with Dword coverage on The Gauntlet Corpus.
    Attached Files Attached Files

  22. #21
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    I tested version 0.2 on 10gb/benchmarks/simple/*, a set of synthetic data files. It crashes on the file c0000, a 1 byte file containing just a 0 byte. On the files b6002, c6001, c6255, d6128 the DWord coverage is given as "-1.#J" instead of a number. Other than that, it correctly detects whether all files are random or not except for w7002 and w8002, which are cases you already know about (random strings longer than 4MB repeated twice). Here is a description of the files.
    Code:
        Size      File   Description                            Par
        --------- -----  -----------                            ---------
        1,000,000 a6000  1 MB of random bytes in range 0..255   1,000,000
        1,000,000 a6001  random bytes in range 0..127             875,000
        1,000,000 a6004  random bytes in range 0..15              500,000
        1,000,000 a6007  random bytes in range 0..1               125,000
    
        1,000,000 b6002  pattern 1,0,1,0...                             0
        1,000,000 b6004  pattern 1,0,0,0,1,0,0,0                        0
        1,000,000 b6010  every 10'th byte is 1, else 0                  0
        1,000,000 b6100  every 100'th byte is 1, else 0                 0
    
                1 c0000  single NUL byte (0x00)                         0
        1,000,000 c6000  1 MB of NUL bytes                              0
        1,000,000 c6001  all bytes are 1 (0x01)                         0
        1,000,000 c6255  all bytes are 255 (0xFF)                       0
    
        1,000,000 d6001  repeating pattern 0,1,2,3,...255,0,1...        0
        1,000,000 d6016  repeating stepping by 16                       0
        1,000,000 d6128  alternating 0,128,0,128...                     0
    
        1,000,000 i6002  0,0,1,1,2,2,...255,255,0,0,1,1...              0
        1,000,000 i6004  0,0,0,0,1,1,1,1,2,2,2,2...                     0
        1,000,000 i6010  bytes repeat 10 times, then increment          0
        1,000,000 i6100  bytes repeat 100 times, then increment         0
    
        1,000,000 l6002  ascending list of 2-byte integers 0,1,2,...    0
        1,000,000 l6003  ascending list of 3-byte integers              0
        1,000,000 l6004  ascending list of 4-byte integers              0
        1,000,000 l6008  ascending list of 8-byte integers              0
    
        1,000,000 m6002  like l but numbers are MSB first (big endian)  0
        1,000,000 m6003  (e.g. 0,0,0, 0,0,1, 0,0,2, ...)                0
        1,000,000 m6004                                                 0
        1,000,000 m6008                                                 0
    
        1,000,000 p6002  random strings of length 2 written twice 500,000
        1,000,000 p6004  random strings of length 4 written twice 500,000
        1,000,000 p6010  length 10                                500,000
        1,000,000 p6100  length 100                               500,000
    
        1,000,000 r6002  randomly chosen bytes written twice      500,000
        1,000,000 r6004  randomly chosen bytes written 4 times    250,000
        1,000,000 r6010  written 10 times                         100,000
        1,000,000 r6100  written 100 times                         10,000
    
        1,000,000 s6002  random bytes alternating with 0 bytes    500,000
        1,000,000 s6004  random,0,0,0,random,0,0,0...             250,000
        1,000,000 s6010  every 10'th byte is random, else 0       100,000
        1,000,000 s6100  every 100'th byte is random, else 0       10,000
    
           10,000 w4002  a 5000 byte random string written twice    5,000
          100,000 w5002  50,000 random bytes, written twice        50,000
        1,000,000 w6002  500,000 random bytes, written twice      500,000
        1,000,000 w6010  100,000 random bytes, written 10 times   100,000
        1,000,000 w6100  10,000 random bytes, written 100 times    10,000
       10,000,000 w7002  5M random bytes, written twice         5,000,000
      100,000,000 w8002  50M random bytes, written twice       50,000,000
    
      151,110,001 total size
    In case you don't want to download the 10GB data set, I attached a zpaq archive containing the files, plus the code (in ZPAQL) that generated the files and a readme that describes the files and some test results with various compressors. The random data is pseudo-random, so the whole data set fits in a 5944 byte zpaq archive.
    Attached Files Attached Files

  23. Thanks (2):

    avitar (12th February 2014),Bulat Ziganshin (12th February 2014)

  24. #22
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    Version 0.3
    Executable (requires SSE 4.2)
    • Added "2-pass DWord/QWord coverage" smokers that does the same as 1-pass coverage smoker but with automatic selection of the most populated sector
    • The "DWord hash entropy" smoker, of course absolutely useless
    • -bBUFSIZE option selects size of analyzed blocks: -b64k, -b4m/-b4, -b1g
    • print numbers of first 10 incompressible blocks
    • hashing uses SSE 4.2 crc32c instruction in order to make distribution more fair

  25. Thanks:

    samsat1024 (12th February 2014)

  26. #23
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Bulat,

    I'm just starting to look at your code, and I'm looking at the smoke method for ByteEntropy.

    I notice that you're computing 4 separate 256-element histograms, rather than one, with each histogram recording the counts of the different byte values at different offsets from a 4 byte boundary---i.e., a histogram for bytes whose addresses are multiples of 4, another for bytes whose addresses are one greater than those, another for 2 greater, and another for 3...

    ...but then you just add the corresponding elements of the 4 histograms when you actually compute the entropy.

    Is this just a speed hack, maybe enabling superscalar issue of up to 4 count updates, or what?

    I think it's a good thing to do for a very different reason---if you compute the entropies at different alignments separately, it can tell you a whole lot that helps discriminate between different kinds of data.

    Examples:

    (1) if you're looking at dword-aligned integers and/or pointers or offsets, the entropy will typically be highest in the lowest bytes of each aligned 4, and very low in the high byte.

    (2) For 32-bit Unicode, only the lowest byte will usually have much entropy at all. The upper 3 will all be almost zero.

    (3) if you're looking at 16-bit integers or 16-bit Unicode text for an alphabetic language, the entropy in the high byte of each byte pair will be low, and the entropy in the low bytes will be much higher

    (4) for 16-bit Unicode for alphabetic languages, the entropy in the high bits is usually very near zero

    (5) for 64-bit values, 32-bit alignment regularities will show up more weakly, because you're effectively superimposing the high 32 bits onto the low 32 bits.

    (6) if these alignment regularities show up strongly, but in the wrong offsets---more information in the high bytes of alignment units than the low ones---it's a good guess that you're looking at either (a) a file that has a variable-sized header followed by a repeating pattern that's misaligned by a certain number of bytes, or (b) data that's been stored in the opposite endianness than what you're expecting.

    I think it would be a good idea to break out a separate method for scanning the data and accumulating the arrays of counts, so that different analyses can be done on the separate histograms rather than just adding them up again.

    It may also be a good idea to further unroll the inner loop to do 8 elements at a whack. That would at least make 64-bit alignments pop right out, and make it easier to guess at larger-scale alignments.

  27. #24
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    BTW, I may be misunderstanding the goals of this project. Is it only to ever make a simple 3-way distinction between uncompressed text-like data, other compressible data, and likely incompressible data? (That would only tell you whether to not bother trying to compress, or to use a text-oriented algorithm, or to use some hairier algorithm that you hope will find subtler regularities for itself...)

    Or at the other extreme, is it intended to eventually make a variety of other distinctions such as whether you're looking at Intel-like machine code or bytecodes, 32-bit aligned RISC instructions, RGB image data with 10-bit colors packed in 32-bit pixels, etc.?

    The latter could eventually be used to, e.g., pick a specific ZPAQ model to compress the data effectively.

  28. #25
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    this project is about detecting datatypes for the purposes of compression. it may be anything, in theory - text, exe, rgb... but in pratice, now i need a way to detect incompressible data so these smokers was only about this particular task. in fact, i had this concrete problem, but have declared project goals in more generalized way so it may be extended in the future

    but this project fails, i don't found any algorithm that detects incompressible data more reliable that standard order-0 model. so i left it as is - not every research project should have success

    freearc already has mmdet library that detects various datatypes, you may look into it

    4 counter tables is an optimization trick that i borrowed from the FSE. proabaly, it reduces number of memory read/write collisions

    using 4-8 counter tables foir detection of multi-byte data is a great idea

  29. Thanks:

    Paul W. (4th March 2014)

  30. #26
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Bulat, in your README, the results for the "binary" file are for an x86 machine code file, right? Do you know if it's all machine code, or has a link table or literals segment in it, etc.

    Also, the 16-bit WAV file is an interesting case, because ideally a bytewise analysis would make it immediately obvious that it's a 16-bit discretized continuous function, and eminently compressible. It will fool both order 0 and order 1 bytewise entropy, though. I think it would also fool my scheme of computing the order 0 entropies for different alignments separately.

    It would not fool a simple and fast distance transform, that keeps track of where each byte value was last seen, and how far back it was in the stream---on almost most of the high bytes of 16-bit values, it'd find the last occurrence of the very same byte either value two or four bytes back in the stream. (Two bytes back if the stereo channels are very similar, four if they're very dissimilar.)

    (To add that, you'd just need to save a pointer to the last occurrence along with each count, subtract that from the current pointer when you update your count, update the corresponding distance histogram parallel to your count histogram, and replace the stored pointer with the current one.) If that distribution is very skewed, i.e., the distance entropy is low, you know you have a lot of small-scale locality in your data. If it's strongly peaked at a few particular distances, that can tell you the likely strides and is a good hint for ZPAQ models, or for other worthwhile "smoking" tests to make further discriminations.

  31. #27
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Hmmm... I take back what I said about the 16-bit WAV file fooling my scheme with 4 or 8 separate histograms (for different offsets from 4- or 8-byte alignments).

    The WAV file will show significantly less entropy in the even/high bytes than the odd/low bytes, unless it's a very weird audio file with constant volume and a whole ton of bass, i.e., something that usually uses almost of the available dynamic range. Barring that, very high and very low amplitudes will be less common than middle-range values, and the high bytes will show skew towards values with small absolute values. The low bytes won't. That should make it clear that you're looking at data that are aligned on some multiple of 2-byte boundaries, and probably suggest 4.

  32. #28
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    Just a quick test, but it looks like g++ 4.8.1 is able to optimize f() but not g() using SSE2 instructions to add in parallel. (options -O3 -msse2 -S or -O3 -m64 -s). Either function should return the sum of n bytes at p[0..n-1] but g() tries to makes the parallelism explicit.

    Code:
    unsigned f(unsigned char* p, unsigned n) {
      unsigned sum=0;
      for (unsigned i=0; i<n; ++i)
        sum+=p[i];
      return sum;
    }
    
    unsigned g(unsigned char* p, unsigned n) {
      unsigned sum[4]={0};
      for (unsigned i=0; i<n; ++i)
        sum[i&3]+=p[i];
      return sum[0]+sum[1]+sum[2]+sum[3];
    }

  33. Thanks:

    Paul W. (4th March 2014)

  34. #29
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Looking at the mmdet source, it seems that

    (1) A multimedia file is assumed to consist mainly of a stream of records, and records consist of uniform arrays of numeric values of the same fixed width ("bitwidth"), e.g., a 16-bit stereo WAV file is mostly a sequence of 32-bit records made of two 16-bit integer fields for the left and right stereo "channels," and a 24-bit color RGB image file might be mostly a sequence of 3 one-byte fields for the R, G, and B "channels",

    (2) the approach is fairly brute force---you can take a block of data and try to chop it up with different numbers of assumed channels, and different field "bitwidths," and see if the separated streams seem more compressible using e.g., order 0 entropy.

    Is that approximately correct? (Is there documentation somewhere besides the code and comments?)

    My impression so far is that you don't handle things where some channels are represented differently from others, e.g., RGB where the green is 10 bits but the R & B are only 8, or luminance + chrominance where those have different resolutions, or 5.1 audio where the low bass channel is lower-resolution than the others.

  35. #30
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    506
    Thanks
    187
    Thanked 177 Times in 120 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Just a quick test, but it looks like g++ 4.8.1 is able to optimize f() but not g() using SSE2 instructions to add in parallel. (options -O3 -msse2 -S or -O3 -m64 -s). Either function should return the sum of n bytes at p[0..n-1] but g() tries to makes the parallelism explicit.

    Code:
    unsigned f(unsigned char* p, unsigned n) {
      unsigned sum=0;
      for (unsigned i=0; i<n; ++i)
        sum+=p[i];
      return sum;
    }
    
    unsigned g(unsigned char* p, unsigned n) {
      unsigned sum[4]={0};
      for (unsigned i=0; i<n; ++i)
        sum[i&3]+=p[i];
      return sum[0]+sum[1]+sum[2]+sum[3];
    }
    I see the same thing. However I wouldn't have written g like this. I'd be inclined to do (along with unwritten code to handle the boundary case of not being divisible by 4):

    Code:
    unsigned h(unsigned char* p, unsigned n) {
        unsigned sum[4]={0};
        for (unsigned i=0; i<n; i+=4, p+=4) {
            sum[0]+=p[0];
            sum[1]+=p[1];
            sum[2]+=p[2];
            sum[3]+=p[3];
        }
        return sum[0]+sum[1]+sum[2]+sum[3];
    }
    Gcc 4.8.2 still does better on f than h, but h is much better than g. Conversely the Intel compiler does a significantly better job on h than f, producing a version faster than all. With icpc both f and h are SSE based.

    Edit: clang does better than gcc and prefers f too it seems.
    Last edited by JamesB; 4th March 2014 at 12:07.

Page 1 of 2 12 LastLast

Similar Threads

  1. lost interest in text data compression
    By RichSelian in forum The Off-Topic Lounge
    Replies: 12
    Last Post: 10th February 2014, 23:12
  2. Multimedia Data compression with FreeArc
    By Gruselgurke in forum Data Compression
    Replies: 0
    Last Post: 5th April 2012, 20:36
  3. Detecting non-immediate data correlations
    By GerryB in forum Data Compression
    Replies: 14
    Last Post: 5th December 2010, 09:30
  4. Text Detection
    By Simon Berger in forum Data Compression
    Replies: 15
    Last Post: 30th May 2009, 09:58
  5. Two dimentional Multimedia preprocessor
    By chornobyl in forum Data Compression
    Replies: 18
    Last Post: 7th October 2008, 16:54

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •