Results 1 to 16 of 16

Thread: Steganography based on dictionary permutations

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts

    Steganography based on dictionary permutations

    http://nishi.dreamhosters.com/u/stegdict_v0.rar
    https://github.com/Shelwien/stegdict/releases

    The main idea is that dictionary file compression is a big problem for any fast codecs.
    So here's a workaround - if we cannot take care of some correlations in the data,
    let's randomize them instead via steganography, and get better overall compression as a result.
    Its the same idea that I proposed for jpeg/mp3 in threads/1405-Steganography-for-universal-recompression-of-lossy-formats

    Usage:
    stegdict c input_dict output_dict payload
    stegdict d input_dict sorted_dict payload.unp

    Note: it expects the dictionary to be the output of "sort -u" basically -
    ascending order without duplicate words.

    Quoting output of test scripts
    Code:
    // Testing with ppmd_sh9 -o12 -m256
    english_SFC_s.pmd:       size=1110534
    english_SFC_steg.pmd:    size=1198488
    testfile_out:            size=754029
    english_SFC_s.dic.paq8p: size=389794 // paq8p -7 for comparison
    english_SFC_steg.pmd-testfile_out = 444459
    english_SFC_steg.pmd-english_SFC_s.pmd = 87954
    
    So using with method, we can compress the dictionary with ppmd to 444459 bytes instead of 1110534
    although we'd need other files in archive to use as steganography payload
    Code:
    // testing with lzma 9.20 -lc8 -lp0 -pb0 -fb273 -mc999
    english_SFC_s.lzm:       size=790664
    english_SFC_steg.lzm:    size=1435398
    testfile_out:            size=754029
    english_SFC_steg.lzm-testfile_out = 681369
    english_SFC_steg.lzm-english_SFC_s.lzm = 644734

  2. Thanks (2):

    Mike (5th September 2016),RamiroCruzo (21st June 2017)

  3. #2
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    803
    Thanks
    244
    Thanked 255 Times in 159 Posts
    Sounds interesting, but I have to admit that I don't understand - could you elaborate?

    Regarding steganography - hiding information in common data, you would like to encode information as a sequence having some chosen statistical model (of the common data). It can be done by using entropy coder with switched encoder and decoder, however, it would require that decoder knows the used probability distributions (statistical model).
    It turns out this requirement can be omitted - starting with so called dirty paper coding or Kuznetsov-Tsybakov problem, which can be generalized to any probability distribution (statistical model) known only to the sender (not receiver) - thread, recent paper. The cost is increased complexity at encoder.

    However, your description suggests that you want standard entropy coding, but with reduced knowledge of e.g. probability distribution (statistical model). So it turns out that it can be done: encoder doesn't need to know the model (or a priori knowledge of decoder!), but at increased computational cost at decoder and shifting away from Shannon entropy (to some Renyi entropy) - thread.

  4. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    I'm talking about applying steganography to improve compression here, not steganography per se.

    1. Why dictionary?

    - Because sorted wordlists are a common type of data (besides already existing wordlist files,
    including multi-GB password lists, they naturally appear when you try to use eg. text preprocessing).
    Also all coders with practical processing speed (LZ77/PPM/fast CM/BWT) are notably bad at compressing
    this type of data - see http://www.maximumcompression.com/data/dict.php .
    The common redundancy is at the level of 100%-300%.

    2. Why steganography?

    - This is a demo of a more general method, which I'm researching mainly for improvement of quality
    of lossy format recompression.
    The problem is like this: we have a data type with hard-to-use correlations, eg. sorted wordlists,
    or ex-jpeg bmp, or ex-mp3 wav. How to compress this kind of data?
    (a) Make a new specialized codec which would take into account all the dependencies in the data
    (like eg. paq8 does with wordlists thanks to indirect contexts, column contexts etc).
    (b) Use the dependencies to store other useful data with steganography.
    (Thus we don't need a specialized codec anymore. For example, a jpeg decoder which has access to
    quantization information can be updated to fill the "quantization gaps" in pixels.)

    Of course, (a) is better in theory, because like this its possible to make a "perfect" model
    without any known redundancy. But its also much harder to actually implement it - for example,
    consider a jpeg recompressor model, which would look for pixel block matches, while actual
    pixels don't exist (have to be decoded), and any detected correlations in the data have
    to be translated to probabilities of bits of quantized FFT coefficients.

    Comparing to that, (b) which is modular and compatible with existing lossless codecs, seems
    much more feasible, and still better than existing solutions (for example, ppmd's dictionary
    compression rate in stegdict_v0 demo is very close to nanozip -cc and better than zpaq max.cfg).

    3. What actually happens in this demo?

    Step 1. Suppose we have a file "testfile" (sorted wordlist) with following contents (15 lines):
    Code:
    "Adobe\nApple\nCisco\nDell\nEMC\nGoogle\nHP\nIBM\nIntel\nM$\nNvidia\nOracle\nQualcomm\nTI\nXerox\n"
    There're 15! permutations of these lines, so by choosing a specific permutation we can
    encode log2(15!)=40.25 bits of data, ie 5 bytes.

    Step 2. We have a payload file "payload", containing "Hello world!"

    Step 3. We run the coder: "stegdict.exe c testfile permutation payload"
    and it creates a file "permutation" which contains:
    Code:
    "EMC\nDell\nIBM\nAdobe\nIntel\nXerox\nHP\nTI\nApple\nQualcomm\nCisco\nM$\nOracle\nGoogle\nNvidia\n"
    Step 4. Now run the decoder: "stegdict.exe d permutation sorted payload_out"
    Code:
    payload:     48 65 6C 6C 6F 20 77 6F 72 6C 64 21     Hello world!
    payload_out: 48 65 6C 6C 6F 03                       Hello?
    Step 5. Compression
    Code:
    >lzma.exe e testfile 1     -> 96 bytes
    >lzma.exe e permutation 2  -> 96 bytes
    So without steganography we "compress" 82 bytes of "testfile" to 96 bytes of "1".
    And with it, 82 bytes of "permutation" to 96 bytes of "2" - 5 bytes of payload, which we "save".
    So the dictionary in archive with N bytes of some other data would compress to 96+N normally,
    and 96+N-5 using stegdict, which is better.

  5. Thanks:

    RamiroCruzo (21st June 2017)

  6. #4
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    803
    Thanks
    244
    Thanked 255 Times in 159 Posts
    Ok, we could hide some information in ex-jpeg bmp. Naively it would require going through FFT, but it can be avoided by directly adding e.g.
    cos(wx * x) cos(wy * y) * "coefficient containing our information"
    for some coefficients (and range of values defined by quantization).
    However, such adding is still costly (64 per coefficient), so going through FFT might be faster.
    And I don't really understand motivation: for data compression you would better go back to jpeg, for steganography it would be relatively simple to detect if searched for (ok, you could enforce coefs to have proper statistical model ... but there would be blocking artifacts - I think it's hard to make it right).

    Here you want to hide some information by choosing order of an unordered list - indeed lg(n!) bits can be stored in the order of length n list.
    However, talking about data compression, much simpler is just subtracting these lg(n!) bits, and there are a few ways to do it, for example:
    - if this list is dense among possibilities (the list has length p * "possible values"), you can go through all possibilities with entropy coder using p probability,
    - sort then encode differences (delta coding), the differences have some specific probability distribution,
    - encode a tree: e.g. store the total number of elements, root stores the number of elements to the left, and so the nodes recursively,
    - there is a very nice approximation I know from James Dow Allen (slide 6 here) - you divide the space of possibilities into ranges of identical length, use some way to write the number of elements in each range, then suffixes in each range (position with restriction to this range). One way to encode "number of elements in range" is by unary representation. More tricky way is to use 1 trit per range: 0, 1, or "2 or more elements", in the last case you use order of elements in this range to write its size.

  7. #5
    Member
    Join Date
    Aug 2016
    Location
    Lisbon
    Posts
    19
    Thanks
    4
    Thanked 8 Times in 8 Posts
    The permutation approach seems promising not only for this, I'm working on it as part of a greater project, and it does offer good compression rates if set in the right context.

    There are lots of scenarios to apply that where it has not been used before. To name a simple one, is to set the rules of the words, like no more than 3 times the same letter in a row, and the possible combinations of one letter with the preceding, and store the number of combinations/permutations (or the mix of them).

    Then the bits needed to store that is less than the bits to store the original data, thus, compressing, because we applied a sense of rules to a raw data. The more the rules is the more the comprehension of the data, and the more the comprehension or understanding of it gives more compression or shrink.

  8. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    > And I don't really understand motivation: for data compression you would
    > better go back to jpeg,

    Ok, do you know about "recompression"?
    There exist many standard formats with bad integrated compression - sometimes
    its simply outdated like zlib, and sometimes its something else, like jpeg.
    But universal compression methods are not really able to further compress
    such files, because even zlib output looks like random data to them.
    So people started making tools which unpack the original files, and reapply
    better compression - eg. mp3zip/soundslimmer, while still being able to
    losslessly restore original file - by unpacking it again, and reapplying
    the original compression method.
    So repeated compression = recompression.
    Then, most of recompression coders only replace entropy coding in eg. jpg or mp3,
    while keeping the output of format's lossy transformation untouched.
    But precomp showed that at least for lossless compression its possible to implement
    full recompression - it simply outputs data unpacked from deflate streams,
    instead of replacing entropy coding with more advanced one, like eg. packjpg does.

    Then, there're actually quite a few implementations of jpeg recompression already,
    so its obvious that there're two groups with a significant difference in compression -
    first that simply replaces jpeg's huffman coding with CM+arithmetic
    (wzjpeg/packjpg/pjpg/rejpeg/early paq models) and second which does something else too
    (stuffit/newer paq model by Jan Ondrus). And for paq we luckily have an
    explanation - "The new model improves compression by using decoded pixel
    values of current and adjacent blocks as context."

    So my conjecture from this is that its possible to further improve the results
    of jpeg recompression, if we'd manage to fully extract the pixels - ideally,
    do it like precomp does with deflate, ie take a .jpg file and produce a .bmp + .diff.

    But then there's a problem - lossless image coders which could be used to
    compress that .bmp would not know about the dependencies added by jpeg
    quantization, and thus the compressed size of this ex-jpeg .bmp would be
    larger than original .jpg, which would make it useless.

    So here I'm suggesting a new solution for this - we would "fill the gaps"
    in that .bmp with steganography payload. Unlike normal ex-jpeg bmp,
    its every bit would contain some useful information, and there won't be
    any correlations added by jpeg algorithm. Also, to a lossless bmp compressor,
    this steganography payload would seem like the same noise as jpeg quantization
    quirks, so the compressed size of bmp with payload won't grow too much.
    Of course, the size of losslessly compressed steg-bmp would be still larger
    than original .jpg file, but this time it would include additional useful data
    (the payload) which, in practice, can be simply subtracted from the size of
    compressed .bmp.

    To provide a comparison, suppose we have a choice between zlib and lzma,
    but its an implementation of lzma that only can compress two files at once,
    also speed/memory usage are not relevant, we just want smaller compressed size.

    In such a case, I'd obviously choose lzma, because its compression is clearly
    better. Sure, it has a problematic interface (in this example), but that can
    be worked around.

    Same here - if we'd have to store a part of data of jpeg1 into jpeg2 with
    steganography, but overall compression would be still noticeably better
    than best existing jpeg recompressors, then why not do it?

    Also, in fact it would be even possible to implement this scheme within
    a single file too - we'd just have to split the jpeg into two parts,
    where 2nd part would be stored as payload in a (partial) picture decoded from
    the first part. Conveniently for us, jpeg has blocks, so its easy to split it.
    However, this would require to integrate the jpeg<->bmp recompressor,
    lossless bmp codec, and bmp steganography within a single program,
    while having these as 3 different programs would be more convenient for
    development.

    So, the stegdict utility here is basically a proof-of-concept for this idea.
    It also works with dependencies which universal codecs don't notice
    (the matching prefixes in adjacent lines of a sorted wordlist), makes use
    of them for steganography (by permutating the lines) and improves overall
    compression (ppmd would compress whole SFC set better with stegdict,
    than without).


    > but there would be blocking artifacts - I think it's hard to make it right).

    It won't really matter from compression PoV.
    But I actually think that it might not be so noticeable with good quality jpegs.
    Anyway, I already have in mind a method which I'd try to use - all jpeg
    transformations are linear ((de)quantization,FFT and color space conversion),
    so imho I'd be able to track value intervals through the jpeg decoding algo,
    and then the gaps in these intervals could be used for steganography.

    Note that its supposed to be a scheme which would affect the bmp only in such a way,
    that exactly the same jpg could be reconstructed from the previously extracted bmp,
    so I don't think that there'd be much visible artifacts - if only the noise.

    Even from steganography PoV, I think this is still better than methods which
    simply store information in low bits of pixel components in a bmp, or some such.
    In this case, it would be much harder to tell steganography-processed from simply
    a noisy picture.

    > Here you want to hide some information by choosing order of an unordered
    > list - indeed lg(n!) bits can be stored in the order of length n list.

    I don't "want" it - there's a working utility that does that in the first post,
    with source on github and all.

    > However, talking about data compression, much simpler is just subtracting
    > these lg(n!) bits, and there are a few ways to do it, for example:
    > [...]
    > - sort then encode differences (delta coding), the differences have some
    > specific probability distribution,

    Yes, I already did all that in 2003 - results were pretty similar to current ppmd's btw:
    http://compression.ru/sh/2003-10-24.htm
    Its in russian, sorry, but you can press "translate" in a browser, also results
    are numbers with comments in english.
    Also here's a version updated in 2010 to work on modern OS versions - http://nishi.dreamhosters.com/u/dictpack3b.rar

    > - encode a tree: e.g. store the total number of elements, root stores the
    > number of elements to the left, and so the nodes recursively,

    Did that too - results were significantly worse overall (30% or such), because
    I didn't have a compressor optimized for such a data structure.

    That's actually the whole idea of this project - to be able to use existing
    compressors for compression, instead of writing crazily complicated new ones.

    And btw, the ppmd result with optimal order (o6) is 433813.
    With preprocessing (like common suffix/prefix replacement) it could be made even better.
    But the dictionary compression is not really the point here.

    Actual points are:
    1) Its a modular solution, compatible with any universal compressor, not
    a specialized codec for dictionary compression.
    2) There's no dependency on data being text. It could be any unordered set, eg. image palette.
    3) This is a proof-of-concept for more interesting applications.
    4) It can be used for actual steganography too.

    Btw, I also have a steganography solution based on deflate:
    threads/1406-Concealing-data-in-deflate-streams

  9. Thanks (5):

    Gonzalo (4th September 2016),Jarek (4th September 2016),Mike (5th September 2016),RamiroCruzo (21st June 2017),schnaader (4th September 2016)

  10. #7
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    803
    Thanks
    244
    Thanked 255 Times in 159 Posts
    So from information theory point of view, removing the redundancy or filling it with some additional data are equivalent in the amount of bits - the income is the same, filling the redundancy is more complex from practical point view (you merge information from multiple files).
    Also filling the redundancy well in jpeg is quite complex because of quantization and that you control the coefficient, while restriction is e.g. max 256 value per color of pixel - doing it optimally seems a tough optimization problem.
    Due to these difficulties, beside being more convenient for applications, repacking should allow you to save more bits than you could practically hide in this redundancy.
    And generally I don't see a reason to go through FFT to pixel values while repacking - the damage was already done in quantization, the best information is carried in the coefficients, as Fourier is close to Karhunen–Loève transform. The best we can do is to exploit correlations between them and use accurate entropy coder.

    Regarding storing unordered lists, it is relatively simple to write these lg(n!) bits in the order, but again it's more complex for applications, and it is not a problem to just subtract these lg(n!) bits.
    These methods allow to approach the optimum - assuming that you use the real probability distribution and accurate entropy coder.
    For delta coding, if you use p of N positions, delta is given by geometric distribution
    Pr(next - prev = x) = (1-p) * p^(x-1)
    For binary tree, each node know the total number of leaves of this subtree (let say N) and you write how many go left (let say K) - then probability distribution is from binomial:
    Pr(K) = binomial(N,K)/ 2^N

  11. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    > filling the redundancy is more complex from practical point view (you merge
    > information from multiple files).

    You're wrong here. Writing stegdict took me 7 hours.
    Writing a version of ppmd optimized for dictionaries, ie not predicting the same prefix
    in context of previous word's suffix, right-to-left prediction of word's suffix, masking
    of symbols which can't appear due to ascending sort order, etc - would be much more work,
    lots of different things instead of just one, and its possible that I won't be able to do it
    in the end, because ppmd parsing is not very convenient for some of these things.

    > Also filling the redundancy well in jpeg is quite complex because of quantization

    I don't see why. I do my steganography via arithmetic decoding - ie I build a probability
    distribution where only available coef values have non-zero probability, and then "decode"
    the value from payload with a rangecoder.

    > Due to these difficulties, beside being more convenient for applications,
    > repacking should allow you to save more bits than you could practically
    > hide in this redundancy.

    It might seem so at first glance, but in practice results are very different.
    stegdict here allows to compress dictionaries better without having to modify
    the codecs, and stegzip actually showed that we can split a file in two parts:
    book1=book1a+book2b, and with just stegzip and zip compression get a smaller
    file on output. ie stegzip(book1a.zip,book1b.zip)<book1.zip.
    (Stegzip uses LZ77 redundancy to store payload, so its explainable).

    In other words, its exactly what I'm trying to show here - that filling the redundancy
    is commonly much easier than making use of it in a CM model.

    > And generally I don't see a reason to go through FFT to pixel values while
    > repacking - the damage was already done in quantization, the best
    > information is carried in the coefficients, as Fourier is close to
    > Karhunen–Loeve transform. The best we can do is to exploit correlations
    > between them and use accurate entropy coder.

    Code:
    // no recompression
    A10.jpg           842,468 // source file
    A10.7z            831,920 // 7z a -m0=lzma:lc=8:pb=0 A10 A10.jpg
    A10.rk            812,700 // WinRK 3.1.2 MAX (PWCM)
    
    // recompression working with FFT coefs
    A10.jpg.paq8l     698,509 // old model, which just replaces entropy coding
    A10.lep           688,273 // lepton
    A10.pjg           673,197 // packjpg 2.5
    
    // recompression working with pixels
    A10.sitx          638,544 // stuffit 14
    A10.jpg.paq8p     638,130 // new jpeg model by Jan Ondrus, which uses decoded pixels as context
    A10.jpg.paq8pxd18 636,817
    A10.lea           635,427 // EMMA 0.1.15
    > These methods allow to approach the optimum - assuming that you use the
    > real probability distribution and accurate entropy coder.

    You're right in dictionary case, although writing a good dictionary model is
    still much harder work than stegdict.
    And you're right that overall, its always better to have a model which would
    directly take into account all the relevant correlations in the data.

    But in case of lossy formats like jpeg and mp3, I think it would be 100x harder
    to write a model which would fully make use of pixel correlations (including block matches etc)
    to compute the probability distribution of FFT coef values.

    While lossless bmp/wav coders already exist, and computing pixel intervals for
    ex-jpeg steganography would be relatively easy. In fact, I'd expect it to be
    much harder to reconstruct exactly the same .jpg from a .bmp extracted from it,
    comparing to steganography itself.

  12. Thanks:

    Jarek (5th September 2016)

  13. #9
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    538
    Thanks
    238
    Thanked 92 Times in 72 Posts
    On GitHub releases, 'wrapper.exe' is compiled for x64 processors, rendering the whole system unusable on 32 bits machines. Please mind releasing it as a 32 bit application if possible so more people can test it.

  14. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    Sorry, but stegdict itself is only compatible with x64 due to its 128-bit rangecoder atm.
    As to wrapper and ppmd_sh - you can take 32-bit versions in /u.

    I have a __int128 simulation library though - maybe I'd try to build stegdict with it for x86.

  15. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts

  16. Thanks (3):

    Gonzalo (5th September 2016),Mike (5th September 2016),RamiroCruzo (21st June 2017)

  17. #12
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    615
    Thanks
    260
    Thanked 242 Times in 121 Posts
    Very interesting concept (and proof-of-concept)!
    A minor remark, though: Validation of the dictionary might be useful because of these two scenarios:

    1. Unsorted list or different sort order
      Example: swap "1080" and "10th" in english_SFC_s.dic
      Code:
      Comparing the extracted files with originals (english_SFC_u.dic, ENGLISH_SFC_S.DIC)
      00000002: 38 74
      00000003: 30 68
      00000007: 74 38
      00000008: 68 30
      
      Comparing outfile3 and WRAPPER.EXE
      0000314B: 12 48
      0000314C: CC 8D
      0000314D: 75 05
      0000314E: C2 6E
      0000314F: AB FA
      00003150: FF 80
      Basically, this is a "shit in, shit out" scenario. If the input dictionary isn't sorted, don't expect it to work. Anyway, since the definition/expectation of sort order can differ in edge cases like ("2"/"2nd", "1080/"10th", sorting Unicode instead of ASCII, ...), validating the input and preventing further processing early in the script (like stegdict returning an error code) might be useful.

    2. Duplicates
      Example: clone entries in english_SFC_s.dic ("duplicate" => 3 x "duplicate")
      Code:
      Comparing the extracted files with originals (english_SFC_u.dic, ENGLISH_SFC_S.DIC)
      No differences
      
      Comparing outfile2 and LZMA.EXE
      0000C2E5: 52 75
      0000C2E6: C5 47
      0000C2E7: 90 8B
      0000C2E8: B5 C3
      0000C2E9: C8 C1
      0000C2EA: 0A E0
      This one is slightly different: you won't notice any difference when only comparing dictionaries. Even worse, there's a 1/(word count!) chance that comparing the payload will also succeed - as it does in this specific case if we only do (duplicate" => 2 x "duplicate)! So things seem fine with one payload, but corrupt data with another.
    http://schnaader.info
    Damn kids. They're all alike.

  18. #13
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    615
    Thanks
    260
    Thanked 242 Times in 121 Posts
    Having a second look at the scripts, I wondered about this:

    Code:
    wrapper.exe a testfile  stegdict.exe lzma.exe wrapper.exe english_SFC_s.lzm
    [...]
    wrapper.exe x testfile_out outfile1 outfile2 outfile3
    [...]
    rem fc /b testfile testfile_out
    fc /b outfile1 stegdict.exe
    fc /b outfile2 lzma.exe
    fc /b outfile3 wrapper.exe
    As I understand wrapper.exe, it puts all the files together into one file. In this case, 4 files are combined, but only 3 of them are extracted and compared.

    Extracting outfile4 and comparing it with english_SFC_s.lzm (or uncommenting "rem fc /b testfile testfile_out") leads to:
    Code:
    Comparing testfile and TESTFILE_OUT
    000B816C: 37 38
    testfile is longer than TESTFILE_OUT (1,013,402 bytes instead of 754,029)
    The size of TESTFILE_OUT is used in the gain calculation (lzma: 681194 bytes instead of 790664) - adding the missing 1013402-754029=259373 bytes would lead to a much worse result. I would've expected these files to be exactly the same, but this might also be a misunderstanding of the concept as the "real" payload seems to be outfile1-3 only.
    http://schnaader.info
    Damn kids. They're all alike.

  19. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    1. The first post says it "Note: it expects the dictionary to be the output of "sort -u" basically - ascending order without duplicate words."
    I can include sorting at encoding stage, but it would make testing slower without much reason.
    Also, I may eventually add support for permutations of repeated words and partially sorted inputs, so I don't see much sense for doing this now.

    2. Wrapper utility is (imho) an interesting thing itself, because it allows to concatenate/split the files without writing any file sizes in
    the merged file. Instead it uses masking and signatures, kinda like jpeg, but signature is 4-byte here, so no extra bytes added normally.
    This library is actually used in the recent version of reflate.
    Also I used it for things like merging all the sources to a single file, then running regexps on the merged file, then splitting them back -
    you can't do that with tar or rar-m0 or even shar output.

    3. size of testfile_out is the size of payload that fits in wordlist - you can independently verify this by comparing testfile_out size
    with Log[256,354950!] via wolframalpha or something. Its last byte is a bit unstable (due to lack of bits), yeah, but that's it.
    Also, yes, the size of testfile_out is used in calculation of compression gain, and that is correct, because its original testfile
    that is larger, so if it fit completely, then the gain would be better by these 260k, not worse.

    Anyway, I added the 4th file just to pad the payload size fully with random data, because if you'd just specify a file
    which is much smaller than payload size, then permutation gets much less random, and compression improves.
    For example, if we'd remove the .pmd file from "wrapper a" line, the ppmd result becomes 402,298 [!].
    But if we'd use the size of testfile in this case, which is smaller and is the real size of payload in this case,
    then result becomes 858,841, which is still smaller than original, but is not so interesting.

    In short, the padding was used just to create a payload file of size larger than wordlist capacity,
    because I didn't have a set of test files exactly matching the capacity.

    4. Also an important thing is this - http://www.wolframalpha.com/input/?i...00000000%7D%5D
    Its a graph that shows that the payload size ratio to original filesize grows with the wordlist filesize.
    I interpret it as we'd get better compression with larger wordlists.
    To be specific, for english_SFC_s.dic in archive, the ratio is 754028/3712480=0.203106, ie 20%,
    while for a 100G file the ratio would be 40% (and 30% for 500M file).
    Btw, I tested stegdict with >1G password lists, and it seems to work fairly fast.
    So I'd expect it to let ppmd beat paq at some point :)

  20. Thanks:

    Mike (5th September 2016)

  21. #15
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    https://github.com/Shelwien/stegdict/releases
    Code:
    1. Added sorting of input file before encoding, now payload is always restored correctly
    2. Added support for duplicate strings in the dictionary
    3. Cut off last byte from extracted payload - it was usually only partially filled anyway
    4. Added support for fixed-width records (eg. binary files). 
    Now "stegzip c4 input output payload" would encode a binary file with 4-byte records. 
    Note that record width in this case also has to be specified for decoding, like "d4".
    5. Added support for "batch mode" - commandline would only contain a config file and payload file, 
    while "config file" is a list of pairs of input and output dictionary names, eg:
    stegdict c @list payload
    and list is:
    #4:fp_log\fp_tim.bin fp_mut\fp_tim.bin
    fp_log\fp_get.txt fp_mut\fp_get.txt
    "#4:" here means to use fixed record width 4 for this file.
    http://www.maximumcompression.com/data/log.php
    Code:
    Testing compression of the deconstructed fp.log from maximumcompression's SFC
    
    Compressing fp_log
    Running stegdict encoding
    Compressing fp_mut
    Running stegdict decoding
    fp_orig.arc:      size = 342778
    fp_steg.arc:      size = 428678
    payload_u:         size = 96293
    fp_steg.arc-payload_u  = 332385
    fp_steg.arc-fp_orig.arc = 85900
    
    So using this method, we can compress the dictionary with lzma to 332385 bytes instead of 342778
    although we'd need other files in archive to use as steganography payload
    The fp.log thing is imho a good example of how and where sorted dictionaries can appear from seemingly unrelated files:
    http://nishi.dreamhosters.com/u/fplog_demo_0.rar

    The fp.pl script in archive above splits the 20M fp.log (an apache log file), into several dictionaries (per each type of field in log line),
    and the main binary structure with dictionary indexes for each field.
    Code:
    fp.bin     - main log structure, 16 bytes per line (actually 12 with zero padding for lzma)
    fp_ip.bin  - ip addrs in binary from, 4 bytes per record
    fp_tim.bin - unixtimes, 4 bytes per record
    fp_ref.txt - referer strings, LF-terminated
    fp_ua.txt  - user-agent strings, LF-terminated
    fp_url.txt - site paths from http request + http status + page size
    This decomposition already reduces the fp.log size from 20,617,071 to 2,333,154,
    and then stegdict helps a little more.

  22. Thanks (2):

    inikep (6th February 2017),schnaader (6th September 2016)

  23. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    Here's a practical application of stegdict for program watermarking:
    http://nishi.dreamhosters.com/u/steg...termark_v0.rar

    It means that you can compile a program with some inconspicuous code inside,
    and then that code can be extracted from the .exe and watermark automatically decoded from it,
    allowing to eg. detect the source of a leak, by compiling the .exe separately for each user.

    All the temp files are in archive, but actually testing it requires a C++ compiler (tested with VS and gcc/mingw) and perl:
    Code:
    Extracting constants from md5.inc
    Adding a watermark to constants
    Sorting 64 words. Done
    Compiling the program = test.exe
    Disassembling the program = test.asm
    Extracting the constants from test.asm
    Decoding the watermark from extracted constants
    Sorting 64 words. Done
    Printing the watermark:
    
    
    (c) Eugene Shelwien, 2016

Similar Threads

  1. Steganography for universal recompression of lossy formats
    By Shelwien in forum Data Compression
    Replies: 3
    Last Post: 7th November 2011, 17:05
  2. LZW with unbounded dictionary
    By encode in forum Data Compression
    Replies: 34
    Last Post: 28th September 2010, 03:30
  3. bzip2 dictionary size
    By Wladmir in forum Data Compression
    Replies: 3
    Last Post: 7th April 2010, 16:09
  4. LZSS with a large dictionary
    By encode in forum Data Compression
    Replies: 31
    Last Post: 31st July 2008, 21:15
  5. I volunteer for some big dictionary test
    By SvenBent in forum Data Compression
    Replies: 18
    Last Post: 8th June 2008, 22:59

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •