Results 1 to 14 of 14

Thread: Data Distribution Questions.

  1. #1
    Member
    Join Date
    Jun 2008
    Posts
    26
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Data Distribution Questions.

    Hi all.

    Im new to forums, and I currently Focus on data Transformation and minimising the overhead involved.

    I am not fluent in your technical area. So have many questions.

    My question today relates more to.. well I dont know; Here is the situation.

    if we consider a .jpg file.
    If you could request the data changed to be skewed in some manner. What? Given your knowledge of compression would be most usefull?.

    For example. would you like <127 to be 4* more heavy than >127. or Have the data limited to < xxx.

    As i dont fully understand your area, I would like some input. As to what Manipulation of the data would provide the best result for your chosen method.

    Trib.

  2. #2
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    735
    Thanked 660 Times in 354 Posts
    generally speaking, when output of some compression stage is skewed, it's used to further improve compression by applying huffman/ari or other algorithms. output of good compression algorithm has non-skewed symbol distribution

    in particular, jpg and mp3 algorithms use huffman encoding to make best use of skewed distributions; although there jpg/mp3 lossless compressors that makes even better use of this info, improving compression by 10-25%

  3. #3
    Member
    Join Date
    Jun 2008
    Posts
    26
    Thanks
    0
    Thanked 0 Times in 0 Posts
    It does not have to be any specific format. jpg.. whatever.

    The main point is that the file to process; is in a more Entropy Efficient format already. As .jpg for example has a 1+k begin and .5k ish end. that is not Entropy efficient.

    And for you to process it Further. Given your knowledge; how would you prefer the data to be transformed (overhead). to make your encoding more efficient.

    Do you know of a tool; i can try. That perhaps does a good job if you specify the bits per byte your looking at.

    For example to make it easier/faster to code/test I output data as 1 byte. regardless. but most tools just produce a larger result. than just dividing the byte by bits i actually use in the byte.

    So for example. il just use 7bits per byte. but the compressor will not touch/match. if i just outed it as 7bits manually . it infact makes the output bigger than the input. That just seems all wrong.

    Trib.
    Last edited by Tribune; 24th June 2008 at 21:30.

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    @Tribune:
    I tried to read your posts, but failed to decrypt that.
    Can you post in your native language?

  5. #5
    Member
    Join Date
    Jun 2008
    Posts
    26
    Thanks
    0
    Thanked 0 Times in 0 Posts
    ok. il try.

    I am looking at changing data. just for the fun of it.

    So i have tried several simple things just splitting up data from a byte into X bits and storing that as a whole byte (not efficient i know.)

    So for example if I create a file just using 7bits per byte. so all the numbers <=127. Most compressors dont seem to make a file that even take out the unused bit.

    So for a 80k file knowing 1 bit is wasted in every byte. id expect 70k max output from compressor. but they dont seem to work that way?

    are there any compressors that let you specify details like the range of data in the files you are compressing.

    or if you could have the data pre-processed to aid with compression you know of. What would be a few ways the data would best be skewed.

    I hope that makes more sence this time.

    Today I am playing with a full byte but making the data <=127 4*more more common in the file than 128-255. If you know any compressor that would give better results with that sort of distribution.


    Trib
    Last edited by Tribune; 25th June 2008 at 10:04.

  6. #6
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    So for a 80k file knowing 1 bit is wasted in every byte. id expect 70k max output from compressor. but they dont seem to work that way?
    Every simple order 0 arithmetic coder will take care of that. You could try fpaq0*: http://cs.fit.edu/~mmahoney/compression/text.html#5586

  7. #7
    Member
    Join Date
    Jun 2008
    Posts
    26
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Thanks Toffer.

    Sorry im just a noob learning about this subject slowly.

    Trib
    Last edited by Tribune; 25th June 2008 at 12:51.

  8. #8
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I don't like to do advertisement, but i can recommend two books:

    If you don't have any experience get: "The data compression book" by Mark Nelson. It's a very easy to understand introduction, but somehow antique.

    Another good book is: "Data compression - The complete reference" by David Salomon. It's easy to understand too, but covers more topics.

  9. #9
    Member
    Join Date
    Jun 2008
    Posts
    26
    Thanks
    0
    Thanked 0 Times in 0 Posts
    ok cool.

    At the moment im investigating ways to efficiently seperate parts of a file.

    With a view to perhaps making part of the data more compressor friendly.

    That is why i have so many questions for you experienced/skilled people. Relating to what sort of data gives better compression results with your best compression programs.

    Trib

  10. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    Yeah, It was better this time.

    1. Most compressors are byte-oriented, so they won't work right
    with not-8bit codes. Actually they're not very good even at handing
    byte-aligned symbol lengths, like in unicode.
    This more or less has to be explicitly supported by a model.

    2. Masking the bits in bytes while leaving the 8bit alignment
    won't reduce the compressed size by percent of masked out bits.
    That's because different bits amount to different compressed code
    lengths, but not only.
    There's also a matter of bitwise coding (most recent CM coders are
    bitwise, due to Matt's influence I guess) where the worst case
    would be setting the bit0 to 0 in all bytes.
    And then there's a contextual problem. The masked bit might allow
    to differentiate the cases which need different modelling, so
    masking it could lead even to visibly worse overall compression.
    Another good example of this is removing the spaces from texts -
    the file gets significantly smaller, and remains "perfectly readable",
    but CM compressors would compress it worse than original file which
    "contains more information".
    As to LZ methods, these are generally less perceptible to such things.
    Eg. the match codes would mostly remain the same even with masked bit(s),
    because only byte-aligned matches are supported.

  11. #11
    Member
    Join Date
    Jun 2008
    Posts
    26
    Thanks
    0
    Thanked 0 Times in 0 Posts
    ok i think i understand most of that.

    Toffers link to the fpaq is most usefull. it does a very good job. (could a better solution be written if the bit count was known in advance?)

    Do any compression methods make use of weighted distributions in the data like.. il try explain.

    can they assign a more efficient code to a chunk of the data. at the expence of the less occuring data in the stream.

    So rare values actually incur a penalty to encode. but if the data is shaped correctly the over all result will be possitive.

    if i give an expample perhaps.

    This time dealing with words (2 byte) not language :P

    if i only use 25k of the possible 64k Variations. and of that 25k, some may be used 20* more than others. Is their anything that would suit that arrangement in compressors you know of?

    Again. i dont have the knowledge of the codecs personally, so i ask.

    Trib.
    Last edited by Tribune; 25th June 2008 at 17:38.

  12. #12
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Most top ranked compression programs do some kind of statistical modeling. You should read about PPM:

    http://en.wikipedia.org/wiki/Predict...rtial_Matching
    http://dogma.net/markn/articles/arith/part2.htm

  13. #13
    Member
    Join Date
    Jun 2008
    Posts
    26
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Toffer

    That is exacly what i need. It makes total sence.

    I fear i have a long way to go before I can even think about actually coding such a thing. Chukkle.

    Have you created any PPm .exe i can test with some 2byte shaped data please.

    *Edit*. Ahh. 7zip PPMd . found it; i looking now.

    Trib.
    Last edited by Tribune; 25th June 2008 at 18:07.

  14. #14
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Sorry, i haven't and i don't know about something like this, since data is mostly byte oriented. You can still try byte oriented PPM, it will do a good job. Context mixing can be modified to work on such data.

Similar Threads

  1. New guy look for helps on naive questions
    By yacc in forum Data Compression
    Replies: 4
    Last Post: 1st January 2010, 17:39
  2. A recruit's compressor and some questions
    By Fu Siyuan in forum Data Compression
    Replies: 122
    Last Post: 23rd September 2009, 18:35
  3. Bunch of stupid questions
    By chornobyl in forum Data Compression
    Replies: 28
    Last Post: 6th December 2008, 17:26

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •