Results 1 to 10 of 10

Thread: Bytewise vs Binary

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,426
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    So, I've been experimenting with the design of a new coder,
    and got kinda confusing results, so comments are quite welcome.



    Well, at least I've got some proof for my statement that bytewise
    coder is better for text compression as texts are constructed of
    symbols and symbols' specific binary codes don't mean much.

    But still, my own binary models apparently have better overall
    results - they're only insignificantly worse on texts, but much
    better on binaries... and probably could be further improved by
    separately adjusting update speed for different bit positions.
    Thus, I can only guess that intersymbol correlations are actually
    not as weak as I thought... meaning that not only the probability
    of just encountered symbol has to be incremented, but its useful
    to increment some other probabilities as well - and binary encoding
    does that, although maybe not in optimal way.

    Also there was an additional surprise - my PPM-like implementation,
    with inheritance and all (with a special symbol in alphabet for escape
    to lower order), appears worse than more "dumb" alternative, where
    escape is coded separately as a binary flag. Guess that could be explained
    by having additional parameters for escape encoding which can be tuned too.

    To sum this up, bytewise coders are very complex and really slow at that,
    so now I've even got (unnecessary) experimental proof of how better
    binary coders are.
    But, unfortunately, there're problems.
    1. A context needs significantly more counters on average, and if one
    wants to avoid allocating 255 counters for each context right after its
    first occurence, the only way is hashing... which isn't convenient for
    lots of tasks.
    2. There's no way around unary coding when SSE is to be used.
    Its pretty obvious that last symbol encountered in context has
    to be processed in a special way (binary contexts etc), and this
    applies to other ranks as well.
    3. With text modelling, there're lots of cases where alphabet
    should be restricted. For example, consider a custom model for
    preprocessed text, where reduced charset is guaranteed by preprocessor.
    Of course, codes of allowed symbols could be mapped to a 0..n interval,
    but anyway binary encoding wouldn't be that convenient with n!=2^k,
    and there might be cases where charset has to be dynamically switched,
    while preserving statistics (eg. a model for sorted dictionary).

    Well, any ideas on how to solve these contradictions?

  2. #2
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien
    1. A context needs significantly more counters on average, and if one
    wants to avoid allocating 255 counters for each context right after its
    first occurence, the only way is hashing... which isnt convenient for
    lots of tasks.
    Why not use some kind of tree? Hook uses some kind of tree to store the dmc state machine, if i remember correctly. And it is fast, too! You can do "manual" decomposition (flat decomp.; this only works for few symbols under a context):

    If youve got 3 symbols:

    0010110 with pa
    0011000 with pb
    0011101 with pc

    partitially coded: 001x....

    You can add their probabilities togehter, as long as the partitial coded symbol matches:

    p for a one bit is: pb+bc

    So if the next bit is a 1 you only use pb and pc to estimate the next probability. This is an idea i had looking for higher order models for my context mixer. I havent implemented it yet, though.

    Quote Originally Posted by Shelwien
    2. Theres no way around unary coding when SSE is to be used.
    Doesnt any binary scheme apply here?

    Quote Originally Posted by Shelwien
    Its pretty obvious that last symbol encountered in context has
    to be processed in a special way (binary contexts etc), and this
    applies to other ranks as well.
    Yep. I played with probability adjustments and used some rank 0 contexts and a context and if the coded symbol matches the r0 symbol. The result was a significant compression gain.

    Variable length codes can be used to make a binary representation more or less efficient for alphabets with n!=2^k.

    Well at least i got one idea which might lead to something useful

    BTW: Im a bit confused about the conflict. A binary representation is equal to the one using any other base, since every number can be represented like that. Its the reason for turing machines having a binary input, transistions, etc... So why does it show significant differences when compared to bytewise coding?!
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  3. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,426
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    >> 1. A context needs significantly more counters on average, and if one
    >> wants to avoid allocating 255 counters for each context right after its
    >> first occurence, the only way is hashing... which isn't convenient for
    >> lots of tasks.
    >
    > Why not use some kind of tree?

    A tree for dynamic alphabet requires some pointers.
    Also its troublesome to balance etc.
    And either way I'd need 8 counters for a single symbol... at least if I'd try
    to simulate the binary (or should I say bitwise) model.

    > Hook uses some kind of tree to store the dmc state machine,
    > if i remember correctly. And it is fast, too!

    Well, I'm not currently interested in that.
    Its mostly a matter of personal choice - I want to use the best
    available design for every building block.
    Even although its not necessarily would improve overall compression.

    >> 2. There's no way around unary coding when SSE is to be used.
    >
    > Doesn't any binary scheme apply here?

    Of course, but unary decomposition is most general and simple at that.

    > BTW: I'm a bit confused about the conflict.

    I'll repeat the conditions:
    1. Symbols in contexts must be sequentially accessible, with probabilities
    and masking option. Basically, the target is unary coding with MtF ranking
    for start (its not the best, but I don't yet know how to implement the
    optimal ranking in realtime).
    2. The size of context node should reflect the number of symbols encountered
    in context.
    3. Hashes for statistics storage are not an option, because they have too
    many restrictions. Trees/lists are possible though.
    4. The compression level should be same or better than my bitwise coders.

    > A binary representation is equal to
    > the one using any other base, since every number can be represented like that.
    > It's the reason for turing machines having a binary input, transistions, etc...
    > So why does it show significant differences when compared to bytewise coding?!

    Well, by "binary" I meant specifically bitwise, not just any binary.
    And this applies not to storage details, but to modelling.
    In bytewise model you increase the probability of encountered symbol and
    decrease all others. And if you split the alphabet into two subsets and
    encode the alphabet selector separately from the symbols, the effect already
    would be different (of course, with nonstationary counters) - technically,
    here probabilities of _all_ symbols in selected alphabet would be increased
    comparing to other one. Thus, every alphabet decomposition means a different
    update function.

    However, I'm not quite sure about why the bitwise coder is so much better for
    executable compression while being quite close on texts. It may be just a matter
    of updating the probabilities of symbols different from encountered,
    or nonlinearity of the update of composite symbol probabilities may be significant too.

  4. #4
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I cant imagine any other data structures than the ones used by ppm implementations. Dont you already use some kind of suffix tree in ash?

    I wonder if ppmmonstr uses unary decomposition?

    With my early CMM implementations i tried a simple fixed variable length code (of lengths 4, 8, 12 bits) which boosts compression ratio and speed on redundant files, but hurts on others; i was quiet suprised that compression ratio was affected significantly - which could be a side effect of bitwise modelling. I think unary decomposition of (mtf/symbol) ranks is superior here, since you assing adaptive variable length codes dependent on contexts, which are "usually" shorter than 4 bits.

    Quote Originally Posted by Shelwien
    1. Symbols in contexts must be sequentially accessible, with probabilities
    and masking option. Basically, the target is unary coding with MtF ranking
    for start (its not the best, but I dont yet know how to implement the
    optimal ranking in realtime).
    Optimal ranking? Do you mean ranking by probability?

    And what about your counters, do you still want to use them? Since storing probabilities requires an update of every symbol compared to "counting" counters, like n_symbol+1. And again, how does ppmmonstr handle this issue?
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,426
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    > I can't imagine any other data structures than the ones used by ppm implementations.

    The problem is model design, not the data structures.
    For a long time I thought that combinatoric alphabet model is ok
    (ppmonstr successfully uses it and all), and it really should be
    (and is) optimal on texts from the theoretical point of view.
    But my recent experiments show that even simple binary decomposition
    has better performance than alphabet model. Guess its simply compensated
    by secondary model in ppmonstr etc (also slim,epm and ash).
    So now I think that I have to find a good primary model first, before
    moving further.

    > Don't you already use some kind of suffix tree in ash?

    Yes; and wish to continue using something like that.
    But if bitwise model would really appear optimal, I'd just
    have to switch to hash, because I don't see any effective
    alternatives now.

    > I wonder if ppmmonstr uses unary decomposition?

    Of course it does.
    There's just no way around it in text compression.
    I mean, look at how durilca has beaten paq8 on enwik9,
    despite being 2-3 generations older in used techniques
    (there're byte frequencies with blockwise rescaling vs
    paq's states, no interpolation in SSE, its not CM but PPM,
    and there're no text specifics in the model).

    > With my early CMM implementations i tried a simple fixed variable length code
    > (of lengths 4, 8, 12 bits) which boosts compression ratio and speed on redundant
    > files, but hurts on others; i was quite suprised that compression ratio was
    > affected significantly - which could be a side effect of bitwise modelling.

    Of course decomposition scheme is important - it completely changes the
    model behavior. You can even visualize this by comparing eg. paq8's and ppmonstr's
    output when decoding the broken text archives - ppmonstr would generate a
    random _text_ much longer than paq8, while paq8's output gets completely random really fast,
    despite having explicit text support in the model.
    Btw, this basically means that paq8 is redundant.

    > I think unary decomposition of (mtf/symbol) ranks is superior here, since you
    > assign adaptive variable length codes dependent on contexts, which are "usually"
    > shorter than 4 bits.

    I agree, but there's a problem with this.
    Unary model with fixed symbol ranking would be ok in compression, but
    really slow on unexpected data.
    And with dynamic ranking I'd have to recalculate probabilities every
    time when ranking is changed, which is again slow, and also hurts compression
    due to limited precision.
    Also I don't see any sense in increasing the probabilities of all higher
    ranked symbols except rank0 and decreasing all lower ranked ones, and thats what'd
    happen if I'd use separate binary counters for each symbol.

    Also experiments show that its useful to not only increase the probability
    of encountered symbol on update, but others as well.
    So I'm thinking about using an order1 with rank0 symbol as context for an
    update distribution or something.

    >> 1. Symbols in contexts must be sequentially accessible, with probabilities
    >> and masking option. Basically, the target is unary coding with MtF ranking
    >> for start (its not the best, but I don't yet know how to implement the
    >> optimal ranking in realtime).
    >
    > Optimal ranking? Do you mean ranking by probability?

    I mean this: http://www.encode.su/forums/index.ph...um=1&topic=602

    Of course, its simpler with simple targets like speed optimization by
    guessing right at the first try. In such a case we just need to assign
    rank0 to a most probable symbol, but most probable overall, not only in
    the current context's statistics. So for best precision we'd have to use
    the real output distribution from the main model (which is kinda a contradiction,
    because the model is supposed to use that ranking to calculate the distribution
    necessary for ranking; so we'd need multiple iterations or something), or
    just some simple and separate ranking model if its really a speed optimization...
    like a move-to-front model.

    But it gets more complex if the target is compression improvement.
    Like, there's some hard to measure redundancy due to calculation
    imprecision, so that some specific ranking might give better results.
    Or the case with real unary coding, with separate counters per
    symbol, where all the probabilities after update depend on the
    current ranking.

    > And what about your counters, do you still want to use them?

    I'd say I want, but probably won't be able, if you mean binary counters.

    > Since storing probabilities requires an update of every symbol compared
    > to "counting" counters, like n_symbol+1.

    A scheme with separate binary counter per symbol has other issues
    beside poor speed, because it affects compression.
    So for now there's no way around symbol frequencies, though mine
    are fixed-point instead of plain integer and increment is dynamic.

    Actually, I can upload my coders whose results are in the table,
    if you want to see how its implemented.

    > And again, how does ppmmonstr handle this issue?

    Well, exactly the same way. I mean, it stores the symbol frequences
    in the tree and divides them by their sum to get the probability.
    However, its only nonstationarity workaround is blockwise rescaling
    of frequences, so its not as adaptive as modern coders.

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,426
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    Btw, here's my method of modelling results visualization
    and model comparison. Hope someone would find it useful.

    So, here we have a couple of different order0 coders:
    http://shelwien.googlepages.com/bb_v...0329210720.rar
    Code:
    book1  wcc386 
    436029 411381 // o0alf: alphabet model 
    438861 401395 // o0bin: binary model
    And this is the said visualization (click to enlarge):


    These are colored hex-views of some wcc386 blocks.
    Brighter colors mean larger associated values, red for negative,
    green for positive. Thus, the left part shows the
    alf-bin symbol code length differences for the block with best
    performance of my alphabet model (so red cells correspond to
    bytes where alphabet model is actually better) and right part is
    bin-alf for the best block of binary model (so red is where
    binary model is better).

    Actually these pages was made using html, but seems that IE6
    doesn't show what I want.

  7. #7
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Here's a thought, which might be the reason for better binary compression. When playing around with context mixing and variable length codes i noticed that the assignment of different codes (same length) yields different compression, since - by accident - some bits of the code are used more often.
    So the characters used in text are mostly a-z, A-z and some spaces, etc... That means that most of the byte's numeric values are in the same range. For binary files the numeric values are much more widespread.
    It might be possible to compensate some loss by alphabet reordering? Clustering together frequently used symbols (assign codes based on sorting by frequency).
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  8. #8
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,475
    Thanks
    26
    Thanked 121 Times in 95 Posts
    my explanation for different compression with bitwise encoder is simpler:

    suppose we have following sequence and we are coding it using order- 0:

    01 00 00 00 00 01 01 00 01 01 01 00 00 01 10 11 10 11 10 10 11 11 10 11 01 00 00 01 00 01 00 00 00 01

    so you see there are some symbols starting with 0 then some symbols starting with 1 then again some symbols starting with 0.

    in such situation bitwise encoder adapts faster (because when encoding higher bits we're using less contexts) and compresses higher bits of symbols more efficiently.

    above idea can be generalized:

    suppose we have two types of symbols and they are usually clustered, ie. there are mostly small runs of symbols of one type and vice versa.

    one type is 0xxx another type is 1xxx. so with sequence:

    0xxx 0xxx 0xxx 0xxx 0xxx 0xxx 0xxx 1xxx 1xxx 1xxx 1xxx 1xxx 1xxx 1xxx 1xxx 1xxx 1xxx 1xxx 1xxx 0xxx 0xxx 0xxx 0xxx 0xxx 0xxx 0xxx 0xxx 0xxx

    bitwise compressor would sore much better not only due to more efficient coding but also it encodes lower bits more efficiently because after encoding first bit the statistics for every type of symbols are independent.

    to sum up, bitwise encoding is more efficient when symbols with common prefixes are likely to occur near each other (ie. they forms small runs - runs of symbols with same prefix). symbols with common prefixes should have common meaning.

    even ascii codes have such properties:
    ciphers starts with 0011
    uppercase letters starts with 010
    lowercase letters start with 011

    so with super short sequences like:

    eribuieunsidbunvdjbmnadfgjkhrwpuohkxjhb

    bytewise encoder will output many escape flags and code many symbols using pure order -1 model. but bitwise encoder will soon figure out that every symbol starts with 011 so it will compress those bits well yielding overall compression much better than bytewise compressor.


    imo such binary decompostion helps only with low order contexts say up to order- 1 or order- 2 as higher orders are used primarily for stationary data where bytewise model should perform better.


    n00b said'



    added:

    the general rule for me is: the higher order the more stationary model / the lower order the more non stationary model.

    bitwise models are most non stationary
    bytewise models are in the middle
    rank0 models are most stationary

    that explains why my rank0 model for high order contexts is so efficient in tarsalzp (overall efficiency of tarsalzp is low due to poor lower order coders).

  9. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,426
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    There's a mistake actually.
    Red cells on the left view are for symbols which were
    better encoded by _binary_ mode, same as on the right.

    > Here's a thought, which might be the reason for better
    > binary compression.

    Well, not that its _always_ better. Alphabet model is ~0.1%
    better for text even with bruteforce parameter optimization
    (434489 vs 434888 on book1).

    But bitwise model totally wins in adaptation speed - look
    how it switches from green to red on sequences like
    B9 03 00 00 00 (thats MOV ECX,3 btw) - 00 gets much more
    probable after all the similar bits in 03.

    > When playing around with context mixing
    > and variable length codes i noticed that the assignment of
    > different codes (same length) yields different compression,

    Yes; btw "alphabet reordering" is frequently mentioned in
    text preprocessing and BWT-related articles.

    > since - by accident - some bits of the code are used more often.
    > So the characters used in text are mostly a-z, A-z
    > and some spaces, etc... That means that most of the byte's
    > numeric values are in the same range.

    Funny thing is, that's where alphabet model wins, because it
    never increases probability for unused symbols, so unused
    ranges are perfectly excluded, while basic bitwise model
    would keep significant probabilities eg. for x3A-x3F
    despite encountering only digits (x30-x39).

    > For binary files the numeric values are much more widespread.

    Yes; that's why faster adaptation wins.

    > It might be possible to compensate some loss by alphabet reordering?

    Well, bitwise model can be improved by alphabet reordering all right,
    but alphabet model doesn't depend on symbol codes at all - that's
    its benefit actually, as bitwise compressor won't compress, say, both
    english and russian texts on the same level, even without language-specific tricks.

    > Clustering together frequently used symbols
    > (assign codes based on sorting by frequency).

    No, that's not how it works, at least not for nonstationary data.
    Correct clustering should be performed by context similarity,
    using a special model for context histories, for example.
    But still the only sure way is to try merging two nodes and
    see how it'd affect compression.

  10. #10
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by donkey7
    symbols with common prefixes should have common meaning.
    This is achieved by frequency sorting. Frequently appearing symbols get the same prefixes, thus the bits are used more often (quicker adaption in your terms). This "reordering" can be interpreted as grouping together symbols with "similar meaning" on an order 0 base. Symbol ranking could be interpreted as a higher order realization of this. Alltogether some alphabet reordering should improve compression a bit.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

Similar Threads

  1. Simple bytewise context mixing demo
    By Shelwien in forum Data Compression
    Replies: 11
    Last Post: 27th January 2010, 04:12
  2. BMF is not binary lossless NOR pictore lossy
    By SvenBent in forum Data Compression
    Replies: 4
    Last Post: 23rd August 2009, 13:54
  3. Delta: binary tables preprocessor
    By Bulat Ziganshin in forum Forum Archive
    Replies: 14
    Last Post: 1st April 2008, 10:43

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •