Results 1 to 6 of 6

Thread: Vectorized rangecoder

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts

    Vectorized rangecoder

    http://nishi.dreamhosters.com/u/vecrc_v0.rar

    I made and uploaded this thing quite a while ago, but I guess
    its still hard to understand what's it about from this archive.

    The idea is kinda obvious - instead of coding all data bits
    with a single rangecoder sequentially, we can use a separate
    rc instance for each bit, and then update all rcs at once in
    a single loop after processing a byte (or maybe some 2^n bytes).

    The benefits of that are:
    1) inlined rc calls (which usually have 2-3 branches inside)
    disappear from the model.
    2) vector instructions can be used for rc update, especially
    for multiplication in it.
    3) with a large enough number of rc instances it starts making
    sense to run the model and rc updates in different threads.

    And that's implemented in vecrc above, where IntelC was able
    to automatically vectorize the rc update loop, including
    multiplication. This coder was tested by attaching to ccm,
    where it demonstrated ~15% speed improvement.

    So, questions.
    Is it clear how it works, or I have to explain the details?
    Do we need a simpler implementation, like based on rc_vX series?

  2. #2
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    862
    Thanks
    457
    Thanked 256 Times in 104 Posts
    Sounds like a clever idea

    > The idea is kinda obvious
    Simple things are best ones.

    Just a few questions :

    > 1) inlined rc calls (which usually have 2-3 branches inside) disappear from the model.
    Not sure to understand, in which respect does it change branch complexity ?

    > IntelC was able to automatically vectorize the rc update loop
    Is IntelC the only compiler able to vectorize loops ?

    > Do we need a simpler implementation
    Well, maybe off-topic : is the use of *.inc file considered usual trick ?


    Regards

  3. #3
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    So you have multiple rangecoder instances with separate streams?

    Is vectorised rangecoder applicable to RINGS (Francesco's file compressor) where symbols are Huffman coded before arithmetic coding?

  4. #4
    Member
    Join Date
    Apr 2010
    Location
    El Salvador
    Posts
    43
    Thanks
    0
    Thanked 1 Time in 1 Post
    Related paper(s): Arithmetic Coding in Parallel (cumulative intervals), Parallel design of arithmetic coding (a no-delay coder)

    @piotr: If you have no "delay" in code-transmittion, all coders behave deterministic (like an orchestra and a specific play) and you can mix the code-streams; you always know when how which bit belongs to which instance of the coder. If you have "delay" in one instance (for example with a underflow-queue) you would need to halt all instances; and that could be untill the end of the input. This is under the assumption you mix the streams. As far as I can see Eugene is using seperate streams.
    Last edited by Ethatron; 14th January 2011 at 23:11.

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    @Cyan:

    >> 1) inlined rc calls (which usually have 2-3 branches inside) disappear from the model.
    >
    > Not sure to understand, in which respect does it change branch complexity ?

    1. Branch on bit value is replaced with logic instructions - normally having this
    branch is better (=faster), but not in a vector.
    2. Renorm branch actually also can be removed like that, but the replacement has to
    read the next data byte on every iteration, and appears slower in the end
    (at least for now, while we still don't have a "gather" instruction -
    see http://en.wikipedia.org/wiki/Gather-...or_addressing))
    3. lzma has 15 different rc calls in the main loop, with 2 branches
    each and (manually) inlined - that certainly puts more load on code cache,
    branch prediction unit, and so on.
    Also compilers now can sometimes unroll branches too - something like
    if(...) A(); else B(); C(); ->
    if(...) { A(); C(); } else { B(); C(); }
    Sure that kinda can optimize the code, if C() has common parts with
    both A() and B(), but the code size increases even further, and that's
    rarely good.

    >> IntelC was able to automatically vectorize the rc update loop
    > Is IntelC the only compiler able to vectorize loops ?

    Afaik, yes.
    gcc's been improved a lot in that sense too, but 4.5 still couldn't vectorize that loop.
    But anyway, there're asm and intrinsics for that - its easy to translate manually, when
    structure of C code is already compatible with it.

    > Well, maybe off-topic : is the use of *.inc file considered usual trick ?

    More or less - commonly they're just called .h or .hpp, despite having code inside,
    but I think that's misleading.
    Anyway, its very easy to convert my sources with multiple .incs into a single .cpp -
    just comment out all the #include <...> at the start of .cpp and run gcc -E on it.

    Btw, the point of that is to still have clearly separated design blocks (like classes etc),
    while not having to deal with separately compiled modules which are a bother to maintain,
    and even hurt program's speed (because compiler can do more optimizations within a single
    source than with multiple "object" files).

    @Piotr Tarsa:

    > So you have multiple rangecoder instances with separate streams?

    Yes, in encoder, that appeared faster after some experiments.
    In vecrc_v0, there're 8 output buffers and a tag buffer, and at
    the end of block, the streams from rc outputs are interleaved
    in decoding order. Rc instances in decoder just read from the same stream.
    My previous attempt had small queues and emitted the bytes in
    decoding order right away, but current version is faster.

    > Is vectorised rangecoder applicable to RINGS (Francesco's file
    > compressor) where symbols are Huffman coded before arithmetic
    > coding?

    Well, sure having 8 rc calls per byte is convenient here, because
    it matches with hardware vector size.
    But with 32 or 64 rc instances it should improve the speed of
    variable-length coding too.
    In other words, applying this approach without actual vectorization
    results in the same or a little slower speed (handling multiple rc
    instances is more complicated = slower, but separating rc calls from
    the model still helps, so there might be no speed loss overall).
    But this only means that in unaligned cases we have to accumulate
    enough rc instances to be able to efficiently process them in vectors.

    There're issues though:
    1. There's redundancy, 4-5 bytes per rc instance per block in simple case
    (can be reduced to 1-2 bytes per instance), vecrc_v0 has a trick which makes
    it 4-5 bytes per instance per _file_, but it probably hurts speed.
    This only really matters if we'd try to accumulate some huge vectors,
    like for multithreading.
    2. Although this method is applicable to non-bitwise rangecoders
    (like in ppmd), it likely doesn't make sense there, as much higher
    precision is necessary for bytewise rangecoder, and its already harder to
    vectorize even without that.
    3. Even with bitwise rc, it makes sense to apply common speed optimizations
    first, like leave only one renormalization iteration etc - this puts some
    restrictions on probability precision and adds more redundancy to the code.

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    @Ethatron:

    > Related paper(s):
    > Arithmetic Coding in Parallel (cumulative intervals),
    > Parallel design of arithmetic coding (a no-delay coder)

    Found time to read them - both don't consider decoding at all.
    Second one is actually interesting - at least it proposes an approach
    that I didn't consider at all (building a sequentially decodable code
    in parallel), but still far from practical.

    Also I have a completely different coder design for parallel (multithreaded)
    processing - this one is intended specifically for making use of x86
    vector instructions.

    > If you have "delay" in one instance (for example with a
    > underflow-queue) you would need to halt all instances;

    Not necessarily. At least with my approach I'd "reserve" bytes in the buffer,
    and if decoder won't ever need them, they'd just remain as a redundancy.

    Actually I use full rangecoders (with infinite delay) there, and the effect
    is very easy to observe, because bit7 in english text are all 0, so bit7's
    rc instance never emits any bytes.
    So maybe in actual compressor there would be a few bytes of redundancy per
    block, but in this implementation I avoided it by (somewhat) randomizing
    the bit-to-rc_instance mapping, so each rc instance gets roughly the same
    share of each bit position.

    Both this and rc with carries are good for decoding, though - decoder
    just reads an uniform compressed stream without any special cases.

    In fact, I tried it first with a carryless rc (Matt's) and fixed-size
    pointer queues, without separate buffers per rc instance - it was slower.

    > As far as I can see Eugene is using seperate streams.

    They're encoded separately, then interleaved, then decoders read the bytes
    sequentially.
    But in the end, maybe completely independent codes would be the best - its
    almost certain for encoder, but reading from multiple locations in decoder
    might be not so good.

Similar Threads

  1. Simple binary rangecoder demo
    By Shelwien in forum Data Compression
    Replies: 35
    Last Post: 17th June 2019, 16:21

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •