Results 1 to 2 of 2

Thread: Code generation in LZ decoder / Branchless LZ77 Decoder

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Kharkov, Ukraine
    Thanked 1,397 Times in 802 Posts

    Code generation in LZ decoder / Branchless LZ77 Decoder

    There's another post in cbloom blog:
    Which reminded me about "D2CT" discussion here:
    And this time I even made an actual implementation to test it, but nobody likes it :)

    clocks are cpu clocks from rdtsc.
    first result is from 2.cpp, with a plain decoding loop,
    and second is the result with code generation
    Note that IntelC replaced the cpy() with its crazy memcpy
    function call, which works faster with SSE2 or whatever it uses,
    but it doesn't really count because the same tricks can be
    used in generated code too.

    [ MSC ]

    [ gcc450 ]

    [ IC111 ]

  2. Thanks:

    encode (31st May 2017)

  3. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Kharkov, Ukraine
    Thanked 1,397 Times in 802 Posts
    >> I look at my numbers and see that my code generator is faster,
    >> despite all the problems you mentioned.

    > Yes, and I say that this result is completely bogus, because you're
    > not comparing the implementations on equal terms. You can't change
    > two things A and B (copy loop and decoding strategy) at the same
    > time and then conclude that B is faster; all you know is A+B is
    > faster, and A may have a much larger effect than B does.

    Well, I just thought that having a simple demo is better than plain
    speculation. If you don't like it - post a better one.
    My point here is that the redundant implementation with two processing
    passes instead of one and extra buffer still works faster.
    Also as to your "A+B" - my generated code uses simple STOSB and REP MOVSB
    instructions, which is far from optimal. Afair, gcc is able to generate
    REP MOVSB in some cases (like with size optimization), but it didn't in
    this case, and I'm presuming that its heuristics chose a better implementation.

    > I don't care about "perfection", but I do care about proper
    > methodology. In particular, if you make any claims and I see a flaw
    > in your experiments, I will point them out, and I won't be subtle
    > about it if I think it's something that should be obvious.

    And its exactly what I meant by "perfection".
    Any implementation can be discarded as "bogus", unless there's a mathematical
    proof of its perfection - and to compare two different methods, _both_ their
    implementations have to be perfect like that.
    And I still believe that approximations do have some meaning - especially
    comparing to plain speculation.

    > Another thing is making your results relevant and reproducible. You
    > didn't include your test data file, so I can't re-run your
    > experiments (and figure out if the copy loop makes a difference on
    > my own), and I also can't see what type of data you're testing on -
    > might be 12MB of zeroes for all I know.

    Relevancy aside (you can always say that real entropy coding and
    offset/length sets would make it all different - but I won't care
    while its just words), you could at least pay more attention - there's
    a data generator included, and a test script which I used to collect
    the stats.

    > Finally - and that's the biggest red flag - you made a lot of claims
    > that are either wrong or just make no sense. Once a certain level of
    > handwaving is reached, I tend to get very skeptical.

    I guess you didn't notice, but I just posted a link to old topic
    which I considered to be relevant here ("Branchless LZ77 Decoder").
    So I'd be satisfied with a reaction like "Funny idea, but not practical",
    but instead you chose to attack my explanation without any real

    > As Charles pointed out, your comment about L1 ICache vs. DCache
    > makes no sense; your generated code goes through both.

    Whether it makes sense depends on specific platform.
    And on modern x86 such loop splitting helped me to gain speed
    quite a few times.
    As to reasons, I'd explain it like this: sequential writes and reads
    don't stall the processing, unless there's mixed r/w access to the
    same buffer etc (cpu is tuned to such types of access).
    So the cost of splitting the single processing loop into two with
    an extra intermediate buffer is usually surprisingly low.
    And in some cases its even possible to gain speed by doing that -
    if the original complex loop has to wait for some limited resources
    (data in L1, arith units, branch target prefetches etc).

    And in the case of single LZ decoding loop we have reads from
    two different sources and writes (ie 3 different memory locations
    are accessed at each iteration), while with the 2-pass method
    we have only 2 locations in each loop, and first loop is also
    purely sequential which improves speed too.

    So I do have this explanation, and an implementation which proves
    it for me (I wasn't sure about results until I've got them -
    that's why it works for me). And what do you have to counter it?

Similar Threads

  1. Executable patch generation methods
    By Shelwien in forum Data Compression
    Replies: 2
    Last Post: 2nd April 2010, 10:13
  2. Code Optimisation
    By Cyan in forum Data Compression
    Replies: 18
    Last Post: 18th January 2010, 01:48
  3. balz v1.00 - new LZ77 encoder is here!
    By encode in forum Forum Archive
    Replies: 61
    Last Post: 17th April 2008, 23:57
  4. UNZ - minimal LZW decoder + source
    By encode in forum Forum Archive
    Replies: 7
    Last Post: 29th January 2008, 15:54
  5. Fast arithcoder for compression of LZ77 output
    By Bulat Ziganshin in forum Forum Archive
    Replies: 13
    Last Post: 15th April 2007, 18:40

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts