Results 1 to 9 of 9

Thread: Looking for some help for hopefully trivial optimisation

  1. #1
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    458
    Thanked 257 Times in 105 Posts

    Looking for some help for hopefully trivial optimisation

    Hello


    I've been developing my own pretty simple implementation of LZSS for some very restricted platform (low CPU & memory) for fun as a time-waister.
    Currently, the result are rather good, so i'm looking into code optimisation.

    While looking at the compression path, one thing strikes me :
    I'm doing a test at every symbol (that means several thousand times) just for the potential use of it once.
    I started to wonder if i could get rid of it, think the problem differently, but could not find a solution.

    So let me introduce the issue.

    LZSS is basically LZ77 + literals, the simplest form possible.
    So now, let's imagine i'm ending the compression with a match.
    Therefore, i'm coding a final OFFSET-MATCHLENGTH.

    OK, but maybe this MatchLength is going to far.
    Just image the following scenario
    <-- Source Memory Area --><End><Rest of memory>
    (...) 1 2 3 4 5 6 (...) 1 2 3 4 || 5 6

    Here, the source is a memory area; this is not a file, so there is no reading, and there is no "EOF" marker.

    Therefore, the Final matchlength may go too far, as a result of bad luck.
    Here, it will believe that it found a 6-symbols match "123456", while in fact there are only 4 symbols left to encode.

    How to avoid that ?
    Well, quite simply, each time i've got a matchlength, i'm calculting how many symbols are left to be encoded, and if MatchLength>SymbolsLeft, then MatchLength=SymbolsLeft

    OK, so that basically means i'm doing this test for each matchlength, which translates into "for each symbol" as i'm using a lazy matching algorithm.

    This seems quite a waste, as this is supposed to be only usefull once, at the last O-M sequence (note : if the file ends with literals, there is no such issue).

    I've tried to find a way to only do this test once, at the end of the compression. and this i have failed to do.
    I can detect that the last matchlength went too far, but cannot "undo" the last step, because i cannot guess which bits are part of this last Offset/Matchlength (the Offset-Matchlength format is variable, and depends on previous context information).

    Why not storing the last O-M sequence before writing it, so that it can be cancelled ?
    Well, the cost of this step is no better than testing the matchlength to begin with, so....

    I'm pretty sure that every developper in this forum has already solved this issue, as LZSS is a kind of "compressor for beginner". I just can't figure out how...

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,271
    Thanks
    201
    Thanked 985 Times in 511 Posts
    AFAIU, the question is how to avoid checking each match
    for possible buffer overflow.

    1. You can just hope that the encoder generated a correct
    sequence, that can be also enforced by adding a compressed data crc.
    Ad absurdum, you can once decode the compressed data with
    all checks to ensure its validity, and then add a 2kbit RSA signature
    as a guarantee

    2. You can limit the maximum match length and allocate
    some additional space after the end of the buffer, such
    that all of that padding can never be overwritten by a
    match starting in the decoded data
    Some architectures might allow to set a memory
    breakpoint or a guard page at the end of the buffer.

    3. You can use a less redundant encoding scheme, which
    won't have any codes for out-of-range values.
    Of course, this is also relevant for relative offsets.
    With this you'd have these checks anyway, though
    maybe implicitly.

    Anyway, there're limited possibilities here.
    Either you can have invalid matches in the compressed
    data, or not. Then if they're possible, you can check
    for such cases, or not. Then without checks, overflows
    are possible, so you either don't care, or ensure that
    they're safe. And for safety either some amount of spare
    space is required, or some other way of memory access
    control has to be used.

  3. #3
    Member mr.spiv's Avatar
    Join Date
    Dec 2008
    Location
    Finland
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Cyan View Post
    Hello

    I'm pretty sure that every developper in this forum has already solved this issue, as LZSS is a kind of "compressor for beginner". I just can't figure out how...
    I recall I allocated some extra buffer space that I could "safely" search past the file end.. and did the MatchLen > SymbolsLeft check once for the search sequence. Also the code had two different search functions. One for the case when we know we cannot reach the EOF and one for the case when there is a chance to reach the EOF.
    ..more coffee..

  4. #4
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    458
    Thanked 257 Times in 105 Posts
    Thanks for answer, Shelwien

    AFAIU, the question is how to avoid checking each match
    for possible buffer overflow.
    Buffer overflow is indeed a risk during decoding, as i create a memory slot which has exactly the size of target object, and must fill it entirely & exactly.
    This is a very hard restriction, due to an automatic garbage collector roaming in the background

    But my question here is about the encoder : i want it to avoid creating a last Offset/Match which would be too long.
    This would trigger the decoder overflow detector, but this is hardly a consolation : the detector triggers an exception in order to prevent the problem to spread, but harm is already done : smooth execution is halted.

    So i wish to keep this mechanism to deal with damaged files only.

    You can limit the maximum match length and allocate
    some additional space after the end of the buffer, such
    that all of that padding can never be overwritten by a
    match starting in the decoded data
    Some architectures might allow to set a memory
    breakpoint or a guard page at the end of the buffer.
    Alas, the current format has no limit for maximum match length.
    But anyway, one of the bias of my implementation is to achieve maximum decoding speed. So i'm willing to spend whatever it takes in the encoder to avoid any additional test in the decoder.

    You can use a less redundant encoding scheme, which
    won't have any codes for out-of-range values.
    Of course, this is also relevant for relative offsets.
    With this you'd have these checks anyway, though
    maybe implicitly.
    Humm, this one, i really do not understand.
    But i'm very keen with implicit tests.
    They tend to be very interesting performance-wise.
    What do you mean by "less redundant encoding scheme" ?

  5. #5
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    458
    Thanked 257 Times in 105 Posts
    and did the MatchLen > SymbolsLeft check once for the search sequence
    That sounds very interesting, what do you mean by "search sequence" ?

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,271
    Thanks
    201
    Thanked 985 Times in 511 Posts
    > Alas, the current format has no limit for maximum match length.

    Well, I'd say that's certainly redundant...
    Just count the matchlengths on some sample data, and
    look at the distribution.

    > But anyway, one of the bias of my implementation is to
    > achieve maximum decoding speed. So i'm willing to spend
    > whatever it takes in the encoder to avoid any additional
    > test in the decoder.

    Well, I assumed that too, but appears that you still want
    to improve the encoding speed, by removing that extra
    check.

    So, if the comparison over the limit is ok by itself,
    then i guess the offset;length pair can be stored
    to some vars on each match encoding. Like that you'd
    be able to only check the last match after actual
    encoding.
    Also some terminator sequence can be written after
    the data, but as is it only reduces the probability
    of "comparison overflow", but not makes it zero.
    However, it might be possible to both add a terminator
    and some simple escape-coding (like "\n") for terminator
    sequence in the data. (Or just find an unused sequence
    in the data). Thus, the comparison would never match.
    But I doubt it'd be faster than having that check.
    Well, it might be possible to select some terminator,
    then check if it matches with the data, and rehash
    if it does.

    However anyway its better to limit the match length
    (for compression), and like that you'd be able
    to additionally cut down this maximum comparison length
    based on comparison with the size of remaining space -
    once per match and not per symbol.

    > But i'm very keen with implicit tests.
    > They tend to be very interesting performance-wise.
    > What do you mean by "less redundant encoding scheme" ?

    When invalid codes are possible, or just multiple
    ways of encoding the same thing, its a "redundancy"
    from compression p.o.v.
    Here, for example, you could use some different coding
    layouts, taking into account the maximum possible
    matchlength, or even use the different huffman tables.

  7. #7
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    458
    Thanked 257 Times in 105 Posts
    Thanks Shelwien

    These are very interesting suggestions. I will have a try with your ideas, see where they can lead.

    Thanks again

  8. #8
    Member mr.spiv's Avatar
    Join Date
    Dec 2008
    Location
    Finland
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Cyan View Post
    That sounds very interesting, what do you mean by "search sequence" ?
    All steps needed to find the longest match, including possible lazy evaluation etc. So you check whether you went past EOF only once. This might give you unoptimal result when you actually go beyond EOF. But as that happens only once the penalty is meaningless.
    ..more coffee..

  9. #9
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    458
    Thanked 257 Times in 105 Posts
    OK, so you mean you're doing the "EOF check" only once per search-sequence,
    as opposed to doing it once per match-try ?

    Then the number of "EOF check" is directly related to the number of search-sequences, right ?

Similar Threads

  1. Code Optimisation
    By Cyan in forum Data Compression
    Replies: 18
    Last Post: 18th January 2010, 00:48

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •