Results 1 to 10 of 10

Thread: LZSA

  1. #1
    Member
    Join Date
    Apr 2017
    Location
    United Kingdom
    Posts
    65
    Thanks
    52
    Thanked 27 Times in 17 Posts

    LZSA

    LZSA is a new data compressor written by Emmanuel Marty that is targeted for small, primarily 8-bit platforms, and heavily optimized for high decompression speeds. It takes its heritage from LZ4 school of LZ-compression, but modifies and extends the format to improve the compression ratio and decompression speed on not particularly capable computers.

    You can download the latest sources and binaries for LZSA from: https://github.com/emmanuel-marty/lzsa

    In fact, LZSA comprizes two compressors inside of a single executable. The first of them, called "LZSA1", is easier to explain, as it is a very similar compressor to LZ4. The changes include more efficient extended lengths codes, as well as the added support for 8-bit offsets. These changes together increased the compression ratio by about 6-7%, which is probably not particularly surprising to someone with experience of working with this type of compressors. However, what hopefully makes this specific format interesting is the process that we went through to arrive at it.

    Essentially, we knew from the start, from our experience of developing a speed-optimized Z80 decompressor for LZ4, that certain decisions in the compression format are akward to implement on an 8-bit platform. For example, the format is half-assuming that decompressor would be controlling the number of decompressed bytes as the stopping condition. This is clearly a good practice from the point of view of dealing with potentially malicious data, but not very practical on 8-bit platforms, where 16-bit operations, even if available, are fairly expensive. Therefore, every practical LZ4 decompressor for Z80 is not checking the size of decompressed data, but instead is looking for zero offset as the marker of end-of-data. (This works because 8-bit computers only use LZ4 data with a single frame, and at the end of the last compressed block there is an EndMark that is 32-bit zero, sitting exactly where the last offset would sit.) Actually, testing that offset is zero is relatively expensive too, because there is no single command on an 8-bitter capable of checking if a 16-bit number is equal to zero.

    So, to address this kind of issues, we developed the format in several iterations. After the first version of the compressor was written, we implemented the decompressor on Z80 and looked into ways of simplifying the format to simplify/speed-up decompressor. We went through several iterations of this process, with some ideas tried and discarded, and another ideas kept. This resulted in the format that is farly decompressor-friendly, at least from the point of view of an 8-bit machine.

    We also spend quite a lot of time working out ways to help the decompressor during the compression. Our main vector of attack was to resolve the ties - the situations where several combinations of code are possible - strategically. We attempted several approaches to do this. At some point we had a version of the compressor that literally took into account how long each decompressor branch executes, so that the number of machine cycles to decompress was used to resolve the ties. However, we found it to be very impractical: every decompressor would need a separate function specifying the cost of each combination of codes. Every change of a decompressor would entail changing the compressor and recompressing the data. So, we came up with an extremely useful heuristic, by which we minimized the number of compression tokens (this is exactly why we tried to apply it to LZ4 data too, see LZ4Ultra tread here). The resulting speed-up was within one or two percent of the result obtained using the exact machine cycle costs of the decompression.

    In addition, even with such extreme optimizations, the added complexity of "LZSA1" was such that we struggled to match the decompression speed of LZ4. Luckily, we quickly noticed that the relative loss in speed scaled as about 4/3. Incidentally, this is also the ratio of minimum size matches in LZ4 and LZSA. Fundamentally, this happens because the decompression time per byte scales as C1 / N + C2, where N is the length of the current match, C1 some ~constant decoding time per token and C2 is the time that it takes to copy a byte. C2 is pretty much a CPU-specific constant (assuming a well-optimized decompressor). Our work on the compression format was aimed at reducing C1. However, fundamentally, even if we managed to match C1 from LZ4 using various clever tricks in the data format, we could not match C1/4 in the LZ4, because our shortest match length was 3.

    This led us to a decision to allow the user to choose the minimal match length "-m". This leads to a loss in compression ratio, but also provides a fairly substantial speed-up in the decompressor. Therefore, the user can decide that a particular set of data needs to be compressed as well as possible and choose to use the minimal match length of 3 ("-m3"). On the other hand, the user may have a preference for higher decompression speed and choose minimum match length of 4 or even 5 ("-m5" or "-m5"). The resulting combination of the format tuning and compressor helping the decompressor by reducing the number of tokens and, possibly, increasing the minimum match length produces results that are clearly on a Pareto frontier for Z80-based compression and decompression:


    Click image for larger version. 

Name:	encode.png 
Views:	79 
Size:	47.0 KB 
ID:	6732


    Of course, our combination of skills meant that Z80 was particularly well catered for. Good for Z80, but not completely representitative for everything else. Since our compressor was published, new highly optimized implementations for 8086/8088 decompressors were contributed by Jim Leonard and Peter Ferrie. Although the speed of decompression of "LZSA1 -m5" in these decompressors is merely similar to LZ4 (on Z80 "LZSA1 -m5" overtakes LZ4 by about 9%), it is still showing that the fundamental design decisions are not completely off.

    The second compression format, "LZSA2" is in many ways a logical extension of the first. To the simple byte-based format of "LZSA1" we added nibbles, to allow finer grain control of the match lengths and offsets. The token was modified a little to optimize the compression on our corpus. Last, but not least, a command for repeated offsets was added. Therefore, overall, the resulting compressor is probably best described as a further exploration of the LZ5-style compression. In our compression tests we achieved compression ratios comparable with "ZX7", which was ultimately our target in terms of the compression ratio. Compared to "ZX7", "LZSA2" did not fare very well when compressing graphics, which is probably due to the relatively crude variable length code used in the compressor. However, when dealing with mixed data, less dominated by a large number of small matches, we outperform "ZX7" by quite a big margin, mostly due to a larger compression window. These are results on our corpus comprised of about 2.5Mb of files mostly relevant to ZX Spectrum:

    Code:
    				corp  demos  games  gfx    music  text
    
    
    OOZ v7.1(Kraken)		34211+219830+337485+438657+138516+204963	1,373,662
    LZSA (f2)			38043+229544+352172+497857+138086+225430	1,481,132
    ZX7				39484+238903+362443+489624+143457+226092	1,500,003
    OOZ v7.1(Mermaid)		37709+235601+363995+486306+157553+220841	1,502,005
    LZOP v1.3 -9			40347+238797+374751+527285+154278+236488	1,571,946
    LZSA (f1)			41839+238331+375340+530595+152838+247901	1,586,844
    LZ4 v1.9.1 -19			45135+244707+398287+559407+168746+266545	1,682,827
    OOZ v7.1(Selkie)		45122+246574+398377+559023+167500+271168	1,687,764
    
    
    Uncompressed			90445+303002+563234+819609+352130+518401	2,646,821
    At this stage of the project, the decompressors for Z80 and 8086/8088 contributed by Jim Leonard and Peter Ferrie are quite mature. The decompressors for 6502 have been partly optimized, but probably more can be done to them. We are hoping to have decompressors for several other small CPUs added soon.

    P.S. We recognized very recently that there exists another compression scheme that is also named LZSA: https://cbloomrants.blogspot.com/201...ndex-post.html
    We did not really try to refer to Charles Bloom's work and, anyway, his compressor is unpublished and extremely different from what we did. We hope that this won't cause any unnecessary confusion.

  2. Thanks (3):

    comp1 (26th July 2019),jibz (26th July 2019),xcrh (4th October 2019)

  3. #2
    Member
    Join Date
    Apr 2017
    Location
    United Kingdom
    Posts
    65
    Thanks
    52
    Thanked 27 Times in 17 Posts
    Now I've got a proper place to reply to xcrh's post here: https://encode.su/threads/2333-Turbo...ll=1#post60890

    Quote Originally Posted by xcrh View Post
    What I don't like about it:
    - Obsession on Z80 and other obsolete/toy platforms, ignoring modern viable uses (e.g. larger blocks for boot loaders/OS kernels/etc, modern CPU cores like e.g. app/mcu/etc ARMs, etc - comparably sized platforms for present-day uses).
    Well, obsession is one word, a focus on a well-defined set of use cases is another. I agree with you that similar techniques could be applied to other, more modern applications. However, we do not have a specialist in one of these applications amongst us. If people will attempt implementing LZSA in such contexts, I am sure that slightly strange features of the format would not overcomplicate the code, because we reduced the amount of computations to the bare minimum as is. Obviously, the compression ratio would be the same for any platform. The specific position on the Pareto frontier with regard to decompression speed is harder to assess, but keep in mind that this is currently the only open source compressor augmenting pure byte encoding with some nibbles. Frankly, this niche is mostly unoccupied at present. You may be comparing the compression ratios with something like ZIP, whereas our real target was to match the compression ratio of simple bitstream-based compressors, which we did quite well in my opinion.

    Our real competitors in terms of the approach and ratio are Oodle compressors. Since our work has so far been focussed on 8-bit decompression, we are very likely to lag behind the performance of Oodle compressors. However, this is probably a good idea to attempt writing an efficient C decompressor for LZSA, esp. LZSA2, just to see how competitive can we get in this context. It may prove useful to people.

    Quote Originally Posted by xcrh View Post
    - A very strange decompression termination conditions, likely resulting from previous oddity. It looks like this thing was never meant to be decompressed in "safe" way? Like that: "I have 2K buffer for result, never exceed that, even if input is random garbage, no matter what". Seems this algo never been meant to run in safe way? Or I've failed to get idea. Looking on specs it seems author never considered decompressing that in safe manner under limited-size buffer that should never get exceeded, even if input is "damaged" so decompressor fed with (odd/"uncooperative" or even malicious) garbage.
    That was simply done in recognition of the fact that the usual safe decompression is very expensive on an 8-bit platform. The frame format provides all the informaton needed for safe decompression, so a 16+ bit decompressor can very easily work with the framed data (as opposed to raw data assumed by every asm decompressor).

    Quote Originally Posted by xcrh View Post
    - Ahem, speaking of that, there're tons of standalone asm stuff, but no trivial, standalone C decompressor, digging to decompression routine expressed in C takes all pains of unwinding LZ4-like "enterprise-grade sphaggetti for all occasions". Uh-oh. Of course, perfection is a bitch, but as LZ4 shown, algo could enjoy bu very reasonable speed without resorting to assembly/simd/etc.
    Yes, this is a good point, we will probably look into it at some point.

  4. #3
    Member
    Join Date
    Oct 2009
    Location
    usa
    Posts
    58
    Thanks
    1
    Thanked 9 Times in 6 Posts
    I commend you, for I think these types of projects are fascinating. The 8-bit world is thriving, and your new LZSA algorithms satisfy a need and will be used extensively.

    As you mentioned there are binaries, I could not find any, except the raw .ASM source code. Are there precompiled .EXEs for 8088 and .SYS files and / or raw binary stubs for 6502 / Z80 systems available anywhere. Or any raw binary machine code which can be pasted into these systems to generate the binaries by means of saving the raw machine code to disk?

  5. #4
    Member
    Join Date
    Apr 2017
    Location
    United Kingdom
    Posts
    65
    Thanks
    52
    Thanked 27 Times in 17 Posts
    Quote Originally Posted by zyzzle View Post
    As you mentioned there are binaries, I could not find any, except the raw .ASM source code. Are there precompiled .EXEs for 8088 and .SYS files and / or raw binary stubs for 6502 / Z80 systems available anywhere. Or any raw binary machine code which can be pasted into these systems to generate the binaries by means of saving the raw machine code to disk?
    Ah, sorry, maybe I was not explicit enough. We are using nearly optimal parsers, so even though they are fairly fast on a PC (except for the default parser in LZSA2 mode), they would probably impractical on 8- and even 16-bit hardware. We fully intended to exploit the asymmetry in processing power, so that powerful computer can squeeze a bit more ratio during compression, whereas weaker platform can still benefit from added decompression speed.

    We believe that LZSA1 parser is very close to optimal (the task is simple enough). However, we do not really know how far LZSA2 parser is from the optimality. We only know that our last two or three attempts to increase the ratio were producing fairly minuscule dividends. It may mean that we are close to optimum, or it may also mean that our ideas are not of the right kind for the task.

    What would be the accepted "best" way to parse LZ-style data with repeated offsets?

  6. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,693
    Thanks
    267
    Thanked 1,180 Times in 651 Posts
    > What would be the accepted "best" way to parse LZ-style data with repeated offsets?

    From lzma observation, the main source of match patterns with rep-matches are fuzzy matches, like "<P 5?>\n" in book1.
    So I'd expect an improvement if _all_ potential matches are fed to parsing optimizer, rather than recent ones.
    Presuming that optimizer would look for rep-suffixes, like lzma's does.

  7. #6
    Member
    Join Date
    Apr 2017
    Location
    United Kingdom
    Posts
    65
    Thanks
    52
    Thanked 27 Times in 17 Posts
    LZSA just got its first minor version bump. Over the course of the last few months Emmanuel Marty implemented multiple forward arrivals parser, which enabled us to get much more benefit from having the rep-offset command. Just as suggested by Shelwien, the match finder has also been upgraded to generate more suitable matches.

    Overall, the progress from ver.1.0.0 to ver.1.1.0 led to the improvement of the LZSA2 compression ratio by about 1.2%. LZSA1 has already been nearly optimal, so it got only marginal increase in the compression ratio. The token counts have also been reduced. New features has been added: the support for dictionaries, and the possibility of backward compression. The safe overlap distance computation has been added to the verbose information screen. The decompressors have also been improved; the latest fast Z80 decompressors would typically require about 5-6% less time compared to the original release. This is the summary of the compression results on a bunch of "small" files relevant to 8-bitter computers:
    Code:
    				corp  demos  games  gfx    music  text
    
    
    7-zip v.9.20 (ultra,non-solid)	33062+209266+320581+434029+123032+193064	1,313,034
    OOZ v.7.1 (Kraken, -9)		34211+219830+337485+438657+138516+204963	1,373,662
    LZOMA v.0.2 (c,7,1E6)		35018+229572+341036+474495+129788+205132	1,415,041
    				----------------------------------------
    LZSA v.1.1.0 (f2)		37735+228234+349456+491925+136023+224492	1,467,865
    				----------------------------------------
    NRV2b				36444+237865+357998+492116+137521+213874	1,475,818
    LZSA v.1.0.4 (f2)		38043+229544+352172+497857+138086+225430	1,481,132
    ZX7				39484+238903+362443+489624+143457+226092	1,500,003
    OOZ v.7.1 (Mermaid, -9)		37709+235601+363995+486306+157553+220841	1,502,005
    LZOP v.1.3 -9			40347+238797+374751+527285+154278+236488	1,571,946
    bCrush v.0.1.0(--optimal)	38611+253502+382611+525585+149954+225591	1,575,854
    				----------------------------------------
    LZSA v.1.1.0 (f1)		41839+238307+375337+530579+152837+247900	1,586,799
    				----------------------------------------
    LZSA v.1.1.0 (f1)		41839+238331+375340+530595+152838+247901	1,586,844
    LZF v.1.03			42289+247045+390125+539910+161539+242390	1,623,298
    BriefLZ ver.1.2.0		40499+264101+403026+552479+159194+236997	1,656,296
    LZ4 v.1.9.1(-19,raw)		45135+244707+398287+559407+168746+266545	1,682,827
    OOZ v.7.1 (Selkie, -9)		45122+246574+398377+559023+167500+271168	1,687,764
    Binaries and source codes are all available, as usual, from the GitHub: https://github.com/emmanuel-marty/lzsa
    Last edited by introspec; 26th September 2019 at 18:43.

  8. #7
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    171
    Thanks
    46
    Thanked 11 Times in 11 Posts
    Quote Originally Posted by introspec View Post
    (I am sorry about using an image - the [CODE] tag seems to be unavailable.)
    It is, of course, unavailable in simple text editor - when posting, you´ll need to click on the "Go Advanced" button. This opens full WYSIWYG editor.
    It works perfectly (at least for me) - see below.

    Code:
    sample code tag test
    mark all text that you want to put in the Code window and click on CODE tag in advanced editor.

    If you´re still facing same issue, you have probably disabled WYSIWYG editor in settings - I don´t think so, but it could be.

  9. Thanks:

    introspec (26th September 2019)

  10. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,693
    Thanks
    267
    Thanked 1,180 Times in 651 Posts
    Maybe its possible to implement decoding with LUTs?
    Like https://github.com/Cyan4973/FiniteSt...lib/fse.h#L633 ?
    Then it should be possible to optimize the format within bounds of same LUT decoder.
    In theory it could be helpful even to optimize token layout etc for each specific data sample.

    As example, there recently was this LZ format with adaptive len/dist layout:
    https://encode.su/threads/3147-Rever...ll=1#post60910

  11. #9
    Member
    Join Date
    Apr 2017
    Location
    United Kingdom
    Posts
    65
    Thanks
    52
    Thanked 27 Times in 17 Posts
    Quote Originally Posted by Shelwien View Post
    Maybe its possible to implement decoding with LUTs?
    Like https://github.com/Cyan4973/FiniteSt...lib/fse.h#L633 ?
    Then it should be possible to optimize the format within bounds of same LUT decoder.
    In theory it could be helpful even to optimize token layout etc for each specific data sample.
    There are several ideas mentioned here, so I'll try to address them one by one.

    1a) The use of LUT-based decompressors is not usually practised on 8-bit computers because of significant memory requirements. Of course, if I absolutely had to have some kind of extreme decompression speed, then it can be done. In fact, I experimented with LUT approach to make an LZF decompressor for Z80. In my case the of LUT gave me about 10% increase in decompression speed, but given how unbalanced PC platform is with regard to branching, I am guessing the benefits on the PC would be more significant. At the same time, my standard fast decompressor for LZF is 86 bytes long, whereas the LUT-based decompressor has to be 256 byte long plus some extra code, i.e. the size quadripled. There is very little appetite for such solutions in the mainstream. Moreover, I am personally suspect that just unrolling more and more logic of the standard decompressor to allow it to grow to 300+ bytes might deliver more of a speed improvement, at least on Z80.

    1b) Trixter actually implemented a LUT-based decompressor for 8088. I do not remember him being particularly excited about its performance in the end (maybe he can say a word or two about it here?), but it serves to show that on 16-bit platforms where memory is less restricted such way of thinking becomes a lot more practical.

    2) I am not sure how sensitive the decompression speed would be to variations of the token format, esp. since the format is usually mainly designed with ratio in mind. However, what I can say is that any byte-token-based format is very convenient for setting up a LUT esp. if your CPU wordsize is 16+.

    Quote Originally Posted by Shelwien View Post
    As example, there recently was this LZ format with adaptive len/dist layout:
    https://encode.su/threads/3147-Rever...ll=1#post60910
    Yes, we do appreciate the value of variable length codes in an LZ77-style encoder. Both LZSA1 and LZSA2, but LZSA2 especially supports a range of length and offset codes, see https://github.com/emmanuel-marty/lz...ormat_LZSA2.md
    Last edited by introspec; 26th September 2019 at 21:48. Reason: Corrected LZSS to LZ77 which is more relevant where LZSA is considered

  12. Thanks:

    Shelwien (26th September 2019)

  13. #10
    Member
    Join Date
    May 2008
    Location
    USA
    Posts
    45
    Thanks
    0
    Thanked 4 Times in 4 Posts
    Quote Originally Posted by introspec View Post
    1b) Trixter actually implemented a LUT-based decompressor for 8088. I do not remember him being particularly excited about its performance in the end (maybe he can say a word or two about it here?), but it serves to show that on 16-bit platforms where memory is less restricted such way of thinking becomes a lot more practical.
    The 8088 has an 8-bit bus and can only fetch a single byte in 4 cycles, so optimizing for 8088 is largely minimizing total memory accesses and jumps. The 8086/80186/80286 can fetch 2 bytes in the same 4 cycles, so jump tables become more appealing. I implemented a LUT/jumptable for the very first length decode which had only 8 possible values, so it was easy to include a LUT in the code for a free win. For later length decodes, some logic is necessary to build values, and the benefits of multiple LUTs diminishes (especially if you have to include a 12-bit or 13-bit table in the code which takes up valuable memory space on a platform intended to have less than 1MB RAM). On 8086, the first LUT did provide a speedup, so that version of the code was kept as an option for anyone using anything better than an 8088 CPU.

    For those curious about 808x assembler, the specific jumptable version looks like this (the if-literals-are-0-branch short-circuit is there because the "0 literals" case was common enough in my test corpus that it was more often a win than just falling through to the jumptable):

    Code:
            and     al,070H            ;isolate literals length in token (LLL)
            jz      check_offset_size  ;if LLL=0, we have no literals; goto match
    ; Jump to short copy routine for LLL=1 though 6, need_length_byte for LLL=7
            mov     bx,ax              ;prep for table lookup
            jmp     [cs:copytable+bx]
    The full implementation is available if you can stomach more x86 asm

  14. Thanks:

    introspec (27th September 2019)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •