Results 1 to 17 of 17

Thread: HBA

  1. #1
    Member
    Join Date
    Mar 2018
    Location
    sun
    Posts
    35
    Thanks
    19
    Thanked 16 Times in 7 Posts

    HBA

    I would normally never publish unfinished project, but since completing it will still take a *long* time and the results looks really promising for such a state, I decided to share this unfinished preliminary demo for anyone interested. I literally only finished (read "half-backed", unoptimized, without crc etc) compression stage few minutes ago. So decompression stage is not yet even implemented, but as I backtested each stage individually on small sets during development it should work fine once done.

    So why even bother? Well so far, at least on my few small samples, HBA is beating LZMA on every single one. Sometimes badly. For example:
    Code:
    "Mamei64.exe" v0.174, original size 126mb, "7zip d64m mc32 lc4" => 26mb, hba => 11mb.
    Yeah, thats ~57% difference. Bug? I doubt, but only once decompression is fully implemented I will be able to say. And this goes on, random .bmp ~3.2mb > 7zip: 2.6mb => HBA: ~800kb. You can test for yourself as I attached exe down there. Just keep it to smaller files, It wont even work(for now) on too big(did not bother with 64bit integers, will have to change it) and is anyway intended to be used with long distance dedups(I also plan to write for it one day). Meanwhile HBA use up to few MB's constant buffer at most, no dictionary. And it is not a linear time cmp/decmp like a cm's family. Oh I forgot:
    Code:
    Silesia(211,939,002) => HBA => 22,664,790 - in 109seconds at ~1898kb/s because BWT is slowing it badly. Of course, this is raw without any extra bytes(crc, header etc..).
    So what is it really? Well I got one idea about month ago. In general HBA is a BWT+MTF based compressor with custom arithmetic coder designed specifically for it and special general purpose data filtering/handling(NOT specialized like for wav's, pics etc for example, like other codecs do). It turned out that without custom arithmetic(in my early tests), it would not show any better performance than 7zip/lzma(and in fact was slightly worse, I almost deleted whole project), but with it it made all the difference(however on its own in reverse, AC alone is of no use). So it need to be executed on all set of ideas. Also since this only require small buffer, therefore can(in the future) be fully parallelized(ideally on GPU if I can make it that far).

    Main idea of HBA will remain undisclosed for now, first I need to finish decompression stage and properly test everything before I can even assess its value.

    Final words:
    - This is NOT an April joke, but again only when decompression is fully working we can assess 100% functionality. Also, maybe I was lucky in my 3-4 tests and it is mediocre after all.
    - Currently horribly unoptimized everywhere and BWT is 90%+ of slow down(just qsort with memcmp). It only use 16k buffer ATM with ~1.6m/s on my 4.2ghz haswell(1T). Without BWT I get about 8mb/s already(1T).
    - Buffer is too small, bigger buffer should improve compression further.
    - I still have some math idea based on current one that should improve compression further.
    - Decompression is not implemented yet, dont bother with "hba d file file".
    - To be honest, I myself am skeptical ATM as it looks too good(at times) to be true. I dont want to come out look badly if I discover different "reality"(during decompression implementation for example), so lets keep real expectations for now. That custom Arithmetic coder is deciding factor here as all other stages should be 99% ok(I hope).
    - In the future, HBA can be made parallel, most likely GPU capable(fully I think, not just some stages), significantly quicker and likely with even better compression(through many different improvements). So what you see is merely 1st attempt. But I welcome your data results.
    - I need a lot more time.

    EDIT: Please read my next post below, I had indeed bugs in A encoder which resulted in better compression ratio appearance. As of now, published HBA doesn't work so I deleted file link.
    Last edited by elit; 5th April 2019 at 02:33.

  2. #2
    Member
    Join Date
    Oct 2007
    Location
    Germany, Hamburg
    Posts
    409
    Thanks
    0
    Thanked 5 Times in 5 Posts
    While one could believe to beat 7zip by some big numbers I would stumble if I beat the top scoring cm compressors on silesia BY FAR. Even when they have precompression support and heavy memory usages.
    It is very likely that you have bugs in there. Otherwise it would be a groundbreaking success.

  3. #3
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    176
    Thanks
    28
    Thanked 74 Times in 44 Posts
    One way to check for any encoder bug is to try and compress incompressible data, if the compressed file is smaller then there is a bug, there are no patterns in something like sharnd.exe's output. Unfortunately I cannot test since I'm on holiday and only have my phone with me. Sounds like quite an interesting project though, best of luck with it.

  4. #4
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    889
    Thanks
    483
    Thanked 279 Times in 119 Posts
    Here are some top-notch compression results for silesia.tar :
    http://mattmahoney.net/dc/silesia.html.
    Useful for comparison.

    I would say, on first look, beating cmix + dedicated preprocessing so badly (>30%)
    with just BWT + MTF, on a limited memory budget, with reasonable compression speed,
    that looks too good.

    I understand there is a custom arithmetic stage,
    but having said that HBA is a BWT+MTF compressor,
    that puts HBA into a certain bracket, and there is only so much a good entropy stage can achieve after that.

    Of course, I can only wish these results to be true,
    just, experience learns caution.
    As you have already pointed, only after the decompressor is completed will the results prove solid.

  5. #5
    Member
    Join Date
    Mar 2018
    Location
    sun
    Posts
    35
    Thanks
    19
    Thanked 16 Times in 7 Posts
    I made a quick decompression-in-place for A coder working ( filters... => data => A_enc => A_dec => out ) so that I could test its output and compare to pre-encoder stage(as when A is bypassed completely). You guys were right. Indeed I had a bugs during decompression(wrong data output), it had to do with probabilities. I fixed the issue and could decompress properly but compression worsened dramatically. For example silesia went 2x worse than 7zip to ~89mb.

    With that said, there are still some things I need to try whether maybe that will help. For now, I want to apologize(it really was not prank) and I will update 1st post regarding this.
    Well... I guess thats it. If nothing else, I have learned how to code my own data compressor and things like BWT, MTF and especially Arithmetic coder. This area was before distant unknown to me, so no regrets.

  6. Thanks:

    schnaader (5th April 2019)

  7. #6
    Member
    Join Date
    Jul 2014
    Location
    Mars
    Posts
    199
    Thanks
    135
    Thanked 13 Times in 12 Posts
    Can you say if your compressor is intended to outperform 7z in terms of ratio or speed compression? I mean can become general purpose one. You said GPU can be utilized, if it`s true it would be groundbreaking cause I don`t know any 7z-comparable packer out there.

  8. #7
    Member
    Join Date
    Mar 2018
    Location
    sun
    Posts
    35
    Thanks
    19
    Thanked 16 Times in 7 Posts
    Quote Originally Posted by necros View Post
    Can you say if your compressor is intended to outperform 7z in terms of ratio or speed compression? I mean can become general purpose one. You said GPU can be utilized, if it`s true it would be groundbreaking cause I don`t know any 7z-comparable packer out there.
    That was(and still is in fact, I just need little break finishing something else) my target, as unlikely as it may be. I need both reasonable enough speed and compression ratio to be any relevant otherwise it will fall into obscurity and nobody will use it. So yes general purpose and kind of new standard to use. On top of that with significantly lower memory usage than lzma's.

    For speed, the only obstacle is BWT vs size buffer limitation, though parallel GPU and further optimization should solve it. I only use qsort right now which is not very good and can get ~1.1MB/s on 1T with 16kb buffer. Bzip2 can do almost 1mb on 1T extremely fast(~20MB+/s) on my rig and if allowed would be fine on much bigger buffer IMO. Around 1.1MB/s per 1T is cca ~lzma(7z, FA is faster) speed on my computer so imagine efficiency of Bzip2 with 4T+ support and how much bigger buffer could you allow.

    As for arithmetic coding I implemented it in its original concept using floating point math instead of integer, which will again prove to be great fit for the GPU and since somebody already efficiently implemented BWT on GPU, this should be proof enough by now that you can indeed have a multi parallel, full task(not just some transform operations) GPU based compressor, not to mention BWT is memory friendly unlike lzma's. Thats why I see future in this, not lzma's(because of nonlinear mem scaling requirements) nor cm's(because of linear cmp/decmp time).

    For ratio, after packing enough of big data's with srep+lzma, when it come's to dictionary size I know after srep or other long range dedup is much less relevant , thus if... say 16mb buffer bwt can be achieved, this could(in theory) be very comparable. But quicker and mem efficient. This is, at least right now my vision. In the long enough timeline, I simply dont see lzma or cm's as the future but thats just my opinion. I see "srep"+bwt-like(with ari encoder).
    I imagine the future where you pull 100MB+/s during compression thanks to both GPU+CPU utilization(with, perhaps openmp), linear memory requirement and nonlinear decompression time. Thats is the compressor of the future IMO and I dont know anything other than BWT doing this reasonably well. But then again, I also know far less than some other members here so...

  9. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,975
    Thanks
    296
    Thanked 1,302 Times in 739 Posts
    > I only use qsort right now which is not very good

    You can use divsufsort at least.
    https://github.com/y-256/libdivsufsort
    or even msufsort
    https://github.com/michaelmaniscalco/msufsort

    > As for arithmetic coding I implemented it in its original concept using floating point math instead of integer,

    Keep in mind that although FP hardware usually follows IEEE standard,
    you'd have to write in GPU asm to keep it portable.
    I mean, there're still multiple GPU vendors.

    > BWT is memory friendly unlike lzma

    Its not really true. BWT suffix array just as well can be used to implement lzma matchfinder.

    > Thats why I see future in this, not lzma

    Unfortunately BWT is very limited.
    Its compression ratio is only good for some types of text (not even all text).
    And for decoding it basically needs at least one uncached memory access per byte, which is bad.

    Suffix array still can be a useful data structure for data indexing,
    but I don't see any applications for straight BWT compression.

    Well, unless you can invent a BWT equivalent that can sort symbols by computed context,
    but otherwise is similar to normal BWT.

    > cm's (because of linear cmp/decmp time).

    unlike LZ/BWT, CM more heavily relies on arithmetics.
    Stuff like interpolation/extrapolation, nonlinear transformations etc
    can improve its performance even with the same memory usage.
    And since memory is so slow on modern platforms, it may be faster
    to do 100 multiplications rather than wait for uncached read.

    Even now, for tasks like lossless audio compression, CM can
    provide the best ratio and speed at once.
    And once we have TPU coprocessors in desktop PC, something like NNCP
    might suddently become very practical.

    While BWT may still have plenty of potential for speed optimization,
    but has a hard limit on compression ratio for non-text data.

  10. #9
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    for a long time, I have compression scheme that will be within 5% of BSC but with speed of 500-1000 MB/s using GPU. Still looking for someone interesting to make fun of it. It needs lot of experiments, so lot of time to explore all the possibilities

  11. #10
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    176
    Thanks
    28
    Thanked 74 Times in 44 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    for a long time, I have compression scheme that will be within 5% of BSC but with speed of 500-1000 MB/s using GPU. Still looking for someone interesting to make fun of it. It needs lot of experiments, so lot of time to explore all the possibilities
    I was wondering what was happening with this. Did you want someone else to make it? If so what is currently stopping you?

    I wouldn't mind getting my hands dirty in some BWT and getting back into writing GPU code if you want to work together.

    I haven't been focusing on BWT stuff in a long time since it's not applicable for my work, that and I find the challenge of optimizing LZ more interesting.

  12. #11
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    well, the compression plan is:
    1. Thrust for quick ST4/ST8 implementation - it should be > 1GB/s on modern GPUs
    2. Behemoth Rank coder then
    3. FSE codec

    Once it works this way, we can go into improving each part and/or moving it to GPU. Entropy coder is especially interesting for me - it seems that bzip2-like huffman coder approach is rather easily implementable on GPU while giving only 1% to Structured Model, and 3% to the sophisticated BSC codec

    As to who can make it - I try to make wholesale but if you and/or someone else are interested, we can try to share responsibilities. Eventually, I like the idea to integrate new features into established BSC codebase, but don't insist on that

  13. #12
    Member
    Join Date
    Mar 2018
    Location
    sun
    Posts
    35
    Thanks
    19
    Thanked 16 Times in 7 Posts
    Quote Originally Posted by Shelwien View Post
    >
    Suffix array still can be a useful data structure for data indexing,
    but I don't see any applications for straight BWT compression.

    Well, unless you can invent a BWT equivalent that can sort symbols by computed context,
    but otherwise is similar to normal BWT.
    I am pretty sure its about speedup and I am also pretty sure its a BIG deal for BWT. SA is almost always mentioned/recommended when reading materials regarding BWT. If you are sorting message's own list of suffixes, average time you need is about n / 2, or 50% because median length of all sufixes is exactly half its size:

    mmmmmmmm$
    mmmmmmm$
    mmmmmm$
    mmmmm$
    mmmm$ << median length
    mmm$
    mm$
    m$
    $

    ^because through implementation any compare function(or memcmp()) will end at last character(i.e. no wrapping), meaning you dont need to compare full message's size each time for every index. On top of that, cmp function will never return '=', only either '<' or '>' because with SA no string can be equal even if whole message consist of 1 same literal.
    Last edited by elit; 18th May 2019 at 19:21.

  14. #13
    Member
    Join Date
    Mar 2018
    Location
    sun
    Posts
    35
    Thanks
    19
    Thanked 16 Times in 7 Posts
    I would like to reveal ideas that got me into HBA, maybe it will be of use to someone else or perhaps I can get insight on where I got it wrong. But first of all, upon some research including on GPU's I decided to switch to integer arithmetic, I basically got working standard implementation of original concept (Witten ACM87) of arithmetic coder. I properly implemented decoders for BWT and arithmetic coder as well and these now work correctly, so my encoded sizes I later tested against my preset of files(including tarred silesia) I know now are correct now(still without headers+CRC as I have not yet bothered to implement full file format but that wont add too much to file size).

    With that being said, results were rather disappointing(to me, or my expectations at least) but I know its because I have not yet explored - especially arithmetic coder sufficiently. But lets talk about that original idea:
    (NOTE: before continuing it is very possible that any of these ideas already exist and were discovered before. I wouldn't know but pls keep it in mind. I got here on my own and did not "stole" anyone's idea, even if it exist already.)

    So, not counting dedup techniques like LZ, basically 2 biggest factors that dictate compression is message size and individual unit(aka byte) variation. In case of bytes, variation is as high as 256 states so to compress, say single byte, encoding complexity(for lack of better work) would be 1*256. So I thought one day, what if I break bytes to lower units of less variations? In theory, this should give me less complexity - even at the cost of increased message, but of course only if I can threat each "unit" as individual. So for example, if you break byte to base 16 and get something like say: FC F1 CF, that itself would not be very useful, but if I can make it say: F F F C C 1, now that would be something else. And of course, we do have a technique to sort and revert any symbols to original state, its called BWT. So what would be the complexity of single byte broken to 2 individual independent base16 units?:
    2*16=32. Note this is significantly less than previous 1*256=256. We inflated original message(2*), but each individual unit have now only 16 stances. This(in my original theory) should give better sorting(because of less variations through message) thus lower entropy(which was true from my measurements). Message is bigger but we already know from tools like ztool or precomp that sometimes, increasing message size can be beneficial and yield lower size in the end. This was my general idea.

    Now what you do after this, whether you join them back to bytes(FF FC C1) and then apply MTF or do it already here + RLE+ari encoding without re-joining, its a matter of ideas. I tried all variations and as you will see later none gave any benefit over applying BWT+MTF+RLE+ARI directly to bytes.

    Anyway so after I had things working properly, I tried all combinations but was never able to reach even bzip2. For quick tests I used one 3.3mb file(exe). For reference, 7zip:mc32:d64m:lc4:32 was able to compress that file to 783kb(BCJ disabled). Bzip2:900kb to 995kb. With HBA, I was never able to go under 1.1mb and it was very close to 1.2mb in fact. If I worked with base 16, RLE was able to dedup anything 16-255, simply:

    if(lit != 0) output lit; else "count zeroes and if over 15 output up to 255 as a single byte".

    So I had all the room for me because of all free byte's free numbers because data were never over 15 unsigned integer(of uint8_t in fact). Of course, I tried things with and without RLE, MTF, even BWT, you name it. It was always worse, all 3 were useful and helped significantly to lower the size. Btw I also tried base4 and even base2, no better results. Then I tried to apply BWT+MTF+RLE+ARI directly to data(bytes, without breaking them further). To my surprise, I got best result of 1.1mb(1.106kb precisely). While I was happy, I was also disappointed because there goes my idea out of window. And still not even beating bzip2.

    Quickly to note, RLE with bytes work for obvious reason differently:

    if(lit != 0) output lit; else "count zeroes and: if < 2 just output 0; else output 0 + 0 + count - 2(count up to 257);
    So for every 2x0 there will be 3rd byte representing count, including 0 if was only 2. I found surprisingly this to give me best results and gain can be as much as 9% from RLE.

    So with all this implemented and buffer size same as bzip2: 900kb(for fair comparison), I got that file to 1.1mb as I said, which is still ~10% worse than bzip2(not even mentioning 7zip) and silesia I got to 57,936,443 = 55.2mb, where bzip: 51.7mb and 7zip: 47.2mb. However, silesia is actually hiding weakness of HBA. I tried also to compress 5.9mb bmp image, and while both 7zip and bzip2 got it to similar 117.6k and 113.2k, HBA did only to 151.2k, which is *shame* and this is where it should shine as even bzip2 could take over 7zip! That makes it ~26% worse than bzip2 and show clear weakness somewhere...

    Which bring me here. First if you read all this way, thank you for your time. Now, I think HBA weakness is arithmetic coder but I would like your opinion. After all tests I tried, it seems like BWT, MTF and even RLE are "good enough", but you tell me. First, ari encoder is standard original implementation working on bytes: https://web.stanford.edu/class/ee398...ithmCoding.pdf
    Probability model is also same, simply increment the symbol and half all if crossing max_frequency. However, I must say that I tried different computations, for example to increase by more than 1 for zeroes(because MTF effect), I tried starting probabilities not even and so on. Not only nothing I tried gave me better compression, it got worse in every single case even where I was sure it would benefit. One important thing, this simple model gave me same level of compression as computing precise ratio of each symbol by counting whole message block - aka non adaptive or static probability.

    Since then I read Mahoney's doc, I learned there are binary ari coders that encode bits(or few bits patterns) instead bytes. So I wanted to know, would this be any benefit in compression or its just for being universal(non-byte dependent)? Should I expect significant enough difference to implement bit(or 4 bit pattern?) ari encoder? Also, I am considering adding quick LZ pass on input before BWT, what you think? But then again, bzip2 and others don't have it and are able to compress better, so I don't think that's it.

    I wasted months to get here and feel stuck. I know this is my very first compressor ever and not long ago I had not even clue about all these algos, it is beating zip at least but my ego is still hurt . I really expected to beat whole 7zip stack and bring us to new era of linear memory and GPU acceleration.

  15. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,975
    Thanks
    296
    Thanked 1,302 Times in 739 Posts
    Here're some reference sources:
    http://nishi.dreamhosters.com/u/rc_v5.rar binary adaptive arithmetic coder tuned for BWT data
    http://nishi.dreamhosters.com/u/BWT16.rar simple BWT/unBWT implementation for 8-bit and 16-bit alphabets.
    http://nishi.dreamhosters.com/u/gmtf_v1.rar MTF utility
    http://nishi.dreamhosters.com/u/bsc240_qlfc_v1.rar QLFC postcoder from bsc compressor

    In general:

    1) Sorting with sub-byte alphabet is a bad idea for byte-aligned data (which is normal text, exes, 8bit+ images etc).
    It can be still good for other types of data (1-bit bitmap images).
    Its a bad idea because such sorting would mix up symbols from contexts with different alignment,
    and these would have different probability distributions.
    Basically the same reason as to why BWT compressors have problems with compression of sorted wordlists.

    2) Compression-wise its best to directly encode BWT output with a bitwise arithmetic coder.
    MTF and other transforms actually hurt compression, but provide ways to improve coding speed.

    3) LZ preprocessing is possible (it has to be LZ without entropy coding though, like LZ4 maybe),
    and can even be helpful to work around some common BWT issues with very redundant data,
    but like (2) its expected to hurt compression.

  16. Thanks:

    elit (26th May 2019)

  17. #15
    Member
    Join Date
    Mar 2018
    Location
    sun
    Posts
    35
    Thanks
    19
    Thanked 16 Times in 7 Posts
    So i tried that binary coder(first link) to my BWT-only output. Silesia went to 46.4mb, which actually beat 7zip. On that 3.3mb exe it slightly beat bzip2 but not 7zip, and on bmp image is was surprisingly still weak at 126k. That said, it does look like ari encoder/model plays a lot in role and as you said, if I also used mtf it did hurt compression, unlike in my case.

    EDIT: and QLFC with my bwt output beat everything including bzip2 on bmp image, only on exe 7zip was better.

  18. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,975
    Thanks
    296
    Thanked 1,302 Times in 739 Posts
    BMP is not very compatible with BWT transform.
    BMP is kinda a 3D object (pixel structure X width X height), so bytewise BWT is not very compatible.
    You can at least split color planes to separate files.
    Or ideally, apply some specialized transform to 1D like this: https://encode.su/threads/2702-Adapt...tion-impl-in-C

  19. #17
    Member
    Join Date
    Jul 2014
    Location
    Mars
    Posts
    199
    Thanks
    135
    Thanked 13 Times in 12 Posts
    So what about time/ratio comparison with 7z?

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •