Page 1 of 2 12 LastLast
Results 1 to 30 of 48

Thread: Suffix Trees, Suffix Arrays, or What?

  1. #1
    Member
    Join Date
    Jul 2011
    Location
    phils
    Posts
    32
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Suffix Trees, Suffix Arrays, or What?

    Good day. I'm trying to research what's currently used by the best
    LZ77-based programs in searching for dictionary matches.

    I've read of suffix trees (although I'm still confused how it can be
    implemented) and suffix arrays.

    My question is, which search method is used that's fastest without
    affecting compression ratio?

    A research showed that Suffix Arrays changed the ratio somewhat,
    although I don't get why.

  2. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Hash tables, I believe.

  3. #3
    Member
    Join Date
    Jul 2011
    Location
    phils
    Posts
    32
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Thanks. Can you point to any research and/or comparison on this?

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    > Good day. I'm trying to research what's currently used by the best
    > LZ77-based programs in searching for dictionary matches.

    How about determining what are these "best LZ77-based programs" first?

    > I've read of suffix trees (although I'm still confused how it can be implemented)

    Afaik ppmd is based on a suffix tree -
    http://ctxmodel.net/files/PPMd/ http://compression.ru/ds
    Normally it doesn't make sense for LZ though.
    That is, it can be used, but is relatively inefficient in its memory usage.

    > and suffix arrays.

    As to suffix arrays, the only practical LZ-like program that I know is bsdiff:
    http://nishi.dreamhosters.com/u/bsdiff_sh2.rar http://www.daemonology.net/bsdiff/

    > My question is, which search method is used that's fastest without
    > affecting compression ratio?

    Afaik its lzma's bt4, or maybe Bulat's ht4, or maybe Cyan's MMC.

    > A research showed that Suffix Arrays changed the ratio somewhat,
    > although I don't get why.

    Not sure what you mean with "changed the ratio" here, but
    blocksorting methods provide a way for precise string matching
    with 5N memory usage (or even lower if necessary), while
    practical sequential structures (like bt4 or MMC) start with 9N.

  5. #5
    Member
    Join Date
    Jul 2011
    Location
    phils
    Posts
    32
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Thanks Shelwien, big help.

    Anyways: What are these then?

    >> How about determining what are these "best LZ77-based programs" first?

    And where can I find any reading material on bt4, ht4 and MMC?

    Thanks again.

    EDIT: I already found MMC here http://encode.su/threads/1155-A-new-...ll=1#post22901
    Last edited by incblitz; 4th July 2011 at 05:32.

  6. #6
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    286
    Thanks
    9
    Thanked 33 Times in 21 Posts
    for optimal parsing hash-tables are too slow imo so a binary tree+hashing would be better.

  7. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    > Anyways: What are these then?

    I don't know, you didn't say anything about what kind of LZ do you want.
    For example, these seem to be most popular now:
    http://encode.su/threads/1266-In-mem...ll=1#post25401

    Or maybe lzham/lzmh/lzma if you're ok with slower encoding.

    Or maybe ROLZ is ok too, although its not exactly LZ77.

    > And where can I find any reading material on bt4, ht4 and MMC?

    http://downloads.sourceforge.net/sev...zma920.tar.bz2
    http://freearc.org/download/0.666/Fr...ources.tar.bz2
    http://fastcompression.blogspot.com/

  8. #8
    Member
    Join Date
    Jul 2011
    Location
    phils
    Posts
    32
    Thanks
    0
    Thanked 0 Times in 0 Posts
    ROLZ is similar to LZP... not really what I want, although I like the concept, performance-wise,
    I'll look elsewhere... (I was following Malcolm's work before, doing my own private benchmarks
    from his RKIVE days... before I 'vanished' from this hobby around 2002. I'm back after 9 years
    LOL)

    I'm interested in an LZ implementation that has sufficiently fast decompression, while going
    for the best possible compression. I am looking, of course for speedups, that's why I've been
    rummaging through DCC papers and only found the topic above.

  9. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    > I'm interested in an LZ implementation that has sufficiently fast
    > decompression, while going for the best possible compression.

    ROLZ seems to have more compression potential.
    Also decoding of ROLZ+bitcode would be likely not slower than
    LZ77+ari (like lzma)

    > I am looking, of course for speedups,

    Speedups of what?
    The slowest thing in LZ77 (and ROLZ) encoding is parsing optimization,
    but it can be skipped for speed (at the cost of worse compression).

    As to matchfinders, there's likely nothing faster than hash chains.

    Also it seems helpful to separately remove the long matches first
    (see Bulat's rep/srep filters).

  10. #10
    Member
    Join Date
    Jul 2011
    Location
    phils
    Posts
    32
    Thanks
    0
    Thanked 0 Times in 0 Posts
    ROLZ seems to fair less in terms of compression ratio to LZMA
    implementations, as seen in this benchmark:

    http://www.maximumcompression.com/data/summary_sf.php

    RE: long matches --> I've seen a paper where compression
    ratio is increased if you optimize back after greedy search.
    (where you go back to an lz processed block and look at the
    holes and see if the long matches can be replaced by two or
    more continuous or near continuous matches)

    (or is this parsing optimization?)

  11. #11
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    7z format supports PPMD compression and that PPMD mode is used in SFC test. Take a look at MFC, where 7zip is used in LZMA mode only. Here RZM wins a few megabytes to LZMA, although probably lzmarec could beat RZM on that dataset.

  12. #12
    Member
    Join Date
    Jul 2011
    Location
    phils
    Posts
    32
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I steer away from multiple file sets in judging an LZ based program,
    since that is won by Uharc (on that benchmark).

    I had a short email discussion with Uwe Herklotz (Uharc author) way
    way back where he described the concepts he used. If I remembered
    correctly, Uharc used lz77 + order-1 ari, + multimedia filters.

    Multiple file sets' compression ratios are very much affected by the
    number and the effectiveness of multiple models an archiver has.

    All of the above of course are what my opinions are of the moment.

    So I'd like to see purely the compression ratios of LZ based programs
    on text sources.

  13. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    http://compressionratings.com/sort.cgi?txt1.brief+4n
    http://mattmahoney.net/dc/text.html

    Although it doesn't make sense to compress texts with LZ.

  14. #14
    Member
    Join Date
    Jul 2011
    Location
    phils
    Posts
    32
    Thanks
    0
    Thanked 0 Times in 0 Posts
    > Although it doesn't make sense to compress texts with LZ.

    Why so?

  15. #15
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    It always depends on what do you mean by LZ. For example, WRT (word replacing transform) can be considered a variant of LZ and, when used as preprocessing stage, it often improves compression of textual data. Srep from Bulat is also LZ and is worth using as preprocessing step.

  16. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    Afaik WRT is a specific program, while its main transform is called LIPT - http://academic.research.microsoft.c...on-performance
    And it mainly improves compression of BWT/PPM/CM compressors, while in combination with LZ its not that helpful.
    Note that WRT-the-program actually applies other transforms (like "capital conversion"), which can't be considered a version of LZ78 (like LIPT).

    As to "why so?" - because LZ shows at least 20% worse compression than BWT/CM/PPM on texts, including wrt-processed ones.
    Surely the difference becomes smaller with larger text volumes - for enwik9 lzma is better than some PPMs, but that's mainly due
    to memory usage issues; given enough memory, statistical methods would always compress better.
    There's also a matter of speed, but bsc,ppmd,ccm all compress a few times faster than lzma,
    and bsc's decoding speed is only 2x slower, and its likely possible to further improve it.

    Code:
    c.size    c.time  d.time
    20907118  13.344s 3.656s  bsc.exe e enwik8 1 -b100 -m0f
    24557177  86.969s 1.719s  lzma.exe e enwik8 2 -a1 -d27 -fb273 -mc99999999 -lc8 -lp0 -pb0 -mfbt4 -mt4
    24213805 120.937s 9.344s  plzma.exe c2 enwik8 4  27 99999999 273 8 0 0

  17. #17
    Member
    Join Date
    Jul 2011
    Location
    phils
    Posts
    32
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Well lz77 based algorithms still decompress fast.

    BWT has a property where it matches on all orders,
    but still, the decompression speed is slower than lz77,
    and BWT is not adaptive inter block.

    PPM is slow and symmetric in comp/decomp speeds.

    + I have some ideas on lz77 that I'd like to implement

  18. #18
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    > Well lz77 based algorithms still decompress fast.

    As you can see, bsc decoding speed is the same
    as unoptimized lzma decoder.
    Also bsc would still have better compression without
    arithmetic coding, so there's quite some space for tweaking.

    > and BWT is not adaptive inter block.

    They still can be partially adaptive (especially schemes like m03)

    > PPM is slow and symmetric in comp/decomp speeds.

    PPM can process texts at up to 9-10MB/s with much better compression than LZ.
    Also complete symmetry is not necessary. You can find and store the optimal
    order for decoding and many other options.

    But surely if we'd only consider decoding speed, LZ77 is undefeated.

    Afaik, decoding speed is overrated, though.
    Not everybody has gigabit internet and SSD RAID for storage.

    > I have some ideas on lz77 that I'd like to implement

    Sure, but I'd suggest to look at lzma first.
    There's nothing surprising overall, but entropy coder choice,
    literal masking, rep-codes, context state machine, distance alignment
    are all very helpful, and people frequently start with designs
    which can never beat lzma just because they're too simple.

  19. #19
    Member
    Join Date
    Jul 2011
    Location
    phils
    Posts
    32
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Thanks again for the heads up. I was already starting to study LZMA code
    before I posted. Just took a break and see what others would say.

    Again, I appreciate all your comments.

  20. #20
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 172 Times in 64 Posts
    Quote Originally Posted by Shelwien View Post
    Afaik WRT is a specific program, while its main transform is called LIPT - http://academic.research.microsoft.c...on-performance
    WRT is a highly improved concept of LIPT and StarNT:
    http://www.ii.uni.wroc.pl/~inikep/pa...gDictCompr.pdf
    All above mentioned algorithm use a static dictionary, but XWRT (which works with all kinds of text files) uses a semi-dynamic or a dynamic approach therefore is a little bit similar to LZ77.

  21. #21
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    Yes, but is there any alternative implementation? For LIPT there're 4-5 independent ones.
    Imho its like how wikipedia says that lzma is not LZ77 because its so different.

  22. #22
    Member
    Join Date
    Jul 2011
    Location
    phils
    Posts
    32
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Hello again.

    RE LZMA:

    It's taking me a while to figure out the source code.

    Anyway so far:

    rep codes - means previous distance, repeated
    entropy coder - seems to be arithmetic codes, although there are funny code i still have to decipher
    literal masking - i don't get this?
    distance alignment - haven't found this yet.

  23. #23
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    As to LZMA, Shelwien probably explained that in some other thread.

  24. #24
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    > rep codes - means previous distance, repeated

    Yes, but lzma also actually takes it into account precisely enough
    in its parsing optimization.

    > entropy coder - seems to be arithmetic codes,

    Yes, something like this - http://encode.su/threads/1153-Simple...angecoder-demo
    Its important, because a bytewise rangecoder would have considerably different properties.

    > although there are funny code i still have to decipher

    That might be stuff related to uncompressed "direct" coding -
    there's nothing special really, just coding of bits with probability 1/2.

    > literal masking - i don't get this?

    Literal coding after a match.
    Its not likely that such a literal is equal to the next byte in the previously referenced string.

    > distance alignment - haven't found this yet.

    There're two types of alignment actually.
    First its lp/pb parameters for explicit alignment.
    And then low bits of distance are coded in reverse order
    (but in context of previously coded bits)
    so that distance coding for structured data is improved.

  25. #25
    Member
    Join Date
    Jul 2011
    Location
    phils
    Posts
    32
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I just read the PM... many thanks.

    >Yes, but lzma also actually takes it into account precisely enough
    >in its parsing optimization.

    I'll look into this

    >Literal coding after a match.

    Oh yes. One of my ideas was to look into all strings that match the previous matched string.
    If I'm going to code a literal and not a match, I will create a new stat table zero-ing all stats
    for characters at the end of each of all matched strings in the dictionary.
    This is a bit slower, but will definitely save bits (fractional or multiple bits).

    (There's one more thing I know I remember coming up regarding this literal after the matched
    string. It was years ago [2001?], But all my notes are in old IDE hard drive and I still have no
    way of accessing it. But I'll probably recall it after some reading and coding and brushing up, I
    know I'm a bit rusty)

    I don't expect much savings from that though.


    >>There're two types of alignment actually.

    I'll check these out.

  26. #26
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    > Oh yes. One of my ideas was to look into all strings that match the previous matched string.

    That's a straight path to PPM, you know

    Also lzma actually still sometimes encodes a masked literal byte after a match -
    for example, in the case when it hits the maximum match length.

    > This is a bit slower, but will definitely save bits (fractional or multiple bits).

    This is really slow (not "a bit" at all), and only can be done with a bytewise rangecoder
    (would be too slow with bitwise) - in fact I tried a few versions in lzmarec and
    Igor's method wins in both speed and compression.

    > I don't expect much savings from that though.

    I just tried disabling masked literal coding and got this:
    Code:
    (24695097/24557177-1)*100 = 0.56% // lzma
    (24315563/24213805-1)*100 = 0.42% // lzmarec
    Its actually pretty hard to improve lzma by that much.

    lzmarec is my improved entropy backend for lzma from http://encode.su/threads/1231-lzma-recompressor

    Also some other related threads:
    http://encode.su/threads/1297-plzma-codec
    http://encode.su/threads/1288-LZMA-markup-tool
    http://encode.su/threads/1177-lzma-s...cy-measurement

  27. #27
    Member
    Join Date
    Jul 2011
    Location
    phils
    Posts
    32
    Thanks
    0
    Thanked 0 Times in 0 Posts
    > That's a straight path to PPM

    Hehe .. well, it's an idea I haven't got around to testing.

    > in the case when it hits the maximum match length

    Well it can be a special case. And I think rep code 0 will
    be perfect for this.

    > I just tried disabling masked literal coding and got this:

    I read that before but was confused by how it improved the
    ratio.


    > Its actually pretty hard to improve lzma by that much.


    An idea I have w/c I believe will improve compression is
    shuffling blocks so that blocks with very similar strings
    are grouped together.

    This was done in grouping files together (I first saw w/
    I believe Robert Jungs' jar, and also with rar), but if I
    can test several heuristics, I might improve compression
    by shuffling the blocks to be compressed themselves.

    I believe this will also work in BWT based compression,
    and may improve PPM as well (although it may be more
    complicated)
    Last edited by incblitz; 7th July 2011 at 07:23.

  28. #28
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    > An idea I have w/c I believe will improve compression is
    > shuffling blocks so that blocks with very similar strings
    > are grouped together.

    Unfortunately that's already implemented;
    also it would work for lzma as well - see http://www.ctxmodel.net/files/PPMd/segfile_sh2.rar
    bsc also has such a filter and some other compressors/archivers.

    I think you can only beat that by making it adaptive and contextual -
    ie exclude some areas from the distance set with given context.

    > I believe this will also work in BWT based compression, and may
    > improve PPM as well (although it may be more complicated)

    You're a little late with that idea, but I guess you still have a chance
    if you'd manage to invent a method which would allow to process
    large volumes of data fast enough.

    The main problem is that a data similarity measure has to be based
    on some compression algorithm, but if the chosen algorithm is too simple,
    then its effect won't be always helpful.

  29. #29
    Member
    Join Date
    Jul 2011
    Location
    phils
    Posts
    32
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Well I have more of them ideas. Centering more on Lz77. (I have several ideas for PPM,
    but I'm not as thrilled to implement them as well as with lz77)

    Anyways, its study-mode for me for now.

  30. #30
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    Some topics for you then:
    1. How to find the optimal parsing with rep-codes?
    (The usual dynamic programming approach doesn't quite work,
    because the "base" match may not be the best locally).
    2. How to implement parallel processing for LZ77, especially decoding?
    How many async threads is possible to create?
    3. What's the best way to encode distances?
    (I managed to improve lzma's distance model a little, but its still ugly)

Page 1 of 2 12 LastLast

Similar Threads

  1. Little question about Suffix Sorting techniques
    By Piotr Tarsa in forum Data Compression
    Replies: 0
    Last Post: 23rd May 2011, 21:01

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •