Page 5 of 29 FirstFirst ... 3456715 ... LastLast
Results 121 to 150 of 849

Thread: Tree alpha v0.1 download

  1. #121
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    It does not say sibling list insertion is O(1). It says child insertion is O(1). I think they assume readers know a suffix insertion requires the lookup of the parent followed by a child insertion, but could be more explicit about this to avoid confusion.
    I don't think it's that. If insertion requires a lookup, then it's simply incorrect to say insertion is O(1).

    I find the table to be useful the way it is because it shows the performance of each method for the two parts of the suffix insertion. If they didn't show each step, sibling lists and sorted arrays would show the same time order for suffix insertion. By breaking it into the two steps, it's easy to see that sorted arrays are better than sibling lists for lookup but worse for child insertion.
    It might look useful, but it's misleading. It's kind of irrelevant to real-world performance.

    You might be interested to read about burst tries, because they deal with a pretty much identical situation (which is common to all tries). The research paper talks about various alternatives and how they arrived at their solution.

    What I don't like about the table is that it doesn't show the cost of suffix lookups, which I think is different that the cost of finding a child of a character plus inserting the child. At least for sibling lists and sibling trees, there are as many traversals of sibling lists or trees as there are nodes in the match string. For suffix arrays, it seems different since a program would just do one binary search with one or more character comparisons per element searched with the number of extra character comparisons being equal to something like two times the length of the match. I have never implemented a suffix array, so I'm a little fuzzy on that, but it does seem like a suffix array lookup should be faster in many/most cases.
    The expected look-up performance of a suffix array is pretty straightforward. Suffix trees are not straightforward; for one thing, there are infinitely-many ways to implement them. They both support efficient lookups. If I had to choose one of them to use for looking up suffixes, I'd go with a SA.

    I would not implement the suffix array construction myself. Computing a suffix array is an entire subject unto itself. If your goal is to use a SA, rather than invent a suffix sort, then it's a good idea to use one that already exists.
    Last edited by nburns; 24th May 2014 at 22:47.

  2. #122
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    I don't think it's that. If insertion requires a lookup, then it's simply incorrect to say insertion is O(1).



    It might look useful, but it's misleading. It's kind of irrelevant to real-world performance.
    Look at bullet #2 above the table, which applies to column #2 the way I read it. I believe they are only talking about insertion of a child on a given character (it's parent). Breaking the data up the way they have allows the reader to see the different reasons why building basic suffix trees and building basic suffix arrays each are O(sigma) since both lookup and insertion are needed to build the structure.

    Quote Originally Posted by nburns View Post
    The expected look-up performance of a suffix array is pretty straightforward. Suffix trees are not straightforward; for one thing, there are infinitely-many ways to implement them. They both support efficient lookups. If I had to choose one of them to use for looking up suffixes, I'd go with a SA.
    TreeCompress has an equal number of lookups and insertions. When the tree build is done, the scoring algorithm only traverses the tree one time and is then done with the tree. Clearly an algorithm that had more lookups than insertions would tilt the best approach much more toward any approach besides a suffix tree, but that is not the case here. Also, it is very easy to pass log ratios and counts through an uncompressed tree when scoring the nodes. I don't doubt that a more elegant approach could speed up TreeCompress, but am also not convinced enough to undertake a major rewrite.

    Much of the problem is that the whole approach of looking for good matches of unbounded length in very large files makes the problem explode. I think the largest time benefits can be achieved by smartly building trees (or other structures) of limited scope and gradually expanding the scope until the entire input is included. I have done this to some degree with TreeParagraphs and TreeWords, but there are more ways this can be done. Based on what I know from pre-release experimentation, initially limiting the spacial scope and then gradually expanding it can speed up compression significantly without hurting compression size much.
    Last edited by Kennon Conrad; 24th May 2014 at 23:26.

  3. #123
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Look at bullet #2 above the table, which applies to column #2 the way I read it. I believe they are only talking about insertion of a child on a given character (it's parent). Breaking the data up the way they have allows the reader to see the different reasons why building basic suffix trees and building basic suffix arrays each are O(sigma) since both lookup and insertion are needed to build the structure.
    I've said my piece. Take what you want from it.

    TreeCompress has an equal number of lookups and insertions. When the tree build is done, the scoring algorithm only traverses the tree one time and is then done with the tree. Clearly an algorithm that had more lookups than insertions would tilt the best approach much more toward any approach besides a suffix tree, but that is not the case here. Also, it is very easy to pass log ratios and counts through an uncompressed tree when scoring the nodes. I don't doubt that a more elegant approach could speed up TreeCompress, but am also not convinced enough to undertake a major rewrite.
    You typically don't insert anything into a suffix array. A suffix tree builds itself, but a suffix array is the product of a separate sort routine and it's essentially static.

    Much of the problem is that the whole approach of looking for good matches of unbounded length in very large files makes the problem explode. I think the largest time benefits can be achieved by smartly building trees (or other structures) of limited scope and gradually expanding the scope until the entire input is included. I have done this to some degree with TreeParagraphs and TreeWords, but there are more ways this can be done. Based on what I know from pre-release experimentation, initially limiting the spacial scope and then gradually expanding it can speed up compression significantly without hurting compression size much.
    Looking for unbounded matches need not make the problem explode. That's the counter-intuitive result from suffix trees and suffix sorting. Most of the blow up is redundant, and carefully organized algorithms can be linear or essentially linear. That's what's fascinating.

  4. #124
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    So you do one scan of the tree? How many replacements per scan?

    What I'd like to do is isolate the optimization algorithm from the mechanics of the tree. If there's a way to rewrite this in perfectly clear logic that's not obscured by the data structures involved, that would be a fantastic foundation for future work, regardless of whether it's any faster. The greatest aspect of suffix arrays is their pointer-free simplicity.
    Last edited by nburns; 25th May 2014 at 21:13.

  5. #125
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    So you do one scan of the tree? How many replacements per scan?

    What I'd like to do is isolate the optimization algorithm from the mechanics of the tree. If there's a way to rewrite this in perfectly clear logic that's not obscured by the data structures involved, that would be a fantastic foundation for future work, regardless of whether it's any faster. The greatest aspect of suffix arrays is their pointer-free simplicity.
    Yes, just one scan of the tree per build cycle. The code will identify up to 30K potential replacements per major cycle (subtrees being the minor cycles). Because of overlaps, initially, only about 3K or 4K replacements are actually made. The number generally grows as the tree reduces, typically being within a few hundred of 30K by the time half the iterations are done. If I had infinite RAM, I would build the entire tree once and just modify the tree as the new symbols are created. It could be done on my computer by using the hard drive, but that has its own set of problems. I have seriously thought about getting a new computer with a SSD, but haven't taken the plunge yet.

    The candidates are kept in a sorted list as they are identified. The code tries to check for overlaps while identifying candidates, but this is only partially successful. It would be much easier and better if the whole tree was available (I have had working versions that do this on smaller files). Since the code fails to identify some overlaps, at the end, the program builds a prefix tree to find overlapping candidates in one pass over the data (relatively minor details omitted) and "crosses out" lower valued overlapping candidates from the candidate list. After that, another enhanced prefix tree of the candidates is build and the remaining candidates are substituted in one pass over the data.

    So, the basic requirements are as follows:
    1. Identify all recurring suffixes that have at least one variation in the next letter (or a space in TreeCompress).
    2. Score those suffixes using a function of the number of symbols in the file, the order 0 entropy of the suffix, the number of symbols in the suffix, and the number of times the suffix occurs in the file.*
    3/4. Maintain a top N list of the best scoring suffixes.
    4/3. Eliminate overlapping suffixes.
    5. Insert remaining candidates (new symbols) in the data.
    6. Add the definitions for the new symbols to the end of the data.

    *: Distance between matches (RMS?) would be helpful too, but is not implemented in TreeCompress

    TreeParagraphs and TreeWords are similar, but they just restrict themselves to suffixes that start with and end just prior to certain letters.

    This is probably obvious, but the output still needs a smart encoder to realize good compression ratios. TreeBitEncode is better than a straight Huffman coder, but far from optimum.

    If you have time and energy to rewrite this in human readable code, let alone perfectly readable form, that's great. I don't know that I would ever get there with my limited time and get more excited about the thinking about improvements to the underlying concepts that the implementation.
    Last edited by Kennon Conrad; 25th May 2014 at 23:14.

  6. #126
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts

    Tree v0.5 Download

    TreeBitEncode and TreeDecode have been changed. Symbols that occur 11 - 20 times are now treated similarly to those that occur 2 - 9 times. They have unique definition codes and each occurance count has a buffer that tracks and most recent 64 symbols with that number of instances and provides shorter code lengths. Additionally, bit codes of length 11 - 22 now have a one deep buffer for the last symbol of that code length.

    Results on A8-5500 CPU:
    Prior v0.4 release:
    enwik9: 184,312,072 bytes, 68,866 seconds to compress, 22.2 seconds to decompress
    enwik8: 23,178,500 bytes, 3,546 seconds to compress, 2.7 seconds to decompress

    v0.5 release
    enwik9: 181,375,076 bytes, 68,869 seconds to compress, 22.7 seconds to decompress
    enwik8: 23,084,884 bytes, 3,546 seconds to compress, 2.8 seconds to decompress

    compressed TreeDecode.c file size: 8,271 bytes

    This release will move Tree from #45 to #39 on the LTCB site.
    Attached Files Attached Files

  7. #127
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    150
    Thanks
    30
    Thanked 59 Times in 35 Posts
    Don't you try to cope with the other charset or binary?

  8. #128
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by xezz View Post
    Don't you try to cope with the other charset or binary?
    No, so far Tree only works on UTF-8 compliant text. Removing the UTF-8 compliance requirement is not too hard and I will try to do that. That will also allow the program to run on any file, but the results for binary files would probably not be very good because the recurring sequences will not usually be byte aligned. Adding true binary file support would make compression a lot slower.

  9. #129
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    LZMA uses byte-aligned matches and does reasonably well on executables. BWT coders are byte-based also and perform surprisingly well on bitmaps (ie uncompressed images). So binary doesn't mean you have to look for non-byte-aligned matches to achieve good compression. In fact, there were attempts to do unaligned binary modelling and that thing is worse than byte aligned. That's what I remembered about crook at least.

    (crook is benchmarked on LTCB)

  10. #130
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    LZMA uses byte-aligned matches and does reasonably well on executables. BWT coders are byte-based also and perform surprisingly well on bitmaps (ie uncompressed images). So binary doesn't mean you have to look for non-byte-aligned matches to achieve good compression. In fact, there were attempts to do unaligned binary modelling and that thing is worse than byte aligned. That's what I remembered about crook at least.

    (crook is benchmarked on LTCB)
    That makes a lot of sense. I suppose many or most binary files are byte (or word) aligned even though they are not text.

  11. #131
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    I think the only data you will see that isn't byte aligned is compressed data. The trend is toward more file formats being text-based, e.g. XML. Many of those formats incorporate their own compression.

    MS Office file formats are XML files inside a zip archive.
    Last edited by nburns; 27th May 2014 at 00:22.

  12. #132
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    The tokens usually are byte-aligned but not always they're all single-byte. So while it's usually pointless to go below byte granularity, there can be some gain by using variable length tokens. Eg in case of x86 executables, one could split instruction stream to instructions and each instruction to control bytes and offsets separately. Also don't forget the ultra-old trick of converting relative addresses (which are pretty unique) to absolute addresses (which tend to repeat) - it's the E8/E9 trick.

  13. #133
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    LZMA uses byte-aligned matches and does reasonably well on executables. BWT coders are byte-based also and perform surprisingly well on bitmaps (ie uncompressed images). So binary doesn't mean you have to look for non-byte-aligned matches to achieve good compression. In fact, there were attempts to do unaligned binary modelling and that thing is worse than byte aligned. That's what I remembered about crook at least.

    (crook is benchmarked on LTCB)
    I tried a bit-wise BWT and it was worse (slightly) than byte-wise. The simple explanation is that data is byte-wise, and by not building this into your compressor, you are forcing it to discover this on its own. Information that it has to discover is information it has to store. A compressor that breaks data into discrete factors, like Tree, ought to be able to discover bytes and deal with them fairly efficiently, but a lot of algorithms will probably make repeated bad choices.

  14. #134
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    The tokens usually are byte-aligned but not always they're all single-byte. So while it's usually pointless to go below byte granularity, there can be some gain by using variable length tokens. Eg in case of x86 executables, one could split instruction stream to instructions and each instruction to control bytes and offsets separately. Also don't forget the ultra-old trick of converting relative addresses (which are pretty unique) to absolute addresses (which tend to repeat) - it's the E8/E9 trick.
    Bit fields are pretty common, but I'm not sure you could say that they are not byte-aligned, because they always occur at predictable offsets from byte boundaries.

    Data that isn't byte aligned is going to require pretty slow and complex code to make it byte aligned so the CPU can deal with it. So I don't think it's much of a leap to suppose that there are virtually no formats like that.
    Last edited by nburns; 27th May 2014 at 00:56.

  15. #135
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    I meant that just treating executables as stream of bytes won't be as efficient as trying to extract different parts of instructions and modelling them separately. And those parts of instructions doesn't have to be byte aligned nor fit in single byte. But for LZ like coders, which I suppose Tree coder is belonging to (as it codes repeated sequences as short codes) it shouldn't be of much importance as sub-instruction modelling don't fit well with such coding. Tokenizing at instruction boundaries should still be an idea worth exploring, though.

  16. #136
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    I meant that just treating executables as stream of bytes won't be as efficient as trying to extract different parts of instructions and modelling them separately. And those parts of instructions doesn't have to be byte aligned nor fit in single byte.
    True, but I see that as a special case of the larger problem of figuring out what things mean. This applies at every level, not just bits.

    But for LZ like coders, which I suppose Tree coder is belonging to (as it codes repeated sequences as short codes) it shouldn't be of much importance as sub-instruction modelling don't fit well with such coding. Tokenizing at instruction boundaries should still be an idea worth exploring, though.
    It's only like LZ in a broad sense. LZ allows references to arbitrary segments of text. This pre-defines what references are possible and what they mean, then writes the text from these references.

  17. #137
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    It's only like LZ in a broad sense. LZ allows references to arbitrary segments of text. This pre-defines what references are possible and what they mean, then writes the text from these references.
    I apologize in advance for possible terminology errors in the following comments.

    Yes, all symbols are pre-determined for the bit encoder, but they are not sent in advance of where they are first used. At the point where the final output is encoded, Tree has essentially become an unbounded look-ahead explicit symbol creator. New symbols are sent explicitly by sending a special symbol followed by a variable number of bits describing the priority or lifespan of the symbol. Due to the extensive amount of symbols created, there would be little or no benefit to context matching the next symbol from the previous symbol. However, there are lots of opportunities for improvement by predicting the starting sub-symbols of the next symbol based on ending sub-symbols of the previous symbol and doing normal predictive matching on new symbol sub-symbols. I'm not sure how well this will work on sub-instructions, but it seems like there is some potential of the sub-instructions look similar.

  18. #138
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    You typically don't insert anything into a suffix array. A suffix tree builds itself, but a suffix array is the product of a separate sort routine and it's essentially static.
    I think the point is that there is a sort with suffix arrays that doesn't exist with suffix trees. As far as I know, every time a new suffix array is added, a sort operation is needed. Without enhancements to the suffix array that could be up to more than 8 million moves for TreeCompress when a new suffix array is added.

  19. #139
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    I apologize in advance for possible terminology errors in the following comments.

    Yes, all symbols are pre-determined for the bit encoder, but they are not sent in advance of where they are first used. At the point where the final output is encoded, Tree has essentially become an unbounded look-ahead explicit symbol creator. New symbols are sent explicitly by sending a special symbol followed by a variable number of bits describing the priority or lifespan of the symbol.
    What is an unbounded look-ahead explicit symbol creator? You had mentioned that the new symbols are placed at the end, but I didn't think that made a difference. Are you doing something else that's clever around symbol creation?

    Due to the extensive amount of symbols created, there would be little or no benefit to context matching the next symbol from the previous symbol. However, there are lots of opportunities for improvement by predicting the starting sub-symbols of the next symbol based on ending sub-symbols of the previous symbol and doing normal predictive matching on new symbol sub-symbols. I'm not sure how well this will work on sub-instructions, but it seems like there is some potential of the sub-instructions look similar.
    The more prediction you do, the slower it will make the decoder. Fast decoding is what distinguishes this compressor. Actively predicting symbols will turn it into PPM.

  20. #140
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    I think the point is that there is a sort with suffix arrays that doesn't exist with suffix trees. As far as I know, every time a new suffix array is added, a sort operation is needed. Without enhancements to the suffix array that could be up to more than 8 million moves for TreeCompress when a new suffix array is added.
    Building a suffix tree is equivalent to sorting. If the nodes are sorted alphabetically, you can read off a suffix array by traversing the tree. Asymptotically, they're the same. Trees are large, complex, pointer-heavy data structures, though, and that's typically a performance killer.

    You are building only partial trees, though, and so I don't know how you determine when to stop building; a SA may give you somewhat less control and flexibility. But it's pretty hard to imagine that that would be enough to tip the scales in favor of the tree, especially given that the tree you are building is likely even slower than a suffix tree.

  21. #141
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    What is an unbounded look-ahead explicit symbol creator? You had mentioned that the new symbols are placed at the end, but I didn't think that made a difference. Are you doing something else that's clever around symbol creation?
    I made the phrase up in an attempt to describe Tree in just a few words. I would say an unbounded look-ahead explicit symbol creator is a compessor that explicitly creates symbols of unbounded length at the first location they are needed. Yes, the new symbols are placed at the end of the data when they are created by TreeCompress. TreeBitEncode embeds each dictionary symbol into the data stream at the first place the symbol is used using the special symbol I mentioned, along with the usage information. So, in effect Tree explicitly places the tokens that were created with the (almost) unbounded tree at the first location they are used (look-ahead). This allows for more efficient encoding since the creation also serves as the first reference to the new symbol. The extra bits describing each new symbol are used to determine the number of symbols in the new symbol (which can also be new symbols), to initializing the symbol probability, to create a relatively diverse set of small back count buffers to take advantage of spacial relationships with only a small run-time penalty, and to remove many symbols from the library when they will not be used again. While Tree's total number of dictionary entries when compressing enwik9 is a little more than 8.7 million entries, the most that are in the dictionary at any point in the file is slightly less than 4.6 million entries. I don't think what I am doing is particularly clever. I just think people have perhaps not investigated the potential benefits of explicit symbol creation adequately because of the perceived cost of hints.

    Quote Originally Posted by nburns View Post
    The more prediction you do, the slower it will make the decoder. Fast decoding is what distinguishes this compressor. Actively predicting symbols will turn it into PPM.
    Yes, that is true. My ultimate goal has been to get the best compression possible. I just haven't made it past the first stages yet. Arithmetic coding and some type of context matching could significantly improve Tree's compression ratios even though they are already not bad. As long as I have a toolset, I would place these features on the back end and keep them optional. I am particularly interested in seeing if I can find fast ways to only select the cream of the crop of context matching so that Tree can create additional points on the Pareto Frontier for decompression speed. While the form of context matching is kind of like PPM, I think it would also be kind of backwards in that it would be predicting existing symbols using lower order symbols that are used to compose the current symbol and the dictionary symbols rather than from multiple previous symbols. I am not really sure how different this would be or even how well it will work.
    Last edited by Kennon Conrad; 27th May 2014 at 10:10.

  22. #142
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    Building a suffix tree is equivalent to sorting. If the nodes are sorted alphabetically, you can read off a suffix array by traversing the tree. Asymptotically, they're the same. Trees are large, complex, pointer-heavy data structures, though, and that's typically a performance killer.

    You are building only partial trees, though, and so I don't know how you determine when to stop building; a SA may give you somewhat less control and flexibility. But it's pretty hard to imagine that that would be enough to tip the scales in favor of the tree, especially given that the tree you are building is likely even slower than a suffix tree.
    I guess we'll just have to agree to disagree. I think suffix trees are simple and speed is more important in the current Tree implementation than data structure size. I found building the prefix tree for token substituion to be much harder.

    I don't see much difference between having lots of pointers and lots of array indexes. As long as I'm not updating a bunch of data on each suffix addtion, I am happy. With suffix trees, I only need to change one existing pointer and initialize a new node. Unless I am missing something with suffix arrays, I would have to not only modify the appropriate suffix array but also sort the arrays on every insertion, potentially needing to update many array indexes each time a new array is added.

    I stop adding nodes for a given input symbols when a leaf is reached. I already provided data that showed there are very few compressible nodes for the vast majority of compression cycles. It seems to me that using compressed trees would hurt performance for the majority of the process since there would be just as many node creations but there would be billions of extra branch splits. If you still feel otherwise, you are welcome to try the modification. It would be educational for at least one of us and I don't necessarily mean for you. I think TreeCompress has an average node creation time of about 0.4 usec per node when building trees while compressing enwik9 on my relatively slow computer.

    I would be shocked if you find a generic suffix tree implementation as fast as TreeCompress. The sibling lists of length up to (dictionary size) will ruin performance relative to Tree's quaternary sibling trees of length up to (log2(dictionary size))/2.
    Last edited by Kennon Conrad; 27th May 2014 at 11:07.

  23. #143
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    150
    Thanks
    30
    Thanked 59 Times in 35 Posts
    Mybe sffixtree is faster than SA. Because Diqs use SA, and doesn't build dictionary, only traverse enhanced SA 1 time. But slower than Tree.

  24. #144
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by xezz View Post
    Mybe sffixtree is faster than SA. Because Diqs use SA, and doesn't build dictionary, only traverse enhanced SA 1 time. But slower than Tree.
    It's been known for a long time that you can use a ST to get a SA. The only reason to write a specialized suffix sort is to be faster.

  25. #145
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    I would be shocked if you find a generic suffix tree implementation as fast as TreeCompress. The sibling lists of length up to (dictionary size) will ruin performance relative to Tree's quaternary sibling trees of length up to (log2(dictionary size))/2.
    It would be shocking if you wrote the fastest suffix index in the world on your first try, when people have been working on this for years.

  26. #146
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    It would be shocking if you wrote the fastest suffix index in the world on your first try, when people have been working on this for years.
    Given that I never made such a claim and that writing the fastest suffix index in the world was not one of my goals, I am not sure what your point is.

    Your claim that the tree I am building is slower than a suffix tree was not accurate at all and I didn't want people to be confused by the claim. Limiting the sibling traversals to 11 per character vs. 1000's+ provides massive speed benefits when comparing Tree's structures to a standard suffix tree.

  27. #147
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Given that I never made such a claim and that writing the fastest suffix index in the world was not one of my goals, I am not sure what your point is.

    Your claim that the tree I am building is slower than a suffix tree was not accurate at all and I didn't want people to be confused by the claim. Limiting the sibling traversals to 11 per character vs. 1000's+ provides massive speed benefits when comparing Tree's structures to a standard suffix tree.
    Maybe I'm not understanding you. Data structures aside, what kinds of overlaps do you have to filter out?

  28. #148
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    I don't want to beat a dead horse, but I never adequately responded to this. These are meant to be helpful pointers only.

    Quote Originally Posted by Kennon Conrad View Post
    I guess we'll just have to agree to disagree. I think suffix trees are simple and speed is more important in the current Tree implementation than data structure size. I found building the prefix tree for token substituion to be much harder.
    Size and speed are linked. Accessing memory is expensive. L1 cache is faster than L2 cache, which is faster than DRAM. There are tools you can use to profile your code if you want to see how efficiently it uses the caches.

    I don't see much difference between having lots of pointers and lots of array indexes.
    There's not much difference in 32-bit code. If you go to 64-bit, pointers will become twice as large and array indexes can be smaller.

    As long as I'm not updating a bunch of data on each suffix addtion, I am happy. With suffix trees, I only need to change one existing pointer and initialize a new node. Unless I am missing something with suffix arrays, I would have to not only modify the appropriate suffix array but also sort the arrays on every insertion, potentially needing to update many array indexes each time a new array is added.
    I'm not sure why you anticipate having to do insertions in the suffix array. If you need to insertions, a suffix array would probably not be the way to go.

    I stop adding nodes for a given input symbols when a leaf is reached. I already provided data that showed there are very few compressible nodes for the vast majority of compression cycles. It seems to me that using compressed trees would hurt performance for the majority of the process since there would be just as many node creations but there would be billions of extra branch splits. If you still feel otherwise, you are welcome to try the modification. It would be educational for at least one of us and I don't necessarily mean for you. I think TreeCompress has an average node creation time of about 0.4 usec per node when building trees while compressing enwik9 on my relatively slow computer.

    I would be shocked if you find a generic suffix tree implementation as fast as TreeCompress. The sibling lists of length up to (dictionary size) will ruin performance relative to Tree's quaternary sibling trees of length up to (log2(dictionary size))/2.
    Here's the tree node structure from v0.2:
    Code:
    static struct string_element {
      unsigned int instances;
      unsigned char *last_match_ptr;
      struct string_element *even_sibling_ptr;
      struct string_element *odd_sibling_ptr;
      struct string_element *child_ptr;
    } *string_elements;
    That would be 5*4 = 20 bytes each. A suffix array uses 4 bytes per suffix. You'd probably need an auxiliary array, which is another 4 bytes per suffix.

  29. #149
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    Maybe I'm not understanding you. Data structures aside, what kinds of overlaps do you have to filter out?
    Overlaps don't absolutely have to be filtered out, but it helps compression. An example (probably not real) would be if the program identified "parent" and "rental" as candidates for tokenization. There may be instances of strings that would cause one or both of these strings to not be substitutable. For instance if the input had "parental" in the text, you can't substitute a new symbol for both "parent" and "rental". Tree picks the higher scoring one for tokenization. If the other one still has a good score on the next cycle, it is likely to be tokenized then.

  30. #150
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Overlaps don't absolutely have to be filtered out, but it helps compression. An example (probably not real) would be if the program identified "parent" and "rental" as candidates for tokenization. There may be instances of strings that would cause one or both of these strings to not be substitutable. For instance if the input had "parental" in the text, you can't substitute a new symbol for both "parent" and "rental". Tree picks the higher scoring one for tokenization. If the other one still has a good score on the next cycle, it is likely to be tokenized then.
    Would it treat parent and rental as overlapping even if parental didn't appear anywhere in the text?

Page 5 of 29 FirstFirst ... 3456715 ... LastLast

Similar Threads

  1. Replies: 4
    Last Post: 2nd December 2012, 02:55
  2. Suffix Tree's internal representation
    By Piotr Tarsa in forum Data Compression
    Replies: 4
    Last Post: 18th December 2011, 07:37
  3. M03 alpha
    By michael maniscalco in forum Data Compression
    Replies: 6
    Last Post: 10th October 2009, 00:31
  4. PIM 2.00 (alpha) is here!!!
    By encode in forum Forum Archive
    Replies: 46
    Last Post: 14th June 2007, 19:27
  5. PIM 2.00 (alpha) overview
    By encode in forum Forum Archive
    Replies: 21
    Last Post: 8th June 2007, 13:41

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •