1. Originally Posted by Paul W.
Kennon,

Thanks very much for that; it's very interesting. I'm still trying to figure out exactly how to interpret it, though.

I'm guessing that's a slightly condensed version of stats for each bucket in your bucketing scheme, so that where you actually have buckets for exact lengths 2-19, you're summarizing those for 2, 3, 4-9, and 10-19. I'm also guessing that 20, 30, etc. mean 20-29, 30-39, etc., and likewise 100, 200, etc. mean 100-199, 200-299. Is that about right?
No, it's just a sampling of the 64 symbol mtf statistics for the specific number of times the string occurs. The full file for 2 - 1000 occurances is attached. I thought about putting them in bins, but the code changes would have been harder.

Originally Posted by Paul W.
It's hard to read too much into the data, but I think the most important thing is that the average queue position typically gets longer for for frequent symbols even though the queue size is going up in relation to the data set size.

I'm not sure what that really implies. You also have other trends:

(1) decreasing misses per bucket as the frequency goes up (after the first two)
(2) increasing specificity of the buckets---at each step up, a given string is competing with many fewer other strings for low MTF positions and short codes. (Hence the tiny q% numbers at the beginning, increasing as the buckets get more specific.)
(3) increasing hit ratio for a 64-MTF-item queue for higher-frequency buckets

The increasing hit ratio may largely be a matter of lowered competition because the successive buckets cover much smaller sets of competing items.
I think a larger value for the average mtf queue hit position makes the mtf queue less effective because this usually indicates the dispersion or entropy of the positions is higher, meaning the average code length needed to indicate the position in the queue will be longer.

I also see the other trends. I think much of (1) or (3) occurs because the queue size is getting larger relative to the data set size and some of (1) occurs because more of the tail of the MTF benefits are captured.

For (2), the number of dictionary entries for each code length bin typically increases by a factor of 2 or a little more for each successive bin until the longest bin is reached. Since the probability step size is also increasing by a factor of 2, the number of dictionary entries tends to approximately quadruple as the number of instances reduces by a factor of 2. So, yes, the bins get a lot more specific as the frequency goes up.

Originally Posted by Paul W.
A more regular bucketing scheme (roughly logarithmic) might give easier-to-interpret numbers. A common one that lends itself to fast implementations is to use powers of 2 interleaved with powers of 2 times 3, e.g., 1, (1.5,) 2, 3, 4, 6, 8, 12, 16, 24, 32, 48, 64, 96... That's close to a square-root-of-two ratio in bucket ranges. (I and others have used that for "size classes" in memory allocators---you round the size up to the nearest number in that sequence, so you can allocate blocks in a reasonable and reasonably-distributed number of sizes. You can skip 1.5 if it doesn't make sense for your application, hence the parentheses around that.)
Agreed. The 1 deep MTF queues were mostly just an afterthought from implementing the 64 deep queues. I didn't spend much time on it because I felt I wanted smaller code length increments (similar to your suggestion). In order to take full advantage of adding more overhead to symbol creation, the code needs to also be able to send non-integer length codes, for symbols that fall out of the queue. For that reason, I see AC or ANS as a higher priority change. When that is done, I would like to improve the MTF scheme. Even though the current implementation helps compression size significantly, there are additional gains to be had. Making MTF more comprehensive will help and adjusting MTF queue probabilities dynamically based on the hit history can also help.

Originally Posted by Paul W.
I guess that for bucketed MTF decoding to work at all, the encoder must be transmitting at least the bucket number of a repeat reference, and increasing the specificity of the bucketing incurs a direct coding cost.
Yes, each mtf queue has an overall code space that corresponds to an integer code length. The code length is determined from the specificity. The mtf position codes are fixed and are all integer length, which is another area that could be improved. The distribution is less clustered near the front for higher frequencies. Tree uses only 1 extra bit for the first position, 3 extra bits for positions 2 and 3, etc. It would be nice to be able to make the first position 0.9 bits, 1.2 bits, or whatever is appropriate.

Originally Posted by Paul W.
(BTW, maybe none of this is new to you, and you've clearly put a lot of careful though into what you're doing, but it's a good sanity check for me---I especially want to know if I'm not making sense.)
You are doing fine. I have put a lot of thought into Tree, but the development has been very haphazard. First, I am not a very good programmer. Second, I had virtually no experience with compression or string theory when I started. I only knew what Huffman codes and LZ77 were and wanted to test a thought I had from a computer class 30 years ago (bucket list). I purposely did not educate myself on compression initially because I did not want my thoughts to be biased by current community knowledge. I didn't try to learn about compression until I had my basic thought implemented and found the performance to be interesting, at which point I contacted Dr. Mahoney for advice on what to do. I have learned a lot since then and see a lot of shortcomings in what I did, but have also gotten a lot of useful suggestions from people that with far more knowledge about compression than me. When tree first came out, I suggested it was like a new born baby. Now I think it is about equivalent to a six year old. It is smart enough to do some things reasonably well, but it could still use a lot of improvements. Since this is just a spare time hobby for me, my ability to quickly implement better code is very limited.

2. ## Thanks:

Paul W. (21st October 2014)

3. Kennon,

Thanks. Now I see what you mean.

If you just look at the entries for relatively high frequencies where there are 64 or fewer total items competing within an MTF list (so that 100 percent repeats are MTF list hits), the average MTF position for hits is pretty high relative to the total number of competing items---you're having to look through about a third to a half the MTF list on average before finding the hit. Evidently there isn't a whole lot more simple itemwise (MTF) locality to exploit there.

Nicely done.

4. Originally Posted by Kennon Conrad
I have more experience writing and maintaining 8051 assembly code than I would like to admit. What I really don't miss is 12 clock cycles per instruction. I am also used to designing with lower end processors that can set the data width wider than the address width and didn't realize x64 doesn't have any data models for that. That's too bad because it seems like an I32L64P32 model would work best for applications that don't need more than 4GB of memory.
The primary selling point for 64-bit has always been the ability to address more memory, I think. Access to more memory would offset a constant factor decrease in execution speed for a lot of applications, so it's possible that that's a trade-off that's built in.

5. Originally Posted by nburns
The primary selling point for 64-bit has always been the ability to address more memory, I think. Access to more memory would offset a constant factor decrease in execution speed for a lot of applications, so it's possible that that's a trade-off that's built in.
That makes sense because some applications can't be done with a 32 bit address space. But I imagine there are plenty of applications that could benefit from having a mode that supports 32 bit addresses and 64 bit data.

6. Originally Posted by Kennon Conrad
That makes sense because some applications can't be done with a 32 bit address space. But I imagine there are plenty of applications that could benefit from having a mode that supports 32 bit addresses and 64 bit data.
If that's the case, they probably would have added special instructions for 64-bit-wide arithmetic to the 32-bit instruction set (likely before the full 64-bit switch-over). In fact, I think you can find such instructions sprinkled through the 32-bit instruction set.

7. ## Thanks:

Kennon Conrad (24th October 2014)

8. Originally Posted by nburns
If that's the case, they probably would have added special instructions for 64-bit-wide arithmetic to the 32-bit instruction set (likely before the full 64-bit switch-over). In fact, I think you can find such instructions sprinkled through the 32-bit instruction set.
I find anything relevant at first, but then found out about MMX, SSE, etc. I only wanted 64 bit shifts and MMX seems to have that, although I'm not sure it's supported on all processors. This is worth a longer look after I release the next version of TreeCompress64, which will feature multi-threaded longest common prefix tree building for significantly faster compression.

Do you know if people ever build a tree of suffix arrays? I was thinking that building a prefix tree, but limiting the length of the tree to a small number and then attaching a suffix array to each tree leaf could reduce the sorting a lot.

9. Kennon,

I'm no expert on any of this, and hopefully somebody more knowledgeable will correct me if I'm steering you wrong...

MMX was introduced in 1997 and I think all Intel and AMD processors since shortly thereafter support it, which I'd guess is almost any x86 anybody is still using in 2014. (Over a decade later.)

My impression is that any relatively recent Intel or AMD x86 chip (within about the last 8 or 10 years) will support not only MMX but SSE and SSE2 fairly efficiently.

Even older Intel chips may only support MMX, not SSE, and there's a weirdness about certain old Intel chips (8-10 years old?) that supported MMX more efficiently than SSE (some having twice the throughput for MMX than SSE)... but as understand it, anything since roughly the Core 2 duo should do SSE better than MMX, or at least as well. (Unfortunately, Intel's naming of chips is pretty opaque and bogus, e.g., some early things that they called Core were really marketing terms for things built on the old Pentium 4 microarchitecture, not the newer Core microachitectures, so it's a big mess.)

Basically, I think the bottom line is that if you target SSE or SSE2, you'll do fine except on computers nobody should be using anymore. (Or if they use them, they should expect their mileage to vary by a factor of two.) Many scientific codes insist on at least SSE2 or later.

There's a weirdness with MMX that it conflicts with floating point, using the same register file (or at least the same register numbers) as the old scalar floating point ops---your code needs to be compiled to use either floating point or MMX, not both, or there's a weird cost in saving registers and switching modes when you try to mix FP and MMX, so don't. My guess is that you don't rely on FP, so that shouldn't be a big issue if you want to target MMX.

The upshot so far as I understand it is that if all you need is 64 bit ints (and shifts of them), and don't do many divides, you should be able to get them with either SSE or just MMX support, and it should work fine in 32-bit mode just as well as 64-bit mode. (If you know what compiler target flags to use.)

As I understand it, the bigger performance win of 64-bit mode when you don't actually need to address lots of memory is just that it gives you more general-purpose registers to work with, and saves on register spills. I don't know if that's a significant issue for your code. It may be a win for you to compile 32-bit code with MMX or SSE or SSE2, or it may not, depending on register pressure.

I do NOT understand how to compile programs that detect whether MMX or SSE/SSE2 support is available, and have multiple versions of variant code within an executable, but I know people do that sort of thing... I'm guessing somebody else here can give you some guidance on that.

10. Originally Posted by Paul W.
Basically, I think the bottom line is that if you target SSE or SSE2, you'll do fine except on computers nobody should be using anymore. (Or if they use them, they should expect their mileage to vary by a factor of two.) Many scientific codes insist on at least SSE2 or later.

There's a weirdness with MMX that it conflicts with floating point, using the same register file (or at least the same register numbers) as the old scalar floating point ops---your code needs to be compiled to use either floating point or MMX, not both, or there's a weird cost in saving registers and switching modes when you try to mix FP and MMX, so don't. My guess is that you don't rely on FP, so that shouldn't be a big issue if you want to target MMX.
As far as I can tell, that sounds right. Tree probably wouldn't work very well on many older computers anyway because of its slowness and the amount of RAM needed to avoid disk thrashing. The application I want to try it on does not use floating point, so that's not a problem. It looks like it's worth a try.

It looks like detecting MMX or SSE is pretty easy: http://www.tommesani.com/index.php/c...x-and-sse.html.

11. ## Thanks:

Kennon Conrad (27th October 2014)

12. Kennon:

Tree probably wouldn't work very well on many older computers anyway because of its slowness and the amount of RAM needed to avoid disk thrashing.

I'd been thinking that you'd only do compression on relatively recent machines, but might decompress the data on older, slower machines. (Given the huge asymmetry of compression/decompression times.) I hadn't been thinking about the space requirements for decompression, though, and I guess they're much more symmetrical.

In that case, it seems like SSE2 is likely the way to go. SSE2 was introduced with the Pentium 4, and all 64 bit x86's from Intel and AMD have always had it in both 64- and 32-bit modes. (i.e., for over a decade now.)

(BTW, you can't install 32-bit Windows 8 on an x86 without SSE2... it assumes that anything without SSE2 is just too old and lame.)

13. Originally Posted by Paul W.
Kennon:

Tree probably wouldn't work very well on many older computers anyway because of its slowness and the amount of RAM needed to avoid disk thrashing.

I'd been thinking that you'd only do compression on relatively recent machines, but might decompress the data on older, slower machines. (Given the huge asymmetry of compression/decompression times.) I hadn't been thinking about the space requirements for decompression, though, and I guess they're much more symmetrical.
RAM usage is not symmetrical either. Decompression doesn't require nearly as much RAM as compression since the decoder doesn't need to build suffix trees. You are right in thinking that decompression could be run on older, slower machines (as long as SSE is not used). For decompression, the main consumer of RAM is the decompressed dictionary. I have thought about storing low frequency dictionary entries in their compressed format instead, but haven't done anything there yet.

14. Originally Posted by Kennon Conrad
I find anything relevant at first, but then found out about MMX, SSE, etc. I only wanted 64 bit shifts and MMX seems to have that, although I'm not sure it's supported on all processors. This is worth a longer look after I release the next version of TreeCompress64, which will feature multi-threaded longest common prefix tree building for significantly faster compression.

Do you know if people ever build a tree of suffix arrays? I was thinking that building a prefix tree, but limiting the length of the tree to a small number and then attaching a suffix array to each tree leaf could reduce the sorting a lot.
My access to the web has been hit or miss lately, so that's why it's taken me a while to respond.

I'm certainly not the expert on MMX/SSE around here. I do know that some versions of MMX and SSE are quite old, and so you can pretty much count on them being present even on most AMD CPUs. My understanding is that MMX and SSE aren't free to use, though. Using them opens up a whole new set of performance issues, so it has to be done carefully. Piotr and Bulat would be much more helpful than I would.

As far as trees of suffix arrays: I'm not sure I follow exactly what you have in mind, but trees of suffix arrays doesn't sound like anything I've come across. That doesn't mean it isn't a good idea.

15. Originally Posted by nburns
My access to the web has been hit or miss lately, so that's why it's taken me a while to respond.
I had access problems too, but of a different type. My new computer got fried by a power outage we had a couple of days ago. I made a guess that it was the power supply and managed to replace it myself even though I know next to nothing about computer repair. It works again and hopefully the Corsair supply will do better than the Thermaltake one did.

Originally Posted by nburns
I'm certainly not the expert on MMX/SSE around here. I do know that some versions of MMX and SSE are quite old, and so you can pretty much count on them being present even on most AMD CPUs. My understanding is that MMX and SSE aren't free to use, though. Using them opens up a whole new set of performance issues, so it has to be done carefully. Piotr and Bulat would be much more helpful than I would.
I saw where Matt just turned on a compiler flag, so hopefully it's not hard to try. Maybe it's just me, but it seems like using a 32 bit register when dealing with codes only a little shorter than 32 bits is a pain. Tree does a combination of left shifts and right shifts to get the next value. With a 64 bit register, it could just do left shifts and wouldn't even need to load the next 32 bits unless half the bits are used up.

Originally Posted by nburns
As far as trees of suffix arrays: I'm not sure I follow exactly what you have in mind, but trees of suffix arrays doesn't sound like anything I've come across. That doesn't mean it isn't a good idea.
I was trying to think of a way to support faster sorting of the suffix array. There are probably better, known ways, but I was thinking that putting the suffixes into buckets (with easily calculated size) would reduce the amount of sorting required compared to the simplest algorithm. So I was thinking of using a tree for the first few characters, and then using suffix arrays for the longer portions of the "tree".

16. Kennon,

I don't know if these things are relevant at all but...

Kit & Wilks' "Virtual Corpus" suffix array scheme uses tokenization and assignment of ordered integer codes to tokens to speed up sorting. If the integer ordering corresponds to the input tokens' lexicographic ordering, you can use integer compares to implicitly compare several characters at a time. My impression is that you don't want to commit to a particular tokenization---part of what you're doing with Tree is figuring out how best to chop the input string into token-like substrings. Still, an initial tokenization might be useful purely as an optimization for sorting.

Kärkkäinen's "Suffix Cactus" is a hybrid combining features of suffix trees and suffix arrays, but I don't know if it's the kind of hybrid you're looking for.

Lazy suffix trees also look interesting.

None of those is very recent, and a whole lot of stuff has been done with suffix structures since, for text retrieval and for genetics databases, etc. I wouldn't be surprised if what you're looking for is out there somewhere, but I can't point you to it.

17. Originally Posted by Paul W.
Kärkkäinen's "Suffix Cactus" is a hybrid combining features of suffix trees and suffix arrays, but I don't know if it's the kind of hybrid you're looking for.
You're right. I've come across the suffix cactus before, and the authors describe it as a suffix tree/suffix array hybrid, so maybe it's exactly what Kennon has in mind. This page has a link to a free pdf of the paper: http://citeseerx.ist.psu.edu/viewdoc...10.1.1.42.9224

My assumption was that it wasn't very practical/useful, because I don't think I've ever seen any mention of it aside from the original paper itself popping up in search results. But who knows -- this could be the ideal application it's been waiting for. To gauge its importance, it may be helpful to check out the ways it's been cited in subsequent papers (and the citing papers themselves, which may contain better ideas): http://citeseerx.ist.psu.edu/showcit...10.1.1.42.9224

I wouldn't be surprised if what you're looking for is out there somewhere,
Definitely. The adage that 'he who is ignorant of history is doomed to repeat it' is nowhere more true than in academic research.

Originally Posted by Kennon Conrad
I had access problems too, but of a different type. My new computer got fried by a power outage we had a couple of days ago. I made a guess that it was the power supply and managed to replace it myself even though I know next to nothing about computer repair. It works again and hopefully the Corsair supply will do better than the Thermaltake one did.
I started building my computers from parts 10+ years ago, and the first power supply I got came with the case. I think the power supplies that come with cases aren't very good, because the original one died on me, and I replaced it with a Seasonic from Fry's. I'm using that Seasonic right now as I type this. It's the only part from that past iteration of my computer that I haven't replaced, so that's a pretty good endorsement.

A surge protector is probably a sensible investment, too, in case you're not using one already. Replacing power supplies can get expensive.

I saw where Matt just turned on a compiler flag, so hopefully it's not hard to try. Maybe it's just me, but it seems like using a 32 bit register when dealing with codes only a little shorter than 32 bits is a pain. Tree does a combination of left shifts and right shifts to get the next value. With a 64 bit register, it could just do left shifts and wouldn't even need to load the next 32 bits unless half the bits are used up.

I was trying to think of a way to support faster sorting of the suffix array. There are probably better, known ways, but I was thinking that putting the suffixes into buckets (with easily calculated size) would reduce the amount of sorting required compared to the simplest algorithm. So I was thinking of using a tree for the first few characters, and then using suffix arrays for the longer portions of the "tree".
Most people (myself included) just plug in divsufsort and call it a day. A lot of work has gone into divsufsort and it's very hard to beat. I'm not sure how exactly you're proposing to use suffix arrays and whether you absolutely need to do your own sorting, but if so, you should absolutely review the relevant literature before rolling up your sleeves and writing a suffix sort. There are several papers worth reading, probably starting with Manber and Myers' original SA paper[1]. You could read the source code for divsufsort[2] itself, but it's not simple. I think sais[3] is a simpler and more approachable implementation of the underlying algorithm, and that's what I would look at before divsufsort if I was going to write my own variant.

If (fast) suffix sorting were easy, there would not be so many papers published on the subject. Even assuming you are a genius (it's possible), you might as well take advantage of the work others have done rather than waste time repeating it.

[1] http://citeseerx.ist.psu.edu/viewdoc...0.1.1.420.2338

18. nburns:

Most people (myself included) just plug in divsufsort and call it a day. A lot of work has gone into divsufsort and it's very hard to beat.

That's what I was thinking of doing, but a quick look at libdivsufsort didn't unambiguously tell me it'd do what I want. I saw some phrase about how it worked with a finite alphabet, which made me think it'd work with 10- or 16-bit output from a preprocessor I wanted to make, but when I looked at the headers, it looked like a symbol was assumed to be 8 bits---so maybe "a finite alphabet" means "the particular finite alphabet we expect".

Do you know if can change the relevant typedef and it will magically work with, say, 10- or 16-bit characters, or is it as hardcoded as I'd expect, and that wouldn't work well, if at all?

I do realize I can translate things to 16-bit character pairs, and carefully ignore a lot of stuff about odd-numbered string lengths and so on, to approximate having 16-bit chars with 8-bit code, but a suffix array library that didn't actually assume 8-bit chars would be very handy, and likely have better complexity, a least insofar as I understand it right now.

19. Originally Posted by Paul W.
nburns:

Most people (myself included) just plug in divsufsort and call it a day. A lot of work has gone into divsufsort and it's very hard to beat.

That's what I was thinking of doing, but a quick look at libdivsufsort didn't unambiguously tell me it'd do what I want. I saw some phrase about how it worked with a finite alphabet, which made me think it'd work with 10- or 16-bit output from a preprocessor I wanted to make, but when I looked at the headers, it looked like a symbol was assumed to be 8 bits---so maybe "a finite alphabet" means "the particular finite alphabet we expect".

Do you know if can change the relevant typedef and it will magically work with, say, 10- or 16-bit characters, or is it as hardcoded as I'd expect, and that wouldn't work well, if at all?

I do realize I can translate things to 16-bit character pairs, and carefully ignore a lot of stuff about odd-numbered string lengths and so on, to approximate having 16-bit chars with 8-bit code, but a suffix array library that didn't actually assume 8-bit chars would be very handy, and likely have better complexity, a least insofar as I understand it right now.
divsufsort works on 8-bit characters. There are other suffix sort libraries that work on larger integer alphabets. I think this includes/uses one: https://code.google.com/p/esaxx/

FWIW, I'm not 100% sold on the need to suffix sort on larger alphabets. If your symbols are 16-bit ints, then the byte-wise suffix array will contain within it the suffix array for 16-bit ints and you can throw away the unwanted indices. If your input is utf-8, then suffix sorting as bytes is not necessarily a bad idea anyway, as the multi-byte symbols will be handled in a pretty sensible way and the BWT will compress pretty well. But you can judge for yourself.

Edit: To answer your question more directly -- I'm not aware that divsufsort is designed to be easily changed to handle 16-bit characters. 10-bit would truly be magic, because C doesn't have a type for that and it would be difficult at the level of the hardware.

20. Originally Posted by nburns
divsufsort works on 8-bit characters. There are other suffix sort libraries that work on larger integer alphabets. I think this includes/uses one: https://code.google.com/p/esaxx/

FWIW, I'm not 100% sold on the need to suffix sort on larger alphabets. If your symbols are 16-bit ints, then the byte-wise suffix array will contain within it the suffix array for 16-bit ints and you can throw away the unwanted indices. If your input is utf-8, then suffix sorting as bytes is not necessarily a bad idea anyway, as the multi-byte symbols will be handled in a pretty sensible way and the BWT will compress pretty well. But you can judge for yourself.

Edit: To answer your question more directly -- I'm not aware that divsufsort is designed to be easily changed to handle 16-bit characters. 10-bit would truly be magic, because C doesn't have a type for that and it would be difficult at the level of the hardware.
I'm not sold, myself, just unsure. If there's a small constant factor cost to doubling the character size to 16 bits, plus a small logarthmic cost depending on how many bits actually convey real information, say for a 10-bit alphabet, that's great---not worth looking for a large-alphabet suffix array library.

I do know that people have published papers about maintaining the compexity bounds of "good" suffix array sorts in the face of "large" alphabets, which makes me wonder if it's that simple, or if there's gotchas I don't understand---maybe because I don't really understand any of it in any depth :-/.

Ideally I'd just work on my 1K- (or 2K- or 4K-) or 16K-item dictionary preprocessor and feed it to libdivsufsort in some easy way, and it will all work out fine modulo a small constant in space and/or time, but I just don't know. I'm a noob.

21. Kennon,

A while back you offered to post some more data... I've thought of something that would be very interesting to see (to me, anyway)...

If you could run Tree on the English text from enwik9, it would be very interesting to compare the strings & frequencies it selects to the frequencies of word n-grams in corpuses of English. (E.g., the British National Corpus (BNC) and Corpus of Contemporary American English (COCA).)

Matt has posted text8, which corresponds to enwik8, and a Perl script to generate text9 from enwik9, on this page: http://mattmahoney.net/dc/textdata.html

If you could run the script over enwik9, and Tree over the result, that would be awesome. Failing that, results for text8 would be pretty interesting---they'd probably show a lot of the same stuff, with less certainty.

---

I've been looking at frequency-sorted lists of word n-grams, and there are some interesting patterns that have implications for your "top-down" approach (or what I tend to think of as "outside in"; top-down is a different, related dimension.)

One of the things that's interesting is that n-gram frequencies for n above 1 or 2 vary a lot between corpuses, apparently not mostly for the reasons you might immediately think of---e.g., being intentionally based on British English vs. American English.

For example, if you look at word 5-grams for the BNC and COCA, the major differences in the rank ordering of common word sequences isn't mostly about whether they're more likely in British or American English. It has more to do with the proportions of the corpus that draw from written vs. (transcribed) spoken English, or narrative vs. non-narrative English, and stupid things that clearly depend on boilerblate phrases or sentences that are wildly overrepresented in the limited sample, even though the sample is intended to be diverse and representative.

My impressions so far:

(1) most really long strings in most corpuses are very specific boilerplate due to something that's overrepresented in a given corpus (e.g., transcripts from American NPR or the British House of Commons, but you'd see the same kind of thing for boilerplate like GPL or BSD licenses in source code, and all sorts of other things where there's boilerplate---and I see it in your string frequency lists for enwik9, both in terms of boilerplate English text and in terms of boilerplate xml formatting for particlar kinds of articles.)
(2) most long-but-not-really-long strings have frequencies that depend a lot on a very few basic things about speech (whether it's descriptive vs. narrative, whether it's impersonal vs. personal, etc., whether it's emotive/evaluative or neutral)
(3) a whole lot of the most frequent longish-but-not-really-long strings (word 5- and 6-grams) are all about preposition-related phrases of a few very stereotyped forms e.g., "at the noun1 of the noun2" or "in the noun1 of noun2", or very minor variations that affect the number of words by just 1 due to the inclusion or omission of an article ("a" or "the") or a word that varies by basic mode of speech (e.g., "its" vs. "my" vs. "your" or "his" vs. "her" or whatever).

This is particularly interesting to me because I've thought for a long time that an outside-in approach based on corpus n-gram frequencies should allow you to build a pretty good dictionary for English or German or whatever, without explicitly modeling grammar or parts of speech or anything funky like that.

22. Originally Posted by Paul W.
Kennon,

I don't know if these things are relevant at all but...

Kit & Wilks' "Virtual Corpus" suffix array scheme uses tokenization and assignment of ordered integer codes to tokens to speed up sorting. If the integer ordering corresponds to the input tokens' lexicographic ordering, you can use integer compares to implicitly compare several characters at a time. My impression is that you don't want to commit to a particular tokenization---part of what you're doing with Tree is figuring out how best to chop the input string into token-like substrings. Still, an initial tokenization might be useful purely as an optimization for sorting.

Kärkkäinen's "Suffix Cactus" is a hybrid combining features of suffix trees and suffix arrays, but I don't know if it's the kind of hybrid you're looking for.

Lazy suffix trees also look interesting.

None of those is very recent, and a whole lot of stuff has been done with suffix structures since, for text retrieval and for genetics databases, etc. I wouldn't be surprised if what you're looking for is out there somewhere, but I can't point you to it.
The "Suffix Cactus" is interesting. But I'm not sure how useful it is for Tree because it looks like the fastest way to build it is to build a suffix trie first and that method doesn't really reduce memory if I understand it correctly.

I am looking for a way to reduce memory usage but maintain fast building of the structure. The maximum alphabet size should be at least 16 million (24 bits) since Tree can produce very large alphabets for large files (currently almost 8M for enwik9).

23. Originally Posted by nburns
divsufsort works on 8-bit characters. There are other suffix sort libraries that work on larger integer alphabets. I think this includes/uses one: https://code.google.com/p/esaxx/

FWIW, I'm not 100% sold on the need to suffix sort on larger alphabets. If your symbols are 16-bit ints, then the byte-wise suffix array will contain within it the suffix array for 16-bit ints and you can throw away the unwanted indices. If your input is utf-8, then suffix sorting as bytes is not necessarily a bad idea anyway, as the multi-byte symbols will be handled in a pretty sensible way and the BWT will compress pretty well. But you can judge for yourself.

Edit: To answer your question more directly -- I'm not aware that divsufsort is designed to be easily changed to handle 16-bit characters. 10-bit would truly be magic, because C doesn't have a type for that and it would be difficult at the level of the hardware.
The more papers I read, the more it becomes clear to me that the fastest way to build a sorted suffix array is to build a suffix tree first. I think the suffix array needs to be sorted to easily extract the LCP information. The algorithm would need to support 24 bit symbols.

My new computer was on a surge protector, in fact it was on the same one as my old computer and has survived many power outages. My old computer is on it's third power supply because the the fans on the first one got too noisy for me and the second one was noisy right from the start. Power supplies seem to be the weakest point in PCs by far.

24. Originally Posted by Paul W.
Kennon,

A while back you offered to post some more data... I've thought of something that would be very interesting to see (to me, anyway)...

If you could run Tree on the English text from enwik9, it would be very interesting to compare the strings & frequencies it selects to the frequencies of word n-grams in corpuses of English. (E.g., the British National Corpus (BNC) and Corpus of Contemporary American English (COCA).)

Matt has posted text8, which corresponds to enwik8, and a Perl script to generate text9 from enwik9, on this page: http://mattmahoney.net/dc/textdata.html

If you could run the script over enwik9, and Tree over the result, that would be awesome. Failing that, results for text8 would be pretty interesting---they'd probably show a lot of the same stuff, with less certainty.

---

I've been looking at frequency-sorted lists of word n-grams, and there are some interesting patterns that have implications for your "top-down" approach (or what I tend to think of as "outside in"; top-down is a different, related dimension.)

One of the things that's interesting is that n-gram frequencies for n above 1 or 2 vary a lot between corpuses, apparently not mostly for the reasons you might immediately think of---e.g., being intentionally based on British English vs. American English.

For example, if you look at word 5-grams for the BNC and COCA, the major differences in the rank ordering of common word sequences isn't mostly about whether they're more likely in British or American English. It has more to do with the proportions of the corpus that draw from written vs. (transcribed) spoken English, or narrative vs. non-narrative English, and stupid things that clearly depend on boilerblate phrases or sentences that are wildly overrepresented in the limited sample, even though the sample is intended to be diverse and representative.

My impressions so far:

(1) most really long strings in most corpuses are very specific boilerplate due to something that's overrepresented in a given corpus (e.g., transcripts from American NPR or the British House of Commons, but you'd see the same kind of thing for boilerplate like GPL or BSD licenses in source code, and all sorts of other things where there's boilerplate---and I see it in your string frequency lists for enwik9, both in terms of boilerplate English text and in terms of boilerplate xml formatting for particlar kinds of articles.)
(2) most long-but-not-really-long strings have frequencies that depend a lot on a very few basic things about speech (whether it's descriptive vs. narrative, whether it's impersonal vs. personal, etc., whether it's emotive/evaluative or neutral)
(3) a whole lot of the most frequent longish-but-not-really-long strings (word 5- and 6-grams) are all about preposition-related phrases of a few very stereotyped forms e.g., "at the noun1 of the noun2" or "in the noun1 of noun2", or very minor variations that affect the number of words by just 1 due to the inclusion or omission of an article ("a" or "the") or a word that varies by basic mode of speech (e.g., "its" vs. "my" vs. "your" or "his" vs. "her" or whatever).

This is particularly interesting to me because I've thought for a long time that an outside-in approach based on corpus n-gram frequencies should allow you to build a pretty good dictionary for English or German or whatever, without explicitly modeling grammar or parts of speech or anything funky like that.
I can do that, but will need several days. My computers are tied up with benchmarking right now. I have looked over the portion of the LTCB site that discusses the text versions of enwik in the past and have wanted to see how Tree would do, so this is interesting to me too. Tree doesn't work strictly on word-grams, but does produce a lot of symbols with multiple whole words, so the results will hopefully still be relevant.

25. Originally Posted by Kennon Conrad
The more papers I read, the more it becomes clear to me that the fastest way to build a sorted suffix array is to build a suffix tree first.
I'm honestly not sure how you got that impression. The papers I've read say the exact opposite. It's been known since the earliest days of suffix arrays (the first mention in the literature AFAIK was the 1995 Manber and Myers paper) that suffix arrays can be generated from suffix trees; however, that method has been considered unsatisfactory due to the excessive memory consumption of STs (memory typically equates to time), and that's the basis for the extensive literature on other ways to generate them (SAs), i.e. pure suffix sorting algorithms.

In my own experience, every time I've experimented with suffix tree implementations they've been dog-slow compared to running divsufsort on the same data to generate the suffix array.

I think the suffix array needs to be sorted to easily extract the LCP information. The algorithm would need to support 24 bit symbols.
By definition, suffix arrays are sorted. If it's not sorted, then you can't call it a suffix array. You could get common prefix length information without necessarily doing a full suffix sort; it just wouldn't be in sorted order. Using a suffix array (sorted) would be the easiest way I could think of to get common prefix information, but perhaps that largely has to do with the fact that suffix sort libraries are already widely available and so you can just drop one in and not have to start coding from scratch.

The algorithm to get LCPs from a suffix array is pretty nifty and elegant. I have it implemented and I could post code if you would like. I could also dig up a reference to a paper.

My new computer was on a surge protector, in fact it was on the same one as my old computer and has survived many power outages. My old computer is on it's third power supply because the the fans on the first one got too noisy for me and the second one was noisy right from the start. Power supplies seem to be the weakest point in PCs by far.
I'm not sure if surge protectors last forever. If your power supply died while connected to that surge protector, perhaps you should retire it and buy a new one. I think you could also probably say of power supplies that you get what you pay for, or, at least, some brands are better than others. I've been nothing but pleased with my Seasonic.

26. Originally Posted by nburns
I'm honestly not sure how you got that impression. The papers I've read say the exact opposite. It's been known since the earliest days of suffix arrays (the first mention in the literature AFAIK was the 1995 Manber and Myers paper) that suffix arrays can be generated from suffix trees; however, that method has been considered unsatisfactory due to the excessive memory consumption of STs (memory typically equates to time), and that's the basis for the extensive literature on other ways to generate them (SAs), i.e. pure suffix sorting algorithms.

In my own experience, every time I've experimented with suffix tree implementations they've been dog-slow compared to running divsufsort on the same data to generate the suffix array.
Well, besides the time order issue (see page 1 of link #1 you provided), if you look near the top of page two of the paper at the first link you provided, it includes the following near the top of the second page:

"Time could be saved by constructing the suffix tree first, and then building the array....".

Also, the paper on the Suffix Cactus (http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf) has timings for building a Tree and an Array in table 1. The tree build times are about 25% - 80% less. I agree that a generic suffix tree is dog-slow for large data. Using sibling trees instead of sibling lists reduces node traversals tremendously. Do you remember my one and only Wikipedia addition? It is still there.

Originally Posted by nburns
The algorithm to get LCPs from a suffix array is pretty nifty and elegant. I have it implemented and I could post code if you would like. I could also dig up a reference to a paper.
I don't need just the LCPs, but figure it would be helpful for calculating the number of instances for all the common prefixes. I have seen some papers and it doesn't look too bad. I have a feeling most existing code would have to be modified to support very large alphabets and sometimes find learning code is harder than writing it from scratch. But it would be interesting to see. At some point in time, it would be interesting to try a version that can reasonably fit a structure in memory that can be easily used to extract the common prefix information and that can also be updated instead of rebuilt on each iteration.

Originally Posted by nburns
I'm not sure if surge protectors last forever. If your power supply died while connected to that surge protector, perhaps you should retire it and buy a new one. I think you could also probably say of power supplies that you get what you pay for, or, at least, some brands are better than others. I've been nothing but pleased with my Seasonic.
The surge protector was brand new too.

I expect differences in efficiency and noise for supplies depending on how much I want to spend, but these days none of them should die the first time power goes out. It wasn't a lightning storm and all the other electronic items in my house were fine (as usual), many of which are not on surge protectors.

27. ## Tree v0.14

Tree v0.14 includes a new version of TreeCompress64 that compresses files much faster than TreeCompress or the experimental TreeCompress64 included in v0.13, at the cost of using more CPU cores and more RAM than TreeCompress. As is often the case, this isn’t exactly where I thought I was going, but I am pleasantly surprised by the outcome.

TreeCompress64 is now a multi-threaded single block (iterative symbol ranking) compressor. Up to three helper threads are used to build portions of the suffix trie while the main thread builds a small initial portion and then proceeds to traverse the suffix trie portions and produce a list of the symbols with the best ratings. The symbol rating formula has been adjusted (again) to provide faster compression compared to TreeCompress and better compression than TreeCompress64 v0.13.

The basic symbol rating formula for TreeCompress64 is now:
Rating = ((M * D) – P) * (((D – 1.0)/L)^1.7), where
M = number of string matches (first occurrence does not count)
P = tokenization penalty (default 4.5)
L = symbol length
D = log(M/N) – log(Markov chain probability)
N = number of symbols in data

The 1.0 and 1.7 values were derived empirically based on compression speed and effectiveness.

RAM usage is now a little over 5x the input size for UTF8 compliant files, 8x the input size otherwise. TreeCompress64 uses approximately 5.1 MB to process enwik9 and 3.7 MB to process enwik9 if TreeParagraphs and TreeWords are run before TreeCompress64. TreeCompress still uses about 1.7 MB when processing enwik9 either way.

A variety of additional relatively minor optimizations have been made.

Results for enwik8:
Tree v0.14:
TreeCompress64 => 22,124,900 bytes, encode 496 sec., decode 0.77 sec.
TreeParagraphs + TreeWords + TreeCompress64 => 22,229,468 bytes, encode 326 sec., decode 0.80 sec.

Tree v0.13 (for reference):
TreeCompress64 => 22,196,288 bytes, encode 931 sec., decode 0.76 sec.
TreeParagraphs + TreeWords + TreeCompress64 => 22,304,524 bytes, encode 730 sec., decode 0.74 sec.
TreeCompress => 21,976,316 bytes, encode 3916 sec., decode 0.78 sec.
TreeParagraphs + TreeWords + TreeCompress => 22,075,700 bytes, encode 1906 sec., decode 0.76 sec.

Results for enwik9:
Tree v0.14:
TreeCompress64 => 178,839,408 bytes, encode 9364 sec., decode 7.5 sec.
TreeParagraphs + TreeWords + TreeCompress64 => 179,806,072 bytes, encode 6899 sec., decode 7.4 sec.

Tree v0.13 (for reference):
TreeCompress64 => 180,516,660 bytes, encode 30,147 sec., decode 7.6 sec.
TreeParagraphs + TreeWords + TreeCompress64 => 180,941,504 bytes, encode 20,834 sec., decode 7.4 sec.
TreeCompress => 177,340,072 bytes, encode 116,473 sec., decode 7.5 sec.
TreeParagraphs + TreeWords + TreeCompress => 178,774,864 bytes, encode 36,652 sec., decode 7.3 sec.

TreeDecode.zip is 10,525 bytes.

A few other TreeCompress vs TreeCompress64 compressed file size and compression time results:
calgary.tar:
TreeCompress: 863,616 bytes, encode 27 seconds, decode 25 msec
TreeCompress64: 855,816 bytes, encode 9 seconds, decode 25 msec

mozilla:
TreeCompress: 16,742,964 bytes, encode 1084 seconds, decode 39 msec
TreeCompress64: 16,724,964 bytes, encode 274 seconds, decode 38 msec

world95.txt
TreeCompress: 542,488 bytes, encode 30 seconds, decode 22 msec
TreeCompress64: 533,096 bytes, encode 12 seconds, decode 21 msec

28. Originally Posted by Kennon Conrad
Tree v0.14 includes a new version of TreeCompress64

29. Originally Posted by Sportman
I shouldn't have to do this I think. It must be something wrong with how I build so I will need to research it.

30. I might get around to replying to the rest of your post at some point. For now, a couple things:

Originally Posted by Kennon Conrad
Well, besides the time order issue (see page 1 of link #1 you provided), if you look near the top of page two of the paper at the first link you provided, it includes the following near the top of the second page:

"Time could be saved by constructing the suffix tree first, and then building the array....".
Take into account the fact that that was the first ever suffix array paper, published in 1993 (I erroneously wrote 1995 before). The time difference they're referring to is evidently the difference in asymptotic time bounds. The suffix array algorithm they present has a log factor, and the best suffix tree algorithm known at that time (as now) is linear. Since then, linear algorithms have been discovered for suffix array construction, which brings them even.

Also, the paper on the Suffix Cactus (http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf) has timings for building a Tree and an Array in table 1. The tree build times are about 25% - 80% less. I agree that a generic suffix tree is dog-slow for large data.
That paper was published in 1995. Suffix sorting has undergone tremendous refinement since then. Suffix trees haven't changed much.

Using sibling trees instead of sibling lists reduces node traversals tremendously. Do you remember my one and only Wikipedia addition? It is still there.
No comment.

Originally Posted by Kennon Conrad
I shouldn't have to do this I think. It must be something wrong with how I build so I will need to research it.
pthreads are POSIX threads, so I suspect they are not a native part of Windows. One way to avoid packaging a pthread library would probably be to use Windows' native threads instead of pthreads.

31. Originally Posted by nburns
Take into account the fact that that was the first ever suffix array paper, published in 1993 (I erroneously wrote 1995 before). The time difference they're referring to is evidently the difference in asymptotic time bounds. The suffix array algorithm they present has a log factor, and the best suffix tree algorithm known at that time (as now) is linear. Since then, linear algorithms have been discovered for suffix array construction, which brings them even.
Well, considering you are the one that provided the link to the paper and then claimed in your next post to have no idea where my impression came from, I don't have much else to say on that subject.

I made a special version of TreeCompress64 to check Tree's "suffix tree" build time of world95.txt. Tree's structure includes counters for the number of times each common prefix occurs and the symbols are 32-bits, so the time isn't exactly comparable to those for suffix arrays that only use 8 bit alphabets and don't count instances. It is 0.53 seconds. Considering the extra burden of 32 bit symbols and instance counting, I don't see where that's "dog slow" compared to the fastest SACA benchmark timings.

32. Originally Posted by Sportman