Today, 17:13
Thanks for looking through the code. I knew I had to eventually remove LINQ but I don't think I'm using it at some very critical places. Right now, the most work seems to be in counting the words and ranking.
Thanks for the heads up for foreach loops, I didn't know they were more expensive than for's.
The architecture is pretty ok for now, it's like LINQ but async (using IAsyncEnumerable). I've thought it might be slowing me down but it's actually not. It's processing data in blocks and I'm testing it for optimality for the block size being the full filesize, so all the data only travels the pipeline once.
The slow-down comes from calculating the dictionary, when we have it, parsing and serializing dictionary is pretty fast.
What the rank (as per the latest version) calculates is the difference in bits used to encode the string. So if the dictionary size is 256, 8 bits would be needed per word and the ranking would be analogous to (n-1)*s for n=word size, s=occurences. (I'd rather work with l and c for length and count)
In your example, the weights are correct and the savings in bytes are the same as the rank. The whole ranking system is greedy in this way.
It will prefer:
"the" -> rank=2000
"rabbit" -> rank=175
"lion" -> rank=105
"and" -> rank=70
"fox" -> rank=70
total = 2420
The ranks here only being from the top of my head.
The non-greedy variation you have in mind looks something like:
"the fox and the rabbit and the lion" -> rank=1225
"the" -> rank=2000
total = 3225
I have actually thought a lot about this and it's not a problem of the ranking function itself, at least not it's primitive variation. No mathematical formula can solve the greediness, there's always going to be some example where if we switch the order, we'll get a better result.
It's a problem of choosing the best word from the ranking. Instead of greedy-ly choosing the highest ranked word, we can look ahead (in time) and calculate the ranks of future words, then choose the optimal branch. It becomes a look-ahead tree.
The fragmentation problem by itself can be solved if we consider the dictionary to initially be the alphabet and then morphing it into longer words. Like:
A -- AT
T -- A
C -- T
G -- C
G
Then we can do parsing before each new word calculation and we won't have the constraint of words not spanning over multiple blocks.
I've excluded this as a good idea, because 1) we have to do parsing at each word and only reparsing doesn't seem viable. 2) we won't have this problem for a large enough look-ahead tree.
But now that you mention it, maybe the look-ahead tree can't be that big. My hope was that since we're calculating a lot of the words' ranks, they won't change under the tree and when we choose the optimal branch, we'd only have to calculate the ranks of the next layer of words. Basically ranking becomes at most x times slower, if we prune and look at the top x words. The tree depth is only a concern for the first word and then updates are pretty fast, so it's pretty inexpensive to have a deeper tree, compared to a more spanning tree.
But for the look-ahead tree to solve the fragmentation dilemma, a deeper tree doesn't cut it. We need a more spanning tree.
So, maybe you're right, I should reconsider overlapping words and look into re-parsing, although I'm still pretty set on having the look-ahead tree as well.
A couple assumptions:
Right now, with the naive ranking ((n-1)*s) we don't have to rank all words and can only re-rank some. If we choose to re-rank the words, we have to store all the ranks, we can't store just the top 5, because at some point we won't be able to update the top words for 5 times in a row and we won't have a top 5. My current ranking was last updated with a hashtable for ranks, so the actual calculation is only done once per similair words.
What we can do, for sure, is only re-count some words. I'm working on that still. I have a draft, I just have to fix some minor indexing issues.
The reason why I haven't done re-ranking is because the optimal ranking algorithm uses a state. The technique used is similiar to morphing/updating the dictionary. This way the current rank of a word depends on the other words chosen and it's closer to the actual optimal order of the dictionary. Especially with the look-ahead tree.
For sure, re-ranking and re-counting + the fragmentation fix will create a great compressor, with acceptable compression times. But I'm rather looking to improve BWD in the direction of optimality for the cost of compression time, than leveraging memory for time.
When I'm done, I'll definitely try to implement one with this stack!