21st February 2021, 18:20
Gotty seems to have made some tests and his data seems to confirm precision problems in cmix. I honestly haven't tried anything with cmix because it's simply unusable, it's far too slow.
It shouldn't be a surprise though that it doesn't do very well on what seems like "textual data", since it was continuously tuned over the years for the LTCB.
Comparing it to paq8gen would make more sense, since they're much closer in architecture, and paq8gen is progressing nicely.
It's quite impressive that it's been keeping up with LILY without any special model.
As for LILY, it's just about 1000 lines of badly hacked together code, much simpler than any of those 2, as implied by the difference in speed.
And since it's modelling exactly the correlations I'm choosing, I can tell exactly what they are.
From the moment I decided to have a go at this, I've been focusing on a theory (call it a hunch, an intuition, if you will) and have been trying to see if I could model it.
Now that I've finally found what I was doing wrong, it was quite simple to get a big gain (2 lines of code gave 55.000 bytes of gain). If anything, I'd be tempted to just start over because I think I can get pretty much the same result with half the complexity.
I must say however that it's frustating, not being an expert, to not know if the correlations I found, that accurately predict the differences in the sequences, have any actual meaning in the grand scheme of things. They seem to validate what I thought, but I can't shake the feeling there's an higher order structure at play here that I'm missing.
I'm sure Gotty will find them too, and paq8gen will probably go at least as low as 57x.xxx bytes.