As many people here are interested in text compression, I would like to propose a general discussion about its theoretical basis: information theory in linguistics - e.g. how to improve text compression, what are the theoretical boundaries ... maybe also for other applications of models of text e.g. for automatic text recognition, interpretation or generation.
I will start with a few topics I see interesting for discussion and I am counting for more - thoughts, interesting articles etc. ...
1) Hilberg conjecture - lots of articles can be found here: http://www.ipipan.waw.pl/~ldebowsk/ , here is a review.
I will start with its consequence: that entropy of text grows in a sub-linear way with the length L, e.g. H(L) \propto L^alpha where alpha<1.
Let us look at the top of Matt's benchmarks ( http://mattmahoney.net/dc/text.html ): cmix v6 compresses enwik8 to 15738922 bytes, and 10 times larger enwik9 to 124172611 bytes, what is only 7.89 times more (would be 10 if it would grow in linear way). Log_10 (7.89) ~ 0.897, what suggests that alpha < 0.9.
The author claims that the lower bound is alpha > 0.6.
Interesting and controversial conclusion from such sub-linear behavior is that the entropy rate (number of bits/symbol) will asymptotically drop to 0. (is it true? what does it mean?)
The original Hilberg conjecture (1990) stated that the conditional entropy of a letter (how many new bits of information it actually carries) while given previous n letters, decays like n^(beta - 1) for beta ~ 0.5.
In other words, there are strong long-range dependencies in texts - a new letter can be more and more accurately deduced from the previous text, asymptotically carrying no new information.
It can be imagined that we describe some "fixed object/story/world" and succeeding sentences cover some its parts - and these "description parts" can intersect or repeat - also to remind the reader.
Santa Fe process in linked above articles is nice idealization of such a process of adding succeeding random pieces of information of something we would like to describe.
It brings a question about some idealized data compression methods, which might be available in a few decades (AI etc.).
So I see a perfect video compressor as decomposing the video into 3D objects, learning from the movie their required shapes, textures and dynamics ... understand lightening of the scene etc. - then render its own model of the scene and just encode the differences.
For text it will be even more difficult - the compressor should understand the text, encode its meaning and then encode the actually used way of expressing this meaning (given text).
Current text compressors seem far from something like that - especially using long-range dependences.
Do you think it is possible to significantly reduce this current alpha ~ 0.9 ?
2) Statistical modeling of text - here is a good review paper: http://www.cs.cmu.edu/~roni/papers/s...-PROC-0004.pdf
Maybe something new would be beneficial in data compression?
3) Maximal Entropy Random Walk (MERW) means choosing the random walk which maximizes entropy production - e.g. for maximizing capacity of informational channel under some constraints (most meaningful language?).
For example imagine Fibonacci coding - a sequence of 0 and 1 with constraint: there is no two succeeding 1. So after 0 we can choose any next bit, but after 1 we have to use 0.
The question is what frequency of 0 should we choose after 0 - let's call this probability as q \in (0,1).
Standard/naive/generic random walk for this graph suggests to choose q=1/2 to maximize local entropy production (GRW) ... however, intuition suggests that we should choose a bit larger q as going to 0 gives us more freedom for the next step - maximize global average entropy production instead (MERW).
It turns out that the largest entropy production is for q~0.618 (golden ratio). In this case we get ~0.694 bits/position (we would have 1bit/position without constraints):
We can easily find such MERW for any graph, what has been recently used in various problems of network theory: e.g. for vertex centrality measures, link prediction, community detection (clustering of graph), alternative for PageRank, SimRank etc. (see last slides here).
The question here is if maybe we could see language from MERW perspective - as a way to maximize capacity of informational channel under some constraints: short-range (e.g. grammatic) and long-range (Hilberg)?