# Thread: Infromation theory in linguistics, boundaries of text compression?

1. ## Infromation theory in linguistics, boundaries of text compression?

As many people here are interested in text compression, I would like to propose a general discussion about its theoretical basis: information theory in linguistics - e.g. how to improve text compression, what are the theoretical boundaries ... maybe also for other applications of models of text e.g. for automatic text recognition, interpretation or generation.
I will start with a few topics I see interesting for discussion and I am counting for more - thoughts, interesting articles etc. ...

1) Hilberg conjecture - lots of articles can be found here: http://www.ipipan.waw.pl/~ldebowsk/ , here is a review.
I will start with its consequence: that entropy of text grows in a sub-linear way with the length L, e.g. H(L) \propto L^alpha where alpha<1.
Let us look at the top of Matt's benchmarks ( http://mattmahoney.net/dc/text.html ): cmix v6 compresses enwik8 to 15738922 bytes, and 10 times larger enwik9 to 124172611 bytes, what is only 7.89 times more (would be 10 if it would grow in linear way). Log_10 (7.89) ~ 0.897, what suggests that alpha < 0.9.
The author claims that the lower bound is alpha > 0.6.
Interesting and controversial conclusion from such sub-linear behavior is that the entropy rate (number of bits/symbol) will asymptotically drop to 0. (is it true? what does it mean?)

The original Hilberg conjecture (1990) stated that the conditional entropy of a letter (how many new bits of information it actually carries) while given previous n letters, decays like n^(beta - 1) for beta ~ 0.5.
In other words, there are strong long-range dependencies in texts - a new letter can be more and more accurately deduced from the previous text, asymptotically carrying no new information.
It can be imagined that we describe some "fixed object/story/world" and succeeding sentences cover some its parts - and these "description parts" can intersect or repeat - also to remind the reader.
Santa Fe process in linked above articles is nice idealization of such a process of adding succeeding random pieces of information of something we would like to describe.

It brings a question about some idealized data compression methods, which might be available in a few decades (AI etc.).
So I see a perfect video compressor as decomposing the video into 3D objects, learning from the movie their required shapes, textures and dynamics ... understand lightening of the scene etc. - then render its own model of the scene and just encode the differences.
For text it will be even more difficult - the compressor should understand the text, encode its meaning and then encode the actually used way of expressing this meaning (given text).

Current text compressors seem far from something like that - especially using long-range dependences.
Do you think it is possible to significantly reduce this current alpha ~ 0.9 ?

2) Statistical modeling of text - here is a good review paper: http://www.cs.cmu.edu/~roni/papers/s...-PROC-0004.pdf
Maybe something new would be beneficial in data compression?

3) Maximal Entropy Random Walk (MERW) means choosing the random walk which maximizes entropy production - e.g. for maximizing capacity of informational channel under some constraints (most meaningful language?).
For example imagine Fibonacci coding - a sequence of 0 and 1 with constraint: there is no two succeeding 1. So after 0 we can choose any next bit, but after 1 we have to use 0.
The question is what frequency of 0 should we choose after 0 - let's call this probability as q \in (0,1).
Standard/naive/generic random walk for this graph suggests to choose q=1/2 to maximize local entropy production (GRW) ... however, intuition suggests that we should choose a bit larger q as going to 0 gives us more freedom for the next step - maximize global average entropy production instead (MERW).
It turns out that the largest entropy production is for q~0.618 (golden ratio). In this case we get ~0.694 bits/position (we would have 1bit/position without constraints):

We can easily find such MERW for any graph, what has been recently used in various problems of network theory: e.g. for vertex centrality measures, link prediction, community detection (clustering of graph), alternative for PageRank, SimRank etc. (see last slides here).

The question here is if maybe we could see language from MERW perspective - as a way to maximize capacity of informational channel under some constraints: short-range (e.g. grammatic) and long-range (Hilberg)?

2. ## Thanks (2):

biject.bwts (6th December 2014),Gonzalo (6th December 2014)

3. enwik9 has a large section in the middle of boilerplate text for articles about cities generated automatically from census data. It is more compressible than the rest of the file. However, when I was doing my dissertation research in 1999 I looked at about 30 language models and made a graph of compression vs. input size whih you can see at http://cs.fit.edu/~mmahoney/dissertation/
It does show long range dependencies to support the Hilberg conjecture, but it is not a straight line as it would suggest. I don't know if there is an upper bound. About 1 GB is all the text you can process in a lifetime. Longer range dependencies would require measuring the effects across collaborative efforts like Wikipedia or the whole internet. It will depend on the degree of interaction between the writers.

Ideal video compression would use AI to convert a movie into a script and compress the text. The decompressor would render the movie in such a way that the viewer wouldn't notice the difference. This is lossy, of course, but in theory you could get fantastic compression ratios. The human brain can only remember a few bits per second.

Lossless compression is limited by the random noise in the low bits of the pixels that is invisible to the eye. This is by far much worse than we can now get even with ancient MPEG-1.

4. ## Thanks (3):

Jarek (7th December 2014),ldebowsk (12th December 2014),Łukasz Dębowski (8th December 2014)

5. It's not just the section in the middle of enwik9 that is more compressible. In general, the first 25% is less compressible than the last 75%:

I don't know the theory or anything about Hilbert's conjecture other than what you just posted. From the perspective of Tree, a decent generalization is that for every 10x increase in data size, the dictionary will grow by a factor of about 7x and the number of symbols written will grow by a factor of about 7x - 7.5x. The average symbol length increases by about 0.5 bits for each doubling of the dictionary size. So in general, I would expect an increase in size by a factor of 10 to increase the output file size by a factor of about 10*(0.72 + log2(7)*(0.5 / K)), where K is the average symbol length. The average symbol length for enwik8 is somewhere around 16 bits, so I would expect enwik9 to compress about 19% better. No matter how big the file gets, it is always going to take more symbols to write the output when the file gets bigger, but it is true that if those symbols represent an infinite number of input characters, the extra entropy per input character would be zero.

6. ## Thanks (3):

Jarek (7th December 2014),ldebowsk (12th December 2014),Łukasz Dębowski (8th December 2014)

7. The use of long-range dependencies for sub-linear entropy growth (alpha < 1) is very nontrivial.
For example there should be also some kind of Hilberg law for video compression: a movie often repeats scenery, faces, cloths ... but for current video compression the size growth with length is probably very close to linear. They have some adaption mechanisms, e.g. learning types of textures used, but I would be surprised if they would get alpha below 0.99 (was it tested?)

The current text compressors can see some dependencies: e.g. if two words have appeared together, they are more likely to appear together later.
Maybe let's try to imagine a perfect text compressor e.g. of a (non-scientific) book, what might suggest currently available ways of improvement.
So a book contains some set of characters, places, items - and dynamically evolving net of relationships between them.
Each character (and narrator) has some own way of thinking and talking, each person/scenery/item has some set of features: some general ones which could be a part of compressor, and some specific for a given book - which should be learned from the book.
So after building (and encoding) such dynamical net of relationships (with strengths/frequencies), the compressor should understand the context of each scene, decompose it into parts spoken by specific person or narrator - and use a corresponding model.

PAQ-like ultracompressors seem to do something similar: understand the context of each part and use a corresponding statistical model.
Do you think there is still place for a significant improvement of compression ratio - especially using long-rang dependencies to significantly reduce the current alpha~0.9 ?
Also, could PAQ-based statistical models be used for other applications, like speech recognition and interpretation?
Or maybe e.g. for finding unexpected behavior: we have a stream of measurements from some monitoring of a device and we would like to automatically detect when something goes wrong - the measurement stream should no longer fit the statistical model for healthy behavior, entropy rate should change.

first I would like thank you for your interest in my research concerning Hilberg's conjecture. I have been working on this topic since 2000, when I first found out the original Hilberg (1990) paper. Mostly I have worked as a mathematician, trying to understand general properties and constructions of processes with a power-law growth of entropy (original Hilberg conjecture) or mutual information (relaxed Hilberg conjecture). In recent years, I have become however interested in empirical verification of Hilberg's hypothesis and wrote a few (more or less naive) papers about that. For a very brief review of seemingly all published and unpublished results concerning this statement, see
http://www.ipipan.waw.pl/~ldebowsk/d...ykuly/qlit.pdf

As I mentioned, there are two versions of Hilberg's conjecture: the original conjecture and the relaxed one, each worth studying on its own interest. The relaxed Hilberg conjecture seems pretty uncontroversial since it does not imply zero entropy rate, whereas the original Hilberg conjecture does. For this reason, for many years I was inclined to believe that only the relaxed Hilberg conjecture may be true. However, when I started to study empirically the maximal repetition in texts, I have found that it grows faster than the logarithm of the text length. A possible (but not necessary) explanation for this fact can be the original Hilberg conjecture. Hence I started to contemplate this possibility seriously and now I am on the way to construct some idealized processes that satisfy the original Hilberg conjecture. They seem to have something in common with the idea of memetic cultural evolution, see
http://www.ipipan.waw.pl/~ldebowsk/d...st_repeats.pdf

It is a question of further research to check whether the original Hilberg conjecture is satisfied by real texts or their collections. There are a few caveats. First, universal compressors such as Lempel-Ziv code or grammar based codes seem to perform very poorly on sources satisfying the original Hilberg conjecture when compared with the theoretically optimal Huffman coding (for which to use, we would had to have access to the real probability distribution of the data). This is a partly proven theoretical result, see
http://www.ipipan.waw.pl/~ldebowsk/d...5_debowski.pdf
For this reason, we have to invent other ways of estimating entropy than universal coding. An interesting method of entropy estimation can be based on subword complexity (a function that tells how many different substrings of a given length are contained in a text). What is interesting, such estimates of entropy are lower bound (contrasted to upper bounds given by universal coding). See
http://www.ipipan.waw.pl/~ldebowsk/d...4_debowski.pdf

Your further comments and questions are welcome. I count on an interesting discussion.

Regards, Łukasz Dębowski

9. ## Thanks:

Jarek (12th December 2014)

10. Originally Posted by Jarek
Let us look at the top of Matt's benchmarks ( http://mattmahoney.net/dc/text.html ): cmix v6 compresses enwik8 to 15738922 bytes, and 10 times larger enwik9 to 124172611 bytes, what is only 7.89 times more (would be 10 if it would grow in linear way). Log_10 (7.89) ~ 0.897, what suggests that alpha < 0.9.
Explanation to big difference in compression ratio between enwik8 and enwik9 lies partially in file structure.
I've made simply test - I've splitted enwik9 into 10 files by 100MB each and compressed it.
It looks that files tructure isn't uniform and beginning of the file is less compressible than other parts - especially middle parts. Enwik8 is, as I understand the first 100MB of Enwik9, and it's compression ratio is worse than other parts and summary of all parts of compressed Enwik9. Scores below:

Compressor - 7zip, Ultra compression, LZMA2, 64MB dictionary, 64bit word, 4GB block (however this isn't matter for the conclusion):

enwik8 24'864'790 bytes

enwik9.001 24'864'798 bytes
enwik9.002 24'721'675 bytes
enwik9.003 24'028'227 bytes
enwik9.004 23'788'276 bytes
enwik9.005 17'755'869 bytes
enwik9.006 12'078'414 bytes
enwik9.007 23'428'093 bytes
enwik9.008 23'119'456 bytes
enwik9.009 21'933'978 bytes
enwik9.010 22'710'466 bytes

Total enwik9 = 218'429'252 bytes = 8.78 of Enwik8 compressed. For more advanced compressors this indicator could be even lower.

Darek

11. well, if you want to compare AI vs NI, you should use a natural english text and compare compression of 1 GB corpus against 1.1 GB corpus

12. Darek, look at the Kennon's post - like in yours, but with better resolution.
Beside being inhomogeneous, enwik is not exactly text - much more appropriate for testing Hilberg conjecture would be literature, and so Łukasz have tested English, German and French books: http://www.ipipan.waw.pl/~ldebowsk/d..._paper_doc.pdf
He gets the exponent about 0.95, but if I properly understood, he has used pure LZ77. Modern ultracmpressors should work near 0.9.

Łukaszu, thanks for joining and reply.
I think you claim that the lower bound for this exponent is 0.6 - could you maybe explain how have you obtained it in simple words?
As you work in this topic, you probably have some personal belief where in this [0.6, 0.9] range this exponent really is?
If you think it should be essentially below 0.9, what ways for improving current compressors do you see?

13. ## Thanks:

ldebowsk (12th December 2014)

14. Jarek, you are right concerning an important difference between texts and corpora. Many corpora are randomized collections of texts, which means that long range correlations are distorted. If we are investigating Hilberg's conjecture we should do experiments on collections of texts where there is some "narrative coherence" on all length scales. That's why I investigated single books rather than well established corpora of texts. Its a very good question how to avoid this randomization when trying to go beyond a single text.

As for your second question why the lower bound of exponent is 0.6 --- as I said it has to do with a new kind of entropy estimator, described in the paper mentioned in my previous post. The basic idea is that the lower the entropy the larger the maximal repetition is (maximal repetion is the length of the longest repeated substring in a text). Very roughly speaking, entropy of a given block length is the logarithm of the number of different blocks of a given length that may appear in the text. Hence if entropy is low then the maximal repetition must be high. This can be translated into a lower bound estimator of entropy which is based on measuring the maximal repetition (or more exactly subword complexity, that is, a function that tells how many different substrings of a given length are contained in a text).

15. Originally Posted by ldebowsk
The basic idea is that the lower the entropy the larger the maximal repetition is (...)
So imagine we would like to compress digits of expansion of the number pi - it can be generated by a simple program (Kolmogorov complexity is tiny) - entropy rate quickly drops to zero...
... while looking at repetitions, it behaves like just a random sequence.

16. ## Thanks:

ldebowsk (13th December 2014)

17. Jarek, I haven't mentioned that there is a hidden assuption of stationarity in my estimation procedure. Expansion of pi is a tricky counterexample. It has many properties of random noise but it is not typical of stationary probability measures. Thank you for your important remark. Theoretical results concerning compression usually make assumption of stationarity. Nonstationary measures can be pretty anything and if we dont restrict to some subclass of them, we cannot prove anything.

18. ## Thanks:

Jarek (13th December 2014)

19. The stationarity assumption - that the underlying random process is time invariant - is a very strong one for the text.
The simplest counterargument (you have mentioned me) is that proportion "density of 'the'"/"density of 'a/an'" grows from the beginning to the end of a book - the reader is more and more familiar with the described situation.
This is only a small effect of building a story - first focusing mainly on describing the reality, to be a base for the main plot.

So the first question is if lower bound found for stationary sequences is a good representative for real text?
Maybe like for pi, the real informational content of meaning of the text (Kolmogorov complexity?) is much smaller?

Another issue regarding counting repetitions is that, while it may be a good way to bound LZ-based compressors, there are also much more sophisticated ones. PAQ-based cmix chooses the best prediction among more than a thousand different statistical models - does your lower bound apply to other than LZ-based compressors?

20. ## Thanks:

ldebowsk (14th December 2014)

21. Stationary models are just some starting point in our understanding of long-range dependencies in texts. I think it is a homework we have to do before considering nonstationary models. My lower bound estimators for entropy do not involve universal coding - this is a very new idea that subword complexity provides a lower bound for entropy of stationary sequences, the bound is pretty general -- only some skewness of block distributions is required. But yes, real texts are probably nonstationary, some nonstationary effects are clearly visible, and I am currently ruminating some simplistic model of this nonstationarity. Real text might be generated by a so called asymptotically mean stationary measure, introduced by Kieffer and Gray -- that is, there might be only some limiting stationary measure derived from the nonstationary probability measure.

So, Kolmogorov complexity of real text can be much smaller than compression rate of any effectively computable universal code but even Kolmogorov complexity can be much larger than the true entropy of real text if the true probability distribution is incomputable (actually, why should it be computable? -- our imperfect statistical models of language already get very complicated).

22. We have discussed here e.g. that video compression should be based on trying to understand the scene: decompose it into objects ... and a few days ago there has started claims from V-Nova about their new video codec Perseus, which breaks with MPEG philosophy, to provide 2-3x better compression than H.265.

"Just as human brains function in a hierarchical manner, by first visually recognizing a human being, then, a face, then a goatee, and then, gray hairs mixed into the goatee, Perseus will provide additional details of data representation to the elementary stream of MPEG-2 or H.264, explained Meardi."
http://www.eetimes.com/document.asp?doc_id=1326323

Do you think this claim is true?
Have you seen a better description of this technology?

23. From the article, “2x – 3x average compression gains, compared with legacy video codecs". If "legacy video codecs" means VP8 or H264 (previous generation), there is nothing revolutionary about this (HEVC and potentially Daala can do that). Their website contains no technical information at all, so I would take any performance claim with a gigantic grain of salt.
"Just as human brains function in a hierarchical manner, by first visually recognizing a human being, then, a face, then a goatee, and then, gray hairs mixed into the goatee, Perseus will provide additional details of data representation to the elementary stream of MPEG-2 or H.264, explained Meardi.". It is just a description of an embedded bitstream AFAICT. This is a valuable quality to have because the bitstream content can be adapted based on available bandwidth (say, transmit more details for transmission over cable, less details for transmission over mobile networks), but this is not a new concept (see for example ZeroTree Wavelet Coders).

24. There are lots of news stories about it, but all of them are extremely unspecific, e.g. http://www.telecompaper.com/news/v-n...ducts--1074470
"UK video compression technology company V-Nova has released a number of products based on its Perseus technology, which has been demonstrated to offer 2x to 3x average compression gains, at all quality levels, compared to the h.264, HEVC and JPEG2000 standards. V-Nova said the technology shifts the entire video bitrate-quality curve, offering UHD quality at HD bitrates, HD at SD bitrates, and SD video at audio bitrates."
They promise to present something in a few days on National Association of Broadcasters convention (13-16 April).
There is some discussion here (nothing interesting): http://forum.doom9.org/showthread.php?t=171990

25. "compared to the h.264, HEVC": Given that HEVC offers almost 2x compression ratio compared to H264, this sentence makes no sense to me.

26. http://www.telecompaper.com/news/v-n...ucts--1074470:
Published on April 1st: way to go marketing people !

27. Something more specific:
"V-Nova Ltd., a leading provider of video compression solutions, and Hitachi Data Systems, will unveil an end-to-end solution for the acquisition and delivery of live UHD content to consumers at the Hitachi booth (SL3910) during NAB Show 2015 (Las Vegas, April 13-16). Visitors will see a complete, deployment-ready UHD ecosystem, combining the latest Hitachi Kokusai 4K camera (SK-UHD4000), Hitachi Data Systems servers and the recently launched PERSEUS® video compression technology.

The new solution reduces UHD distribution bandwidth by more than 50% vs. those promised by HEVC, delivering previously unachievable performance: ca. 6–7 Mbps for UHD movies and 10-13 Mbps for UHD p60 sports. As such, the solution is suitable for today’s ADSL, as well as satellite and cable delivery. (...)"
http://www.streamingmedia.com/PressR...ion_38349.aspx
we will probably see tomorrow ...

28. I couldn't find any comments from the NAB Show, but there is an interesting commentary article mostly about skepticism of wide adaption of such proprietary codec (FreeArc Next discussion...):
http://www.v-net.tv/commentators-out...pression-codec
It also says
"And that brings us to the performance claims. At its launch, V-Nova demonstrated UHD video that had been encoded using PERSEUS being distributed at 8Mbps compared to 21Mbps with HEVC encoded content, for what it says was comparable video quality based on key performance indicators."

As it seems there is some truth there, maybe let's try to discuss possibilities of improving video compression, speculate how PERSEUS is built, breaking with the MPEG philosophy.
A perfect video compressor should (?) deconstruct the movie into objects, remember (and update) their textures, then render scenes from them: with stored positions, lightning etc.
I don't think we are close to something like that. However, there are some dominating objects, like person/face, hair, car, house etc. - I think large ratio improvement could be obtained by localizing and processing such objects separately. Describe human body as a set of parameters ... talking face. Try to understand the used parts of buildings, maybe have some database of cars etc.
I think some hints how PERSEUS is built can be obtained from the demonstration they have chosen (8Mbps vs 21Mbps of HEVC) - do anyone have some information about it?

29. Some new information - http://www.tvnewscheck.com/article/8...-opportunities
"
It tackles image compression differently than other algorithms by imitating some aspects of the human visual system, added Meardi.
“When you look at a person, you first see a head, and then a face and then features, then maybe a beard and then the white whiskers of the beard,” he said.

Perseus uses mathematical transforms that leverage correlations in an image to preserve detail efficiently during compression in a way that mimics vision.
“The important thing is that the mathematics to be able to do this need to be amenable to a hierarchical structure,” he said.
The fundamentals of MPEG compression, such as blocks, frequency transforms and motion vectors “do not scale hierarchically,” he added.
Perseus also employs a holistic treatment of image content, much as human vision does, rather than chopping images into blocks to be processed that later can become visible under certain conditions when decoded, as is the case with MPEG, he said.
The new compression technique is also “massively parallel,” taking advantage of the true processing power that has been available for years, Meardi said."

So it seems it breaks with standard square blocks?

30. It sounds like Uwe Prochnow ATVISICAN CODEC from Atvisican AG in 2004, it was 10-20 times better then MPEG4 that time. The algorithm also detected and described separate objects in video and used some Russian military algorithm for that.
http://wikipedia.qwika.com/de2en/Atvisican

He build it in a high end video player (allcanview ACV-1 SE) and sold it for round 8000-9000 euro (German):
https://web.archive.org/web/20041214...attACV1SE.html
https://web.archive.org/web/20050617...eorecorder.htm
https://web.archive.org/web/20060617...t_ACV-1_SE.pdf

HOMEelectronic 05/2005 ACV-1 SE review (German):
https://web.archive.org/web/20060228...ronic_0505.pdf

News (German):
http://www.loehneysen.de/archiv/2004/acv/story.html

Patent:

Funny:
> I've seen this report on TV. If you ask me, he's a con-artist and the TV
> guys aren't knowledgable enough to really judge it.
http://compgroups.net/comp.compressi...pression/77650

31. ## Thanks:

Jarek (18th April 2015)

32. "Perseus uses mathematical transforms that leverage correlations in an image to preserve detail efficiently during compression in a way that mimics vision." Can they give a more fuzzy description than that ? Seriously ? A DCT based compressor would match the 'description'. How about some independent testing ?

33. "V-Nova advanced video compression technology PERSEUS wins NAB best of show award": http://www.wspa.com/story/28865532/v...-of-show-award
"Wyplay Integrates V-Nova PERSEUS® Video Compression Technology into Set-Top-Box Platform of Major Pay-TV Operator": http://finance.yahoo.com/news/wyplay...080000700.html

34. Finally a review of V-Nova Perseus: http://www.streamingmedia.com/Articl...ticleID=111816
Not so revolutionary (generally worse ratio than h.265, but more energy efficient), it has some nice new approach:
"At a high level, most video codecs encode frames by dividing the frame into blocks and squeezing data from these blocks as necessary to meet the target data rate. This causes the blocky artifacts we’ve all seen on streaming video and fringe cable and satellite channels. Perseus processes each frame at multiple resolutions, encoding the lower levels first and then adding detail as necessary for the higher resolutions. As we’ll see, this schema avoids the blockiness inherent to H.264, MPEG-2, and HEVC. Perseus tends to blur at aggressive encoding configurations."

35. Nice recent open paper about universality of Zipf law: that e.g. k-th most frequent word should have approximately 1/k probability
http://journals.plos.org/ploscompbio...l.pcbi.1005110
It is probably well seen in LZ-based compressors, I wonder it is also true for e.g. binary files?

Zipf law is also at the heart of ANS: that x natural number (x-th word in dictionary) contains lg(x) bits of information, hence has approximately 1/x probability.
It is used in ANS philosophy (x -> x/p or equivalently lg(x) -> lg(x) + lg(1/p) while processing symbol of probability p) and its states obtain this ~1/x probability distribution.

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•