Results 1 to 24 of 24

Thread: How much further can the best compression go?

  1. #1
    Member
    Join Date
    Jan 2019
    Location
    UK
    Posts
    7
    Thanks
    1
    Thanked 1 Time in 1 Post

    How much further can the best compression go?

    Hi. I'm interested in finding out whether there is any work on approximating how much further compression can go?

    The two following charts represent all of the compression scores on the Matt Mahoney and squeezechart sites.

    Click image for larger version. 

Name:	matt.png 
Views:	204 
Size:	31.0 KB 
ID:	6433

    Click image for larger version. 

Name:	squeeze.png 
Views:	173 
Size:	25.2 KB 
ID:	6434

    WARNING 1: the time axes are not to scale as I was unable to accurately incorporate dates. So I just graphed it all as simple line charts.
    WARNING 2: I cannot vouch for the existence of the slight (apparent) downturn at the extreme right of the charts for the reason above.

    Clearly there is some form of asymptotic behaviour. Intuition tells us that there has to be a limit by virtue of the pigeon hole principle. Otherwise anything would compress down to one byte. I'm really not sure what role Kolmogorov complexity (Kc) plays regarding incremental improvements to the current top compression algorithms featured on this site. Respectfully, are the authors here hoping for a revelation into how Kc might become computable? I personally don't think that Kc will help with the current benchmark files. They're human generated ultimately (names, images, phone numbers etc). It's hard to imagine how a hundred lines of magical computer code could ever hope to produce the wiki definition of "anarchism" for example.

    So I think we have to limit this question to algorithmic improvements. A file compressed by cmix virtually passes the ent randomness test. This indicates that a cmix file is almost pure entropy. Consequently there doesn't seem to be much room for improvement. cmix and the paq8 compressors produce files that are within 0.1% of the theoretical Shannon sizes when I test them with synthetically generated data sets (using the standard Shannon log formula). My three graphs all look asymptotic.

    Therefore I wager that another 50% reduction in compressed file size is impossible. enwik9 will never reach 58MB. Is there any scientific work/references on testing my assertion?

  2. The Following User Says Thank You to ciril_desquirrel For This Useful Post:

    introspec (1st February 2019)

  3. #2
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    Depends on the models we develop, most high end codecs predict using a combination of binary, byte-wise, and word-wise context, if there's a higher level context we can take advantage of such as a topic-model or tonal-model then we can get higher compression. Compression is such a complex problem there isn't really a mathematical way to prove we've reached a limit.

  4. The Following User Says Thank You to Lucas For This Useful Post:

    Gotty (5th February 2019)

  5. #3
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    695
    Thanks
    216
    Thanked 215 Times in 132 Posts
    The big question is if we are closing to the theoretical limit: Kolmogorov complexity, which is unknown but fixed.

    Regarding text, we need to have in mind that entropy growth is sublinear ( https://encode.su/threads/2092-Infro...xt-compression ):
    H(n) ~ n^alpha
    where alpha for text seems to be ~0.9, but might be improved in future compressors, there are suggestions that alpha > 0.6.
    This sublinear behavior means that at least in theory, asymptotically we need 0 bits/symbol (for n->infinity) - due to long range correlations in text, it becomes more and more difficult to write something really new ...

    This sublinear entropy growth might also concern other media, like all used textures would finally repeat in infinite movie ...
    It brings a practical question: how to exploit this bits/symbol -> 0 limit?
    Like building new YouTube video from all the old ones ... or more practically: learn the space of possible videos and encode within it - and we have now started going there with machine learning-based image/video compressors.

  6. The Following User Says Thank You to Jarek For This Useful Post:

    schnaader (31st January 2019)

  7. #4
    Member
    Join Date
    Aug 2016
    Location
    USA
    Posts
    42
    Thanks
    11
    Thanked 17 Times in 12 Posts
    I think the situation is different between lossless and lossy; In theory, for lossy, one can represent the manifold of "naturally occurring video/image/etc" and encode relative to that (that's a way to understand what NN based compressors are doing - they are learning a compact representation for a high-dimensional manifold embedded in a much higher-dimensional ambient space). It feels (not a very scientific statement) that lossless is different since it cannot afford to drop (not encode) "noise" that is far from the manifold in question, and noise is definitionally not compressible.

  8. The Following User Says Thank You to Stefan Atev For This Useful Post:

    Jarek (31st January 2019)

  9. #5
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    695
    Thanks
    216
    Thanked 215 Times in 132 Posts
    You are right - in multimedia there is rather always impossible to predict noise.
    While we can try to somehow remove it in lossy, lossless cannot really approach asymptotically 0 bits/symbol.

    ... and there is a similar issue in text compression: analogue of noise is e.g. freedom of choosing synonyms, we could try to do "lossy text compression" by somehow standardizing expressions to completely eliminate ambiguity ... but otherwise it just cannot asymptotically approach 0 bits/symbol.

    Łukasz Dębowski has a nice idealization ("Santa Fe process") of such approaching 0 bits/symbol: there is some hidden random infinite sequence of 0/1, and we try to compress a sequence of its values on randomly chosen positions.
    So information about this hidden sequence saturates along the compressed sequence - asymptotically leading to 0 bits/symbol.
    This hidden sequence represents the entire knowledge, our e.g. succeeding texts saturate this knowledge ... a few philosophical assumptions and we could get approaching 0 bits/symbol

  10. #6
    Member
    Join Date
    Jan 2019
    Location
    UK
    Posts
    7
    Thanks
    1
    Thanked 1 Time in 1 Post
    Partial disclosure: My interest in (lossless) compression is for using it as a tool to measure the entropy rate of ergodic/stationary but truly random processes. Not really for text, but I've found that the latest compressors work really well with random data. I don't think that Hilberg's conjecture will hold for a truly random process. It's derived from work with human language, not avalanche diodes where memes/set phrases are not relevant.

    I build true random number generators for my sins. Having developed an entropy source, I sample it and then try to guesstimate the entropy rate using compression. So my cunning plan is to do |cmix(samples)| / n. The purpose of this thread is to determine a safe yet practical value for n. So far I've used n = 2. So for example, I might have an ASCII data set like:-


    3.691406e-01
    -2.636719e-01
    -2.636719e-01
    1.054688e-01
    -1.054688e-01
    4.218750e-01
    -3.691406e-01
    8.437500e-01
    -3.691406e-01
    -5.273438e-01
    -5.273438e-01
    2.109375e-01
    5.800781e-01
    5.273438e-01


    There's no possible underlying model for this behaviour as it's derived from quantum indeterminacy within a hardware thingie. It's truly random, but correlated to some unknown degree. Compressing such a set may result in say 5 bits/sample. Hence I'm relying on the effectiveness of cmix to approximate the true entropy.

    What would I then divide it by to be safe yet practical? Simply dividing by a million may not be clever...

  11. #7
    Member
    Join Date
    Jan 2019
    Location
    UK
    Posts
    7
    Thanks
    1
    Thanked 1 Time in 1 Post
    Quote Originally Posted by Jarek View Post
    a few philosophical assumptions and we could get approaching 0 bits/symbol
    But not with enwik8 or 9 which are finite length...

  12. #8
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    695
    Thanks
    216
    Thanked 215 Times in 132 Posts
    Ciril, as mentioned, for enwik this limit is unknown but indeed fixed: as its Kolmogorov complexity.

    Regarding compression of PRNG sequence, even the best compressors only focus on relatively simple statistical dependencies, have no chance to find rules governing even the simplest (deterministic) random generators - what requires a sophisticated cryptanalysis.

    If you are interested in measuring entropy rate of real (e.g. time) series as minus asymptotic log-likelihood like in data compression, a very accurate approach is first normalizing variables to uniform marginal distribution like in copula theory, then modelling joint distribution with polynomials (slides) - some results for log returns of daily stock prices of 29 Dow Jones companies:
    Click image for larger version. 

Name:	ldKRB3S.png 
Views:	62 
Size:	195.6 KB 
ID:	6437
    If you are interested in maximizing entropy rate, see https://en.wikipedia.org/wiki/Maxima...py_Random_Walk

  13. #9
    Member
    Join Date
    Jan 2019
    Location
    UK
    Posts
    7
    Thanks
    1
    Thanked 1 Time in 1 Post

    Smile

    Quote Originally Posted by Jarek View Post
    Regarding compression of PRNG sequence, even the best compressors only focus on relatively simple statistical dependencies, have no chance to find rules governing even the simplest (deterministic) random generators - what requires a sophisticated cryptanalysis.
    I think that you're being overly pessimistic. I've done some testing that suggests these compressors are pretty good. In fact, the linked test approximates entropy to within 0.1% of the true theoretical amount. That's pretty much perfect. Matt's synthetic bench marking thread also seems to bear this out for my kind of data. Even 7z is only a few tens of percent off the actual value. And the auto correlation lag is over 70 bytes. Remember that this is not human text, and thus relatively simple in comparison.

    I'm starting to think that for sample data of the type I posted above, no significant improvements can be achieved That would confirm an n = 2 as safe.

  14. #10
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    226
    Thanks
    108
    Thanked 106 Times in 65 Posts
    There is still a lot of headroom for reducing the enwik8/enwik9 size, especially if you allow unrestricted resource usage. However, making progress on the Hutter prize will be a lot more difficult due to the resource restrictions. Over the last few years of working on cmix, I don't really feel like making incremental improvements has become any more difficult. I am attaching a graph of the enwik8+decompresser size for all the previous cmix versions: over the last three years progress has been close to linear. I can't guess where the actual compression limit will be, but I think we are still far away from that since it is still relatively easy to make improvements.
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	cmix.png 
Views:	95 
Size:	35.0 KB 
ID:	6440  

  15. The Following 3 Users Say Thank You to byronknoll For This Useful Post:

    ciril_desquirrel (1st February 2019),Cyan (1st February 2019),introspec (1st February 2019)

  16. #11
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    862
    Thanks
    457
    Thanked 256 Times in 104 Posts
    The whole point of Kolmogorov complexity is that it's not possible to prove that the ultimate limit of compression for a file has been reached.
    And basic entropy measurements do not provide a meaningful hint of what this limit could be.

    Let's take an example.
    Presuming the exercise is to compress enwik8, but enwik8 has been encrypted first.
    Encrypted enwik8 has same size as enwik8 (or even a bit more), and its entropy tells that it cannot be compressed.
    By the way, trying to compress encrypted enwik8 proves it : no compressor can find any gain.

    Yet, that's not completely true.
    If one knows the encryption algorithm and key, one can decrypt, and then compress, leading to substantial gains.

    That's the essence of Kolmogorov complexity.
    A source may seem impossible to compress, and this can be confirmed by entropy measurements,
    but it just proves that the decoder is unable to interpret that data, not that it's not compressible.

  17. The Following User Says Thank You to Cyan For This Useful Post:

    Gotty (2nd February 2019)

  18. #12
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    521
    Thanks
    198
    Thanked 745 Times in 302 Posts
    Quote Originally Posted by byronknoll View Post
    There is still a lot of headroom for reducing the enwik8/enwik9 size, especially if you allow unrestricted resource usage.
    So what's your opinion on this?

  19. #13
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    381
    Thanks
    261
    Thanked 269 Times in 145 Posts
    Quote Originally Posted by Jarek View Post
    The big question is if we are closing to the theoretical limit: Kolmogorov complexity, which is unknown but fixed.
    I strongly feel that we are far from reaching the Kolmogorov complexity.
    Let me explain.

    Remember the 4KB intros/compos/demos? Participants create (generate) mind blowing videos with sound with just 4KB of code+data (youtube). If you try to compress the resulting video+audio by actually modeling it as video+audio, you are soooooo far from the Kolmogorov complexity (which is less than 4KB).
    I think 4KB intros/compos/demos are the best to illustrate how large the gap could be between the size of a compressed file and the Kolmogorov complexity of that file.

    My enwik8 skeleton compression post demonstrates the same idea: I have a very good '"compression ratio" (71:1), but my approach is different: I do not let a general compression algorithm to try to guess the structure of the file, I use an algorithm crafted (almost optimally) for that particular file. Do you guys feel the difference?

    So how to get closer to the Kolmogorov complexity? Here is a (non-practical) algorithm:
    Take a file. See it, understand it, let your brain pick up its patterns and finally write an algorithm that creates the file. In order to get the best results - take a decade of vacation ("unrestricted resource usage" in Byron's words). If you did your job right, your result will always be better (sometimes far better) than using any "general" compression algorithm that we use today. No "general" compression algorithm is specialized for this file.

    Quote Originally Posted by ciril_desquirrel View Post
    It's hard to imagine how a hundred lines of magical computer code could ever hope to produce the wiki definition of "anarchism" for example.
    The definition of Kolmogorov Complexity might be misleading. Program = code + data. Thus an upper bound of the Kolmogorov complexity of a file is the size of its compressed format + the size of its decompressor. Not just "a hundred lines of magical computer code" - that's just the algorithm. You will need to include some data as well. You may imagine the "data" part as a tight decision tree that the "a hundred lines of magical computer code" uses.

    Quote Originally Posted by Jarek View Post
    at least in theory, asymptotically we need 0 bits/symbol (for n->infinity)
    A small correction: I agree that is is getting closer and closer to 0, but not arbitrarily close (-> "asymptotically" does not sound right). It is approaching the entropy of the source (which is certainly not 0).

    Quote Originally Posted by Cyan View Post
    The whole point of Kolmogorov complexity is that it's not possible to prove that the ultimate limit of compression for a file has been reached.
    And basic entropy measurements do not provide a meaningful hint of what this limit could be.
    [...]
    A source may seem impossible to compress, and this can be confirmed by entropy measurements,
    but it just proves that the decoder is unable to interpret that data, not that it's not compressible.
    Well put!

    Quote Originally Posted by byronknoll View Post
    I can't guess where the actual compression limit will be, but I think we are still far away from that since it is still relatively easy to make improvements.
    Completely agree.
    Last edited by Gotty; 2nd February 2019 at 12:56. Reason: Didn't really shine thru what I ment

  20. #14
    Member
    Join Date
    Jan 2019
    Location
    UK
    Posts
    7
    Thanks
    1
    Thanked 1 Time in 1 Post
    Quote Originally Posted by Cyan View Post
    A source may seem impossible to compress, and this can be confirmed by entropy measurements...
    This is the part of interest to me. Surely you can compress right down to the entropy content of the file. Let's leave encrypted files aside. The trick is in measuring the entropy accurately. And if you know how to measure the entropy, you know how to compress the file. Clearly the simplistic Shannon p log(pi) formula is not practical as we don't know what i is in the presence of auto correlation. I've tried using NIST entropy formulae, but the GitHub implementation is rubbish. So I'm left with strong compression as a measuring device.

    Consider a file like:-


    543394672338654489241727832251630595849012369539256446919966-fixed-text
    043879463663886904088001594305988717988528785835264959701463-fixed-text
    923509952616900740198769908244018336391467850544005481023141-fixed-text
    920844576644976142249204992716223401194333438114010342577840-fixed-text
    93613389247763086796855055127...


    The digits are random, and "-fixed-text" is well, fixed. Therefore the theoretical Shannon entropy of this file is exactly 2.76827 bits /byte. If you use cmix, you get 2.77107 bits/byte. That's within 0.1% of the best possible result. Ergo for this file, there is hardly any room for improvement.

    So perhaps we can still further shrink the enwik* files, but I don't see how we can shrink this example as there appears to be nowhere left to go...

  21. #15
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    226
    Thanks
    108
    Thanked 106 Times in 65 Posts
    Quote Originally Posted by mpais View Post
    So what's your opinion on this?
    The transformer-XL results are really impressive. Someone should convert it into a compression program! I am guessing that by itself it won't be competitive with PAQ8. It is SoTA for language modelling with a train/test split, but it won't do as well with the online training needed for arithmetic coding. However, it would probably be very useful as a model added to PAQ8/cmix.

  22. #16
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    521
    Thanks
    198
    Thanked 745 Times in 302 Posts
    Quote Originally Posted by byronknoll View Post
    The transformer-XL results are really impressive. Someone should convert it into a compression program! I am guessing that by itself it won't be competitive with PAQ8. It is SoTA for language modelling with a train/test split, but it won't do as well with the online training needed for arithmetic coding. However, it would probably be very useful as a model added to PAQ8/cmix.
    I disagree, completely.
    Not only are the results unimpressive, but their claims to state-of-the-art status are ignorant at best, or purposefully deceptive.
    It seems the rage nowadays is to come up with new machine-learning architectures, tweak them, let them train for weeks on bonafide super-computers, and then publish ludicrous claims without any meaning.

    Either the authors (and most in the field) have forgotten about Kolgomorov complexity, which is doubtful given their prestigious credentials, or the (sad) state of academic publishing is such that nowadays if you don't publish something, you're a failure.

    Now, before going any further, let me justify myself with a concrete example.
    I can publish a machine-learning language model right now that, within equivalent restriction parameters as those described in the paper, achieves 0.76 bpc on enwik8, i.e., compresses it to less than 9.500.000 bytes.

    Now back to the paper. The code for it is located here. The pre-trained Tensorflow models are here.
    Notice anything? They can compress enwik8 down to 0.99 bpc.. using a model that's over 3GB(!) in size. Impressive? No, not at all.
    Not to mention the fact that it requires powerful GPUs to use the model (let alone train it) and it can still take a couple of hours.

    So I modified paq8px to only use the normal model and the text model. That way on level 9 it uses about the same memory, 3GB.
    I trained it on enwik8 (so that way, to reproduce the result, we too would need to load 3GB of model state data). I didn't even save the mixer weigths.
    Compressing enwik8 after this training stage gives my claimed result, 0.76 bpc, in less than 1 hour, single-threaded, on a CPU.

    Sounds ridiculous? Yes, as are their claims. The moment they decided to include results for cmix and yet claim themselves SoTA, they decided that comparing apples to oranges was fair game. What is the benefit of considering these ML models as-is for such tasks? If anything, we should be trying to understand what correlations are being modeled in these networks and devising mathematical frameworks to allow us to efficiently perform that analysis.

    These networks encode the pattern correlations for which they are trained on in a huge high-dimensionality space, they transform the information they've been given into a mapping into this space, but that gives us no better insight into how can we reduce the dimensionality to something tractable. We know that the space of actually meaningful information is a much smaller subset of the infinite space of possibilities, so we need to focus on understanding how can we devise the tools needed to encode a mapping into that space using as little data as possible.

  23. The Following 6 Users Say Thank You to mpais For This Useful Post:

    kaitz (4th February 2019),Mauro Vezzosi (3rd February 2019),Piotr Tarsa (3rd February 2019),pothos2 (4th February 2019),schnaader (3rd February 2019),Shelwien (11th February 2019)

  24. #17
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    I'm not 100% sure what they are doing with transformer-xl, but if they are training their model on the same data that they are then compressing then that makes completely no sense. If we're allowed to have 3 GB decompressor then BARF ( http://mattmahoney.net/dc/barf.html ) can compress enwik8 easily to one byte. It would make much more sense to e.g. train on first 100 MBs of enwik9, and then evaluate compression ratio on next 100 MBs of enwik9. Have they done that?

  25. #18
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    226
    Thanks
    108
    Thanked 106 Times in 65 Posts
    For their enwik8 results, they use the first 90% for training, the next 5% for validation, and the last 5% for testing.

    I tried measuring the performance of cmix v16 using this setup. I posted the results under the "enwik8" section here. After a single pass through the training set, cmix v16 gets 1.1092 bpc on the test set. Their 0.99 bpc is significantly better than what has been achieved using LSTMs on the same benchmark.

  26. The Following User Says Thank You to byronknoll For This Useful Post:

    schnaader (4th February 2019)

  27. #19
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    521
    Thanks
    198
    Thanked 745 Times in 302 Posts
    Quote Originally Posted by byronknoll View Post
    For their enwik8 results, they use the first 90% for training, the next 5% for validation, and the last 5% for testing.
    Yes, that's the usual methodology. But now, that's not what they're stating, is it? Look at table 2, the results for enwik8. They're directly comparing their 1.06 and 0.99 bpc results to cmix v13, listed as 1.23 bpc, which is v13's result for the full enwik8, with no 90% pre-training stage. So, deception or just ignorance?

    Quote Originally Posted by byronknoll View Post
    I tried measuring the performance of cmix v16 using this setup. I posted the results under the "enwik8" section here. After a single pass through the training set, cmix v16 gets 1.1092 bpc on the test set. Their 0.99 bpc is significantly better than what has been achieved using LSTMs on the same benchmark.
    At the cost of making even cmix look fast. Honestly, what is the point? It seems like everyday a new paper comes up on arxiv with yet another ridiculously over-the-top network architecture claiming "SoTA" status. It trully seems the old saying as a point, when all you have is a hammer, everything starts looking like a nail. We know that modelling long-term correlations helps, we know that maintaining an internal memory state helps, what we don't know is how to model this efficiently.

    I'd much rather read about a paper claiming 1.1 bpc on enwik8 but doing so in 3 or 4 less orders of magnitude than the current ML approaches, that would be a lot more interesting. Alas, it seems all the work in the field nowadays is focused on this approach, instead of trying to devise new, efficient ways of achieving similar performance. Are we to resign ourselves to the fact that everything else has already been invented, that no other avenues await, that we must simply proceed on this path? I refuse to believe so, and if anything I think the admittedly poor work I've been doing on paq8 shows that it's much better to improve the way we model the data than to rely on sheer brute force.

  28. #20
    Member
    Join Date
    Apr 2012
    Location
    Stuttgart
    Posts
    447
    Thanks
    1
    Thanked 100 Times in 60 Posts
    Quote Originally Posted by Gotty View Post
    So how to get closer to the Kolmogorov complexity? Here is a (non-practical) algorithm:
    Take a file. See it, understand it, let your brain pick up its patterns and finally write an algorithm that creates the file. In order to get the best results - take a decade of vacation ("unrestricted resource usage" in Byron's words). If you did your job right, your result will always be better (sometimes far better) than using any "general" compression algorithm that we use today. No "general" compression algorithm is specialized for this file.
    Oh, I can certainly write one. This is the easiest exercise. Here is its output:

    0

    And here is the decoder: If the input is 0, regenerate "the file". If the input is "1", strip off the "1" bit and move the rest from input to output.

    Quote Originally Posted by Gotty View Post
    The definition of Kolmogorov Complexity might be misleading. Program = code + data. Thus an upper bound of the Kolmogorov complexity of a file is the size of its compressed format + the size of its decompressor. Not just "a hundred lines of magical computer code" - that's just the algorithm. You will need to include some data as well. You may imagine the "data" part as a tight decision tree that the "a hundred lines of magical computer code" uses.
    Well, that isn't all of it. The Kolmogorov complexity is relative to a Turing machine, which may have an internal (finite) state machine according to which it operates. Hence, the Kolmogorov complexity is only relative to the machine it was defined on. (Or, in your example, if the demo was written for i86 or 68K, so to say). Results will be different. Again, for a given job ("play this demo") I can easily find a Turing machine whose program size is minimal ("0") and whose data size is minimal ("") which plays this demo. Because I can hide the complexity in the machine ("hard-wire something that plays the demo").

    So this is not quite meaningful. It becomes meaningful if you consider infinite sequences, and check whether you can generate them with a (finite) Turing machine (low complexity) or no Turing machine at all (a statistical sequence - if you believe that this exists) in which case the data of the program needs to be infinite as well.

  29. #21
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    381
    Thanks
    261
    Thanked 269 Times in 145 Posts
    Quote Originally Posted by Gotty View Post
    you are soooooo far from the Kolmogorov complexity (which is less than 4KB).
    This is an oversimplification. Sorry if I misled anyone (see below).


    Quote Originally Posted by thorfdbg View Post
    The Kolmogorov complexity is relative to a Turing machine, which may have an internal (finite) state machine according to which it operates.
    Quote Originally Posted by thorfdbg View Post
    I can easily find a Turing machine whose program size is minimal ("0") and whose data size is minimal ("") which plays this demo. Because I can hide the complexity in the machine ("hard-wire something that plays the demo").
    I'm afraid you can't cheat like that. If any algorithm is manipulating any data on the Turing Machine you must take it into consideration when determining the Kolmogorov complexity. If your Turing Machine has multiple abstraction layers: from the gates on the silicon chip(s), the memory controller, cache controller, cpu microcode, chipset microcode, bios, operating system kernel, dlls, ... up to your programs code + data - you must consider all of them and include their length in the "description" of your input because they all have a role in reproducing it.

    You can't hide any tricky stuff anywhere. You would just nullify the essence of Kolmogorov complexity.
    At least you have to read the description from the "input tape", output the original file and finally you must halt at the end. So that is at least 3 different instructions. You cannot have 0 instructions.

    Quote Originally Posted by thorfdbg View Post
    It becomes meaningful if you consider infinite sequences.
    See in Solomonoff's paper: "Any finite string of 0’s and 1’s is an acceptable input to M1"; "known finite string of symbols"; or just consider the title of Chaitin: "Gregory J. Chaitin. On the length of programs for computing finite binary sequences."
    It looks like our forefathers found it meaningful for finite strings only. I also find it meaningful for finite strings. It sounds like you mean determining the entropy of a source? Is it correct? Or am I mistaken?
    Last edited by Gotty; 5th February 2019 at 02:19.

  30. #22
    Member
    Join Date
    Apr 2012
    Location
    Stuttgart
    Posts
    447
    Thanks
    1
    Thanked 100 Times in 60 Posts
    Quote Originally Posted by Gotty View Post
    I'm afraid you can't cheat like that. If any algorithm is manipulating any data on the Turing Machine you must take it into consideration when determining the Kolmogorov complexity. If your Turing Machine has multiple abstraction layers: from the gates on the silicon chip(s), the memory controller, cache controller, cpu microcode, chipset microcode, bios, operating system kernel, dlls, ... up to your programs code + data - you must consider all of them and include their length in the "description" of your input because they all have a role in reproducing it.
    So, what is the "length of the gates of a microprocessor"? The problem is, you cannot really define that either. The Kolmogorov-complexity is only relative to a given machine, but the important aspect here is that a Turing machine is driven by a *finite* state machine (stress on *finite*). Hence, one cannot point at a particular number sequence and say "the Kolmogorov-complexity of this sequence is 5".

    Quote Originally Posted by Gotty View Post
    It looks like our forefathers found it meaningful for finite strings only. I also find it meaningful for finite strings. It sounds like you mean determining the entropy of a source? Is it correct? Or am I mistaken?
    No, Kolmogorov complexity. It is meaningful. For example, consider the Kolmogorov-complexity of Pi: This is certainly finite - Pi is an "algorithmic number" - so there is a finite state machine and a finite input tape forming a Turing machine with its input that, when run, produces the infinite series of the digits of Pi. How long this input tape is depends on the Turing machine, of course, but we know that a finite number of symbols is sufficient.

    Consider a stoichastic series of 0 and 1: For this, we can probably find a Turing machine (finite state machine) but the input tape of the Turing machine has to be infinite to reproduce the sequence. So there is clearly a difference between the two cases.

  31. #23
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    226
    Thanks
    108
    Thanked 106 Times in 65 Posts
    Quote Originally Posted by mpais View Post
    Yes, that's the usual methodology. But now, that's not what they're stating, is it? Look at table 2, the results for enwik8. They're directly comparing their 1.06 and 0.99 bpc results to cmix v13, listed as 1.23 bpc, which is v13's result for the full enwik8, with no 90% pre-training stage. So, deception or just ignorance?
    Yes, the cmix v13 result is not comparable with the other results in that table. I think they just copied that result from other papers (e.g. this) which include cmix results with more explanation. Even the typo in my name has propagated between papers

    Quote Originally Posted by mpais View Post
    At the cost of making even cmix look fast. Honestly, what is the point? It seems like everyday a new paper comes up on arxiv with yet another ridiculously over-the-top network architecture claiming "SoTA" status. It trully seems the old saying as a point, when all you have is a hammer, everything starts looking like a nail. We know that modelling long-term correlations helps, we know that maintaining an internal memory state helps, what we don't know is how to model this efficiently.

    I'd much rather read about a paper claiming 1.1 bpc on enwik8 but doing so in 3 or 4 less orders of magnitude than the current ML approaches, that would be a lot more interesting. Alas, it seems all the work in the field nowadays is focused on this approach, instead of trying to devise new, efficient ways of achieving similar performance. Are we to resign ourselves to the fact that everything else has already been invented, that no other avenues await, that we must simply proceed on this path? I refuse to believe so, and if anything I think the admittedly poor work I've been doing on paq8 shows that it's much better to improve the way we model the data than to rely on sheer brute force.
    For certain applications efficiency is not very important. Language modelling can be used for other tasks like translation, sentiment analysis, text categorization, etc. For some applications there are fewer time constraints so it doesn't matter how long it takes to process the data. The precision/recall of the model may take precedence over the resource cost.

    Some researches have access to huge amounts of computational resources, so they have somewhat of an unfair advantage. They can achieve SoTA results just by scaling up the size of the model. That is why "# of parameters" gets reported.

    I think transformer-XL is a genuinely innovative architecture. It is substantially different from LSTMs, and I am guessing it will achieve some nice results for data compression.

  32. #24
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    381
    Thanks
    261
    Thanked 269 Times in 145 Posts
    Quote Originally Posted by thorfdbg View Post
    The Kolmogorov-complexity is only relative to a given machine
    Just a small correction: the Kolmogorov complexity is relative to the description Language, not the hardware (machine) that runs it. Nonetheless, I understand what you mean and I agree.

    Quote Originally Posted by thorfdbg View Post
    So, what is the "length of the gates of a microprocessor"? The problem is, you cannot really define that either."
    I will try to. First let's have a summary of the process and our goal (in modern terms):

    (In the following paragraphs: "string" = the file content for which we need to find the Kolmogorov complexity; "machine" = any hardware or software implementation of a Universe Turing Machine or preferably a modern computer.)

    We will upload a Description file (the "compressed" form of our string) to the memory of a machine, start the machine, and when it finishes processing the Description (i.e. it halts), we expect our original "uncompressed" string to appear starting from a specific memory location.

    The Description is written in a predetermined description Language. The machine has an interpreter that can decode descriptions written in this Language. The description Language is essentially a compression format. The interpreter is essentially the decompressor. Our goal is to find the shortest possible Description.
    This description Language may have any complexity. We may choose it freely but we shall choose wisely (it should have a good modeling power).

    Please note, that the Language is not bound to our specific machine. It will run on any machine that has an interpreter for it.
    Let's write an interpreter for the Language discussed above in java or python or c++ or in any other language that runs on any older or more modern architecture. The exact implementation of the machine and the interpreter does not play a role when calculating the Kolmogorov complexity. Only the Language and the Description does.

    The standard Kolmogorov complexity definition states that it is the shortest length of a Description written in a predetermined Language.
    Thus the description Language must be independent of our string. It must not have references to its size, to its characters, its structure or to any concrete frequencies of anything in our string. All such information must be placed to the Description and must be acquired from it during decoding.

    In order to find the shortest possible Description we'll need to use a Language that enables us to include algorithms (models/transformations crafted specifically for our string) in the Description. Let's suppose, the string is a text string. When we have a general model for text, it's not useful as it is: it's not yet optimized for the string. For the smallest possible Description we must super-optimize it (and any other useful models) for the string. When we have a smaller text we must have a smaller (simpler) model, when we have a larger text we may have room for a more complex model. We'll need to find the best balance between the model size and the encoded size of the string.

    When encoding "abrakadabra" for instance, we can't have a too sophisticated text model: it may just model the 3 first bits (all 11 characters being lowercase ascii letters). Probably we cannot even model efficiently that "abra" repeats, because that model would be larger than the bits gained by the better model. On the other hand, when encoding for instance dickens/silesia we can design a more sophisticated model. It is advantageous to enhance our text model as long as the size of the encoded string shrinks more compared to the Description growth caused by the included model enhancements.
    When we include such models/transformations in the Description - preferably in a tight format - then we can call them from the interpreter of the Language (inside a virtual machine, of course).

    Now, back to the problematic definition of the "length of the gates".

    The gates on a chip and especially their connections to each other may be created in a manner that they are not independent of our string. You may easily create a circuit board that when powered, it immediately outputs the full string to the desired memory location. This is the reason we must consider them when determining the Kolmogorov Complexity.

    If the behavior of the gates or their connections do contain information about the size/structure/content of the string then they are considered as part of the Description. The gates are just hardware implementations of some (boolean) logic, the information content related to their behavior is what we need. The question is: which description Language are we talking about in this case? Is it a circuit board design encoded in a predetermined (!) format? If so, that shall be included in the Description. But that is probably too expensive, there may be better Languages describing our string.

    On the other hand, if the behavior of the gates and their connections are independent from our string then they will not be part of the "Description", but they will be part of the "machine" or the "interpreter" and so will be simply excluded from the calculation of the complexity.

    I was not very precise regarding this "inclusion"/"exclusion" in my previous post. Sorry for that. I hope it's better now.

    Quote Originally Posted by thorfdbg View Post
    No, Kolmogorov complexity. It is meaningful.
    I agree: it may be meaningful for infinite strings as well.
    In your post you just sounded like that the infinite case is more important than the finite case (where I disagree):

    Quote Originally Posted by thorfdbg View Post
    It becomes meaningful if you consider infinite sequences.
    Last edited by Gotty; 7th February 2019 at 19:03. Reason: clarifications

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •