Results 1 to 5 of 5

Thread: Need help with an algorithm for generating a list of integer sets

  1. #1
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    83
    Thanks
    6
    Thanked 18 Times in 11 Posts

    Need help with an algorithm for generating a list of integer sets

    (English is not my native language. Also I don’t have a university degree so please be a little patient if I made some obvious mistakes. (In the real world I am just a programmer and I never studied on an university.))

    While fooling around with self-written Turing Tarpits, I found out that if you randomly generate a valid program certain patterns emerge more often than others.
    The most common patterns were:
    0
    1
    0000… (infinite length)
    1111… (infinite length)
    00
    11
    01
    10
    010101… (infinite length)
    101010… (infinite length)
    And so on.

    I foolishly tried to construct a language that would create unique patterns for every unique program. Unsurprisingly I could not construct a language that has that ability.
    While searching on the internet if such program exist I found the binary lambda calculus. The current Wikipedia-page does not mention it but the one I read was mentioning that it was used for research on the minimum description length and Kolmogorov complexity when encoded using the de Bruijn index notation. But I was not happy with the results. Not only did BLC generate non-unique programs, the de Bruijn index seemed not exactly to match up with the idea of MDL. (It generates a lot of ‘programs’ that are longer than the output)

    I wasn’t able to find better languages to approach Kolmogorov complexity than the BLC language. (I hope they do not exist or otherwise my work is for nothing.)
    Anyway… After a while I constructed a set of I think they are called axioms. The probably already exist, but I write them down anyway.
    1. Any piece of information can be expressed by a string of bits.
    2. Any outcome of an algorithm can be expressed by a string of bits.
    3. So using 1 and 2, information and algorithms are related. (Not surprising when you read Solomonoff's theory and the relation with Kolmogorov)
    4. Most constructible binary strings of a certain size can be considered random because there is no know program that is smaller than the string itself.
    5. It is hard to construct a binary string of a certain size that is as large as the smallest program that can generate such string. (4 and 5 are a nice paradox)
    6. Almost all strings that are found in the real world are constructed by smaller programs.
    7. The perfect language is a language that has the following properties:
    a. Ever unique program generates unique output
    b. The description length is smaller or equal to the message length
    c. The index of the programs from the language are ordered/sorted on the complexity of the outcome. Or in other words: it is not a simple counter. (you can generate any outcome by simply count through all the options)
    8. It is useful to try to find the perfect language because:
    a. Following axiom 5, most real world strings have a smaller Kolmogorov complexity.
    b. The best approximation of the perfect language would be the closest thing to approach Kolmogorov complexity.
    c. The best approximation of the perfect language is the best approximation of (universal) artificial intelligence. (following Hutters description of universal artificial intelligence)
    9. Using these axioms in the back of my head I deducted (and I hope I am right on this) that every information or algorithm output can be expressed with a Boolean truth table with the number of inputs = ceiling(log2(length of information)). The undefined parts of the truth table can be defined as ‘don’t care’s.

    The next step that I took was to find a way to generate every possible truth table for i’ inputs. For 0 to 2 inputs the boolean expressions were predictable but with 3 or more inputs the boolean expressions needed to iterate through all possible outcomes became unpredictable. Or in other words I could not construct an algorithm that was linear and iterated through all the possible outcomes without repetition.

    I was happy with the results of the experiments. Using boolean expressions the simplest ones indeed generated the simplest truth tables. I don’t have any scientific proof but instinctively and visually it looked pretty good. It also showed that I do not have a formal definition of complexity. Is (A and B) more or less complex than (A or B)? And is (A and B and C) more or less complex than (A and D)?

    I also saw that there was something funny about the boolean expressions needed to construct every possible outcome. A lot of truth tables were just a rotated version of other truth tables. That gave me the idea to make the truth tables one step more abstract: Truth rings. It’s just a truth table where the beginning and the end are connected and where information can be read as having a certain starting point on the ring and where you rotate the ring until you have read the information you want. (If you use the mental image of a tape from a turing machine: The tape is of a certain length and the end and the beginning are connected. So when you reach the end of the tape you start again on the beginning.)

    Also the Truth rings could be abstracted to pattern rings to avoid inverted version of the truth rings.
    Examples:
    - Pattern ring 1,2,1,3 can be translated to the two truth rings 0110111 and 1001000.
    - 2,3,1,1 can be translated to 0011101 or 1100010.

    Information with n’ length consisting of the string 01010101010101… can be reduced to a truth ring ‘01’ which can be reduced to the pattern ring 1,1. To retrieve the information from a pattern ring the starting point, the length and the ‘polarity’ of the information must also be saved. In this case there is no need to store both the starting point and the polarity. One is sufficient. I don’t have hard proof for this but I am very sure that it is possible to save every string of information as a pattern ring, with starting point and polarity without breaking axiom 7b. The length is another discussion but one can say that with normal information stored in a computer the information length is also stored separately.

    Axiom 10: The outcome of the perfect language consists of pattern rings ordered on increasing (yet undefined ) complexity. To avoid repetition pattern rings must be ‘prime’ = Pattern rings that can be constructed using just repeating a pattern ring with a lower complexity are not allowed in the perfect language.

    Creating a program that generates prime pattern rings is not difficult.
    The results:
    For 0 inputs: 0 (special case when information just consists of N zeros of ones)
    For 1 input:
    1,1
    For 2 inputs:
    2,2
    3,1
    For 3 inputs:
    4,4
    5,3
    6,2
    7,1
    3,1,2,2
    3,2,1,2
    3,2,2,1
    3,3,1,1
    4,1,1,2
    4,1,2,1
    4,2,1,1
    5,1,1,1
    2,1,1,2,1,1
    2,1,2,1,1,1
    2,2,1,1,1,1
    3,1,1,1,1,1
    (For 4 bits information the pattern rings are in the attachment.)

    In other words you just need 20 prime pattern rings to construct every (256 different) truth table for 3 inputs. If the perfect language starts counting at ring 0 and loops through every ring with every possible rotation and polarity you have 256 different programs ordered on complexity for 8 bits of information or every small algorithm that results in 8 bits complexity.

    But did we win something? For 8 bits this seems not to be the case and it does not look impressive at all. But… With axiom 5 in mind I expect that when the amount of information increases the delta between the index from the perfect language and the index of the information (when using simple counting) will on average increase. Or in other words with big pieces of real world information the expected average program index in the perfect language is increasing interestingly smaller.

    But now I am stuck on generating the prime pattern rings. How do we iterate through the rings in linear time? And is it possible to find an algorithm that is not NP(P) that converts any piece of information or algorithm to a prime pattern ring? Or even better: is it possible to find an algorithm that can not only find the prime pattern ring but also the index in the perfect language?

    Let's first start with finding an algorithm to generate prime pattern rings. Because it is a ring, the starting point does not matter. Also complexity order is still not defined very well, so if you find an algorithm that generates prime pattern rings in another sequence it can be that your algorithm has a better definition of complexity. Also I think the smallest possible program that generates prime pattern rings is the most correct one on the definition of complexity!
    Attached Files Attached Files

  2. Thanks:

    Shelwien (16th November 2016)

  3. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,973
    Thanks
    296
    Thanked 1,299 Times in 736 Posts
    Its an interesting topic, Matt Mahoney at one point also tried to generate random turing machines - http://mattmahoney.net/dc/uiq/
    And also his "zpaq language" should be relevant as well.

    But I think that we need to start from high-level abstractions here, because for some types of data we actually know how it is generated (executable files, for example).
    So I'd imagine something like writing a few lossless file-format decompilers which can improve compression,
    then trying to generalize the approach and find how these decompilers can be deconstructed into basic components.
    And only once we have the relevant components, we can try enumerating their combinations, or some such.

    I don't think that anything useful can be done by starting from low-level designs like turing machines - simply
    because we _know_ how to construct at least some of available file formats, and operations involved there
    are complex enough that their implementations appearing at random is not really possible.
    But I think that program generation may be still feasible if we'd use relevant higher-level components.

    As to binary tables, its an even lower level representation than turing machines - there're no loops or memory,
    so at best you can expect to break some hashing/encryption with that approach, but its less related to compression and more to SAT.

  4. #3
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,026
    Thanks
    103
    Thanked 410 Times in 285 Posts
    Maybe this guys can help you:
    http://www.cube20.org/

  5. #4
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    83
    Thanks
    6
    Thanked 18 Times in 11 Posts
    Shelwien,

    Thank you for your in-depth reply.
    Quote Originally Posted by Shelwien View Post
    Its an interesting topic, Matt Mahoney at one point also tried to generate random turing machines - http://mattmahoney.net/dc/uiq/
    And also his "zpaq language" should be relevant as well.
    But I think that we need to start from high-level abstractions here, because for some types of data we actually know how it is generated (executable files, for example).
    So I'd imagine something like writing a few lossless file-format decompilers which can improve compression,
    then trying to generalize the approach and find how these decompilers can be deconstructed into basic components.
    And only once we have the relevant components, we can try enumerating their combinations, or some such.
    It's a successful practical approach in reality. That's true. But it is not useful when thinking in the direction of getting towards the legendary lower bound of the Kolmogorov complexity. You are basically programming pre-available knowledge about your subject and you (paq)mix it with probability estimation. Yes, you get very well working compression algorithms but in all fairness you should add those knowledge to the estimated complexity of the program/data. For small (<10k) files it shows immediately that this approach is not the way to go.

    (By the way: UIQ is more or less a fancy pseudorandom number generator to generate artificial data for testing general purpose file compression. It does not try to compress or predict anything.)

    But I am not trying to be practical here…

    Quote Originally Posted by Shelwien View Post
    I don't think that anything useful can be done by starting from low-level designs like turing machines – simply because we _know_ how to construct at least some of available file formats, and operations involved there are complex enough that their implementations appearing at random is not really possible.
    But I think that program generation may be still feasible if we'd use relevant higher-level components.
    I agree. The reality is that iterating through every possible turing machines or any other turing complete language will not result in useful results in reality. That’s the reason they call those machines turing tarpits. “Beware of the Turing tar-pit in which everything is possible but nothing of interest is easy.”

    Even when I am able to construct the ‘perfect language’, iterating through all the possible options is going to be hard with anything that’s above 40 bits complexity in 2016. While it is theoretically possible to generate every piece of information, it won’t be easy. And that’s an understatement.
    But imagine that I am able to at least make a runnable/executable approach of iterating through the programs from the perfect language! That would be a very big improvement and with quantum computing on the horizon: who knows what is possible when those get available?

    Quote Originally Posted by Shelwien View Post
    As to binary tables, its an even lower level representation than turing machines - there're no loops or memory, so at best you can expect to break some hashing/encryption with that approach, but its less related to compression and more to SAT.
    I do not agree on that. Every problem with a deterministic output can be expressed using truth tables. It’s not limited to hashing and/or encryption.
    Imagine that you can map every truth table to a program with a certain index. Then you can construct a Hilbert space where a certain problem with a deterministic output is a point in space (index mapped to point).

    If you have a non-deterministic problem you can construct an N number of P problems. For example the famous traveling salesman problem. If you generate every possible output for every possible input for every number of locations you get an enormous truth table for every number of cities. For example 4 cities with floating point distances (32bit * 4 cities) + city output order (2bit * 4 cities)) generate a 136 bit truth table. Those truth tables can (in theory) be mapped to an index in the iteration of programs from the perfect language. The indices can be mapped as points in the Hilbert space. From those points you can construct lines, planes and spaces for the desired dimensions. So you can define any problem as a geometry in the Hilbert space constructed with the indexes from the order of programs generated by the perfect language.

    I am aware of the SAT problem but I don’t see the relation with the things I wrote before?
    Quote Originally Posted by Sportman View Post
    Maybe this guys can help you:
    http://www.cube20.org/
    Although interesting I don’t think that the rubix cube problem is related to this problem other than it might belong to the same complexity class.

  6. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,973
    Thanks
    296
    Thanked 1,299 Times in 736 Posts
    > You are basically programming pre-available knowledge about your subject
    > and you (paq)mix it with probability estimation.

    Yes, and I see it as a good thing - because otherwise we'd have to use bruteforce
    search to find working implementations of these same components in a low-level language.

    I was talking about a different kind of components though - even to find
    a working pi generator by enumerating programs, you'd have to start with
    something higher-level than bit operations (or even cpu instruction set).

    > Yes, you get very well working compression algorithms but in all fairness
    > you should add those knowledge to the estimated complexity of the
    > program/data.

    Sure, but why do you think that turing machines or boolean logic are "more natural"
    than long arithmetics or hashmaps?

    > For small (<10k) files it shows immediately that this approach is not the way to go.

    What kind of files do you want to "compress" this way, though?
    For example, splitting enwik8 to 16-byte chunks for kolmogorov compression
    certainly won't win you the prize - simply because these chunks are not independent.

    > By the way: UIQ is more or less a fancy pseudorandom number generator to
    > generate artificial data for testing general purpose file compression.
    > It does not try to compress or predict anything.

    Yes, but I think we can use it to find programs that generate given short strings
    of data just as well as your implementation?

    > But imagine that I am able to at least make a runnable/executable approach
    > of iterating through the programs from the perfect language!

    I'm also interested in something like that, but my concept of "perfect language"
    is significantly different - it needs higher-level components and has to be
    compatible with heuristic optimization methods (like genetic algorithms).
    That is, for now I focus more on optimizing compression method parameters
    than algorithms themselves, but the distinction is very fuzzy - for example,
    we can implement two versions of some function and then choose one based
    on parameter value, which is controlled by optimizer.
    And that's actually the point where something tricky can be done -
    we can call both versions and then mix their results using a weight parameter,
    instead of choosing one with a flag parameter.
    Turns out that this "fuzzy" approach works much better with optimization,
    because there's much less probability of getting stuck in local minimums.

    > That would be a very big improvement and with quantum computing on the
    > horizon: who knows what is possible when those get available?

    Afaik quantum computing is simply computing that uses elementary particles
    to store/transmit/transform information. So its massively parallel and
    has high probability of errors, which requires special algorithms to compensate.
    I hope that you don't believe in popular entanglement/teleportation magic hype.

    Wolframalpha says that ~2^104 hydrogen atoms fit into 1m^3, so I'd estimate
    the best quantum PC to be 2^100 times faster than current PC.
    Which only lets you bruteforce programs with extra 100 bits of code - not
    much I think.

    > Every problem with a deterministic output can be expressed using truth tables.

    Sure, there's even a program that does that - http://www.cprover.org/cbmc/

    But with such representation you won't be able to match longer loops, which
    frequently appear both in data generated by "normal" programs, and in
    mathematics (eg. in form of sum operators).

    > I am aware of the SAT problem but I don’t see the relation with the things I wrote before?

    I think either of these problems can be translated into another.
    For example, we can write a generalized program with N instructions
    (implemented as switches on parameter values), translate it with cbmc,
    then with any SAT-solver find parameter values which would give us expected output.
    Or fail to find. Then we can increase N.

Similar Threads

  1. List of Asymmetric Numeral Systems implementations
    By Jarek in forum Data Compression
    Replies: 13
    Last Post: 28th February 2018, 23:06
  2. FreeArc compress files @List.lst ?
    By Eppesuig in forum Data Compression
    Replies: 4
    Last Post: 10th March 2016, 12:56
  3. Fast [delta] compression of sets with small domains
    By PSHUFB in forum Data Compression
    Replies: 8
    Last Post: 6th March 2015, 11:06
  4. list of different jpeg encoders?
    By Mangix in forum Data Compression
    Replies: 4
    Last Post: 21st October 2013, 16:11
  5. Replies: 1
    Last Post: 13th May 2009, 10:46

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •