# Thread: Need help with an algorithm for generating a list of integer sets

1. ## Need help with an algorithm for generating a list of integer sets

(English is not my native language. Also I don’t have a university degree so please be a little patient if I made some obvious mistakes. (In the real world I am just a programmer and I never studied on an university.))

While fooling around with self-written Turing Tarpits, I found out that if you randomly generate a valid program certain patterns emerge more often than others.
The most common patterns were:
0
1
0000… (infinite length)
1111… (infinite length)
00
11
01
10
010101… (infinite length)
101010… (infinite length)
And so on.

I foolishly tried to construct a language that would create unique patterns for every unique program. Unsurprisingly I could not construct a language that has that ability.
While searching on the internet if such program exist I found the binary lambda calculus. The current Wikipedia-page does not mention it but the one I read was mentioning that it was used for research on the minimum description length and Kolmogorov complexity when encoded using the de Bruijn index notation. But I was not happy with the results. Not only did BLC generate non-unique programs, the de Bruijn index seemed not exactly to match up with the idea of MDL. (It generates a lot of ‘programs’ that are longer than the output)

I wasn’t able to find better languages to approach Kolmogorov complexity than the BLC language. (I hope they do not exist or otherwise my work is for nothing.)
Anyway… After a while I constructed a set of I think they are called axioms. The probably already exist, but I write them down anyway.
1. Any piece of information can be expressed by a string of bits.
2. Any outcome of an algorithm can be expressed by a string of bits.
3. So using 1 and 2, information and algorithms are related. (Not surprising when you read Solomonoff's theory and the relation with Kolmogorov)
4. Most constructible binary strings of a certain size can be considered random because there is no know program that is smaller than the string itself.
5. It is hard to construct a binary string of a certain size that is as large as the smallest program that can generate such string. (4 and 5 are a nice paradox)
6. Almost all strings that are found in the real world are constructed by smaller programs.
7. The perfect language is a language that has the following properties:
a. Ever unique program generates unique output
b. The description length is smaller or equal to the message length
c. The index of the programs from the language are ordered/sorted on the complexity of the outcome. Or in other words: it is not a simple counter. (you can generate any outcome by simply count through all the options)
8. It is useful to try to find the perfect language because:
a. Following axiom 5, most real world strings have a smaller Kolmogorov complexity.
b. The best approximation of the perfect language would be the closest thing to approach Kolmogorov complexity.
c. The best approximation of the perfect language is the best approximation of (universal) artificial intelligence. (following Hutters description of universal artificial intelligence)
9. Using these axioms in the back of my head I deducted (and I hope I am right on this) that every information or algorithm output can be expressed with a Boolean truth table with the number of inputs = ceiling(log2(length of information)). The undefined parts of the truth table can be defined as ‘don’t care’s.

The next step that I took was to find a way to generate every possible truth table for i’ inputs. For 0 to 2 inputs the boolean expressions were predictable but with 3 or more inputs the boolean expressions needed to iterate through all possible outcomes became unpredictable. Or in other words I could not construct an algorithm that was linear and iterated through all the possible outcomes without repetition.

I was happy with the results of the experiments. Using boolean expressions the simplest ones indeed generated the simplest truth tables. I don’t have any scientific proof but instinctively and visually it looked pretty good. It also showed that I do not have a formal definition of complexity. Is (A and B) more or less complex than (A or B)? And is (A and B and C) more or less complex than (A and D)?

I also saw that there was something funny about the boolean expressions needed to construct every possible outcome. A lot of truth tables were just a rotated version of other truth tables. That gave me the idea to make the truth tables one step more abstract: Truth rings. It’s just a truth table where the beginning and the end are connected and where information can be read as having a certain starting point on the ring and where you rotate the ring until you have read the information you want. (If you use the mental image of a tape from a turing machine: The tape is of a certain length and the end and the beginning are connected. So when you reach the end of the tape you start again on the beginning.)

Also the Truth rings could be abstracted to pattern rings to avoid inverted version of the truth rings.
Examples:
- Pattern ring 1,2,1,3 can be translated to the two truth rings 0110111 and 1001000.
- 2,3,1,1 can be translated to 0011101 or 1100010.

Information with n’ length consisting of the string 01010101010101… can be reduced to a truth ring ‘01’ which can be reduced to the pattern ring 1,1. To retrieve the information from a pattern ring the starting point, the length and the ‘polarity’ of the information must also be saved. In this case there is no need to store both the starting point and the polarity. One is sufficient. I don’t have hard proof for this but I am very sure that it is possible to save every string of information as a pattern ring, with starting point and polarity without breaking axiom 7b. The length is another discussion but one can say that with normal information stored in a computer the information length is also stored separately.

Axiom 10: The outcome of the perfect language consists of pattern rings ordered on increasing (yet undefined ) complexity. To avoid repetition pattern rings must be ‘prime’ = Pattern rings that can be constructed using just repeating a pattern ring with a lower complexity are not allowed in the perfect language.

Creating a program that generates prime pattern rings is not difficult.
The results:
For 0 inputs: 0 (special case when information just consists of N zeros of ones)
For 1 input:
1,1
For 2 inputs:
2,2
3,1
For 3 inputs:
4,4
5,3
6,2
7,1
3,1,2,2
3,2,1,2
3,2,2,1
3,3,1,1
4,1,1,2
4,1,2,1
4,2,1,1
5,1,1,1
2,1,1,2,1,1
2,1,2,1,1,1
2,2,1,1,1,1
3,1,1,1,1,1
(For 4 bits information the pattern rings are in the attachment.)

In other words you just need 20 prime pattern rings to construct every (256 different) truth table for 3 inputs. If the perfect language starts counting at ring 0 and loops through every ring with every possible rotation and polarity you have 256 different programs ordered on complexity for 8 bits of information or every small algorithm that results in 8 bits complexity.

But did we win something? For 8 bits this seems not to be the case and it does not look impressive at all. But… With axiom 5 in mind I expect that when the amount of information increases the delta between the index from the perfect language and the index of the information (when using simple counting) will on average increase. Or in other words with big pieces of real world information the expected average program index in the perfect language is increasing interestingly smaller.

But now I am stuck on generating the prime pattern rings. How do we iterate through the rings in linear time? And is it possible to find an algorithm that is not NP(P) that converts any piece of information or algorithm to a prime pattern ring? Or even better: is it possible to find an algorithm that can not only find the prime pattern ring but also the index in the perfect language?

Let's first start with finding an algorithm to generate prime pattern rings. Because it is a ring, the starting point does not matter. Also complexity order is still not defined very well, so if you find an algorithm that generates prime pattern rings in another sequence it can be that your algorithm has a better definition of complexity. Also I think the smallest possible program that generates prime pattern rings is the most correct one on the definition of complexity!

2. ## Thanks:

Shelwien (16th November 2016)

3. Its an interesting topic, Matt Mahoney at one point also tried to generate random turing machines - http://mattmahoney.net/dc/uiq/
And also his "zpaq language" should be relevant as well.

But I think that we need to start from high-level abstractions here, because for some types of data we actually know how it is generated (executable files, for example).
So I'd imagine something like writing a few lossless file-format decompilers which can improve compression,
then trying to generalize the approach and find how these decompilers can be deconstructed into basic components.
And only once we have the relevant components, we can try enumerating their combinations, or some such.

I don't think that anything useful can be done by starting from low-level designs like turing machines - simply
because we _know_ how to construct at least some of available file formats, and operations involved there
are complex enough that their implementations appearing at random is not really possible.
But I think that program generation may be still feasible if we'd use relevant higher-level components.

As to binary tables, its an even lower level representation than turing machines - there're no loops or memory,
so at best you can expect to break some hashing/encryption with that approach, but its less related to compression and more to SAT.

http://www.cube20.org/

5. Shelwien,

Originally Posted by Shelwien
Its an interesting topic, Matt Mahoney at one point also tried to generate random turing machines - http://mattmahoney.net/dc/uiq/
And also his "zpaq language" should be relevant as well.
But I think that we need to start from high-level abstractions here, because for some types of data we actually know how it is generated (executable files, for example).
So I'd imagine something like writing a few lossless file-format decompilers which can improve compression,
then trying to generalize the approach and find how these decompilers can be deconstructed into basic components.
And only once we have the relevant components, we can try enumerating their combinations, or some such.
It's a successful practical approach in reality. That's true. But it is not useful when thinking in the direction of getting towards the legendary lower bound of the Kolmogorov complexity. You are basically programming pre-available knowledge about your subject and you (paq)mix it with probability estimation. Yes, you get very well working compression algorithms but in all fairness you should add those knowledge to the estimated complexity of the program/data. For small (<10k) files it shows immediately that this approach is not the way to go.

(By the way: UIQ is more or less a fancy pseudorandom number generator to generate artificial data for testing general purpose file compression. It does not try to compress or predict anything.)

But I am not trying to be practical here…

Originally Posted by Shelwien
I don't think that anything useful can be done by starting from low-level designs like turing machines – simply because we _know_ how to construct at least some of available file formats, and operations involved there are complex enough that their implementations appearing at random is not really possible.
But I think that program generation may be still feasible if we'd use relevant higher-level components.
I agree. The reality is that iterating through every possible turing machines or any other turing complete language will not result in useful results in reality. That’s the reason they call those machines turing tarpits. “Beware of the Turing tar-pit in which everything is possible but nothing of interest is easy.”

Even when I am able to construct the ‘perfect language’, iterating through all the possible options is going to be hard with anything that’s above 40 bits complexity in 2016. While it is theoretically possible to generate every piece of information, it won’t be easy. And that’s an understatement.
But imagine that I am able to at least make a runnable/executable approach of iterating through the programs from the perfect language! That would be a very big improvement and with quantum computing on the horizon: who knows what is possible when those get available?

Originally Posted by Shelwien
As to binary tables, its an even lower level representation than turing machines - there're no loops or memory, so at best you can expect to break some hashing/encryption with that approach, but its less related to compression and more to SAT.
I do not agree on that. Every problem with a deterministic output can be expressed using truth tables. It’s not limited to hashing and/or encryption.
Imagine that you can map every truth table to a program with a certain index. Then you can construct a Hilbert space where a certain problem with a deterministic output is a point in space (index mapped to point).

If you have a non-deterministic problem you can construct an N number of P problems. For example the famous traveling salesman problem. If you generate every possible output for every possible input for every number of locations you get an enormous truth table for every number of cities. For example 4 cities with floating point distances (32bit * 4 cities) + city output order (2bit * 4 cities)) generate a 136 bit truth table. Those truth tables can (in theory) be mapped to an index in the iteration of programs from the perfect language. The indices can be mapped as points in the Hilbert space. From those points you can construct lines, planes and spaces for the desired dimensions. So you can define any problem as a geometry in the Hilbert space constructed with the indexes from the order of programs generated by the perfect language.

I am aware of the SAT problem but I don’t see the relation with the things I wrote before?
Originally Posted by Sportman
http://www.cube20.org/
Although interesting I don’t think that the rubix cube problem is related to this problem other than it might belong to the same complexity class.

> and you (paq)mix it with probability estimation.

Yes, and I see it as a good thing - because otherwise we'd have to use bruteforce
search to find working implementations of these same components in a low-level language.

I was talking about a different kind of components though - even to find
a working pi generator by enumerating programs, you'd have to start with
something higher-level than bit operations (or even cpu instruction set).

> Yes, you get very well working compression algorithms but in all fairness
> you should add those knowledge to the estimated complexity of the
> program/data.

Sure, but why do you think that turing machines or boolean logic are "more natural"
than long arithmetics or hashmaps?

> For small (<10k) files it shows immediately that this approach is not the way to go.

What kind of files do you want to "compress" this way, though?
For example, splitting enwik8 to 16-byte chunks for kolmogorov compression
certainly won't win you the prize - simply because these chunks are not independent.

> By the way: UIQ is more or less a fancy pseudorandom number generator to
> generate artificial data for testing general purpose file compression.
> It does not try to compress or predict anything.

Yes, but I think we can use it to find programs that generate given short strings
of data just as well as your implementation?

> But imagine that I am able to at least make a runnable/executable approach
> of iterating through the programs from the perfect language!

I'm also interested in something like that, but my concept of "perfect language"
is significantly different - it needs higher-level components and has to be
compatible with heuristic optimization methods (like genetic algorithms).
That is, for now I focus more on optimizing compression method parameters
than algorithms themselves, but the distinction is very fuzzy - for example,
we can implement two versions of some function and then choose one based
on parameter value, which is controlled by optimizer.
And that's actually the point where something tricky can be done -
we can call both versions and then mix their results using a weight parameter,
instead of choosing one with a flag parameter.
Turns out that this "fuzzy" approach works much better with optimization,
because there's much less probability of getting stuck in local minimums.

> That would be a very big improvement and with quantum computing on the
> horizon: who knows what is possible when those get available?

Afaik quantum computing is simply computing that uses elementary particles
to store/transmit/transform information. So its massively parallel and
has high probability of errors, which requires special algorithms to compensate.
I hope that you don't believe in popular entanglement/teleportation magic hype.

Wolframalpha says that ~2^104 hydrogen atoms fit into 1m^3, so I'd estimate
the best quantum PC to be 2^100 times faster than current PC.
Which only lets you bruteforce programs with extra 100 bits of code - not
much I think.

> Every problem with a deterministic output can be expressed using truth tables.

Sure, there's even a program that does that - http://www.cprover.org/cbmc/

But with such representation you won't be able to match longer loops, which
frequently appear both in data generated by "normal" programs, and in
mathematics (eg. in form of sum operators).

> I am aware of the SAT problem but I don’t see the relation with the things I wrote before?

I think either of these problems can be translated into another.
For example, we can write a generalized program with N instructions
(implemented as switches on parameter values), translate it with cbmc,
then with any SAT-solver find parameter values which would give us expected output.
Or fail to find. Then we can increase N.

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•