# Thread: Discussion on permutation/sort algorithm possibility

1. ## Discussion on permutation/sort algorithm possibility

Theoretically we agree that directly the permutation can't be calculated in a lifetime but we should be able to use it in other indirect way. I feel I should be near something...

I'm giving up for now/forever after several years of trying hutter on and off, also I have been programming less and less till i cannot code my ideas, due to that and reaching a point of too much advanced math skills for my level.

I will paste raw disregarded texts in case they are useful:

---

Numbers will be approx because i can't remember. Please, indulge and correct as needed.

-Split by the most occuring char (space). There was something like 13M ocurrences for the 100M so 13M words be it.

-Create a dictionary with the unique entries (1-2Mb) and the number of occurences of each word. Dictionary size was in the order of ~>3.5Mb *2

*At this point I have ideas on several optimizations for the dictionary (dawg/syntax related words, tags and markup, punctuation... maybe just use the pahq last version on this), but them should only be interesting after confirming a working algo.

-The algo/tricky part requires: factoriadic system/permutations, sorting algorithms knowledge, finding the n-th permutation, arithmetic? encoding, lexicograph order...

Let me explain first a discarded idea. Calculate the permutation number from an ordered string to the desired data:
Dictionary: abcde

So by incrementing permutation in the string abc, acb, cab... (in factoriadic each number represents an ordering of the data) we can get to the desired data, then we have the data as a number in factoriadic. Testing this at my level of knowledge ended in just changing the base 256 to 13M so to say. Where i would have expected at least: 13M! ~= 10^87M ~= 87Mb of digits, hoping to rebase them as binary in a ratio of 10/256 = 3,5Mb. Also, i faced the impossibility to calculate such number nowadays, so tested with chunks according to max_time. But this was incremental and there's a better way.

With this i would get close to the 12.5Mb theoric limit. 3*3.5+1mb program and data.
And i believe, supported by my approximated calculations that it can be lessen by exchanging the computation time for data, maybe to shrink a bit beyond that, which represents a theoretical limit for almost instant (zip) time extractor.

+Let's combine the last ideas with the dictionary into a pseudo algo:
-Expand each word on the dictionary in an array of 13M, n occurence consecutive times each word. Dictionary order is by appearance on original data. So we have an array same size to data and equivalent to the original data, but unordered.
-*Lets use a sorting algorith kind of mergesort? (for decode sure, encode not sure if reversible) also a more "incremental/guess" lexicograph permutation would be posible but impractically slower.
*Mergesort goes like split in half the data, then guess if goes upper side to 75% or down to 25% (similar to arith comp?) so we could code this like 1 or 0, like saving the way we go.

*a_chunk = n-th permutation of data such that 13M! * 25%
*b_chunk = n-th permutation of data such that 13M! * 75%
*Compare a_chunk and b with original data: (like comparing strings similarity). Set 1 or 0 for upper or downer. Likewise mergesort?, split recursively to find next n-th permutation.

Even if we can not calculate 13M!, we should be able to "spawn" a tree of permutations from the approximated point 13M! * 25% (maybe factorial approximations like stirling?, ramanujan?...)

The main concept, as in the ockham razor is that we have now data not repeated (dictionary -> compress syntax), plus the times it's repeated dictionary with occurences number (digress here: will optimize to sort by this index so it will be a number sequence ascending or desc? but then not so optimizable with syntax? will have to compare both approaches), plus the order in which is the data now in just a number of permutation, as tiny (compressed arithmetically?, can also be approximated and calculated in part so to exchange cpu time with data: now is just applying math thus i need your help)

Please share your thoughts about the possibility of this to work and if you are interested in pairing.

---2

Instead of storing directly the "number" of permutation (87M of digits -> ~ 3,5Mb in bits) we could set rules to calculate it adjusted to the particular result and constrained by the 10h time. Maybe near p (marsenne?) prime + n, or in format x = huge_prime * n + modulo_rest

---3

I know that naming permutations seems like a dead end (not enough computation time in a lifetime), but we could use the fact that it's indeed a permutation that represents an order without having to calculate it step by step. Also taking advantage that the dictionary "shrinks" when we have consumed all the occurences of a given word, and that we have the unordered data "shrinking" each word we decode.

Note that is not quite exactly a permutation nor a combination, because we have repeated words and any of them will suit any of the spots that their group fill.So it's more like ~1M words dict * 13 (occs) known keys fitting in 13M unknown doors.

We could agree that from a lexicographically ordered by appearance dictionary we could get through the destination data while ordering with the appropiate algorithm. Ex: to desc lexicograph.

If we started with something similar to merge sort, there would be a tipping point at a given iteration, when our dict would be before original data lexicographically, and after that, will be lexicographically after the data on the next iteration.

Now is just a matter of adjusting that difference. Instead of splitting at 1/2 we have to find the offset such that gives the starting closer string distance in the less number of iterations (there shall be multiple acceptable ways/solutions). This could be probed randomly at first, so that will guide us to a range, for example, bettween 10-20% and then we could adjust the rest "manually"... in the future, neural net would be plugable here.

We could also tweak this "sorting" with rules like no word is more than 3 consecutive times so to skip a percentage of cases with each rule. With this and the previous we aim to bring the initial huge n! to log n! and so on, to complexity realm, which should be computable. More complex would be the approach to fractally expand the given occurences over the remaining length size. (Like dividing length/n_occs and insert at each modulo)

Finally we would have something like the ADN chains getting together, so to say, left of the pivot is ordered and right unordered. On the right we would have chains ordered but of unknow length and mixed with unordered.

We explore the right for already ordered sets of words while and after iterations of the tweaked "sorting".

Pivot would be a variable size pool instead of a point or char. Here we will get chains already ordered, thus, sizing down the unordered. With a bit we can arrange between them. (lehmer codes?) Once there is one consecutive to the already ordered data we add it to decoded and set 1 bit for that move.

At a given iteration number of the algo, the decode is done. 1st time we can just compare vs original data or hash.

---4

This seems a step stone for another way we couldn't code. Where using the same dictionaries described, we can use a nodes approach.
For each word, set the possible next words and occurrences.
Brute forcing this goes also rapidly to inf., but we could use path search algos... so we try to order them according to the appearance in the text and set bits for the deviation while using syntax rules as for how to construct phrases: 1 article, 1 verb, 1 predicate....
This is an impossible task for now to code all language variants so we can just use it as an additional tool to the syntax/tags comp.

---5

tags, markup, html_urls and such rules of code language:
create code functions:

def html(tag, innertext):
output ("<%tag>innertext</tag>")

punctuation:
replace all with esp. char, then apply the perm+rules to find combination of .,;: (!?)

---6

Possibility of variation of stirling with e and pi max_size ~= 13M? (~words in enwik) where e and pi can be calculated, not stored.
Check also the approx.: x = max_prime * n + modulo

--7

flame

2. ## Thanks:

xinix (21st January 2020)

3. Code:
```98,114,172 enwik_text2       // preprocessed enwik8 text without xml headers

98,114,172 enwik_text3       // perl -pe "tr/[\n ]/[ \n]/" <enwik_text2 >enwik_text3

98,114,172 enwik_text3_s     // (sorted 13519824 words) stegdict d enwik_text3 enwik_text3_s payload
34,954,210 payload           // permutation index; there're repeated words, so less than https://www.wolframalpha.com/input/?i=IntegerPart%5BLog%5B256.%2C13519824%21%5D%5D

24,622,621 enwik_text3.7z    // 7z a -mx=9 -md=128m -mt=1 enwik_text3 enwik_text3

7,190,440 enwik_text3_s.7z  // 7z a -mx=9 -md=128m -mt=1 enwik_text3_s enwik_text3_s
Well, its not how compression works.

4. ## Thanks:

xinix (21st January 2020)

5. Originally Posted by Shelwien
Code:
```98,114,172 enwik_text2       // preprocessed enwik8 text without xml headers

98,114,172 enwik_text3       // perl -pe "tr/[\n ]/[ \n]/" <enwik_text2 >enwik_text3

98,114,172 enwik_text3_s     // (sorted 13519824 words) stegdict d enwik_text3 enwik_text3_s payload
34,954,210 payload           // permutation index; there're repeated words, so less than https://www.wolframalpha.com/input/?i=IntegerPart%5BLog%5B256.%2C13519824%21%5D%5D

24,622,621 enwik_text3.7z    // 7z a -mx=9 -md=128m -mt=1 enwik_text3 enwik_text3

7,190,440 enwik_text3_s.7z  // 7z a -mx=9 -md=128m -mt=1 enwik_text3_s enwik_text3_s
Well, its not how compression works.

Could you elaborate/explain abit more your numbers/process?

Dictionary should be unique words, not repeated. Another dictionary or multidimensional one with the number of repetitions of each word. We can store just this and re-create the expanded one with that.

"Permutation" is word by word of original enwik, encode the dictionary key corresponding in a dict of about ~1M words, not 13M

6. ## Thanks:

xinix (21st January 2020)

7. Hmm i see, https://github.com/Shelwien/stegdict

But in my tests randomizing wouldnt be optimal for compression, even if it shows a little gain in short term.

8. ## Thanks:

xinix (21st January 2020)

9. > Could you elaborate/explain abit more your numbers/process?

stegdict is a somewhat related tool: https://encode.su/threads/2597-Stega...y-permutations https://github.com/Shelwien/stegdict

At least, it is able to split a file to a sorted wordlist and permutation index required to restore original word order.

I tried to demonstrate that permutatation index is too large and mostly incompressible
(which is no wonder since its basically arithmetic code).

> Dictionary should be unique words, not repeated.
> Another dictionary or multidimensional one with the number of repetitions of each word.
> We can store just this and re-create the expanded one with that.

That on its own won't change anything.
In this case we can imagine that enwik_text3_s can be compressed to 0, index is already too large on its own.

> "Permutation" is word by word of original enwik, encode the dictionary key corresponding in a dict of about ~1M words, not 13M

A permutation (or any transformation really) can't be a solution to achieving best enwik compression.
A transformation (eg. BWT) can make it easier/faster to compress a file, but at the same time it becomes harder
to access different features of the data, ones that are not the focus of the transformation.

Comparing to that, a CM (context mixing) compressor is able to combine predictions made from multiple types of patterns at once.

Considering word permutations, you should known than paq-based HP solutions already have a dictionary preprocessing step,
and its fairly advanced, because the words in the dictionary are sorted based on their semantic and morphological properties.
Plus there's a word model in CM part.

There actually may be some reasons to enumerate permutations (like it is explained in stegdict thread).
But it only applies to cases where we would use sorted lists of unique words (DRT dictionary doesn't apply!).
So the effect on enwik compression would be fairly minor even in best case.
Realistically though, we also have much less experience with compression of permutations than with "normal" data.

10. ## Thanks:

xinix (21st January 2020)

11. Originally Posted by Shelwien
> Could you elaborate/explain abit more your numbers/process?

stegdict is a somewhat related tool: https://encode.su/threads/2597-Stega...y-permutations https://github.com/Shelwien/stegdict

At least, it is able to split a file to a sorted wordlist and permutation index required to restore original word order.

I tried to demonstrate that permutatation index is too large and mostly incompressible
(which is no wonder since its basically arithmetic code).

> Dictionary should be unique words, not repeated.
> Another dictionary or multidimensional one with the number of repetitions of each word.
> We can store just this and re-create the expanded one with that.

That on its own won't change anything.
In this case we can imagine that enwik_text3_s can be compressed to 0, index is already too large on its own.

> "Permutation" is word by word of original enwik, encode the dictionary key corresponding in a dict of about ~1M words, not 13M

A permutation (or any transformation really) can't be a solution to achieving best enwik compression.
A transformation (eg. BWT) can make it easier/faster to compress a file, but at the same time it becomes harder
to access different features of the data, ones that are not the focus of the transformation.

Comparing to that, a CM (context mixing) compressor is able to combine predictions made from multiple types of patterns at once.

Considering word permutations, you should known than paq-based HP solutions already have a dictionary preprocessing step,
and its fairly advanced, because the words in the dictionary are sorted based on their semantic and morphological properties.
Plus there's a word model in CM part.

There actually may be some reasons to enumerate permutations (like it is explained in stegdict thread).
But it only applies to cases where we would use sorted lists of unique words (DRT dictionary doesn't apply!).
So the effect on enwik compression would be fairly minor even in best case.
Realistically though, we also have much less experience with compression of permutations than with "normal" data.

>I tried to demonstrate that permutatation index is too large and mostly incompressible

As I estimated, perm number, it should be around 13M! ~= 87 Mb of digits, that means using 10 out of 256 symbols in a byte, which we can code in binary as 3,5Mb
I agree that those 3,5mb should be hardly compressed further, but still there are another approaches like mentioned above. But reaching only this would be an improve.
-It takes 6 hours to calculate with approximations and precision of 13M with python using plain sterling. This is not exact as a real calculation and will have to be adjusted or implemented better
Ideally should be already calculated in 3,5Mb precision binary number, but I can't implement that at my level of math/programming.
There is a code for poker holdem engine that does something similar but i dont understand how. (https://www.codeproject.com/Articles...n-and-Analysis)

> Of course that shrinking the dictionary would make a difference as we save less data compressed... please demonstrate otherwise with facts.
Then we have ~1Mb * 3+ char avg size =~ 3,5Mb plus a dict with ocurrences weighting less than that.
This first one is the one and only to use syntax, bwt, dict compression, etc tecniques on.

I say permutation because is the closer concept but is not quite it, as we are takeing a benefit in repetitions that do not exist in a pure tradicional permutation definition.
I could invent a word to refer to it but i find it even more confusing: sortation? ordenation?

> Transformation like BWT i would use only to compress the dict part, specified as syntax dict.
About CM I don't know much, but by definition is similar to the optimization proposed for syntax dict and could be used also for optimizing. Same with semantics, likewise the (6) html/markup optimization.
First i would like to confirm the whole algorithm basically with a compression around <20mb.

> not enumerating permutations per se is the interesting. We only need to find one and in compression, not decompression (so this we should try do in asymetric time), and then work with it.
There are algorithms to find directly the n-th permutation without calculating previous ones.

-Could you define DRT dict?

12. ## Thanks:

xinix (21st January 2020)

13. >> I tried to demonstrate that permutatation index is too large and mostly incompressible

> As I estimated, perm number, it should be around 13M! ~= 87 Mb of digits,
> that means using 10 out of 256 symbols in a byte, which we can code in binary as 3,5Mb

Ah, so that's where you made a mistake.
https://www.wolframalpha.com/input/?...00000%21%5D%5D
"87 Mb of digits" is actually about right
https://www.wolframalpha.com/input/?...00000%21%5D%5D
But a decimal digit requires ~0.415 of a byte to encode, so 3.5Mb binary is too little.

> Ideally should be already calculated in 3,5Mb precision binary number, but I can't implement that at my level of math/programming.

Well, its not a problem in C/C++, I already did it in stegdict,

> Could you define DRT dict?

http://nishi.dreamhosters.com/u/lpaq9m.rar

Or you can use cmix -s, its similar: http://www.byronknoll.com/cmix.html

14. ## Thanks (2):

Cristo (21st January 2020),xinix (21st January 2020)

15. That's some interesting info for me there and yes there was a miscalculation, placing the expected output at this point on ~40Mb then.

I will try to work it a bit more from here. Thanks.

16. ## Thanks:

xinix (21st January 2020)

17. No the rank of the permutation with duplicates will just need ~21mb because of the duplicates.

But yes it will take unfeasable long. I am also interessted in fast ranking ( rank and unrank) of permutations with duplicates.

18. ## Thanks:

Cristo (11th February 2020)

19. Originally Posted by thometal
No the rank of the permutation with duplicates will just need ~21mb because of the duplicates.

But yes it will take unfeasable long. I am also interessted in fast ranking ( rank and unrank) of permutations with duplicates.
Please elaborate how can be reduced to 1/4 because duplicates. I understand it would be easy with reverse to make it 1/2. And have also other optimizations in mind

I can permute what I need for hutter in just 16 days using python, standard npmath lib. It should be feaseble in prize time in C and custom lib/algo

20. Originally Posted by Cristo
Please elaborate how can be reduced to 1/4 because duplicates. I understand it would be easy with reverse to make it 1/2. And have also other optimizations in mind

I can permute what I need for hutter in just 16 days using python, standard npmath lib. It should be feaseble in prize time in C and custom lib/algo
Just Google "rank and unrank permutation with duplicates"

Herr is an explaination: