Theoretically we agree that directly the permutation can't be calculated in a lifetime but we should be able to use it in other indirect way. I feel I should be near something...
I'm giving up for now/forever after several years of trying hutter on and off, also I have been programming less and less till i cannot code my ideas, due to that and reaching a point of too much advanced math skills for my level.
I will paste raw disregarded texts in case they are useful:
---
Numbers will be approx because i can't remember. Please, indulge and correct as needed.
-Split by the most occuring char (space). There was something like 13M ocurrences for the 100M so 13M words be it.
-Create a dictionary with the unique entries (1-2Mb) and the number of occurences of each word. Dictionary size was in the order of ~>3.5Mb *2
*At this point I have ideas on several optimizations for the dictionary (dawg/syntax related words, tags and markup, punctuation... maybe just use the pahq last version on this), but them should only be interesting after confirming a working algo.
-The algo/tricky part requires: factoriadic system/permutations, sorting algorithms knowledge, finding the n-th permutation, arithmetic? encoding, lexicograph order...
Let me explain first a discarded idea. Calculate the permutation number from an ordered string to the desired data:
Dictionary: abcde
Data expected: cadbe
So by incrementing permutation in the string abc, acb, cab... (in factoriadic each number represents an ordering of the data) we can get to the desired data, then we have the data as a number in factoriadic. Testing this at my level of knowledge ended in just changing the base 256 to 13M so to say. Where i would have expected at least: 13M! ~= 10^87M ~= 87Mb of digits, hoping to rebase them as binary in a ratio of 10/256 = 3,5Mb. Also, i faced the impossibility to calculate such number nowadays, so tested with chunks according to max_time. But this was incremental and there's a better way.
With this i would get close to the 12.5Mb theoric limit. 3*3.5+1mb program and data.
And i believe, supported by my approximated calculations that it can be lessen by exchanging the computation time for data, maybe to shrink a bit beyond that, which represents a theoretical limit for almost instant (zip) time extractor.
+Let's combine the last ideas with the dictionary into a pseudo algo:
-Expand each word on the dictionary in an array of 13M, n occurence consecutive times each word. Dictionary order is by appearance on original data. So we have an array same size to data and equivalent to the original data, but unordered.
-*Lets use a sorting algorith kind of mergesort? (for decode sure, encode not sure if reversible) also a more "incremental/guess" lexicograph permutation would be posible but impractically slower.
*Mergesort goes like split in half the data, then guess if goes upper side to 75% or down to 25% (similar to arith comp?) so we could code this like 1 or 0, like saving the way we go.
*a_chunk = n-th permutation of data such that 13M! * 25%
*b_chunk = n-th permutation of data such that 13M! * 75%
*Compare a_chunk and b with original data: (like comparing strings similarity). Set 1 or 0 for upper or downer. Likewise mergesort?, split recursively to find next n-th permutation.
Even if we can not calculate 13M!, we should be able to "spawn" a tree of permutations from the approximated point 13M! * 25% (maybe factorial approximations like stirling?, ramanujan?...)
The main concept, as in the ockham razor is that we have now data not repeated (dictionary -> compress syntax), plus the times it's repeated dictionary with occurences number (digress here: will optimize to sort by this index so it will be a number sequence ascending or desc? but then not so optimizable with syntax? will have to compare both approaches), plus the order in which is the data now in just a number of permutation, as tiny (compressed arithmetically?, can also be approximated and calculated in part so to exchange cpu time with data: now is just applying math thus i need your help)
Please share your thoughts about the possibility of this to work and if you are interested in pairing.
---2
Instead of storing directly the "number" of permutation (87M of digits -> ~ 3,5Mb in bits) we could set rules to calculate it adjusted to the particular result and constrained by the 10h time. Maybe near p (marsenne?) prime + n, or in format x = huge_prime * n + modulo_rest
---3
I know that naming permutations seems like a dead end (not enough computation time in a lifetime), but we could use the fact that it's indeed a permutation that represents an order without having to calculate it step by step. Also taking advantage that the dictionary "shrinks" when we have consumed all the occurences of a given word, and that we have the unordered data "shrinking" each word we decode.
Note that is not quite exactly a permutation nor a combination, because we have repeated words and any of them will suit any of the spots that their group fill.So it's more like ~1M words dict * 13 (occs) known keys fitting in 13M unknown doors.
We could agree that from a lexicographically ordered by appearance dictionary we could get through the destination data while ordering with the appropiate algorithm. Ex: to desc lexicograph.
If we started with something similar to merge sort, there would be a tipping point at a given iteration, when our dict would be before original data lexicographically, and after that, will be lexicographically after the data on the next iteration.
Now is just a matter of adjusting that difference. Instead of splitting at 1/2 we have to find the offset such that gives the starting closer string distance in the less number of iterations (there shall be multiple acceptable ways/solutions). This could be probed randomly at first, so that will guide us to a range, for example, bettween 10-20% and then we could adjust the rest "manually"... in the future, neural net would be plugable here.
We could also tweak this "sorting" with rules like no word is more than 3 consecutive times so to skip a percentage of cases with each rule. With this and the previous we aim to bring the initial huge n! to log n! and so on, to complexity realm, which should be computable. More complex would be the approach to fractally expand the given occurences over the remaining length size. (Like dividing length/n_occs and insert at each modulo)
Finally we would have something like the ADN chains getting together, so to say, left of the pivot is ordered and right unordered. On the right we would have chains ordered but of unknow length and mixed with unordered.
We explore the right for already ordered sets of words while and after iterations of the tweaked "sorting".
Pivot would be a variable size pool instead of a point or char. Here we will get chains already ordered, thus, sizing down the unordered. With a bit we can arrange between them. (lehmer codes?) Once there is one consecutive to the already ordered data we add it to decoded and set 1 bit for that move.
At a given iteration number of the algo, the decode is done. 1st time we can just compare vs original data or hash.
---4
This seems a step stone for another way we couldn't code. Where using the same dictionaries described, we can use a nodes approach.
For each word, set the possible next words and occurrences.
Brute forcing this goes also rapidly to inf., but we could use path search algos... so we try to order them according to the appearance in the text and set bits for the deviation while using syntax rules as for how to construct phrases: 1 article, 1 verb, 1 predicate....
This is an impossible task for now to code all language variants so we can just use it as an additional tool to the syntax/tags comp.
---5
tags, markup, html_urls and such rules of code language:
create code functions:
def html(tag, innertext):
output ("<%tag>innertext</tag>")
punctuation:
replace all with esp. char, then apply the perm+rules to find combination of .,;: (!?)
---6
Possibility of variation of stirling with e and pi max_size ~= 13M? (~words in enwik) where e and pi can be calculated, not stored.
Check also the approx.: x = max_prime * n + modulo
--7
flame