Enter-the-Simplicius or some draft ideas for a new Fast-Low-performance TEXT [de]compressor: Simplicissimus
Just continuing the paused suggestion from post #21929.
When burst-read is 600+[+]MB/s I need a weak(at least 50% compression ratio) but ultrafast(non-bottlenecking the burst-read) ASCII text decompressor in order to upload/traverse gigabytes of texts even more rapidly.
Still refusing to learn the compression craft I don't feel am capable of writing even a simple compressor targeted only on 4++MB ASCII texts,
but these two little facts cannot stop me to share some ideas of mine.
In the near future I intend to pursue/ride the following bullets/thoughts:
▪ The Sliding Window Dictionary approach i.e. LZ remains parallel as wingman not rival with these settings(similar to Encode's ones from his LZSS):Window_size(Match_Offset+Match_Length) = 24bits i.e. 3bytes
Dictionary_size(Match_Offset) = 19bits i.e. 512KB for reference post
#21822, the very useful tests on OSHO.TXT(206,908,949bytes) by
Encode step in:
128KB/66,529,981 nah!
256KB/61,249,279 nah!
512KB/
56,888,532 a BONBON!
1MB/55,270,170 nah!
Look-Ahead-Buffer_size(Match_Length) = 5bits, in my opinion 6bits(i.e. 35-66bytes long matches) are rare(not worth the additional bit as 18:6 vs 19:5 trade-off) for English texts
Min_Match_Threshold = 3 chars
Max_Match_Threshold = 31+3(0,1,2)=34 chars
▪ Format of .Simplicissimus file:[Header:]
23bytes: Filetype: Simplicissimus
256KBytes: Pointer0000 4bytes, ... PointerFFFF 4bytes
xKBytes: Match0000Length 1byte, Match0000String 3..31bytes, ... MatchFFFFLength 1byte, MatchFFFFString 3..31bytes
Where (MinDictionaryLen=65536*(3+1)=256KBytes) <= xKBytes <= (MaxDictionaryLen=65536*(31+1)=2048KBytes)
[Encoded sequence:]
8bytes: ID TAG, each bit is used as flag 0/1 for LITERAL/NUMBER_of_POINTER_to_MatchLength-MatchString_PAIR respectively, a big/little endian issue here to be solved
Sequence of 64 units either 1byte or 2bytes: LITERAL or NUMBER_of_POINTER_to_MatchLength-MatchString_PAIR
...
8bytes: ID TAG
Sequence of =<64 units either 1byte or 2bytes: LITERAL or NUMBER_of_POINTER_to_MatchLength-MatchString_PAIR
▪ Some attempts:11 bytes string#1: nonetheless
11 bytes string#2: Nonetheless
31 bytes string#3: dichlorodiphenyltrichloroethane
129 bytes string#4: Susie grabs her man and puts a grip on his hand as the rain puts a tear in his eye. /A line from 'Hurts - Wonderful Life' lyrics/
185 bytes string#5: Susanna: In the Apocrypha, a captive in Babylon who was falsely accused of adultery and was rescued by Daniel. /The American Heritage Dictionary of the English Language, Fourth Edition/
Assigning(the 'tobeornottobe' trade-off: longer matches are the goal but they must not be compound otherwise they lessen rapidly the free dictionary entries and thus not allowing the basic-building-bricks(non-compound matches) to be encoded as well) pointers is the core:
Scenario #1:
Shortest encoded sequence for 'nonetheless': 2bytes(1 pointer)
Longest encoded sequence for 'nonetheless': 11bytes(11 literals)
Shortest encoded sequence for 'Nonetheless': 2bytes(1 pointer)
Longest encoded sequence for 'Nonetheless': 11bytes(11 literals)
In total: 2 dictionary entries used; 2+2 bytes in encoded stream
Scenario #2:
Pointer assigned already to 'one': 2bytes(1 pointer)
Pointer assigned already to 'the': 2bytes(1 pointer)
Pointer assigned already to 'less': 2bytes(1 pointer)
Effectively encoded sequence for 'nonetheless': 7bytes(1 literal 3 pointers)
Effectively encoded sequence for 'Nonetheless': 7bytes(1 literal 3 pointers)
In total: 3 dictionary entries used; 7+7 bytes in encoded stream
Achieving (7+7)/(11+11) = 63%(in fact less: due to HEADER & ID TAG overhead) ratio is next-to-nothing, so the secondary objective is to assign pointers to LONGest&FREQUENTest non-compound matches.
And not to be forgotten: wordlist approach is only the first(illustrative) step toward dealing with ALL building blocks(3..31 bytes).
Obviously(as in the next two files) all words are lowercased i.e. the real number is much bigger, which means: relying on words alone is a lost cause for now:
206,908,949 bytes file: OSHO.TXT with word(1..31chars lowercased each) count: 31,957,006 of them 58,893 distinct
223,674,511,360 bytes file: wikipedia-en-html.tar with word(1..31chars lowercased each) count: 30,974,750,142 of them 12,561,874 distinct
Number of ALL building blocks(3..31 bytes) in string/file with length STRlen=31[+]bytes is:
NofBB = (STRlen-3+1)+(STRlen-4+1)+...+(STRlen-31+1) = (31-3+1)*STRlen-(31*32/2-2*3/2)+(31-3+1)*1
Here
Min_Match_Threshold = 3 chars
Max_Match_Threshold = 31 chars
which gives:
NofBB = (31-3+1)*(31)-(31*32/2-2*3/2)+(31-3+1)*1 = 187 for string#3
NofBB = (31-3+1)*(129)-(31*32/2-2*3/2)+(31-3+1)*1 = 3,029 for string#4
NofBB = (31-3+1)*(185)-(31*32/2-2*3/2)+(31-3+1)*1 = 4,653 for string#5
NofBB = (31-3+1)*(206,908,949)-(31*32/2-2*3/2)+(31-3+1)*1 = 6,000,358,809 for OSHO.TXT
NofBB = (31-3+1)*(223,674,511,360)-(31*32/2-2*3/2)+(31-3+1)*1 = 6,486,560,828,728 for wikipedia-en-html.tar
All above-said things were intended also as background to express my life-long dream: creating corpus of Multi-Million-High-Quality-Sentences followed by their source, in string#4 and string#5 manner.
Being a fan of brute-force approaches a SIMILAR-STATIC-2-bytes-HASH search attracts me with gravity of hash, ha-ha, in short:
129 bytes string#4 becomes a 27*2 bytes sequence
185 bytes string#5 becomes a 29*2 bytes sequence
where each word(lowercased) is hashed to 2bytes, in this type of texts the wordlist varies from 600,000 to 900,000 words in size, i.e. 15:1 collisions, a problem.
This kind of searching will be useful only(in majority of cases) for words(disregarding all non-alpha chars) especially when combined with wildcards, an example: grip++hand would reveal the needed preposition after 'grip' where '+' stands for whole-word wildcard, namely: ... a grip on his hand ...
▪ The basis/motto/inspiration derived from 2200+ years old book: If you wish to shrink it, ~ You must certainly stretch it.
What is in the end to be shrunken, ~ Begins by being first stretched out.
If you would have a thing shrink, ~ You must first stretch it.
In order to deplete it, ~ It must be thoroughly extended.
In order to contract it, it is necessary to expand it for the time being.
That which shrinks ~ Must first expand.
In order to fold, one must first unfold.
If it wants to shrink something, it always first expands it.
When Nature is about to withhold a thing it is first sure to increase it.
When one is about to take an inspiration, he is sure to make a (previous) expiration.
To gather ~ you must scatter.
That which shall contract ~ Must have long expanded.
What is to be reduced, ~ Must first be expanded.
In order to dwindle [an enemy], a predator would induce the targeted victim to bloat up his illusory assessment of himself.
In order to shrink it, it must first be stretched out.
In order to reduce it, first expand it.
What is in the end to be shrunk ~ Must first be stretched.
If you would like to gather him in, you must resolve yourself to let him aggrandize himself.
He who is to be made to dwindle (in power) ~ Must first be caused to expand.
When you wish to contract something, ~ you must momentarily expand it.
That which is to be shrunk must first be stretched out.
Before contracting [an object], let it expand.
What is ultimately to be compressed must first be expanded.
In order to reduce something, it must first be expanded.
What is going to be diminished ~ Must first be allowed to inflate.
If you want a thing to contract, ~ You should stretch it first.
To gather things, one must first disseminate them.
To compress something, one must first prop it.
In order to contract a thing, one should surely expand it first.
About to shut it, let it first be opened.
When you want to shrink something, ~ you must always enlarge it.
If it wants to close anything, surely it will first open it.
If you want to contract something, you must first stretch it.
That which is to be condensed must first be dispersed.
Wishing to restrict anything, ~ One must first expand it.
When you want to shrink it, ~ You must first stretch it.
When you want to constrict something, ~ You must first let it expand.
What is to be contracted may need to be expanded.
In order to contract, ~ It is necessary first to expand.
If one wishes to shrink it ~ One must first expand it.
What you want shrunk ~ Must first be allowed to expand.
What is ultimately to be reduced ~ must first be expanded.
He who feels punctured ~ Must once have been a bubble.
Should you want to contain something, ~ you must deliberately let it expand.
That which will be shrunk, ~ Must first be already stretched.
That which is about to contract ~ has surely been expanded.
What you would shorten you should therefore lengthen.
If you wish to contract, you should first expand.
The beginning of contraction necessarily follows the maximum of expansion.
When about to inhale it is certainly necessary to open the mouth.
What you want to compress ~ you must first allow truly to expand.
If there is contraction, then before there was expansion.
Прежде чем что-либо сжать, следует сначала его растянуть.
Желая что-то сжать, сначала растяни его.
Если хочешь нечто сжать - прежде растяни его.
Чтобы нечто сжать, необходимо прежде расширить его.
Стремишься сжать - необходимо сильно растянуть.
То, что сжимается, - расширяется.
Если хочешь сжать, ~ Прежде нужно растянуть.
▪ It will be a Dictionary Method for sure encoding all Latin-Letters-Words longer than 2 chars(bytes) as 2 bytes;
▪ The Header overhead is frightening but only for 4-MB files which is not the case;
▪ There will be no block size whatsoever;
▪ I don't know how this approach is called;
▪ PROs: bit operations used only partially; the dictionary is under 1MB practically;
▪ CONs: a lot but irrelevant;
▪ Using Leprechaun's ability to create the-entire-dictionary-of-needed-possible-matches quickly in memory as a skeleton;
▪ In regards to performance I care for(in that order): decompression speed followed by compression ratio and at last compression speed;
▪ Of course implementing this kind of static dictionary using Latin-Letters-Words as possible-matches is only the first step towards TOTAL-THOROUGH search for possible-matches which will be executed on all possible string combinations not mere Latin-Letters-Words;
▪ Due to huge(trillions) number of possible-matches Simplicissimus will be full-fledged only as a 64bit console tool;
▪ The name Simplicissimus comes from the Grimmelshausen's seventeenth-century German classic in which one simpleton become, after many adventures, an experienced real boy, similarly to Collodi's Pinocchio, but in a more brutal manner;
▪ It's time to wake up: 48000MB/s Copy performance on a desktop PC is here[AMD Magny Cours].