Results 1 to 5 of 5

Thread: Dummy Static [Windowless] Dictionary Text Decompressor

  1. #1
    Member Sanmayce's Avatar
    Join Date
    Apr 2010
    Location
    Sofia
    Posts
    57
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Dummy Static [Windowless] Dictionary Text Decompressor

    Enter-the-Simplicius or some draft ideas for a new Fast-Low-performance TEXT [de]compressor: Simplicissimus

    Just continuing the paused suggestion from post #21929.
    When burst-read is 600+[+]MB/s I need a weak(at least 50% compression ratio) but ultrafast(non-bottlenecking the burst-read) ASCII text decompressor in order to upload/traverse gigabytes of texts even more rapidly.
    Still refusing to learn the compression craft I don't feel am capable of writing even a simple compressor targeted only on 4++MB ASCII texts,
    but these two little facts cannot stop me to share some ideas of mine.
    In the near future I intend to pursue/ride the following bullets/thoughts:

    The Sliding Window Dictionary approach i.e. LZ remains parallel as wingman not rival with these settings(similar to Encode's ones from his LZSS):
    Window_size(Match_Offset+Match_Length) = 24bits i.e. 3bytes
    Dictionary_size(Match_Offset) = 19bits i.e. 512KB for reference post #21822, the very useful tests on OSHO.TXT(206,908,949bytes) by Encode step in:
    128KB/66,529,981 nah!
    256KB/61,249,279 nah!
    512KB/56,888,532 a BONBON!
    1MB/55,270,170 nah!
    Look-Ahead-Buffer_size(Match_Length) = 5bits, in my opinion 6bits(i.e. 35-66bytes long matches) are rare(not worth the additional bit as 18:6 vs 19:5 trade-off) for English texts
    Min_Match_Threshold = 3 chars
    Max_Match_Threshold = 31+3(0,1,2)=34 chars
    Format of .Simplicissimus file:
    [Header:]
    23bytes: Filetype: Simplicissimus
    256KBytes: Pointer0000 4bytes, ... PointerFFFF 4bytes
    xKBytes: Match0000Length 1byte, Match0000String 3..31bytes, ... MatchFFFFLength 1byte, MatchFFFFString 3..31bytes
    Where (MinDictionaryLen=65536*(3+1)=256KBytes) <= xKBytes <= (MaxDictionaryLen=65536*(31+1)=2048KBytes)
    [Encoded sequence:]
    8bytes: ID TAG, each bit is used as flag 0/1 for LITERAL/NUMBER_of_POINTER_to_MatchLength-MatchString_PAIR respectively, a big/little endian issue here to be solved
    Sequence of 64 units either 1byte or 2bytes: LITERAL or NUMBER_of_POINTER_to_MatchLength-MatchString_PAIR
    ...
    8bytes: ID TAG
    Sequence of =<64 units either 1byte or 2bytes: LITERAL or NUMBER_of_POINTER_to_MatchLength-MatchString_PAIR
    Some attempts:
    11 bytes string#1: nonetheless
    11 bytes string#2: Nonetheless
    31 bytes string#3: dichlorodiphenyltrichloroethane
    129 bytes string#4: Susie grabs her man and puts a grip on his hand as the rain puts a tear in his eye. /A line from 'Hurts - Wonderful Life' lyrics/
    185 bytes string#5: Susanna: In the Apocrypha, a captive in Babylon who was falsely accused of adultery and was rescued by Daniel. /The American Heritage Dictionary of the English Language, Fourth Edition/

    Assigning(the 'tobeornottobe' trade-off: longer matches are the goal but they must not be compound otherwise they lessen rapidly the free dictionary entries and thus not allowing the basic-building-bricks(non-compound matches) to be encoded as well) pointers is the core:
    Scenario #1:
    Shortest encoded sequence for 'nonetheless': 2bytes(1 pointer)
    Longest encoded sequence for 'nonetheless': 11bytes(11 literals)
    Shortest encoded sequence for 'Nonetheless': 2bytes(1 pointer)
    Longest encoded sequence for 'Nonetheless': 11bytes(11 literals)
    In total: 2 dictionary entries used; 2+2 bytes in encoded stream
    Scenario #2:
    Pointer assigned already to 'one': 2bytes(1 pointer)
    Pointer assigned already to 'the': 2bytes(1 pointer)
    Pointer assigned already to 'less': 2bytes(1 pointer)
    Effectively encoded sequence for 'nonetheless': 7bytes(1 literal 3 pointers)
    Effectively encoded sequence for 'Nonetheless': 7bytes(1 literal 3 pointers)
    In total: 3 dictionary entries used; 7+7 bytes in encoded stream
    Achieving (7+7)/(11+11) = 63%(in fact less: due to HEADER & ID TAG overhead) ratio is next-to-nothing, so the secondary objective is to assign pointers to LONGest&FREQUENTest non-compound matches.

    And not to be forgotten: wordlist approach is only the first(illustrative) step toward dealing with ALL building blocks(3..31 bytes).
    Obviously(as in the next two files) all words are lowercased i.e. the real number is much bigger, which means: relying on words alone is a lost cause for now:
    206,908,949 bytes file: OSHO.TXT with word(1..31chars lowercased each) count: 31,957,006 of them 58,893 distinct
    223,674,511,360 bytes file: wikipedia-en-html.tar with word(1..31chars lowercased each) count: 30,974,750,142 of them 12,561,874 distinct

    Number of ALL building blocks(3..31 bytes) in string/file with length STRlen=31[+]bytes is:
    NofBB = (STRlen-3+1)+(STRlen-4+1)+...+(STRlen-31+1) = (31-3+1)*STRlen-(31*32/2-2*3/2)+(31-3+1)*1
    Here
    Min_Match_Threshold = 3 chars
    Max_Match_Threshold = 31 chars
    which gives:
    NofBB = (31-3+1)*(31)-(31*32/2-2*3/2)+(31-3+1)*1 = 187 for string#3
    NofBB = (31-3+1)*(129)-(31*32/2-2*3/2)+(31-3+1)*1 = 3,029 for string#4
    NofBB = (31-3+1)*(185)-(31*32/2-2*3/2)+(31-3+1)*1 = 4,653 for string#5
    NofBB = (31-3+1)*(206,908,949)-(31*32/2-2*3/2)+(31-3+1)*1 = 6,000,358,809 for OSHO.TXT
    NofBB = (31-3+1)*(223,674,511,360)-(31*32/2-2*3/2)+(31-3+1)*1 = 6,486,560,828,728 for wikipedia-en-html.tar

    All above-said things were intended also as background to express my life-long dream: creating corpus of Multi-Million-High-Quality-Sentences followed by their source, in string#4 and string#5 manner.
    Being a fan of brute-force approaches a SIMILAR-STATIC-2-bytes-HASH search attracts me with gravity of hash, ha-ha, in short:
    129 bytes string#4 becomes a 27*2 bytes sequence
    185 bytes string#5 becomes a 29*2 bytes sequence
    where each word(lowercased) is hashed to 2bytes, in this type of texts the wordlist varies from 600,000 to 900,000 words in size, i.e. 15:1 collisions, a problem.
    This kind of searching will be useful only(in majority of cases) for words(disregarding all non-alpha chars) especially when combined with wildcards, an example: grip++hand would reveal the needed preposition after 'grip' where '+' stands for whole-word wildcard, namely: ... a grip on his hand ...
    The basis/motto/inspiration derived from 2200+ years old book:
    If you wish to shrink it, ~ You must certainly stretch it.
    What is in the end to be shrunken, ~ Begins by being first stretched out.
    If you would have a thing shrink, ~ You must first stretch it.
    In order to deplete it, ~ It must be thoroughly extended.
    In order to contract it, it is necessary to expand it for the time being.
    That which shrinks ~ Must first expand.
    In order to fold, one must first unfold.
    If it wants to shrink something, it always first expands it.
    When Nature is about to withhold a thing it is first sure to increase it.
    When one is about to take an inspiration, he is sure to make a (previous) expiration.
    To gather ~ you must scatter.
    That which shall contract ~ Must have long expanded.
    What is to be reduced, ~ Must first be expanded.
    In order to dwindle [an enemy], a predator would induce the targeted victim to bloat up his illusory assessment of himself.
    In order to shrink it, it must first be stretched out.
    In order to reduce it, first expand it.
    What is in the end to be shrunk ~ Must first be stretched.
    If you would like to gather him in, you must resolve yourself to let him aggrandize himself.
    He who is to be made to dwindle (in power) ~ Must first be caused to expand.
    When you wish to contract something, ~ you must momentarily expand it.
    That which is to be shrunk must first be stretched out.
    Before contracting [an object], let it expand.
    What is ultimately to be compressed must first be expanded.
    In order to reduce something, it must first be expanded.
    What is going to be diminished ~ Must first be allowed to inflate.
    If you want a thing to contract, ~ You should stretch it first.
    To gather things, one must first disseminate them.
    To compress something, one must first prop it.
    In order to contract a thing, one should surely expand it first.
    About to shut it, let it first be opened.
    When you want to shrink something, ~ you must always enlarge it.
    If it wants to close anything, surely it will first open it.
    If you want to contract something, you must first stretch it.
    That which is to be condensed must first be dispersed.
    Wishing to restrict anything, ~ One must first expand it.
    When you want to shrink it, ~ You must first stretch it.
    When you want to constrict something, ~ You must first let it expand.
    What is to be contracted may need to be expanded.
    In order to contract, ~ It is necessary first to expand.
    If one wishes to shrink it ~ One must first expand it.
    What you want shrunk ~ Must first be allowed to expand.
    What is ultimately to be reduced ~ must first be expanded.
    He who feels punctured ~ Must once have been a bubble.
    Should you want to contain something, ~ you must deliberately let it expand.
    That which will be shrunk, ~ Must first be already stretched.
    That which is about to contract ~ has surely been expanded.
    What you would shorten you should therefore lengthen.
    If you wish to contract, you should first expand.
    The beginning of contraction necessarily follows the maximum of expansion.
    When about to inhale it is certainly necessary to open the mouth.
    What you want to compress ~ you must first allow truly to expand.
    If there is contraction, then before there was expansion.
    Прежде чем что-либо сжать, следует сначала его растянуть.
    Желая что-то сжать, сначала растяни его.
    Если хочешь нечто сжать - прежде растяни его.
    Чтобы нечто сжать, необходимо прежде расширить его.
    Стремишься сжать - необходимо сильно растянуть.
    То, что сжимается, - расширяется.
    Если хочешь сжать, ~ Прежде нужно растянуть.
    It will be a Dictionary Method for sure encoding all Latin-Letters-Words longer than 2 chars(bytes) as 2 bytes;
    The Header overhead is frightening but only for 4-MB files which is not the case;
    There will be no block size whatsoever;
    I don't know how this approach is called;
    PROs: bit operations used only partially; the dictionary is under 1MB practically;
    CONs: a lot but irrelevant;
    Using Leprechaun's ability to create the-entire-dictionary-of-needed-possible-matches quickly in memory as a skeleton;
    In regards to performance I care for(in that order): decompression speed followed by compression ratio and at last compression speed;
    Of course implementing this kind of static dictionary using Latin-Letters-Words as possible-matches is only the first step towards TOTAL-THOROUGH search for possible-matches which will be executed on all possible string combinations not mere Latin-Letters-Words;
    Due to huge(trillions) number of possible-matches Simplicissimus will be full-fledged only as a 64bit console tool;
    The name Simplicissimus comes from the Grimmelshausen's seventeenth-century German classic in which one simpleton become, after many adventures, an experienced real boy, similarly to Collodi's Pinocchio, but in a more brutal manner;
    It's time to wake up: 48000MB/s Copy performance on a desktop PC is here[AMD Magny Cours].
    Last edited by Sanmayce; 30th September 2010 at 19:53.

  2. #2
    Member Sanmayce's Avatar
    Join Date
    Apr 2010
    Location
    Sofia
    Posts
    57
    Thanks
    0
    Thanked 0 Times in 0 Posts
    In order to estimate(not calculate) the Max_Match_Threshold value it is appropriate to examine the 'Repetitive' values(the last column) in this case(Min_Match_Threshold=3) 4bits(15+3) or 5bits(31+3) or 6bits(63+3) and why not 3bits(7+3).

    For example here are listed DISTINCT(with overlapping) Building-Blocks 3,4,5 chars in size for string(18bytes): bxr_bxr_boban_dodo
    3|18-3+1 All|13 Distinct|3 Repetitive:
    _bo
    _bx
    _do
    an_
    ban
    bob
    bxr
    dod
    n_d
    oba
    odo
    r_b
    xr_

    4|18-4+1 All|13 Distinct|2 Repetitive:
    _bob
    _bxr
    _dod
    an_d
    ban_
    boba
    bxr_
    dodo
    n_do
    oban
    r_bo
    r_bx
    xr_b

    5|18-5+1 All|13 Distinct|1 Repetitive:
    _boba
    _bxr_
    _dodo
    an_do
    ban_d
    boban
    bxr_b
    n_dod
    oban_
    r_bob
    r_bxr
    xr_bo
    xr_bx
    The following values are derived from OSHO.TXT with the aid of Building-Blocks_DUMPER.c tool:

    No code has to be inserted here.Note: REPETITIVE = ALL - DISTINCT

    The following LOG is used to fill the above table:
    Code:
    C:\WorkTemp\LEPREC~1\VISUAL~1\LEA56D~1>cl /Ox Building-Blocks_DUMPER.c
    Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86
    Copyright (C) Microsoft Corporation 1984-2002. All rights reserved.
    
    Building-Blocks_DUMPER.c
    Microsoft (R) Incremental Linker Version 7.10.3077
    Copyright (C) Microsoft Corporation.  All rights reserved.
    
    /out:Building-Blocks_DUMPER.exe
    Building-Blocks_DUMPER.obj
    
    C:\WorkTemp\LEPREC~1\VISUAL~1\LEA56D~1>Building-Blocks_DUMPER.exe
    Building-Blocks_DUMPER rev.1, written by Kaze.
    Sorting 206908947 Pointers to Building-Blocks 3 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB003.txt ...
    3|206908949-3+1|46486|206862461
    Sorting 206908946 Pointers to Building-Blocks 4 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB004.txt ...
    4|206908949-4+1|248019|206660927
    Sorting 206908945 Pointers to Building-Blocks 5 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB005.txt ...
    5|206908949-5+1|855682|206053263
    Sorting 206908944 Pointers to Building-Blocks 6 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB006.txt ...
    6|206908949-6+1|2236138|204672806
    Sorting 206908943 Pointers to Building-Blocks 7 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB007.txt ...
    7|206908949-7+1|4803152|202105791
    Sorting 206908942 Pointers to Building-Blocks 8 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB008.txt ...
    8|206908949-8+1|8956496|197952446
    Sorting 206908941 Pointers to Building-Blocks 9 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB009.txt ...
    9|206908949-9+1|15006172|191902769
    Sorting 206908940 Pointers to Building-Blocks 10 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB010.txt ...
    10|206908949-10+1|22992127|183916813
    Sorting 206908939 Pointers to Building-Blocks 11 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB011.txt ...
    11|206908949-11+1|32707519|174201420
    Sorting 206908938 Pointers to Building-Blocks 12 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB012.txt ...
    12|206908949-12+1|43802365|163106573
    Sorting 206908937 Pointers to Building-Blocks 13 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB013.txt ...
    13|206908949-13+1|55802446|151106491
    Sorting 206908936 Pointers to Building-Blocks 14 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB014.txt ...
    14|206908949-14+1|68180356|138728580
    Sorting 206908935 Pointers to Building-Blocks 15 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB015.txt ...
    15|206908949-15+1|80437731|126471204
    Sorting 206908934 Pointers to Building-Blocks 16 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB016.txt ...
    16|206908949-16+1|92201533|114707401
    Sorting 206908933 Pointers to Building-Blocks 17 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB017.txt ...
    17|206908949-17+1|103270390|103638543
    Sorting 206908932 Pointers to Building-Blocks 18 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB018.txt ...
    18|206908949-18+1|113558215|93350717
    Sorting 206908931 Pointers to Building-Blocks 19 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB019.txt ...
    19|206908949-19+1|123049098|83859833
    Sorting 206908930 Pointers to Building-Blocks 20 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB020.txt ...
    20|206908949-20+1|131787864|75121066
    Sorting 206908929 Pointers to Building-Blocks 21 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB021.txt ...
    21|206908949-21+1|139825520|67083409
    Sorting 206908928 Pointers to Building-Blocks 22 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB022.txt ...
    22|206908949-22+1|147188899|59720029
    Sorting 206908927 Pointers to Building-Blocks 23 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB023.txt ...
    23|206908949-23+1|153910191|52998736
    Sorting 206908926 Pointers to Building-Blocks 24 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB024.txt ...
    24|206908949-24+1|160000889|46908037
    Sorting 206908925 Pointers to Building-Blocks 25 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB025.txt ...
    25|206908949-25+1|165466370|41442555
    Sorting 206908924 Pointers to Building-Blocks 26 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB026.txt ...
    26|206908949-26+1|170311355|36597569
    Sorting 206908923 Pointers to Building-Blocks 27 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB027.txt ...
    27|206908949-27+1|174546865|32362058
    Sorting 206908922 Pointers to Building-Blocks 28 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB028.txt ...
    28|206908949-28+1|178205308|28703614
    Sorting 206908921 Pointers to Building-Blocks 29 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB029.txt ...
    29|206908949-29+1|181327130|25581791
    Sorting 206908920 Pointers to Building-Blocks 30 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB030.txt ...
    30|206908949-30+1|183959024|22949896
    Sorting 206908919 Pointers to Building-Blocks 31 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB031.txt ...
    31|206908949-31+1|186161003|20747916
    Sorting 206908918 Pointers to Building-Blocks 32 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB032.txt ...
    32|206908949-32+1|187995939|18912979
    Sorting 206908917 Pointers to Building-Blocks 33 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB033.txt ...
    33|206908949-33+1|189519921|17388996
    Sorting 206908916 Pointers to Building-Blocks 34 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB034.txt ...
    34|206908949-34+1|190780112|16128804
    Sorting 206908915 Pointers to Building-Blocks 35 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB035.txt ...
    35|206908949-35+1|191819039|15089876
    Sorting 206908914 Pointers to Building-Blocks 36 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB036.txt ...
    36|206908949-36+1|192676075|14232839
    Sorting 206908913 Pointers to Building-Blocks 37 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB037.txt ...
    37|206908949-37+1|193385742|13523171
    Sorting 206908912 Pointers to Building-Blocks 38 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB038.txt ...
    38|206908949-38+1|193975536|12933376
    Sorting 206908911 Pointers to Building-Blocks 39 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB039.txt ...
    39|206908949-39+1|194471152|12437759
    Sorting 206908910 Pointers to Building-Blocks 40 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB040.txt ...
    40|206908949-40+1|194890268|12018642
    Sorting 206908909 Pointers to Building-Blocks 41 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB041.txt ...
    41|206908949-41+1|195247898|11661011
    Sorting 206908908 Pointers to Building-Blocks 42 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB042.txt ...
    42|206908949-42+1|195555647|11353261
    Sorting 206908907 Pointers to Building-Blocks 43 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB043.txt ...
    43|206908949-43+1|195822384|11086523
    Sorting 206908906 Pointers to Building-Blocks 44 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB044.txt ...
    44|206908949-44+1|196055530|10853376
    Sorting 206908905 Pointers to Building-Blocks 45 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB045.txt ...
    45|206908949-45+1|196261553|10647352
    Sorting 206908904 Pointers to Building-Blocks 46 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB046.txt ...
    46|206908949-46+1|196445637|10463267
    Sorting 206908903 Pointers to Building-Blocks 47 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB047.txt ...
    47|206908949-47+1|196612126|10296777
    Sorting 206908902 Pointers to Building-Blocks 48 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB048.txt ...
    48|206908949-48+1|196764219|10144683
    Sorting 206908901 Pointers to Building-Blocks 49 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB049.txt ...
    49|206908949-49+1|196904845|10004056
    Sorting 206908900 Pointers to Building-Blocks 50 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB050.txt ...
    50|206908949-50+1|197036167|9872733
    Sorting 206908899 Pointers to Building-Blocks 51 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB051.txt ...
    51|206908949-51+1|197159567|9749332
    Sorting 206908898 Pointers to Building-Blocks 52 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB052.txt ...
    52|206908949-52+1|197276991|9631907
    Sorting 206908897 Pointers to Building-Blocks 53 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB053.txt ...
    53|206908949-53+1|197389631|9519266
    Sorting 206908896 Pointers to Building-Blocks 54 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB054.txt ...
    54|206908949-54+1|197498154|9410742
    Sorting 206908895 Pointers to Building-Blocks 55 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB055.txt ...
    55|206908949-55+1|197603104|9305791
    Sorting 206908894 Pointers to Building-Blocks 56 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB056.txt ...
    56|206908949-56+1|197704831|9204063
    Sorting 206908893 Pointers to Building-Blocks 57 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB057.txt ...
    57|206908949-57+1|197803647|9105246
    Sorting 206908892 Pointers to Building-Blocks 58 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB058.txt ...
    58|206908949-58+1|197899903|9008989
    Sorting 206908891 Pointers to Building-Blocks 59 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB059.txt ...
    59|206908949-59+1|197993797|8915094
    Sorting 206908890 Pointers to Building-Blocks 60 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB060.txt ...
    60|206908949-60+1|198085857|8823033
    Sorting 206908889 Pointers to Building-Blocks 61 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB061.txt ...
    61|206908949-61+1|198177236|8731653
    Sorting 206908888 Pointers to Building-Blocks 62 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB062.txt ...
    62|206908949-62+1|198268384|8640504
    Sorting 206908887 Pointers to Building-Blocks 63 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB063.txt ...
    63|206908949-63+1|198359397|8549490
    Sorting 206908886 Pointers to Building-Blocks 64 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB064.txt ...
    64|206908949-64+1|198450573|8458313
    Sorting 206908885 Pointers to Building-Blocks 65 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB065.txt ...
    65|206908949-65+1|198541382|8367503
    Sorting 206908884 Pointers to Building-Blocks 66 chars in size ...
    Allocated memory for pointers-to-words in MB: 790
    Writing Sorted Building-Blocks to BB066.txt ...
    66|206908949-66+1|198631486|8277398
    
    Building-Blocks_DUMPER total time: 13954860 clocks
    
    C:\WorkTemp\LEPREC~1\VISUAL~1\LEA56D~1>
    Looking at the table may be LZ window 19:5 is the best pair for English texts, after all.
    Max_Match_Threshold = 18(4bits(15+3)) has 93,350,717 Repetitive(with overlapping) Building-Blocks.
    Max_Match_Threshold = 34(5bits(31+3)) has 16,128,804 Repetitive(with overlapping) Building-Blocks.
    Max_Match_Threshold = 66(6bits(63+3)) has 8,277,398 Repetitive(with overlapping) Building-Blocks.

    I wonder how a 2bytes(13:3) window i.e. 3..10 Match_Lengths(the purple values in the table) would rank, though.

    I do not understand why the term 'OPTIMAL parsing' is in use when the right term is 'NEAR-OPTIMAL parsing' or its synonym 'THE-STATE-OF-THE-ART parsing', speaking of LZ as far as I understand 'OPTIMAL parsing' is not yet achieved, if I am mistaken then I need 'SUPRA-OPTIMAL parsing' to come alive.

    The bottom of my text-processing-dreams stack is:
    - dream: fast ASCII English full-text(26GB(my Low-Quality Corpus) to ???GB in size) searching & analyzing;
    - sub-dream: ultrafast ASCII English text decompressing/traversing;
    - sub-sub-dream: to crunch 4++MB ASCII English text files to the limit while the decompression performance remains intact(as close as possible to uncompressed approach) or even boosted compared to pure copy, where burst-read range is 200MB/s(mainstream SSD) to 26624+MB/s(cached in AMD-Magny-Cours' quadruple-channel 32GB SYSTEM RAM).

    I keep holding my look at uploading(if not residing in system RAM) my Low-Quality Corpus in a few seconds and to be ready for further brute-force-processing. Many 'PRO' applications, by using all kind of indexes(and thus eliminating the time/resource consuming fulltext-traversing), give 'results' in extremely short periods of time(less than a thousandth of a second), BUT by limiting/sacrificing what I need the most: total freedom in fragment[s] choosing or simply said: not giving full control over the texts.

    Imagine a 'well'-written tool relying on some nifty-compression-scheme choking(on computers with I/O speed >[>] decompression speed) the speed performance dumbly while trying(in fact bottlenecking) to boost(in fact catching up with a faster than its own pace) the upload.
    I mean the well-written uploader mustn't be unaware of whether the situation needs a decompression or not.

    As things go currently the sub-net-books or so-called handhelds or tablets in my opinion are the future(regarding availableness at any moment i.e. their dimensions) for a hardware platform executing my ultimate text-dream: real-time sentences suggesting which would help to combine words into phrases/sentences properly. That's why I am so fond of fast-with-relatively-low(not low at all in my view)-compression-ratio LZ decompressors.

    The roadmap is clear: from a weak uni-core-CPU to a powerful multi-core-CPU of course with 16GB system RAM as a minimum. I think the worst drawback(as cause/reason for slow-running tools) of nowayears PC computers is the minimalistic size of system RAM not so much the GHz/number-of-cores demands. I prefer a tablet PC(1GHz single-core CPU with single-channel 32GB system RAM) rather than a desktop PC(4GHz 12-core CPU with triple-channel 24GB system RAM), if only such tablets could quickly hit the mainstream, somewhere I have read about plans to manufacture some 2 billion of them(sadly with far less memory than my 'greedy' specification) until 2014.

    I don't have such a device but I am very interested in how fast these tiny gadgets operate.
    Attached Files Attached Files
    Last edited by Sanmayce; 5th October 2010 at 21:31.

  3. #3
    Member Sanmayce's Avatar
    Join Date
    Apr 2010
    Location
    Sofia
    Posts
    57
    Thanks
    0
    Thanked 0 Times in 0 Posts
    An idea, borrowed from Encode's remark on NTFS, pushed:
    a possible way to improve even more the effectiveness of Match-Pair(floating/DYNAMIC values(within STATIC length window) of Offset-Length) choosing:
    just using 2bits(instead of 1bit) as a flag:
    00 Literal
    01 Match-Pair 18:6 or 12:4
    10 Match-Pair 19:5 or 13:3
    11 Match-Pair 20:4 or 14:2
    grouped into one QWORD(8bytes i.e. one long long variable) housing 32 units, the next variant might be good:
    00 Literal
    01 Match-Pair 13:3 0+3..7+3
    10 Match-Pair 19:5 0+6..31+6
    11 Match-Pair 14:2 0+(7+3)+1..3+(7+3)+1

    And a little bit more stretched in a window-windowless hybrid:
    Using PointerNumber(2bytes) for 65536 MatchStrings(x..255+(x)bytes) preceded by 65536 MatchLengths(1 byte) form the cap for the entire LZ compressed file.

    Min_Match_Threshold_Window = 5 chars i.e. 3->5 at lowest boundary
    Min_Match_Threshold_Windowless = 3 chars i.e. 2 bytes used to encode 3 bytes at lowest boundary
    Max_Match_Threshold_Windowless = 5 chars i.e. 2 bytes used to encode 5 bytes at biggest boundary

    00 Literal
    01 Match-Pair 18:6 utilizing long(31+(5)+1..63+(31+(5)+1)chars) Matches
    10 Match-Pair 19:5 utilizing medium(5..31+(5)chars) Matches
    11 PointerNumber utilizing short(3..5chars) Matches

    Note1: I use 'Match_Threshold' value as the lowest(inclusive) bound for MatchLengths to be encoded, may be I use it incorrectly therefore it is more proper to add a supplemental definition: 'Matches with below this length not encoded'. I don't know.
    Note2: The idea is to provide(from the CAP) a MatchString encoded in 2bytes(not 3) i.e. when(if) MatchLengths overlap the shorter sequence is used.
    Note3: Some key things I don't see clearly yet, so when two modes of operation are confronted, here: expressing 1000-1 meaningless things vs expressing nothing, I prefer the former because this forms a strong base and inevitably expands the field-of-view, after all 1>0. It is told: the smart guys learn from errors of others while the stupid guys learn from their owns. My mode-of-communication encompasses both approaches: I hope the others could benefit from my own errors/tries.

    I continue to think of LZSS and my dummy windowless decoder in a complementary way i.e. using common/similar schemes.

    Yesterday, as I was doing some slack digging in the INTERNET for LZ, some new and valuable for me pages popped up:

    ~ 'Speeding Up Pattern Matching by Text Compression' PDF paper from some Japanese researchers from Kyushu Institute of Technology. Thanks a lot. I am very excited. A quotation:
    "In this paper, we bring out a potential advantage of BPE compression.
    We show that it is very suitable from a practical view point of com-
    pressed pattern matching, where the goal is to find a pattern directly in
    compressed text without decompressing it explicitly. We compare run-
    ning times to find a pattern in (1) BPE compressed files, (2) Lempel-Ziv-
    Welch compressed files, and (3) original text files, in various situations.
    Experimental results show that pattern matching in BPE compressed
    text is even faster than matching in the original text. Thus the BPE
    compression reduces not only the disk space but also the searching time."

    Link: http://www.i.kyushu-u.ac.jp/~takeda/papers/CIAC2000.pdf

    ~ LZ77/LZSS and derivatives area from Mark Nelson, very comprehensive and well described. Thank you.
    The 'Flexible Parsing (FP) - The Optimal Parsing for Dictionary Based Compression' topic looks interesting(still unread from me, though).
    The 'BPE - Speeding Up String Matching by Text Compression' topic says: "This paper describes a few searching algorithms to be used on compressed files. An AC type algorithm for searching in Huffman encoded files, a KMP type algorithm for LZ derivatives and a BM type algorithm for searching in files compressed by Byte-Pair-Encoding (BPE). BPE gave results upto 3 times faster than those achieved by agrep.".
    The 'Michael Dipperstein's LZSS Code Page' topic says: "Michael Dipperstein describes his personal quest for understanding and implementation of LZSS coding. Full source included.". Bravo-bravissimo!
    The 'McKee's Directed Acyclic Graph Compression' topic with a very temptable description, I will take a look for sure.
    The 'LZ77 the basics of compression (2st ed.)' topic from Arturo Campos.
    The 'Flexible parsing' topic from Arturo Campos.
    The 'An Optimizing Hybrid LZ77 RLE Data Compression Program' topic from Pasi Ojala, sounds intriguing.
    Link: http://compression-links.info

    Thanks again to all above sharers.

    Next table shows Canonical Huffman codes applied on single chars for OSHO.TXT:
    No code has to be inserted here.Next table shows Traditional Huffman codes applied on single chars for OSHO.TXT:
    No code has to be inserted here.Next LOG shows Michael Dipperstein's ANSI C implementation of Huffman coding and decoding applied on OSHO.TXT:
    Code:
    D:\_KAZE_NEW\tests\Dipperstein>dir
     Volume in drive D is H320_Vol5
     Volume Serial Number is 0CB3-C881
    
     Directory of D:\_KAZE_NEW\tests\Dipperstein
    
    10/07/2010  07:20 AM    <DIR>          .
    10/07/2010  07:20 AM    <DIR>          ..
    08/26/2007  02:20 PM            29,163 bitarray.c
    08/26/2007  02:20 PM             3,785 bitarray.h
    01/24/2008  11:03 PM            35,017 bitfile.c
    01/24/2008  11:03 PM             4,929 bitfile.h
    06/08/2008  02:09 PM            22,354 chuffman.c
    09/19/2007  08:20 PM            35,147 COPYING
    09/19/2007  08:20 PM             7,639 COPYING.LESSER
    10/05/2010  09:58 PM            53,807 huffman-0.81.zip
    09/19/2007  08:30 PM            19,190 huffman.c
    09/19/2007  08:30 PM             2,549 huffman.h
    09/19/2007  08:30 PM            11,396 huflocal.c
    09/19/2007  08:30 PM             4,954 huflocal.h
    09/19/2007  08:30 PM             2,037 Makefile
    09/04/2007  06:10 AM             8,705 optlist.c
    09/04/2007  06:10 AM             2,943 optlist.h
    07/23/2010  06:03 AM       206,908,949 OSHO.TXT
    06/08/2008  02:13 PM             4,869 README
    09/19/2007  08:30 PM             8,592 sample.c
                  18 File(s)    207,166,025 bytes
                   2 Dir(s)   4,841,537,536 bytes free
    
    D:\_KAZE_NEW\tests\Dipperstein>cl /Ox sample.c huffman.c huflocal.c optlist.c bi
    tarray.c bitfile.c chuffman.c
    Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86
    Copyright (C) Microsoft Corporation 1984-2002. All rights reserved.
    
    sample.c
    huffman.c
    d:\_KAZE_NEW\tests\Dipperstein\huflocal.h(101) : warning C4005: 'max' : macro re
    definition
            C:\Program Files\Microsoft Visual C++ Toolkit 2003\include\stdlib.h(424)
     : see previous definition of 'max'
    huflocal.c
    d:\_KAZE_NEW\tests\Dipperstein\huflocal.h(101) : warning C4005: 'max' : macro re
    definition
            C:\Program Files\Microsoft Visual C++ Toolkit 2003\include\stdlib.h(424)
     : see previous definition of 'max'
    optlist.c
    optlist.c(98) : warning C4028: formal parameter 1 different from declaration
    bitarray.c
    bitfile.c
    chuffman.c
    d:\_KAZE_NEW\tests\Dipperstein\huflocal.h(101) : warning C4005: 'max' : macro re
    definition
            C:\Program Files\Microsoft Visual C++ Toolkit 2003\include\stdlib.h(424)
     : see previous definition of 'max'
    Generating Code...
    Microsoft (R) Incremental Linker Version 7.10.3077
    Copyright (C) Microsoft Corporation.  All rights reserved.
    
    /out:sample.exe
    sample.obj
    huffman.obj
    huflocal.obj
    optlist.obj
    bitarray.obj
    bitfile.obj
    chuffman.obj
    
    D:\_KAZE_NEW\tests\Dipperstein>sample.exe -?
    Usage: sample <options>
    
    options:
      -C : Encode/Decode using a canonical code.
      -c : Encode input file to output file.
      -d : Decode input file to output file.
      -t : Generate code tree for input file to output file.
      -i<filename> : Name of input file.
      -o<filename> : Name of output file.
      -h|?  : Print out command line options.
    
    Default: huffman -t -ostdout
    
    D:\_KAZE_NEW\tests\Dipperstein>sample.exe -C -t -i OSHO.TXT -o OSHO.TXT.Canonical_Huffman_codes
    
    D:\_KAZE_NEW\tests\Dipperstein>sample.exe -c -t -i OSHO.TXT -o OSHO.TXT.Traditional_Huffman_codes
    
    D:\_KAZE_NEW\tests\Dipperstein>type OSHO.TXT.Canonical_Huffman_codes
    Char  CodeLen  Encoding
    ----- -------- ----------------
    0x0A  06       001111
    0x0C  12       000000001001
    0x0D  06       001110
    0x20  02       11
    0x21  11       00000001101
    0x22  09       000010001
    0x23  16       0000000000000001
    0x24  19       0000000000000000111
    0x25  22       0000000000000000000111
    0x26  19       0000000000000000110
    0x27  09       000010000
    0x28  15       000000000001001
    0x29  15       000000000001000
    0x2A  15       000000000000111
    0x2B  21       000000000000000001001
    0x2C  07       0001101
    0x2D  08       00001111
    0x2E  07       0001100
    0x2F  11       00000001100
    0x30  11       00000001011
    0x31  11       00000001010
    0x32  12       000000001000
    0x33  15       000000000000110
    0x34  12       000000000111
    0x35  14       00000000000101
    0x36  15       000000000000101
    0x37  12       000000000110
    0x38  12       000000000101
    0x39  11       00000001001
    0x3A  10       0000010011
    0x3B  10       0000010010
    0x3C  26       00000000000000000000000001
    0x3D  21       000000000000000001000
    0x3E  22       0000000000000000000110
    0x3F  10       0000010001
    0x40  22       0000000000000000000101
    0x41  08       00001110
    0x42  10       0000010000
    0x43  10       0000001111
    0x44  09       000001111
    0x45  08       00001101
    0x46  10       0000001110
    0x47  10       0000001101
    0x48  09       000001110
    0x49  08       00001100
    0x4A  11       00000001000
    0x4B  12       000000000100
    0x4C  10       0000001100
    0x4D  10       0000001011
    0x4E  09       000001101
    0x4F  08       00001011
    0x50  11       00000000111
    0x51  12       000000000011
    0x52  09       000001100
    0x53  09       000001011
    0x54  08       00001010
    0x55  10       0000001010
    0x56  11       00000000110
    0x57  10       0000001001
    0x58  15       000000000000100
    0x59  09       000001010
    0x5A  13       0000000000011
    0x5B  15       000000000000011
    0x5C  22       0000000000000000000100
    0x5D  15       000000000000010
    0x5E  24       000000000000000000001001
    0x5F  20       00000000000000000101
    0x60  15       000000000000001
    0x61  04       1011
    0x62  07       0001011
    0x63  06       001101
    0x64  06       001100
    0x65  04       1010
    0x66  07       0001010
    0x67  06       001011
    0x68  05       01011
    0x69  04       1001
    0x6A  10       0000001000
    0x6B  08       00001001
    0x6C  05       01010
    0x6D  06       001010
    0x6E  04       1000
    0x6F  04       0111
    0x70  07       0001001
    0x71  11       00000000101
    0x72  05       01001
    0x73  05       01000
    0x74  04       0110
    0x75  06       001001
    0x76  07       0001000
    0x77  06       001000
    0x78  10       0000000111
    0x79  06       000111
    0x7A  12       000000000010
    0x7B  22       0000000000000000000011
    0x7C  21       000000000000000000111
    0x7D  21       000000000000000000110
    0x7E  19       0000000000000000101
    0x80  19       0000000000000000100
    0x84  25       0000000000000000000001011
    0x89  26       00000000000000000000000000
    0x93  25       0000000000000000000001010
    0x98  21       000000000000000000101
    0x99  21       000000000000000000100
    0x9A  25       0000000000000000000001001
    0xA2  24       000000000000000000001000
    0xA9  25       0000000000000000000001000
    0xAD  24       000000000000000000000111
    0xAE  25       0000000000000000000000111
    0xB8  25       0000000000000000000000110
    0xB9  25       0000000000000000000000101
    0xBB  25       0000000000000000000000100
    0xBF  25       0000000000000000000000011
    0xC2  24       000000000000000000000110
    0xC3  23       00000000000000000000101
    0xE2  19       0000000000000000011
    0xEF  25       0000000000000000000000010
    EOF   25       0000000000000000000000001
    
    D:\_KAZE_NEW\tests\Dipperstein>type OSHO.TXT.Traditional_Huffman_codes
    Char  Count      Encoding
    ----- ---------- ----------------
    0x69     9253057 0000
    0x6E     9625004 0001
    0x44      286170 001000000
    0x59      295487 001000001
    0x4F      593094 00100001
    0x56       69907 00100010000
    0x4A       77198 00100010001
    0x71       77514 00100010010
    0x5A       16620 0010001001100
    0x35        8697 00100010011010
    0x33        5718 001000100110110
    0x28        6007 001000100110111
    0x0C       41782 001000100111
    0x52      327520 001000101
    0x45      639322 00100011
    0x0A     2459508 001001
    0x0D     2459508 001010
    0x77     2551146 001011
    0x61     9916090 0011
    0x6C     5052621 01000
    0x67     2593560 010010
    0x55      165521 0100110000
    0x21       80208 01001100010
    0x2F       86343 01001100011
    0x22      344773 010011001
    0x41      698718 01001101
    0x76     1414420 0100111
    0x6F    10759258 0101
    0x74    11473635 0110
    0x2D      709624 01110000
    0x27      349530 011100010
    0x3F      180884 0111000110
    0x46      181560 0111000111
    0x6A      190940 0111001000
    0x47      194006 0111001001
    0x39       95624 01110010100
    0x34       46935 011100101010
    0x51       51495 011100101011
    0x78      197870 0111001011
    0x6B      827155 01110011
    0x63     3026018 011101
    0x72     6543701 01111
    0x20    58731439 10
    0x6D     3044191 110000
    0x79     3576531 110001
    0x68     7432356 11001
    0x3B      200353 1101000000
    0x3A      203570 1101000001
    0x30      102146 11010000100
    0x38       53307 110100001010
    0x32       54408 110100001011
    0x31      108176 11010000110
    0x37       54685 110100001110
    0x29        6013 110100001111000
    0x36        6058 110100001111001
    0x2A        6175 110100001111010
    0x7E          80 1101000011110110000
    0x80          81 1101000011110110001
    0xE2          83 1101000011110110010
    0x7B          10 1101000011110110011000
    0xA9           1 1101000011110110011001000
    0xB9           1 1101000011110110011001001
    0xAD           3 110100001111011001100101
    0xC2           3 110100001111011001100110
    EOF            1 1101000011110110011001110
    0x84           2 1101000011110110011001111
    0x2B          25 110100001111011001101
    0x5F          53 11010000111101100111
    0x26         106 1101000011110110100
    0x24         109 1101000011110110101
    0x3E          12 1101000011110110110000
    0x40          14 1101000011110110110001
    0x7C          29 110100001111011011001
    0x7D          29 110100001111011011010
    0xC3           7 11010000111101101101100
    0xA2           4 110100001111011011011010
    0x93           2 1101000011110110110110110
    0x9A           2 1101000011110110110110111
    0x5C          17 1101000011110110110111
    0x98          35 110100001111011011100
    0x3D          36 110100001111011011101
    0xAE           2 1101000011110110111100000
    0xB8           2 1101000011110110111100001
    0xBB           2 1101000011110110111100010
    0xBF           2 1101000011110110111100011
    0xEF           2 1101000011110110111100100
    0x3C           1 11010000111101101111001010
    0x89           1 11010000111101101111001011
    0x5E           5 110100001111011011110011
    0x25          20 1101000011110110111101
    0x99          40 110100001111011011111
    0x23        5358 1101000011110111
    0x58        7462 110100001111100
    0x5B        7936 110100001111101
    0x5D        7954 110100001111110
    0x60        8267 110100001111111
    0x4E      443953 110100010
    0x48      472318 110100011
    0x2C     1904475 1101001
    0x54      945085 11010100
    0x53      480032 110101010
    0x4C      235870 1101010110
    0x50      119582 11010101110
    0x7A       59278 110101011110
    0x4B       66879 110101011111
    0x49     1013693 11010110
    0x43      249894 1101011100
    0x4D      263934 1101011101
    0x42      270332 1101011110
    0x57      274987 1101011111
    0x73     8498025 11011
    0x65    16164977 1110
    0x70     2138218 1111000
    0x62     2154304 1111001
    0x75     4325811 111101
    0x66     2253517 1111100
    0x2E     2359285 1111101
    0x64     4623546 111111
    
    D:\_KAZE_NEW\tests\Dipperstein>sample.exe -c -i OSHO.TXT -o OSHO.TXT.Huffman
    
    D:\_KAZE_NEW\tests\Dipperstein>dir OSHO.TXT*
     Volume in drive D is H320_Vol5
     Volume Serial Number is 0CB3-C881
    
     Directory of D:\_KAZE_NEW\tests\Dipperstein
    
    07/23/2010  06:03 AM       206,908,949 OSHO.TXT
    10/07/2010  07:21 AM             3,634 OSHO.TXT.Canonical_Huffman_codes
    10/07/2010  07:30 AM       111,924,905 OSHO.TXT.Huffman
    10/07/2010  07:22 AM             3,874 OSHO.TXT.Traditional_Huffman_codes
                   4 File(s)    318,841,362 bytes
                   0 Dir(s)   4,729,495,552 bytes free
    
    D:\_KAZE_NEW\tests\Dipperstein>
    Add-on:
    Link: http://www.i.kyushu-u.ac.jp/~takeda/papers
    Super, a bunch of precious PDF materials, just what I like/want to read the most:
    Speeding Up String Pattern Matching by Text Compression: The Dawn of a New Era
    Link: www.i.kyushu-u.ac.jp/~takeda/papers/IPSJ40.pdf

    A quotation from CPM2000.pdf:
    "E. S. de Moura, G. Navarro, N. Ziviani, and R. Baeza-Yates proposed a compression scheme that uses a word-based Huffman encoding with a byte-oriented code, which allows a search twice faster than agrep. However, the compression method is not applicable to such texts as genomic sequence data since they cannot be segmented into words."

    No problem about genomic type of text, it is hard to find someone who cares less than me for such "disadvantage".
    Wow, my dummy approach crashed in the garage even before being started. Obviously what I need is an UNKNOWN-YET search technique applied on Huffman.

    Huffmaning "words" i.e. DISTINCT-Building-Blocks instead of single chars is what I find very appropriate as a possible CAP(Huffman tree) i.e. static dictionary. It is quite promising in my case(4++MB ASCII English texts full of words and phrases).
    The first 1-2-3-4-5 rows of table from the previous post come in handy too, that is for "words" with length 3/4/5/6 chars the respective "alphabet" is 46486/248019/855682/2236138. The existence of so many REPETITIVE-Building-Blocks guarantees that the worst case scenario is discarded at once i.e. maximum-unbalanced-binary-tree(linked-list) where encoding would be horrible: some 46486/248019/855682/2236138 bits, on the contrary: situation is just beautiful and gets even beautifulER: 46486 maybe with 16-bit, 248019 maybe with 18-bit, 855682 maybe with 20-bit ... I don't have any experience on Huffman therefore only future tests will confirm its usefulness here, but it is quite obvious.

    Does anyone know where BPE-BM implementationS with C source can be found? It would be a tear from the moon, as Sinead O'Connor sings, if they are close-sourced.

    Yes, a new sub-dream I have: to empower Simplicissimus with the BPE-BM & word-based Huffman techniques.
    Last edited by Sanmayce; 16th October 2010 at 18:10.

  4. #4
    Member Sanmayce's Avatar
    Join Date
    Apr 2010
    Location
    Sofia
    Posts
    57
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Simpleton becomes Tripleton

    The goal: to upload faster and perform boosted(Boyer-Moore-Horspool my search-function-of-choice) search.
    One approach(I still don't understand it) is word-based Huffman decoding followed by Aho-Corasick (AC) pattern matching algorithm.
    Although Huffman+automaton might be great I must feint(for now) their usage in order to gain even more speed.
    My approach is simpler hopefully not slower, it remains to be seen.
    Instead of using single chars as alphabetical symbols triplets(3 chars) replace(EDIT: what a shameful blunder of mine, I was fooled/blinded to think that such an illusion is possible, sorry) them.
    Does anybody know whether a combined scasw/scasd/scasq search for beginning/ending of the pattern(2+/4+/8+ in length) + BMH implementation exists?

    In order to upload(decompress) and traverse(full-text search) at objects-blurred speed I am forced to sacrifice compression ratio brutally, thus giving CAP(the new alphabet) a double-role(1st: in compression i.e. 3-2 substitution, 2nd: as BMH booster i.e. bigger alphabet).
    The only limitation: the new alphabet to exceed not 16bits in size; here for OSHO.TXT 65536 house unproblematically 46486 symbols/triads, this decides how big chunk of text can be compressed as one independent decompress/search unit.

    Format of .Simplicissimus-Tripleton file:

    [Header:]
    33bytes: Filetype:Simplicissimus-Tripleton
    4bytes: Size of uncompressed file
    3*64KBytes: triplet0000, ... tripletFFFF
    [Encoded sequence:]
    Tripleton_CODE, ... Tripleton_CODE, Literal(Remainder 3- bytes)

    The compression ratio is awful: the worst ever, numberly: (((33+4+3*65536)+(206,908,949\3)*2+(206,908,949 MOD 3))/206,908,949)*100% = 66.7%.

    A quotation:
    "The bad-character shift used in the Boyer-Moore algorithm (see chapter Boyer-Moore algorithm) is not very efficient for small alphabets, but when the alphabet is large compared with the length of the pattern, as it is often the case with the ASCII table and ordinary searches made under a text editor, it becomes very useful.
    Using it alone produces a very efficient algorithm in practice. Horspool proposed to use only the bad-character shift of the rightmost character of the window to compute the shifts in the Boyer-Moore algorithm."

    Link: http://www-igm.univ-mlv.fr/~lecroq/string/index.html

    Edit: I realized what a stupid statement I did:
    Due to bigger-and-bigger strides(bm_bc[ch]*DISTINCT_Building_Block_Length, see Boyer-Moore-Horspool code below) and as the diversity(number of DISTINCT-Building-Blocks i.e. number of symbols in the alphabet) increases I expect the speed performance of BMH gradually to reach sub-fantastic levels somewhere at 5chars-in-length DISTINCT-Building-Blocks which is 855,682*5bytes=4,278,410bytes(4MB) purely CPU cache SIZE/speed dependent.
    Sorry, I tend to generate from time-to-time such wrong castings, it's a result from completely superficial mode of sketching(thinking without any bias).

    Code:
    #define ASIZE 256
    // Stupid, Stupid: #define ASIZE 46486 // for triplets or 248019/855682/2236138 for quadruplets/quintuplets/sextuplets respectively
    
    else
    {
    458    /* Preprocessing */
    459    for (a=0; a < ASIZE; a++) bm_bc[a]=cbPattern;
    460    for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1;
    461
    462    /* Searching */
    463    //lastch=pbPattern[cbPattern-1];
    464    //firstch=pbPattern[0];
    465    i=0;
    466    while (i <= cbTarget-cbPattern) {
    467       ch=pbTarget[i+cbPattern-1];
    468       //if (ch ==lastch)
    469          //if (memcmp(&pbTarget[i],pbPattern,cbPattern-1) == 0) OUTPUT(i);
    470          //if (ch == lastch && pbTarget[i] == firstch && memcmp(&pbTarget[i],pbPattern,cbPattern-1) == 0) return(i);  // Kaze: The idea(to prevent execution of slower 'memcmp') is borrowed from Karp-Rabin i.e. to perform a slower check only when the target "looks like".
    471          if (ch == pbPattern[cbPattern-1] && pbTarget[i] == pbPattern[0])
    472             {
    473         count = countSTATIC;
    474         while ( count && *(char *)(pbPattern+1+(countSTATIC-count)) == *(char *)(&pbTarget[i]+1+(countSTATIC-count)) ) {
    475               count--;
    476         }
    477         if ( count == 0) Railgunhits++; //return(pbTarget+i);
    478         }
    479       i+=bm_bc[ch];
    480    }
    481    return(NULL);
    }
    
    $L1709:
    ; Line 460
        mov    ebx, DWORD PTR _pbPattern$[esp+1052]
        mov    ecx, 256                ; 00000100H
        lea    edi, DWORD PTR _bm_bc$[esp+1056]
        rep stosd
        lea    edi, DWORD PTR [eax-1]
        xor    ecx, ecx
        test    edi, edi
        jbe    SHORT $L1730
        lea    esi, DWORD PTR [eax-1]
        npad    3
    $L1728:
        movsx    ebp, BYTE PTR [ecx+ebx]
        inc    ecx
        mov    DWORD PTR _bm_bc$[esp+ebp*4+1056], esi
        dec    esi
        cmp    ecx, edi
        jb    SHORT $L1728
        mov    ebp, DWORD PTR _pbTarget$[esp+1052]
    $L1730:
    ; Line 465
        xor    ecx, ecx
    ; Line 466
        sub    edx, eax
    ; Line 481
        mov    DWORD PTR $T2053[esp+1056], edx
        npad    3
    $L1732:
        movsx    esi, BYTE PTR [ebx+eax-1]
        lea    edx, DWORD PTR [eax+ebp-1]
        movzx    edx, BYTE PTR [edx+ecx]
        cmp    edx, esi
        mov    DWORD PTR tv356[esp+1056], edx
        jne    SHORT $L1740
        mov    dl, BYTE PTR [ecx+ebp]
        cmp    dl, BYTE PTR [ebx]
        jne    SHORT $L2056
        mov    ebp, DWORD PTR _countSTATIC$[esp+1056]
        test    ebp, ebp
        je    SHORT $L2054
        mov    edx, DWORD PTR _pbTarget$[esp+1052]
        lea    edi, DWORD PTR [ebx+1]
        lea    esi, DWORD PTR [ecx+edx+1]
    $L1738:
        mov    dl, BYTE PTR [edi]
        cmp    dl, BYTE PTR [esi]
        jne    SHORT $L1739
        dec    ebp
        inc    esi
        inc    edi
        test    ebp, ebp
        jne    SHORT $L1738
    ; Line 477
        jmp    SHORT $L2054
    $L1739:
        test    ebp, ebp
        jne    SHORT $L2056
    $L2054:
        inc    DWORD PTR _Railgunhits
    $L2056:
        mov    ebp, DWORD PTR _pbTarget$[esp+1052]
    $L1740:
    ; Line 479
        mov    edx, DWORD PTR tv356[esp+1056]
        mov    esi, DWORD PTR _bm_bc$[esp+edx*4+1056]
        mov    edx, DWORD PTR $T2053[esp+1056]
        add    ecx, esi
        cmp    ecx, edx
        jbe    SHORT $L1732
        pop    edi
        pop    esi
        pop    ebx
    $L2063:
    ; Line 481
        xor    eax, eax
        pop    ebp
    Such an opportunity must be exploited, after all huge(alphabet size dependent) skips constitute the intricate beauty of BMH variant, its strongest feature.

    As for Fibonacci sequence - the Huffman's nightmare, a peek into worst-case-scenario:
    I created one ugliest input(file ASCII_F.BIN 165,580,139 bytes with only 38 DISTINCT-chars) for Huffman binary tree just to see the longest code(38bits): '00000000000000000000000000000000000001':

    Char: A encountered Fibonacci #2: 1
    Char: B encountered Fibonacci #3: 1
    Char: C encountered Fibonacci #4: 2
    Char: D encountered Fibonacci #5: 3
    Char: E encountered Fibonacci #6: 5
    Char: F encountered Fibonacci #7: 8
    Char: G encountered Fibonacci #8: 13
    Char: H encountered Fibonacci #9: 21
    Char: I encountered Fibonacci #10: 34
    Char: J encountered Fibonacci #11: 55
    Char: K encountered Fibonacci #12: 89
    Char: L encountered Fibonacci #13: 144
    Char: M encountered Fibonacci #14: 233
    Char: N encountered Fibonacci #15: 377
    Char: O encountered Fibonacci #16: 610
    Char: P encountered Fibonacci #17: 987
    Char: Q encountered Fibonacci #18: 1597
    Char: R encountered Fibonacci #19: 2584
    Char: S encountered Fibonacci #20: 4181
    Char: T encountered Fibonacci #21: 6765
    Char: U encountered Fibonacci #22: 10946
    Char: V encountered Fibonacci #23: 17711
    Char: W encountered Fibonacci #24: 28657
    Char: X encountered Fibonacci #25: 46368
    Char: Y encountered Fibonacci #26: 75025
    Char: Z encountered Fibonacci #27: 121393
    Char: [ encountered Fibonacci #28: 196418
    Char: \ encountered Fibonacci #29: 317811
    Char: ] encountered Fibonacci #30: 514229
    Char: ^ encountered Fibonacci #31: 832040
    Char: _ encountered Fibonacci #32: 1346269
    Char: ' encountered Fibonacci #33: 2178309
    Char: a encountered Fibonacci #34: 3524578
    Char: b encountered Fibonacci #35: 5702887
    Char: c encountered Fibonacci #36: 9227465
    Char: d encountered Fibonacci #37: 14930352
    Char: e encountered Fibonacci #38: 24157817
    Char: f encountered Fibonacci #39: 39088169


    Code:
    D:\_KAZE_~1\tests>dir ASCII_F.BIN
    
    10/09/2010  05:09 AM       165,580,139 ASCII_F.BIN
    
    D:\_KAZE_~1\tests>sample.exe -C -t -i ASCII_F.BIN
    Char  CodeLen  Encoding
    ----- -------- ----------------
    0x41  38       00000000000000000000000000000000000001
    0x42  37       0000000000000000000000000000000000001
    0x43  36       000000000000000000000000000000000001
    0x44  35       00000000000000000000000000000000001
    0x45  34       0000000000000000000000000000000001
    0x46  33       000000000000000000000000000000001
    0x47  32       00000000000000000000000000000001
    0x48  31       0000000000000000000000000000001
    0x49  30       000000000000000000000000000001
    0x4A  29       00000000000000000000000000001
    0x4B  28       0000000000000000000000000001
    0x4C  27       000000000000000000000000001
    0x4D  26       00000000000000000000000001
    0x4E  25       0000000000000000000000001
    0x4F  24       000000000000000000000001
    0x50  23       00000000000000000000001
    0x51  22       0000000000000000000001
    0x52  21       000000000000000000001
    0x53  20       00000000000000000001
    0x54  19       0000000000000000001
    0x55  18       000000000000000001
    0x56  17       00000000000000001
    0x57  16       0000000000000001
    0x58  15       000000000000001
    0x59  14       00000000000001
    0x5A  13       0000000000001
    0x5B  12       000000000001
    0x5C  11       00000000001
    0x5D  10       0000000001
    0x5E  09       000000001
    0x5F  08       00000001
    0x60  07       0000001
    0x61  06       000001
    0x62  05       00001
    0x63  04       0001
    0x64  03       001
    0x65  02       01
    0x66  01       1
    EOF   38       00000000000000000000000000000000000000
    
    D:\_KAZE_~1\tests>
    Not by the way, Simplicissimus-Quadrupleton will be the fastest-searcher(especially utilizing SCASD) & fastest-but-not-best single-thread text decompressor with no bit operations whatsoever and operating with DWORDs, I mean the Decompression Ratio would be badly hurt by 50+% Compression Ratio, which reminds me a funny pun: 'Sometimes everything is not enough'.

    Grmbl, I could not find a way to boost BMH by using DISTINCT_Building-Blocks, only by involving Assembler, but the case is different(less interesting). My clacking(alphabet blunder) led to a heavy catastrophe. I knew that something is wrong but I proceeded because sometimes such-a-manner leads to some new viewpoints, not this time.

    A little test with Dipperstein's LZSS(hash finder):

    No code has to be inserted here.
    A little test with 20+ years old plain LZSS C etude by Mr. Okumura:

    No code has to be inserted here.
    Code:
    /* LZSS encoder-decoder  (c) Haruhiko Okumura */
    
    #include <stdio.h>
    #include <stdlib.h>
    
    #define EI 11  /* typically 10..13 */
    #define EJ  4  /* typically 4..5 */
    #define P   1  /* If match length <= P then output one character */
    #define N (1 << EI)  /* buffer size */
    #define F ((1 << EJ) + P)  /* lookahead buffer size */
    
    int bit_buffer = 0, bit_mask = 128;
    unsigned long codecount = 0, textcount = 0;
    unsigned char buffer[N * 2];
    FILE *infile, *outfile;
    
    void error(void)
    {
        printf("Output error\n");  exit(1);
    }
    
    void putbit1(void)
    {
        bit_buffer |= bit_mask;
        if ((bit_mask >>= 1) == 0) {
            if (fputc(bit_buffer, outfile) == EOF) error();
            bit_buffer = 0;  bit_mask = 128;  codecount++;
        }
    }
    
    void putbit0(void)
    {
        if ((bit_mask >>= 1) == 0) {
            if (fputc(bit_buffer, outfile) == EOF) error();
            bit_buffer = 0;  bit_mask = 128;  codecount++;
        }
    }
    
    void flush_bit_buffer(void)
    {
        if (bit_mask != 128) {
            if (fputc(bit_buffer, outfile) == EOF) error();
            codecount++;
        }
    }
    
    void output1(int c)
    {
        int mask;
        
        putbit1();
        mask = 256;
        while (mask >>= 1) {
            if (c & mask) putbit1();
            else putbit0();
        }
    }
    
    void output2(int x, int y)
    {
        int mask;
        
        putbit0();
        mask = N;
        while (mask >>= 1) {
            if (x & mask) putbit1();
            else putbit0();
        }
        mask = (1 << EJ);
        while (mask >>= 1) {
            if (y & mask) putbit1();
            else putbit0();
        }
    }
    
    void encode(void)
    {
        int i, j, f1, x, y, r, s, bufferend, c;
        
        for (i = 0; i < N - F; i++) buffer[i] = ' ';
        for (i = N - F; i < N * 2; i++) {
            if ((c = fgetc(infile)) == EOF) break;
            buffer[i] = c;  textcount++;
        }
        bufferend = i;  r = N - F;  s = 0;
        while (r < bufferend) {
            f1 = (F <= bufferend - r) ? F : bufferend - r;
            x = 0;  y = 1;  c = buffer[r];
            for (i = r - 1; i >= s; i--)
                if (buffer[i] == c) {
                    for (j = 1; j < f1; j++)
                        if (buffer[i + j] != buffer[r + j]) break;
                    if (j > y) {
                        x = i;  y = j;
                    }
                }
            if (y <= P) output1(c);
            else output2(x & (N - 1), y - 2);
            r += y;  s += y;
            if (r >= N * 2 - F) {
                for (i = 0; i < N; i++) buffer[i] = buffer[i + N];
                bufferend -= N;  r -= N;  s -= N;
                while (bufferend < N * 2) {
                    if ((c = fgetc(infile)) == EOF) break;
                    buffer[bufferend++] = c;  textcount++;
                }
            }
        }
        flush_bit_buffer();
        printf("text:  %ld bytes\n", textcount);
        printf("code:  %ld bytes (%ld%%)\n",
            codecount, (codecount * 100) / textcount);
    }
    
    int getbit(int n) /* get n bits */
    {
        int i, x;
        static int buf, mask = 0;
        
        x = 0;
        for (i = 0; i < n; i++) {
            if (mask == 0) {
                if ((buf = fgetc(infile)) == EOF) return EOF;
                mask = 128;
            }
            x <<= 1;
            if (buf & mask) x++;
            mask >>= 1;
        }
        return x;
    }
    
    void decode(void)
    {
        int i, j, k, r, c;
        
        for (i = 0; i < N - F; i++) buffer[i] = ' ';
        r = N - F;
        while ((c = getbit(1)) != EOF) {
            if (c) {
                if ((c = getbit(8)) == EOF) break;
                fputc(c, outfile);
                buffer[r++] = c;  r &= (N - 1);
            } else {
                if ((i = getbit(EI)) == EOF) break;
                if ((j = getbit(EJ)) == EOF) break;
                for (k = 0; k <= j + 1; k++) {
                    c = buffer[(i + k) & (N - 1)];
                    fputc(c, outfile);
                    buffer[r++] = c;  r &= (N - 1);
                }
            }
        }
    }
    
    int main(int argc, char *argv[])
    {
        int enc;
        char *s;
        
        if (argc != 4) {
            printf("Usage: lzss e/d infile outfile\n\te = encode\td = decode\n");
            return 1;
        }
        s = argv[1];
        if (s[1] == 0 && (*s == 'd' || *s == 'D' || *s == 'e' || *s == 'E'))
            enc = (*s == 'e' || *s == 'E');
        else {
            printf("? %s\n", s);  return 1;
        }
        if ((infile  = fopen(argv[2], "rb")) == NULL) {
            printf("? %s\n", argv[2]);  return 1;
        }
        if ((outfile = fopen(argv[3], "wb")) == NULL) {
            printf("? %s\n", argv[3]);  return 1;
        }
        if (enc) encode();  else decode();
        fclose(infile);  fclose(outfile);
        return 0;
    }
    Another article grabbed my attention:

    Fast String Searching With Suffix Trees
    by Mark Nelson
    Dr. Dobb's Journal
    August, 1996


    The intuitive solution

    Since the database that you are testing against is invariant, preprocessing it to simplify the search seems like a good idea. One preprocessing approach is to build a search trie. For searching through input text, a straightforward approach to a search trie yields a thing called a suffix trie. (The suffix trie is just one step away from my final destination, the suffix tree.) A trie is a type of tree that has N possible branches from each node, where N is the number of characters in the alphabet. The word 'suffix' is used in this case to refer to the fact that the trie contains all of the suffixes of a given block of text (perhaps a viral genome.)

    Figure 1
    The Suffix Trie Representing "BANANAS"

    Figure 1 shows a Suffix trie for the word BANANAS. There are two important facts to note about this trie. First, starting at the root node, each of the suffixes of BANANAS is found in the trie, starting with BANANAS, ANANAS, NANAS, and finishing up with a solitary S. Second, because of this organization, you can search for any substring of the word by starting at the root and following matches down the tree until exhausted.

    The second point is what makes the suffix trie such a nice construct. If you have a input text of length N, and a search string of length M, a traiditonal brute force search will take as many as N*M character comparison to complete. Optimized searching techniques, such as the Boyer-Moore algorithm can guarantee searches that require no more than M+N comparisons, with even better average performance. But the suffix trie demolishes this performance by requiring just M character comparisons, regardless of the length of the text being searched!

    Remarkable as this might seem, it means I could determine if the word BANANAS was in the collected works of William Shakespeare by performing just seven character comparisons! Of course, there is just one little catch: the time needed to construct the trie.

    The reason you don't hear much about the use of suffix tries is the simple fact that constructing one requires O(N^2) time and space. This quadratic performance rules out the use of suffix tries where they are needed most: to search through long blocks of data.

    Under the spreading suffix tree

    A reasonable way past this dilemma was proposed by Edward McCreight in 1976, when he published his paper on what came to be known as the suffix tree. The suffix tree for a given block of data retains the same topology as the suffix trie, but it eliminates nodes that have only a single descendant. This process, known as path compression, means that individual edges in the tree now may represent sequences of text instead of single characters.

    Figure 2
    A suffix tree representing BANANAS

    Figure 2 shows what the suffix trie from Figure 1 looks like when converted to a suffix tree. You can see that the tree still has the same general shape, just far fewer nodes. By eliminating every node with just a single descendant, the count is reduced from 23 to 11.

    In fact, the reduction in the number of nodes is such that the time and space requirements for constructing a suffix tree are reduced from O(N^2) to O(N). In the worst case, a suffix tree can be built with a maximum of 2N nodes, where N is the length of the input text. So for a one-time investment proportional to the length of the input text, we can create a tree that turbocharges our string searches.

    Link: http://marknelson.us/1996/08/01/suffix-trees/
    Needless to say I go [for more] bananas having read the above excerpt, I don't want to think how far from such an approach I was, suffix tries/trees must be explored, no doubt.
    Last edited by Sanmayce; 19th October 2010 at 19:59. Reason: Nasty stupid statement found

  5. #5
    Member Sanmayce's Avatar
    Join Date
    Apr 2010
    Location
    Sofia
    Posts
    57
    Thanks
    0
    Thanked 0 Times in 0 Posts
    8th dan Simpleton comes silently opening way to an unexploited instruction - scasq

    Format of .Simplicissimus-Quadrupleton file:

    [Header:]
    36bytes: Filetype:Simplicissimus-Quadrupleton
    4bytes: Size of uncompressed file
    4bytes: Alphabet size in symbols equal to (size in bytes)>>2
    4*1MBytes max: quadruplet00000, ... quadrupletFFFFF {up to 16(either 1 or 2 or 4 or 16) volumes, each up to 4*64KBytes: quadruplet0000, ... quadrupletFFFF, i.e. up to 16*65536=1,048,576 symbols or 4*16*65536 bytes}
    [Encoded sequence, for example, when 2*65536 < Alphabet-size <= 4*65536:]
    1(2bytes) AlphabetSelector_TAG, 8(8*2bytes) Quadrupleton_CODEs, ... 0..1(2bytes) AlphabetSelector_TAG, 0..8(8*2bytes) Quadrupleton_CODEs, Literal(Remainder 0..3 bytes)

    Since quadruplets are 248,019 for OSHO.TXT: 4(2bits) volumes will be used, 4*65536=262144>248,019 which means AlphabetSelector_TAG(16bits) describes 8 units/codes.

    Of course volumes-of-alphabets are unwanted baggage/break so volumes-of-files(one only alphabet i.e. no TAG) must take over when primary concern is speed.
    The compression ratio is not so awful: (((36+4+4+4*248,019)+ceiling(206,908,949\4\*2 + (206,908,949\4)*2+(206,908,949 MOD 4))/206,908,949)*100% =
    ( ( 36+4+4+(4*248,019)+((6465904+1)*2)+((51727237)*2)+ (1) )/206,908,949 )*100% = 56.7%

    TAG Needed-Bits(Number-of-Alphabet-Volumes) are to be determined automatically:
    Code:
    if (DISTINCT_BuildingBlocks > (65536<<0))
    {
     AlphabetVolumes_Number=2; // 1bit used: 16 units described in TAG
     if (DISTINCT_BuildingBlocks > (65536<<1))
     {
      AlphabetVolumes_Number=4; // 2bits used: 8 units described in TAG
      if (DISTINCT_BuildingBlocks > (65536<<2))
      {
       AlphabetVolumes_Number=16; // 4bits used: 4 units described in TAG
       if (DISTINCT_BuildingBlocks > (65536<<4)) {printf("Simplicissimus-Quadrupleton: Still unable to make volumes-of-files, alphabet table exceeds 1,048,576 symbols.\n");exit(1);}
      }
     }
    }
    else {AlphabetVolumes_Number=1;} // <= 65536 i.e. no TAG
    Of course plan B reinforces the heavy situation: by using 8,956,496 DISTINCT_BuildingBlocks with size 8chars.

    Format of .Simplicissimus-Octupleton file:

    [Header:]
    34bytes: Filetype:Simplicissimus-Octupleton
    4bytes: Size of uncompressed file
    4bytes: Alphabet size in symbols equal to (size in bytes)>>3
    8*16MBytes max: octuplet000000, ... octupletFFFFFF
    [Encoded sequence, for example, when 16*65536 < Alphabet-size <= 16*16*65536:]
    1(2bytes) AlphabetSelector_TAG, 2(2*2bytes) Octupleton_CODEs, ... 0..1(2bytes) AlphabetSelector_TAG, 0..2(2*2bytes) Octupleton_CODEs, Literal(Remainder 0..7 bytes)

    Since octuplets are 8,956,496 for OSHO.TXT: 16*16(4+4bits) volumes will be used, 16*16*65536=16,777,216>8,956,496, which means AlphabetSelector_TAG(16bits) describes 2 units/codes.
    The compression ratio is preposterous: (((34+4+4+8*8,956,496)+ceiling(206,908,949\8\2)*2+ (206,908,949\*2+(206,908,949 MOD )/206,908,949)*100% =
    ( ( 34+4+4+(8*8,956,496)+((12931809)*2)+((2586361*2)+(5) )/206,908,949 )*100% = 72.1%
    The quick conclusion: huge(128++MB) files must be used. By chunking the original into 8chars(24bit alphabet) the primary-goal is to speed-up the search, the nasty outcome: a compression close to expansion, where is the Guinness World Records agent.

    NO bit operations at all, I wonder how Yorkfield(12MB L2 cache) will handle the decompression.

    206,908,949 OSHO.TXT
    149,242,869 OSHO.TXT.Simplicissimus-Octupleton
    117,378,405 OSHO.TXT.Simplicissimus-Quadrupleton
    111,924,905 OSHO.TXT.Dipperstein's-Huffman [Version 0.81]
    101,903,677 OSHO.TXT.LZ4
    090,662,607 OSHO.TXT.lzt_-10 [lzturbo 0.95]
    079,161,940 OSHO.TXT.qp_-L3K512 [qpress 1.0]
    062,620,125 OSHO.TXT.ULZ [ULZ v0.02]
    061,439,540 OSHO.TXT.lzss [lzss v0.01]
    051,909,111 OSHO.TXT.lzt_-19 [lzturbo 0.95]

    Once revision 1 is completed(both Intel and Microsoft compilers 32bit EXEs) I will post the source-code&benchmark-table.

    I can't imagine(for now) faster text decompression scheme, do you?

    Let's hear what a smart girl says about the right approach:

    What are you waiting for?
    Nobody's gonna show you how
    Why wait for someone else
    To do what you can do right now?

    Got no boundaries and no limits
    If there's excitement, put me in it
    If it's against the law, arrest me
    If you can handle it, undress me

    Don't stop me now, don't need to catch my breath
    I can go on and on and on
    When the lights go down and there's no one left
    I can go on and on and on

    Give it to me, yeah
    No one's gonna show me how
    Give it to me, yeah
    No one's gonna stop me now

    They say that a good thing never lasts
    And that it has to fall
    Those are the people that did not
    Amount too much at all

    Give me the base line and I'll shake it
    Give me a record and I'll break it
    There's no beginning and no ending
    Give me a chance to go and I'll take it

    Don't stop me now, don't need to catch my breath
    I can go on and on and on
    When the lights go down and there's no one left
    I can go on and on and on

    Give it to me, yeah
    No one's gonna show me how
    Give it to me, yeah
    No one's gonna stop me now

    Pharrell:
    Watch this

    Get stupid, get stupid, get stupid, don't stop it (what?)
    Get stupid, get stupid, get stupid, don't stop it (what?)
    Get stupid, get stupid, get stupid, don't stop it (what?)
    Get stupid, get stupid, get stupid, don't stop it

    Get stupid, get stupid, get stupid, don't stop it
    (to the left, to the right, to the left, to the right)
    Get stupid, get stupid, get stupid, don't stop it
    (to the left, to the right, to the left, to the right)
    Get stupid, get stupid, get stupid, don't stop it
    (to the left, left, right, right, left, left, right, right)
    Get stupid stupid stupid stupid stupid stupid stupid...
    (left, left, right, right, left, left, right, right)

    Don't stop me now, don't need to catch my breath
    I can go on and on and on
    When the lights go down and there's no one left
    I can go on and on and on

    Give it to me, yeah
    No one's gonna show me how
    Give it to me, yeah
    No one's gonna stop me now

    You're only here to win
    Get what they say?
    You're only here to win
    Get what they do?
    They'd do it too
    If they were you
    You've done it all before
    It ain't nothing new

    Give it to me, yeah
    No one's gonna show me how
    Give it to me, yeah
    No one's gonna stop me now
    Give it to me, yeah
    No one's gonna show me how
    Give it to me, yeah
    No one's gonna stop me now

    Link: http://www.youtube.com/watch?v=aQRLSBUNupg
    Madonna is just great! If only she was in compression software development, he-he. I salute all developers with this masterpiece video-clip.
    Last edited by Sanmayce; 16th October 2010 at 18:12. Reason: Added C.R. for Octupleton & fixed errors

Similar Threads

  1. Fastest decompressor!?
    By Sanmayce in forum Data Compression
    Replies: 66
    Last Post: 13th April 2011, 02:18
  2. Text Detection
    By Simon Berger in forum Data Compression
    Replies: 15
    Last Post: 30th May 2009, 10:58
  3. rash - dummy EXE packer
    By encode in forum Forum Archive
    Replies: 17
    Last Post: 26th January 2008, 12:27
  4. Large text benchmark
    By Matt Mahoney in forum Forum Archive
    Replies: 39
    Last Post: 13th January 2008, 01:57
  5. New rule for large text benchmark
    By Matt Mahoney in forum Forum Archive
    Replies: 5
    Last Post: 28th October 2007, 22:00

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •