Results 1 to 10 of 10

Thread: controlled regexp tools for HP

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,976
    Thanks
    296
    Thanked 1,303 Times in 740 Posts

    controlled regexp tools for HP

    I finally made the toolkit which was mentioned in my old post here:
    http://nishi.dreamhosters.com/u/contrepl_v1.rar

    MM> Instead, the context model should develop lexical, semantic, and grammar models,

    Btw, my approach to HP is a multi-pass scheme which can potentially
    take these into account.
    The idea is that instead of sequential compression of all data,
    we can move some information from input data to archive iteratively,
    also with custom models for specific cases.
    Its supposed to be something like a controlled regexp engine -
    we write a replacement regexp and its inverse - eg. s/\n/ /
    and s/ /\n/ and compress the flags which tell the decoder where
    to apply the inverse regexp for lossless restoring of the data.
    Obviously the flags are only generated for positions where
    the inverse regexp matches at all, ie for all spaces in processed
    text in the example above.
    Benefits are:
    - reduced memory usage
    - bi-directional contexts
    - separately tuned models
    - easy incremental development
    As to grammar models, the simplest workaround here would be
    eg. to replace all verbs to "verb" and all nouns to "noun"
    etc, and we'd end up with plain sentence structure.
    Though of course it would be also useful to generate additional
    input streams (also by processing original input with regexps).
    Current result:

    Code:
                      c.bat  c0.bat
    book1.out         768771 768771 // patched book1
    book1.out_ari     197926 197926 // compressed patched book1
    book1.flg         142173 142173 // uncompressed flags
    book1.flg_ari       5676   8163 // compressed flags
    book1.ari         205688 205688 // original book1 (compressed)
    book1.out+flg_ari 203610 206097 // patched book1 with restore flags (compressed)
    The difference here is that "c.bat" script compresses the flags
    using actual data (book1.out here) as context,
    while "c0.bat" compresses it as an independent file
    (best paq8 results are also around 70xx there).

    Don't test this on enwik as is, since perl scripts take too much time
    when there're many matches, and LF-to-space hurts compression on
    enwik anyway, since its not plaintext.

    There's another example in http://nishi.dreamhosters.com/u/contrepl_v0.rar
    ("don't" and "do not" - see book1.cfg), that should work on enwik8,
    but might not improve compression.

    So, now we need good examples of replacement regexps for enwik8,
    which would let us incrementally extract information from enwik to flag streams,
    in such a way that after replacement enwik8 could be compressed better,
    and compressed flags would take less space than the gain.
    As book1 example above shows, it is possible.

    Meanwhile I'd continue improving the flag coder -
    atm it just uses the same o5 model from mod_CM,
    so higher orders, right-side context and SSE
    still have to be added.

    Of course it would also make sense to implement actual replacements
    in C++ too (for speed), but that can wait until we have more examples.

  2. Thanks (3):

    encode (17th February 2019),Mike (17th February 2019),xinix (17th February 2019)

  3. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,976
    Thanks
    296
    Thanked 1,303 Times in 740 Posts
    Well, here's a simple example with enwik8:
    s/anything/nothing/
    Code:
                         c.bat   c0.bat     .cfg
    book1.out         99999325 99999325    [^a-zA-Z]  
    book1.out_ari     23477126 23477126    [^a-zA-Z]  
    book1.flg             1490     1490    any(thing) 
    book1.flg_ari          176      196    no\${1}    
    book1.ari         23477349 23477349    any\${1}   
    book1.out+flg_ari 23477310 23477330    no(thing)
    So, we saved 39 bytes from that (note that its a reversible replacement - that's the point).

    Btw, the idea comes from the paq8hp DRT dictionary:
    Code:
    ...
    why
    how
    everything
    nothing
    something
    anything
    ordinary
    binary
    primary
    secondary
    ...

  4. Thanks:

    xinix (17th February 2019)

  5. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,976
    Thanks
    296
    Thanked 1,303 Times in 740 Posts
    http://nishi.dreamhosters.com/u/contrepl_v1a.rar

    Patched up perl scripts to support multiple-choice replacements.
    s/([Dd])on't/${1}o not/ is still ok I guess, but something else gets too complicated too quickly.
    Also added c2/d2/test2 scripts for faster testing using ppmd_sh... although its reaction to replacements is different.

    Code:
    (ari=CM_o5)          c.bat   c0.bat     .cfg
    book1.out         99997960 99997960   [^a-zA-Z]              
    book1.out_ari     23476855 23476855   [^a-zA-Z]              
    book1.flg             1605     1605   everything|||Everything
    book1.flg_ari          194      203   nothing|||Nothing      
    book1.ari         23477349 23477349
    book1.out+flg_ari 23477057 23477066
    
    (ari=pmd)            c.bat   c0.bat     .cfg
    book1.out         99997960 99997960   [^a-zA-Z]              
    book1.out_ari     20784704 20784704   [^a-zA-Z]              
    book1.flg             1605     1605   everything|||Everything
    book1.flg_ari          194      254   nothing|||Nothing      
    book1.ari         20784856 20784856
    book1.out+flg_ari 20784906 20784966

  6. Thanks:

    xinix (23rd February 2019)

  7. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,976
    Thanks
    296
    Thanked 1,303 Times in 740 Posts
    http://nishi.dreamhosters.com/u/contrepl_v1b.rar

    Added a script to generate a config based on these:
    https://en.wikipedia.org/wiki/Wikipe...h_contractions

    Code:
    100,000,000 16,913,266 = paq8px178 -7t
    100,004,854 16,912,702 = 16909975+2727 .out + .flg (lowercase contractions)
    100,007,707 16,912,274 = 16908616+3658 .out2 + .flg2 (all contractions)
    Somehow it actually works better with paq8px?

    Because of these new configs though, I had to rewrite the whole flag generation algorithm.
    Apparently there're cases like "you would have", which can be turned both into "you'd have" and "you would've",
    so regexp that scans for matches (and doesn't do any replacements) works differently from
    the one with actual replacement (sees both matches, while after replacement the other match disappears).
    But problem is, to do correct replacements we need flags, and to decompress flags,
    we need the location log... so had to realign flags to "incorrect" location log.

    Update:
    Code:
    [The|the -> A|a]
    98,330,050 - book1.out
     1,093,887 - book1.flg
        94,701 - book1.flg_ari
    23,370,489 - book1.out_ari
    23,477,349 - book1.ari
    23,465,198 - book1.out+flg_ari
    
    98,330,050 book1.out3
    16,838,911 book1.out3.paq8px178
     1,093,887 book1.flg              
       136,735 book1.flgc (bits packed to bytes)
       105,914 book1.flg.paq8px178    
       105,197 book1.flgc.paq8px178   
        91,883 flg in original context
    16,913,266 paq8px178 -7t
    16,930,794 = 16838911+91883
    For CM_o5 its a win, but for paq8px somehow it isn't?

  8. #5
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    249
    Thanks
    113
    Thanked 123 Times in 72 Posts
    This approach reminds me of the type of transforms Alex added in the preprocessing stage of phda. It could be a promising direction for further compression improvements, but at least for me it is difficult to come up with many transforms that actually improve compression. The 6 transforms that Alex added were specific to enwik8 and don't generalize well to other files. I guess I am more interested in ideas that work for more general language modelling. I like the idea of expanding contractions. The WRT used in cmix/phda don't match contractions, so I feel like there is room for improvement there. I am investigating expanding contractions within WRT without the need for an additional flag file. One approach is to use a special byte to indicate an expanded contraction (and I have some other approaches in mind too). Anyway, I would encourage you to continue looking for transform ideas. Hopefully some of the ideas can combine well with WRT.

  9. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,976
    Thanks
    296
    Thanked 1,303 Times in 740 Posts
    > This approach reminds me of the type of transforms Alex added in the preprocessing stage of phda.

    I guess you missed this too?
    https://encode.su/threads/2590-Some-...enwik8-parsing

    Yes, the idea was originally an extension of old-style preprocessing tricks
    from the first BWT compressors, like "space stuffing" etc.
    https://www.researchgate.net/profile...ompression.pdf

    But there's a significant difference: my approach here supports lossy replacements.
    For example, why|||Why -> how|||How produces flags like this:
    00000000001001111110000000111001110100000000001110 10000000001100001001010000101
    00010100110011101110111011000111001000100101111110 00000000000001000000001110000
    01011001000010010111110000001101110000101110100010 10011001100101111011100011010
    00010100000001000001100011110010000000000101011010 000000001011010000101

    "0" here means a match for "how", and "1" means that it has to be replaced back to "why"
    to restore the input file.

    > It could be a promising direction for further compression improvements, but
    > at least for me it is difficult to come up with many transforms that
    > actually improve compression.

    Actually most of lossy replacements like above do technically improve
    the compression of the text file.

    For example, the why->how transform improves paq8px book1 result from 179,988 to 179,939.

    Currently it is possible to make it worse due to technical details of paq models,
    like removing all spaces would break "wordmodel".
    But most of the lossy replacements which don't break file structure
    would also improve compression - simply because we're removing
    information from the file.

    That's the whole point though.
    The script here moves information from the text file to the flag files, incrementally.
    So we can individually train the models for flags (in context of their locations in text).

    In theory, we can reduce the file to nothing with these replacements,
    and thus fully compress it as a sequence of flag streams.

    Unfortunately I don't know how to modify paq to encode flags in context of main data,
    and the model in my coder is much simpler.

    > The 6 transforms that Alex added were specific to enwik8 and don't generalize well to other files.

    The transforms like & -> & actually can be applied to any html/xml.

    > I guess I am more interested in ideas that work for more general language modelling.

    That's the point here. The regexp approach may seem simple,
    but we can eg. replace synonyms (and thus merge their contexts and statistics),
    or replace all nouns with "thing" and all verbs with "do" and
    get to sentence structure modelling.

    > I like the idea of expanding contractions. The WRT used in cmix/phda don't
    > match contractions, so I feel like there is room for improvement there.

    I thought so too, but with the list I used, the difference is just 4-5k on enwik8 with paq8px.
    I think the reason is the number of still remaining contractions, like "ed"->"'d",
    and also words like "muad'dib".

    > Anyway, I would encourage you to continue looking for transform ideas.
    > Hopefully some of the ideas can combine well with WRT.

    There're much better targets in enwik texts, like wiki links ([[...]] and [[...|...]]),
    but these are more tricky on regexp side, and the latter one would require handling
    a phrase stream, rather than flags.

    There're also some utf8 ideas, but it appears very hard to deal with it in perl.

  10. #7
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    249
    Thanks
    113
    Thanked 123 Times in 72 Posts
    Quote Originally Posted by Shelwien View Post
    So we can individually train the models for flags (in context of their locations in text).
    Can you expand a bit on how you get context for the flags? I guess the naive approach for compressing the flags is to just append the flag file to the end of the original file (i.e. don't reset all the models after the original file, just move on to compressing the flag file).

    Quote Originally Posted by Shelwien View Post
    That's the point here. The regexp approach may seem simple,
    but we can eg. replace synonyms (and thus merge their contexts and statistics),
    or replace all nouns with "thing" and all verbs with "do" and
    get to sentence structure modelling.
    Replacing synonyms makes sense to me. I am quite interested in those results if you can get it working at scale. I guess using a thesaurus? WordNet might also be useful.

    I am skeptical that replacing nouns/verbs would help. I think separating out sentence structure from the actual words will hurt compression, because both work together as useful context for language modeling. But maybe I am just confused about how to do the contextual flag modeling.

  11. #8
    Member
    Join Date
    May 2008
    Location
    Estonia
    Posts
    510
    Thanks
    208
    Thanked 348 Times in 185 Posts
    KZo


  12. Thanks:

    Shelwien (22nd February 2019)

  13. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,976
    Thanks
    296
    Thanked 1,303 Times in 740 Posts
    > Can you expand a bit on how you get context for the flags?

    repl.pl performs the forward transformation (eg. s/the/a/),
    then backward transformation (s/a/the/) while comparing the output with original file,
    the result are the log file (currently a bitmap of match locations in transformed file),
    and the flag file (whether backward replacement is applicable for this match).

    Specialized CM coder then takes the data file, log and flag files,
    and encodes the flags in context of data locations marked in the log.

    Currently it uses the same order5 model as in mod_CM
    (still, its parameters can be tuned for specific flag type).
    But there's a potentially very interesting option of
    using right-side context from match locations.

    For example, there's high probability of "The" being followed
    by a capital letter. Conversely, we should be able to predict
    that "A" followed by a capital letter likely should be replaced back.

    > I guess the naive approach for compressing the flags is to just append the
    > flag file to the end of the original file

    I already confirmed that its not enough.
    Even my simple o5 model compresses flag information 10% better than paq8px.

    > Replacing synonyms makes sense to me. I am quite interested in those
    > results if you can get it working at scale. I guess using a thesaurus?
    > WordNet might also be useful.

    I think DRT dictionary is a good source for that kind of examples.
    I mean, its already sorted to cluster together the words that are
    likely to appear in same context.

    Code:
    ...
    must
    can
    cannot
    could
    would
    should
    may
    might
    ...
    > I am skeptical that replacing nouns/verbs would help.
    > I think separating out sentence structure from the actual words will hurt compression,
    > because both work together as useful context for language modeling.
    > But maybe I am just confused about how to do the contextual flag modeling.

    Well, in theory we could also replace all letters with "" (empty string) -
    same compression then would have to be applied to the side stream,
    but overall compression won't change.

    Its basically the same in the case of replacing all words with a single regexp.

    I was talking about incremental replacement of words one by one (or close to that).

  14. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,976
    Thanks
    296
    Thanked 1,303 Times in 740 Posts
    http://nishi.dreamhosters.com/u/contrepl_v1c.rar

    I made a new toolkit for string replacement, now in C++ (included with sources).
    Its actually based on constexpr crc32 from https://encode.su/threads/3077-C-constexpr-experiments .
    The idea is that rolling crc32s are updated and current crcs are compared with precalculated ones.
    Of course, it also means that all the tools have to be recompiled for each new config (set of string pairs for now),
    but its what I intended to do from the start, for speed.
    As it is, my implementation seems to be more than 2x faster than perl even at simple s/string1/string2/g replacements,
    and the regexps I actually used for this task are crazy slow in perl, they can take hours sometimes.

    Btw, another problem this solved is utf8. Enwik8 includes a lot of it, from words in various language to extended punctuation.
    Somehow perl treats string literals in the script, file data and regexp captures differently, and I wasn't able to sync them for \x80-\xFF symbols.

    Anyway, now there're two files that define the forward and backward replacements:

    replace.inc:
    M_Replace( "\xE2\x80\x93", "--" )
    M_Replace( "\xE2\x80\x94", "==" )
    M_Replace( "\xE2\x80\x98", "``" )
    M_Replace( "\xE2\x80\x99", "''" )
    M_Replace( "\xE2\x80\x9C", "<\x22" )
    M_Replace( "\xE2\x80\x9D", "\x22>" )


    replaceU.inc:
    M_Replace( "--", "\xE2\x80\x93" )
    M_Replace( "==", "\xE2\x80\x94" )
    M_Replace( "``", "\xE2\x80\x98" )
    M_Replace( "''", "\xE2\x80\x99" )
    M_Replace( "<\x22", "\xE2\x80\x9C" )
    M_Replace( "\x22>", "\xE2\x80\x9D" )


    And there's also a script, _incgen\incgen.pl, which can generate these from old-style .cfg file.

    Tools description:
    Code:
    // apply "replace.inc" to a file
    repl.exe   book1 1out
    
    // apply "replace.inc" to a file and produce a map of positions where strings were replaced
    mapfwd.exe book1 1fwd
    
    // apply "replaceU.inc" to a file and produce a map of positions where strings were matched
    mapbwd.exe 1out 1bwd
    
    // diff the above two maps; produce flags to mark backward replacements that are actually necessary
    mapdif.exe 1fwd 1bwd 1flg
    
    // apply "replaceU.inc" to a file where flags allow it
    replrest.exe 1out 1bwd 1flg 1rst
    As to results, that utf8 punctuation gives:
    Code:
    16,913,266 paq8px178 -7t
    16,907,820 1out.paq8px178 (old config version with single-byte replacements)
    16,912,897 1out2.paq8px178 (new config posted here, looks like its worse, took too long to test)

  15. Thanks:

    xinix (26th February 2019)

Similar Threads

  1. Looking for a couple compression tools
    By zrlzahyj in forum Download Area
    Replies: 1
    Last Post: 2nd June 2017, 17:37
  2. which tools support parallel compress
    By l1t in forum Data Compression
    Replies: 10
    Last Post: 9th January 2012, 21:35
  3. Non open source Data compression Tools
    By ehsansad in forum Data Compression
    Replies: 9
    Last Post: 22nd September 2011, 00:41
  4. Comparison of lossless PNG compression tools
    By Surfer in forum Data Compression
    Replies: 54
    Last Post: 19th September 2011, 22:58
  5. Interesting tools
    By lunaris in forum Data Compression
    Replies: 2
    Last Post: 25th August 2009, 23:50

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •