Results 1 to 22 of 22

Thread: Text strings coding chemical structures

  1. #1
    Member FatBit's Avatar
    Join Date
    Jan 2012
    Location
    Prague, CZ
    Posts
    190
    Thanks
    0
    Thanked 36 Times in 27 Posts

    Text strings coding chemical structures

    Dear colleagues,

    I would like to discuss specific „text“ compression – text strings which code chemical structures. I assume that the best method is dictionary based (GLZA, XWRT). I prefer very fast decompression time. Compression time almost does not matter because precomputation is possible. Lines range up to 1E+6 - 1E+12.

    Thank you in advance for your suggestions, FatBit

    PS: More in enclosed pdf.
    Attached Files Attached Files

  2. #2
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    766
    Thanks
    217
    Thanked 285 Times in 167 Posts
    Quote Originally Posted by FatBit View Post
    Lines range up to 1E+6 - 1E+12.
    I don't understand this statement.

    Start the project by creating and sharing an example corpus that should be compressed, as well as the results of a simple compression benchmarking program on this corpus.

  3. #3
    Member FatBit's Avatar
    Join Date
    Jan 2012
    Location
    Prague, CZ
    Posts
    190
    Thanks
    0
    Thanked 36 Times in 27 Posts
    Dear colleagues,

    as recommended I prepared small benchmark. More in enclosed pdf and spreadsheet.

    Best regards, FatBit
    Attached Files Attached Files

  4. #4
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts
    silesia/nci contains chemical formulas in SMILES format, along with other stuff. The best compressor for it is cmix v8. http://mattmahoney.net/dc/silesia.html

  5. Thanks:

    Paul W. (29th November 2015)

  6. #5
    Member FatBit's Avatar
    Join Date
    Jan 2012
    Location
    Prague, CZ
    Posts
    190
    Thanks
    0
    Thanked 36 Times in 27 Posts
    Dear colleagues,

    link for data download (up to 4th of December, 30x) for your own benchmarking/testing. Thank you.

    http://www.uschovna.cz/zasilka/GWE3D49BAMBIEPN7-8RT/

    Best regards, FatBit

  7. #6
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 182 Times in 108 Posts
    Thank you. This is interesting to me and I will look at the data when I have computer access

  8. #7
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    766
    Thanks
    217
    Thanked 285 Times in 167 Posts
    Now that you have the benchmark run turbobench http://encode.su/threads/2333-TurboB...(incl-LzTurbo) on this data. Post the results here.

  9. #8
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    some selected results for silesia/nci:

    cmix versions 5-8: 838 (2.5%)
    lzham 1,503 (4.5%)
    GLZA 0.2: 1,617 (4.8%)
    ppmd -256 -o16 -r1: 1,834 (5.5%)
    uncompressed: 33,553 (100%)

    Given that you are interested in fast decompression, GLZA and lzham look like by far the best among these four. They are both about an order of magnitude faster at decompressing than a high-order PPM, with lzham even faster than GLZA. (cmix 8 is several orders of magnitude slower.)

    lzham decompresses texty things even faster than GLZA, and while it gets a slightly better ratio for this file, it usually doesn't for texty things, so it may not for your data. (Like LZMA, it's usually not the best for text, but is usually a bit better for binary files than GLZA.) An earlier version of GLZA (Tree) was faster than lzma, but didn't get as good a ratio as GLZA.) GLZA is usually better at larger files than this one, but I expect it will get better for small files as Kennon refines it.

    If you're worried about decompression memory use, GLZA is usually significantly better than lzham, and both are probably plenty fast, depending on exactly what you're doing. (And plzma usualy gets a somewhat better ratio on text than lzham, but is significantly slower, while being faster than PPM.)

    Things get more complicated when you consider preprocessors ahead of something like PPM. They can improve PPM ratios while letting you use a considerably lower-order PPM that's faster. (And I'd guess they significantly reduce decompression memory requirements too.) So depending on just how much you care about speed, that might be an option.
    Last edited by Paul W.; 20th November 2015 at 22:46.

  10. #9
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    FatBit,

    I just looked at the largest of those files (11.smi) and it looks to me like the kind of data that LZMA is particularly good at compressing---it's got a lot of short-to-medium-length strings that repeat at short match distances, a lot of occurrences of a very few characters, etc.

    That file appears to be more or less sorted, with a lot of repeats of initial substrings from line to line. It's also got a lot of repeats of other substrings from line to line. Is that typical of the data you want to compress?

    I don't think GLZA is particularly good for that kind of data at this point---it builds its grammar based on global frequencies, and that may interfere with exploiting short-scale repeats as it should. (It does exploit local repeats of strings to some extent, but only strings that are in its dictionary, and they're chosen based on global frequencies.)

    The same problem may occur with a dictionary-building preprocessor of the usual kind oriented toward natural-language text (maybe with markup). If it's picking strings based on global frequencies, it may chunk things too coarsely for the subsequent compressor to be able to see the short strings that repeat at short distances.

  11. #10
    Member FatBit's Avatar
    Join Date
    Jan 2012
    Location
    Prague, CZ
    Posts
    190
    Thanks
    0
    Thanked 36 Times in 27 Posts
    Dear colleagues,

    At first I would like to thank you for your replies and comments. Structures strings compression is “subproblem” of the project which deals with phys-chem properties estimation and uses group contribution methods. Size range can be up to ~1E12 lines = molecules.
    I assumed that dictionary based approach is the best, because all strings consist of “base fragments” - for substance names e.g. *methyl*, *ethyl*, *ane, *ene, *ine, *nol, * acid etc. and for others description methods (SMILE, InChI) it is almost same. Difference is only in “coded view” of the above mentioned text fragments. E.g. ethanol = CCO.
    And more complex molecules only “repeat” these fragments. I though that if we precompute e.g. 1E6 lines = molecules, it will be enough to build “dictionary” because rest of molecules is only “different combination” of the same fragments.
    I do not have simple approach to x64 platform therefore please be a bit patient. I will try to test suggested programs and to repeat computations with sorted smi files. They exist also files 12.smi (~2 GB) and 13.smi (~20 GB) I will try to compute also them.

    Best regards, FatBit

  12. #11
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Fat Bit,

    I tried the largest file (11.smi) with GLZA 0.3, LZMA, and PPMd. (The latter two using modes of 7zip.)

    GLZA gets a ratio a little under 14.8 percent, or a little over a bit per character. That would seem good, but...

    PPMd gets anywhere from 9 to 16 percent, for the modes I tried, with smallish models working better than larger ones. "Normal" mode with a 16MB "dictionary" size does better than "Ultra" mode with a 1536MB "dictionary" size. I assume that means model size, and would guess that the bigger you make it the higher order the highest-order model is, and the slower the statistics decay. (For natural-language text, bigger models usually work better. For stuff with strong short-scale locality effects, they may work worse because they're working mostly with stale statistics for too-long contexts.)

    LZMA gets a little over 9 percent, with mode (Normal vs. Ultra) and dictionary size (default vs. largest) making little difference. I didn't try other modes.

    I suspect those results are mostly for the reasons I gave above---you need an algorithm that's good at recognizing short strings that repeat at very short distances, and can entropy code a small alphabet.

    BTW, I think that one thing that LZMA should be good at is recognizing strings that repeat very soon with one or two characters changed, like a molecular structure with an atom of one element replaced by an atom of a chemically similar element. It can take advantage of repeating LZ77 match distances (offsets), but it can only remember about 4 match distances in its MTF queue of recent offsets.

    cmix would be worth a try, for experimental purposes. If it compresses substantially better than LZMA, it may be worth figuring out why, and how to get the relevant effect with a modifications to another algorithm that decompresses. (Maybe with a simple preprocessor of some unusual sort.)

  13. #12
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Quote Originally Posted by FatBit View Post
    Dear colleagues,

    I assumed that dictionary based approach is the best, because all strings consist of “base fragments” - for substance names e.g. *methyl*, *ethyl*, *ane, *ene, *ine, *nol, * acid etc. and for others description methods (SMILE, InChI) it is almost same. Difference is only in “coded view” of the above mentioned text fragments. E.g. ethanol = CCO.
    And more complex molecules only “repeat” these fragments. I though that if we precompute e.g. 1E6 lines = molecules, it will be enough to build “dictionary” because rest of molecules is only “different combination” of the same fragments.
    If you know the vocabulary of repeating fragments, you may want to put spaces between the occurrences of fragments within a line. That may help PPM, and would probably help GLZA or XWRT a lot. (I'm not positive, but my impression is that XWRT takes spaces as word/nonword separators. GLZA uses them heuristically to reduce proliferation of grammar productions, preferring strings with spaces on the left over strings with spaces on the right.)

    You may want to add a space at the end of each line, before the carriage return, to make it easier to recognize the carriage return as a special token. For PPM of order n, you may want to end the line with n-1 repetitions of some character that doesn't occur otherwise, so a line like

    H2NCOOH<CR>

    would end up as something like

    HCH COOH ZZZZ<CR>

    (I'm not trying to make any chemical sense here, because I don't know chemistry---"HCH" and "COOH" are just supposed to be the kind of molecular component that occurs in lots of places, and Z is any character not in your input alphabet. <CR> is the carriage return.)

    The idea with the Zs is that PPM of order 5 would rapidly learn that...

    Once it's seen a space it might see a Z.
    Once it's seen a space and a Z it will always see another Z
    Once it's seen a space and two Zs it will always see another Z
    Once it's seen a space and three Zs it will always see another Z
    Once it's seen a space and four Z's it will always see a <CR>

    As it sees more occurrences of the " ZZZZ<CR>" sequence, the probabilities associated with those things will converge toward 1, and the number of bits to encode the characters approaches zero as closely as the resolution of the arithmetic coder allows. (And the probability at the highest model order is the only one that really counts, after the first time---it will always make the same prediction, and it will always be right, so it won't be falling back to the lower orders.)

    If you don't get the number of Z's exactly right for the highest order of your PPM, it may not work quite as well, but it should work reasonably well. It will learn most of what it needs to learn---that anything it's seen ending in some Zs predicts that it will see either another Z or a <CR>.

    The ZZZZ<CR> sequence should also help PPM predict that the next thing it sees (the initial substring of the line) will be the same as last time (the initial substring of the previous line). If you have at least enough Z's, it will look for the things it's seen lately in that highest-order context, and the previous line's initial string will be there.

    This is a case where symbol ranking may work better than something that attempts to do actual statistics---symbol ranking predicts the most recently-seen things most strongly, and for this kind of data, the one most recently-seen thing IS what's most likely to repeat right away, by far. (On the other hand, the other things are much less likely to repeat soon, because a lexicographic sort will make them occur in a cycle.)

    Are your data always sorted? Is it some kind of left-to-right lexicographic sort?

    If so, it sounds like the abstract structure of your data is actually a sparse multidimensional array, and you might be able to exploit that with some sort of tree compression algorithm.

  14. #13
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 182 Times in 108 Posts
    Hi Fat Bit,

    Well, the doctors did not kill me so I was finally able to download the files, look at them, and do a little testing.

    I think GLZA is not so good for these files at this time because it is not able to adequately exploit local matches as much as some other programs and does not have sufficiently high order modeling on the back end. I tried to test CMIX, but found neither my own build nor the last one posted on here would run on my PC. For 11.smi, PLZMA64 with the lzmarec model tuned to enwik8 did give some decent results with fairly fast decompression. 11.smi compressed to 20,069,767 bytes or 7.3% of the original file size and decoded in 8.9 seconds. I noticed you got a little better compression with XWRT v3.4. Does v3.4 of XWRT use PPMd for the backend? Is it fast enough for your purposes? I suspect you get good results with XWRT more because of the backend than the dictionary preprocessor, but would like to understand this better.

  15. #14
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 198 Times in 147 Posts

    TurboBench Benchmark - File 11.smi

    Code:
         C Size  ratio%     C MB/s     D MB/s   Name             (bold = pareto) MB=1.000.0000
       17978415     6.5       0.32       0.29   zpaq 5          
       21874731     8.0       1.40     179.63   lzma 9          
       23949824     8.7       0.95     194.86   lzturbo 49      
       27765063    10.1       1.15    1094.70   lzturbo 39      
       28173635    10.2       0.26     517.97   brotli 11       
       28306270    10.3       1.35     397.35   lzham 4         
       33013572    12.0       0.11     519.41   zopfli          
       34093986    12.4       4.22     926.81   zstd 20         
       41662642    15.2       9.23      15.38   bsc 2           
       54801989    19.9       1.20    1634.19   lzturbo 29      
       58391509    21.2       1.47    2997.58   lzturbo 19      
      274870873   100.0    7609.19    8506.28   memcpy          
    If you want to preprocess your files, then you will get different results.
    When your files are sorted or partially sorted, then you can use "incremental encoding" (or "front coding"),
    and possibly bigram encoding or grammar compression for the suffix.
    See also: A Fully Reversible Data Transform Technique Enhancing Data Compression of SMILES Data
    Last edited by dnd; 28th November 2015 at 18:14. Reason: Added zpaq,bsc

  16. #15
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 182 Times in 108 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Does v3.4 of XWRT use PPMd for the backend? Is it fast enough for your purposes? I suspect you get good results with XWRT more because of the backend than the dictionary preprocessor, but would like to understand this better.
    Here are some answers to my questions:

    XWRT 3.4 uses a variety of backends, depending on the compression level. Levels 10 - 15 use lpaq6, which is about 15 times slower for decompression of 11.smi than PLZMZ64. So it appears PLZMA64 offers the best compression ratio among the LZ class encoders. For lzham, with options -d29 -x, I get a 26,757,742 byte file (9.7%) that decompresses in 1.03 seconds (less if I/O ignored like TurboBench does), which is also Pareto Frontier among open source and controversy free compressors.

  17. Thanks:

    Paul W. (29th November 2015)

  18. #16
    Member FatBit's Avatar
    Join Date
    Jan 2012
    Location
    Prague, CZ
    Posts
    190
    Thanks
    0
    Thanked 36 Times in 27 Posts
    Dear all,

    thank you for your activity. You can imagine my hobby/project as a database of chemical substances, where one substance is one record. And one record consist of many fields. Some of them are SubstanceNameIUPAC, SubstanceNameACS, SMILE, InChI etc.

    Now I am in point from very first, naive, basic ideas to better, real knowledge. When I imagine millions of substances => GB size files and therefore there was/is idea to store compressed names. I assumed that for relatively big smile files compression ratio will be constant. But does not.

    Inspired by dr. Mahoney compression benchmark I prepare (agree, slowly) greater benchmark. I also assume to communicate more with some developers (dictionary based compressors) to try to reach better results. If would be useful I can share also 12.smi (~2 GB uncommpresed) and 13.smi (~20 GB uncommpresed).

    Be please patient, progress is 2s (slow,small), but is.

    Best regards, FatBit
    Last edited by FatBit; 29th November 2015 at 10:30. Reason: Better typo

  19. #17
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    FatBit,

    Are these files real data files for some particular purpose, or set of purposes? Or are they things you've put together that you expect to resemble the real data, more or less?

    The roughly 2GB file would be interesting to me, if the data are real.

    I'm still wondering whether you expect the data to be lexicographically sorted, or almost-sorted, or something else. (And whether they'd sometimes be sorted on one field, and sometimes another. That has a big impact on what a good compressor must be able to detect/exploit.

  20. #18
    Member FatBit's Avatar
    Join Date
    Jan 2012
    Location
    Prague, CZ
    Posts
    190
    Thanks
    0
    Thanked 36 Times in 27 Posts
    Dear Paul W.,

    I assume, that data files are real. Better say I used them as I downloaded. No end line Unix=>Windows conversion, no line sorting. Really as downloaded.

    Best regads, FatBit

    PS: In the evening I will give link for other files download.
    Last edited by FatBit; 30th November 2015 at 08:24. Reason: Better typo

  21. #19
    Member FatBit's Avatar
    Join Date
    Jan 2012
    Location
    Prague, CZ
    Posts
    190
    Thanks
    0
    Thanked 36 Times in 27 Posts
    Dear all,

    as promised, here is link for compressed 01-13.smi files download (1,7 GB 7z file), without any changes, as downloaded:

    www.uschovna.cz/zasilka/GZZUIB2UNXD5CKL7-JZR

    and in attachment are source URLs which I used for download.

    Best regards, FatBit
    Attached Files Attached Files

  22. #20
    Member
    Join Date
    Nov 2015
    Location
    France
    Posts
    7
    Thanks
    2
    Thanked 0 Times in 0 Posts
    I haven't looked at the data.

    If I'm not mistaken, as my knowledge in both chemistry and compression is limited, there are various properties that could be exploited:
    - for symbol conditional probability, valence may help, as well as "termination of a branch" of a molecule (ie CH3 cannot occur only once, and there, not inside a molecule)
    - dictionary-based approach; also, some patterns occurs more frequently (alkane/alkene/alcohol/ often have a higher probability of C/H/O occurrences or repeating structures)
    - some combination are not possible (eg CH5)

    It's basically predicting things that would be helpful to produce a tuned compression scheme.

  23. #21
    Member FatBit's Avatar
    Join Date
    Jan 2012
    Location
    Prague, CZ
    Posts
    190
    Thanks
    0
    Thanked 36 Times in 27 Posts
    Dear colleagues,

    I prepared for interested people some data files:

    #.smi as downloaded
    #+crlf.smi as downloaded + Nix => Win line end conversion by unix2dos routine
    #+crlf-sorted-duplicity_included.smi as above + sorted by win embended sort routine (=duplicities included)
    #+crlf-sorted-duplicty_removed.smi as above + sorted by cmsort routine (=duplicities excluded)
    #+crlf-duplicity.smi - excluded duplicities from previous sort

    Together ~5.8 GB as 7zip (95 GB). Give me please a hint and I will prepare download link.

    Best regards,

    FatBit

  24. #22
    Member FatBit's Avatar
    Join Date
    Jan 2012
    Location
    Prague, CZ
    Posts
    190
    Thanks
    0
    Thanked 36 Times in 27 Posts
    Dear colleagues,

    I am sending package URL www.uschovna.cz/en/zasilka/HML3R3BW4KPTAC79-HY2 where are files above mentioned (~5.8 GB) for your own tests..

    Best regards + nice packing,

    FatBit

Similar Threads

  1. Replies: 4
    Last Post: 13th October 2015, 00:35
  2. Concatenating strings
    By andromeda in forum Data Compression
    Replies: 29
    Last Post: 15th September 2014, 07:51
  3. Most efficient/practical compression method for short strings?
    By never frog in forum Data Compression
    Replies: 6
    Last Post: 1st September 2009, 04:05
  4. Text Detection
    By Simon Berger in forum Data Compression
    Replies: 15
    Last Post: 30th May 2009, 09:58
  5. Data Structures for Context Model Statistics
    By Shelwien in forum Data Compression
    Replies: 2
    Last Post: 8th November 2008, 10:14

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •