Results 1 to 21 of 21

Thread: DNA Corpus

  1. #1
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts

    DNA Corpus

    This DNA corpus is referenced in several grammar compression papers, but is hard to find. Even then, each file needs to be decompressed (ever so slightly) with gzip and then run through dnau, for which I could only find source code. It consists of the following 11 files: chmpxx, chntxx, hehcmv, humdyst, humghcs, humhbb, humhdab, humprtb, mpomtcg, mtpacga, and vaccg. Here it is in a single .zip file.
    Attached Files Attached Files

  2. Thanks (5):

    byronknoll (28th March 2019),comp1 (6th January 2015),encode (8th April 2019),Paul W. (6th January 2015),Sabrina (23rd June 2018)

  3. #2
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    Apparently some of these DNA files are given different names by different users. From what I can tell, the following names are equivalent:

    HUMDYST = HUMDYSTROP
    HUMGHCS = HUMGHCSA
    MTPACGA = MIPACGA = PANMTPACGA
    CHMPXX = MPOCPCG

    More details are here: http://article.sapub.org/10.5923.j.b...04.html#Sec3.1, including the following note:

    Most DNA compression algorithms use the standard benchmark data in[23]. These standard sequences, come from a variety of sources and include the complete genomes of two mitochondria (MPOMTCG, PANMTPACGA "also called MIPACGA"), two chloroplasts (CHNTXX and CHMPXX "also called MPOCPCG"), five sequences from humans (HUMDYSTROP, HUMGHCSA, HUMHBB, HUMHDABCD and HUMHPRTB), and finally the complete genome from two viruses (VACCG and HEHCMVCG "also called HS5HCMVCG").

  4. Thanks (3):

    byronknoll (28th March 2019),Paul W. (17th January 2015),Sabrina (23rd June 2018)

  5. #3
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    249
    Thanks
    113
    Thanked 123 Times in 72 Posts
    Here are the results for cmix v17:

    chmpxx: 121024 bytes -> 27299 bytes (cross entropy: 1.805)
    chntxx: 155844 bytes -> 37349 bytes (cross entropy: 1.917)
    hehcmv: 229354 bytes -> 54522 bytes (cross entropy: 1.902)
    humdyst: 38770 bytes -> 9295 bytes (cross entropy: 1.918 )
    humghcs: 66495 bytes -> 9692 bytes (cross entropy: 1.166)
    humhbb: 73308 bytes -> 16742 bytes (cross entropy: 1.827)
    humhdab: 58864 bytes -> 12954 bytes (cross entropy: 1.761)
    humprtb: 56737 bytes -> 12780 bytes (cross entropy: 1.802)
    mpomtcg: 186609 bytes -> 44616 bytes (cross entropy: 1.913)
    mtpacga: 100314 bytes -> 23113 bytes (cross entropy: 1.843)
    vaccg: 191737 bytes -> 44153 bytes (cross entropy: 1.842)

  6. #4
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    615
    Thanks
    260
    Thanked 242 Times in 121 Posts
    The ZIP file happens to be a good test file for recompression:

    Code:
    DNA.zip				368.673 Bytes
    Precomp 0.4.6			368.147 Bytes, 53 s, 0/11 ZIP
    Precomp 0.4.7			369.735 Bytes, 6 s, 11/11 ZIP, 54.018 Bytes total zlib reconstruction data
    Precomp 0.4.7			365.507 Bytes, 3 s, 7/7 ZIP (ignoring some streams, "-i0 -i95667 -i203983 -i232726")
    reflate				426.218 Bytes, 3 s (7-Zip DLL from here)
    reflate				387.298 Bytes (7-Zip DLL, "a -m0=reflate:x6 -m1=lzma:x9")
    decompressed + Precomp 0.4.7	312.994 Bytes, 3 s
    
    cmix v17			292.515 Bytes (from previous post for reference, each file compressed seperately)
    Last edited by schnaader; 28th March 2019 at 13:26. Reason: Added reflate result by Shelwien
    http://schnaader.info
    Damn kids. They're all alike.

  7. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,973
    Thanks
    296
    Thanked 1,299 Times in 736 Posts
    reflate result is a bit better actually:
    Code:
    Z:\042>7z.exe a -m0=reflate:x6 -m1=lzma:x9 1.pa DNA.zip
    
    7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
    
    Scanning the drive:
    1 file, 368673 bytes (361 KiB)
    
    Creating archive: 1.pa
    
    Items to compress: 1
    
    
    Files read from disk: 1
    Archive size: 387298 bytes (379 KiB)
    Everything is Ok
    Also can be improved further by using a different codec for compression (plzma or ppmd or BWT+qlfc),
    but its true that reflate diff size here is 20k larger than preflate.

  8. Thanks:

    schnaader (28th March 2019)

  9. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,973
    Thanks
    296
    Thanked 1,299 Times in 736 Posts
    Code:
                    CMo8   cmix
    chmpxx:  121024 27368  27299*
    chntxx:  155844 37333* 37349
    hehcmv:  229354 55395  54522*
    humdyst:  38770  9254*  9295
    humghcs:  66495 12362   9692*
    humhbb:   73308 17015  16742*
    humhdab:  58864 13252  12954*
    humprtb:  56737 13045  12780*
    mpomtcg: 186609 45334  44616*
    mtpacga: 100314 23211* 23113
    vaccg:   191737 44205  44153*
    It doesn't seem like there's anything interesting with structure there, since CMo8 results are pretty similar, and its model is very simple - no sparse context or any other tricks.
    Except that one file though (humghcs).

  10. Thanks:

    byronknoll (28th March 2019)

  11. #7
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    249
    Thanks
    113
    Thanked 123 Times in 72 Posts
    I have attached an image of the results of some DNA compression programs on this corpus (showing cross entropy, aka bits-per-byte). Results are from this paper. The DNALight results are from 2008 - I am not sure what state-of-the-art is now. I am surprised cmix results are so much worse. I don't know much about DNA compression - why are the cmix results so bad? The task doesn't seem that different from language modeling (i.e. looking for some repeating patterns in a one dimensional sequence). Can someone provide some insight into how DNA compression programs do significantly better than CM/PPM style compressors?
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	dna.png 
Views:	96 
Size:	55.3 KB 
ID:	6537  

  12. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,973
    Thanks
    296
    Thanked 1,299 Times in 736 Posts
    > I have attached an image of the results of some DNA compression programs on
    > this corpus (showing cross entropy, aka bits-per-byte).

    So best vaccg result is 191737*1.7542/8=42043, while cmix' is 44153.
    Not _that_ bad I think?

    > I don't know much about DNA compression - why are the cmix results so bad?
    > The task doesn't seem that different from language modeling (i.e. looking
    > for some repeating patterns in a one dimensional sequence).

    Its actually quite different, you'd need a specialized model + parsing optimization.
    There's some redundancy even just from encoding 8 bits per symbol
    with only 4 distinct byte values.

    > Can someone provide some insight into how DNA compression programs do
    > significantly better than CM/PPM style compressors?

    JamesB explained some of it here:
    https://encode.su/threads/3080-3-fai...ll=1#post59423

    Also, Osman attempted to use some chemical/molecular properties
    to restrict the nucleotide choices... DNA data is actually not text,
    but text description of a 3D object, so some patterns just can't physically exist.
    To be specific, Osman replaced patterns with vectors, summed these vectors,
    and posted some interesting graphs as results - with rather long linear
    sequences. I don't know where it ended though.

    Quote Originally Posted by osmanturan
    you may want to consider DNA's features: long matches, complementary matches, reversed matches
    (palindroms), mutated gapped context (insertion, deletion etc).
    There are also 2 features which is a bit hard to implement - unwrapped
    phase linearity and codon redundancy.

    codon redundancy:
    3 codes called as codon in DNA world, and 3 codes means 64 possible patterns,
    but, in real life actually there is 20-24 possible amino acid which is pointed by those codes,
    so, more than one codon could point to a single amino acid,
    so, there is a known redundancy, and it can be handled with context clustering.

    unwrapped phase linearity:
    I was trying to use piece-wise linear regression for fitting this
    linearity... as a first step to jump to linear predictor, i've tried delta
    filter and got very limited distinct values. Imagine, there are only 4
    phase angle (usually of course), so, next phase angle can be known on
    consequence nucleotids by using a state machine... i was trying to use it
    as an extra context...
    Updated stats:
    Code:
                    cmix    CMo8'  DNALi
    chmpxx:  121024 27299*  27334  24832
    chntxx:  155844 37349   37312* 31112
    hehcmv:  229354 54522*  55309  52513
    humdyst:  38770  9295    9244*  9161
    humghcs:  66495  9692*  12129   8082
    humhbb:   73308 16742*  16956  15959
    humhdab:  58864 12954*  13232  12192
    humprtb:  56737 12780*  13001  12253
    mpomtcg: 186609 44616*  45210  43493
    mtpacga: 100314 23113*  23178  23124
    vaccg:   191737 44153*  44186  42043
    Here I attached the version of files with ACTG converted to \x00-\x03 codes.
    Attached Files Attached Files

  13. Thanks (2):

    byronknoll (30th March 2019),Mike (29th March 2019)

  14. #9
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    278
    Thanks
    116
    Thanked 160 Times in 117 Posts
    cmix: v17.
    CMo8': 2019/03/29.
    cmv: 0.2.0, best of -m0,0,0x7fededff (-m0,0,>), -m0,0,0x7fede51e.
    paq8px_v178: best of -7, -7a, -7f, -7af, -9, -9a, -9f, -9af.
    paq8pxd64 (SSE2 build using IC19 from Shelwien): best of -s7, -s9.
    DNALi: DNALight (https://encode.su/threads/2105-DNA-C...ll=1#post59591).
    DNA compr.: best of DNA compressors (https://encode.su/threads/2105-DNA-C...ll=1#post59591).
    Code:
                         cmix   CMo8'     cmv   paq8px  paq8pxd  |   DNALi  DNA compr.
    chmpxx :   121024   27299   27334   27240*   27420    27550  |   24832       24832
    chntxx :   155844   37349   37312*  37336    37417    37541  |   31112       31112
    hehcmv :   229354   54522*  55309   54970    54644    54923  |   52513       52513
    humdyst:    38770    9295    9244*   9287     9286     9380  |    9161        9161
    humghcs:    66495    9692   12129    8078*    9840     9834  |    8082        7896
    humhbb :    73308   16742   16956   16442*   16802    16902  |   15959       15959
    humhdab:    58864   12954   13232   12461*   13058    13156  |   12192       12192
    humprtb:    56737   12780   13001   12414*   12847    12943  |   12253       12198
    mpomtcg:   186609   44616   45210   44458*   44743    44891  |   43493       43493
    mtpacga:   100314   23113   23178   23050*   23150    23263  |   23124       23122
    vaccg  :   191737   44153   44186   44152*   44186    44317  |   42043       42043
    Total  :  1279056  292515  297091  289888*  293393   294700  |  274764      274521
    Last edited by Mauro Vezzosi; 30th March 2019 at 11:35. Reason: Fixed "*" in vaccg

  15. Thanks:

    byronknoll (30th March 2019)

  16. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,973
    Thanks
    296
    Thanked 1,299 Times in 736 Posts
    @Mauro Vezzosi: Can you tell which submodel has most effect on humghcs? Certainly not a normal prefix model.

    As to CMo8', its still a simple logistic mix of o0..o8, but with parameters tuned to this whole set.
    Current version is a little better (296,976), but it doesn't seem to have any further potential.

  17. #11
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    249
    Thanks
    113
    Thanked 123 Times in 72 Posts
    Here are some additional cmix results:

    Code:
                     cmix   .bin   backwards
    chmpxx  121024   27299  27204  27283 
    chntxx  155844   37349  37316  37352 
    hehcmv  229354   54522  54463  54578 
    humdyst 38770    9295   9289   9298  
    humghcs 66495    9692   9617   9756  
    humhbb  73308    16742  16746  16760 
    humhdab 58864    12954  12946  12957 
    humprtb 56737    12780  12784  12835 
    mpomtcg 186609   44616  44578  44617 
    mtpacga 100314   23113  23096  23118 
    vaccg   191737   44153  44131  44155 
    Total   1279056  292515 292170 292709
    cmix: cmix on original data
    .bin: cmix on the data Shelwien posted here.
    backwards: cmix on the original data in reversed order

  18. #12
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    278
    Thanks
    116
    Thanked 160 Times in 117 Posts
    Quote Originally Posted by Shelwien View Post
    @Mauro Vezzosi: Can you tell which submodel has most effect on humghcs? Certainly not a normal prefix model.
    I would say that Test 3 is the most important.

    Test 1.
    To check the effect of each model I runned the "optimizer": it changes one option of the model at a time and keeps it if better; I found that -m2,0,0x73a169fe compress to 7945.
    Code:
    0.3.0 alpha 1\cmv a -ax "-m0,0,>"
    
    L: Loop number.
    M: Method/Model/Mixer identifier.
    O: Method/Model/Mixer option.
    P: Parameters to pass to the CLI -m switch to set Method/Models/Mixers.
    CS: Compressed size without header.
    B: "*" if better than previous.
    BS: Current best compressed size.
    BP: Current best Parameters.
    
    L  M  O               P         CS B         BS              BP Note
    0/ 0/ 0  0,0,0x7fededff       8060 *       8060  0,0,0x7fededff Compression using initial options -m0,0,> .
    1/ 0/ 1  1,0,0x7fededff       8052 *       8052  1,0,0x7fededff
    1/ 0/ 2  2,0,0x7fededff       7980 *       7980  2,0,0x7fededff -80 bytes (8060 - 7980), enables maximum number of order-N models: order 0, 1, 2, 3, 4, 5, 6, 8, 10, 12, 16.
    1/ 2/ 1  2,1,0x7fededff       7980 -       7980  2,0,0x7fededff
    1/ 2/ 2  2,2,0x7fededff       7980 -       7980  2,0,0x7fededff
    1/ 2/ 3  2,3,0x7fededff       7980 -       7980  2,0,0x7fededff
    1/29/ 0  2,0,0x5fededff       8065 -       7980  2,0,0x7fededff +85 bytes, disables the addition of the SSE of the input values in the final mixer.
    1/28/ 0  2,0,0x6fededff       8009 -       7980  2,0,0x7fededff
    1/13/ 0  2,0,0x7fedcdff       7981 -       7980  2,0,0x7fededff
    1/ 0/ 0  2,0,0x7fededfe       7980 -       7980  2,0,0x7fededff
    1/ 1/ 0  2,0,0x7fededf1       8549 -       7980  2,0,0x7fededff +569 bytes, disables the sparse match model (SMM), SMM seems quite important.
                                                                    In CMV, if the sparse match model doesn't match, it returns a sparse ~order-2 prediction.
    1/ 1/ 1  2,0,0x7fededf3       8354 -       7980  2,0,0x7fededff +374 bytes, enables SMM level 1 instead of level 7
    1/ 1/ 2  2,0,0x7fededf5       8090 -       7980  2,0,0x7fededff +110 bytes, enables SMM level 2 instead of level 7
    1/ 1/ 3  2,0,0x7fededf7       8014 -       7980  2,0,0x7fededff + 34 bytes, enables SMM level 3 instead of level 7
    1/ 1/ 4  2,0,0x7fededf9       7994 -       7980  2,0,0x7fededff + 14 bytes, enables SMM level 4 instead of level 7
    1/ 1/ 5  2,0,0x7fededfb       7984 -       7980  2,0,0x7fededff +  4 bytes, enables SMM level 5 instead of level 7
    1/ 1/ 6  2,0,0x7fededfd       7983 -       7980  2,0,0x7fededff +  3 bytes, enables SMM level 6 instead of level 7
    1/ 4/ 0  2,0,0x7fededef       7994 -       7980  2,0,0x7fededff
    1/ 9/ 0  2,0,0x7fede9ff       7981 -       7980  2,0,0x7fededff
    1/ 9/ 1  2,0,0x7fedebff       7980 -       7980  2,0,0x7fededff
    1/12/ 1  2,0,0x7fedfdff       7981 -       7980  2,0,0x7fededff
    1/14/ 0  2,0,0x7fedadff       7983 -       7980  2,0,0x7fededff
    1/15/ 0  2,0,0x7fec6dff       8067 -       7980  2,0,0x7fededff +77 bytes, disables some sparse and masked models (FCM).
    1/15/ 1  2,0,0x7fecedff       8068 -       7980  2,0,0x7fededff +88 bytes, enables FCM level 1 instead of level 3.
    1/15/ 2  2,0,0x7fed6dff       7979 *       7979  2,0,0x7fed6dff - 1 bytes, enables FCM level 2 instead of level 3.
    1/23/ 0  2,0,0x7f6d6dff       8000 -       7979  2,0,0x7fed6dff
    1/24/ 0  2,0,0x7ced6dff       7980 -       7979  2,0,0x7fed6dff
    1/24/ 1  2,0,0x7ded6dff       7979 -       7979  2,0,0x7fed6dff
    1/24/ 2  2,0,0x7eed6dff       7979 -       7979  2,0,0x7fed6dff
    1/26/ 0  2,0,0x7bed6dff       7975 *       7975  2,0,0x7bed6dff
    1/27/ 0  2,0,0x73ed6dff       7974 *       7974  2,0,0x73ed6dff
    1/30/ 0  2,0,0x33ed6dff       7978 -       7974  2,0,0x73ed6dff
    1/ 5/ 0  2,0,0x73ed6ddf       7974 -       7974  2,0,0x73ed6dff
    1/ 6/ 0  2,0,0x73ed6dbf       7975 -       7974  2,0,0x73ed6dff
    1/ 7/ 0  2,0,0x73ed6d7f       7975 -       7974  2,0,0x73ed6dff
    1/ 8/ 0  2,0,0x73ed6cff       7974 -       7974  2,0,0x73ed6dff
    1/11/ 0  2,0,0x73ed65ff       7974 -       7974  2,0,0x73ed6dff
    1/17/ 0  2,0,0x73e96dff       7966 *       7966  2,0,0x73e96dff
    1/17/ 1  2,0,0x73eb6dff       7968 -       7966  2,0,0x73e96dff
    1/19/ 0  2,0,0x73e16dff       7966 -       7966  2,0,0x73e96dff
    1/20/ 0  2,0,0x73c96dff      10184 -       7966  2,0,0x73e96dff +2218 bytes, disables the final mixers stage (FMS).
    1/20/ 1  2,0,0x73d96dff       8237 -       7966  2,0,0x73e96dff + 271 bytes, enables FMS level 1 (just 1 mixer) instead of level 2 (6->3->1 mixers with context from most complex to the simplest, no SSE).
    1/22/ 0  2,0,0x73a96dff       7930 *       7930  2,0,0x73a96dff
    ...loops by testing the model options again until nothing better is found...
    Test 2.
    Compression of the first N KiB humghcs.
    Code:
     cmv paq8px First N KiB of humghcs
     269    268 humghcs-01K
     524    521 humghcs-02K
     774    770 humghcs-03K
    1011   1011 humghcs-04K
    1251   1258 humghcs-05K
    1498   1506 humghcs-06K
    1727   1750 humghcs-07K
    1842   1900 humghcs-08K
    1908   1994 humghcs-09K
    2047   2163 humghcs-10K
    2252   2376 humghcs-11K
    2369   2550 humghcs-12K
    2450   2691 humghcs-13K
    2545   2864 humghcs-14K
    2778   3104 humghcs-15K
    3026   3353 humghcs-16K
    3265   3599 humghcs-17K
    3501   3845 humghcs-18K
    3714   4075 humghcs-19K
    3948   4318 humghcs-20K
    4162   4552 humghcs-21K
    4381   4788 humghcs-22K
    4627   5034 humghcs-23K
    4771   5218 humghcs-24K
    4832   5345 humghcs-25K
    4912   5473 humghcs-26K
    4974   5593 humghcs-27K
    5024   5689 humghcs-28K
    5075   5768 humghcs-29K
    5159   5907 humghcs-30K
    5211   6014 humghcs-31K
    5263   6109 humghcs-32K
    5323   6224 humghcs-33K
    5396   6342 humghcs-34K
    5485   6467 humghcs-35K
    5569   6603 humghcs-36K
    5634   6689 humghcs-37K
    5711   6793 humghcs-38K
    5760   6890 humghcs-39K
    5829   7000 humghcs-40K
    5891   7098 humghcs-41K
    5953   7205 humghcs-42K
    6012   7315 humghcs-43K
    6096   7453 humghcs-44K
    6144   7535 humghcs-45K
    6230   7664 humghcs-46K
    6268   7736 humghcs-47K
    6318   7800 humghcs-48K
    6389   7885 humghcs-49K
    6407   7904 humghcs-50K
    6449   7973 humghcs-51K
    6481   8021 humghcs-52K
    6531   8107 humghcs-53K
    6564   8159 humghcs-54K
    6621   8274 humghcs-55K
    6668   8349 humghcs-56K
    6765   8480 humghcs-57K
    6859   8626 humghcs-58K
    6916   8720 humghcs-59K
    6953   8775 humghcs-60K
    7064   8925 humghcs-61K
    7307   9170 humghcs-62K
    7554   9418 humghcs-63K
    7759   9636 humghcs-64K
    8078   9840 humghcs
    Test 3.
    Test the sparse match model by turning off most other models and mixer (1 final mixer).
    Code:
    Explanation for "step" and "initial gap".
    
    16013 Level 0: sparse match model disabled.
     9360 Level 1: step 1, initial gap 1, 2, 3, 4.
     9136 Level 2: step 2, 3, 4, 5, initial gap 0.
     8566 Level 3: step 1, initial gap 1, 2, 3, 4 + step 2, 3, 4, 5, initial gap 0, prediction 0, 1, step - 1.
    
    Explanation for "prediction", e.g. step 4, initial gap 0, prediction 0, 1, 3:
    step 4, initial gap 0 use this context "c": ...x...x...xy
    when a match is found, the bytes predicted "p" are
    prediction 0: cp
    prediction 1: c.p
    prediction 3: c...p
    Too weird? :-)

  19. Thanks:

    Shelwien (31st March 2019)

  20. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,973
    Thanks
    296
    Thanked 1,299 Times in 736 Posts
    I did expect sparse model to be the cause, but its helpful that you explained how yours works.

    I wonder if we'd see something by logging symbol probabilities and generating a visualization
    like here: http://nishi.dreamhosters.com/u/lzma_markup_v0.rar

    DNALight results are still significantly better...

  21. #14
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    249
    Thanks
    113
    Thanked 123 Times in 72 Posts
    I made some progress on this corpus:

    Code:
                    cmix   cmix-2 cmv    DNALi 
    chmpxx  121024  27299  24850  27240  24832 
    chntxx  155844  37349  31284  37336  31112 
    hehcmv  229354  54522  51499  54970  52513 
    humdyst 38770   9295   9286   9287   9161  
    humghcs 66495   9692   9723   8078   8082  
    humhbb  73308   16742  16656  16442  15959 
    humhdab 58864   12954  12773  12461  12192 
    humprtb 56737   12780  12632  12414  12253 
    mpomtcg 186609  44616  44104  44458  43493 
    mtpacga 100314  23113  23118  23050  23124 
    vaccg   191737  44153  42065  44152  42043 
    Total   1279056 292515 277990 289888 274764
    cmix-2 contains the following hack: every 2000 bytes, I disable the arithmetic coder and replay the previous 2000 bytes in reverse-complement order. There are "palindromic sequences" in the data. The hack that I added injects some fake input history (reverse complement) into cmix at the cost of 2x compression time. I plan to try creating a separate DNA model in cmix to model these sequences in a more principled way.

  22. Thanks (2):

    Mauro Vezzosi (7th April 2019),Shelwien (7th April 2019)

  23. #15
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    278
    Thanks
    116
    Thanked 160 Times in 117 Posts
    cmv-2: as cmix-2 with cmv 0.3.0a1 optimized options (optimized options and results are in https://encode.su/threads/2284-CMV?p...ll=1#post54665 day 2019/04/05).
    Code:
                     cmix   cmix-2 cmv    cmv-2  DNALi
    chmpxx   121024  27299  24850  27240  24865  24832
    chntxx   155844  37349  31284  37336  31329  31112
    hehcmv   229354  54522  51499  54970  51956  52513
    humdyst   38770   9295   9286   9287   9264   9161
    humghcs   66495   9692   9723   8078   7987   8082
    humhbb    73308  16742  16656  16442  16212  15959
    humhdab   58864  12954  12773  12461  12229  12192
    humprtb   56737  12780  12632  12414  12211  12253
    mpomtcg  186609  44616  44104  44458  43721  43493
    mtpacga  100314  23113  23118  23050  22984  23124
    vaccg    191737  44153  42065  44152  42080  42043
    Total   1279056 292515 277990 289888 274838 274764
    Updated E.coli results with cmv-2.

  24. Thanks (2):

    byronknoll (8th April 2019),Shelwien (7th April 2019)

  25. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,973
    Thanks
    296
    Thanked 1,299 Times in 736 Posts
    100 bytes left to beat DNALight!

    Btw, how about inserting not only reverse-complement, but also reverse and complement separately (all combinations)?

    Also, Mauro, can you upload compressed files for cmv-2? I'd test them with cdm. Got suspicious after seeing nncp output

  26. #17
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    278
    Thanks
    116
    Thanked 160 Times in 117 Posts
    Here it is.
    I have overwritten for mistake chmpxx_pal.cmv with a new test applying reverse-complement order every 4096 bytes instead of 2000: 24849 bytes.
    What is it cdm?
    Attached Files Attached Files

  27. #18
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,973
    Thanks
    296
    Thanked 1,299 Times in 736 Posts
    Code:
    275,528 dna.cmv           // concatenation of all files
    275,665 dna.cmv.paq8px178 // paq8px -7
    275,604 dna.cmv.cdm       // cdm5 "direct test"
    So I guess CMV is okay.

    cdm is basically an universal recompressor for bitcodes: https://encode.su/threads/2742-Compressed-data-model
    But sometimes it can compress the result of arithmetic coding too - when model needs SSE/APM too much or something. It seems to be the case with nncp.

  28. #19
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    278
    Thanks
    116
    Thanked 160 Times in 117 Posts
    > 275,528 dna.cmv // concatenation of all files
    My total is 274838: i gave you chntxx applying reverse-complement order every 128 bytes instead of 2000, the compression was worse.
    128 is worse at least on chmpxx and chntxx, 4096 is better at least on chmpxx.
    I assumed that 128 should be better because it updates the models more frequently, instead it seems to hurt the compression: why? Does it break too many contexts?
    Maybe I'll try to: compress a block A, feed the compressor with reverse-complement order of block A, feed last N (~4?) bytes of block A (to recreate a minimal context), continue with block B, ...
    I can also add: ..., feed last N (~4?) bytes of reverse-complement order block A, feed the compressor with reverse-complement order of block B, ...

    > how about inserting not only reverse-complement, but also reverse and complement separately (all combinations)?
    IIRC I already tested it as a single transformation (no improvements found), I'll try more combinations.

    > So I guess CMV is okay.
    I've already tried to re-compress .cmv and I've never had a smaller size.

    > 100 bytes left to beat DNALight!
    DNALight is still better in 7 files out of 11.

  29. #20
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,973
    Thanks
    296
    Thanked 1,299 Times in 736 Posts
    > I assumed that 128 should be better because it updates the models more frequently, instead it seems to hurt the compression: why? Does it break too many contexts?

    Another possibility is the average match distance being much longer than 128.

    > I've already tried to re-compress .cmv and I've never had a smaller size.

    You can test very redundant files, like 10M of zeros, or repeated patterns.
    cdm actually can compress quite a few compressed formats - basically all fast ones.
    For ones with arithmetic coding the cause can be redundant block headers,
    slightly skewed byte probabilities (eg. because of carryless rc), or too low max probability
    (which leads to redundant code generated for long chunks of predictable data).

  30. #21
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    278
    Thanks
    116
    Thanked 160 Times in 117 Posts
    Tests for:
    - compress a block A;
    - feed the compressor with reverse-complement order of block A;
    - feed last N bytes of block A;
    - continue with block B, ...

    cmv-2a: block size 2000 (like cmv-2), feed last 16 bytes.
    cmv-2b: block size 4096, feed last 16 bytes.
    Code:
                     cmix   cmix-2 cmv    cmv-2 cmv-2a cmv-2b  DNALi
    chmpxx   121024  27299  24850  27240  24865  24851  24834  24832
    chntxx   155844  37349  31284  37336  31329  31281  31241  31112
    hehcmv   229354  54522  51499  54970  51956  51921  51877  52513
    humdyst   38770   9295   9286   9287   9264   9264   9261   9161
    humghcs   66495   9692   9723   8078   7987   7903   7930   8082
    humhbb    73308  16742  16656  16442  16212  16197  16189  15959
    humhdab   58864  12954  12773  12461  12229  12216  12207  12192
    humprtb   56737  12780  12632  12414  12211  12204  12194  12253
    mpomtcg  186609  44616  44104  44458  43721  43724  43701  43493
    mtpacga  100314  23113  23118  23050  22984  22995  22985  23124
    vaccg    191737  44153  42065  44152  42080  42055  42019  42043
    Total   1279056 292515 277990 289888 274838 274611 274438 274764
    Block size 4096 is still better than 2000.
    Quote Originally Posted by Shelwien View Post
    how about inserting not only reverse-complement, but also reverse and complement separately (all combinations)?
    I did a few tests and they hurts a little, except in 2-3 cases.
    Last edited by Mauro Vezzosi; 10th April 2019 at 08:54.

Similar Threads

  1. Encode's Compression Corpus (EncCC)
    By encode in forum Download Area
    Replies: 5
    Last Post: 21st December 2017, 12:43
  2. Silesia compression corpus
    By encode in forum Data Compression
    Replies: 29
    Last Post: 8th June 2012, 10:53
  3. Repstsb Corpus
    By Mihai Cartoaje in forum Download Area
    Replies: 0
    Last Post: 14th March 2009, 06:25
  4. Canterbury Corpus
    By LovePimple in forum Download Area
    Replies: 0
    Last Post: 31st July 2008, 23:35
  5. Calgary Corpus
    By LovePimple in forum Download Area
    Replies: 0
    Last Post: 31st July 2008, 21:55

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •