Page 20 of 29 FirstFirst ... 101819202122 ... LastLast
Results 571 to 600 of 849

Thread: Tree alpha v0.1 download

  1. #571
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Paul W. View Post
    Hmmm... that last brain dump was fairly redundant with what I have said before...

    Kennon, let me know if the elaborations made anything clearer the second time around.
    Yes, it all helps. I often have to read new information several times to get it, so having it stated in different ways can help. I think the main thing I was missing is that DNA strings are really long. Even though you essentially said it, I didn't realize many of Tree's long/high-value matches are probably from the same original string, except one or both have mutated. It also possibly sheds light on why lzham does much better than Tree on certain binary files. It also gives me a better idea why a splay tree might be helpful (or at least I think it does).

  2. #572
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Paul W. View Post
    Interesting... that's less than 25 percent worse than DNACompress, and much better than any of the other algorithms in the table you posted in post #555. (bzip2 is second-best at 1.73) For that file, lzham mostly closes the gap between general-purpose and domain-specific DNA sequence compression.

    Have you tried it on the other files?
    I should have added Tree's results to the table. It would have been #2 at 1.51.

    Here are results on all five files I posted earlier for lzham, plzma and paq8o10t, plus the published DNAComrpess results. Not necessarily the best choices, but they are what I have handy. If Matt wants to recommend a different version of paq, I could try that.

    HUMDYST: lzham 2.26, plzma 2.15, paq 1.93, DNACompress 1.91
    HUMGHCS: lzham 1.26, plzma 1.12, paq 1.20, DNACompress 1.03
    MIPACGA: lzham 2.13, plzma 2.08, paq 1.85, DNACompress 1.86
    MPOCPCG/CHMPXX: lzham 2.12, plzma 2.08, paq 1.82, DNACompress 1.67
    VACCG: lzham 2.08, plzma 2.05, paq 1.85, DNACompress 1.76

    Not too surprisingly, paq8o10t did pretty well. What surprises me is how close it gets to DNACompress even though it's not specialized.

    Professor Ladner mentioned DNASequitur uses palindromes when I visited him. My impression is that he was disappointed with the amount of compression gained. Looking through the results in the dnasequitur paper, it looks like they got around 5% - 10% better total, but attribute some of that to longer matches.
    Last edited by Kennon Conrad; 20th January 2015 at 08:07.

  3. #573
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    I added the results for Tree and for bzip2 and arith from the paper, for ease of reference:

    Code:
    HUMDYST: bzip2 2.18, tree 2.18, lzham 2.26, plzma 2.15, arith 1.95, paq 1.93, DNACompress 1.91
    HUMGHCS: bzip2 1.73, tree 1.51, lzham 1.26, plzma 1.12, arith 2.00, paq 1.20, DNACompress 1.03
    MIPACGA: bzip2 2.12, tree 2.11, lzham 2.13, plzma 2.08, arith 1.88, paq 1.85, DNACompress 1.86
    CHMPXX:: bzip2 2.12, tree 2.11, lzham 2.12, plzma 2.08, arith 1.87, paq 1.82, DNACompress 1.67
    VACCG::: bzip2 2.09, tree 2.10, lzham 2.08, plzma 2.05, arith 1.92, paq 1.85, DNACompress 1.76
    The extra colons after the shorter names is just for alignment (the CODE environment wasn't preserving my spaces in preview, for some reason).

  4. #574
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    458
    Thanks
    143
    Thanked 158 Times in 106 Posts
    The key difference between LZMA, lzham and dedicated DNA compression is the strand. DNA comes in two strands which are complements of each other; ie a tr/ACGT/TGCA/. Generally when things get flipped you only need to compare against the reverse strand:

    Code:
    jkb@seq3a[scratch01/jkb] echo CATTG | rev | tr ACGT TGCA
    CAATG
    This has the benefit of doubling up your history for matches strings in an LZ algorithm, at the cost of an extra bit to say which strand.


    On the other points of statistical analysis, try something like Fickett's Method of coding likelihood. This is deceptively simple (and for sure not the best out there), but easy. It comes from the knowledge of how DNA is converted to Proteins. DNA has broken down into triplets, named codons, matched by the tRNAs and utilised by the ribosome. There are 64 (4^3) possible such triplets, matching to 20 amino acids and a stop codon. This translation table is the Genetic Code:

    Click image for larger version. 

Name:	genetic_code.gif 
Views:	536 
Size:	10.4 KB 
ID:	3363

    It's obvious that with 64->21 there is redundancy, as can be seen above. Some amino acids (W) have just 1 triplet while others (S, L, etc) have 6. It's also obvious they form groups. This tends to lead to situations like GCU, GCC, GCA, GCG (NB U in RNA = T in DNA) being Alanine, so GC* is the same thing. This combined with natural selection and random mutations means that there is a selective pressure against changing the first two bases (G and C in the GC* example) which usually does not exist for the last base (U, C, A, G). Coupling that knowledge with overall base by base metrics (rather than triplet by triplet) such as GC content yields some interesting models to compare "coding dna" vs "noncoding dna". For example:

    Click image for larger version. 

Name:	spin_plot_p.unix.gif 
Views:	270 
Size:	28.4 KB 
ID:	3364

    In that the stop codons are also show as red bars. The ribosome tends to decode a gene codon at a time until it finds a stop codon (there are 3 in the standard genetic code), so we can also detect "open reading frames" by looking for gaps where there are no stop codons.

    The above is an exceptionally simple and effective example, with it being far muddier in eukaryotes due to genes being split into multiple exons and introns, but the methods still work.

    I've no idea how much of a statistical bias it actually has though and whether it is sufficient for teasing out significantly better compression.

  5. The Following User Says Thank You to JamesB For This Useful Post:

    Paul W. (20th January 2015)

  6. #575
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    JamesB,

    Do regions of a gene that don't code for a protein sequence have a triplet token structure, too, just with different statistics, or are they just sequences of base pairs, not chunked in threes?

    For regions that code protein sequences, is it easy to find the triplet boundaries?

    I've been calling the part of the gene that gets transcribed to RNA the "coding region" of the gene, but maybe that's the wrong terminology---is it only the stretches that get translated to protein sequences that I should call "coding regions"? What do you call the whole right-hand side?

    I take it your second picture is plotting two different measures (in scans through three genes): 1) occurrences of stop triplets and (2) something like a statistic saying whether the triplets statistically resemble typical protein coding triplets.

    So what the plots are showing is that where you don't have stops, and do have a high statistical resemblance to protein sequences, you've probably got a stretch of DNA that does in fact code for a protein sequence. The two measures help you find protein-coding segment boundaries, which have different statistical properties, so you could compress them with a specific model.

    Any ideas about other known kinds of regions within genes, like control regions with (activation, repression, and promoter sites) or different kinds of introns? I'd guess geneticists have tools for helping find those things, too, but I don' t know.

  7. #576
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    458
    Thanks
    143
    Thanked 158 Times in 106 Posts
    The regions of a gene that aren't part of the coding sequences (exons = coding, introns = the bits between the exons, but maybe containing regulatory signatures) don't have to be a multiple of 3 in length. However it is likely that the statistis of introns is still not quite the same as the statistics of the wider inter-genic regions. As for the definition of a coding region, I am not sure I know the correct answer either. It's rather muddled in places no doubt.

    Yes the second plot shows stop codons (red bars) and a coding likelihood test (the wavy line). The test may not be fickett's, but perhaps one of the more complex ones. It demonstrates the basic concept though (rather well in this case).

    There is a whole myriad of tools out there for finding regulatory regions, tRNAs, various repeat families, a multitude of small RNA classifications, and so on. Whether or not any of these are prevalent enough to help compression is an open question.

  8. The Following User Says Thank You to JamesB For This Useful Post:

    Paul W. (20th January 2015)

  9. #577
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Quote Originally Posted by Kennon Conrad View Post
    I think the main thing I was missing is that DNA strings are really long.
    Yes. The average gene is something like 25,000 base pairs long. Some are only a few hundred (but that's still pretty long by normal standards), and others are hundreds of thousands. A few are over a million.

    When you start duplicating and reduplicating strings things like that, and then doing random local edits, you tend to get a lot of substrings that still match, and may themselves be hundreds of characters long, and tend to occur in the same order at nearly the same offsets.

    Most of this isn't going to show up in small files that only give the sequence for one gene or a few genes around a particular locus of interest. It's going to show up in sequences for whole chromosomes (duplicates tend to get scattered around a chromosome) or a whole genome. (Genes make their way onto different chromosomes as well, and IIRC whole chromosomes were duplicated a very long time ago.)

    At a smaller scale, a gene that codes for proteins doesn't really code the whole protein sequence in one long string. That string is interspersed with non-coding stretches called "introns". Introns are weird and poorly understood, and I don't know how much introns tend to resemble each other. Some introns are wideley believed to be "selfish DNA" fragments that copy themselves and insert themselves in various places, so I'd guess that there'd tend to be copies of those that behave similarly to retroviruses for compression purposes. They're typically much shorter than normal genes, or retroviruses, but still pretty long. That may account for some of the longish repeats (100-400 base pairs?) I can see in some of the fv plots, even in a small file, but I don't know.

    . It also gives me a better idea why a splay tree might be helpful (or at least I think it does).
    For a gene that's been duplicated once and then accumulated mutations, you'll tend to get matches back to the first copy when you hit the second copy, with random long matches alternating with short mismatches.

    A splay tree offset coder should be good for that, because the splay tree will tend to shorten paths to recently-used parts of the tree, like the last place you got one of those matches. If you keep going to the same part of the tree, the paths and resulting codes will get very short, not just for the offset you searched for, but its near neighbors as well.

    Even better, a splay tree can do this simultaneously in several different hot regions of the tree. That will be important when you have several copies of a gene, each randomly mutatated in different places. You'll get some matches to substrings in the most-recently-seen copy, others to substrings in the next-most-recently-seen copy, etc. Since you tend to access the strings in each copy sequentially (modulo random mismatches), a splay tree should work very well. Splay trees rock for sequential- or nearly-sequential accesses to keys, even if you have several regions of your key space being sequentially accessed.

    Doug Jones at Iowa gives away a splay prefix tree coder that can handle short integer keys, so you could have a sliding window of up to 64K strings and plug it in to splay-prefix-code the offsets. I think that would be a very interesting thing to do for text, especially boilerplate strings that recur in the same order in multiple articles in enwik9. I'm not sure how much hacking it would be to generalize it to long keys, so that the window could cover a whole chromosome or genome. (It builds a whole fully-populated tree covering the whole keyspace at init time, so you get an n log n space whether you need it or not. Building the tree as you go may be better for large and sparsely-used keyspaces.)

    Here's Doug's page with the splay coder & decoder: http://homepage.cs.uiowa.edu/~jones/compress/

    --

    I've been scratching my head over this fv picture, of the humhdab sequence, which is less than 60Kbp long:

    Click image for larger version. 

Name:	humhdab_fv.png 
Views:	216 
Size:	165.3 KB 
ID:	3365

    It mostly looks like the usual ergodic Markov noise, but with a funny stridey feature on the very left end, and large cluster features repeating four times in the right two-thirds of the picture.

    Each pixel column here corresponds to about 115 bytes of input.

    The feature on the very left, about a third of the way up the scale (at 40) Shows 8-byte matches (black) at just under and just over 40, over a span of bout four pixels, or a few hundred bytes.

    In the right two thirds of the picture, you have some sort of alternation going on between scatters of black marks (8-byte matches) at distances scattered around randomly offsets mostly between 200 and 2000, out recurring at an interval of about 10,000 bytes of input and lasting 2 or 3 thousand bytes. In between those vertically scattered features, you get some more black marks scattered at higher offsets, around 2000 to 5000

    The black marks seem to be short horizontal lines, one to four pixels wide, suggesting that you've got approximate string matches of around 200 bytes, but those approximate strings are repeating at somewhat randomized intervals, as though those strings were repeating in a roughly loop-like fashion but with a locally permuted order. [but see NOTE below]

    I have no idea yet what any of this really means. It looks like nested loops, with the inner one somewhat randomized. You have an outer cyclic pattern in the data repeating four or five times, and within each iteration of that cycle (about 10K), you have a phase of about 2K or so where things tend not to repeat right away (within 200 bytes), but then do repeat at about 600 bytes plus or minus a couple of hundred.

    It looks weirdly like the initial part of each 10K phase is a 2K-or-so phase where you have several iterations of a 200-byte structure, and nothing in each of those structures matches anything in itself, but they match things in the previous structure at random, like an array of permutations of the same data.

    [NOTE: One tricky thing about fv is that the pixel color only tells you whether there were ANY matches of a given length in the stretch of input and range of offset distances corresponding to that pixel. A single 8-byte match is enough to make the pixel go black, in this example. When you see two or three consecutive pixels go black, it suggests you've got a long string, that mostly matches, but you might only have two or three matches, one per pixel, or anything in between... So I need to zoom in and see what's really going on at a finer granularity. (I guess Matt's original color scheme would be better for this example.. I'll revert the color scheme so any mix of blue (8-byte) and green (4-byte) matches within a pixel will make it come out blue-green.)]
    Last edited by Paul W.; 21st January 2015 at 18:30.

  10. #578
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Kennon,

    If you get serious about big DNA sequence files, this site's (UCSC Genome Bioinformatics) has got them:

    http://genome.ucsc.edu/index.html

    It looks like they've at least got files for the entire known sequences of chromosomes, which could be concatenated to give you basically the whole nuclear genome. (They've got mitochondrial DNA sequences too.)

    I think the .fa suffix means FASTA format, which is an extended version of the usual GCTA letters. Where a letter isn't known exactly, it's replaced with another letter to signify the ambiguity. Some letter means "could be G or A", another means "could be A or T", etc. (I think the funny letters are very rare, especially in the "finished" sequence files, except at the very beginnings and ends of some chromosomes.)

    They also give away a bunch of tools, and annotation files that mark and describe regions of gene sequences.

  11. #579
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    [Update:

    I looked up a description of humhdabcd a.k.a. humhdab, and it's a human gene sequence, but comprises three cosmids. Cosmids, it turns out, are viral DNA that was originally bacterial DNA. As I understand it, these cosmids come from a phage virus that snarfed a plasmid from a bacterium it infected. (Plasmids are loops of DNA that bacteria trade around, often coding for some handy thing like resistance to some antibiotic. Bacteria use plasmid exchange to swap genetic material in sort of the way we use sex to come up with new combinations of genes.)

    Now I'm wondering why there are at least four iterations of a large, regular structure, if the sequence is mostly three cosmids. And I still have no idea why DNA would look like that in fv, viral-bacterial-human or otherwise. ]

  12. #580
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Paul W. View Post
    Kennon,

    If you get serious about big DNA sequence files, this site's (UCSC Genome Bioinformatics) has got them:

    http://genome.ucsc.edu/index.html

    It looks like they've at least got files for the entire known sequences of chromosomes, which could be concatenated to give you basically the whole nuclear genome. (They've got mitochondrial DNA sequences too.)

    I think the .fa suffix means FASTA format, which is an extended version of the usual GCTA letters. Where a letter isn't known exactly, it's replaced with another letter to signify the ambiguity. Some letter means "could be G or A", another means "could be A or T", etc. (I think the funny letters are very rare, especially in the "finished" sequence files, except at the very beginnings and ends of some chromosomes.)

    They also give away a bunch of tools, and annotation files that mark and describe regions of gene sequences.
    Yes, I downloaded the mouse genome from that site, but later saw it was FASTA format and bagged it for now. NCBI (http://www.ncbi.nlm.nih.gov/)also has a lot of information, but I'm not even sure what I am looking much of the time. It's an interesting subject, but very specialized, so I probably won't go much further. I am trying to better classify the types of data Tree is really bad at relative to LZMA. I thought DNA might be one of them, but Tree actually did okay there. The worst performance I have seen is on mri and x-ray files from the Silesia Corpus. I think mostly Tree needs to recognize 16 bit data and then a simple delta filter could help a lot. A double symbol delta filter wouldn't be bad for enwik9 either since it could be used for alphabet listings. Tree is not effective at compressing extended language alphabet listings, especially when the alphabet list only appears once and some of the UTF-8 symbols are not used anywhere else in the file.

  13. #581
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Yes, I downloaded the mouse genome from that site, but later saw it was FASTA format and bagged it for now.
    FASTA is pretty simple. A file can contain one or more sequences. Each sequence starts with a header line that starts with the ">" character, maybe followed by comment lines starting with ";".

    The sequence itself is just a string of letters like GCTA for a DNA sequence, broken into lines of no more than 120 characters (and usually no more than 80).

    So to convert a FASTA file to a raw sequence of DNA letters (or whatever), just loop to read one line at a time, and if it doesn't start with ">" or ";", write it out without the trailing line break. That's all you need to do.

    That will throw away the header & comment lines, concatenate the sequence of sequences into one long sequence, and concatenate the lines of each sequence together, so that you have one big string of DNA letters. (Like the little files in the DNA corpus.)

    A DNA sequence file like that may contain a few letters besides GCTA (e.g., N for any of GCTA). Just leave those in. They're standard now, and they're rare in finished sequences (like 1 in 10,000 characters), so they don't matter much.

  14. #582
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Paul W. View Post
    FASTA is pretty simple. A file can contain one or more sequences. Each sequence starts with a header line that starts with the ">" character, maybe followed by comment lines starting with ";".

    The sequence itself is just a string of letters like GCTA for a DNA sequence, broken into lines of no more than 120 characters (and usually no more than 80).

    So to convert a FASTA file to a raw sequence of DNA letters (or whatever), just loop to read one line at a time, and if it doesn't start with ">" or ";", write it out without the trailing line break. That's all you need to do.

    That will throw away the header & comment lines, concatenate the sequence of sequences into one long sequence, and concatenate the lines of each sequence together, so that you have one big string of DNA letters. (Like the little files in the DNA corpus.)

    A DNA sequence file like that may contain a few letters besides GCTA (e.g., N for any of GCTA). Just leave those in. They're standard now, and they're rare in finished sequences (like 1 in 10,000 characters), so they don't matter much.
    Sorry, I forgot. The file is actually .2bit format and I got stuck looking for a .2bit to fasta converter that works.

  15. #583
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Sorry, I forgot. The file is actually .2bit format and I got stuck looking for a .2bit to fasta converter that works.
    In their downloads area they have a lot of stuff like the whole human genome, chromosome-by-chromosome, in .fa.gz files, which I assume is just gzipped FASTA.

    If my understanding is correct, there's a whole lot of data available in FASTA format, which is trivial to convert to plain letter sequences, without dealing with .2bit (which I don't understand yet).

    BTW, they have .fa.gz files for the numbered chromosomes (like Chr17.fa.gz), the X and Y chromosomes, and the mitochondrial "chromosome" (ChrM.fa.gz), which isn't really a chromosome at all. (Mitochondria are really symbiotic bacteria that have lived in all of our cells since almost forever; you only inherit them from your mother, because they're there in egg cells, and don't participate in the normal sexual reassortment of chromosomes.)

  16. #584
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Paul W. View Post
    In their downloads area they have a lot of stuff like the whole human genome, chromosome-by-chromosome, in .fa.gz files, which I assume is just gzipped FASTA.

    If my understanding is correct, there's a whole lot of data available in FASTA format, which is trivial to convert to plain letter sequences, without dealing with .2bit (which I don't understand yet).

    BTW, they have .fa.gz files for the numbered chromosomes (like Chr17.fa.gz), the X and Y chromosomes, and the mitochondrial "chromosome" (ChrM.fa.gz), which isn't really a chromosome at all. (Mitochondria are really symbiotic bacteria that have lived in all of our cells since almost forever; you only inherit them from your mother, because they're there in egg cells, and don't participate in the normal sexual reassortment of chromosomes.)
    Yes, it can be overwhelming for somebody that doesn't know much about biology. Tree often isn't very effective with binaryish data, DNA included. If I want to get serious about DNA compression, I think context modeling is the way to go. It shouldn't be too expensive to implement fairly high order character model since the alphabet is so small. It might be good to tokenize the longest matches, but it seems like that would depend on the performance of the context model. Tokenization is too expensive to provide much profit on most of these files since they tend to not be very compressible. It sounds like an optimum context model would need to produce predictions based on the reverse ordered symbols as well as the normal order. I wonder if anyone has tried to create a version of paq for DNA.

  17. #585
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    150
    Thanks
    30
    Thanked 59 Times in 35 Posts
    The compression is too bad for several KB data. Older than v0.11 is not so bad.

  18. #586
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by xezz View Post
    The compression is too bad for several KB data. Older than v0.11 is not so bad.
    Can you provide a sample file or describe the file? I see only very minor differences on the files I have tried.

  19. #587
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    150
    Thanks
    30
    Thanked 59 Times in 35 Posts
    OK.
    Attached Files Attached Files

  20. The Following User Says Thank You to xezz For This Useful Post:

    Kennon Conrad (23rd January 2015)

  21. #588
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by xezz View Post
    OK.
    Xezz, there is a bug in encoding and decoding of that can cause failure on relatively small files like these. It is in the handling of the code bins that replaced the code tree starting in v0.11. Unfortunately it is not a simple fix, so if you want a fix you will have to wait for the next version which will also include some mtf improvements for small files like these, probably in one or two weeks.

  22. #589
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts

    Tree v0.19

    Tree v0.19 has a fix for the problem xezz posted. TreeCompress64 was modified to slightly favor matches starting with immediately after a special capital encode symbol and ending with a special capital encode symbol when compressing capital encoded input. TreeBitEncode and TreeDecode have slightly better code length calculations for symbols that fall out of the MTF queues and a couple of relatively minor inefficiencies eliminated.

    Compressed file sizes for fields.c:
    Tree v0.10: 3484 bytes
    Tree v0.11 - v0.18: fails
    Tree v0.19: 3192 bytes
    Tree64 v0.19: 3188 bytes

    Compressed file sizes for xargs.1:
    Tree v0.10: 1980 bytes
    Tree v0.11 - v0.18: fails
    Tree v0.19: 1740 bytes
    Tree64 v0.19: 1724 bytes

    Results for enwik9:
    TreeCompress: 173,210,648 bytes in 119,742 seconds, decodes in 7.1 seconds, 1692 MB RAM
    TreeCompress64: 173,803,292 bytes in 4702 seconds, decodes in 7.2 seconds, 6023 MB RAM
    TreeDecode.c in a .zip file is 14,625 bytes

    Results for enwik8:
    TreeCompress: 21,497,672 bytes in 3937 seconds, decodes in 0.72 seconds
    TreeCompress64: 21,547,924 bytes in 293 seconds, decodes in 0.72 seconds, 1388 MB RAM

    Results for the Calgary Corpus, again from the Sequitur paper (http://www.cs.waikato.ac.nz/~ihw/pap...W-Compress.pdf), but this time with Tree added:

    Click image for larger version. 

Name:	CalgaryResults.jpg 
Views:	235 
Size:	138.8 KB 
ID:	3390

    A funny thing happened. I opened BIB to see what the data is because I wanted to understand why Tree doesn't do better. I went a couple of pages down and saw one of my former professor's name on one of the articles (not the professor I mentioned earlier). The world seems to be getting smaller all the time.....
    Attached Files Attached Files

  23. The Following 2 Users Say Thank You to Kennon Conrad For This Useful Post:

    Mike (4th February 2015),xezz (4th February 2015)

  24. #590
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    150
    Thanks
    30
    Thanked 59 Times in 35 Posts
    Is usage the same as older?
    Decoding always fail.
    My Procedure:
    TreeCompress in out -> TreeBitEncode in out -> TreeDecode in out

  25. #591
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts

  26. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    Kennon Conrad (6th February 2015)

  27. #592
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by xezz View Post
    Is usage the same as older?
    Decoding always fail.
    My Procedure:
    TreeCompress in out -> TreeBitEncode in out -> TreeDecode in out
    Usage is the same as before. You need to run TreePreEncode before the others. Alternatively, you can use TreeComp.bat ("TreeComp in") and it will run all four steps.

  28. #593
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    778
    Thanks
    63
    Thanked 273 Times in 191 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Tree v0.19
    enwik8:
    21,497,664 bytes, 3759.977 sec. - 0.645 sec., TreeComp
    21,547,916 bytes, 268.118 sec. - 0.655 sec., TreeComp64
    21,570,772 bytes, 1883.272 sec. - 0.630 sec., TreeCompUTF8
    21,585,096 bytes, 292.053 sec. - 0.642 sec., TreeComp64UTF8
    Last edited by Sportman; 6th February 2015 at 12:14.

  29. The Following User Says Thank You to Sportman For This Useful Post:

    Kennon Conrad (6th February 2015)

  30. #594
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Sportman View Post
    enwik8:
    21,497,664 bytes, 3759.977 sec. - 0.645 sec., TreeComp
    21,547,916 bytes, 268.118 sec. - 0.655 sec., TreeComp64
    21,570,772 bytes, 1883.272 sec. - 0.630 sec., TreeCompUTF8
    21,585,096 bytes, 292.053 sec. - 0.642 sec., TreeComp64UTF8
    I think the slight differences in compressed file size are due to differences in floating point math results on different processors. I'm not sure there is anything I can do except not use floating point instructions.

  31. #595
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    I think the slight differences in compressed file size are due to differences in floating point math results on different processors. I'm not sure there is anything I can do except not use floating point instructions.
    You mean your cost function is affected by floating point errors? That could be technically harmless, but I'd probably want to track it down and make it deterministic. You probably don't really need floating points, and might be better served by removing them.

  32. #596
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Yes, the cost function is affected by differences in floating point calculations. The same thing could be done with integer math, but I think the calculation would take a lot more CPU cycles. If close is adequate, maybe a lookup table with interpolation would be adequate. Ultimately though, a cost function based on changes in LCP tree internal node and leaf counts may be better.

  33. #597
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Kennon, have you tried Tree on the hq19 (whole human genome in FASTA format) file from Matt's 10GB Compression Benchmark corpus? (It's about 3GB.)

    I think that might show some interesting things that looking at little files of particular loci wouldn't. In particular, I'd expect Tree or LZMA to be able to find a lot more longish repeats, from retroviruses and other genes that have been copied widely within a chromosome, or across chromosomes. (IIRC, some of our chromosomes are basically copies of other whole chromosomes, from major copying malfunctions WAY back in evolution, with lots of subsequent mutations and duplications of various sorts---tiny point mutations, larger insertions and deletions, retroviruses, jumping genes going way the hell away from where they started, etc. I know many of our most important genes (within the Hox complex) are basically mutatated copies of each other.)

    I haven't looked at the file in detail to see how much of it is metadata and how to strip it out to look at the raw sequence per se, but I think the method I gave in the previous post should work---the metadata should be really easy to strip out with a simple script, just deleting comment-like lines that start with either of the two special start characters, leaving the raw FASTA sequence. That will leave you with more than the four GCTA letters, because of ambiguities encoded by using other characters to encode things like G-or-C, but that's what the "real world data" look like, and the extra letters should be left in for compression tests. (Assuming exactly four letters is sorta cheating anyway, for most analyses.)

    Even comparing general-purpose compressors on the unmodified file (with metadata left in) would be interesting. That's real-world data too, and some compressors will win mostly by compressing the metadata well and not doing horribly on the actual sequences, which are mostly four-letter noise-like sequences at any small scale. (Comparing results for with- and without-metadata would be more revealing, of course.)

    (If you've tried to do this and run into problems, let me know. I'll eventually get around to doing this myself, and if there are gotchas, I want to know about them.)

  34. #598
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    BTW if you just run Tree on the hq19 file, as is, and post the top strings, I'll have a look at them, and that should be informative about metadata vs. sequence frequencies, funny strings that imply how to strip out the metadata, and a few other things. (Like artifacts of how the sequences are concatenated, so that my simple metadata-stripping algorithm needs to be slightly modified to account for that.)

    I'd expect that to be informative in the same way as looking at the top strings for enwik9---it'll show what's common due to metadata/markup vs. template-filling or whatever, vs. what's common due to normal "textual" regularities (in this case, actual gene sequence repeats).

    Hopefully I can come up with a little Perl script for turning hq19 into a good benchmark file, like Matt's/ours for stripping out most of the markup in enwik9.
    Last edited by Paul W.; 20th February 2015 at 17:46.

  35. #599
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    No, I haven't tried hq19. I don't think it will work properly for several reasons, the first of which is that I haven't figured out how to read the file size of files over 2 GB. To perform effectively of DNA, the entire data LCP structure generally needs to be built. Since Tree uses a tree instead of a suffix array, memory usage will blow up.

  36. #600
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    I generated new plots of the symbol type probabilites. When I last did this, the "MTF queue" for symbols that appear more than 20 times was not very good (1 deep for each symbol length) and didn't use an ergodicity bit. This time I broke the symbols into two categories, one for symbols written at the top level and the other for symbols written while writing new symbols. I made two plots, one that is very smoothed out (100K symbols per output) and another that is not quite as smoothed out (10K symbols per output). There are nine plots, one for the ratio of symbols written that were used at level 0 , four for the ratio of symbols from each category at level 0, and four for the ratio of symbols from each category at level >= 1 (writing subsymbols of new symbols). I think what surprised me the most is that at the beginning new symbols are more likely to occur when writing at level 0 that when writing the subsymbols of new symbols but by the end, new symbols are more likely to occur when writing subsymbols of new symbols than when writing at level 0.

    Click image for larger version. 

Name:	Tree64SymbolTypes_enwik9_100K.jpg 
Views:	219 
Size:	200.5 KB 
ID:	3401

    Click image for larger version. 

Name:	Tree64SymbolTypes_enwik9_10K.jpg 
Views:	192 
Size:	234.4 KB 
ID:	3402

    Not surprisingly, this supports the notion that a hierarchical grammar compressor could benefit from using separate symbol type models for level 0 and other levels. I imagine the characteristics of the symbols are different too, but have no evidence (yet).
    Last edited by Kennon Conrad; 23rd February 2015 at 04:39.

  37. The Following User Says Thank You to Kennon Conrad For This Useful Post:

    Paul W. (22nd February 2015)

Page 20 of 29 FirstFirst ... 101819202122 ... LastLast

Similar Threads

  1. Replies: 4
    Last Post: 2nd December 2012, 02:55
  2. Suffix Tree's internal representation
    By Piotr Tarsa in forum Data Compression
    Replies: 4
    Last Post: 18th December 2011, 07:37
  3. M03 alpha
    By michael maniscalco in forum Data Compression
    Replies: 6
    Last Post: 10th October 2009, 00:31
  4. PIM 2.00 (alpha) is here!!!
    By encode in forum Forum Archive
    Replies: 46
    Last Post: 14th June 2007, 19:27
  5. PIM 2.00 (alpha) overview
    By encode in forum Forum Archive
    Replies: 21
    Last Post: 8th June 2007, 13:41

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •