
Originally Posted by
PSHUFB
I didn't follow the above. In particular, I didn't follow the reasoning from "it's quite easy to beat it on compression speed for these files" to "so some asymmetry has been accepted for a long time". Clearly, gzip is asymmetric, and also there are faster compression and decompression (and both) algorithms, but I didn't understand the connection.
Reading that back I don't understand what I was trying to say either!
Ayway, historically BAM was a standard format for aligned data for a considerable time and accepted. It has strong asymmetry in encode / decode speeds, but due to the "mix it all together and then gzip" approach it's weak compressing. I guess what I meant is that we can both beat it on speed for compression and decompression plus asymmetry has been deemed acceptable so far so we shouldn't be too worried about it.
Regarding your other point, "low coverage" relates to the DNA sequences placed against the genome. It's like a random sampling; take enough samples and you'll gradually observe most of the sample space, possibly with considerable redundancy, although the nature of random sampling means there will always be gaps. Low coverage is usually referring to the average redundancy ("depth"); so a 5x sample say has the genome, on average, sequenced 5 times over. It's not uniform and there will be many holes where no data covers that particular region. Hence people generally prefer high depth (say 40x) to get higher consensus accuracy and to reduce the number of holes that aren't covered by data. This obviously changes the compression statistics somewhat too as it implies more redundancy, but it is data type dependent. The DNA sequence is redundant, but the identifiers and quality values (one quality value per "base" / nucleotide) are not.
See for example some of the graphical visualisations of these aligned sequences at
https://ics.hutton.ac.uk/tablet/tablet-screenshots/ or https://sourceforge.net/projects/sta...urce=directory
Edit: putting it another way, if you want a representative subset of a high coverage data set, you need to pick a few regions and slice through it (so all the data for chr1:10,000,000 to chr1:11,000,000 for example) so you have data at the same depth / coverage as the total set without having a huge file. Maybe just pick an entire chromosome, like chr22, which is only a few % of the total genome. There are various tools that can do this subsetting, including samtools (SAM, BAM, CRAM) and bcftools (VCF, BCF). They permit this over ftp and http too so you don't need to download the entire data set.
Edit 2: Albeit slow, an example of viewing a BAM file directly over ftp:
samtools tview -p chr22:16388319 ftp://ftp-trace.ncbi.nih.gov/1000gen..._coverage.cram
Code:
16388321 16388331 16388341 16388351 16388361 16388371 16388381 16388391 16388401 16388411 16388421
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
AGCAGACACATGTGTGCAGTGATGGCTACTGTCAGGAGGTCTGTGCAGATGCTTGGAGGGGGCTGGGGCCAGCAGAGTCGGGTGGATTCAGAGATGAGTTCACTGAAAAGGAGGCCAG
agcagac GCAGTGATGGCTACTGTCAGGAGGTCTGTGCAGATGCTTGGAGGGGGCTGGGGCCAGCAGAGTCGGGTGGATTCAGAGATGAGTTCACTG ggccag
AGCAGACACA gcagtgatggataatgtcaggacgtgtgtgcagatggttggagggggctggggccagcagagtcgcgtgtattcagagatgagttccctg
AGCAGACACAT GCAGTGATGGCTACTGTCAGGAGGTCTGTGCAGATGCTTGGAGGGGGCTGGGGCCAGCAGAGTCGGGTGGATTCAGAGATGAGGTCACTG
AGCAGACACATGTGTGCAGTGATGGCTACTG GATGCTTGGAGGGGGCTGGGGCCAGCAGAGTCGGGTGGATTCAGAGATGAGTTCACTGAAAAGGAGGCCAG
agcagacacatgtgtgcagtgatggctactgtca tgcttggagggggctggggccagcagagtcgggtggattcagagatgagttcactgaaaaggaggccag
AGCAGACACATGTGTGCAGTGGTGGCTACTGTCAGGAGGTCTGTGCAGATGCTTGGAGGGGGCTGGGGCCAGCAGA cagagatgagttcactgaaaaggaggccag
AGCAGACACATGTGTGCAGTGATGGCTACTGTCAGGAGGTCTGTGCAGATGCTTGGAGGGGGCTGGGGCCAGCAGAGTCGGGTGGATTC gttcactgaaaaggaggccag
agcagacacatgtgtgcagagatggctactgtcaggaggtctgtgcagatgcttggagggggctggggccagcagagtcgggtggattca
agcagacacatgtgtgcagtgatggctactgtcaggaggtctgtgcagatgcttggagggggctggggccagcagagtcgggtggattca
ggctggggccagcagagtcgggtggattcagagatgagttcactgaaaaggaggccag
tggggccagcagagtcgggtggattcagagatgagttcactgaaaaggaggccag
agagtcgggtggattcagagatgagttcactgaaaaggaggccag
And similarly with one of the joint called VCF files:
Code:
@ seq3a[/tmp]; bcftools view -H ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz 22:16390000-16393000|cut -c 1-250
22 16390164 rs545334900 C G 100 PASS AC=1;AF=0.000199681;AN=5008;NS=2504;DP=21032;EAS_AF=0;AMR_AF=0;AFR_AF=0;EUR_AF=0.001;SAS_AF=0;AA=c|||;VT=SNP GT 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0
22 16390218 rs563622392 G T 100 PASS AC=1;AF=0.000199681;AN=5008;NS=2504;DP=21303;EAS_AF=0.001;AMR_AF=0;AFR_AF=0;EUR_AF=0;SAS_AF=0;AA=g|||;VT=SNP GT 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0
22 16390331 rs575263612 G A 100 PASS AC=15;AF=0.00299521;AN=5008;NS=2504;DP=27596;EAS_AF=0.001;AMR_AF=0.0043;AFR_AF=0;EUR_AF=0.0089;SAS_AF=0.002;AA=g|||;VT=SNP GT 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0
22 16390504 rs542582108 C A 100 PASS AC=1;AF=0.000199681;AN=5008;NS=2504;DP=27647;EAS_AF=0;AMR_AF=0;AFR_AF=0;EUR_AF=0;SAS_AF=0.001;AA=c|||;VT=SNP GT 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0
22 16391873 rs539358425 CAA C 100 PASS AC=11;AF=0.00219649;AN=5008;NS=2504;DP=21663;EAS_AF=0.002;AMR_AF=0;AFR_AF=0.0038;EUR_AF=0.002;SAS_AF=0.002;VT=INDEL GT 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0
22 16391910 rs539916033 C CA 100 PASS AC=40;AF=0.00798722;AN=5008;NS=2504;DP=23249;EAS_AF=0.0089;AMR_AF=0;AFR_AF=0.0098;EUR_AF=0;SAS_AF=0.0184;AA=?|AAAAAAAAA|AAAAAAAAAA|unsure;VT=INDEL GT 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|
22 16391918 rs561239096 A T 100 PASS AC=23;AF=0.00459265;AN=5008;NS=2504;DP=20930;EAS_AF=0;AMR_AF=0.0029;AFR_AF=0.0083;EUR_AF=0.002;SAS_AF=0.0082;AA=a|||;VT=SNP GT 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|
22 16392063 rs375187140 T G 100 PASS AC=17;AF=0.00339457;AN=5008;NS=2504;DP=13848;EAS_AF=0.003;AMR_AF=0.0014;AFR_AF=0;EUR_AF=0;SAS_AF=0.0133;AA=t|||;VT=SNP GT 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0
22 16392087 rs369384048 G A 100 PASS AC=33;AF=0.00658946;AN=5008;NS=2504;DP=9786;EAS_AF=0.0109;AMR_AF=0.0101;AFR_AF=0.0015;EUR_AF=0.006;SAS_AF=0.0072;AA=g|||;VT=SNP GT 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|
22 16392597 rs565129926 G C 100 PASS AC=1;AF=0.000199681;AN=5008;NS=2504;DP=27908;EAS_AF=0;AMR_AF=0;AFR_AF=0;EUR_AF=0;SAS_AF=0.001;AA=g|||;VT=SNP GT 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0
Those lines are actually really long; around 10kb. They're listing the genotypes per sample, which as this is a joint call across all 1000 genomes samples (much more than 1000!) it's *hideously* inefficiently encoded. The format really needs dumping and starting again from scratch IMO.