Some musings, to explain how own code works and the sizes it gets.
Enabling the libbsc support in my experimental CRAM branch I get this:
Code:
-rw-r--r-- 1 jkb jkb 973413 Jan 3 14:24 _.cram
(without BSC enabled it was 1005359 bytes)
jkb@Orion:~/covid19/io_lib$ ./progs/cram_size _.cram
Block CORE , total size 0
Block content_id 11, total size 6476 g n RN
Block content_id 12, total size 31 8 QS
Block content_id 13, total size 199 g 0 8 IN
Block content_id 14, total size 86408 l SC
Block content_id 15, total size 68 14 BF
Block content_id 17, total size 22206 g B AP
Block content_id 19, total size 13 8 MQ
Block content_id 25, total size 30340 b B RL
Block content_id 26, total size 31119 g l FN
Block content_id 27, total size 56404 3 B FC
Block content_id 28, total size 590966 l B FP
Block content_id 29, total size 1264 g 4 DL
Block content_id 30, total size 14821 0 8 B BA
Block content_id 31, total size 93584 1 B BS
Block content_id 33, total size 69 14 RI
Block content_id 37, total size 1219 4
Block content_id 42, total size 36170 g l B BB
The various codec symbols there are "n"=name tokeniser, "B"=Bsc, "g"=Gzip(deflate), "l"=LZMA, and 0-9=various rANS codecs. Multiples per line as this contains discrete slices of data and each slice may choose a different codec.
The 2 letter codes are the type of data being encoded. In order: Read Name, Quality Score, INserted bases, BAM Flag, Alignment Position, Map Quality, Read Length, Feature Number, Feature Code, Feature Position, Deletion Length, BAse, Base Substitution code, Reference ID, and Base Block (multi-base edits).
The sequence is stored as a delta to the reference, with a sequence aligning at a specific position (AP) and a CRAM interpretetation of the SAM/BAM "CIGAR" string. Eg in SAM (textual) we may see:
CIGAR 7909S14032M3D3449M5D4384M, meaning 7909 soft-clipped (unaligned bases), 14032 matching (NB match can be match or mismatch, so it refers to both reference and query sequence marching together in sync), a 3 base-pair deletion (3D), 3449 match, 5 deletion, and finally 4384 match. These coordinates get turned into features, which have a code (mismatch, deletion, insertion, soft-clip, etc) in "FC" and a position in the sequence (relative to last feature) in "FP". These FP values appear to be the bulk of the file contents. Bsc does a better job than vanilla CRAM which is limited to bzip2 - that was 713xxx in size IIRC.
The deletion lengths are stored in the "DL" data series (tiny). Base differences in M-regions are either a substition code (ACGTN -> ACGTN as a numeric "BS") or if the difference is something else it's stored verbatim in "BA". The BS field is the next largest. Finally "SC" is the soft-clipped bases which weren't aligned against the genome.
Obvious things that can be improved are.
- SC: I'd expect all of this data to be aligned properly, but the sequence aligner I used (bwa) split things up a bit into multiple records which gave a high proportion of clipped bases. Change aligner or parameters would probably shrink the file by another 50Kb or so (guesswork).
- FP: this may perhaps be best done via a "tab stop" type of idea. Many of the differences are at consistent genome locations, while others are either random denovo mutations or sequencing errors.
- FP: again, maybe we'd want to produce several reference sequences instead of just 1, encompassing the most abundant sequences. "RI" currently is constant, but it'd then become an indicator of which reference copy was aligned against. That means we can collapse many of the commonly occuring mutations into a single number representing the virus lineage. What's left after that is simply random differences. This is comparable to what I mentioned above about maintaining a lineage tree and encoding to a model of the likelihood of being in each lineage based on what's gone before.
I should probably try and package this up as an entry, but the decode is large as it's a general purpose tool. The gzipped binary is 340Kb, and it needs another 9.5Kb for the gzipped reference genome. Anyway, I list is mainly to encourage people to think on it more.
Edit: make that 921815 bytes (with bsc). I tweaked the parameters a bit.