Results 1 to 6 of 6

Thread: why Compression of DNA sequences is a very challenging task?

  1. #1
    Join Date
    Aug 2016
    Thanked 0 Times in 0 Posts

    why Compression of DNA sequences is a very challenging task?

    why the compression of DNA sequence is a difficult for compression algorithms?
    although DNA is composed of four bases (A, T, G, C), and can be coded using two bits per base , The human genome contains around 3 billion characters .although these software are designed for text compression.

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Kharkov, Ukraine
    Thanked 1,363 Times in 779 Posts
    The problem is that even when its this kind of DNA data (and its not always this - the previous dna compression competition used a different encoding),
    its not just random independent symbols which can be packed to 2 bits each and that's all - it can be compressed much further than that.
    For one, there're some common strings which appear frequently enough, so plain LZ would compress it better than just 2-bit packed coding.
    But then, there're lots of other types of dependencies which don't normally appear in data compression - like palindromes,
    complementary strings, imprecise matches, and even some restrictions appearing from the molecular structure of DNA sequence.

    So I'd say its hard mainly because you have to know a lot about its properties to compress it well.
    Also some things like imprecise matches are just hard to do on their own.

  3. Thanks:

    omran farhat (4th September 2016)

  4. #3
    Join Date
    Dec 2011
    Cambridge, UK
    Thanked 186 Times in 127 Posts
    Several small things, but none of them insurmountable.

    Firstly, DNA consists (mostly) of just 4 symbols, meaning that bit-based encoding like Huffman is just never going to beat 2 bits per base at raw entropy encoding and in practice doesn't get that far due to often having an EOF symbol. Trivial workaround is to pack multiple bases per byte - eg triplets (permitting ACGTN). Try gzip on a dna sequence and you'll see it's really weak due to the small alphabet penalties of its huffman implementation.

    Secondly, if you want LZ style matching then DNA is comparable to a corrupted archive. You'll be able to deduplicate a lot, but there are various imperfect repeats too, sometimes not just substitutions but small insertions and deletions. While this could be coded as a whole series of separate exact repeats (eg lots of LZ matches), it's inefficient as the distances are highly correlated (so maybe you need to use a more complex LZ with recent distance codes). Alternatively you can do a full smith-waterman style dynamic programming matrix to identify the repeats and the optimal edits from one to the other, permitting a richer LZ scheme.

    Then just to make life awkward, matches can occur on either strand (so AT<>TA and CG<>GC plus reverse order).

    Imagine trying to describe the duplications in this sequence:

    Each dot on that graph is a short match between substrings of a pair of DNA sequences, but it could equally be between a DNA sequence and itself (making it perfectly symmetrical along the leading diagonal). The dots combine to make lines where the matches are long, either matching the leading diagonal for forward strand match or at right angles for a reverse strand match. The small repeat many times over gives rise to an approximate grid of short matches, but they're not quite all identical.

    Frankly though it's probably not worth doing anything clever. Basic LZ (albeit both strands) coupled with 2 bit encoding and an exception matrix for the non ACGT symbols would get you 99% of the way there I think.

    PS. Using dot plots on other data can be a nice way to visualise the repeat structure. Even just turning data into DNA via 2-bit encoding and using dotter (or similar) can shine a light on the internal data structure.

  5. Thanks:

    omran farhat (5th September 2016)

  6. #4
    Join Date
    Jun 2015
    Thanked 347 Times in 219 Posts
    Quote Originally Posted by omran farhat View Post
    why the compression of DNA sequence is a difficult for compression algorithms?
    Because of mutations, length optimization by evolution, and sex (and other exchange of dna) the dna has increased entropy, and is difficult to compress.

    Some 3–10th order context modeling seems to help a tiny bit. Non-scientific personal theory: I believe a longer context model works better than shorter because the protein coming from the synthesis would have a self-collision or collision with the machinery building proteins with some of the sequences, and could not be completed or would be much slower to be completed. Such protein synthesis would seem harmful for most life, and thus is less common.

  7. #5
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Melbourne, Florida, USA
    Thanked 798 Times in 489 Posts
    The best you can compress any string is to find the shortest program that outputs it. The program that wrote the human genome is called evolution. It is not a complex algorithm, but it required 10^48 DNA base copy operations on 10^38 bits of memory, consuming 280 terawatts of power for 3.5 billion years.

    I hope that including the human genome as 30% of the 10 GB benchmark will spur research in genomics in the same way that the large text benchmark spurs research in AI.

  8. Thanks:

    omran farhat (26th October 2016)

  9. #6
    Member CompressMaster's Avatar
    Join Date
    Jun 2018
    Lovinobana, Slovakia
    Thanked 17 Times in 17 Posts
    btw, lossy DNA compression are called "mutations" such as cystic fibrosis, epidermolysis bullosa (extreme skin fragility), progeria (extremely fast aging), cancer and so on. The only hope is gene editing tool CRISPR/Cas9. For example, one patient with EB was fully cured:

    Recessive dystrophic epidermolysis bullosa, an aggressive subtype of epidermolysis bullosa, is a rare disease resulting in severe skin fragility, characterized by the continuous formation of erosions and blisters on the skin and internal mucous membranes, as well as fibrosis and diverse complications such as pseudo-syndactyly (fusion of the fingers) and an elevated risk of developing metastatic squamous cell carcinoma. Dealing with this disease represents a challenge for health professionals and a great effort on the part of patients and their families.

    This disease, of a genetic origin, is caused by mutations in the COL7A1 gene, which codifies for type VII collagen (C7), a protein essential for dermo-epidermal adhesion.

    Gene editing with CRISPR/Cas9

    The authors of this study have applied the gene-editing tool CRISPR/Cas9, which in this case is employed to safely and accurately eliminate from the stem cells of the skin of patients the exon 80 of the COL7A1 gene, which contains the pathogenic mutation. This leads to the production, from the edited cells, of a functional C7 variant.
    Click image for larger version. 

Name:	b4757bc22acf7b987f030ed891e97a74.gif 
Views:	93 
Size:	6.5 KB 
ID:	7817
    cystic fibrosis
    Last edited by CompressMaster; 27th July 2020 at 00:24.
    Please hit the "THANKS" button under my post if its useful for you.

Similar Threads

  1. DNA Corpus
    By Kennon Conrad in forum Download Area
    Replies: 20
    Last Post: 10th April 2019, 09:13
  2. DNA storage
    By Shelwien in forum The Off-Topic Lounge
    Replies: 3
    Last Post: 25th January 2013, 23:42
  3. Finding most frequency sequences in data?
    By RichSelian in forum Data Compression
    Replies: 5
    Last Post: 21st September 2012, 04:29
  4. Another? DNA contest
    By Shelwien in forum Data Compression
    Replies: 2
    Last Post: 8th February 2012, 17:17

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts