Hello Experts
I and my fellow researchers have developed few compression algorithms (keeping in view good speed and highest compression ratio), most of the compression specifically for raw DNA sequences, which resulted in very good speed and a good compression ratio. One of them was AMC (Matrix Compression), we started modifying it and did some paper work before translating it into the C/C++ code. We modified it to work with binary data (not only the raw DNA sequence files). In DNA data we had so many options because it is limited to coinsurances i.e. 'a', 'c','g', and 't' or capitalized version.
How AMC 3.0 works?
AMC has two schemes of filtering (Scheme H and scheme V - Horizontal[ROWs], Vertical[COLUMNs]), it filters a matrix and calculates the scheme which can allow to eliminate more places in a matrix. Then keeps track of the places to be filled later (when decompression takes place) in an unsigned __int64 integer, where each bit indicates the row or column, if turned 1: To be filled with popular character we removed, when 0: fill the remaining uneliminatable bytes (we call it JUNK). Repeating this process until we receive eliminatable popular character + offsets sizeof unsigned __int64 is less than 16. Input is 4096 bytes block (assumed a 64x64 unsigned __int8 matrix).
(Popular Character (PC) is the character that is most frequent in the matrix)
Now we have moved back and realized that elimination of similar bits in N bytes (vertical comparison of current with last byte) will be better idea, BUT question was, input might be a highly compressed data which might not provide us enough matches, so we decided to transform the bits using 6 ways, and we use 3 bits to store the type of filter used in 3 bits.
Code:
bits Filter
---------------------------
000 No Filter
001 Toggle
010 Flip (Flip all 8 bits)
100 Swap (Actually swap each 2 bits in a byte)
101 Toggle + Flip
110 Toggle + Swap
011 Flip + Swap
111 Unused
The very first thing we do is check the number of on bits in the first byte, if on bits are less than 4 then Toggle first byte and add 001 in TFS block.
We call them filters, where each filter occupies 3 bits space in a separate block called TFS block. Applying each filter one by one on the current byte and comparing it with FIRST byte gives us result of at least normalizing each byte to at least match 4 bits to previous byte. Once the (VERTICAL) matches are less than 4 BREAK and store the data (NOT THE TFS block), and count next incoming byte as first. A sample data:
Code:
1. 1 1 1 1 1 1 1 1
2. 1 0 1 1 1 1 0 0
3. 0 0 1 0 1 1 1 1
Flip 11110100 (4 matches with previous)
4. 0 0 0 0 0 0 1 1
Toggle - 11111100 (5 matches)
Finally makes:
Code:
1 1 1 1 1 1 1 1
1 0 1 1 1 1 0 0
1 1 1 1 0 1 0 0
1 1 1 1 1 1 0 0
etc. Where if you review it as matrix there are some columns that can be eliminated and their offsets can be preserved in just single byte. We call it OFFSETS Block.
OFFSETS: 1 0 1 1 0 1 0 0 (Where 1 means column was chopped off, 0 means data is considered JUNK)
Code:
1 1 1 1 1 1 1 1
1 0 1 1 1 1 0 0
1 1 1 1 0 1 0 0
1 1 1 1 1 1 0 0
After elimination of said columns, resulted bits are
Code:
- 1 - - 1 - 1 1
- 0 - - 1 - 0 0
- 1 - - 0 - 0 0
- 1 - - 1 - 0 0
Eventually the remains, I name them JUNK
Code:
1111
0100
1000
1100 => 1111010010001100 == (2 bytes)
So finally the structure of encoded block will be:
Code:
Block bits bytes Value Description
---------------------------------------------------------------------------------------
SIZE 8 1 uc 4 Total bytes encoded
OFFSETS 8 1 10110100 Offsets of eliminated
JUNK 16 2 1111010010001100 Uneliminatable columns' data
Where TFS block is a global block, and written down to end of file when encoding has been completed.
In this 4 bytes encoding case, if we attach TFS to it, it will unfortunately result in 6 bits more (storable in 8) = 1 byte, so final data after encoding 4 bytes is now 5 byte. Once such situation is encountered, these 4 bytes block is considered incompressible.
BUT at some position, when compressing our test files of :
- raw DNA sequences
- Windows BMP files
- some already compressed JPEG files
- PAQ compressed files
we found there are many matches that can form a favorite sequence (after filters).
What I and my fellow researchers are seeking are suggestions.
What we already have tried is:
1. keeping filters entries of TFS block to 2 bits only, but again it accommodates 4 types only, where first must remain No-Filter, so practically we have 3 possible 2bit combinations left, i.e. 01, 10, and 11, using these we cannot broaden the filters application, such as combination of two - Toggle + Flip, Toggle + Swap, and Flip + Swap.
2. Keeping yet another superior block of bits that indicates either filter is applied or not, where 0 is NO, and 1 is YES, using this method we could save a lot in TFS block by eliminating No-Filter notations.
3. We also have tried TYPES OF BLOCKS, means the maximum number of filters (M) applied in a block of N bytes, where
if M is 0 then type is 00
if M <= 3 then type is 01
if M > 3 then type is 10
But yet we cannot eliminate the No-Filter block here, we did not try procedures 2 and 3 together, but I thought we must consult the experts here. They can give us some great suggestions which will ultimately result in a new compressor with great compression ratio.
We have named it FBEM, that will be released under GNU GPL, once finalized and tested.
Your reviews and suggestions are most welcomed. Our joint effort can result in something really good and fast. Constructive criticism is most welcomed to make us work better.
regards
Ali...