
19th April 2010, 15:52
#1
Programmer
Compression via substring enumeration
Hi!
During a break i found some interesting compression method published within the DCC2010:
http://w3.ift.ulaval.ca/~dadub100/files/DCC10.pdf
It seems to be capable of using prefix and suffix correlations by processing the data nonsequentially via the enumeration of all bitwise substrings. Using some logical relations and enumerating the strings starting from an empty string e, successively appending bits to the left AND right it obtains quite some compression.


19th April 2010, 21:01
#2
Member
To bad they did not complete there example.
I think they missed the boat
Take their string as an example D = 010001 do standard BWT
assume no two rows the same since D in paper is defined
such that D of 010101 or 001001 and etc and not valid
000101
001010
010001
010100
100010
101000
you can transmit the info in first column by 4,2 number of zeroes and ones
then start next column look at rows that start with 0 only tranmit 2 number of zero bounded by 1 to 3
actually to match them the string D in paper not made of repeated cycles using this info its the bound is 1 to 3
if the value 0 and 4 are not possible in full wrapped BWT and D not made of repeated strings.
but all that really means in fully wrapped BWT which few use is that no two rows are the same.
that forces the rest of column no more numbers needed
start next column look at rows starting from top defined by 2 characters in order 00 01 10 no 11 in this case
you transmit nothing since 0 bounded by 1 to 1 note that forces a 1 in next row postion
note two rows of 000 not posible since lenght 6 and that would mean 6 zeros
001 and 001 not posible since that would imply D 001001 meaning D made of a repeated string
you transmit nothing for nex two rows of 01 since the first 2 columns tell you 11 does not occur
so 0 follows each 01.
going to rows 10 there is only a 1 and 0 left but its in sorted order so nothing transmitted.
going to rows definded by 3 character 000 there is 1 row that is 001 so tansmit nothing its a forced zero
since it can't be 0 a since 11 not allowed and they would have filled the last two postions
looking at row 001 again transmit nothing since 11 not allowed.
there are two rows 010 and a row 100 and 101 but only one 1 left in column
1011 not allowed 2 ones 1001 not alowed since row would be 100100 violates D being a mulitiple
so the only one left goes to 0101 so rest forst to zero and since sorted 0100 is first whole
column is defined.
since the first 4 coulmns now defined nothing is need for rest since its defined by previous
0001 0 since a 1 would make 11 not allowed
0010 1 since a 0 would force 001001 not allowed
0100 0 since a 1 would force 010010 reapeat of 010
0101 0 since only 2 ones in row
1000 1 since zero would force 100001 a 11 in wrap around
1010 0 since only two ones in row.
likewise for last column the two ones per row does it.
last column forced since fist 5 charcters unique each row in fact if futher coumuns nothing is tranmited its done
except for transmiiting the index since this a a plain BWT you would reduce this to
4,2,2 bounded 1 to 3
I gues if using pure BWT I would transmit
4,2,2(1 of 3 automatic),2(0 to 6 automatic)
I think this is simalar to the bottom line but they mever finished there example. wht didn't they complets it?
Again I don't trust papers that don't have source code Who knows what they really got.


20th April 2010, 03:20
#3
Administrator
1. The title is misleading  imho enumeration methods work by enumerating
the objects and encoding the index of specific object, not by counting the
objects and encoding the statistics :)
2. "Given a particular L, the description consists in indicating the number of
occurrences of each possible Lbit substring of D."
Sounds like a common static CM idea... there're Bulat's and Shkarin's programs
that do that (also my ctx :).
The bitwise approach _is_ new though, but that's only because nobody
considers it normally  for same reasons as with bitwise BWT :)
3. Their result for book1 is ~218138, "PPM" is ~230631 and "BWT" is ~239279
(approximated from bpc)
4. Still, thanks for posting the link, its interesting :)


20th April 2010, 23:26
#4
Expert
Very interesting. Thanks for posting.
To summarize: build a suffix tree like with BWT, but instead of transmitting the BWT, transmit the tree itself. The tree is bitwise, not bytewise. The transmission of node counts uses constraints from suffixes already sent to narrow the possibilities.
This opens many possibilities besides the method described in the paper. The tree need not be bitwise. There are many other possibilities for modeling. Node counts can be predicted using escape modeling, adaptive context mixing, SSE, etc.
It seems the approach has the same limitation as BWT as being poorly suited for nonstationary sources (tar files, sorted lists of words) and for binary data with sparse contexts such as images or x86). A bitwise tree would also use a lot of memory. Losing byte boundaries can't help with modeling either.
The algorithm has to check that the data does not repeat. A simple way to do that would be to use a block size that is a prime number, or to append one extra bit that is different from the first bit.
Too bad there is no software released. I suppose they have some work to do to make it competitive. While I was looking for it I came across another paper by the same authors where they use the redundancy for coding matches in LZ77 to transmit extra data for free using a method they call "bit recycling". http://w3.ift.ulaval.ca/~dadub100/files/ISITA06.pdf
I suppose ROLZ is another way to achieve this.


21st April 2010, 04:07
#5
Expert
> The algorithm has to check that the data does not repeat. A simple way to do that would be to use a block size that is a prime number, or to append one extra bit that is different from the first bit.
On second thought that wouldn't work. 00000 has a prime length and it repeats. Also adding an opposite bit to 00100100 would make it repeat too.


21st April 2010, 04:16
#6
Administrator

Posting Permissions
 You may not post new threads
 You may not post replies
 You may not post attachments
 You may not edit your posts

Forum Rules