# Thread: Compression via substring enumeration

1. ## Compression via substring enumeration

Hi!

During a break i found some interesting compression method published within the DCC2010:

It seems to be capable of using prefix and suffix correlations by processing the data non-sequentially via the enumeration of all bitwise substrings. Using some logical relations and enumerating the strings starting from an empty string e, successively appending bits to the left AND right it obtains quite some compression.

2. To bad they did not complete there example.
I think they missed the boat
Take their string as an example D = 010001 do standard BWT
assume no two rows the same since D in paper is defined
such that D of 010101 or 001001 and etc and not valid
000101
001010
010001
010100
100010
101000
you can transmit the info in first column by 4,2 number of zeroes and ones
then start next column look at rows that start with 0 only tranmit 2 number of zero bounded by 1 to 3
actually to match them the string D in paper not made of repeated cycles using this info its the bound is 1 to 3
if the value 0 and 4 are not possible in full wrapped BWT and D not made of repeated strings.
but all that really means in fully wrapped BWT which few use is that no two rows are the same.
that forces the rest of column no more numbers needed

start next column look at rows starting from top defined by 2 characters in order 00 01 10 no 11 in this case
you transmit nothing since 0 bounded by 1 to 1 note that forces a 1 in next row postion
note two rows of 000 not posible since lenght 6 and that would mean 6 zeros
001 and 001 not posible since that would imply D 001001 meaning D made of a repeated string
you transmit nothing for nex two rows of 01 since the first 2 columns tell you 11 does not occur
so 0 follows each 01.
going to rows 10 there is only a 1 and 0 left but its in sorted order so nothing transmitted.

going to rows definded by 3 character 000 there is 1 row that is 001 so tansmit nothing its a forced zero
since it can't be 0 a since 11 not allowed and they would have filled the last two postions
looking at row 001 again transmit nothing since 11 not allowed.
there are two rows 010 and a row 100 and 101 but only one 1 left in column
1011 not allowed 2 ones 1001 not alowed since row would be 100100 violates D being a mulitiple
so the only one left goes to 0101 so rest forst to zero and since sorted 0100 is first whole
column is defined.

since the first 4 coulmns now defined nothing is need for rest since its defined by previous
0001 0 since a 1 would make 11 not allowed
0010 1 since a 0 would force 001001 not allowed
0100 0 since a 1 would force 010010 reapeat of 010
0101 0 since only 2 ones in row
1000 1 since zero would force 100001 a 11 in wrap around
1010 0 since only two ones in row.
likewise for last column the two ones per row does it.

last column forced since fist 5 charcters unique each row in fact if futher coumuns nothing is tranmited its done

except for transmiiting the index since this a a plain BWT you would reduce this to
4,2,2 bounded 1 to 3
I gues if using pure BWT I would transmit
4,2,2(1 of 3 automatic),2(0 to 6 automatic)
I think this is simalar to the bottom line but they mever finished there example. wht didn't they complets it?
Again I don't trust papers that don't have source code Who knows what they really got.

3. 1. The title is misleading - imho enumeration methods work by enumerating
the objects and encoding the index of specific object, not by counting the
objects and encoding the statistics :)

2. "Given a particular L, the description consists in indicating the number of
occurrences of each possible L-bit substring of D."

Sounds like a common static CM idea... there're Bulat's and Shkarin's programs
that do that (also my ctx :).

The bitwise approach _is_ new though, but that's only because nobody
considers it normally - for same reasons as with bitwise BWT :)

3. Their result for book1 is ~218138, "PPM" is ~230631 and "BWT" is ~239279
(approximated from bpc)

4. Still, thanks for posting the link, its interesting :)

4. Very interesting. Thanks for posting.

To summarize: build a suffix tree like with BWT, but instead of transmitting the BWT, transmit the tree itself. The tree is bitwise, not bytewise. The transmission of node counts uses constraints from suffixes already sent to narrow the possibilities.

This opens many possibilities besides the method described in the paper. The tree need not be bitwise. There are many other possibilities for modeling. Node counts can be predicted using escape modeling, adaptive context mixing, SSE, etc.

It seems the approach has the same limitation as BWT as being poorly suited for nonstationary sources (tar files, sorted lists of words) and for binary data with sparse contexts such as images or x86). A bitwise tree would also use a lot of memory. Losing byte boundaries can't help with modeling either.

The algorithm has to check that the data does not repeat. A simple way to do that would be to use a block size that is a prime number, or to append one extra bit that is different from the first bit.

Too bad there is no software released. I suppose they have some work to do to make it competitive. While I was looking for it I came across another paper by the same authors where they use the redundancy for coding matches in LZ77 to transmit extra data for free using a method they call "bit recycling". http://w3.ift.ulaval.ca/~dadub100/files/ISITA06.pdf

I suppose ROLZ is another way to achieve this.

5. > The algorithm has to check that the data does not repeat. A simple way to do that would be to use a block size that is a prime number, or to append one extra bit that is different from the first bit.

On second thought that wouldn't work. 00000 has a prime length and it repeats. Also adding an opposite bit to 00100100 would make it repeat too.

6. I can't imagine this on non-bitwise tree. I think the |alphabet|=2 is used in a manner like the Intrerval in arithmetic coding to recognise (L+1) bit enumerations.

7. Originally Posted by Shelwien
I guess this builds a suffix tree and transmits it to the decoder, then compresses with a static model. It is hard to tell with no documentation and very few comments (in Russian). Anyway neither .exe works in Vista:
Code:
```C:\tmp>CM-nt-watcom.exe
The instruction at 0x0006fe84 referenced memory at 0x0081cab0.
The memory could not be written.

C:\tmp>CM-dos4g.exe
Stub exec failed:
dos4gw.exe
No such file or directory```
So I tried compiling in g++. I had to remove flushall() from main(). Dunno if I broke something. I told it to use 12 MB memory for the tree, a block size of 4 MB, and order 3. If you don't give it enough memory for the tree then it fails.

Code:
```C:\tmp>cm -m12000000 -t4000000 -o3 a x calgary.tar

Order 3, count 10, encode on
Parse  Text: 3152896 chars, 169701 contexts, 341759 nodes, 11586780 memory
Cutoff Tree: 11333 contexts, 43253 nodes, 1182384 memory
Encode Tree: 1(0) 1(1); 0i+256c=256; 81560 bytes
Encode Text: 3441071 codes, 77735 empty codes, 315975 null hints;
288175 escapes, 255682 first escapes, 1053908 bytes

C:\tmp>cm x x x1

Order 3, count 10, encode on
Assertion failed: size<=maxsize, file cm.cpp, line 2147

This application has requested the Runtime to terminate it in an unusual way.
So anyway I tried some other variations, and gave up.

8. cm -m12000000 -t4000000 -o3 x x x1

9. i don't know what's the suffix tree. CM works as follows:
1. builds exhasutive PPM tree of fixed order (-o param)
2. removes from tree all elements with counts less than cutoff value (-c param), adding their counts to parent elements
3. encodes tree, using itself
4. encodes text using the tree

exhaustivness of tree built at the first stage is reason why it needs so much memory

10. OK thanks. I got it to work on calgary.tar. I tried enwik8 and it crashed during decompression. (This program has stopped working...)

Code:
```C:\tmp>cm -m1000000000 -t100000000 -o4 a x \res\enwik8

Order 4, count 10, encode on
Parse  Text: 100000000 chars, 1471707 contexts, 3529664 nodes, 111801048 memory
Cutoff Tree: 170764 contexts, 657523 nodes, 17931852 memory
Encode Tree: 0(0) 0(1); 0i+205c=205; 1232941 bytes
Encode Text: 103465485 codes, 3590136 empty codes, 2852286 null hints;
3465485 escapes, 3054678 first escapes, 27492041 bytes

C:\tmp>cm -m1000000000 -t100000000 -o4 x x enwik8

Order 4, count 10, encode on
Decode Tree
Decode Text:    4%```
If it's possible and you can tell me the best options, I will add to LTCB.

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•