Results 1 to 11 of 11

Thread: Compression via substring enumeration

  1. #1
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Compression via substring enumeration

    Hi!

    During a break i found some interesting compression method published within the DCC2010:

    http://w3.ift.ulaval.ca/~dadub100/files/DCC10.pdf

    It seems to be capable of using prefix and suffix correlations by processing the data non-sequentially via the enumeration of all bitwise substrings. Using some logical relations and enumerating the strings starting from an empty string e, successively appending bits to the left AND right it obtains quite some compression.

  2. #2
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    To bad they did not complete there example.
    I think they missed the boat
    Take their string as an example D = 010001 do standard BWT
    assume no two rows the same since D in paper is defined
    such that D of 010101 or 001001 and etc and not valid
    000101
    001010
    010001
    010100
    100010
    101000
    you can transmit the info in first column by 4,2 number of zeroes and ones
    then start next column look at rows that start with 0 only tranmit 2 number of zero bounded by 1 to 3
    actually to match them the string D in paper not made of repeated cycles using this info its the bound is 1 to 3
    if the value 0 and 4 are not possible in full wrapped BWT and D not made of repeated strings.
    but all that really means in fully wrapped BWT which few use is that no two rows are the same.
    that forces the rest of column no more numbers needed

    start next column look at rows starting from top defined by 2 characters in order 00 01 10 no 11 in this case
    you transmit nothing since 0 bounded by 1 to 1 note that forces a 1 in next row postion
    note two rows of 000 not posible since lenght 6 and that would mean 6 zeros
    001 and 001 not posible since that would imply D 001001 meaning D made of a repeated string
    you transmit nothing for nex two rows of 01 since the first 2 columns tell you 11 does not occur
    so 0 follows each 01.
    going to rows 10 there is only a 1 and 0 left but its in sorted order so nothing transmitted.

    going to rows definded by 3 character 000 there is 1 row that is 001 so tansmit nothing its a forced zero
    since it can't be 0 a since 11 not allowed and they would have filled the last two postions
    looking at row 001 again transmit nothing since 11 not allowed.
    there are two rows 010 and a row 100 and 101 but only one 1 left in column
    1011 not allowed 2 ones 1001 not alowed since row would be 100100 violates D being a mulitiple
    so the only one left goes to 0101 so rest forst to zero and since sorted 0100 is first whole
    column is defined.

    since the first 4 coulmns now defined nothing is need for rest since its defined by previous
    0001 0 since a 1 would make 11 not allowed
    0010 1 since a 0 would force 001001 not allowed
    0100 0 since a 1 would force 010010 reapeat of 010
    0101 0 since only 2 ones in row
    1000 1 since zero would force 100001 a 11 in wrap around
    1010 0 since only two ones in row.
    likewise for last column the two ones per row does it.


    last column forced since fist 5 charcters unique each row in fact if futher coumuns nothing is tranmited its done

    except for transmiiting the index since this a a plain BWT you would reduce this to
    4,2,2 bounded 1 to 3
    I gues if using pure BWT I would transmit
    4,2,2(1 of 3 automatic),2(0 to 6 automatic)
    I think this is simalar to the bottom line but they mever finished there example. wht didn't they complets it?
    Again I don't trust papers that don't have source code Who knows what they really got.

  3. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,375
    Thanks
    214
    Thanked 1,023 Times in 544 Posts
    1. The title is misleading - imho enumeration methods work by enumerating
    the objects and encoding the index of specific object, not by counting the
    objects and encoding the statistics :)

    2. "Given a particular L, the description consists in indicating the number of
    occurrences of each possible L-bit substring of D."

    Sounds like a common static CM idea... there're Bulat's and Shkarin's programs
    that do that (also my ctx :).

    The bitwise approach _is_ new though, but that's only because nobody
    considers it normally - for same reasons as with bitwise BWT :)

    3. Their result for book1 is ~218138, "PPM" is ~230631 and "BWT" is ~239279
    (approximated from bpc)

    4. Still, thanks for posting the link, its interesting :)

  4. #4
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Very interesting. Thanks for posting.

    To summarize: build a suffix tree like with BWT, but instead of transmitting the BWT, transmit the tree itself. The tree is bitwise, not bytewise. The transmission of node counts uses constraints from suffixes already sent to narrow the possibilities.

    This opens many possibilities besides the method described in the paper. The tree need not be bitwise. There are many other possibilities for modeling. Node counts can be predicted using escape modeling, adaptive context mixing, SSE, etc.

    It seems the approach has the same limitation as BWT as being poorly suited for nonstationary sources (tar files, sorted lists of words) and for binary data with sparse contexts such as images or x86). A bitwise tree would also use a lot of memory. Losing byte boundaries can't help with modeling either.

    The algorithm has to check that the data does not repeat. A simple way to do that would be to use a block size that is a prime number, or to append one extra bit that is different from the first bit.

    Too bad there is no software released. I suppose they have some work to do to make it competitive. While I was looking for it I came across another paper by the same authors where they use the redundancy for coding matches in LZ77 to transmit extra data for free using a method they call "bit recycling". http://w3.ift.ulaval.ca/~dadub100/files/ISITA06.pdf

    I suppose ROLZ is another way to achieve this.

  5. #5
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    > The algorithm has to check that the data does not repeat. A simple way to do that would be to use a block size that is a prime number, or to append one extra bit that is different from the first bit.

    On second thought that wouldn't work. 00000 has a prime length and it repeats. Also adding an opposite bit to 00100100 would make it repeat too.

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,375
    Thanks
    214
    Thanked 1,023 Times in 544 Posts

  7. #7
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    I can't imagine this on non-bitwise tree. I think the |alphabet|=2 is used in a manner like the Intrerval in arithmetic coding to recognise (L+1) bit enumerations.

  8. #8
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Quote Originally Posted by Shelwien View Post
    I guess this builds a suffix tree and transmits it to the decoder, then compresses with a static model. It is hard to tell with no documentation and very few comments (in Russian). Anyway neither .exe works in Vista:
    Code:
    C:\tmp>CM-nt-watcom.exe
    The instruction at 0x0006fe84 referenced memory at 0x0081cab0.
    The memory could not be written.
    
    C:\tmp>CM-dos4g.exe
    Stub exec failed:
    dos4gw.exe
    No such file or directory
    So I tried compiling in g++. I had to remove flushall() from main(). Dunno if I broke something. I told it to use 12 MB memory for the tree, a block size of 4 MB, and order 3. If you don't give it enough memory for the tree then it fails.

    Code:
    C:\tmp>cm -m12000000 -t4000000 -o3 a x calgary.tar
    
    Order 3, count 10, encode on
    Parse  Text: 3152896 chars, 169701 contexts, 341759 nodes, 11586780 memory
    Cutoff Tree: 11333 contexts, 43253 nodes, 1182384 memory
    Encode Tree: 1(0) 1(1); 0i+256c=256; 81560 bytes
    Encode Text: 3441071 codes, 77735 empty codes, 315975 null hints;
                    288175 escapes, 255682 first escapes, 1053908 bytes
    
    C:\tmp>cm x x x1
    
    Order 3, count 10, encode on
    Assertion failed: size<=maxsize, file cm.cpp, line 2147
    
    This application has requested the Runtime to terminate it in an unusual way.
    Please contact the application's support team for more information.
    So anyway I tried some other variations, and gave up.

  9. #9
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    cm -m12000000 -t4000000 -o3 x x x1

  10. #10
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    i don't know what's the suffix tree. CM works as follows:
    1. builds exhasutive PPM tree of fixed order (-o param)
    2. removes from tree all elements with counts less than cutoff value (-c param), adding their counts to parent elements
    3. encodes tree, using itself
    4. encodes text using the tree

    exhaustivness of tree built at the first stage is reason why it needs so much memory

  11. #11
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    OK thanks. I got it to work on calgary.tar. I tried enwik8 and it crashed during decompression. (This program has stopped working...)

    Code:
    C:\tmp>cm -m1000000000 -t100000000 -o4 a x \res\enwik8
    
    Order 4, count 10, encode on
    Parse  Text: 100000000 chars, 1471707 contexts, 3529664 nodes, 111801048 memory
    Cutoff Tree: 170764 contexts, 657523 nodes, 17931852 memory
    Encode Tree: 0(0) 0(1); 0i+205c=205; 1232941 bytes
    Encode Text: 103465485 codes, 3590136 empty codes, 2852286 null hints;
                    3465485 escapes, 3054678 first escapes, 27492041 bytes
    
    C:\tmp>cm -m1000000000 -t100000000 -o4 x x enwik8
    
    Order 4, count 10, encode on
    Decode Tree
    Decode Text:    4%
    If it's possible and you can tell me the best options, I will add to LTCB.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •