Results 1 to 4 of 4

Thread: BWT with compressed input data

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts

    BWT with compressed input data

    http://ctxmodel.net/files/rcbwt_v0.rar
    Contains two proof-of-concept implementations:
    - "not working" uses a static o0 rangecoder for storing input data
    in memory and compares encoded substrings without decoding.
    It actually produces something very similar to BWT on output, but
    regrettably the rangecoder doesn't _always_ preserve the lexicographic
    order of strings, thus "not working".
    - "working" uses a static o0 bit-aligned rangecoder, with ~485k
    compressed size for book1.

    http://ctxmodel.net/files/rcbwt_v1.rar
    A new complete rewrite, with an order1 optimal ordered bitcode
    (381.5k on book1) for input data storage.
    Really produces a reversible BWT on output... for small enough files.
    Also chunked processing is implemented, so memory usage except for
    compressed input data is static.

    Code:
    Input file size            = 768771 (book1)
    Memory usage:
     - o1 static bitcode model = 34473984
     - clustering statistics   = 4194304
     - BWT index table         = 65536
     - source data buffer      = 381510
     TOTAL                     = 39115334
    
    Input file size            = 1000000 (enwik6)
    Memory usage:
     - o1 static bitcode model = 34473984
     - clustering statistics   = 4194304
     - BWT index table         = 65536
     - source data buffer      = 529780
     TOTAL                     = 39263604
    Unfortunately some "features" still have to be fixed:
    - "static bitcode model" is too large (mainly due to decoding lookup table)
    - Longer files (eg. enwik8) cause a 16bit code overflow
    - The case with too many strings in a chunk is not handled yet (infinite loop)

    Still, I think its interesting enough as it is.

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    Well, I couldn't stop ;)

    http://ctxmodel.net/files/rcbwt_v2.rar
    - Added frequency rounding to 14 bits (to avoid bitcode overflows)
    - Chunk size is now dynamic

    Code:
    Input file size            = 100000000 (enwik8)
    Memory usage:
     - o1 static bitcode model = 34736128
     - clustering statistics   = 4194304
     - BWT index table         = 5652984
     - source data buffer      = 53711205
     TOTAL                     = 98294621
    
    Input file size            = 1000000000 (enwik9)
    Memory usage:
     - o1 static bitcode model = 34736128
     - clustering statistics   = 4194304
     - BWT index table         = 68491184
     - source data buffer      = 538268254
     TOTAL                     = 645689870
    And here, with enwik9, we finally have an example of BWT with 0.646N memory usage ;)
    It still works though, and I don't think I have anything to verify that its correct.
    Would later try to compress the output with o2rc and see what'd happen.

  3. #3
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    That is pretty impressive - Very good work!

    What about transform speeds compared to non-precompressed input?

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    Well, I just tested on enwik8 and my dummy 5N BWT
    (from http://ctxmodel.net/files/BWT_reorder_v2.rar)
    took 89.766s, while rcBWT_v2 took 269.75s.
    But its unoptimized and even does some unnecessary stuff
    (like saving bitcoded and decoded data) and I'm going to
    add multithreading too, so it might even become "faster"
    than 5N one with 4 cores.

Similar Threads

  1. Matlab Contest: Compressed Sensing
    By russelms in forum Data Compression
    Replies: 1
    Last Post: 28th April 2010, 22:48
  2. BWT - how to use?
    By m^2 in forum Data Compression
    Replies: 29
    Last Post: 6th November 2008, 02:01
  3. BWT/ST compression
    By encode in forum Data Compression
    Replies: 60
    Last Post: 25th June 2008, 07:21
  4. Replies: 3
    Last Post: 10th November 2007, 21:32
  5. Replies: 12
    Last Post: 30th June 2007, 16:49

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •