Results 1 to 20 of 20

Thread: Ordered bitcodes experiments

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts

    Ordered bitcodes experiments

    I need a bitcode which would preserve the lexicographic
    order of encoded strings.
    Initially I intended to use a rangecoder instead, but it
    seems that it would be highly impractical due to precision loss
    issues.

    So here I tried a few ideas on how to produce such ordered bitcodes,
    but arithmetic code for book1 is around 435k, and static huffman
    is 438374 bytes, and the best result i've got for now is 464252,
    so any suggestions on how to get better results are welcome.

    http://ctxmodel.net/files/bitcode_v0.rar

    Code:
             SF              aligned ari      huffman
    00 - . - <0000>          <0000000>        <00000>
    0A - . - <00010>         <0000001>        <00001>
    1A - . - <00011>         <000001>         <0001>
    20 -   - <0010>          <0001>           <001>
    21 - ! - <00110000>      <0010>           <01000000>
    22 - " - <001100010>     <001100000>      <010000010>
    26 - & - <001100011>     <001100001>      <010000011>
    27 - ' - <00110010>      <00110001>       <01000010>
    28 - ( - <001100110>     <0011001100010>  <01000011000>
    29 - ) - <0011001110>    <00110011000110> <01000011001>
    2A - * - <0011001111>    <00110011000111> <0100001101>
    2B - + - <0011010>       <00110011001>    <010000111>
    2C - , - <0011011>       <0011010>        <010001>
    2D - - - <0011100>       <0011011>        <0100100>
    2E - . - <00111010>      <0011100>        <0100101>
    30 - 0 - <001110110>     <001110100>      <0100110000000>
    31 - 1 - <0011101110>    <001110101000>   <0100110000001>
    32 - 2 - <00111011110>   <001110101001>   <01001100000100>
    33 - 3 - <00111011111>   <0011101010100>  <01001100000101>
    34 - 4 - <00111100000>   <0011101010101>  <0100110000011>
    35 - 5 - <001111000010>  <0011101010110>  <01001100001000>
    36 - 6 - <001111000011>  <0011101010111>  <010011000010010>
    37 - 7 - <001111000100>  <00111010110000> <010011000010011>
    38 - 8 - <0011110001010> <00111010110001> <01001100001010>
    39 - 9 - <0011110001011> <0011101011001>  <01001100001011>
    3A - : - <00111100011>   <001110101101>   <010011000011>
    3B - ; - <0011110010>    <00111010111>    <01001100010>
    3C - < - <00111100110>   <0011101100>     <0100110001100>
    3D - = - <001111001110>  <001110110100>   <0100110001101>
    3E - > - <001111001111>  <001110110101>   <010011000111>
    3F - ? - <001111010>     <0011101110>     <01001100100>
    41 - A - <0011110110>    <0011101111>     <01001100101>
    42 - B - <0011110111>    <001111000>      <01001100110>
    43 - C - <0011111000>    <00111100100>    <010011001110>
    44 - D - <00111110010>   <00111100101>    <0100110011110>
    45 - E - <00111110011>   <00111100110>    <0100110011111>
    46 - F - <00111110100>   <00111100111>    <01001101000>
    47 - G - <00111110101>   <0011110100>     <01001101001>
    48 - H - <0011111011>    <0011110101>     <0100110101>
    49 - I - <001111110>     <001111011>      <010011011>
    4A - J - <0011111110>    <001111100>      <010011100000>
    4B - K - <0011111111>    <001111101000>   <010011100001>
    4C - L - <0100000>       <001111101001>   <01001110001>
    4D - M - <01000010>      <00111110110>    <01001110010>
    4E - N - <01000011>      <00111110111>    <01001110011>
    4F - O - <0100010>       <0011111100>     <01001110100>
    50 - P - <01000110>      <0011111101>     <010011101010>
    51 - Q - <010001110>     <001111111000>   <0100111010110>
    52 - R - <010001111>     <001111111001>   <0100111010111>
    53 - S - <010010>        <00111111101>    <0100111011>
    54 - T - <0100110>       <010000000>      <010011110>
    55 - U - <010011100>     <010000001000>   <010011111000>
    56 - V - <010011101>     <010000001001>   <010011111001>
    57 - W - <010011110>     <01000000101>    <01001111101>
    58 - X - <0100111110>    <01000000110>    <01001111110>
    59 - Y - <0100111111>    <01000000111>    <01001111111>
    61 - a - <01010>         <01001>          <01010>
    62 - b - <010110>        <010100>         <0101100>
    63 - c - <010111>        <010101>         <0101101>
    64 - d - <0110>          <01011>          <010111>
    65 - e - <01110>         <0110>           <0110>
    66 - f - <01111>         <011110>         <011100>
    67 - g - <1000>          <011111>         <011101>
    68 - h - <1001>          <1000>           <01111>
    69 - i - <10100>         <10010>          <1000>
    6A - j - <101010>        <100110100>      <1001000>
    6B - k - <101011>        <100110101>      <1001001>
    6C - l - <101100>        <100111>         <100101>
    6D - m - <101101>        <10100>          <10011>
    6E - n - <10111>         <10101>          <1010>
    6F - o - <1100>          <1011>           <1011>
    70 - p - <110100>        <1100010>        <110000>
    71 - q - <110101>        <1100011>        <110001>
    72 - r - <11011>         <11001>          <11001>
    73 - s - <1110>          <1101>           <1101>
    74 - t - <11110>         <1110>           <1110>
    75 - u - <1111100>       <111100>         <11110>
    76 - v - <1111101>       <111101>         <1111100>
    77 - w - <1111110>       <1111100>        <1111101>
    78 - x - <11111110>      <1111101>        <1111110>
    79 - y - <111111110>     <1111110>        <11111110>
    7A - z - <111111111>     <1111111>        <11111111>
             490730 bytes    485015 bytes     464252 bytes

  2. #2
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    555
    Thanks
    208
    Thanked 193 Times in 90 Posts
    Perhaps this helps in analyzing the codes - I changed the printf in DumpCode to this:

    Code:
    printf( "%06d - %06d - %02X - %c - <", freq0[i] * code[i].l, freq0[i], i, cmask(i) );
    This adds two columns - one is stating how much bits a symbol took in the result file, the other shows the frequency of that code. Then I used "sort /R log-huf > log-huf-sorted" to get an overview over the symbols wasting most space. Of course this doesn't help much as we can't directly change one code without changing several of the others, but now you can see where the "bad codes" are - for example w and y that occur quite often (over 10000 times), but have 7 and 8 bit codes. However, it seems this isn't avoidable as w and y are some of the last characters and will have to start with a run of 1's.

    Code:
    376653 - 125551 - 20 -   - <001>
    289724 - 072431 - 65 - e - <0110>
    239180 - 047836 - 61 - a - <01010>
    200108 - 050027 - 74 - t - <1110>
    187805 - 037561 - 68 - h - <01111>
    179180 - 044795 - 6F - o - <1011>
    164445 - 032889 - 72 - r - <11001>
    163676 - 040919 - 6E - n - <1010>
    159738 - 026623 - 64 - d - <010111>
    148028 - 037007 - 69 - i - <1000>
    147152 - 036788 - 73 - s - <1101>
    138468 - 023078 - 6C - l - <100101>
    098497 - 014071 - 77 - w - <1111101>
    095888 - 011986 - 79 - y - <11111110>
    088795 - 012685 - 63 - c - <0101101>
    083110 - 016622 - 0A - . - <00001>
    080155 - 016031 - 75 - u - <11110>
    073818 - 012303 - 67 - g - <011101>
    073422 - 012237 - 66 - f - <011100>
    070220 - 014044 - 6D - m - <10011>
    063924 - 009132 - 62 - b - <0101100>
    061776 - 010296 - 2C - , - <010001>
    055992 - 009332 - 70 - p - <110000>
    051760 - 006470 - 27 - ' - <01000010>
    050190 - 007170 - 2E - . - <0100101>
    037674 - 005382 - 76 - v - <1111100>
    034958 - 004994 - 6B - k - <1001001>
    027685 - 003955 - 2D - - - <0100100>
    026091 - 002899 - 49 - I - <010011011>
    022212 - 002468 - 22 - " - <010000010>
    017694 - 001966 - 54 - T - <010011110>
    016093 - 001463 - 42 - B - <01001100110>
    010637 - 000967 - 41 - A - <01001100101>
    009770 - 000977 - 48 - H - <0100110101>
    009416 - 000856 - 4F - O - <01001110100>
    008500 - 000850 - 53 - S - <0100111011>
    008382 - 000762 - 3B - ; - <01001100010>
    008349 - 000759 - 3F - ? - <01001100100>
    008316 - 000693 - 50 - P - <010011101010>
    008283 - 000753 - 57 - W - <01001111101>
    006960 - 000580 - 43 - C - <010011001110>
    006656 - 000832 - 21 - ! - <01000000>
    006474 - 000498 - 3C - < - <0100110001100>
    006325 - 000575 - 47 - G - <01001101001>
    006219 - 000691 - 2B - + - <010000111>
    006215 - 000565 - 4D - M - <01001110010>
    006027 - 000861 - 78 - x - <1111110>
    005976 - 000498 - 3E - > - <010011000111>
    005772 - 000444 - 45 - E - <0100110011111>
    005522 - 000502 - 4E - N - <01001110011>
    004576 - 000416 - 59 - Y - <01001111111>
    004543 - 000413 - 4C - L - <01001110001>
    004543 - 000413 - 46 - F - <01001101000>
    003497 - 000269 - 44 - D - <0100110011110>
    003276 - 000468 - 6A - j - <1001000>
    003185 - 000245 - 52 - R - <0100111010111>
    003120 - 000520 - 71 - q - <110001>
    003120 - 000240 - 31 - 1 - <0100110000001>
    003036 - 000253 - 4A - J - <010011100000>
    002640 - 000220 - 3A - : - <010011000011>
    002590 - 000185 - 32 - 2 - <01001100000100>
    002576 - 000184 - 33 - 3 - <01001100000101>
    002112 - 000264 - 7A - z - <11111111>
    001963 - 000151 - 34 - 4 - <0100110000011>
    001344 - 000096 - 35 - 5 - <01001100001000>
    001305 - 000087 - 36 - 6 - <010011000010010>
    001275 - 000085 - 37 - 7 - <010011000010011>
    001274 - 000098 - 30 - 0 - <0100110000000>
    001236 - 000103 - 55 - U - <010011111000>
    001190 - 000085 - 38 - 8 - <01001100001010>
    001148 - 000082 - 39 - 9 - <01001100001011>
    000768 - 000064 - 56 - V - <010011111001>
    000540 - 000045 - 4B - K - <010011100001>
    000473 - 000043 - 28 - ( - <01000011000>
    000440 - 000040 - 29 - ) - <01000011001>
    000182 - 000014 - 51 - Q - <0100111010110>
    000065 - 000005 - 3D - = - <0100110001101>
    000055 - 000005 - 58 - X - <01001111110>
    000010 - 000001 - 2A - * - <0100001101>
    000009 - 000001 - 26 - & - <010000011>
    000005 - 000001 - 00 - . - <00000>
    000004 - 000001 - 1A - . - <0001>
    Last edited by schnaader; 28th May 2009 at 12:51.
    http://schnaader.info
    Damn kids. They're all alike.

  3. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    For reference, here's a real huffman code for the same file (book1).
    The problem is that I want to preserve the original alphabet order
    of the symbols - its intended for computing a BWT using a compressed input.

    Code:
     "e"  - 000
     "s"  - 0010
     "i"  - 0011
     "h"  - 0100
     "n"  - 0101
     ","  - 011000
     """  - 01100100
     "+"  - 0110010100
     "P"  - 0110010101
     "B"  - 011001011
     "v"  - 0110011
     "l"  - 01101
     "o"  - 0111
     "a"  - 1000
     "y"  - 100100
     "f"  - 100101
     "g"  - 100110
     "I"  - 10011100
     "W"  - 1001110100
     "3"  - 100111010100
     "("  - 10011101010100
     "K"  - 10011101010101
     "5"  - 1001110101011
     "2"  - 100111010110
     "0"  - 1001110101110
     "U"  - 1001110101111
     "?"  - 1001110110
     ";"  - 1001110111
     "'"  - 1001111
     "t"  - 1010
     "d"  - 10110
     "c"  - 101110
     "m"  - 101111
     " "  - 110
     "w"  - 111000
     "F"  - 11100100000
     "L"  - 11100100001
     "!"  - 1110010001
     "S"  - 1110010010
     "O"  - 1110010011
     "Y"  - 11100101000
     "E"  - 11100101001
     "x"  - 1110010101
     ":"  - 111001011000
     "1"  - 111001011001
     "j"  - 11100101101
     "A"  - 1110010111
     "."  - 1110011
     "u"  - 111010
    $000A - 111011
     "r"  - 11110
     "T"  - 111110000
     "H"  - 1111100010
     "<"  - 11111000110
     ">"  - 11111000111
     "-"  - 11111001
     "b"  - 1111101
     "p"  - 1111110
     "R"  - 111111100000
     "J"  - 111111100001
     "N"  - 11111110001
     "q"  - 11111110010
     "z"  - 111111100110
     "D"  - 111111100111
     "M"  - 11111110100
     "G"  - 11111110101
     "C"  - 11111110110
     "V"  - 11111110111000
     "X"  - 11111110111001000
    $0000 - 11111110111001001000
    $001A - 11111110111001001001
     "&"  - 11111110111001001010
     "*"  - 11111110111001001011
     "="  - 111111101110010011
     "Q"  - 1111111011100101
     ")"  - 111111101110011
     "4"  - 1111111011101
     "9"  - 11111110111100
     "7"  - 11111110111101
     "8"  - 11111110111110
     "6"  - 11111110111111
     "k"  - 11111111

  4. #4
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    555
    Thanks
    208
    Thanked 193 Times in 90 Posts
    I wonder if a brute force attempt would be possible here - at least the tree structure is almost given because of the lexicographic order constraint - there can only be some "shifts".
    http://schnaader.info
    Damn kids. They're all alike.

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    Btw, one thing that helped was that frequency rounding.
    With precise original freqs it was ~466k.

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    As to bruteforce, here I tried some - ran my optimizer with
    a table of freq increments:

    Code:
             huffman           huf+bruteforce
    00 - . - <00000>           <00000>
    0A - . - <00001>           <00001>
    1A - . - <0001>            <0001>
    20 -   - <001>             <001>
    21 - ! - <01000000>        <01000000>
    22 - " - <010000010>       <010000010>
    26 - & - <010000011>       <010000011>
    27 - ' - <01000010>        <01000010>
    28 - ( - <01000011000>     <0100001100>
    29 - ) - <01000011001>     <01000011010>
    2A - * - <0100001101>      <01000011011>
    2B - + - <010000111>       <010000111>
    2C - , - <010001>          <0100010>
    2D - - - <0100100>         <0100011>
    2E - . - <0100101>         <0100100>
    30 - 0 - <0100110000000>   <010010100000>
    31 - 1 - <0100110000001>   <010010100001>
    32 - 2 - <01001100000100>  <010010100010>
    33 - 3 - <01001100000101>  <010010100011>
    34 - 4 - <0100110000011>   <010010100100>
    35 - 5 - <01001100001000>  <0100101001010>
    36 - 6 - <010011000010010> <0100101001011>
    37 - 7 - <010011000010011> <01001010011000>
    38 - 8 - <01001100001010>  <01001010011001>
    39 - 9 - <01001100001011>  <0100101001101>
    3A - : - <010011000011>    <010010100111>
    3B - ; - <01001100010>     <0100101010>
    3C - < - <0100110001100>   <010010101100>
    3D - = - <0100110001101>   <010010101101>
    3E - > - <010011000111>    <01001010111>
    3F - ? - <01001100100>     <0100101100>
    41 - A - <01001100101>     <0100101101>
    42 - B - <01001100110>     <010010111>
    43 - C - <010011001110>    <0100110000>
    44 - D - <0100110011110>   <01001100010>
    45 - E - <0100110011111>   <01001100011>
    46 - F - <01001101000>     <01001100100>
    47 - G - <01001101001>     <01001100101>
    48 - H - <0100110101>      <0100110011>
    49 - I - <010011011>       <01001101>
    4A - J - <010011100000>    <0100111000000>
    4B - K - <010011100001>    <0100111000001>
    4C - L - <01001110001>     <010011100001>
    4D - M - <01001110010>     <01001110001>
    4E - N - <01001110011>     <01001110010>
    4F - O - <01001110100>     <01001110011>
    50 - P - <010011101010>    <01001110100>
    51 - Q - <0100111010110>   <010011101010>
    52 - R - <0100111010111>   <010011101011>
    53 - S - <0100111011>      <0100111011>
    54 - T - <010011110>       <010011110>
    55 - U - <010011111000>    <010011111000>
    56 - V - <010011111001>    <010011111001>
    57 - W - <01001111101>     <01001111101>
    58 - X - <01001111110>     <01001111110>
    59 - Y - <01001111111>     <01001111111>
    61 - a - <01010>           <01010>
    62 - b - <0101100>         <0101100>
    63 - c - <0101101>         <0101101>
    64 - d - <010111>          <010111>
    65 - e - <0110>            <0110>
    66 - f - <011100>          <011100>
    67 - g - <011101>          <011101>
    68 - h - <01111>           <01111>
    69 - i - <1000>            <1000>
    6A - j - <1001000>         <1001000>
    6B - k - <1001001>         <1001001>
    6C - l - <100101>          <100101>
    6D - m - <10011>           <10011>
    6E - n - <1010>            <1010>
    6F - o - <1011>            <1011>
    70 - p - <110000>          <110000>
    71 - q - <110001>          <110001>
    72 - r - <11001>           <11001>
    73 - s - <1101>            <1101>
    74 - t - <1110>            <1110>
    75 - u - <11110>           <111100>
    76 - v - <1111100>         <111101>
    77 - w - <1111101>         <111110>
    78 - x - <1111110>         <1111110>
    79 - y - <11111110>        <11111110>
    7A - z - <11111111>        <11111111>
             464252 bytes      463380 bytes

  7. #7
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    555
    Thanks
    208
    Thanked 193 Times in 90 Posts
    Oops, the brute force attempt I wanted to do is unusable... anyway, here is it:

    We want to distribute n = 82 codes to a binary tree. To get this, we recursively distribute (1 .. n - 1) codes to the left sub-tree and the remaining (1 .. n - 1) codes to the right sub-tree.

    This gives correct results, unfortunately we'll get 2^80 possible trees with a speed of ~7000 trees/s for my PC (800 MHz)... even with a fast PC we can't brute force all of them...

    If anyone is interested, here's the source code for a brute force on book1. Symbol counts for book1 are hard-coded and the program will spit out codes as soon as the result size is smaller than 464252 (which could take some time, the first tree has codes (0, 10, 110, 1110...) and result size 40,553,606 bytes).

    I'll try to optimize this a bit by only trying codes shorter than 16 (?) bits, but I doubt it will get fast enough.
    Attached Files Attached Files
    Last edited by schnaader; 28th May 2009 at 14:47.
    http://schnaader.info
    Damn kids. They're all alike.

  8. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    Wonder if anybody can find this article:
    http://citeseerx.ist.psu.edu/showcit...A4?cid=1286484

  9. #9
    Programmer Jan Ondrus's Avatar
    Join Date
    Sep 2008
    Location
    Rychnov nad Kněžnou, Czech Republic
    Posts
    278
    Thanks
    33
    Thanked 137 Times in 49 Posts
    Optimal size for book1 is 461084 bytes.
    You can easily compute also optimal bitcode (see test.cpp)
    Attached Files Attached Files

  10. #10
    Programmer Jan Ondrus's Avatar
    Join Date
    Sep 2008
    Location
    Rychnov nad Kněžnou, Czech Republic
    Posts
    278
    Thanks
    33
    Thanked 137 Times in 49 Posts
    And here is the optimal code:
    Code:
    code(0)=0000
    code(1)=00010
    code(2)=00011
    code(3)=001
    code(4)=01000000
    code(5)=010000010
    code(6)=010000011
    code(7)=01000010
    code(8)=0100001100
    code(9)=01000011010
    code(10)=01000011011
    code(11)=010000111
    code(12)=0100010
    code(13)=0100011
    code(14)=0100100
    code(15)=010010100000
    code(16)=010010100001
    code(17)=010010100010
    code(18)=010010100011
    code(19)=0100101001000
    code(20)=0100101001001
    code(21)=0100101001010
    code(22)=0100101001011
    code(23)=0100101001100
    code(24)=0100101001101
    code(25)=010010100111
    code(26)=0100101010
    code(27)=01001010110
    code(28)=010010101110
    code(29)=010010101111
    code(30)=0100101100
    code(31)=0100101101
    code(32)=010010111
    code(33)=0100110000
    code(34)=01001100010
    code(35)=01001100011
    code(36)=01001100100
    code(37)=01001100101
    code(38)=0100110011
    code(39)=01001101
    code(40)=0100111000000
    code(41)=0100111000001
    code(42)=010011100001
    code(43)=01001110001
    code(44)=01001110010
    code(45)=01001110011
    code(46)=01001110100
    code(47)=010011101010
    code(48)=010011101011
    code(49)=0100111011
    code(50)=010011110
    code(51)=010011111000
    code(52)=010011111001
    code(53)=01001111101
    code(54)=01001111110
    code(55)=01001111111
    code(56)=0101
    code(57)=011000
    code(58)=011001
    code(59)=01101
    code(60)=0111
    code(61)=100000
    code(62)=100001
    code(63)=10001
    code(64)=10010
    code(65)=1001100
    code(66)=1001101
    code(67)=100111
    code(68)=10100
    code(69)=10101
    code(70)=1011
    code(71)=110000
    code(72)=110001
    code(73)=11001
    code(74)=1101
    code(75)=1110
    code(76)=111100
    code(77)=111101
    code(78)=111110
    code(79)=1111110
    code(80)=11111110
    code(81)=11111111
    result: 3688668 bits = 461084 bytes
    Attached Files Attached Files

  11. #11
    Member
    Join Date
    Oct 2007
    Location
    Germany, Hamburg
    Posts
    408
    Thanks
    0
    Thanked 5 Times in 5 Posts
    Seems like you can get it here but not for free?!?
    I think you found this too.

    http://ieeexplore.ieee.org/Xplore/lo...hDecision=-203

    You could find some of the related documents interesting. They have references on the top places to the one you try to find.

    http://scholar.google.com/scholar?q=...59&as_yhi=1959
    Last edited by Simon Berger; 28th May 2009 at 16:27.

  12. #12
    Member
    Join Date
    May 2008
    Location
    Antwerp , country:Belgium , W.Europe
    Posts
    487
    Thanks
    1
    Thanked 3 Times in 3 Posts
    Very interesting !
    How far do we get now by feeding this "optimal code" to let's say BCM v0.7 ?
    Can I consider BWMonstr. as beaten on book1 ?

    Could it be that BWMonstr also uses this method ?
    Last edited by pat357; 28th May 2009 at 17:34.

  13. #13
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by Shelwien View Post
    For reference, here's a real huffman code for the same file (book1).
    The problem is that I want to preserve the original alphabet order
    of the symbols - its intended for computing a BWT using a compressed input.
    pretend a occurs .25 of time b occurs .5 and c .25

    the lexical order wanted is a b c

    but that's not possible you only have

    b as largest or smallest and a c can be slected so any order

    example

    a 00
    c 01
    b 1

    since the normal order not best for BWT any way why
    do you need it to match if doing a binary bWT?

  14. #14
    Member
    Join Date
    May 2009
    Location
    CA USA
    Posts
    22
    Thanks
    0
    Thanked 0 Times in 0 Posts
    The reason you want the symbols in sorted order is so you don't need to store the permutation table as well, which would take something like 8*256 bits? (Or a less if you do some math to encode a permutation and not a list)

    If that's true, then you may consider limited permutations... basically use your symbols in order, but use a small exception table , maybe just 32 bits to allow you to reorder the "troublesome" symbols in an encoding.

    So if "y" is especially poorly encoded, you may try specifying y to go somewhere else in order, use say 16 bits to list which index it's moved to (or swapped with?), then recode and see if your recoded version has better behavior.

    It might not, but this gives you the flexibility to at least try to avoid especially poor encodings.


    Another idea would be to use random permutations for order. Use just 4 bits at the start as a seed to make a random order for the whole symbol list. Generate encodings for all 16 possible orders and use the order that has the best encoding. Some may be better than others, and the extra bits may give you the flexibility to avoid the cases where bad luck gave you an especially bad encoding.

  15. #15
    Member
    Join Date
    Jun 2008
    Location
    USA
    Posts
    111
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I suspect it's of no use, but hey, maybe somebody will find it interesting anyways.

    http://www.colorforth.com/chars.html

  16. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    @Jan Ondrus:
    Thanks, that's exactly what I wanted.
    Really one of these "why didn't i think about that myself?!" things
    Bruteforcing such a code is pretty simple actually, as optimal
    codes for alphabet subsets are independent, and there's a very
    limited number of these subsets due to order preserving requirement,
    and knowing the optimal codelengths for all subsets of the current
    interval, we can find its optimal codelength too, by testing all
    the possible subdivisions and calculating the codelengths with added
    prefix bits.
    Thus a really nice and simple example of "dynamic programming".

    http://ctxmodel.net/files/bitcode_v1.rar
    (now has all 4 codes included, with a real test of encoding and decoding)

  17. #17
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    @pat357:

    > Very interesting !
    > How far do we get now by feeding this "optimal code" to let's say BCM v0.7 ?

    Nowhere?
    Well, compressing unaligned bitcode with a bytewise BWT is a fairly bad idea,
    and while my scheme might allow for practical GPGPU and under-1N BWT implementations,
    its still supposed to produce a normal BWT on output, which would be compressed
    with a usual ratio by BCM or anything.

    > Can I consider BWMonstr. as beaten on book1 ?

    Err... Not that I have such a goal.

    > Could it be that BWMonstr also uses this method ?

    My guess is that BWMonstr uses a m03-style coding scheme,
    and thats not really a BWT imho - more like a static CM
    (the kind that compresses statistics for a block of data),
    which uses BWT output as a structure for efficient context lookups.
    And if i'm right, beating BWMonstr on book1 might be not
    that hard, as it might be a matter of attaching a slow CM
    to m03.
    But considering something like enwik9, it might be much more
    troublesome, as programming all the stuff with low memory
    usage is a lot of work.


    @biject.bwts:

    > pretend a occurs .25 of time b occurs .5 and c .25
    > the lexical order wanted is a b c
    > but that's not possible

    Why, it is possible, and we have have a few examples
    of such codes posted in this thread.
    Of course, such ordered codes would commonly provide
    worse compression than Huffman's, but who cares
    if it serves the purpose.

    > since the normal order not best for BWT any way why
    > do you need it to match if doing a binary bWT?

    1. With ordered codes I'd be able to use any specific
    alphabet order for BWT, and there're some additional
    bonuses like easily adding control codes and stuff
    to the alphabet without using existing symbols as
    control codes.

    2. I'm trying to calculate the BWT using compressed data,
    so comparisons of compressed substrings have to produce
    the same results as uncompressed.
    In fact, I have a working implementation of that online
    for some time, http://ctxmodel.net/files/rcbwt_v0.rar
    But its inoptimal in many ways, so I'm trying to write
    a better one from scratch, before moving further.


    @GerryB:
    > The reason you want the symbols in sorted order is so you
    > don't need to store the permutation table as well, which
    > would take something like 8*256 bits? (Or a less if you do
    > some math to encode a permutation and not a list)

    Its not, as explained above.

    Btw, I just checked and some of my coders compress
    whole frequency table of book1 to 145 bytes and the length
    table (which is sufficient to reconstruct a prefix bitcode)
    to 47 bytes.

  18. #18
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    And here's an order1 test:
    http://ctxmodel.net/files/bitcode_v2.rar
    381254 on book1, while a plain o1 ari gives 346xxx.
    But I guess its still a worthy improvement comparing to 461k.

    There're some problems though:

    1. Generating 82 codes with 256^3 loops each takes a
    considerable amount of time even on Q6600, so guess
    I'd have to eventually optimize the implementation.

    2. The code (one suggested by Jan Ondrus) is calculated
    using full freqs, which gives slightly better results, but also
    like that codes longer than 16 bits are possible, which is
    inconvenient (and would cause an error in current version).

    But well, I'd better first try to design a BWT based on this.

  19. #19
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    If I understand you correctly you are not going to do a
    binary BWT since the results would differ greatly and
    you would not be able to convert this to the normal
    BWT in an easy way?
    When doing the sorting you have to know the
    start postion of each symbol. That might be a lot of
    extra space or time. Why not just use all symbols the
    same length based on the power of 2. By that I mean
    if you have 10 symbols you use 4 bits per symbol.

  20. #20
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    > If I understand you correctly you are not going to do a
    > binary BWT since the results would differ greatly and
    > you would not be able to convert this to the normal
    > BWT in an easy way?

    Yes. But not only that.
    1. There're large files which would not normally fit in memory.
    2. There're architectures without sophisticated memory caching
    (like GPUs) where it would be faster to read less data.
    3. It partially solves some common BWT problems with data
    redundancy, as redundant parts are better compressed.
    Also things like radix sorting become much more efficient.

    > When doing the sorting you have to know the start postion
    > of each symbol. That might be a lot of extra space or time.

    Not really, as usual 32-bit pointers are enough for files
    up to 512M in size (compressed!).
    But even 12-byte indexes into arithmetic code are quite
    usable in fact, as we don't really have to maintain the
    index for all the substrings at once.

    > Why not just use all symbols the same length based on the
    > power of 2. By that I mean if you have 10 symbols you use
    > 4 bits per symbol.

    With order0 it just won't work as eg. enwik has more than
    128 symbols in alphabet.
    And with higher orders there won't be much difference
    in decoder implementations, except for much higher
    redundancy.

    So its the reverse.
    I'd really like to use a rangecoder there - but currently
    I don't see a way to get the same code values with different
    initial range. Well, except for bit-aligned rc, but that's
    much less efficient than the optimal ordered bitcode
    which I use now. And rc with infinite-precision intervals,
    which would be rather hard to design, and still not any practical.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •