Results 1 to 1 of 1

Thread: Is any one interested in at least bijective endds for file compression

Threaded View

Previous Post Previous Post   Next Post Next Post
  1. #1
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Thanked 14 Times in 10 Posts

    Lightbulb Is any one interested in at least bijective endds for file compression

    I have talked with another person and I will likely post in this thread about bijective coding. I know that its to much to hope that people will make the Whole approach to file compression as bijective. But I would like to discuss how to make a standard Entropy coder bijective.
    The first one to discuss is how does one make a huffman compress with a fixed table map its tokens which are of various length to a byte type of file that is of 8 bit bytes.
    Actually with the same ease you could map to any infinite structure of data blocks even those of varying number of bits but lets kill the byte case first.

    I notice that many entropy compressors which is also the last stage in many popular file compressors have difficulty in meshing there endings with the byte structure of data files.
    Most people some to think you have to output extra stuff such as a symbol count of some type or reserve space in the Huffman table ( if using huffman ) for some sort of EOF symbol. Also space waste in last byte since the EOF symbol not only wasteful but will likely not fill the last byte without waste itself. Also it wastes space when not used in the rest of the file since it robs space in the output file itself.

    Simple Example: Say your compressing some of 4 symbols A B C D
    supose A occurs half the time and B occurs 1/4 the time and C and D each occur 1/8 the time in the files of interest.

    It would be nice to use a fixed table for these files so the tables don't have to be in the code.
    A = 1
    B = 01
    C = 001
    D = 000

    However how would one end the file. Or better yet how does one map bijectively this file set to the byte file set.

    Most users would say lets create a number field in front that tell how many huffman
    tokens are used. Well that waste space for the number. Some smarter people may notice that since bytes are 8 bits and since the longest token is 3 bits. One could write a small number field and only output the number for last tokens in part of last byte written. Others Might say lets creat a huffman table for 5 symbols. Lets look at that
    A = 1
    B = 01
    C = 001
    D = 0001
    E = 0000 this is otptimal huffman table for the probabilites mentioned you
    assume that the E the EOF is less likely than the other symbols.

    Well you run the huffman coder using above table and when at last word you could cleverly just output enough Zeros to fill last byte. If the last symbol ends on the lucky last bit of a byte you don't even need to write the EOF symbol since you could pretend the FILE EOF is actually the huffman EOF in this special lucky alignment case.
    Well you still lose a lot of space. Since we know every 1/8 time a D appears and we should optimally write 3 bits out but now we are writting 4 so we lost big time again for a long file.

    Here is a bijective solution for this case part one
    use the correct table
    A = 1
    B = 01
    C = 001
    D = 000 /* if at least one B or C follows somewhere in rest of string to end */
    D = 0001 /* if only zero or more A tokens follow till end of huffman string */

    write this to a byte file when at end of file keep writing to at last byte is filled
    notice this last byte will never be all zeroes. (note if using huffman tokens that
    are not all zeros if they are last truncate the last bytes at this point so no last byte of zero.

    This can't be bijective since we don't allow for the zero byte at end. The above is close to bijective. In fact its at most one byte off.

    part two process

    looks at the byte just written if the last byte is not 10000000 you are done.

    if the next to last byte is not a 00000000 or 10000000 your done
    if its a 10000000 loop to read next most last byte
    now it must be a byte of 00000000 all zeros.
    so drop the last byte written your done its bijective

    It's that simple (look a proof read this a dozen times but I am sure my English sucks)

  2. Thanks:

    R2F6K (7th February 2014)

Similar Threads

  1. PerfectCompress, a new file compression software.
    By moisesmcardona in forum Data Compression
    Replies: 148
    Last Post: 21st May 2018, 04:16
    By biject.bwts in forum Data Compression
    Replies: 4
    Last Post: 13th February 2012, 02:36
  3. Compression test file generator
    By Matt Mahoney in forum Data Compression
    Replies: 3
    Last Post: 26th June 2011, 22:28
  4. Interested in Google-Wave?
    By Vacon in forum The Off-Topic Lounge
    Replies: 2
    Last Post: 29th November 2009, 20:11
  5. my file compression considerations
    By JB_ in forum Data Compression
    Replies: 2
    Last Post: 5th May 2008, 20:47

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts