Page 1 of 2 12 LastLast
Results 1 to 30 of 39

Thread: No more encoding-overhead in Run Length Encoding? Read about Mespotine-RLE here... :)

  1. #1
    Member
    Join Date
    Jan 2015
    Location
    germany
    Posts
    6
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Smile No more encoding-overhead in Run Length Encoding? Read about Mespotine-RLE here... :)

    Hi everyone.

    This is my first post in this forum as I need some help from you.

    After one long year, I have finally finished my paper of a rework of Run Length Encoding.
    Mespotine-RLE-basic, which is the first RLE-algorithm, that produces no useless overhead in the encoding itself anymore. The only overhead produced is a header of only 32 bytes(!).
    That means: no more RLE-encoded-data twice the size of the original, as Mespotine-RLE-basic only compresses the run-values, that are compressible in the first place.
    So that means: either MeRLE compresses, or it’s just, as I said, 32 Bytes bigger.

    You can find the paper at: http://mespotine.de/zugehoert/wp-con...1-2015.pdf.pdf
    or at http://arxiv.org/abs/1501.05542 or at https://archive.org/details/MeoMespo...ic20012015.pdf

    This is a version 0.9 prerelease and I’m open for your questions, suggestions, comments, bugreports that I can correct in version 1.0, so don’t hesitate, if you have something to comment
    The algorithm is licensed under a creative commons license 3.0 cc-by-sa/de , means, you can alter and use it, even for commercial purposes, as long as you include my name and release your alterations under a similar license.
    Consider it as a gift to the data-compression community, that has teached me a lot about data-compression (what it is and what it isn't).
    Cheers and have fun with it
    Meo
    trss.mespotine.de

  2. #2
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Sorry to inform you but Bijective Run Length Encoders have been around for more than a decade. A bijective coder does not produce any waste. What this means is that any file one byte to whatever. Can be viewed as either a RLE encoded file or has data file. Does your attempt satisfy this or not?

  3. #3
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Did a quick look at your method its not bijective. So I would not call it a reduced overhead method. Second if the file is long enough every character may appear in runs from one to a number greater than one. Here is counter example

    Suppose you have ABBAAB
    old RLE you compare with 1A2B2A1B 8 character 64 bits
    you way comp_bit_list 11
    your RLE A1B2A2B1 also 8 character 64 bits plus overhead for comp_bit_list of 2 bits. (Plus uncounted overhead of joining two lists)

    One of the interesting things about compression is one it seems can always create data that makes ones own program look good but how does it do on other data sets. In my example old RLE looks better than yours thou I am not fond of it.

  4. #4
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    You're not given copyright for algorithm, only for implementation. For this reason you license does apply to the paper, but only to the paper.

  5. #5
    Member jibz's Avatar
    Join Date
    Jan 2015
    Location
    Denmark
    Posts
    122
    Thanks
    105
    Thanked 71 Times in 51 Posts
    From a cursory look at the paper, it appears your method requires two passes over the data. That could be an issue in some uses.

    Like biject.bwts says, you can usually come up with bad cases for an algorithm. An example of something it does not appear to handle so well, is adapting to local changes in compressibility of a value. Take for instance data consisting of 100 repetitions of 'A', followed by 100 repetitions of 'BA'. The 'A' would get a score of -2 and not be compressed by your method (300 bytes), whereas PackBits would compress the run and copy the rest verbatim (203 bytes).

    Speed wise, there is the overhead of the table lookups to see if values are compressed or not.

  6. Thanks:

    Cyan (30th January 2015)

  7. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts
    Any software or test results?

  8. #7
    Member
    Join Date
    Jan 2015
    Location
    germany
    Posts
    6
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Thanks for your replies

    @biject.bwts
    I quickly explain the way ABBAAB would be encoded with my method.

    First, we look at the data. Neither A nor B would be compressible with RLE (1A 2B 2A 1B=64 Bits)
    Therefore the Comp_Bit_List would be 00. (the first 0 for A, the second one for the B)
    The encoding does now the following:
    We read one character, check, if the accompanying entry in the Compbitlist is set to 1(the character would be compressible) or set to 0(the character would not be compressible).

    We read A. The first entry in the CompBit List is 0, therefore, we do NOT encode A, but store it the way it is. ->A
    We read B. The second CompbitList-entry is 0, therefore, we do not encode the B, but store it the way it is. -> B
    Read B, Compbitlist-entry is 0 ->B
    Read A, Compbitlist-entry is 0 ->A
    Read A, Compbitlist-entry is 0 ->A
    Read B, Compbitlist-entry is 0 ->B

    The result would be: CompBitList:00 "encoded"-data: ABBAAB

    When decompressing, you do the same with the encoded data:
    first read the CompBitList:00
    We read A. The first entry in the CompBit List is 0, therefore, the A is NOT encoded, but stored the way it is. ->A
    We read B. The second entry in the CompBit List is 0, therefore, the B is NOT encoded, but stored the way it is. ->B
    Read B, Compbitlist-entry is 0 ->B
    Read A, Compbitlist-entry is 0 ->A
    Read A, Compbitlist-entry is 0 ->A
    Read B, Compbitlist-entry is 0 ->B

    The decoded data would be ABBAAB.

    What my method basically does is, using a simple logic(with the help of the Comp_Bit_List) to decide, if encoding a specific character would benefit in compression at all or not.
    If encoding all the A would result in compression(at least one byte), the accompanying entry in the Comp_Bit_List is set to 1, otherwise, it is set to 0.
    Therefore, the worstcase scenario would be, no encoding at all, leading to a Comp_Bit_List of (when working with 8 Bits-Ascii) 256Bits=32Bytes, the only overhead produced.

    Thanks for pointing out bijective-encoding to me. Some other ideas I'm toying around seem to work with that kind of idea. It'll help me a lot

    @m^2
    I'm not sure, if I understand what you mean't. Do you mean, my wording in the license is wrong? How could I have written it better?

    @jibz
    The first pass is just an analytical pass, just like the building of the Huffmantree with Huffman-Encoding. The second pass is the actual encoding. I have included in the Appendices some examples, that are worst-case examples for Mespotine-RLE, where RLE itself produces better results, but: only smaller by the size of the CompBitList(the MeRLE-counterpart to what the Huffmantree for Huffman is).
    So, the difference to all RLE-methods I have seen so far is, that I analyze the data first, while PackBits or Tsukyama do encode the just away.
    Though I have to admit, the analytical pass does mean, more computing time required, compared to the other Tsukyama, PackBits or classic-RLE.

    @MattMahoney
    Some testresults are in the appendix of the paper, that do reflect best and worst-cases of the method.
    I'm going to write an actual experimental-implementation later, but wanted to ask for same feedback for the method itself first, as at some point you risk not seeing the forest full of trees while developing on such stuff for such a long time.



    Again, thanks for your helpful replies
    Last edited by mespotine; 31st January 2015 at 21:06. Reason: Now I'm done :)

  9. #8
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    895
    Thanks
    54
    Thanked 109 Times in 86 Posts
    quick question how would it work with ABABAAAABBBB ?

  10. #9
    Member
    Join Date
    Jan 2015
    Location
    germany
    Posts
    6
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I have worked out another example, to clear up some confusion. This should explain, how it basically works.
    It contains most of the occuring cases with the exception of longruns, that is explained in detail in the paper.


    What MeRLE-basic basically does is, turning off RLE for run-values, that wouldn't be compressible using RLE
    and turning it on for run-values, that would compress. To decide, which is which, it uses the
    CompBitList(we need to create), in which we can see, if encoding a certain run-value would result in
    compression(accompanying entry is set to 1) or not(accompanying entry is set to 0). By that we loose
    inefficiencies that would result in overhead, that would make the file bigger. With MeRLE-basic, the biggest it could become is
    the original file+the CompBitList (with all entries set to 0), resulting in the "best worst-case-scenario"
    in all RLE-variants I have come across as of yet, as the encoding itself(!) NEVER creates files bigger than the original,
    with only the small CompBitList resulting in always predictable growth of a file(with ASCII-8Bit: 32 Bytes max overhead).
    (An example of a worst-case-scenario is at the end of this post...)


    In the following example, you can see, what that means.


    ACBBCCCCA = 72 Bits


    Standard RLE: 1A 1C 2B 4C 1A = 80 Bits ratio: 111,11%
    Tsukyama: A C BB0 CC2 A = 72 Bits ratio: 100%
    PackBits: 3ACBB -4C 0A = 72 Bits ratio: 100%


    Mespotine-RLE-basic:
    Comp Bit List: 001
    encoded: A C1 BB C4 A = 64 Bits + Comp_Bit_List(3 Bits here) = 67 Bits
    ratio: 93,06%


    In detail:
    The data to encode is: A C BB CCCC A


    Analysing data&creating the CompBitList


    - Is the run-value A compressible using RLE?
    A -> 1A, A -> 1A
    Encoding the A would make the A bigger by two bytes, therefore it is not compressible,
    first entry in the CompBitList is set to 0.


    - Is the run-value B compressible using RLE?
    BB -> 2B
    Encoding the B would result in no compression (2Bytes stay 2 Bytes), therefore it is not compressible,
    second entry in the CompBitList is set to 0.


    - Is the run-value C compressible using RLE?
    C -> 1C, CCCC -> 4C
    Encoding the C would make the C smaller by one byte, therefore it is compressible,
    third entry in the CompBitList is set to 1.


    -The resulting Comp_Bit_List is 001.


    Encoding:
    - We read the first character A
    The first entry of the CompBitList is set to 0, means: DON'T encode with RLE
    A -> A


    - We read the character C
    The third entry of the CompBitList is set to 1, means: encode with RLE
    C -> C1


    - We read the character B
    The second entry of the CompBitList is set to 0, means: DON'T encode with RLE
    B -> B


    - We read the character B
    The second entry of the CompBitList is set to 0, means: DON'T encode with RLE
    B -> B


    - We read the character C
    The third entry of the CompBitList is set to 1, means: encode with RLE
    CCCC -> C4


    - We read the character A
    The first entry of the CompBitList is set to 0, means: DON'T encode with RLE
    A -> A


    The resulting data, including CompBitList(in this example 3 Bits) is: 001 A C1 B B C4 A (5 Bits smaller than the original data)






    Decoding:
    Read the CompBitList(in this example 3 Bits): 001


    - We read the first character A
    The first entry of the CompBitList is set to 0, therefore NO(!) decoding using RLE
    A -> A


    - We read the character C
    The third entry of the CompBitList is set to 1, therefore we read the next
    character as run: 1
    C1 -> C


    - We read the character B
    The second entry of the CompBitList is set to 0, therefore NO(!) decoding using RLE
    B -> B


    - We read the character B
    The second entry of the CompBitList is set to 0, therefore NO(!) decoding using RLE
    B -> B


    - We read the character C
    The third entry of the CompBitList is set to 1, therefore we read the next
    character as run: 4
    C4 -> CCCC


    - We read the character A
    The second entry of the CompBitList is set to 0, therefore NO(!) decoding using RLE
    A -> A




    the decoded data is: A C BB CCCC A


    In accordance with the prophecy






    WorstCaseScenario example


    Let's take the following example, a file, in which the following string appears 100 000 times
    100 000 times ABB resulting in a file with 300 000 Bytes.
    Standard RLE:
    100 000 times 1A2B resulting in a file with 400 000 Bytes
    Tsukyama:
    100 000 times ABB0 resulting in a file with 400 000 Bytes
    PackBits:
    we need to include a runbyte after every run of 127 unencodeable characters,
    to signal an new unencodeable run (300 000/127 = 2363 run-bytes additional)
    therefore the resulting file would be 302 363 Bytes
    Mespotine RLE-basic:
    With MeRLE-basic, it would be a CompBitList with 32Bytes=256 bits (all set to 0) plus
    the original, unencoded file (as neither the A, nor the B would result in
    compression in such a file and therefore stay unenoded)
    therefore the resulting file would be 300 032 Bytes!


    So we can see, if no compression is possible, in MeRLE-basic we have only an overhead of the CompBitList
    itself, nothing more, resulting in more efficiency in worst case-scenarios, especially on bigger files.

  11. #10
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,487
    Thanks
    26
    Thanked 129 Times in 99 Posts
    We can always reduce maximum overhead of any compression scheme to 1 bit if we use that bit to store information whether the content is compressed or not.

  12. #11
    Member
    Join Date
    Jan 2015
    Location
    germany
    Posts
    6
    Thanks
    0
    Thanked 0 Times in 0 Posts
    @SvenBent

    For a detailed description on how to encode, I suggest you to read my post above. It is also important, that it gets rid of the overhead of uncompressible run-values, not of the single-character runs of compressible run-values! That means, there are cases, where other RLE-methods are better than MeRLE, even standard RLE (I suggest the Appendix in my paper, where I have listed such examples). All other cases, that produce overhead with Standard-RLE, Tsukyama and PackBits, are more efficiently encoded with MeRLE, as only run-values are encoded, that produce compression. If not, they are not encoded at all. That means, only the CompBitList itself produces overhead, no matter how big the original data was(unlike other RLE-methods, where the encoding itself could produce an enormous amount of overhead).

    Your example is one, that is better encoded with other RLE-encoders. I do not claim, MeRLE-basic is the best RLE-encoder available, but it improves heavily on worst case scenarios and in cases, where other RLE-methods fail completely.

  13. #12
    Member
    Join Date
    Jan 2015
    Location
    germany
    Posts
    6
    Thanks
    0
    Thanked 0 Times in 0 Posts
    @Piotr Tarsa
    But you loose the possibility of compressing parts, that are compressible by the algorithm.

    Let's concentrate on Standard RLE in the following example: AAAABBCBBCBC= 12 chars
    Encoded it would be 4A 2B 1C 2B 1C 1B 1C =14 chars means not compressible with standard RLE
    If, because of that, you turn off encoding at all, you can't benefit from the fact, that the A IS compressible. I turn RLE on for A, while all others are RLE turned off.

    By that: A4BBCBBCBC = 10 chars + 3 Bits of the Bitlist(that remembers, if RLE for a certain character is turned off or not)

    That's the improvement over other RLE-methods: I keep the compressible runvalues encoded, while the others are not.

    An yes, such a thing could be implemented in other methods too, but need completely other strategies to do so successfully(In Huffman it's hard without doing it as you described...turning it on and off for the whole file)

    I hope I got your point right....
    Last edited by mespotine; 1st February 2015 at 17:20.

  14. #13
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    mespotine,
    it's not about wording but about how copyright law works. Your wording would be OK if what you stated was allowed, but it's not on the principle that copyright gets granted not for ideas (algorithm) but for expressions (implementation of algorithm).

  15. #14
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    895
    Thanks
    54
    Thanked 109 Times in 86 Posts
    @mespotine you own example provide me the requested info ty.
    but it looks to me you need to make a 2 pass encoding on the data. One to figure out out if something is compressible or not and a second to compress it.

  16. #15
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    895
    Thanks
    54
    Thanked 109 Times in 86 Posts
    @mespotine you own example provide me the requested info ty.
    but it looks to me you need to make a 2 pass encoding on the data. One to figure out out if something is compressible or not and a second to compress it.
    i messed with a different rle approach for dithered images. where the first bit of the run length would indicate a single byte or a 2 byte pari to be repeatede

    so a dither images section of abababababababab would easily be compressed to [136] [A] [B]
    136 = 128 (most significant bit set to 1) + 8 ( the count)

  17. #16
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts
    I wrote a quick program to calculate compressed file sizes using Mespotine RLE. Some results:

    Code:
    calgary\BIB:
      111261 -> 111293
    
    calgary\BOOK1:
      768771 -> 768803
    
    calgary\BOOK2:
      610856 -> 610888
    
    calgary\GEO:
      102400 -> 102432
    
    calgary\NEWS:
      Char * (42) saves 47 bytes
      Char - (45) saves 1545 bytes
      Char = (61) saves 191 bytes
      Char ~ (126) saves 243 bytes
      377109 -> 375115
    
    calgary\OBJ1:
      Char   (0) saves 2905 bytes
      21504 -> 18631
    
    calgary\OBJ2:
      246814 -> 246846
    
    calgary\PAPER1:
      53161 -> 53193
    
    calgary\PAPER2:
      82199 -> 82231
    
    calgary\PIC:
      Char   (0) saves 411805 bytes
      Char * (255) saves 468 bytes
      513216 -> 100975
    
    calgary\PROGC:
      39611 -> 39643
    
    calgary\PROGL:
      Char ; (59) saves 2445 bytes
      71646 -> 69233
    
    calgary\PROGP:
      Char   (32) saves 2166 bytes
      49379 -> 47245
    
    calgary\TRANS:
      Char   (0) saves 1427 bytes
      Char   (8) saves 105 bytes
      Char ? (63) saves 2 bytes
      Char _ (95) saves 14 bytes
      93695 -> 92179
    
    silesia\dickens:
      10192446 -> 10192478
    
    silesia\mozilla:
      Char   (0) saves 3380768 bytes
      51220480 -> 47839744
    
    silesia\mr:
      Char   (0) saves 2725 bytes
      9970564 -> 9967871
    
    silesia\nci:
      Char   (32) saves 1344299 bytes
      Char $ (36) saves 17928 bytes
      33553445 -> 32191250
    
    silesia\ooffice:
      Char   (0) saves 7603 bytes
      Char É (144) saves 98521 bytes
      Char ╠ (204) saves 13059 bytes
      6152192 -> 6033041
    
    silesia\osdb:
      10085684 -> 10085716
    
    silesia\reymont:
      6627202 -> 6627234
    
    silesia\samba:
      Char   (0) saves 1027833 bytes
      Char * (42) saves 471387 bytes
      21606400 -> 20107212
    
    silesia\sao:
      7251944 -> 7251976
    
    silesia\webster:
      41458703 -> 41458735
    
    silesia\x-ray:
      8474240 -> 8474272
    
    silesia\xml:
      Char   (0) saves 20512 bytes
      5345280 -> 5324800
    
    enwik8:
      100000000 -> 100000032
    
    enwik9:
      Char ' (39) saves 77014 bytes
      1000000000 -> 999923018
    
    maxcomp\a10.jpg:
      842468 -> 842500
    
    maxcomp\acrord32.exe:
      Char   (0) saves 115140 bytes
      Char * (255) saves 10639 bytes
      3870784 -> 3745037
    
    maxcomp\english.dic:
      4067439 -> 4067471
    
    maxcomp\FlashMX.pdf:
      4526946 -> 4526978
    
    maxcomp\fp.log:
      20617071 -> 20617103
    
    maxcomp\mso97.dll:
      Char   (0) saves 5362 bytes
      3782416 -> 3777086
    
    maxcomp\ohs.doc:
      Char   (0) saves 104861 bytes
      4168192 -> 4063363
    
    maxcomp\rafale.bmp:
      4149414 -> 4149446
    
    maxcomp\vcfiu.hlp:
      Char Σ (228) saves 60 bytes
      Char ± (241) saves 16 bytes
      4121418 -> 4121374
    
    maxcomp\world95.txt:
      Char _ (95) saves 19179 bytes
      2988578 -> 2969431
    Here is the program. Run it with filename arguments.

    Code:
    // Estimate compression using Mespotine RLE encoding
    // Usage: mespotine files...
    // Written by Matt Mahoney. Public domain.
    
    #include <stdio.h>
    
    int main(int argc, char** argv) {
      for (int i=1; i<argc; ++i) {
    
        // Open file
        FILE* in=fopen(argv[i], "rb");
        if (!in) {
          perror(argv[1]);
          continue;
        }
    
        // Count runs
        int c, c1=-1;  // current and last char
        long long t[256]={0};  // byte -> savings
        long long n=0;         // input size
        long long run=0;       // current run length
        while ((c=getc(in))!=EOF) {
          ++n;
          if (c==c1) t[c] += ++run%255!=254;
          else --t[c], run=0;
          c1=c;
        }
    
        // Show compressed size
        long long newsize=n+32;
        printf("%s:\n", argv[i]);
        for (int i=0; i<256; ++i) {
          if (t[i]>0) {
            printf("  Char %c (%d) saves %1.0f bytes\n", i<32?' ':i, i, t[i]+.0);
            newsize-=t[i];
          }
        }
        printf("  %1.0f -> %1.0f\n\n", n+.0, newsize+.0);
        fclose(in);
      }
      return 0;
    }
    Last edited by Matt Mahoney; 4th February 2015 at 18:36.

  18. Thanks (2):

    jibz (5th February 2015),PSHUFB (26th February 2015)

  19. #17
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    236
    Thanked 90 Times in 70 Posts
    Great!
    Edge-case: 1gb zero-padded file >>

    Code:
    gonzalo@gonzalo-M861G:/media/gonzalo/Sistema NT1$ mespotine-rle zero-filled.bin
    zero-filled.bin:
      Char   (0) saves 1069531070 bytes
      1073741824 -> 4210786
    BTW: Can you edit the demo to accept data from stdin? It is impossible to measure such a fast algorithm because HD limits... Even better, if it is not asking too much... Can you make it to write data to disk/stdout? I'd like to see how can it be used as a preprocessor. Thanks in advance!
    Last edited by Gonzalo; 4th February 2015 at 22:03.

  20. #18
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts
    Try replacing getc(in) with getchar() and skip the fopen() code and remove the outer for-loop. That will only work in Linux.

  21. #19
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    236
    Thanked 90 Times in 70 Posts
    Thanks! I've managed to get it eating from stdin but I don't get yet how to write the actual RLE, not the statistics...
    Tried putchar() instead of printf but if it is correct: what should I put in between ()? I just put everything and the output goes from the very original input file to a complete 1-256 list over and over again. I can't figure out what's the variable storing the compressed output, if it is there... Sorry for the newbie question.
    Last edited by Gonzalo; 5th February 2015 at 05:14.

  22. #20
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts
    It's not a compressor. It only calculates the output size. To compress, you would make a second pass through the file and RLE encode the bytes for which t[c] > 0. The RLE encoding is the byte followed by the run length. For runs of 1..255 you would use bytes 0..254. For longer runs you use byte 255 to mean a run of 255 followed by another run code. You would also need to write a decompressor, of course.

    I didn't bother to write a compressor because the savings are very small and there are better algorithms like LZ77 already written. RLE is just LZ77 restricted to offsets of 1.

  23. Thanks:

    Gonzalo (5th February 2015)

  24. #21
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    236
    Thanked 90 Times in 70 Posts
    Quote Originally Posted by Matt Mahoney View Post
    It's not a compressor. It only calculates the output size. To compress, you would make a second pass through the file and RLE encode the bytes for which t[c] > 0. The RLE encoding is the byte followed by the run length. For runs of 1..255 you would use bytes 0..254. For longer runs you use byte 255 to mean a run of 255 followed by another run code. You would also need to write a decompressor, of course.

    I didn't bother to write a compressor because the savings are very small and there are better algorithms like LZ77 already written. RLE is just LZ77 restricted to offsets of 1.
    Sounds fair... Anyhow, as this is the simplest compressor I can think of, I will try to get it working as a first step into real world packers.

  25. #22
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 182 Times in 108 Posts
    Dr. Mahoney,

    I hate to mention it since you did a lot of testing, but I think the code has problems here:

    Code:
          if (c==c1) t[c] += ++run%255!=254;
          else --t[c], run=0;
    I think t[c] should be incremented when a ++run%255 != 0 test passes and decremented otherwise instead of not changing it when it is 254. I also think the --t[c] should be --t[cl], but then you will also have to handle cl being a negative index the first time through the loop. Perhaps, I misunderstand the code, but that's what I see.
    Last edited by Kennon Conrad; 5th February 2015 at 13:50.

  26. #23
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts
    I think you're right about the first bug. About the second, a run of 1 should have a score of -1. In general, a run of n should score n-2 for n up to 255.

  27. #24
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 182 Times in 108 Posts
    Quote Originally Posted by Matt Mahoney View Post
    I think you're right about the first bug. About the second, a run of 1 should have a score of -1. In general, a run of n should score n-2 for n up to 255.
    I missed the part about sending long runs. So when ++run%256 == 255 (a run of 256), a byte is not saved for that character and run should be set to 0 because you don't save a byte once per 255 characters in the run. For the second problem, I think I was wrong.
    Last edited by Kennon Conrad; 5th February 2015 at 20:32.

  28. #25
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    236
    Thanked 90 Times in 70 Posts
    I saw something strange while running the test against ~6.7gb data.
    The test set is a shar archive containing a LiberKey installation.

    Code:
      Char   (0) saves 231572627 bytes
      Char � (204) saves 23288754 bytes
      7180648320 -> 6925786971
    I wonder if is true that only two characters made gainings in such a big file...

  29. #26
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts
    Here is a compressor.

    Code:
    // Mespotine RLE compressor
    // Usage: mespotine c|d input output
    // Written by Matt Mahoney. Public domain.
    
    #include <stdio.h>
    
    int main(int argc, char** argv) {
    
      // Check args
      if (argc!=4)
        return printf("To compress|decompress: mespotine c|d input output\n"), 1;
    
      // Open files
      FILE* in=fopen(argv[2], "rb");
      if (!in) return perror(argv[2]), 1;
      FILE* out=fopen(argv[3], "wb");
      if (!out) return perror(argv[3]), 1;
    
      int c, c1=-1;  // current and last char
      long long t[256]={0};  // byte -> savings
      long long run=0;  // current run length
    
      // Compress
      if (argv[1][0]=='c') {
    
        // Pass 1: determine which chars will compress
        while ((c=getc(in))!=EOF) {
          if (c==c1) t[c] += ++run%255!=0;
          else --t[c], run=0;
          c1=c;
        }
    
        // Write bitmapped header (1 = compressible)
        for (int i=0; i<32; ++i) {
          c=0;
          for (int j=0; j<8; ++j) c+=(t[i*8+j]>0)<<j;
          putc(c, out);
        }
    
        // Pass 2: RLE encode
        rewind(in);
        c=c1=EOF;
        run=0;
        do {
          c=getc(in);
          if (c==c1)
            ++run;
          else if (run>0 && t[c1]>0) {
            putc(c1, out);
            for (; run>255; run-=255) putc(255, out);
            putc(run-1, out);
            run=1;
          }
          else
            for (++run; run>1; --run) putc(c1, out);
          c1=c;
        } while (c!=EOF);
      }
    
      // Decompress
      else if (argv[1][0]=='d') {
    
        // Read header
        for (int i=0; i<32; ++i) {
          c=getc(in);
          for (int j=0; j<8; ++j)
            t[i*8+j]=(c>>j)&1;
        }
    
        // Decode
        while ((c=getc(in))!=EOF) {
          if (t[c]) {
            for (run=0; (c1=getc(in))==255; run+=255);
            run+=c1+1;
            for (; run>0; --run) putc(c, out);
          }
          else putc(c, out);
        }
      }
    
      // Print result
      printf("%lu -> %lu\n", ftell(in), ftell(out));
      return 0;
    }

  30. Thanks (4):

    Gonzalo (6th February 2015),jibz (6th February 2015),PSHUFB (26th February 2015),surfersat (6th February 2015)

  31. #27
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    236
    Thanked 90 Times in 70 Posts
    You didn't let me write my own
    Anyway, I wouldn't know where to start. So thank you.
    I will check it later in deep

    BTW: just for fun, I just compressed 1gb file down to exactly 23 bytes. That is, 0.00000002% Of course the file has zero entropy...
    The method is mrle 4 times, then ccm. There are very much methods left. I tried several and I picked up the best. From about 1.5 years ago, until now, it was rep+slugx=93 bytes.
    Now I'm thinking about the way to chain inside mrle the two steps to eliminate the need to make 2 passes through the data. I don't know if it is possible. Doing so, the next logical step is to make it recursive. In this extreme case, mrle continued gaining until 5th run. Of course, in real world are very few samples where run length is that big, or at least >256. The only case i saw is the beginning of an iso image. Perhaps a solid color bitmap too...
    Last edited by Gonzalo; 6th February 2015 at 19:12.

  32. #28
    Member jibz's Avatar
    Join Date
    Jan 2015
    Location
    Denmark
    Posts
    122
    Thanks
    105
    Thanked 71 Times in 51 Posts
    Thanks for posting the example compressor! I have taken the liberty of modifying it to create PackBits and Tsukiyama versions to help compare. You can find them here:

    https://gist.github.com/jibsen/75275f594f2100b5df77

    https://gist.github.com/jibsen/214d1abfaf2d78515e6b

  33. Thanks:

    Gonzalo (7th February 2015)

  34. #29
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,537
    Thanks
    758
    Thanked 676 Times in 366 Posts
    Quote Originally Posted by Gonzalo View Post
    BTW: just for fun, I just compressed 1gb file down to exactly 23 bytes. That is, 0.00000002% Of course the file has zero entropy...
    The method is mrle 4 times, then ccm. There are very much methods left. I tried several and I picked up the best. From about 1.5 years ago, until now, it was rep+slugx=93 bytes.
    i got 81 bytes with header-less fazip:

    Code:
    M:\>fazip compress:rep:1g+tor 1gb0 1gb0.rep-tor
    100%: 1,000,000,000 -> 81: 0.00%  Cpu 315 mb/s (3.026 sec), real 373 mb/s (2.555 sec) = 118%
    
    M:\>fazip decompress:rep:1g+tor 1gb0.rep-tor nul
    100%: 81 -> 1,000,000,000: 0.00%  Cpu 2911 mb/s (0.328 sec), real 1658 mb/s (0.575 sec) = 57%
    and 64 bytes with srep:

    Code:
    M:\>srep64 1gb0 -b1g -m0o -d1m:c4 -nomd5
    SREP 3.93 beta (September 30, 2014): input size 953 mb, memory used 11364 mb, -m0o -hash- -b1gb -d1mb:h4mb:l512:c4
    100%: 1,000,000,000 -> 64: 0.00%.  Cpu 443 mb/s (2.153 sec), real 322 mb/s (2.961 sec) = 73%.  Remains 00:00
    Decompression memory is 0 mb.  1 matches = 16 bytes = 25.00% of file
    Last edited by Bulat Ziganshin; 6th February 2015 at 22:44.

  35. #30
    Member
    Join Date
    Dec 2011
    Location
    Germany / Hessen
    Posts
    18
    Thanks
    0
    Thanked 1 Time in 1 Post
    First of all, I like the writing style of the paper - nice to understand.


    But for the algorithm itself - sorry think it is rather a fail or I understood it completely wrong.

    Let me explain what I think:

    Main point is you certainly did not 100% understand how RLE works.

    In order to en- and later on decode your runs, you have to tell how long your runs are. So as you stated you will code something like 3A 2B 5A (+in your case the Compbitlist-entry)

    But as 3 or 2 or 5 are also chars "3" , "2", "5" - you have a problem if you want to decode your compressed file - cause you never know is it a RunLength or a char e.g. "3"A vs. 3A

    So what you need is a escape symbol to show that a Run is meant. So your it looks basically like E3A E2B E5A (E standing for escape symbol)

    The point is now:
    As you in your algorithm would anyway also need an escape symbol there is no advantage of the algorithm.
    Because in classic RLE if I do not save chars with it, I just do not put E2B - I just write BB instead - the missing escape symbol tells me it has to be interpreted as normal chars
    This has the same ability your algortihm has: everything uncompressible is just not touched

    But has 2 advantages over your algorithm:
    -It does not need the "dictionary" up front -
    - it is much more flexible
    e.g. AGAFAJAKALAOAAAAAA - your algorithm would say no saving for A do not use it
    the normal approach would do AGAFAJAKALAO E 6A - it would be able the compress the last parts

Page 1 of 2 12 LastLast

Similar Threads

  1. GearEnc - test application gear encoding
    By Sportman in forum Data Compression
    Replies: 7
    Last Post: 19th May 2013, 04:33
  2. Replies: 2
    Last Post: 16th January 2012, 23:12
  3. Mark Forums Read
    By Surfer in forum The Off-Topic Lounge
    Replies: 6
    Last Post: 12th May 2010, 18:07
  4. Advanced Huffman Encoding
    By Simon Berger in forum Data Compression
    Replies: 28
    Last Post: 15th April 2009, 14:24
  5. Variable-length Offset
    By Cyan in forum Data Compression
    Replies: 8
    Last Post: 12th December 2008, 13:06

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •