Results 1 to 10 of 10

Thread: String compression for 8bit architectures.

  1. #1
    Member
    Join Date
    May 2017
    Location
    Spain
    Posts
    25
    Thanks
    22
    Thanked 4 Times in 4 Posts

    String compression for 8bit architectures.

    I was interested this year in string compression, specifically for very low resource machines, usually 8bit cpus with maybe 64kb of RAM.
    The idea here is, what could I use to save text space?

    There are already modern compressors as LZSA, and many more, optimized for 8bit machines. But all those works for general compressing binary data.

    I wanted something that could be called with a string number, an address space, and that unique string is decompressed. The standard use case is a text adventure, as the ones developed by Infocom, Level-9, Magnetic Scrolls, etc... By the way, all those systems (adventure interpreters) use somekind of text compressing, packing 4 chars in 3bytes, I think is used for example by Level-9, and many others use a simple dictionary token substitution. I've used strings extracted from decompiled adventures in these systems to do some basic testings. I used Delphi/pascal for testing, but this is irrelevant.

    So, actual string focused compressors, as shoco, smaz, the trie version of smaz, etc, are not tailored to this problem.
    I've studied, as far as I can, sources, also for other string compressors, as Femtozip and Unishox, and came to the following conclusions:

    - Decompressing code needs to be small, but also there is not needed to pack it under 100bytes, but of course, smaller is better. Speed is not relevant afaik in a text adventure.
    - So, easy coding needs to be used, no arithmetic coding for example, I've
    read something about the possbility of AC in a 8bit machine, but it's totally
    out of my knowledge.
    - The best I could use is a token dictionary and maybe some coding of literals and ptrs in dict., the next one I was testing is adaptive huffman for each string, this gives a nice compression, and easy to use for strings, but it's impractical for this user-case.

    My definition of the problem:
    - alphabet size is 64 chars, so I can use 6bits in literals for example
    - max total text to compress is 0xffff bytes
    - strings are capitalized, per-platform uncompressing code could capitalize or other transformations after unpack
    - max 255 strings (could be more without changing method)
    - max string size could be more than 255bytes
    - strings are accessed by a number
    - dictionary size is 512 bytes
    - max dictionary match len is 8chars, 3bits, after some testing, I found larger matches are not produced if I use 4bits.

    I came to the following idea that is giving me around 60% final size (including dictionary, alphabet definition, and string pointers). I do not claim this is huge savings or that I 've invented something useful, I just got some things from here and there and tried to use for this specific case.


    This is my first try to code something related to compression, and been testing many things, maybe you have some ideas about. The final idea could be, compress the strings in a modern machine, and then output results that could be used with a small function in machines with a Z80, a 6502, early Intel, 8088, 8086, etc.

    - all strings are capitalized, concatenated and cleaned of unknown chars

    - all n-grams up to size 8 are extracted from cleaned text, and sorted by frequency of use

    - dictionary construction is based around what I've seen in other source codes, I tried to understand Z-standard dictionary builders (even new fastcover), but code is too complex for me
    And I think I don't need suffix arrays because maximum input size is 64KB and so I can afford extracting each n-gram? Could anyone correct me if I am wrong about this?

    Basically each ngram is taken by order from ngrams lists, and added to the dictionary, with the usual optimizations: ngram is not added if is a substring of another one already in dict. n-gram is packed in dictionary if head or tail of it and another one in dict coincides, maximizing coincident chars, so for example, adding "ESE" to dict. and found a token "SDES", merge both, producing a bigger token "SDESE" (tail opt. in this case)

    - my idea here is, please correct me if I am wrong, getting maximum number of ngrams could be packed in dict. maximazing freq. usage. This work pretty well, usually packing in 512 bytes dict, hundreds of n-grams. I have the feeling, that there are better ways to build this small dict, and improve compression, but I don't have a better idea.

    - compression is pretty simple:

    - literals use 6bits (huffman coded), and are stored and the end of the compressed string, unpacking space is prefilled with 0s, so after unpacking dict. tokens, 0bytes are filled with literals in the order they come.

    - pointers in dict are 12bits, 9 used for a pointer in the 512bytes dictionary, and 3 for length of match. I got this idea from femtozip I think, I use a huffman coder for encoding literal 6bit values, and decomposing the tokens 12bits in 2 halves, produces 6+6, so I can use a unique hufmman coder for
    encoding both literals and dict token ptrs. Femtozip uses byte nibbles, but each one use its own huffman dict., I cannot afford this luxury)

    - so compression works this way:

    - tokens in dict are found for each string from max possible length to min, so I first try to find matches with 8chars len, downto 3

    - if no more tokens match, all the rest are literals, and are deferred to the end of the compressed string after all dict tokens.

    - I build this for each string, accumulating stats for 6bit values for my huffman encoder.

    - finally, huffman is used to encode tokens 6bit at a time, and then all literals are encoded and appended.
    ‚Äč
    - strings are accessed via a pointer array, so no string lengths are needed, and as I said literals are filled at the end.

    I've not tested unpacking, but it should work

    Well this is my little experiment,

    I have the sensation that there are better strategies for example constructing a better dictionary, or choosing the token matched in the dictionary.

    This later case for example, is there a better way that instead of getting matches in dict. first by lenght, (longest to smallest) maybe other ways?

    I mean, I found a match of len 8char, nice, then found a match of len 6, but maybe if I don't use this first 8char len, I could find later 2 matches of len 7? (improving compressiong) or maybe other better combinations? Sorry I don't know how to explain this better.

    Well if you suvived this wall of text and have any idea to implement here, I'll try to test it. For example, I've thought in using a genetic evolver to improve match choosing, and maybe token packing in dictionary (learn about this
    in old times of Corewar warriors).

    Take in account that this case is useless in general sense, and is only very specific user case. Unpacking needs to be done in a 8bit machine, max input text size is 64Kb, and the other minutia I've written.

    Thanks guys!

    Edit: I've added some debugging output if you want to have a look at some numbers (naive6 is plain 6bits per char baseline): https://pastebin.com/xa3hCQg5
    Last edited by anormal; 18th October 2020 at 00:07.

  2. Thanks:

    introspec (19th October 2020)

  3. #2
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    169
    Thanks
    31
    Thanked 71 Times in 44 Posts
    Maybe LittleBit solves some problems.

  4. Thanks:

    Kaw (10th November 2020)

  5. #3
    Member
    Join Date
    May 2017
    Location
    Spain
    Posts
    25
    Thanks
    22
    Thanked 4 Times in 4 Posts
    Interesting... I missed it while diving in github looking for more info about this idea.
    I wonder how many nice projects in github that I cannot find with their searcher.

    > Is it possible to get good compression ratios while using Huffman encoding? It would require a static Huffman tree and the sum of the size of the tree and the encoded data must be competitive with other compression techniques

    I was thinking the same when I started this idea.

    But, excuse my ignorance, I know what a canonical Huffman is, but found of no use of it in my specific problem, so I use a standard algorithm that computes the Huffman codes from symbols frequency table.
    Is not true that Huffman builds the minimal codes from symbols frequency, always? By default?
    I didn't know it was a NP problem, or that some algorithms could produce "better" (total smaller size I suposse) codes?

    Anyway, thanks, I'd have a look at the sources, let's see if I can produce better Huffman codes

  6. #4
    Member
    Join Date
    May 2017
    Location
    Spain
    Posts
    25
    Thanks
    22
    Thanked 4 Times in 4 Posts
    @xezz you were right, similar ideas and I learnt a lot. Also I found there about the paper about Syllabe encoding, they even tried and tested a genetic evolver for building a dictionary. It really surprised me what I thought was a silly idea when I thought about it , it was really used and solved problems
    Thanks

  7. #5
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    916
    Thanks
    253
    Thanked 327 Times in 200 Posts
    Quote Originally Posted by anormal View Post
    I was interested this year in string compression, specifically for very low resource machines, usually 8bit cpus with maybe 64kb of RAM.
    Out of curiosity: Is there a practical use for this or is it for fun with retro computing?

  8. #6
    Member
    Join Date
    Jan 2014
    Location
    USA
    Posts
    12
    Thanks
    13
    Thanked 3 Times in 2 Posts
    Hmm... an 8-bit machine from the '80s with a total of 64K RAM would have less than 64K available for use.

  9. #7
    Member
    Join Date
    May 2017
    Location
    Spain
    Posts
    25
    Thanks
    22
    Thanked 4 Times in 4 Posts
    I don't think is useful out of this special case. Maybe it could be scaled in a similar way, for encoding a predefined set of strings, and when you need to decompress any string one at a time.
    But for modern processors/ram, you could use a much bigger dict, longest pointers in dictionary, etc. As I said you could you check other projects as femtozip or unishox, and Littlebit as posted before (I tested many, and usually shoco and smasz are suggested, but these I mentioned are much advanced and mature)

    I am going to try a GA to see if there are more interesting strategies while packing the dictionary, than the one I am using. If anyone could explain in few words as Z-Std algorithms for building dictionary works, cover and fastcover? (as I see in sources), I'd be happy

    @danlock, yes you are right, standard Sinclair Spectrum has 48KB (including screen ram), I think Vic-20 had 4KB? There are also many MSX systems with varied ram sizes, 16,32,64, etc, etc... Spectrum 128, has pageable banks, etc...

    Edit: I've just remembered I was looking for something similar years ago, while I was building a Chess png repo, and tried to compress each game apart in the DB. I'd try with pngs when I've tested more things, but with maybe a 1MB dict? It could be fun to explore this.

  10. #8
    Member
    Join Date
    May 2017
    Location
    Spain
    Posts
    25
    Thanks
    22
    Thanked 4 Times in 4 Posts
    After some fixing and testing, I finished using this format:
    5bits-number of tokens
    tokens... =12bits, 9bits index in dict+3bits length of match, encoded as 2 groups of 6bit
    literals...=6bits each one

    Each group of 6bits is finally encoded with a standard huffman encoder.
    I am happy as even very small strings, 4-5 chars are always compressed.

    Some stats, sizes are in bits:


    line:775 tokens:5(60) literals:25(150) orig:392 - naive6 : 294 - compressed no huffman:214 - HUFFMAN :202
    line:776 tokens:3(36) literals:5(30) orig:136 - naive6 : 102 - compressed no huffman:70 - HUFFMAN :60
    line:777 tokens:4(48) literals:11(66) orig:232 - naive6 : 174 - compressed no huffman:118 - HUFFMAN :110
    line:778 tokens:13(156) literals:21(126) orig:592 - naive6 : 444 - compressed no huffman:286 - HUFFMAN :301
    line:779 tokens:1(12) literals:11(66) orig:112 - naive6 : 84 - compressed no huffman:82 - HUFFMAN :69
    line:780 tokens:2(24) literals:10(60) orig:136 - naive6 : 102 - compressed no huffman:88 - HUFFMAN :73
    line:781 tokens:2(24) literals:3(18) orig:88 - naive6 : 66 - compressed no huffman:46 - HUFFMAN :46
    line:782 tokens:2(24) literals:9(54) orig:128 - naive6 : 96 - compressed no huffman:82 - HUFFMAN :74
    line:783 tokens:2(24) literals:14(84) orig:216 - naive6 : 162 - compressed no huffman:112 - HUFFMAN :92
    line:784 tokens:6(72) literals:9(54) orig:240 - naive6 : 180 - compressed no huffman:130 - HUFFMAN :120
    line:785 tokens:1(12) literals:5(30) orig:80 - naive6 : 60 - compressed no huffman:46 - HUFFMAN :42
    line:786 tokens:1(12) literals:5(30) orig:80 - naive6 : 60 - compressed no huffman:46 - HUFFMAN :42
    line:787 tokens:5(60) literals:14(84) orig:280 - naive6 : 210 - compressed no huffman:148 - HUFFMAN :139
    line:788 tokens:4(48) literals:6(36) orig:168 - naive6 : 126 - compressed no huffman:88 - HUFFMAN :98
    line:789 tokens:0(0) literals:5(30) orig:40 - naive6 : 30 - compressed no huffman:34 - HUFFMAN :28
    Final compression no huffman:216246 / 356872 = ,6059
    Final compression huffman:185306 / 356872 = ,5193

    I'll try now to build a better dictionary that provides more matches and so better compression. What I am using now is simply inserting+merging ngrams sorted by frequency of use.

  11. #9
    Member
    Join Date
    Jul 2020
    Location
    EU
    Posts
    7
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Your constraints are similar to what I used for https://github.com/clbr/NTC. I got ~50% compression using a simpler setup, no dictionaries etc, though your way may be faster.

  12. #10
    Member
    Join Date
    May 2017
    Location
    Spain
    Posts
    25
    Thanks
    22
    Thanked 4 Times in 4 Posts
    @zeit: great, another project that I didn't find while researching this I'd have a look

    I'll write about my experiments with the genetic evolver, maybe this is more interesting that simply strings compression.

    It really surprised me, and was very fun to experiment and tune it, still improving it of course.

    As you know a GA works generating children from a sent of original pool, these are breed in the way you want to, in my example I mix every dictionary in the original pool with every other one.
    A initial pool of 60 dictionaries, will produce 3600 children (yes I mix them with theirselves).
    Each parent got n ranges of size m from the other parent (n crosspoints of size m), positioned randomly. I got this idea from the paper linked in LittleBit compressor.

    I do some mutations later, overwritting portions with random ngrams, etc...

    Then apply a fitness function to each dictionary, sort them by best to worst, and get the firsts best from this generation. You repeat while generations improve. I stop after 6 generations don't improve. But tweaking the GA parameters (initial pool size, how do you breed, how do you mutate, etc) will produce better or faster convergency to an optimal solution. So you need to experiment a lot tuning these parameters.

    In a test case of a 40KB text file with 700 lines, the actual method I am using converge around generations 70-80.
    The best generated dictionary improved around 4KB in total compression, compared to my own first dictionary (explanation in the first post).

    I could write much more about what I've learned using a GA.
    The first fitness function I used was simply bytes saved (num of bytes that are not literals).
    This produced improves, but after compressing, the file was bigger!
    What happened? My tokens are 12bits.
    Yes, it was improving compression, but not taking in account number of tokens (and the size they occupy).
    I am using right now this fitness function that works pretty nice. I still have many experiments to do with this function.

    fitness=bits_saved-bits_occupied_by_tokens+numTokens_size7*3+numToken s_size8*5;

    The last two factors were needed to avoid the GA getting more smaller tokens than bigger ones (and so, producing worse compression).

    I'll publish this in github next months, with more explanations there, and example in 8086 asm

    My goal now is try to go better than 49-50% compression. I have to see if a BWT works here. Or maybe delta coding tokens, starting from first letter in token, this should produce a better human coding at the end.
    I learned a lot with this project. And as my first one related directly to compression, I now want to learn everything about it. Thanks!
    Last edited by anormal; 9th November 2020 at 21:12.

Similar Threads

  1. Find string probabilities
    By duma in forum Data Compression
    Replies: 3
    Last Post: 16th September 2018, 13:38
  2. Replies: 5
    Last Post: 20th January 2018, 02:39
  3. Whats best tool to go from 24PNG to <8Bit image
    By dado023 in forum Data Compression
    Replies: 9
    Last Post: 25th October 2016, 07:53
  4. Any tool to conveter 8bit/c RGBA png file into palete RGBA?
    By SvenBent in forum The Off-Topic Lounge
    Replies: 0
    Last Post: 8th October 2014, 22:34
  5. Directory hash as one string
    By FatBit in forum The Off-Topic Lounge
    Replies: 6
    Last Post: 17th January 2012, 00:29

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •