Results 1 to 12 of 12

Thread: String compression for 8bit architectures.

Threaded View

Previous Post Previous Post   Next Post Next Post
  1. #1
    Join Date
    May 2017
    Thanked 5 Times in 5 Posts

    String compression for 8bit architectures.

    I was interested this year in string compression, specifically for very low resource machines, usually 8bit cpus with maybe 64kb of RAM.
    The idea here is, what could I use to save text space?

    There are already modern compressors as LZSA, and many more, optimized for 8bit machines. But all those works for general compressing binary data.

    I wanted something that could be called with a string number, an address space, and that unique string is decompressed. The standard use case is a text adventure, as the ones developed by Infocom, Level-9, Magnetic Scrolls, etc... By the way, all those systems (adventure interpreters) use somekind of text compressing, packing 4 chars in 3bytes, I think is used for example by Level-9, and many others use a simple dictionary token substitution. I've used strings extracted from decompiled adventures in these systems to do some basic testings. I used Delphi/pascal for testing, but this is irrelevant.

    So, actual string focused compressors, as shoco, smaz, the trie version of smaz, etc, are not tailored to this problem.
    I've studied, as far as I can, sources, also for other string compressors, as Femtozip and Unishox, and came to the following conclusions:

    - Decompressing code needs to be small, but also there is not needed to pack it under 100bytes, but of course, smaller is better. Speed is not relevant afaik in a text adventure.
    - So, easy coding needs to be used, no arithmetic coding for example, I've
    read something about the possbility of AC in a 8bit machine, but it's totally
    out of my knowledge.
    - The best I could use is a token dictionary and maybe some coding of literals and ptrs in dict., the next one I was testing is adaptive huffman for each string, this gives a nice compression, and easy to use for strings, but it's impractical for this user-case.

    My definition of the problem:
    - alphabet size is 64 chars, so I can use 6bits in literals for example
    - max total text to compress is 0xffff bytes
    - strings are capitalized, per-platform uncompressing code could capitalize or other transformations after unpack
    - max 255 strings (could be more without changing method)
    - max string size could be more than 255bytes
    - strings are accessed by a number
    - dictionary size is 512 bytes
    - max dictionary match len is 8chars, 3bits, after some testing, I found larger matches are not produced if I use 4bits.

    I came to the following idea that is giving me around 60% final size (including dictionary, alphabet definition, and string pointers). I do not claim this is huge savings or that I 've invented something useful, I just got some things from here and there and tried to use for this specific case.

    This is my first try to code something related to compression, and been testing many things, maybe you have some ideas about. The final idea could be, compress the strings in a modern machine, and then output results that could be used with a small function in machines with a Z80, a 6502, early Intel, 8088, 8086, etc.

    - all strings are capitalized, concatenated and cleaned of unknown chars

    - all n-grams up to size 8 are extracted from cleaned text, and sorted by frequency of use

    - dictionary construction is based around what I've seen in other source codes, I tried to understand Z-standard dictionary builders (even new fastcover), but code is too complex for me
    And I think I don't need suffix arrays because maximum input size is 64KB and so I can afford extracting each n-gram? Could anyone correct me if I am wrong about this?

    Basically each ngram is taken by order from ngrams lists, and added to the dictionary, with the usual optimizations: ngram is not added if is a substring of another one already in dict. n-gram is packed in dictionary if head or tail of it and another one in dict coincides, maximizing coincident chars, so for example, adding "ESE" to dict. and found a token "SDES", merge both, producing a bigger token "SDESE" (tail opt. in this case)

    - my idea here is, please correct me if I am wrong, getting maximum number of ngrams could be packed in dict. maximazing freq. usage. This work pretty well, usually packing in 512 bytes dict, hundreds of n-grams. I have the feeling, that there are better ways to build this small dict, and improve compression, but I don't have a better idea.

    - compression is pretty simple:

    - literals use 6bits (huffman coded), and are stored and the end of the compressed string, unpacking space is prefilled with 0s, so after unpacking dict. tokens, 0bytes are filled with literals in the order they come.

    - pointers in dict are 12bits, 9 used for a pointer in the 512bytes dictionary, and 3 for length of match. I got this idea from femtozip I think, I use a huffman coder for encoding literal 6bit values, and decomposing the tokens 12bits in 2 halves, produces 6+6, so I can use a unique hufmman coder for
    encoding both literals and dict token ptrs. Femtozip uses byte nibbles, but each one use its own huffman dict., I cannot afford this luxury)

    - so compression works this way:

    - tokens in dict are found for each string from max possible length to min, so I first try to find matches with 8chars len, downto 3

    - if no more tokens match, all the rest are literals, and are deferred to the end of the compressed string after all dict tokens.

    - I build this for each string, accumulating stats for 6bit values for my huffman encoder.

    - finally, huffman is used to encode tokens 6bit at a time, and then all literals are encoded and appended.
    - strings are accessed via a pointer array, so no string lengths are needed, and as I said literals are filled at the end.

    I've not tested unpacking, but it should work

    Well this is my little experiment,

    I have the sensation that there are better strategies for example constructing a better dictionary, or choosing the token matched in the dictionary.

    This later case for example, is there a better way that instead of getting matches in dict. first by lenght, (longest to smallest) maybe other ways?

    I mean, I found a match of len 8char, nice, then found a match of len 6, but maybe if I don't use this first 8char len, I could find later 2 matches of len 7? (improving compressiong) or maybe other better combinations? Sorry I don't know how to explain this better.

    Well if you suvived this wall of text and have any idea to implement here, I'll try to test it. For example, I've thought in using a genetic evolver to improve match choosing, and maybe token packing in dictionary (learn about this
    in old times of Corewar warriors).

    Take in account that this case is useless in general sense, and is only very specific user case. Unpacking needs to be done in a 8bit machine, max input text size is 64Kb, and the other minutia I've written.

    Thanks guys!

    Edit: I've added some debugging output if you want to have a look at some numbers (naive6 is plain 6bits per char baseline):
    Last edited by anormal; 18th October 2020 at 00:07.

  2. Thanks:

    introspec (19th October 2020)

Similar Threads

  1. Find string probabilities
    By duma in forum Data Compression
    Replies: 3
    Last Post: 16th September 2018, 13:38
  2. Replies: 5
    Last Post: 20th January 2018, 02:39
  3. Whats best tool to go from 24PNG to <8Bit image
    By dado023 in forum Data Compression
    Replies: 9
    Last Post: 25th October 2016, 07:53
  4. Any tool to conveter 8bit/c RGBA png file into palete RGBA?
    By SvenBent in forum The Off-Topic Lounge
    Replies: 0
    Last Post: 8th October 2014, 22:34
  5. Directory hash as one string
    By FatBit in forum The Off-Topic Lounge
    Replies: 6
    Last Post: 17th January 2012, 00:29

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts