Results 1 to 12 of 12

Thread: String compression for 8bit architectures.

  1. #1
    Member
    Join Date
    May 2017
    Location
    Spain
    Posts
    27
    Thanks
    22
    Thanked 5 Times in 5 Posts

    String compression for 8bit architectures.

    I was interested this year in string compression, specifically for very low resource machines, usually 8bit cpus with maybe 64kb of RAM.
    The idea here is, what could I use to save text space?

    There are already modern compressors as LZSA, and many more, optimized for 8bit machines. But all those works for general compressing binary data.

    I wanted something that could be called with a string number, an address space, and that unique string is decompressed. The standard use case is a text adventure, as the ones developed by Infocom, Level-9, Magnetic Scrolls, etc... By the way, all those systems (adventure interpreters) use somekind of text compressing, packing 4 chars in 3bytes, I think is used for example by Level-9, and many others use a simple dictionary token substitution. I've used strings extracted from decompiled adventures in these systems to do some basic testings. I used Delphi/pascal for testing, but this is irrelevant.

    So, actual string focused compressors, as shoco, smaz, the trie version of smaz, etc, are not tailored to this problem.
    I've studied, as far as I can, sources, also for other string compressors, as Femtozip and Unishox, and came to the following conclusions:

    - Decompressing code needs to be small, but also there is not needed to pack it under 100bytes, but of course, smaller is better. Speed is not relevant afaik in a text adventure.
    - So, easy coding needs to be used, no arithmetic coding for example, I've
    read something about the possbility of AC in a 8bit machine, but it's totally
    out of my knowledge.
    - The best I could use is a token dictionary and maybe some coding of literals and ptrs in dict., the next one I was testing is adaptive huffman for each string, this gives a nice compression, and easy to use for strings, but it's impractical for this user-case.

    My definition of the problem:
    - alphabet size is 64 chars, so I can use 6bits in literals for example
    - max total text to compress is 0xffff bytes
    - strings are capitalized, per-platform uncompressing code could capitalize or other transformations after unpack
    - max 255 strings (could be more without changing method)
    - max string size could be more than 255bytes
    - strings are accessed by a number
    - dictionary size is 512 bytes
    - max dictionary match len is 8chars, 3bits, after some testing, I found larger matches are not produced if I use 4bits.

    I came to the following idea that is giving me around 60% final size (including dictionary, alphabet definition, and string pointers). I do not claim this is huge savings or that I 've invented something useful, I just got some things from here and there and tried to use for this specific case.


    This is my first try to code something related to compression, and been testing many things, maybe you have some ideas about. The final idea could be, compress the strings in a modern machine, and then output results that could be used with a small function in machines with a Z80, a 6502, early Intel, 8088, 8086, etc.

    - all strings are capitalized, concatenated and cleaned of unknown chars

    - all n-grams up to size 8 are extracted from cleaned text, and sorted by frequency of use

    - dictionary construction is based around what I've seen in other source codes, I tried to understand Z-standard dictionary builders (even new fastcover), but code is too complex for me
    And I think I don't need suffix arrays because maximum input size is 64KB and so I can afford extracting each n-gram? Could anyone correct me if I am wrong about this?

    Basically each ngram is taken by order from ngrams lists, and added to the dictionary, with the usual optimizations: ngram is not added if is a substring of another one already in dict. n-gram is packed in dictionary if head or tail of it and another one in dict coincides, maximizing coincident chars, so for example, adding "ESE" to dict. and found a token "SDES", merge both, producing a bigger token "SDESE" (tail opt. in this case)

    - my idea here is, please correct me if I am wrong, getting maximum number of ngrams could be packed in dict. maximazing freq. usage. This work pretty well, usually packing in 512 bytes dict, hundreds of n-grams. I have the feeling, that there are better ways to build this small dict, and improve compression, but I don't have a better idea.

    - compression is pretty simple:

    - literals use 6bits (huffman coded), and are stored and the end of the compressed string, unpacking space is prefilled with 0s, so after unpacking dict. tokens, 0bytes are filled with literals in the order they come.

    - pointers in dict are 12bits, 9 used for a pointer in the 512bytes dictionary, and 3 for length of match. I got this idea from femtozip I think, I use a huffman coder for encoding literal 6bit values, and decomposing the tokens 12bits in 2 halves, produces 6+6, so I can use a unique hufmman coder for
    encoding both literals and dict token ptrs. Femtozip uses byte nibbles, but each one use its own huffman dict., I cannot afford this luxury)

    - so compression works this way:

    - tokens in dict are found for each string from max possible length to min, so I first try to find matches with 8chars len, downto 3

    - if no more tokens match, all the rest are literals, and are deferred to the end of the compressed string after all dict tokens.

    - I build this for each string, accumulating stats for 6bit values for my huffman encoder.

    - finally, huffman is used to encode tokens 6bit at a time, and then all literals are encoded and appended.
    ‚Äč
    - strings are accessed via a pointer array, so no string lengths are needed, and as I said literals are filled at the end.

    I've not tested unpacking, but it should work

    Well this is my little experiment,

    I have the sensation that there are better strategies for example constructing a better dictionary, or choosing the token matched in the dictionary.

    This later case for example, is there a better way that instead of getting matches in dict. first by lenght, (longest to smallest) maybe other ways?

    I mean, I found a match of len 8char, nice, then found a match of len 6, but maybe if I don't use this first 8char len, I could find later 2 matches of len 7? (improving compressiong) or maybe other better combinations? Sorry I don't know how to explain this better.

    Well if you suvived this wall of text and have any idea to implement here, I'll try to test it. For example, I've thought in using a genetic evolver to improve match choosing, and maybe token packing in dictionary (learn about this
    in old times of Corewar warriors).

    Take in account that this case is useless in general sense, and is only very specific user case. Unpacking needs to be done in a 8bit machine, max input text size is 64Kb, and the other minutia I've written.

    Thanks guys!

    Edit: I've added some debugging output if you want to have a look at some numbers (naive6 is plain 6bits per char baseline): https://pastebin.com/xa3hCQg5
    Last edited by anormal; 18th October 2020 at 00:07.

  2. Thanks:

    introspec (19th October 2020)

  3. #2
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    170
    Thanks
    31
    Thanked 71 Times in 44 Posts
    Maybe LittleBit solves some problems.

  4. Thanks:

    Kaw (10th November 2020)

  5. #3
    Member
    Join Date
    May 2017
    Location
    Spain
    Posts
    27
    Thanks
    22
    Thanked 5 Times in 5 Posts
    Interesting... I missed it while diving in github looking for more info about this idea.
    I wonder how many nice projects in github that I cannot find with their searcher.

    > Is it possible to get good compression ratios while using Huffman encoding? It would require a static Huffman tree and the sum of the size of the tree and the encoded data must be competitive with other compression techniques

    I was thinking the same when I started this idea.

    But, excuse my ignorance, I know what a canonical Huffman is, but found of no use of it in my specific problem, so I use a standard algorithm that computes the Huffman codes from symbols frequency table.
    Is not true that Huffman builds the minimal codes from symbols frequency, always? By default?
    I didn't know it was a NP problem, or that some algorithms could produce "better" (total smaller size I suposse) codes?

    Anyway, thanks, I'd have a look at the sources, let's see if I can produce better Huffman codes

  6. #4
    Member
    Join Date
    May 2017
    Location
    Spain
    Posts
    27
    Thanks
    22
    Thanked 5 Times in 5 Posts
    @xezz you were right, similar ideas and I learnt a lot. Also I found there about the paper about Syllabe encoding, they even tried and tested a genetic evolver for building a dictionary. It really surprised me what I thought was a silly idea when I thought about it , it was really used and solved problems
    Thanks

  7. #5
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    977
    Thanks
    266
    Thanked 350 Times in 221 Posts
    Quote Originally Posted by anormal View Post
    I was interested this year in string compression, specifically for very low resource machines, usually 8bit cpus with maybe 64kb of RAM.
    Out of curiosity: Is there a practical use for this or is it for fun with retro computing?

  8. #6
    Member
    Join Date
    Jan 2014
    Location
    USA
    Posts
    14
    Thanks
    13
    Thanked 4 Times in 3 Posts
    Hmm... an 8-bit machine from the '80s with a total of 64K RAM would have less than 64K available for use.

  9. #7
    Member
    Join Date
    May 2017
    Location
    Spain
    Posts
    27
    Thanks
    22
    Thanked 5 Times in 5 Posts
    I don't think is useful out of this special case. Maybe it could be scaled in a similar way, for encoding a predefined set of strings, and when you need to decompress any string one at a time.
    But for modern processors/ram, you could use a much bigger dict, longest pointers in dictionary, etc. As I said you could you check other projects as femtozip or unishox, and Littlebit as posted before (I tested many, and usually shoco and smasz are suggested, but these I mentioned are much advanced and mature)

    I am going to try a GA to see if there are more interesting strategies while packing the dictionary, than the one I am using. If anyone could explain in few words as Z-Std algorithms for building dictionary works, cover and fastcover? (as I see in sources), I'd be happy

    @danlock, yes you are right, standard Sinclair Spectrum has 48KB (including screen ram), I think Vic-20 had 4KB? There are also many MSX systems with varied ram sizes, 16,32,64, etc, etc... Spectrum 128, has pageable banks, etc...

    Edit: I've just remembered I was looking for something similar years ago, while I was building a Chess png repo, and tried to compress each game apart in the DB. I'd try with pngs when I've tested more things, but with maybe a 1MB dict? It could be fun to explore this.

  10. #8
    Member
    Join Date
    May 2017
    Location
    Spain
    Posts
    27
    Thanks
    22
    Thanked 5 Times in 5 Posts
    After some fixing and testing, I finished using this format:
    5bits-number of tokens
    tokens... =12bits, 9bits index in dict+3bits length of match, encoded as 2 groups of 6bit
    literals...=6bits each one

    Each group of 6bits is finally encoded with a standard huffman encoder.
    I am happy as even very small strings, 4-5 chars are always compressed.

    Some stats, sizes are in bits:


    line:775 tokens:5(60) literals:25(150) orig:392 - naive6 : 294 - compressed no huffman:214 - HUFFMAN :202
    line:776 tokens:3(36) literals:5(30) orig:136 - naive6 : 102 - compressed no huffman:70 - HUFFMAN :60
    line:777 tokens:4(48) literals:11(66) orig:232 - naive6 : 174 - compressed no huffman:118 - HUFFMAN :110
    line:778 tokens:13(156) literals:21(126) orig:592 - naive6 : 444 - compressed no huffman:286 - HUFFMAN :301
    line:779 tokens:1(12) literals:11(66) orig:112 - naive6 : 84 - compressed no huffman:82 - HUFFMAN :69
    line:780 tokens:2(24) literals:10(60) orig:136 - naive6 : 102 - compressed no huffman:88 - HUFFMAN :73
    line:781 tokens:2(24) literals:3(18) orig:88 - naive6 : 66 - compressed no huffman:46 - HUFFMAN :46
    line:782 tokens:2(24) literals:9(54) orig:128 - naive6 : 96 - compressed no huffman:82 - HUFFMAN :74
    line:783 tokens:2(24) literals:14(84) orig:216 - naive6 : 162 - compressed no huffman:112 - HUFFMAN :92
    line:784 tokens:6(72) literals:9(54) orig:240 - naive6 : 180 - compressed no huffman:130 - HUFFMAN :120
    line:785 tokens:1(12) literals:5(30) orig:80 - naive6 : 60 - compressed no huffman:46 - HUFFMAN :42
    line:786 tokens:1(12) literals:5(30) orig:80 - naive6 : 60 - compressed no huffman:46 - HUFFMAN :42
    line:787 tokens:5(60) literals:14(84) orig:280 - naive6 : 210 - compressed no huffman:148 - HUFFMAN :139
    line:788 tokens:4(48) literals:6(36) orig:168 - naive6 : 126 - compressed no huffman:88 - HUFFMAN :98
    line:789 tokens:0(0) literals:5(30) orig:40 - naive6 : 30 - compressed no huffman:34 - HUFFMAN :28
    Final compression no huffman:216246 / 356872 = ,6059
    Final compression huffman:185306 / 356872 = ,5193

    I'll try now to build a better dictionary that provides more matches and so better compression. What I am using now is simply inserting+merging ngrams sorted by frequency of use.

  11. #9
    Member
    Join Date
    Jul 2020
    Location
    EU
    Posts
    7
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Your constraints are similar to what I used for https://github.com/clbr/NTC. I got ~50% compression using a simpler setup, no dictionaries etc, though your way may be faster.

  12. #10
    Member
    Join Date
    May 2017
    Location
    Spain
    Posts
    27
    Thanks
    22
    Thanked 5 Times in 5 Posts
    @zeit: great, another project that I didn't find while researching this I'd have a look

    I'll write about my experiments with the genetic evolver, maybe this is more interesting that simply strings compression.

    It really surprised me, and was very fun to experiment and tune it, still improving it of course.

    As you know a GA works generating children from a sent of original pool, these are breed in the way you want to, in my example I mix every dictionary in the original pool with every other one.
    A initial pool of 60 dictionaries, will produce 3600 children (yes I mix them with theirselves).
    Each parent got n ranges of size m from the other parent (n crosspoints of size m), positioned randomly. I got this idea from the paper linked in LittleBit compressor.

    I do some mutations later, overwritting portions with random ngrams, etc...

    Then apply a fitness function to each dictionary, sort them by best to worst, and get the firsts best from this generation. You repeat while generations improve. I stop after 6 generations don't improve. But tweaking the GA parameters (initial pool size, how do you breed, how do you mutate, etc) will produce better or faster convergency to an optimal solution. So you need to experiment a lot tuning these parameters.

    In a test case of a 40KB text file with 700 lines, the actual method I am using converge around generations 70-80.
    The best generated dictionary improved around 4KB in total compression, compared to my own first dictionary (explanation in the first post).

    I could write much more about what I've learned using a GA.
    The first fitness function I used was simply bytes saved (num of bytes that are not literals).
    This produced improves, but after compressing, the file was bigger!
    What happened? My tokens are 12bits.
    Yes, it was improving compression, but not taking in account number of tokens (and the size they occupy).
    I am using right now this fitness function that works pretty nice. I still have many experiments to do with this function.

    fitness=bits_saved-bits_occupied_by_tokens+numTokens_size7*3+numToken s_size8*5;

    The last two factors were needed to avoid the GA getting more smaller tokens than bigger ones (and so, producing worse compression).

    I'll publish this in github next months, with more explanations there, and example in 8086 asm

    My goal now is try to go better than 49-50% compression. I have to see if a BWT works here. Or maybe delta coding tokens, starting from first letter in token, this should produce a better human coding at the end.
    I learned a lot with this project. And as my first one related directly to compression, I now want to learn everything about it. Thanks!
    Last edited by anormal; 9th November 2020 at 21:12.

  13. #11
    Member
    Join Date
    May 2017
    Location
    Spain
    Posts
    27
    Thanks
    22
    Thanked 5 Times in 5 Posts
    Hi guys, if someone is still interested...
    Note that I don't have a good theory base about compression, so I think sometimes used other terms read in other fields.
    If anyone feels I should use other words or terms I'll happily learn it.

    This has been a fantastic project for me to learn really a lot about compression, before that I only had read some books, algorithms, etc. But now I start to understand how the differente pieces click. If you think you have some idea to try, just do it, even if it's not innovative or whatever.

    I simplified and speed up the code a lot, and improve compression (up to 44% in some files now, if I am not wrong...)

    Ngram managing was literally reduced to 0 via a simple array and a hash used as index. I am using ngrams up to length 11. I am going to encode lengths in 3 bits, but mininal ngram len I am intersted is 3, so I'll encode starting from 3.

    Dictionaries are generated and encoded for the GA as a array of indexes to ngrams, and crossed and mutated simply as integers. Before I was using strings.

    The better speed up comes from the token matcher against the dictionary. For the LCS I was using a brute-force algo, that stupidily worked better thant some search algos I used, as a FNVM hash+buckets or Booyer-More search (umm I am thinking I did not try a bit-parallel search...).

    This lead me to learn better about suffix trees, then arrays, and finally using a suffix automata, via minimal code.
    I don't know if there are faster ways of finding the LCS, if anyone knows some alternatives to try...

    I tried also to parallelize these calculations via threads, but first test was horrible, so obviously something was wrong
    in my code, maybe in the future...

    Simplified code gave me faster and easier calculation of alphabet stats for the final huffman encoder.

    I can produce better dictionaries in each generation loop simply adding some heuristics to random selection of ngrams when I am building the dicts, basically, substring ellimination, and other little checks. This make generations produce better breeds and so convergence to something optimal is faster.

    After many hand optimizations, I have a reduced set of parameters for the GA that it seems to produce the best results, this is tested against spanish texts, but I guess other languages should be similiar. The parameters could be setup by user, so this is not a problem.

    I tried BWT in each string, as I see un-BWT algorithm could be easily done in a 8bit machine.
    But it didn't work. It surprised me, but compression was around 10% worse.

    I am testing now a nice way to store the huffman data, that make its easy to be used in a 8bit machine via small code. Probably storing it as canonical Huffman, outputting only the bit-lengths.

    Thanks for reading!

    Edit:
    The genetic evolver I am using is pretty simple and based in what I've read in many sources.
    BEST_NGRAM_PERCENT = the ngrams are sorted by frequency in an array, this marks the last ngram that
    all the previous frequencies summed until this is N percent, I found 55% is optimal for the mutation phase.

    The fitness function is the slowest part, as I am calculating the final compression size (except huffman).
    Each generation ,depending in GA parameters, produce many dictionaries. Right now I am generating around
    4000 dictionaries per generation, and up to 100 generations.
    So around 99% computing time is spent in calls to the LCS function (a suffix automata).

    I've read some papers that I can approximate final compression, via the frequencies of the ngrams in the dictionary,
    without needed to compress the text, but I need to re-read and see if I can use it.

    - Build an initial dictionary, this is simply the most frequent ngrams summed until I got the needed length.
    - This is mutated n times to produce a initial pool. This pool is sorted by fitness.
    - start of evolver
    -- crossbreed the first x dictionaries between them, a random ngram from one is put in the another
    -- mutate the first y dictionaries, get a random ngram from the ngram array up to BEST_NGRAM_PERCENT,
    and put it this dictionary
    -- resort everything by fitness, so dictionaries producing better compression are the firsts,
    - loop to start of evolver until max. generations, or G last generations didn't improve

    Actual GA parameters that works ok for me are:
    initial pool size = 5000
    x = 40-50 , this produces 1600-2500 dicts, between 1-2 ngram changed
    y = 1000, between 1-2 ngram changed
    All these numbers are related in some way, larger numbers in population don't means better compression,
    and of course makes it slower. Ideal parameters are those that produce near optimal compression with
    generations converging faster.
    It would be nice to have a cluster and simply brute forcing these params

    Initial dictionary gives me around 55% compression, after 100 generations (10minutes of cpu) I've seen up to 45% compression.
    It could be nice to research how this final dictionary is different from the initial one.
    Last edited by anormal; 7th December 2020 at 17:01.

  14. Thanks:

    Dresdenboy (8th December 2020)

  15. #12
    Member
    Join Date
    May 2020
    Location
    Berlin
    Posts
    76
    Thanks
    21
    Thanked 25 Times in 20 Posts
    I'm a big fan of Evolutionary Computing approaches. And I find your work very interesting! With more time at hand (maybe during the holidays) I'll add more comments here.

Similar Threads

  1. Find string probabilities
    By duma in forum Data Compression
    Replies: 3
    Last Post: 16th September 2018, 13:38
  2. Replies: 5
    Last Post: 20th January 2018, 02:39
  3. Whats best tool to go from 24PNG to <8Bit image
    By dado023 in forum Data Compression
    Replies: 9
    Last Post: 25th October 2016, 07:53
  4. Any tool to conveter 8bit/c RGBA png file into palete RGBA?
    By SvenBent in forum The Off-Topic Lounge
    Replies: 0
    Last Post: 8th October 2014, 22:34
  5. Directory hash as one string
    By FatBit in forum The Off-Topic Lounge
    Replies: 6
    Last Post: 17th January 2012, 00:29

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •