Looking for compression algorithm with simply decompression
Hi,
I'm a neuromorphic hardware designer trying to implement a compressed cache system on a digital chip as part of my PhD and I have been suggested to ask here for help. During the design, I noticed that the data I'm going to store, despite being composed of 16 bits words,have the following distribution:
minimum number of bits to represent them | %
1 | 20.1%
2 | 12.7%
3 | 22.5%
4 | 29.0%
5 | 14.7%
6 | 0.9%
7 | < 0.1%
8 | < 0.1%
9 | < 0.1%
10 | < 0.1%
11 | < 0.1%
12 | < 0.1%
13 | < 0.1%
14 | < 0.1%
15 | < 0.1%
16 | < 0.1%
The physical structure in which they are going to be stored can be seen as a set of small list of 9 words each
L[0][0]= [0, 1, 2, 3, 4, 5, 6, 7, 8]
L[0][1]= [0, 1, 2, 3, 4, 5, 6, 7, 8]
L[0][2]=[0, 1, 2, 3, 4, 5, 6, 7, 8]
....
L[0][7]=[0, 1, 2, 3, 4, 5, 6, 7, 8]
L[1][0]=[0, 1, 2, 3, 4, 5, 6, 7, 8]
L[1][1]=[0, 1, 2, 3, 4, 5, 6, 7, 8]
...
L[N][7]=[0, 1, 2, 3, 4, 5, 6, 7, 8]
at time step 0, the system is going to read, in parallel, the first values of the 8 lists composing L[K], one from list (so L[K][0][0],L[K][1][0],L[K][2][0],L[K][3][0],...)
at time step 1, the second values of the lists composing L[K] are read(L[K][0][1],L[K][1][1],L[K][2][1],L[K][3][1],...)
at time step 2, the third values
it continues until 9 values are read for each list(L[K][0][8],L[K][1][8],L[K][2][8],L[K][3][8],...), then it starts to read from another L[K]
I'm trying to reduce the number of bits composing each 9-words list using some compression technique. The technique can be very slow and complex during compression phase (it is performed offline) but must be very simply for decompression (to avoid hardware overhead).
Since it is done in HW, it has some extra "unusual" constrain:
-Since we can't change the physical size of the memory cells, the resulting total number of bits required for each list is going to be rounded up to the closest multiple of 8 (e.g. a compressed string of 17 bits is going to become 24 bits long)
-It would be better if the lists are stored independently (without creating a huge 81 words list) unless it gives a really strong advantage (> 10%) in terms of compression.
-We can have shared information for the decoding (e.g. storing the number of bits of the longest word in the 8 lists)
-The decompression can't have multiplications or divisions, only shifts, adds, subs, and, or or large memories (e.g. no large or dynamic dictionaries) since there are going to be 512 decoders implemented on chip
I tried several techniques already, with following resulting bits/word:
-exp golomb coding: 7.76
-fibonacci coding: 6.73
-rice coding: 5.57
-ANS (2 bits symbol)** 11.35 (a lot higher than I was expecting? I have checked the implementation and it seems fine...)
-ANS (1 bit symbol)** 10.86
-shared size* 5.24
**probabilities rounded to closest power of 2 to simplify decoding in HW
*The data are rectified (2*x if positive, -2*x-1 if negative and a reference value X is subtracted from all of them). For each group of entries L[K][0][Y],L[K][1][Y],L[K][2][0],L[K][3][Y]... the number of bits required by the longest word of the group is stored within the data
Before proceeding with the implementation of the current best solution, I would really appreciate any suggestion for algorithms or whatsoever other useful information/paper/etc&
Thank you!