A compression algorithm is always made for a specific form of redundancy. For a good advice we probably need to know the redundancy your data has. Maybe your data has even several forms of redundancy. Knowing them would give us the possibility to think about which algorithms are good to remove the redundancy with parallel processing.
For example if your data has a redundancy in the form that certain values occure more often than others and every value can be stored with 1 byte of space, then you could use a variable length encoding. You could make a parallel look up of each byte and replace it with a value of a variable length in bits. Something like this would be suitable if low values occure very often and high values nearly never:
Code:
0000 0000 -> 0
0000 0001 -> 10
0000 0010 -> 110
0000 0011 -> 1110
...