I'm trying to find a way to compress (losslessly) individual short strings. Specifically, the strings are URLs, they're typically 20-50 characters in size, mostly ASCII, but UTF-8 sequences may also occur.
Since the strings have to be compressed and stored separately and independently from one another, the usual compression methods can't do much. I'm think the compressor should have a built-in pre-computed dictionary or context. I do have a reasonably large list of real URLs to train the compressor on (920 000 URLs, 60 megabytes).
Since the strings are short, I don't think performance is a big consideration. Using not too much RAM would be good, though (100 MB is better than 1000; I'm not talking about kilobytes, of course).
My question is two-fold:
1. What compression algorithms would work best for this scenario?
2. Are there any existing C++ libraries or pieces of source code that I could use to implement this?
Of note: I don't need to de-compress the strings. My idea is to use the compressed representation of a string as a variable-length collision-free hash/ID in a database. So I do need the compression function to be bijective (each input corresponds to exactly one output, and vice versa), but it's OK to drop some headers required for decompression, if the algorithm requires any.
Any advice is welcome, thanks in advance.