Results 1 to 18 of 18

Thread: Achieving decent compression of individual short strings?

  1. #1
    Member
    Join Date
    Oct 2018
    Location
    Milky Way
    Posts
    7
    Thanks
    1
    Thanked 0 Times in 0 Posts

    Achieving decent compression of individual short strings?

    I'm trying to find a way to compress (losslessly) individual short strings. Specifically, the strings are URLs, they're typically 20-50 characters in size, mostly ASCII, but UTF-8 sequences may also occur.

    Since the strings have to be compressed and stored separately and independently from one another, the usual compression methods can't do much. I'm think the compressor should have a built-in pre-computed dictionary or context. I do have a reasonably large list of real URLs to train the compressor on (920 000 URLs, 60 megabytes).

    Since the strings are short, I don't think performance is a big consideration. Using not too much RAM would be good, though (100 MB is better than 1000; I'm not talking about kilobytes, of course).
    My question is two-fold:
    1. What compression algorithms would work best for this scenario?
    2. Are there any existing C++ libraries or pieces of source code that I could use to implement this?

    Of note: I don't need to de-compress the strings. My idea is to use the compressed representation of a string as a variable-length collision-free hash/ID in a database. So I do need the compression function to be bijective (each input corresponds to exactly one output, and vice versa), but it's OK to drop some headers required for decompression, if the algorithm requires any.

    Any advice is welcome, thanks in advance.

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,942
    Thanks
    291
    Thanked 1,286 Times in 728 Posts
    http://nishi.dreamhosters.com/u/pmd_dict_v0.rar? (I'd find source if you want it)
    https://github.com/antirez/smaz
    https://github.com/jstepien/skrot

    In fact, you can use almost any compression algorithm for this, especially CM ones.
    Just train it with a dictionary, then enable entropy coding specifically to compress your string.

  3. #3
    Member
    Join Date
    Oct 2018
    Location
    Milky Way
    Posts
    7
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    http://nishi.dreamhosters.com/u/pmd_dict_v0.rar? (I'd find source if you want it)
    https://github.com/antirez/smaz
    https://github.com/jstepien/skrot

    In fact, you can use almost any compression algorithm for this, especially CM ones.
    Just train it with a dictionary, then enable entropy coding specifically to compress your string.
    Thanks a lot! Slightly ashamed I found neither of the two libraries on Github myself.
    I'll look into the two public libraries right away, but first a couple questions about pmd.exe which seems to be the closest to what I want out of the box. I assume you're the author of the tool? Is there a way to "train" an optimal dictionary from a large set of strings?

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,942
    Thanks
    291
    Thanked 1,286 Times in 728 Posts
    > Is there a way to "train" an optimal dictionary from a large set of strings?

    Dictionary optimization is mostly necessary for LZ, where unused strings still affect match distances.
    For PPM you can just put all the relevant data into the dictionary file, and train the model.
    Btw, its not really necessary to reprocess the dictionary to encode every string - you can copy the model instead.

    As to dictionary generation, you can try using one in zstd.
    Or maybe this: http://nishi.dreamhosters.com/u/defl_matches_v0.rar
    There the idea is to rely on deflate coders with optimal parsing (eg. 7zip,kzip).
    Dec2unp utility there generates a text log of literals and matches in deflate stream,
    so next we can sort the match strings, count them, and keep more popular ones for the dictionary.

    Source... maybe this? http://nishi.dreamhosters.com/u/pmd_dict_v0s.rar

  5. #5
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    846
    Thanks
    242
    Thanked 309 Times in 184 Posts
    Quote Originally Posted by VioletGiraffe View Post
    Any advice is welcome, thanks in advance.
    Try zstd's dictionary generator but brotli's dictionary compression. Brotli tends to be far superior in density with the shortest data against other nice compressors such as lzma and zstd.

    If you want to go all the way, you'd probably would like to have something like the dictionary transforms in brotli's dictionary. They increase the utility of the dictionary significantly. Shared brotli includes this in the spec, but no dictionary generator yet supports finding the suitable transforms.

  6. #6
    Member
    Join Date
    Oct 2018
    Location
    Milky Way
    Posts
    7
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Thanks for the other pointers, I'm studying your suggestions.

    Quote Originally Posted by Shelwien View Post
    For PPM you can just put all the relevant data into the dictionary file, and train the model.
    I noticed that pmd_dict_v0 produced an output that was 1 byte smaller after I removed a lot of strings from the dictionary. Is that supposed to happen?..
    Does pmd rebuild the mode on every run?

    Also, what's the difference betweem pmd_v0 and v0s? The latter produced a MUCH smaller output for the same input and dictionary (5 bytes vs. 17 bytes for a 83 byte source string).

  7. #7
    Member
    Join Date
    Oct 2018
    Location
    Milky Way
    Posts
    7
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    Try zstd's dictionary generator but brotli's dictionary compression.
    Interesting, thanks! I have to admit, I implemented LZ78 / LZW, but I could never understand how LZ77 works. Maybe now would be a good time!

  8. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,942
    Thanks
    291
    Thanked 1,286 Times in 728 Posts
    > I noticed that pmd_dict_v0 produced an output that was 1 byte smaller after
    > I removed a lot of strings from the dictionary. Is that supposed to happen?..

    Of course dictionary would affect compression even with PPM.
    Strings in dictionary would only improve compression
    when they provide relevant statistics, otherwise
    the model would allocate higher probabilities for dictionary strings,
    which would be left unused, thus adding redundancy.

    There's no simple way to optimize the dictionary,
    but if its really necessary, you can invent a metric function
    (eg. sum of compressed sizes of some strings),
    and then run iterations, disabling/enabling strings in dictionary,
    while it improves the metric.

    > Does pmd rebuild the model every time on every run?

    Currently yes, but PPM model is a single object, so you
    can backup it after training, then restore for every word.
    Model size in MBs is also supplied in args.

    Alternatively, its possible to log the model changes
    during string encoding, and then undo these,
    or maybe just disable the changes,
    but I didn't implement this.
    It was just a quick hack using https://github.com/Shelwien/ppmd_sh

    > What's the difference betweem pmd_v0 and v0s?

    Hard to say really... I just took the latest source and
    replaced 128-bit rangecoder with older one, since rc128 didn't
    work with gcc (I normally use IntelC to compile).

    > The latter produced a MUCH smaller output for the same input and dictionary
    > (5 bytes vs. 17 bytes for a 83 byte source string).

    Need to check if it decodes correctly.
    I had some optimizations there (like limited charset),
    which could affect the results.
    Or it can be just the rc... its been years since I made this,
    so I don't remember any details.
    It can be also the "order" parameter, its 6 in v0s, and I think it was 12 before.

  9. #9
    Member
    Join Date
    Oct 2018
    Location
    Milky Way
    Posts
    7
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    It was just a quick hack using https://github.com/Shelwien/ppmd_sh
    You don't provide a license for that repo, is that on purpose? If there was a permissive license like MIT or Apache, I could just steal the code shamelessly With attribution, of course.

    Quote Originally Posted by Shelwien View Post
    Need to check if it decodes correctly.
    Oh yes, good point. Of course it doesn't decompress properly :/

  10. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,942
    Thanks
    291
    Thanked 1,286 Times in 728 Posts
    > You don't provide a license for that repo, is that on purpose?

    I just don't really use github, it was a test.
    Its public domain, anyway, original source too (http://compression.ru/ds/ppmdj1.rar)

    > Oh yes, good point. Of course it doesn't decompress properly :/

    Its most likely gcc. What about attached exe (built with VC)?
    Attached Files Attached Files

  11. #11
    Member
    Join Date
    Oct 2018
    Location
    Milky Way
    Posts
    7
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    I just don't really use github, it was a test.
    Its public domain, anyway, original source too (http://compression.ru/ds/ppmdj1.rar)
    Public domain means free to look, not free to take and reuse, unless explicitly allowed.
    Thanks for the source, I'm studying the references from readme.txt

    Quote Originally Posted by Shelwien View Post
    Its most likely gcc. What about attached exe (built with VC)?
    Nope, same thing. 83 bytes compressed to 5 bytes, and doesn't decompress properly. The version that works compressed the same to 17 bytes.

  12. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,942
    Thanks
    291
    Thanked 1,286 Times in 728 Posts
    https://wiki.creativecommons.org/wiki/Public_domain

    Well, I'd look into it later.
    It'd be helpful if you could provide the samples where it fails.

  13. #13
    Member
    Join Date
    Oct 2018
    Location
    Milky Way
    Posts
    7
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Hm, you're right. But I don't think everything public on Github automatically becomes public domain. I'll check it later, maybe I'm wrong.

    Quote Originally Posted by Shelwien View Post
    It'd be helpful if you could provide the samples where it fails.
    Literally every URL I tried. This topic's address will do: "https://encode.su/threads/3025-Achieving-decent-compression-of-individual-short-strings"

  14. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,942
    Thanks
    291
    Thanked 1,286 Times in 728 Posts
    > But I don't think everything public on Github automatically becomes public domain.

    This source is not on github in any case.
    The one on github is a usual compressor, faster and without dictionary.

    Ok, how about this: http://nishi.dreamhosters.com/u/pmd_dict_v0s.rar (I updated the same file)

    1. I tried a few different rc versions there, and seemingly found one that works.
    2. There _was_ actually a limited charset used there, see ppmd_flush.inc:11
    I modified it so that it works with a full charset (all 256 byte values), but it hurts compression, as expected.

  15. #15
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    353
    Thanks
    131
    Thanked 54 Times in 38 Posts
    Quote Originally Posted by VioletGiraffe View Post
    I'm trying to find a way to compress (losslessly) individual short strings. Specifically, the strings are URLs, they're typically 20-50 characters in size, mostly ASCII, but UTF-8 sequences may also occur.

    Since the strings have to be compressed and stored separately and independently from one another, the usual compression methods can't do much. I'm think the compressor should have a built-in pre-computed dictionary or context. I do have a reasonably large list of real URLs to train the compressor on (920 000 URLs, 60 megabytes).

    Since the strings are short, I don't think performance is a big consideration. Using not too much RAM would be good, though (100 MB is better than 1000; I'm not talking about kilobytes, of course).
    My question is two-fold:
    1. What compression algorithms would work best for this scenario?
    2. Are there any existing C++ libraries or pieces of source code that I could use to implement this?

    Of note: I don't need to de-compress the strings. My idea is to use the compressed representation of a string as a variable-length collision-free hash/ID in a database. So I do need the compression function to be bijective (each input corresponds to exactly one output, and vice versa), but it's OK to drop some headers required for decompression, if the algorithm requires any.

    Any advice is welcome, thanks in advance.
    Zstd's latest release is focused on optimizing compression of small data: https://github.com/facebook/zstd/releases/tag/v1.3.6

    You might be able to construct the best dictionary just by thinking it through. For example, you'd have http:// and https://, as well as .com and possibly other TLDs depending on your data. If a lot of your URLs end in a file extension, like for images, then .jpg, .png, .jpeg, etc. would be in your dictionary. Common paths/subfolders would be good candidates too. You'll probably want a way to efficiently encode forward slashes with much less than one byte, and possibly ?= as well.

    If you have any control over it, standardizing or normalizing your URLs can go a long way. It's always better to make data more compressible from the outset.

  16. #16
    Member JamesWasil's Avatar
    Join Date
    Dec 2017
    Location
    Arizona
    Posts
    86
    Thanks
    89
    Thanked 16 Times in 15 Posts
    Depending on what the URLs go to, you may be able to utilize a service such as http://www.shorturl.com or http://goo.gl first to preprocess them and improve compression further. By shortening the URLs with an alias to them before you select a method to compress them, you'll be able to achieve compression of the space they consume just by utilizing the cloud database of those URLs.

    Before:
    i.e: http://www.yourdomain.com/directory/page1.php

    After:
    http://alturl.com/ctdzd

    If your database uses external web links, this can work for you, but you'd likely want to automate the process for submitting the links since there are 920,000 of them or more.

    Google isn't going to disappear anytime soon, and ShortURL has been around since 1999. I'd keep a cache of the URLs somewhere to be on the safe side, but for the main database, it would only save space for you to use them if you're able to. Then you can compress the shorter URLs you get from them to save even more.

    Beyond that, if you know that the output URL will always start with http://alturl.com/, a custom compressor can be made to represent those 18 bytes with as little as 1 bit each time.

    The compressor would only have to compress ctdzd and the other pointers to your URL then, taking the above sample url of 45 bytes down to 41 bits. (1 bit header for the site prefix + 5 bytes for the actual site reference) before you even started to compress it normally.

    (You could compress it even more to 5 bytes or less by converting the symbols for the pointer to 7 bit patterns instead of 8, making it 5*7 bits = 35 bits total to represent ctdzd + 1 bit header. 36 bits then to fit in a package of 5 bytes = 40 bits. You'd want to add an end symbol to the end of the pointer that wouldn't be present on a url link to signify the end of the url, but even then, you're still able to keep it under 6 bytes total for the example, and several times smaller for what they actually are if you decide to do it this way).
    Last edited by JamesWasil; 14th October 2018 at 06:06.

  17. #17
    Member
    Join Date
    Aug 2015
    Location
    The Earth
    Posts
    11
    Thanks
    3
    Thanked 21 Times in 7 Posts
    https://github.com/siara-cc/Unishox - Guaranteed Compression for Unicode Short Strings.
    Unishox is an hybrid encoder (entropy, dictionary and delta coding). It works by assigning fixed prefix-free codes for each letter in the above Character Set (entropy coding). It also encodes repeating letter sets separately (dictionary coding). For Unicode characters, delta coding is used. More information is available in this article.
    Code:
    en:English 29/58=50.00% Beauty is not in the face. Beauty is a light in the heart.
    ar:Arabic: 46/91=49.45% الجمال ليس في الوجه. الجمال هو النور الذي في القلب.
    bn:Bengali: 42/117=64.10% সৌন্দর্য মুখে নেই। সৌন্দর্য হৃদয় একটি আলো।
    de:German: 35/68=48.53% Schönheit ist nicht im Gesicht. Schönheit ist ein Licht im Herzen.
    es:Spanish: 38/69=44.93% La belleza no está en la cara. La belleza es una luz en el corazón.
    fa:Persian: 46/76=39.47% زیبایی در چهره نیست. زیبایی نور در قلب است.
    fr:French: 39/76=48.68% La beauté est pas dans le visage. La beauté est la lumière dans le coeur.
    hi:Hindi: 54/144=62.50% सुंदरता चेहरे में नहीं है। सौंदर्य हृदय में प्रकाश है।
    it:Italian: 35/63=44.44% La bellezza non è in faccia. La bellezza è la luce nel cuore.
    ja:Japanese: 37/60=38.33% 美は顔にありません。美は心の中の光です。
    jv:Javanese: 40/63=36.51% Beauty ora ing pasuryan. Kaendahan iku cahya ing sajroning ati.
    ko:Korean: 43/82=47.56% 아름다움은 얼굴에 없습니다。아름다움은 마음의 빛입니다。
    ms:Malay: 38/65=41.54% Kecantikan bukan di muka. Kecantikan adalah cahaya di dalam hati.
    my:Myanmar: 72/205=64.88% အလှအပမျက်နှာပေါ်မှာမဟုတ်ပါဘူး။ အလှအပစိတ်နှလုံးထဲမှာအလင်းကိုဖြစ်ပါတယ်။
    pa:Punjabi: 51/141=63.83% ਸੁੰਦਰਤਾ ਚਿਹਰੇ ਵਿੱਚ ਨਹੀਂ ਹੈ. ਸੁੰਦਰਤਾ ਦੇ ਦਿਲ ਵਿਚ ਚਾਨਣ ਹੈ.
    pl:Polish: 37/58=36.21% Piękno nie jest w twarz. Piękno jest światłem w sercu.
    pt:Portugese: 36/60=40.00% A beleza não está na cara. A beleza é a luz no coração.
    ru:Russian: 45/82=45.12% Красота не в лицо. Красота - это свет в сердце.
    ta:Tamil: 49/128=61.72% அழகு முகத்தில் இல்லை. அழகு என்பது இதயத்தின் ஒளி.
    tl:Filipino: 36/68=47.06% Ang kagandahan ay wala sa mukha. Ang kagandahan ay ang ilaw sa puso.
    tr:Turkish: 52/75=30.67% Güzellik karşısında değildir. Güzellik, kalbin içindeki ışıktır.
    ur:Urdu: 53/95=44.21% خوبصورتی چہرے میں نہیں ہے۔ خوبصورتی دل میں روشنی ہے۔
    vi:Vietnam: 57/82=30.49% Vẻ đẹp không nằm trong khuôn mặt. Vẻ đẹp là ánh sáng trong tim.
    zh:Chinese: 34/49=30.61% 美是不是在脸上。 美是心中的亮光。
    Last edited by data man; 26th September 2019 at 18:39.

  18. Thanks (2):

    Shelwien (26th September 2019),VioletGiraffe (26th September 2019)

  19. #18
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,942
    Thanks
    291
    Thanked 1,286 Times in 728 Posts
    http://nishi.dreamhosters.com/u/pmd_dict_v0s.rar
    https://encode.su/threads/3174-Finit...ll=1#post61256
    Code:
    30,515 alice_wland_chn.txt // added extra LF at the end
    10,786 0 // pmd.exe c nul alice_wland_chn.txt 0 (ppmd_dict)
    20,356 1 // unishox1.exe -c alice_wland_chn.txt 1
    22,301 3 // coder.exe c alice_wland_chn.txt 3 3_frq (bitalign)
     4,184 3_frq
    
    30,515 inp/*  // for /L %a in (000,1,787) do coder.exe d%a 3 inp\%a 3_frq
    27,846 out0/* // for /L %a in (000,1,787) do pmd.exe c nul inp\%a out0\%a
     5,753 out1/* // for /L %a in (000,1,787) do pmd.exe c alice_wland_chn.txt inp\%a out1\%a

Similar Threads

  1. The short, tormented life of computer genius Phil Katz
    By boxerab in forum Data Compression
    Replies: 0
    Last Post: 14th December 2016, 00:07
  2. CryoPNG short introduction
    By caveman in forum Data Compression
    Replies: 34
    Last Post: 4th December 2016, 01:31
  3. How do I dump the strings LZ77 matches?
    By SolidComp in forum Data Compression
    Replies: 4
    Last Post: 27th August 2016, 08:16
  4. Concatenating strings
    By andromeda in forum Data Compression
    Replies: 29
    Last Post: 15th September 2014, 07:51
  5. Most efficient/practical compression method for short strings?
    By never frog in forum Data Compression
    Replies: 6
    Last Post: 1st September 2009, 04:05

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •