Results 1 to 10 of 10

Thread: Compression of Cyrilic Unicode

  1. #1
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    889
    Thanks
    483
    Thanked 279 Times in 119 Posts

    Compression of Cyrilic Unicode

    Georgi Marinov (@sanmayce) made an interesting benchmark test using Cyrilic Unicode content.
    We don't have enough of such content for benchmark, most tests are quite biaised towards english.
    So I figure it would be interesting to share.

    See the screenshot.
    Click image for larger version. 

Name:	BRE_turbobench.png 
Views:	187 
Size:	490.3 KB 
ID:	4072
    Last edited by Cyan; 8th February 2016 at 16:35.

  2. Thanks:

    encode (8th February 2016)

  3. #2
    Member Skymmer's Avatar
    Join Date
    Mar 2009
    Location
    Russia
    Posts
    688
    Thanks
    41
    Thanked 174 Times in 88 Posts
    I can provide you with about 200 GB of raw Russian text. Its a juridic databases taken from Consultant+ informational software. Originally they are packed with block-based ZLIB but I passed em through Precomp and additionally cleaned with Iconv. Its quite easy to convert them to Unicode with the same Iconv. If somebody need such test material then I can cut off any size and upload it.

  4. #3
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    A useful chunk size is about 6 to 15 megabytes, such that it's as big as possible without making ppmd -m256 -o16 -r1 discard and rebuild its model. That way people can compare against a high-order, high-quality, widely-used ppm implementation that's been used as a standard of comparison in several published papers. (Without the result being unfair to ppmd's particular memory limitation.)

    Ideally, test files should be freely redistributable---public domain or copyleft or whatever. I'm wondering if the "database" the software in question is copyrighted, even if the original sources aren't. (I'm wondering if there's some big book of laws or court decisions, published online by the Russian government, that would be usable.)

    I've got a couple of largish Russian files in UTF8 Cyrillic that I *think* are freely redistributable: Dostoyevski's Crime and Punishment and Tolstoy's War and Peace, but it would be good to have some kinds of data besides 19th-century epic novels. (Those two are 1,940,360 and 5,639,118 bytes, respectively.)

  5. #4
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    I'm also wondering if it would be legal to put big chunks of the Great Russian Encyclopedia online. It's government-funded, and they sell it cheaply on CD's, so maybe they don't care about making money off of it. I haven't found it for free download, though.

  6. #5
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    889
    Thanks
    483
    Thanked 279 Times in 119 Posts
    I believe the main point of sanmayce is that it is important to look beyond english-only text benchmarks.

    Cyrilic russian is an obvious great example, and it even introduces several questions, such as which representation standard
    (version of unicode ? UTF-8, UTF-16, UTF-32 ? Windows ?).

    Then, one could also consider Japanese, Chinese, and so on.

    So I guess a question is :
    Is there an available corpus of texts using non-roman alphabet which could be used for compression benchmark ?
    That would be an important stepping stone.

  7. #6
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    I am constructing such a corpus, in UTF8. That can easily be converted to other Unicode representations or (in most cases) other common representations like Big5 (for Chinese characters) or the modified ASCII's used for Cyrillic.

    So far I've got books in English, Chinese, Arabic, Spanish, Portuguese, French, German, Japanese, and Telugu, which covers most of the most-widely spoken and written languages in various appropriate scripts. (I'm still looking for Modern Standard Hindi, Bengali, and Urdu, and a better Arabic file. Then I'd have the top 10 native languages and the top 10 first-or-second languages. Mainland Chinese using Simplified Chinese characters would be good, too; right now I've only got non-mainland Chinese using Traditional Chinese Characters. I don't know how much difference that makes for compressor testing.)

  8. Thanks:

    Cyan (8th February 2016)

  9. #7
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Another interesting kind of text file is to use the second 10,000,000 bytes of Wikipedia in various languages. Then you get to see what happens with a reasonable mix of different scripts with Latin-1 markup. You probably shouldn't use the first 10,000,000 bytes, because there's a sizable blob of nonrepresentative stuff at the beginning. You can include an extra byte or two or three at each end, as necessary, to ensure that you don't chop a multibyte character between character boundaries, and get a conforming UTF8 file.

  10. #8
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Here are pictures of the first 10 million bytes of Chinese Wikipedia (mixed latin-1 markup etc. and Traditional Chinese Characters) and War and Peace (Cyrillic). They were generated using a slighly modified version of Matt's fv program. The colors represent match lengths, plotted on a vertical log scale against first match distance. (Blue=1 byte match length, red=2, green=4, black= 8 ).

    The broad color bands show the distances at which you tend to get matches of different lengths.

    Notice the clear narrow stripes across the bottom. Those tell you the usual character size in bytes, because the high byte tends to match from character to character.
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	warpeace.png 
Views:	139 
Size:	959.4 KB 
ID:	4076   Click image for larger version. 

Name:	zhwik7.jpg 
Views:	135 
Size:	260.5 KB 
ID:	4075  

  11. Thanks (3):

    Cyan (8th February 2016),Turtle (9th February 2016),willvarfar (9th February 2016)

  12. #9
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    One feature I should explain is the black stripe at 1 byte for the first part of War and Peace, which indicate runs of same-valued bytes. That's because the first volume was formatted differently than the rest, with leading spaces on each line to create a left margin. It also has some runs of two spaces where extra spaces have been inserted to make it left-and-right justified.

    My other files in various languages are not formatted (unlike most UTF-8 files from Gutenberg). That's because there's a trend toward flowing text formatted in the browser or reader, and because it makes a better test of basic text compression.
    Last edited by Paul W.; 8th February 2016 at 20:55.

  13. #10
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    By the way, the gedit (free, multiplatform) text editor can save UTF8 files in a variety of different encodings, including 8 for Cyrillic, 3 for Traditional Chinese characters, 3 for Simplified Chinese characters, etc. You can usually just read the UTF8 file into gedit, "save as...", pick an encoding, and click OK. Of course that won't work if the file has stuff not supported by the encoding you choose. gedit can handle pretty big files, like my 15MB test files. (I haven't tried loading enwik9 into it.)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •