Results 1 to 14 of 14

Thread: Multi-language text compression corpus?

  1. #1
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts

    Multi-language text compression corpus?

    Does anybody know of a good text compression corpus with fairly plain text in a variety of languages?

    If not, does anybody know of good sources of freely redistributable, large, plain texts in languages like Arabic, Hindi, Punjabi, and/or Bengali? UTF-8 preferred, but something that can be automatically converted to UTF-8 is okay.

    I'm looking for good-sized books, if possible, at least 800KB and up to maybe 2.4 MB; should be a fairly unified work of prose written by a native speaker.

    I have been starting to put a test corpus together, starting with Project Gutenberg files like these:

    Guanchang XianXing Ji (novel "Bureaucrats"(?) by Bau Jia Li, Chinese (Sino-Tibetan, logographic)) (1.9 MB)
    Doko E (Japanese novel by Hakucho Masamune (Japonic, logographic) (0.88MB)
    Don Quixote (Spanish novel by Cervantes (Romance Indo-European) (2.1 MB)
    The Descent of Man (Charles Darwin; only English text (Germanic Indo-European)(1.8 MB)
    The Crime of Padre Amaro (Portuguese Novel (Romance Indo-European)(0.88 MB)
    Le Suicide (classic early sociology in French by Emile Durkheim (Romance Indo-European)(1.1 MB)
    Buddenbrooks (Thomas Mann novel in German (Germanic Indo-European))(1.6 MB)
    Kollayi Gattite Nemi (novel in Telugu by Rama Mohana Rao Mahidhara (Dravidian))(1.4 MB)

    (One thing I like about Gutenberg is that people should be able to find the files well into the future, even if some site I put the corpus on disappears.)

    What I'm most lacking are large, free texts in Arabic, Hindi, Russian, Bengali, and Punjabi. I would like to hit all of the most-spoken (native) languages as well as the most-understood (native or not, like English and Russian). I don't want too many European and Latin-1 languages, and would like to hit more language families, character sets, etc.

    I've got a Quran in Arabic, using fairly modern orthography (0.8MB), but I'd rather have something more modern and bigger.

    Weirdly, I'm having problems finding big texts in Russian---Gutenberg has Dostoyevsky's Crime and Punishment in many languages, but not Russian. I've found versions on Russian sites, but not as a single plain text file, and I'm not sure which are legitimately redistributable, on stable sites, etc.

    Another issue I'm a bit worried about is formatting. The Gutenberg books are all formatted with line breaks, which I'd like to remove. I can easily replace simple line ends with single spaces for most alphabetic languages, but I'm not sure what's reasonable for Chinese and Japanese, or whether there are gotchas for languages that are written right-to-left or whatever.

    As I understand it, Chinese is difficult to correctly break into words at all---there are no obvious spaces or other delimiters, and words are mostly two characters but up to 5. I'm guessing that the Gutenberg texts are automatically line-broken using some heuristic algorithm, sometimes erroneously in the middle of a word. Can I just concatenate the broken lines (with no spaces) and get something reasonable, maybe even fixing errors? Or will that introduce some other kind of error? (And I know even less about Japanese, but I'd guess similar concerns apply.)

    Any thoughts/pointers would be appreciated.
    Last edited by Paul W.; 23rd November 2015 at 18:27. Reason: formatting

  2. #2
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,564
    Thanks
    773
    Thanked 687 Times in 372 Posts
    i have 323,845 russian books for 300 GB total, if that's not enough, we can write some more
    Last edited by Bulat Ziganshin; 14th November 2015 at 22:32.

  3. #3
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    i have 323,845 russian books for 300 GB total, if that's not enough, we can write some more
    I really just need one that I'm sure of the provenance and freely-redistributable nature of; I don't want somebody who OCR'd and line-broke it to say it's their intellectual property and keep me from redistributing a full corpus---or worse, some paper book publisher who published an edition in 1957 and somebody OCR'd that. I trust Gutenberg in that respect... they're careful about not infringing non-PD copyrights.

    I don't want the kind of thing that happened with the Lena image in graphics compression. (Eventually Playboy OK'd redistributing the image, but for a while it was illegal to replicate people's compression experiments.)

  4. #4
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    163
    Thanks
    31
    Thanked 64 Times in 40 Posts
    link is japanese text archived by DGCA.

  5. #5
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Quote Originally Posted by xezz View Post
    link is japanese text archived by DGCA.
    Is that text a better candidate for inclusion in a corpus than the Japanese novel (Doko E) I listed above? (I am not looking for particular compression or archiving algorithms, just for substantial input texts in important languages, to test various text compressors. And I'm basically looking for books, preferably a megabyte or two, not ancient, and with a clear provenance without copyright problems.)

  6. #6
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    163
    Thanks
    31
    Thanked 64 Times in 40 Posts
    genji monogatari is ancient novel, so not better

  7. #7
    Member RichSelian's Avatar
    Join Date
    Aug 2011
    Location
    Shenzhen, China
    Posts
    171
    Thanks
    20
    Thanked 61 Times in 30 Posts
    Guanchang XiaoXing Ji (novel "Bureaucrats"(?) by Bau Jia Li, Chinese (Sino-something, logographic)) (1.9 MB)
    it's Guanchang Xianxing Ji, written in ancient Chinese, not good.

  8. #8
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Quote Originally Posted by RichSelian View Post
    it's Guanchang Xianxing Ji, written in ancient Chinese, not good.
    Oops, I meant Xianxing, but I'm more used to Shaoxing (from recipes), so I guess that's what I garbled it with.

    I thought that the book was written after 1900, from reading this Wikipedia article...

    https://en.wikipedia.org/wiki/Guanchang_Xianxing_Ji

    ...which would make it more recent than Dickens, etc.

    But maybe it's written in an older style?

    Any recommendations what would be a better Chinese book or two for a corpus?

  9. #9
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    After doing a little reading about Chinese literature, it sounds like anything pre-1920 is likely to seem archaic, and that what I probably want is something post-1920 with modern punctuation, numerals, etc.

    I've got the (Brief?) History of Chinese Fiction (Novels?) (中國小說史略) by Shuren Zhou (Lu Xun?), but it's only 671KB and I'm hoping to find something bigger. Unfortunately, most of what Gutenberg has is older, and mostly lots older. (As near as I can tell---there are a lot of authors they don't give dates for.)

    Now I'm wondering if there are similar gotchas with Japanese texts.
    Last edited by Paul W.; 17th November 2015 at 02:20.

  10. #10
    Member RichSelian's Avatar
    Join Date
    Aug 2011
    Location
    Shenzhen, China
    Posts
    171
    Thanks
    20
    Thanked 61 Times in 30 Posts
    here is a set of Chinese laws: http://pan.baidu.com/s/1o6L8Nxw, i think they are more suitable for corpus.

  11. #11
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Quote Originally Posted by RichSelian View Post
    here is a set of Chinese laws: http://pan.baidu.com/s/1o6L8Nxw, i think they are more suitable for corpus.
    Thanks. I downloaded that and it is a bunch of folders with a bunch of small files in them; I'm hoping for one big file that's easy to test compressors with.

    One big book Kennon found at Gutenberg is "Shi Gong Chuan" by Anonymous.:

    http://www.gutenberg.org/ebooks/23825

    It's about 2x bigger than I was really looking for. (It's about 3.5MB)

    Google Translate translates the title to "Ship Construction", but when I look at samples of the text, it doesn't look like it's about building ships. It seems to be narrative of some sort, but I don't know if it's a novel, a series of stories, or something else. It seems to have modern punctuation, but that's about all I can tell about it at this point.

  12. #12
    Member RichSelian's Avatar
    Join Date
    Aug 2011
    Location
    Shenzhen, China
    Posts
    171
    Thanks
    20
    Thanked 61 Times in 30 Posts
    it should be "Shigong Zhuan (施公传, biography of Shigong, a novel about criminal case handling, not famous at all)" but not "Shigong Chuan"(施工船, ship construction)

  13. Thanks:

    Paul W. (19th November 2015)

  14. #13
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Thanks. That makes a whole lot more sense. I don't think it's a problem that it's not famous, as long as it's not particularly weird for compression purposes.

  15. #14
    Member RichSelian's Avatar
    Join Date
    Aug 2011
    Location
    Shenzhen, China
    Posts
    171
    Thanks
    20
    Thanked 61 Times in 30 Posts
    Quote Originally Posted by Paul W. View Post
    Thanks. That makes a whole lot more sense. I don't think it's a problem that it's not famous, as long as it's not particularly weird for compression purposes.
    the text on http://www.gutenberg.org/ebooks/23825 is written in traditional Chinese (mainly used in Hong-Kong and Taiwan), be the mainland of China use simplified Chinese, many characters are different. for example "中国"(simplified) and "中國"(traditional).

  16. Thanks:

    Paul W. (19th November 2015)

Similar Threads

  1. Encode's Compression Corpus (EncCC)
    By encode in forum Download Area
    Replies: 5
    Last Post: 21st December 2017, 12:43
  2. Replies: 1
    Last Post: 3rd July 2014, 06:31
  3. multi-pass compression
    By Cyan in forum Data Compression
    Replies: 4
    Last Post: 4th July 2012, 00:48
  4. Silesia compression corpus
    By encode in forum Data Compression
    Replies: 29
    Last Post: 8th June 2012, 10:53
  5. Multi-threaded compression
    By Cyan in forum Data Compression
    Replies: 34
    Last Post: 16th January 2011, 17:32

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •