Does anybody know of a good text compression corpus with fairly plain text in a variety of languages?
If not, does anybody know of good sources of freely redistributable, large, plain texts in languages like Arabic, Hindi, Punjabi, and/or Bengali? UTF-8 preferred, but something that can be automatically converted to UTF-8 is okay.
I'm looking for good-sized books, if possible, at least 800KB and up to maybe 2.4 MB; should be a fairly unified work of prose written by a native speaker.
I have been starting to put a test corpus together, starting with Project Gutenberg files like these:
Guanchang XianXing Ji (novel "Bureaucrats"(?) by Bau Jia Li, Chinese (Sino-Tibetan, logographic)) (1.9 MB)
Doko E (Japanese novel by Hakucho Masamune (Japonic, logographic) (0.88MB)
Don Quixote (Spanish novel by Cervantes (Romance Indo-European) (2.1 MB)
The Descent of Man (Charles Darwin; only English text (Germanic Indo-European)(1.8 MB)
The Crime of Padre Amaro (Portuguese Novel (Romance Indo-European)(0.88 MB)
Le Suicide (classic early sociology in French by Emile Durkheim (Romance Indo-European)(1.1 MB)
Buddenbrooks (Thomas Mann novel in German (Germanic Indo-European))(1.6 MB)
Kollayi Gattite Nemi (novel in Telugu by Rama Mohana Rao Mahidhara (Dravidian))(1.4 MB)
(One thing I like about Gutenberg is that people should be able to find the files well into the future, even if some site I put the corpus on disappears.)
What I'm most lacking are large, free texts in Arabic, Hindi, Russian, Bengali, and Punjabi. I would like to hit all of the most-spoken (native) languages as well as the most-understood (native or not, like English and Russian). I don't want too many European and Latin-1 languages, and would like to hit more language families, character sets, etc.
I've got a Quran in Arabic, using fairly modern orthography (0.8MB), but I'd rather have something more modern and bigger.
Weirdly, I'm having problems finding big texts in Russian---Gutenberg has Dostoyevsky's Crime and Punishment in many languages, but not Russian. I've found versions on Russian sites, but not as a single plain text file, and I'm not sure which are legitimately redistributable, on stable sites, etc.
Another issue I'm a bit worried about is formatting. The Gutenberg books are all formatted with line breaks, which I'd like to remove. I can easily replace simple line ends with single spaces for most alphabetic languages, but I'm not sure what's reasonable for Chinese and Japanese, or whether there are gotchas for languages that are written right-to-left or whatever.
As I understand it, Chinese is difficult to correctly break into words at all---there are no obvious spaces or other delimiters, and words are mostly two characters but up to 5. I'm guessing that the Gutenberg texts are automatically line-broken using some heuristic algorithm, sometimes erroneously in the middle of a word. Can I just concatenate the broken lines (with no spaces) and get something reasonable, maybe even fixing errors? Or will that introduce some other kind of error? (And I know even less about Japanese, but I'd guess similar concerns apply.)
Any thoughts/pointers would be appreciated.