Results 1 to 1 of 1

Thread: Canterbury Corpus

  1. #1

    Join Date
    May 2008
    Tristan da Cunha
    Thanked 4 Times in 4 Posts

    Canterbury Corpus

    The Canterbury Corpus was developed by Ross Arnold and Timothy Bell in 1997 at the University of Canterbury, New Zealand, as an improved version of the Calgary Corpus. The files were chosen because their results on existing compression algorithms are typical.

    The corpus itself was published at DCC 97 in the paper "A corpus for the evaluation of lossless compression". The final files of the corpus were chosen from a set of more than 800 files, which were relevant for inclusion in the corpus. The DCC 97 paper explains how the files were chosen, and why it is difficult to find "typical" files.

    There are two main editions of the Canterbury Corpus: the Standard Canterbury Corpus, consisting of 11 files (alice29.txt, asyoulik.txt, cp.html, fields.c, grammar.lsp, kennedy.xls, lcet10.txt, plrabn12.txt, ptt5, sum, xargs.1) and the Large Canterbury Corpus, consiting of 3 files (bible.txt, e.coli, world192.txt).

    The corpus is available below:
    Attached Files Attached Files

Similar Threads

  1. Encode's Compression Corpus (EncCC)
    By encode in forum Download Area
    Replies: 5
    Last Post: 21st December 2017, 13:43
  2. Calgary Corpus
    By LovePimple in forum Download Area
    Replies: 0
    Last Post: 31st July 2008, 22:55

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts