Results 1 to 8 of 8

Thread: Natural Language Processing, PPMd, PPMZ and frozen models

  1. #1
    Member
    Join Date
    May 2012
    Location
    Netherlands
    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Natural Language Processing, PPMd, PPMZ and frozen models

    Greetings to everybody! A short introduction to my first post:

    After reading Rudi Cilibrasi's PhD Thesis and the paper "On Compression-based Classification" I'm using compression in my pet project. I'm very pleased with the document models provided by PPMd.

    I'm trying to classify patent documents according to the EPC (European Classification) standard of codes. I use compression to extract meaningful n-grams and a summary of a patent's description. This could be described as "compressing the document in human-readable form". You can give it a try here: http://elcid.demon.nl/cgi-bin/viewer...s&PN=GB2221870 and here: http://elcid.demon.nl/viewer.html

    The extractive summary is just the first step, classification goes like this:

    1) each class is represented by a "super-document" formed by appending the extractive summaries of representative patents from the class.
    2) the test document is reduced to the same readibly compressed form.
    3) using PPMd I estimate the conditional entropy of the test document relative to each class, the select the classes that provide the largest reduction of entropy.

    The ideal way to do (3) is, as Charles Bloom explains here, is pre-conditioning the compressor with the super-document in the class, then using this "frozen" model to compress the test document without further model updating. PPMZ2 can do this (options x and c), but it's too slow for my purposes, as I want to classify several document per minute.

    Since I'm new to this field and Charles Bloom's page hasn't been updated since 2009, I'm asking those in the know whether there's any PPM-based compressor which is fast like PPMd and has the capability of being pre-conditioned and frozen lke PPMZ2, even better if it can save and reload models for later use.

    Thank you all,

    Oscar, the patents guy.
    Last edited by patentsguy; 27th May 2012 at 18:18.

  2. #2
    Member
    Join Date
    May 2012
    Location
    Netherlands
    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Alright, I've done some searching and found that Dimitry's programs PPMd1 and durilca can use static models (pre-built contexts, preconditioning, etc.). Apparently .MDL files can be built with Dimitry's utility PPMTrain.

    However, documentation of this feature is cryptic, all you get is "Usage: PPMTrain [-mXX] [-oXX] [-sXX] FName" and a !PPMd.mdl fle gets created. Durilca's READ_ME says: "-t2 - test tricks ... 3 - pre-built model (.MDL files are required);"

    I don't know how to proceed to use prebuilt models with these family of compressors, anyon cares to shed some light?

  3. #3
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    For any compressor, you can measure the compressed size of B conditioned on A as C(AB) - C(A), where C(AB) is the compressed size of the concatenation of A and B. I guess that you want to avoid having to compress A over and over when comparing to other documents. I'm not aware of any compressors that have the feature of saving and reloading their state like this. You would have to modify the program to do that. For good compressors, the state would be hundreds of MB, much larger than the input, so not necessarily fast.

  4. #4
    Member
    Join Date
    May 2012
    Location
    Netherlands
    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Matt Mahoney View Post
    I guess that you want to avoid having to compress A over and over when comparing to other documents. I'm not aware of any compressors that have the feature of saving and reloading their state like this. You would have to modify the program to do that. For good compressors, the state would be hundreds of MB, much larger than the input, so not necessarily fast.
    Thank you for the insight. It might be slower to load the model, as you say. The bottom line is I want to freeze the model on one document and use it to compress another. PPMZ does that, although the precoditioning can't be saved.

    I m only aware of Dimitry's PPMTrain (http://compression.ru/ds/ppmtrain.rar) that can do that, It creates a tree file extension .ndl, but I can't figure out how to use it. The README refers to PPMd1 but in his home page he refers to DURILCA.

  5. #5
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    In ppmd, -m selects memory usage in MB and -o selects model order. Don't know about -s.

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    Uh, its used like this:
    Code:
    PPMTrain.exe BOOK1 // creates !PPMd.mdl 
    PPMd1.exe e BOOK2 // uses !PPMd.mdl, its hardcoded
    As to options, the additional -s in ppmtrain defines the size of created !PPMd.mdl -
    ppmtrain applies its tree reduction algo until it reaches the required size (the same one as in ppmd -r1).

    But saving statistics in such a way is probably not the best idea.
    If its just to measure the similarity of files, it should be ok to just compress the first file separately and then
    two files together, and calculate the difference.
    Also in most cases the same approach actually works for creating file diffs too -
    Code:
    ppmd e book1 1
    ppmd e book1+book2 2
    diff 1 2 3
    If you really need an artifical dictionary as a reference, instead of a specific file,
    then an interesting option may be to use ppmd -r2 to compress the base data,
    then introduce an error in the "frozen" part of compressed file - after that on decoding,
    ppmd would generate some random data using the model's statistics.

    Anyway, the most efficient description of statistics is the compressed data -
    some snapshot of memory structures would be either huge or lose important
    information.


    Also you might want to look at http://libbsc.com/ - its text compression is also good, but its faster.

  7. #7
    Member
    Join Date
    May 2012
    Location
    Netherlands
    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    Uh, its used like this:
    Code:
    PPMTrain.exe BOOK1 // creates !PPMd.mdl 
    PPMd1.exe e BOOK2 // uses !PPMd.mdl, its hardcoded
    As to options, the additional -s in ppmtrain defines the size of created !PPMd.mdl -
    ppmtrain applies its tree reduction algo until it reaches the required size (the same one as in ppmd -r1).
    Thank you for clarifying this!

    Quote Originally Posted by Shelwien View Post
    But saving statistics in such a way is probably not the best idea.
    If its just to measure the similarity of files, it should be ok to just compress the first file separately and then
    two files together, and calculate the difference.
    This is exactly what I'm doing. I have files representing each class: C1, C2... Cn and a file A to classify. The problem is that, when compressing Cn + A, file A is also being used to create the last part of model, which can affect the precision of the classification if its size is not a lot smaller than the smallest of Cn.

    It would be nice to have options to use one file as the model and a second file to compress (like the -x option in PPMZ2), then ppmd would be an excellent tool for classification.

    Quote Originally Posted by Shelwien View Post
    If you really need an artifical dictionary as a reference, instead of a specific file,
    then an interesting option may be to use ppmd -r2 to compress the base data,
    then introduce an error in the "frozen" part of compressed file - after that on decoding,
    ppmd would generate some random data using the model's statistics.
    Where does the "frozen" part of the compressed file begin?

    Quote Originally Posted by Shelwien View Post
    Anyway, the most efficient description of statistics is the compressed data -
    some snapshot of memory structures would be either huge or lose important
    information.
    So an even better option for classification applications would be to have a library of compressed files as class models, and an option to load any of them and use it to compress the file to classify.

    I'd support such a development financially if you're willing to adapt the ppmd code to do this.

  8. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    > affect the precision of the classification if its size is not a
    > lot smaller than the smallest of Cn.

    Yes, but it doesn't actually matter whether its Cn+A or A+Cn.
    Its more a matter of dictionary size.
    If there's enough memory to keep both files in the window, then
    it shouldn't be a problem.
    BWT compressors are more convenient in that sense, because their
    memory usage is more predictable (just 5*filesize usually).

    > It would be nice to have options to use one file as the model and a
    > second file to compress (like the -x option in PPMZ2), then ppmd
    > would be an excellent tool for classification.

    Well, if processing the base file is ok, then its not hard to do.
    But ppmtrain kinda already does that - you just have to rename
    the .mdl file. Of course its not hard to add a new option for that either.

    >> introduce an error in the "frozen" part of compressed file - after
    >> that on decoding, ppmd would generate some random data using the
    >> model's statistics.

    > Where does the "frozen" part of the compressed file begin?

    It starts at the point where ppmd completely fills its model memory
    (thus can be controlled with -m option).
    There's no marker in the code or anything like that.
    I guess it would be impractical for actual use... just that statistical
    data generation is interesting in itself

    > So an even better option for classification applications would be to
    > have a library of compressed files as class models, and an option to
    > load any of them and use it to compress the file to classify.

    No, its just the most compact representation of statistics.
    But it would be still necessary to decode these compressed base files before
    encoding the target file.
    Its an option, but if you don't really care about storage space,
    then ppmtrain model snapshots may be a faster option.
    You have to choose one first based on classification results, I guess -
    ppmtrain way also discards some statistics so it may be worse... or not.

    > I'd support such a development financially if you're willing to adapt the ppmd code to do this.

    Maybe, it doesn't seem too troublesome.
    I'd prefer to use ppmd_sh instead, though - http://www.ctxmodel.net/files/PPMd/

Similar Threads

  1. Compiling PPMd var J1 on Ubuntu
    By Piotr Tarsa in forum The Off-Topic Lounge
    Replies: 2
    Last Post: 18th December 2011, 19:17
  2. PPMd + TSE
    By Serge Osnach in forum Data Compression
    Replies: 4
    Last Post: 30th April 2011, 10:21
  3. PPMd reformatting
    By Shelwien in forum Data Compression
    Replies: 25
    Last Post: 16th April 2009, 05:06
  4. Lisaac programming language
    By lunaris in forum The Off-Topic Lounge
    Replies: 0
    Last Post: 11th December 2008, 17:51

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •