Results 1 to 13 of 13

Thread: Questions on modelling and mixing in PAQ8PX

  1. #1
    Member
    Join Date
    Oct 2019
    Location
    Munich
    Posts
    5
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Questions on modelling and mixing in PAQ8PX

    Hello people,

    I am interested in analyzing which models in PAQ8PX have a positive or negative influence on predictions for a particular data set of binary files I have. The binary files are 2D byte data.

    I scanned through the source code and to my understanding the PAQ8PX compressor keeps _many_ models which give predictions for the current bit. The result of each model is stored in SIMDMixer::tx[] and the predictions are later mixed using a three-layer neural network. The weights of the neural network are later updated in SIMDMixer::update(). It looks like the number of predictions in tx is very large: 1248 which is less than the available models but I suppose each prediction is the same model with contexts of different order. Is this correct?

    I decided to evaluate the partial derivatives of the error from the neural network with respect to each input neuron. I did this numerically in the following manner:
    Code:
    SIMDMixer::update() {
      ...
      store relevant state since SIMDMixer::p() modifies the state
      err = y_predict - mp->pr[0]
      for each neuron i
         tmp = tx[i]
         tx[i] = tmp + DELTA
         p1 = this->p()
         restore state
         tx[i] = tmp - DELTA
         p2 = this->p()
         partial_deriv[i] = -2.0 * err * (p1 - p2) / (2.0 * DELTA)
         restore state
      ...
    }
    This should be the implementation of the two-point method: p'(a) ~= (f(a+h) - f(a-h)) / (2h)

    I later normalize the partial derivative based on the range of the data types for the probability and error.
    Do you think that this idea is correct? Are there better ways to accomplish this?
    Is there anything more I should be aware of?

    The motivation for this work is that I wold like to figure out which models are useful for the data being compressed and also attempt to write a new model into PAQ8PX for this data type if it makes sense.

    Thank you for your time,
    Martin

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,774
    Thanks
    276
    Thanked 1,206 Times in 671 Posts
    > I suppose each prediction is the same model with contexts of different order. Is this correct?

    I'd say no. With some small modifications it may be possible to say that paq models only calculate
    contexts for counters of the same type, but imho its more important that models parse the specific data structures
    to compute these contexts.

    1. Model is not useful only if removing it from the mix doesn't make compression worse.
    A weaker definition would be something like:
    l(p,y) = -log2(y*p+(1-y)*(1-p))
    l(p1,bit) >= l(p2,bit) for every data bit, where p1 is bit==1 estimation given by this model and p2 is given by some other model.
    But some indirect effects are also possible (like from SSE layer), so its not 100% true.

    2. A rough estimation of submodel's performance can be obtained by coding with predictions only from this one model.
    If your new model on its own compresses some data better than paq8, then adding it to paq8 mix would also improve compression.

    3. The only good metric for data compression models is Shannon's entropy (see l() defined in (1)).

  3. #3
    Member
    Join Date
    Oct 2019
    Location
    Munich
    Posts
    5
    Thanks
    0
    Thanked 0 Times in 0 Posts
    ​Hello Shelwien,

    thanks for the comment.

    I decided to compute the gradients with respect to the input to find how each model effects the error over the span of compressing the whole file.
    Removing a model would show changes in the compression ratio but would not give the fine granularity of showing at which point a model is useful and when not.

    However, computing the gradients didn't produce easy to parse results since there were many input neurons (>1000?).
    I decided to try out your suggestions but I do not understand the full paq8px pipeline and I am not sure whether I should do something else.

    I only modified the ContextModel class by changing how the mixer is initialized and also which frequencies are passed within ContextModel::p().
    I made it controllable via an env variable to make it easy to disable models without recompiling.
    I see changes in the compression ratio and also speed. However, even after disabling the majority of the models, for example all but the ExeModel, I still get some compression close to the original. However, the binary data is not an exe and I would expect the model to give bad predictions and thus have significantly worse compression.

    Am I missing something?
    Can Secondary Symbol Estimation (SSE) lead to still pretty decent compression?
    Is there some other part in the code I should be modifying?

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,774
    Thanks
    276
    Thanked 1,206 Times in 671 Posts
    > Removing a model would show changes in the compression ratio but would not
    > give the fine granularity of showing at which point a model is useful and
    > when not.

    You can record -log2(y*p+(1-y)*(1-p)) per bit, with (l1) and without (l2) a submodel.
    Spans with high frequency of l1>l2 are the ones where submodel improves compression.

    > I do not understand the full paq8px pipeline and I am not sure whether I should do something else.

    You can just record these (with or without SSE):
    https://github.com/hxim/paq8px/blob/...8px.cpp#L11439

    > However, the binary data is not an exe and I would expect the model to give bad predictions
    > and thus have significantly worse compression.

    Well, it depends on data type? If none of the models really fits your data,
    I suppose exemodel still can provide some compression, since it adds several contexts,
    similar to prefix model in a way.

    > Can Secondary Symbol Estimation (SSE) lead to still pretty decent compression?

    Actually yes. SSE there has its own contexts up to order3.

  5. #5
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    399
    Thanks
    278
    Thanked 283 Times in 149 Posts
    Hello martinradev!

    [In this first part I try to explain some of your experiences]

    1) The models are not pure: models may use similar contexts with different purposes - for example both the normalmodel and wordmodel will set up the same context when they see a letter after a space. They just set it up from a different angle: The normalmodel sees it just as two bytes, but wordmodel will see it as a beginning of a word. This means you will need to consider the "contexts", not the "models". So your question would be: which contexts are the most useful for modeling your data.

    2) The models/contexts are mixed and it's difficult to look at each model/context alone: you'll need to consider them together. If you turn off one of the models, for example, other models may just "fill in" (= your experience with the exemodel). For instance if the dmc model is turned off, the normal model, matchmodel, wormodel, etc. may just join forces, receive a bit larger weights (or different weights) from the mixer than before, and so they will model the data nearly as well as when the dmc model was also enabled. With the dmc model it's just better - the joint probability is modeled better. A good compression result is not the result of one model alone.

    I think calculating the usefulness of a model/context from intermediate results (like neuron outputs) is not the best way go.

    You may try turning off different models/contexts to see which was the most useful - as you did. Don't expect serious losses because of the above 2 points. Turning off models/contexts is very similar to a brain damage: the roles of the damaged part may be replaced by other parts.
    If you try to find out which part of the brain is the most useful in a specific (real-life) context and for that you try to disable neurons or even groups of neurons in the brain, you may not experience a serious dumbness. Other parts will replace the missing part. Not perfectly, but still quite well.

    The models in paq8px are originally built to deal with specific data structures in mind, but none of the models are pure: they all have something common with the others. For example the recordmodel, sparsemodel, sparsematchmodel, wordmodel are all modelling bytes separated by gaps or bytes in columns (so to speak). The sparsemodel does this in a fixed fashion, the others are more dynamic. The recordmodel is the most dynamic, but it models fixed record lengths. The wordmodel however models variable length rows.
    All of these models were built with specific data structures in mind but because of the similarities, they have overlaps.

    Having models with similar contexts (but from different angles) are very useful to find the joint probability. And this is the purpose of a context mixing data compressor.
    Last edited by Gotty; 3rd November 2019 at 07:43. Reason: typo

  6. #6
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    399
    Thanks
    278
    Thanked 283 Times in 149 Posts
    A concrete example:

    In a file paq8px sees this: the preceding byte is a dot and before a dot there was a number. In textual files that's a strong indicator that a number will follow.
    Which models deal with this structure (number-dot-number)? The chargroupmodel for example, but the normalmodel, matchmodel, sparsemodel, dmcmodel, etc. will also kick in from a different angle. Their joint effort will give us a really strong predictive power. So which model is useful in this case? All of them!
    Last edited by Gotty; 3rd November 2019 at 07:41.

  7. #7
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    399
    Thanks
    278
    Thanked 283 Times in 149 Posts
    [In this part let's talk about your data, and how paq8px may handle it]

    >>The binary files are 2D byte data.

    Paq8px predicts bits. In a general 2d binary data file a bit within a byte depends most heavily on its surrounding bits and bytes.

    In paq8px compression works by processing the file left to right, one byte after the other. When we reach a byte and try to predict it, for good prediction we will need contexts.

    When processing a 2d file from left to right, useful contexts are the preceding bytes and bytes above. We don't have what's on the right and what's under the byte. They are unavailable. So...

    Bytes preceding the current byte is modeled by a lot of models: most models in paq8px do exactly that. Just from different angles (see my previous posts).
    Bytes above the current byte is modeled by recordmodel and wordmodel only ((and the image models of course)). These models need to detect the record length or line length to kick in. Without detecting the record length or line length paq8px will process your 2d data in 1d only and will not use contexts from above. In such a case compression will not be (near-)optimal.

    What is the structure of your data?
    Being 2d means it has a fixed record length. How large is the record length? Do some of the columns have the same byte/character always (like a separator character)? From the viewpoint of the recordmodel these are the two most important questions.

    If you would like to find out if the recordmodel detects your structure, look for all the "rlen[0] =" lines, and printf("record detected with length: %d\n",rlen[0]) at each instance. There are 4 places.

  8. #8
    Member
    Join Date
    Oct 2019
    Location
    Munich
    Posts
    5
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Hello Gotty and Shelwien and thank you for your answers.

    The test cases in our data do not exhibit any regular pattern. However, there is correlation between neighboring bits (left and above) and there is a bias towards 0.
    We are actually not exactly sure of the dimensions of the data but we are able to give an estimate on the width and height.

    I will try your suggestion on examining the proposed model. I could also attempt to implement a model which gives a prediction based on the neighbors (top, left, top-left). Maybe it would make sense to give multiple predictions, each based on a different context, and then mix them as it is currently done in the NN.
    So, if we have the following 2D grid and want to give a prediction for x:
    abc
    def
    ghx
    Maybe we can keep track of the distributions: p(x | f), p(x | h), p(x | e), p(x | (h+e+f), (g+d+b+c))
    Where p(x | f ) should handle well cases where the top bit is important for determining the current one, p(x | h) helps when the left bit determines the current one, p(x | e) determines the current one, and p(x | (h+e+f), (g+d+b+c)) help give an even stronger prediction if it happens that the neighborhood has a prevalence of ones or zeroes.
    Initially all distributions are initialized according to some known bias for all of our test cases. The distributions are updated adaptively as input is processed.

    Does this make sense to you?

  9. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,774
    Thanks
    276
    Thanked 1,206 Times in 671 Posts
    > Does this make sense to you?

    I'd suggest to look at image models in paq, they kinda already do what you described, for all kinds of pixel types.

    Also maybe find paq8p or even earlier version - paq8px source is 4x larger, but these improvements likely don't help you in any way.

    And yes, you can gather statistics for all relevant contexts, compute contextual predictions, then mix them.
    Its also relatively easy to make a context optimizer, eg. https://sites.google.com/site/toffer..._0.6_100206.7z

    Btw, its not a good idea to use paq to find good contexts - I mean something like adding counters with all possible contexts
    then tracking corresponding mixer weights. It might work, but its not reliable, because frequently an extra context
    would require specific counter parameters to improve compression, and paq doesn't include a parametric counter.
    For example, imagine a context where only "10" or "01" combinations occur at even positions... in combination
    with a counter that simply counts how many 0 and 1 bits it sees, and estimates a next bit =1 probability as (n1+1)/(n0+n1+2).
    It would always produce estimations ~0.5, which is bad for this context... but in other cases this same counter
    can be a reasonable choice.

  10. #10
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    399
    Thanks
    278
    Thanked 283 Times in 149 Posts
    I'm seconding Shelwien's post.

    +:
    >>We are actually not exactly sure of the dimensions of the data but we are able to give an estimate on the width and height.

    I suppose that you do know the width for each file, it's just not really possible to detect it automatically. Is this the case?
    (You have to know the width: without the width you don't know what is "above".)
    It's a fixed width per file, isn't it? Is the file byte-oriented?

  11. #11
    Member
    Join Date
    Oct 2019
    Location
    Munich
    Posts
    5
    Thanks
    0
    Thanked 0 Times in 0 Posts
    An issue I am facing:

    I am trying to integrate CTW as a model into paq8px to check whether it somehow improves the predictions for our test data.
    I already have the integration done and I can see that the trace of predictions is the same as the CTW compressor I lifted the implementation from. However, the archive size doesn't change, so I must be doing something wrong with providing the probability to the mixer.
    I also locally modified PAQv182 to expose dynamically which models to use and to also enable/disable APM. This was done to find out what models have a negative effect or do not contribute for our data.

    For my test data, I disable all models but the ExeModel and the CTWModel and observe that the compressed archive is about the size of the input file which means that the predicted probability is somehow not used by the mixer or I am doing just something horribly wrong.
    I usually pick the ExeModel has it produces really poor results for our test data and I was hoping that the newly integrated model would produce good predictions and thus lead to an overall small archive size.

    The code at a high level looks the following:

    Code:
    class CTWModel {
        MIXERINPUTS = 1;
        MIXERCONTEXTS = 0;
        MIXERCONTEXTSETS = 0;
        void mix(Mixer& m) {
                 p = get_prob(); // this is in the range [0, 4095]
                 m.add(stretch(p))
        }
    }
    Then in the ContextModel constructor I add the input size contribution from CTWModel
    Code:
    MixerFactory::CreateMixer() (
     ... + CTWModel::MixerInputs,
    ... + CTWModel::MixerContexts,
    ... + CTWModel::MixerContextSets);
    Then in ContextModel:: p () I mix the contribution from CTWModel:

    Code:
    ContextModel:: p () {
    ...
    CTWModel &model =  models.ctwModel();
    model.mix(*m);
    }
    Am I doing something wrong with MIXERCONTEXT and MIXERCONTEXTSETS? I do not really understand the complete PAQ pipeline so I might be misusing parts.

    I would be happy if somebody could give me some hints.

  12. #12
    Member
    Join Date
    Oct 2019
    Location
    Munich
    Posts
    5
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Just wanted to bump this up.

  13. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,774
    Thanks
    276
    Thanked 1,206 Times in 671 Posts
    1) Are you sure that get_prob() gives the probability of 1?

    2) Normally you'd need at least one context and context set.
    It may be simpler to insert an extra input into an existing mix.
    Just see here how the mixer works:
    https://github.com/hxim/paq8px/blob/...Mixer.hpp#L124
    https://github.com/hxim/paq8px/blob/...Mixer.hpp#L111

Similar Threads

  1. paq8px
    By Jan Ondrus in forum Data Compression
    Replies: 1837
    Last Post: 15th March 2020, 17:15
  2. Modelling techniques
    By Shelwien in forum Data Compression
    Replies: 16
    Last Post: 23rd December 2019, 12:57
  3. clz - Context Modelling for lz77
    By Gonzalo in forum Data Compression
    Replies: 27
    Last Post: 9th May 2018, 05:36
  4. Replies: 15
    Last Post: 4th September 2015, 15:10
  5. Modelling... what exactly?
    By Shelwien in forum Data Compression
    Replies: 24
    Last Post: 5th May 2012, 17:00

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •