Results 1 to 18 of 18

Thread: Test & Benchmark: Incomparable

  1. #1
    Member
    Join Date
    Oct 2007
    Posts
    51
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I've seen there are lot of brand new *wonderful* detailed test about compression; I thanks all the authors for their work because those test already provide useful feedback to users and authors.
    Test are done with different strategies, and that's good because they provide a good feedback to different aspects, feature and constraints.

    However, I found very difficult to try to compare them, even having decided one specific feature and constraint to evaluate, mostly because:
    1) The test file set is in most cases "proprietary", it's not clear how results apply to different cases.
    2) Speed and throughput depends on the system, mainly to CPU type and RAM type (e.g. memory buses speed, CPU cache, etc)
    3) Measure of memory usage depends on pc configuration and on the method (scripts/tools) used.

    It's obvious that more accurate benchmarks can be obtained using comparable and repeatable tests.
    My ideas to achieve these points are:

    A) Speed/throughput should be relative , e.g., 100% is assigned to a well known method, for instance ZIP 7-zip, (or any other widely used compression format), and all the speed can computed as relative to it. In such a way, results would be comparable in different PC and nearly (though not completely) independent from CPU/RAM type and frequency.

    B) The test file set should be freely downloaded, in such a way anyone would be able to repeat tests and check results on its own system. There are lot of test sets (mainly Text files), but others can be defined using for instance Sourceforge or Freshmeat projects. That solution would also allow to provide file hosting and would offer a fast access (with lot of server already available) to download very large group of files.

    C) Optionally, If one unique tool/method to perform tests will be chosen, especially to measure speed and RAM usage, than users would be able to check results but also to submit their results to improve accuracy of the tests. Obviously difference larger than a threshold, for instance 10% for speed/RAM usage (.i.e. wrong results) would be discarded. The advantage would be the capability to have an average measure (speed/memory usage) that would provide a very accurate data.

    IMHO suppose:
    A) can be easily implemented (just adding one column to many Test tables),
    B) require some general discussion to pick different sets of files, especially binary types because large text test set are very well known (e.g., one for media files such as Vegastrike, one for executable such as Mediacoder, etc...).
    C) Unfortunately I don't know any tool able to do such a measure (process time or speed and max ram usage), and especially available for both windows and Linux...

    What's you opinion about these points?
    What are your ideas?

  2. #2
    Member
    Join Date
    Dec 2006
    Posts
    611
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by Gish
    A) Speed/throughput should be relative , e.g., 100% is assigned to a well known method
    Thats one step closer to real-life applicability, however, CPUs are a little bit different, so 40% speed with this compressor could be 35% on a different CPU... still basically its the best idea available.
    Quote Originally Posted by Gish
    B) The test file set should be freely downloaded
    Theres UCLC testset and LTCB to begin with
    Quote Originally Posted by Gish
    C) Optionally, If one unique tool/method to perform tests
    Well, theres a lot of timers for Windows, in worst case they could be used thru wine also in Linux... as for RAM usage, each OS should have something that works like taskmgr, that combined with RAM usage details provided by program developer should get pretty close to real usage

  3. #3
    Member
    Join Date
    Oct 2007
    Location
    Germany, Hamburg
    Posts
    408
    Thanks
    0
    Thanked 5 Times in 5 Posts
    We shouldn?t forget one thing (I sometimes also do). This scene tries to get the best compression. Thats the aim of paq. Those experiments doesn?t have the claim to be useful. They have the claim to be the best

    But at all I?m also for the useful archiver

  4. #4
    Member
    Join Date
    Oct 2007
    Posts
    51
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Black_Fox
    Theres UCLC testset and LTCB to begin with
    Ahem... we are talking about compression, no wonder even website name are compressed (luckily are not encrypted...) ...I had to do some search to discover what UCLC and LTCB are...
    I suppose:
    UCLC = Ultimate Command Line Compressors ( http://uclc.info )
    LTCB = Large Text Compression Benchmark
    ( http://cs.fit.edu/~mmahoney/compression/text.html )

    UCLC has interesting files, including s.c.o.u.r.g.e., which is available at SourceForge too. I suppose all the testsets are free content and theres no usage limitation. I suggested OpenSource to avoid problems due to rights, etc, but any other free content would be fine.
    LTCB use the Hutter prize large text; its a good test but does not include binary content files.

    Quote Originally Posted by Black_Fox
    Well, theres a lot of timers for Windows, in worst case they could be used thru wine also in Linux... as for RAM usage, each OS should have something that works like taskmgr, that combined with RAM usage details provided by program developer should get pretty close to real usage
    One single tool to measure speed and ram usage would be better, because also the collecting and the ordering of data requires lot of time. Using a single toll would provide data always with the same formatting, thus it would be possible to process them with some scripts.

    Quote Originally Posted by Simon Berger
    But at all I?m also for the useful archiver
    What does it mean "useful archiver"?
    People have different needs and opinion about that definition...
    Having all the data together will help to compare the archivers using different criteria. For instance I can decide that decompression speed is as important as compression speed, or that speed is more important than compression.
    I do not need to define "a priori" rules (anyone can make his own rule), what I need is only a set of comparable (and repeatable) measures, possibly in different test cases.

  5. #5
    Member
    Join Date
    Oct 2007
    Location
    Germany, Hamburg
    Posts
    408
    Thanks
    0
    Thanked 5 Times in 5 Posts
    < What does it mean "useful archiver"?
    < People have different needs and opinion about that definition...

    Thats a common phrase but in my opinion here you have only a small set of different uses. In really use you never choose paq because it?s too slow.
    For backup you will choose a fast de-/compression program. For all the others you have to look for a good compression level what isn?t to slow to decompress. Thats all what I meant. PAQ isn?t useful at all. But it is the best...

  6. #6
    Member
    Join Date
    Oct 2007
    Posts
    51
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Simon Berger
    Thats a common phrase but in my opinion here you have only a small set of different uses. In really use you never choose paq because it?s too slow.
    I agree: PAQ is the best state-of-art compression, at the moment its only practical purpose is to give a way to win the Hutter prize....

    Quote Originally Posted by Simon Berger
    For backup you will choose a fast de-/compression program. For all the others you have to look for a good compression level what isn?t to slow to decompress. Thats all what I meant. PAQ isn?t useful at all. But it is the best...
    I believe that the increase in processing capability of the CPUs will make PAQ usable in the next years (probably not before 5 years) only if it will be possible to write PAQ code in an efficient multitasks/threads style. CPU are no more becoming faster, they are simply increasing the number of their cores, then code must be written and optimized for parallel processing to exploit their processing power...
    That will be a great challenge...

  7. #7
    Member
    Join Date
    Dec 2006
    Posts
    611
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by Gish
    we are talking about compression, no wonder even website name are compressed
    Sorry about that

    Quote Originally Posted by Gish
    only if it will be possible to write PAQ code in an efficient multitasks/threads style.
    That seems impossible for given method, only possible way seems to be to cut the incoming data into blocks, reducing both time needed and compression gained...

  8. #8
    Member
    Join Date
    Jan 2007
    Location
    Moscow
    Posts
    239
    Thanks
    0
    Thanked 3 Times in 1 Post
    To be honest i don't see much sense in all these detailed benchmarks, except making them just for fun. They are too rough in results. IMHO it would be better to group them using lowest compression speed as the category and make compression testset for detailed testing on the concrete machine with user's data. There are different algorithms and archivers for different types of data - log files, raw photo, audio and video, programs and games distributions, science data, etc... I think there must be a testset with best archivers using different algorithms and something like wizard to ask user some questions about minimum speed limit, decompression speed, memory limit, necessary features of the archiver, etc. Then you choose your usual-compressed data (must be big enough to test) and go to have some coffee.. Or to sleep in rare cases Testing software should gather some work numbers, check for correct decompression, and so on. As the result you sould have nice HTML, where you will read - "FreeArc rules! Anyway." Except the last joke this is some kind if proposal, and if no one does this, may be some day i'll do it myself

  9. #9
    Member
    Join Date
    Oct 2007
    Posts
    51
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Black_Fox
    That seems impossible for given method, only possible way seems to be to cut the incoming data into blocks, reducing both time needed and compression gained...
    Im really not an expert about PAQ, but surely that would require some change; PAQ use (also) neural network with continuous training than, splitting data into blocks would require to divide that network too, or to have different networks... Well, its not not impossible but that have so many implications I cannot imagine...

    Quote Originally Posted by nimdamsk
    Testing software should gather some work numbers, check for correct decompression, and so on. As the result you sould have nice HTML, where you will read - "FreeArc rules! Anyway." Except the last joke this is some kind if proposal, and if no one does this, may be some day ill do it myself
    Really???
    But if you can do that wizard you really are a sort of wizard...

  10. #10
    Member
    Join Date
    Jan 2007
    Location
    Moscow
    Posts
    239
    Thanks
    0
    Thanked 3 Times in 1 Post
    Quote Originally Posted by Gish
    But if you can do that wizard you really are a sort of wizard...
    I dont this this task to be hard for professional programmer - gust another kind of GUI. But as i am not programmer... But it wouldnt stop me one day

  11. #11
    Member
    Join Date
    May 2008
    Location
    Earth
    Posts
    115
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Gish
    Quoting: Black_FoxThat seems impossible for given method, only possible way seems to be to cut the incoming data into blocks, reducing both time needed and compression gained...
    Paq uses a large number of independent models,I cant see why they cant work simultaneuosly...

  12. #12
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    Quote Originally Posted by IsName
    Paq uses a large number of independent models,I cant see why they cant work simultaneuosly
    i thought the same. moreover, they really can work simultaneously!

    ... but only for compression. decompression needs to correct each model with real data in order to keep its statistics actual, fine-tuned for concrete file compressed. but we dont know real data before all models are finished, their predictions are combined and we decode next data bit using this combined prediction!

    so, in the area of those 80-core processors, paq may become ideal algorithm for fast backups

  13. #13
    Member
    Join Date
    Oct 2007
    Posts
    51
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by IsName
    Paq uses a large number of independent models,I cant see why they cant work simultaneuosly...
    I admit I understand very little of PAQ... and also of LPAQ...
    There are pleny of Neural Networks and its not really clear if they are trained on all the input data : parallel processing would divide input data, but networks can be shared ? And I dont even understand what happen to the weights, it they are saved in the output archive or not...
    Weights have special properties, much depends if you can share them or not...
    Unfortunately I dont even know C++... sooner or later I have to learn it.

  14. #14
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    Quote Originally Posted by Gish
    Unfortunately I dont even know C++... sooner or later I have to learn it.
    i never read paq sources, only Matts paper

    basically, its not harder than PPM:
    1) encoding goes bit-per-cycle

    2) each model count amounts of 0 and 1 bits in the "same" previous situations as the current one, but their definitions of being "same" differs - the same 1-8 previous bytes, the same previous full word and so on

    3) all predictions are combined, just summed up in the simplest case - so, we know that 0 was seen X times in the "same" situations and 1 was seen Y times and make conclusion that probability of 0 bit is X/(X+Y)

    4) then we encode actual bit with arithmetic encoder using this probability and update all models with actual bit

    on decoding, we do just the same. the problem here is that we need to update model with actual bit before making the prediction for next one and we cant do it before all models for current bit will be evaluated and actual bit decoded using probability computed in above-mentioned way

    so, encoding may be parallelized because all we need is to update each model with actual bit before calculating X/Y for next one and its easy - we know all actual data in advance. but on decoding stage, we are stalled on each step waiting while actual bit will be decoded

  15. #15
    Member
    Join Date
    Dec 2006
    Posts
    611
    Thanks
    0
    Thanked 1 Time in 1 Post
    So PAQ could be the first program in the world that compresses a lot faster than decompresses?

  16. #16
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,475
    Thanks
    26
    Thanked 121 Times in 95 Posts
    but the cpu time would still be nearly equal

  17. #17
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Quote Originally Posted by Black_Fox
    So PAQ could be the first program in the world that compresses a lot faster than decompresses?
    SR2 compresses a little faster than it decompresses. It would be useful for making backups where you compress often but decompress rarely. Just the opposite of distributing files where you compress once and decompress often.

    PAQ could be made parallel, but you have to synchronize after each bit. Since all the work is done in the model, decompression speed will still be the same as compression.

    IMHO, zip is the most useful compressor. Disk space is cheap. Standardization, stable software, and speed are more important than good compression ratio. Notice I distribute PAQ as a zip file. PAQ and LTCB are research projects for AI.

  18. #18
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    Quote Originally Posted by Matt Mahoney
    PAQ could be made parallel, but you have to synchronize after each bit
    why? each model can just output stream, consisting of X/Y pair for each input bit. then these X/Y streams from all models may be combined by separate procedure

    i.e. we have basic model output: 1/1, 1/1....
    first model output, say: 0/0, 0/1, 1/1...
    second model output, say: 0/0, 0/0, 0/0, 0/1...
    and so on

    then all these temporary data are combined. its a bit like block-static huffman compression in zip where we generate code words at first pass and then enocde them with Huffman coder on second one

Similar Threads

  1. Logistic mixing & modeling
    By toffer in forum Data Compression
    Replies: 44
    Last Post: 3rd March 2011, 06:25
  2. Idea: Combine Compression & Encryption
    By dirks in forum Data Compression
    Replies: 16
    Last Post: 22nd February 2010, 11:49
  3. Archon & Dark sources
    By kvark in forum Data Compression
    Replies: 4
    Last Post: 13th March 2009, 04:01
  4. qc & qazar compressors
    By encode in forum Forum Archive
    Replies: 3
    Last Post: 25th August 2007, 04:58
  5. FreeArc GUI - how it should look&feel?
    By Bulat Ziganshin in forum Forum Archive
    Replies: 31
    Last Post: 20th July 2007, 18:32

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •