Page 41 of 78 FirstFirst ... 31394041424351 ... LastLast
Results 1,201 to 1,230 of 2338

Thread: paq8px

  1. #1201
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    290
    Thanks
    120
    Thanked 169 Times in 125 Posts
    Quote Originally Posted by Gotty View Post
    Could you please find out the file position where they (the compressed files on the two systems) start to differ? (paq8px_v140.exe -t A10.jpg.paq8px140)
    The i3-3240 is the computer I use at work and I can do the test only from tomorrow.
    Tests with i7-4710HQ on the files created by i3-3240:

    paq8px_v140 -t A10.jpg.paq8px140
    Comparing A10.jpg 842468 bytes -> differ at 649

    paq8px_v140 -t FlashMX.pdf.paq8px140
    Comparing FlashMX.pdf 4526946 bytes -> 100.00% then the program loops (I waited ~3 hours before stop the program)
    Quote Originally Posted by Gotty View Post
    Could you please try compressing A10.jpg with the attached debug compile on your i3-3240?
    Yes, tomorrow.
    Quote Originally Posted by Gotty View Post
    Could you verify that you have the latest BIOS version for your motherboard (on your i3-3240 system)?
    The 3rd gen intel cpu errata is long (as usual) but a couple of bugs are fixed in newer BIOS releases.
    Ok, I'll try to verify the BIOS version.
    Quote Originally Posted by suryakandau@yahoo.co.id View Post
    What system you use ? Win 7 ? 32 or 64 bit ?
    i3-3240: Windows 7, 64 bit.
    i7-4710HQ: Windows 8.1, 64 bit.
    Quote Originally Posted by mpais View Post
    I bet further tuning can improve results even more (the parameters used were from EMMA, afterall).
    I have done a lot of tests with various parameter values.
    I also tried to make some of them dynamic (minimum rate, minimum collected info before reset, ...) and I failed (sometimes is better lower values and sometimes higher, but I'm not able to learn dynamically when).
    Maybe we can improve ALR changing the approach, but I don't know how.

  2. #1202
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    1,274
    Thanks
    803
    Thanked 545 Times in 415 Posts
    Some tests of v141 - my testbed and 4 corpuses with best settings. I've check if this time "a" option gives some improvements and change "best setting" to new.
    Generally quite nice improvements on all bigger files.
    For enwik8 -9eta I've got numbers as below:

    16'973'268 - Paq8px_v139
    16'919'232 - Paq8px_v140
    16'872'591 - Paq8px_v141

    It looks like enwik9 -9eta should be about 132'1xx'xxx bytes - it's just simple extrapolate, however tuned ALR could give some better results on bigger files.
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	paq8px_v141.jpg 
Views:	76 
Size:	558.9 KB 
ID:	5877   Click image for larger version. 

Name:	paq8px_v141_4_corpuses.jpg 
Views:	93 
Size:	552.3 KB 
ID:	5878  

  3. Thanks (3):

    Gotty (9th April 2018),Mauro Vezzosi (9th April 2018),mpais (9th April 2018)

  4. #1203
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    290
    Thanks
    120
    Thanked 169 Times in 125 Posts
    Quote Originally Posted by Gotty View Post
    Could you please try compressing A10.jpg with the attached debug compile on your i3-3240?
    i3-3240:
    Code:
    Paq8px_v140_VC_debug+asserts_on -7 A10.jpg
    paq8px archiver v140 (C) 2018, Matt Mahoney et al.
    
    Highest SIMD vectorization support on this system: AVX.
    Using SSE2 neural network functions.
    
    Creating archive A10.jpg.paq8px140 in single file mode...
    
    Filename: A10.jpg (842468 bytes)
    Block segmentation:
     0           | jpeg             |    842468 bytes [0 - 842467]
    File input size       : 842468
    File compressed size  : 629996
    -----------------------
    Total input bytes     : 842468
    Total archive size    : 630005
    
    Time 318.19 sec, used 1116 MB (1170967370 bytes) of memory
    
    ----------
    
    Paq8px_v140_VC_debug+asserts_on -t A10.jpg.paq8px140
    paq8px archiver v140 (C) 2018, Matt Mahoney et al.
    
    Highest SIMD vectorization support on this system: AVX.
    Using SSE2 neural network functions.
    
    Comparing A10.jpg 842468 bytes -> identical
    Time 305.45 sec, used 1116 MB (1170965303 bytes) of memory
    
    ----------
    
    paq8px_v140 -t A10.jpg.paq8px140 (created with Paq8px_v140_VC_debug+asserts_on)
    paq8px archiver v140 (C) 2018, Matt Mahoney et al.
    
    Highest SIMD vectorization support on this system: AVX.
    Using SSE2 neural network functions.
    
    Comparing A10.jpg 842468 bytes -> differ at 650
    Time 132.21 sec, used 1116 MB (1170965303 bytes) of memory
    Quote Originally Posted by Gotty View Post
    Could you verify that you have the latest BIOS version for your motherboard (on your i3-3240 system)?
    The 3rd gen intel cpu errata is long (as usual) but a couple of bugs are fixed in newer BIOS releases.
    At the bootstrap I don't see any message to enter the BIOS, I pressed some keys (Canc, F2, F8, ...), but the BIOS setting don't start.
    I find the following info, I don't know if they are enough:
    Code:
    msinfo32
    Version/BIOS date   American Megatrends Inc. 0902, 25/11/2013
    SMBIOS Version      2.7
    
    wmic bios get smbiosbiosversion
    SMBIOSBIOSVersion
    0902
    
    wmic baseboard get product,Manufacturer,version,serialnumber
    Manufacturer           Product  Version
    ASUSTeK COMPUTER INC.  B75M-A   Rev X.0x
    
    regedit HKEY_LOCAL_MACHINE\HARDWARE\DESCRIPTION\System
    SystemBiosDate      25/11/13
    SystemBiosVersion   ALASKA - 1072009 BIOS Date: 11/25/13 15:28:05 Ver: 09.02
    However, I'm not sure I want to update the BIOS version...

  5. Thanks:

    Gotty (9th April 2018)

  6. #1204
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    578
    Thanks
    220
    Thanked 834 Times in 342 Posts
    Quote Originally Posted by Mauro Vezzosi View Post
    I have done a lot of tests with various parameter values.
    I also tried to make some of them dynamic (minimum rate, minimum collected info before reset, ...) and I failed (sometimes is better lower values and sometimes higher, but I'm not able to learn dynamically when).
    Maybe we can improve ALR changing the approach, but I don't know how.
    Did you try increasing the precision of the learning rate? I'm running a few tests, just doubling the precision seems to give a small improvement.
    I don't have a lot of time right now for coding, and I'm focusing more on Fairytale, which should hopefully be a lot more extensible than the architecture of paq8. We could use your help on it

  7. #1205
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    748
    Thanks
    426
    Thanked 487 Times in 261 Posts
    Aha, it seems that the debug version works correctly (producing the expected 630005 bytes).

    There is a newer BIOS version for your motherboard dated 2014/08/29 vs. your current 2013/11/25.
    I understand your concern - updating the BIOS is a dangerous operation. And it may not fix anything: the site states that "Enhance compatibility with some USB devices." - nothing about a cpu microcode update.

    However there are cpu microcode updates for your i3-3240. The latest one (2018/03/12) came out probably to fix the Spectre/Meltdown issue. But there are older ones from 2017 and 2015. That means there were some bug fixes in the last couple of years for your cpu, and they may contain a fix for what you experienced with paq8px.
    You can ignore the word "Linux" - bin-files with microcodes are appropriate for both Linux and Windows. The Windows one is installed from Windows Update automatically. ... Well... Usually....

    If you search for the word "genuineintel" in your c:\Windows folder you will probably see 4-6-8 files. On my Win 7 it is:
    Code:
    c:\Windows\winsxs\amd64_microsoft-windows-m..update-genuineintel_31bf3856ad364e35_6.1.7601.18848_none_1ac9921ec901a2ca\mcupdate_GenuineIntel.dll
    Date is: 2015/05/09. This is a microcode patch downloaded by Windows Update.
    Do you have a file simliar to mine? What is its date?

  8. Thanks:

    Mauro Vezzosi (10th April 2018)

  9. #1206
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    290
    Thanks
    120
    Thanked 169 Times in 125 Posts
    Quote Originally Posted by Gotty View Post
    There is a newer BIOS version for your motherboard dated 2014/08/29 vs. your current 2013/11/25.
    [...]
    However there are cpu microcode updates for your i3-3240.
    There are some newer BIOS versions and many "Linux* Processor Microcode Data File" after my 2010/11/21: I guess I can download only the latest version of both, right? (Or do I have to download all cpu microcode updates after my current version?)
    I'll see better in the coming days, thank you very much!
    Quote Originally Posted by Gotty View Post
    Do you have a file simliar to mine? What is its date?
    Code:
    dir C:\Windows\*genuineintel* /s
    C:\Windows\System32
    21/11/2010  05:24           299.392 mcupdate_GenuineIntel.dll
    C:\Windows\winsxs\amd64_microsoft-windows-m..update-genuineintel_31bf3856ad364e35_6.1.7601.17514_none_1ae611d0c8ecd885
    21/11/2010  05:24           299.392 mcupdate_GenuineIntel.dll
    C:\Windows\winsxs\Backup
    21/11/2010  05:27             2.532 amd64_microsoft-windows-m..update-genuineintel_31bf3856ad364e35_6.1.7601.17514_none_1ae611d0c8ecd885.manifest
    21/11/2010  05:27           299.392 amd64_microsoft-windows-m..update-genuineintel_31bf3856ad364e35_6.1.7601.17514_none_1ae611d0c8ecd885_mcupdate_genuineintel.dll_940e6a7f
    C:\Windows\winsxs\Manifests
    21/11/2010  05:17             2.532 amd64_microsoft-windows-m..update-genuineintel_31bf3856ad364e35_6.1.7601.17514_none_1ae611d0c8ecd885.manifest

  10. #1207
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    748
    Thanks
    426
    Thanked 487 Times in 261 Posts
    You don't really have to update your cpu's microcode. It turned out that the "difference" problem is a gcc+paq8px problem (which I still investigate).

    But if you'd like to:
    You need the latest one only.
    Or better: the latest 2017 one. The latest 2018 one probably has the Spectre/Meltdown patch (I haven't verified it). And for performance reasons I would avoid that. See it here:
    https://encode.su/threads/2935-Meltd...ression-codecs
    This is how the patching mechanism works:
    https://superuser.com/questions/9352...d-to-processor
    This is how you can do the patching manually in Windows:
    https://forums.guru3d.com/threads/wi...e-bios.418806/
    or
    http://forum.notebookreview.com/thre...indows.787152/
    (I haven't done that personally, I just verified that I have the lastest microcode for my home desktop - and I have.)
    Last edited by Gotty; 11th April 2018 at 04:57. Reason: Wording

  11. #1208
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    748
    Thanks
    426
    Thanked 487 Times in 261 Posts

    paq8px_v141fix1

    -- Fixes --
    - Fixed a bug in the stemmers reported by Mauro - gcc incorrectly optimized memcpys believing that the memory ranges do not overlap when the -ftree-vrp flag was 'on' (it is 'on' by default on -O2 and higher)
    - Fixed some minor bugs and typos

    -- Changes (user interface only) --
    - Print the selected vectorization info only if the user wants/needs to know.
    - On single file mode (almost) the same information was printed 3 times on screen (input file size). Now it prints it only twice (at the beginning of segmentation and in the summary) like in the old days.
    - Fixed wording (bytes vs size) - thank you, Darek.

    -- Remarks and notes --
    No model or compression improvements (archives must be binary compatible with paq8px_v141 compiled by VC++ or GCC with the -fno-tree-vrp flag or GCC with profile guided optimization)
    Attached Files Attached Files
    Last edited by Gotty; 11th April 2018 at 04:25.

  12. Thanks:

    Darek (11th April 2018)

  13. #1209
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    748
    Thanks
    426
    Thanked 487 Times in 261 Posts
    Quote Originally Posted by mpais View Post
    and I'm focusing more on Fairytale, which should hopefully be a lot more extensible than the architecture of paq8.
    Hopefully sometime in the near future I'll come, too. I just have too much on my shoulders these days.

  14. #1210
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    1,274
    Thanks
    803
    Thanked 545 Times in 415 Posts
    Tests of v141 for enwik8/9:

    For enwik8 -9eta:
    16'973'268 - paq8px_v139
    16'919'232 - paq8px_v140
    16'872'591 - paq8px_v141

    132'175'153 - enwik9_1423.drt -s9[eta] - paq8px_v141 - it's close to 6'th place on LTCB

  15. Thanks:

    Gotty (11th April 2018)

  16. #1211
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    748
    Thanks
    426
    Thanked 487 Times in 261 Posts
    Quote Originally Posted by Darek View Post
    It looks like enwik9 -9eta should be about 132'1xx'xxx bytes - it's just simple extrapolate, however tuned ALR could give some better results on bigger files.
    It looks like you extrapolated just perfectly.
    Thank you Darek!
    Well done, Mauro!

  17. #1212
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    1,274
    Thanks
    803
    Thanked 545 Times in 415 Posts
    One more test of v141 for enwik8/9:

    For enwik8 -9eta:
    16'973'268 - paq8px_v139
    16'919'232 - paq8px_v140
    16'872'591 - paq8px_v141

    132'175'153
    - enwik9_1423.drt -s9[eta] - paq8px_v141 - it's close to 6'th place on LTCB
    137'450'799 - enwik9_1423 - paq8px_v141 - non preprocessed enwik9 loses still about 4% to DRT version.

  18. Thanks (2):

    Gotty (13th April 2018),Mauro Vezzosi (13th April 2018)

  19. #1213
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    290
    Thanks
    120
    Thanked 169 Times in 125 Posts
    Quote Originally Posted by mpais View Post
    Did you try increasing the precision of the learning rate? I'm running a few tests, just doubling the precision seems to give a small improvement.
    Yes, it was one of the first attempts I made and it seemed worse.
    Then I also went in the opposite direction, incresing the step of the decrement.
    I had not save the test with greater step precision, I only have those with lower precision and the steps 7-5-3-2 and 7-5-3 seems to be quite good.
    Maybe I'll come back to ALR and do more tests.
    Code:
    v140 standard v140        v140      v140      v140      v141 standard v141      v141      v141       <- Version
    7-6-5-4-3-2-1 7-6-5-4-3-2 7-5-3-2-1 7-5-3-2   7-5-3     7-6-5-4-3-2   7-5-3-1   7-5-3-2   7-5-3      <- Rates steps
      630.005       630.005     630.005   630.005   630.005   629.978       630.005   630.005   630.005  A10.jpg
      860.718       859.370     861.200   859.569   858.635   857.948       859.849   858.879   858.275  AcroRd32.exe
      358.545       358.114     358.805   358.243   358.456   358.307       358.325   358.562   358.633  english.dic
    1.407.889     1.396.062   1.410.480 1.397.458 1.390.693 1.394.375     1.406.068 1.396.452 1.390.783  FlashMX.pdf
      238.722       237.004     238.914   237.123   236.800   237.085       238.372   237.168   236.868  FP.LOG
    1.217.204     1.216.578   1.217.489 1.216.524 1.215.980 1.214.793     1.216.309 1.215.780 1.215.469  MSO97.DLL
      467.634       467.302     467.770   467.343   467.159   467.148       467.529   467.239   467.136  ohs.doc
      505.681       501.960     505.866   502.086   500.748   501.973       506.006   502.093   500.755  rafale.bmp
      393.981       390.072     394.641   390.272   388.657   389.737       393.653   390.020   388.684  vcfiu.hlp
      325.883       325.064     326.164   325.215   325.390   325.177       325.809   325.286   325.476  world95.txt
    6.406.262     6.381.531   6.411.334 6.383.838 6.372.523 6.376.521     6.401.925 6.381.484 6.372.084  Total
    
      186.821       187.034     186.633   186.865   187.141   186.784       186.554   186.708   186.868  book1
      582.078       581.281     582.420   581.285   581.367   581.064       581.579   581.211   581.310  Calgary corpus.tar
      315.206       314.767     315.302   314.754   315.083   314.720       314.664   314.613   315.013  Canterbury corpus.tar
    6.445.221     6.414.764   6.453.353 6.417.120 6.403.218 6.408.001     6.437.341 6.414.017 6.402.617  MaxCompr.tar
        3.649         3.460       3.660     3.466     3.425     3.456         3.636     3.461     3.426  NUM.txt
      416.044       416.044     416.044   416.044   416.044   416.044       416.044   416.044   416.044  pi1000000.txt
      100.088       100.088     100.088   100.088   100.088   100.088       100.088   100.088   100.088  sharnd_challenge.dat
      314.081       314.081     314.081   314.081   314.081   313.945       314.081   314.081   314.081  test-seed000-n100000.uiq2
      312.519       312.519     312.519   312.519   312.519   312.372       312.518   312.518   312.518  test-seed001-n100000.uiq2
          106           106         106       106       106       106           106       106       106  0x00 * 256 Ki
          113           113         113       113       113       113           113       113       113  0xaad2 * 128 Ki
          109           109         109       109       109       109           109       109       109  0xff * 256 Ki
    8.676.035     8.644.366   8.684.428 8.646.550 8.633.294 8.636.802     8.666.833 8.643.069 8.632.293  Total
    Quote Originally Posted by mpais View Post
    I don't have a lot of time right now for coding, and I'm focusing more on Fairytale, which should hopefully be a lot more extensible than the architecture of paq8. We could use your help on it
    I read Community-Archiver/Lobby every day, Fairytale is a nice project, but I haven't much time to dedicate to it.
    I'll see if I can do something occasionally in the future.
    Quote Originally Posted by Gotty View Post
    -- Remarks and notes --
    No model or compression improvements (archives must be binary compatible with paq8px_v141 compiled by VC++ or GCC with the -fno-tree-vrp flag or GCC with profile guided optimization)
    It seem to me that the compressed size is the same as the quick version, not the profiled one.
    (I compared my v141 quick and profiled build made when I released v141 and the v141fix1, see https://encode.su/threads/2885-Paq8p...ll=1#post56487 for the compressed size)
    Quote Originally Posted by Gotty View Post
    Well done, Mauro!
    I just changed some values in a piece of code written by Márcio, nothing special.
    I'm a little disappointed with Darek's results, I was hoping for something better in Calgary and Canterbury corpus.
    I slightly penalized the compression of some data (such as text) to improve the compression of other data: I chose a minimun learning rate = 2 because most of the time is the best (in my tests), however 1 is better for some file (such as text) or 3 for others.
    @Gotty Thank you for your help and for what you do in paq8px!

  20. Thanks:

    Gotty (13th April 2018)

  21. #1214
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    748
    Thanks
    426
    Thanked 487 Times in 261 Posts
    Quote Originally Posted by Mauro Vezzosi View Post
    It seem to me that the compressed size is the same as the quick version, not the profiled one.
    (I compared my v141 quick and profiled build made when I released v141 and the v141fix1, see https://encode.su/threads/2885-Paq8p...ll=1#post56487 for the compressed size)
    A stripped down version of the quick build script with "-fno-tree-vrp" and the profiled one and vc++ all gave the same results on my system - this was my starting point. So I used the quick build script to alternate between "-fno-tree-vrp" (good) and the original (broken) until I found the problem. It was a bit strange that the problematic "-ftree-vrp" is included in -O2 so both the quick build and the profiled build should have been affected. But the profiled one wasn't. I didn't know why and didn't care much, I was just happy that I could narrow it down. So you had it the opposite way around...
    Strange. Probably a little difference in the combination of flags could case the "switch". I don't know. gcc has its quirks. This could be one of them.
    Quote Originally Posted by Mauro Vezzosi View Post
    Thank you for your help and for what you do in paq8px!
    Oh, don't mention it.

  22. #1215
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    748
    Thanks
    426
    Thanked 487 Times in 261 Posts

    pull request: paq8px_v141fix2

    -- Fixes (compiler warnings and compiler compatibility) --
    - Restored clang (linux) compatibility (changed gcc pragmas to function attributes for the dot product and train functions)
    - Fixed a compiler warning in llog()
    - Fixed a compiler warning in Stretch::Stretch()
    - Fixed a compiler warning in HashTable<B>:: operator[]
    - Fixed a compiler warning in ContextMap2::Find()
    - Fixed a compiler warning in TextModel::TextModelContextMap2()
    - Fixed a compiler warning in im24bitModel()
    - Fixed a compiler warning in encode_zlib()
    - Fixed a compiler warning in decompressfile()

    -- Fixes (minor bugs) --
    - Fixed a bug in decode_gif() (1st part of a while condition was always true)
    - Fixed a bug in main() (value of 'options' was never EOF)

    -- Fixes (memory usage) --
    - JpegModel reserved memory always (even if this model was not used). Fixed.
    -- v141fix1: 347, 603, 1116, 2142 MBs
    -- v141fix2: 313, 551, 1028, 1982 MBs RAM use on levels 5,6,7 and 8 respectively when compressing non-jpeg files

    -- Changes in user interface --
    - Until now we didn't know the overhead (the size of the added "diff" information) of zlib recompression. Now we know.

    Example for a negligible overhead:
    v141fix1:
    Code:
    Filename: boat.png (177762 bytes)
    Block segmentation:
     0           | default          |        81 bytes [0 - 80]
     1           | png-8b-grayscale |    177665 bytes [81 - 177745] (width: 512)
     2           | default          |        16 bytes [177746 - 177761]
    -----------------------
    Total input size     : 177762
    Total archive size   : 147923
    v141fix2:
    Code:
    Filename: boat.png (177762 bytes)
    Block segmentation:
     0           | default          |        81 bytes [0 - 80]
     1           | png-8b-grayscale |    177665 bytes [81 - 177745] (width: 512)
     1->         | ->  exploded     |    262663 bytes [0 - 262662]
     1-->        | --> added header |         7 bytes [0 - 6]
     1-->        | --> data         |    262656 bytes [7 - 262662]
     2           | default          |        16 bytes [177746 - 177761]
    -----------------------
    Total input size     : 177762
    Total archive size   : 147923
    Example for a significant overhead:
    v141fix1:
    Code:
    Filename: 00080.png (423 bytes)
    Block segmentation:
     0           | default          |       261 bytes [0 - 260]
     1           | png-8b           |       146 bytes [261 - 406] (width: 20)
     2           | default          |        16 bytes [407 - 422]
    -----------------------
    Total input size     : 423
    Total archive size   : 510
    Note the archive size. Why is it so "big"?

    v141fix2:
    Code:
    Filename: 00080.png (423 bytes)
    Block segmentation:
     0           | default          |       261 bytes [0 - 260]
     1           | png-8b           |       146 bytes [261 - 406] (width: 20)
     1->         | ->  exploded     |      1042 bytes [0 - 1041]
     1-->        | --> added header |       622 bytes [0 - 621]
     1-->        | --> data         |       420 bytes [622 - 1041]
     2           | default          |        16 bytes [407 - 422]
    -----------------------
    Total input size     : 423
    Total archive size   : 510
    Answer: the added 622-byte zlib "diff" header makes it "big".

    -- Remarks and notes --
    No model or compression improvements (archives must be binary compatible with paq8px_v141fix1)
    Attached Files Attached Files

  23. Thanks (2):

    Darek (19th April 2018),Gonzalo (19th April 2018)

  24. #1216
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    748
    Thanks
    426
    Thanked 487 Times in 261 Posts

    paq8px_v141fix3

    Let's begin fixing the documentation.

    - Separated documentation from the source file (README, DOC, CHANGELOG)
    - Updated (rewrote) README
    - Gathered CHANGELOG information
    - No code changes

    What do you guys think? Am I on the right path?

    I didn't update the technical information (mostly untouched since paq8a) in the DOC file. I feel that some (or most) parts should be moved to the source file to the respective classes/functions. Some parts/chapters seem to be really outdated.

    I'm not satisfied with the name of the "DOC" file? How should we call it?

    In the README file I thanked the guys involved in the development of paq8px. Did I include everybody?

    Someone (a native English speaker perhaps) should proofread my text

    (Not yet pull-requested.)
    Attached Files Attached Files

  25. Thanks:

    Darek (20th April 2018)

  26. #1217
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    748
    Thanks
    426
    Thanked 487 Times in 261 Posts

    pap8px_v141fix4

    Code:
    - Cosmetic changes in the documents
      - README: is now UTF8 encoded for having non-ascii characters + small changes in the text
      - DOC: added warning that its content is partially obsolete
      - source file: a bit clearer header
    - Small changes/fixes in Word, Language, TextModel and the English Stemmer classes
      - Added virtual destructors
      - Created a function to print the result of the stemmer (i.e. word stem) for debugging
      - Fixed 2 items in the Exceptions1 list (idle, gentle)
      - Fixed a bug in Step5 ( W->End+=!EndsInShortSyllable(W);  ->  W->End+=EndsInShortSyllable(W); )
      - Modified TrimStartingApostrophe() to trim more than 1 apostrophe both from the beginning and end in pairs
      - Commented some parts of the English Stemming routines
    - Finally fixed the printf formatting warnings in Linux/GCC and Linux/Clang (changed typedef shortcuts for U8..U64)
    The above fixes mean a tiny-tiny gain in files. A few files may also be a bit larger. Nothing serious.
    @Márcio, would you please validate the fixes in your excellent stemming routines? I hope I didn't break anything unintentionally.

    I do not plan to submit more fixes to the v141 series. I'm not aware of more suspicious piece of code.

    About the separation of the README/DOC/CHANGELOG from the source: do you guys have any suggestions? Do you agree with what you see? Is anything missing? If you have a suggestion, don't hasitate to upload an enhanced version of any of these files.
    Attached Files Attached Files

  27. Thanks (2):

    Darek (24th April 2018),Mike (24th April 2018)

  28. #1218
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    253
    Thanks
    49
    Thanked 108 Times in 55 Posts
    Quote Originally Posted by Gotty View Post
    If you have a suggestion
    Just a small comment...
    If you continue with the same naming policy, then a dozen years from now the latest version will get a name like paq8pxv235rc6beta4fix3

    This newsgroup is dedicated to image compression:
    http://linkedin.com/groups/Image-Compression-3363256

  29. #1219
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    748
    Thanks
    426
    Thanked 487 Times in 261 Posts
    Yeah. I see your point - but I have a reason.

    These "fixN" subversions are to specify that a certain change is not major enough to increase the (major) version number.
    In the old days there were only major improvements so the version number increased "normally".
    Nowadays there are versions that contain fixes for rare bugs or compiler compatibility issues, readibility improvements, documentation changes (most of them does not even change the compression itself). I chose to mark these subversions with a "fixN".
    Only two levels. No rc6. Promise .
    I'm open to a simpler (shorter) version naming convention however such as "paq8px_v141.4". What would you say?
    Last edited by Gotty; 24th April 2018 at 15:45.

  30. #1220
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    578
    Thanks
    220
    Thanked 834 Times in 342 Posts
    @Gotty:
    Code:
    Fixed 2 items in the Exceptions1 list (idle, gentle)
    I modified the english stemmer because on a lot of related words it would produce 2 different stem variations, one with a final "e" and another one without it. If memory serves, that is why I modified those exceptions.

    As for the documentation, it's very outdated. I left a few comments here and there for my changes, but if needed I'll try to find some time to document them.

  31. Thanks:

    Gotty (30th April 2018)

  32. #1221
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    748
    Thanks
    426
    Thanked 487 Times in 261 Posts
    Quote Originally Posted by mpais View Post
    I modified the english stemmer because on a lot of related words it would produce 2 different stem variations, one with a final "e" and another one without it.
    The final-"e"-behavior you experienced is probably due to this bug:
    Quote Originally Posted by Gotty View Post
    Code:
    - Fixed a bug in Step5 ( W->End+=!EndsInShortSyllable(W);  ->  W->End+=EndsInShortSyllable(W); )
    It's fixed now. Some of the differences:
    Code:
    Word         Expected stem    Before the fix (v141fix3)  After the fix (v141fix4)
    bone          bone            bon                        bone
    knives        knive           kniv                       knive
    wives         wive            wiv                        wive
    lacing        lace            lac                        lace
    ache          ach             ache                       ach
    ached         ach             ach                        ach
    aches         ach             ache                       ach
    ape           ape             ap                         ape
    argue         argu            argue                      argu
    argued        argu            argu                       argu
    argues        argu            argue                      argu
    arguing       argu            argue                      argu
    careful       care            car                        care
    carefully     care            car                        care
    carefulness   care            car                        care
    cares         care            car                        care
    See the full list of differences in the attached xlsx file at the bottom of this post.

    I followed the description of the stemming algorithm on http://snowball.tartarus.org/algorit...h/stemmer.html, but I think you have an extended algorithm in paq8px. Could you please post the link for the algorithm you followed?

    Quote Originally Posted by mpais View Post
    As for the documentation, it's very outdated. I left a few comments here and there for my changes, but if needed I'll try to find some time to document them.
    I'd say a general overview above the definition of the classes and functions should suffice. As you and I and other authors already did it.
    Sometimes short comments are useful where the intention could be blurry for the first time reader. We have such comments already in a couple of places.
    A problem is with the information in the DOC file. What dou you think about having a very general overview in the DOC and that we move all the useful and still valid information from the DOC into the cpp file to the respective functions/classes?
    The "model mixing" could stay in the DOC, but for example "The match model maintains a pointer ..." should simply go into the cpp file where the code for matchmodel is.
    I would do that. I just need green light.
    Attached Files Attached Files

  33. #1222
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    748
    Thanks
    426
    Thanked 487 Times in 261 Posts
    @Márcio
    An additional idea.
    class EnglishStemmer, class FrenchStemmer, class GermanStemmer, class TextModel, function im4bitModel, class exeModel, function XMLModel all have changelog information above the code in the cpp file.
    What do you think about moving these changelog pieces from the cpp to the CHANGELOG file?
    If you prefer, we may add the author's name to the CHANGELOG entries not to lose that information. Anyway we have "tracking changes" feature in the github repository as well, so we know the author of every line of code starting from paq8px_v69.
    If you agree, I'd like to ask you to do the merge {cpp-->CHANGELOG} since you are the author of these and you know them the best.

  34. #1223
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    578
    Thanks
    220
    Thanked 834 Times in 342 Posts
    Yes, I saw it, nice catch. The english stemmer I used is based on my own modifications to the Porter2 stemmer, so there's no link to a description of the changes. I see you classified the exception "skies" as a noun, that is incorrect, it is also a verb form. I had also skipped classification on "andes" because I was considering removing it altogether, since in english it's just the name of a mountain range (so rarely used), but in my own language it's a common verb form.

    The documentation for the models is bad. It even still mentions the long gone "FAX model" that someone made for pic from the Calgary Corpus.
    Just a general definition of classes and functions won't be of much help to anyone trying to understand what the models are doing, which is really the most important thing in paq8.
    I mean, looking at my contexts for the image models, is it immediatelly clear to you what they're modelling?

  35. #1224
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    748
    Thanks
    426
    Thanked 487 Times in 261 Posts
    Quote Originally Posted by mpais View Post
    The english stemmer I used is based on my own modifications to the Porter2 stemmer, so there's no link to a description of the changes.
    Aha! An imporved Porter2. Cound you add some comments in the code where it is different? So the reader would know and able to follow easier.

    Quote Originally Posted by mpais View Post
    I see you classified the exception "skies" as a noun, that is incorrect, it is also a verb form. I had also skipped classification on "andes" because I was considering removing it altogether, since in english it's just the name of a mountain range (so rarely used), but in my own language it's a common verb form.
    "skies": Aha, I see. And I agree. Would it work with "English::Noun|English::Plural|Verb"?
    "andes": removing it (and my "texas") is OK. If they do not help compression (I don't know that), they do not need to stay. I know that the purpose is not a morphologically correct stemmer but a stemmer that helps text compression the most. Of course, information about any such decision should be commented in the code somewhere, so fellow developers know why deviating from a "standard" stemmer.

    Quote Originally Posted by mpais View Post
    The documentation for the models is bad. It even still mentions the long gone "FAX model" that someone made for pic from the Calgary Corpus.
    Yes. This is what I'm about to change. The "FAX model" paragraph would certainly go.

    Quote Originally Posted by mpais View Post
    Just a general definition of classes and functions won't be of much help to anyone trying to understand what the models are doing, which is really the most important thing in paq8.
    I mean: in the cpp a general overview (above the class and function definitions) is a must (and enough) - the reader should know the purpose of a class/function. Additionally "sometimes short comments are useful where the intention could be blurry for the first time reader. We have such comments already in a couple of places." Although we had some progress in this field we need more comments in the code. For instance comments that I added to the English Stemmer in v141fix4 are enough for me to understand what's going on. How do you see: are such comments enough or should the documentation or explanation be "deeper" or more "textual"?

    In the DOC file however a general definitions of concepts and the internal workings of paq8px - mainly the process flow - is missing. I don't have a very clear idea what I would include here exactly, but I strongly feel that the "general picture" is not in the document or not explained very clearly or deeply enough. For instance I remember that at the beginning I was struggling to understand simple things such as what exactly a "bit history" is or what an Adaptive Probability Map is for.

    What do you suggest - how to structure the documentation? What should be included in the DOC and what in the cpp?
    As for the cpp file: is a general overview + comments in the code are sufficient or you'd like to see a more "textual" documentation of the functions/classes/models?

    Quote Originally Posted by mpais View Post
    I mean, looking at my contexts for the image models, is it immediatelly clear to you what they're modelling?
    No, it's not. I need to tear the code apart to understend what's going on. The same applies to the Stemming routines or the Jpeg model. If you'd write an overview and comment on some of the details that would be really nice. I'd certainly appreciate your efforts.

  36. #1225
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    578
    Thanks
    220
    Thanked 834 Times in 342 Posts
    Quote Originally Posted by Gotty View Post
    Aha! An imporved Porter2. Cound you add some comments in the code where it is different? So the reader would know and able to follow easier.
    It's mostly just that it handles superlative adjectives and prefixes. It's been a long time since I made it (it was ported from EMMA) but I'll see if I still have my notes on it.

    Quote Originally Posted by Gotty View Post
    "skies": Aha, I see. And I agree. Would it work with "English::Noun|English::Plural|Verb"?
    Sure, it's just that the TextModel relies on detecting the verbs in sentences.

    Quote Originally Posted by Gotty View Post
    I know that the purpose is not a morphologically correct stemmer but a stemmer that helps text compression the most. Of course, information about any such decision should be commented in the code somewhere, so fellow developers know why deviating from a "standard" stemmer.
    The stemmers are just a way to get somewhat useful semantic information and act as a quantization scheme that is language-dependent. The semantic information provides a high-level overview of the current state and as such is probably mostly useful for mixer contexts. The quantized word representation helps the clustering of similar correlated concepts and is usually useful for prediction contexts.

    Quote Originally Posted by Gotty View Post
    I mean: in the cpp a general overview (above the class and function definitions) is a must (and enough) - the reader should know the purpose of a class/function. Additionally "sometimes short comments are useful where the intention could be blurry for the first time reader. We have such comments already in a couple of places." Although we had some progress in this field we need more comments in the code. For instance comments that I added to the English Stemmer in v141fix4 are enough for me to understand what's going on. How do you see: are such comments enough or should the documentation or explanation be "deeper" or more "textual"?

    In the DOC file however a general definitions of concepts and the internal workings of paq8px - mainly the process flow - is missing. I don't have a very clear idea what I would include here exactly, but I strongly feel that the "general picture" is not in the document or not explained very clearly or deeply enough. For instance I remember that at the beginning I was struggling to understand simple things such as what exactly a "bit history" is or what an Adaptive Probability Map is for.

    What do you suggest - how to structure the documentation? What should be included in the DOC and what in the cpp?
    As for the cpp file: is a general overview + comments in the code are sufficient or you'd like to see a more "textual" documentation of the functions/classes/models?
    Most of the preprocessing capabilities aren't mentioned anywhere either.
    The thing is, to understand how paq8 works, a user would need a reasonably good knowledge of a lot of topics, so the documentation would need to be really thorough, and I don't think anyone is willing to write it (besides, once you understand how things work, you see how limited it is). Apart from you and Mauro, I don't think anyone even bothers checking out the code (how many easter eggs have you found?).

    Quote Originally Posted by Gotty View Post
    No, it's not. I need to tear the code apart to understend what's going on. The same applies to the Stemming routines or the Jpeg model. If you'd write an overview and comment on some of the details that would be really nice. I'd certainly appreciate your efforts.
    The stemming routines are mostly just from the standard Porter stemmers, with additional code for semantic classification and common word recognition, which helps with language detection.

    The JPEG model is actually reasonably commented in the parsing and decoding stages, it's just the context modelling stage that may seem hard to follow, I'll see if I can make a few comments to better explain it (it isn't my model though, I just improved it).

    For the image models, it helps to understand the type of correlations one can expect for each type of image (gross oversimplification incoming, be warned).

    For 8bpp pallete color-indexed images, if you have 2 consecutive pixels with values 100 and 80, you can't just use linear predictors and expect the next pixel to be 60. In this case, you need the actual neighborhood pixel values as contexts, so you have contexts of the form hash(W, N, NW). So quantized values are a big no-no, since the context (100, 50, 78) quite likely represents a completely different texture than context (50, 25, 39). Since this context uses 24 bits and we can expect that on most images only a small subset of all of its possible values will be used, we use it with the ContextMap.

    For 8bpp grayscale images and individual color planes of color images, we need to consider that we may be dealing with photographic or computer-generated (artificial) images. Among artificial images, we may have those with characteristics similar to photographic images (renders, mostly) and others quite different (clipart, drawings, etc). Non-photographic images may have hard edges and large continuous tone regions. So we need contexts that describe the current texture, like for pallete color images, and contexts that represent the actual expected pixel value.
    Taking the same example as before, if the previous pixels (WW and W) are 100 and 80, the context (W*2-WW) will predict 60. However, this value can be outside the allowed [0..255] interval (consider WW=100 and W=40). So we either Clip() this value to our interval, or we Clamp() it, which restricts the possible values to the largest interval defined by the neighborhood pixels we pass as parameters. To calculate this linear-prediction, we can use averages of neighboring pixels, like (W+N)/2, or use gradients, like (W+N-NW), or extrapolate, like (WWW*3-WW*3+W) [see Lagrange polynomials]. Some of the more complex contexts use Lagrange polynomials to extrapolate "future" values (S, E, EE, etc) and then use them to average/gradient/interpolate predictions. I also made contexts that use half-pel/quarter-pel/nth-pel extra/interpolation, and obviously most directions are used, not just horizontal and vertical. Most of these are well known techniques used in video-codecs.

    On color images, besides this spatial correlation, you may have spectral correlation between the color planes (not necessarily if dealing with artificial images). So an edge in the previous plane will most likely mean an edge at the same spatial position in this plane, so we can use the magnitude of the intensity change in the previous plane to make predictions, like (W+p1-Wp1), or refine our predictions based on the residual of the prediction of the previous plane, like (W+N-NW+p1-Clip(Wp1+Np1-NWp1)).
    Last edited by mpais; 1st May 2018 at 09:16. Reason: Typos

  37. Thanks (2):

    Eppie (1st May 2018),Gotty (13th May 2018)

  38. #1226
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    578
    Thanks
    220
    Thanked 834 Times in 342 Posts

    paq8px_v142

    I fixed a memory leak in the zlib transform, and since I didn't want to release a new version with just a simple fix, I increased the precision of 2 mixer contexts in the TextModel and added a new one.

    Code:
    File: enwik8
    v141       16.872.591
    v142       16.813.905

  39. Thanks (4):

    Darek (12th May 2018),Gotty (13th May 2018),kaitz (15th May 2018),moisesmcardona (20th May 2018)

  40. #1227
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    1,274
    Thanks
    803
    Thanked 545 Times in 415 Posts
    Scores for my testset and 4 corpuses.
    Looks that changes adds some nice gains in almost all nonmultimedia files.
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	paq8px_v142.jpg 
Views:	78 
Size:	451.8 KB 
ID:	5920   Click image for larger version. 

Name:	paq8px_v142_4corpuses.jpg 
Views:	84 
Size:	606.8 KB 
ID:	5921  

  41. Thanks:

    Gotty (13th May 2018)

  42. #1228
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    578
    Thanks
    220
    Thanked 834 Times in 342 Posts

    paq8px_v143

    Just a quick modification, I added 2 mixer contexts to the exeModel.

    Code:
    File: ooffice, from Silesia
    v142 -9a         1.336.534 bytes
    v143 -9a         1.324.685 bytes
    
    File: acrord32.exe, from Maximum compression
    v142 -9ta         852.258 bytes
    v143 -9ta         847.926 bytes

  43. Thanks (3):

    Darek (21st May 2018),Gotty (20th May 2018),moisesmcardona (20th May 2018)

  44. #1229
    Member
    Join Date
    Jun 2009
    Location
    Puerto Rico
    Posts
    277
    Thanks
    164
    Thanked 64 Times in 49 Posts
    Nice to be back in the forum and nice to also see paq8px is still in development

    I did a quick script to automate folder compression but also works with individual files, as well as to initiate testing once the compression finishes.

    Script and readme can be found here:
    https://github.com/moisesmcardona/paq8px_scripts

  45. #1230
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    578
    Thanks
    220
    Thanked 834 Times in 342 Posts

    paq8px_v144

    Code:
    Changes:
    - Improved the 24/32bpp image model
    - Added another mixer context to the exeModel
    - Truncated detections are now possible
    Benchmarks:
    Code:
    LPCB with option -9:  946.423.086 bytes
    LPCB with option -9a: 945.832.025 bytes
    LPCB with option -9s: 943.203.584 bytes
    LPCB with option -9as: 942.903.925 bytes
    LPCB with best options per file: 937.115.803 bytes
    I only improved the part of the model that handles photographic images, so for computer generated images the gains will probably be small.
    Still, this version is already close to cmix on the LPCB, and choosing the best option per file will probably beat it (until I port these improvements over to cmix).

    [Edit]
    As expected, choosing the best options per file allows paq8px to beat cmix.
    Attached Files Attached Files
    Last edited by mpais; 18th June 2018 at 20:55.

  46. Thanks (4):

    Darek (17th June 2018),Gotty (17th June 2018),kaitz (22nd June 2018),Mike (17th June 2018)

Page 41 of 78 FirstFirst ... 31394041424351 ... LastLast

Similar Threads

  1. FrontPAQ - GUI frontend for PAQ8PF and PAQ8PX
    By LovePimple in forum Download Area
    Replies: 26
    Last Post: 17th January 2019, 14:36
  2. Alternative paq8px builds
    By M4ST3R in forum Download Area
    Replies: 20
    Last Post: 25th June 2010, 17:19
  3. Optimized paq7asm.asm code not compatible with paq8px?
    By M4ST3R in forum Data Compression
    Replies: 7
    Last Post: 3rd June 2009, 16:34

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •