Page 5 of 5 FirstFirst ... 345
Results 121 to 138 of 138

Thread: NNCP: Lossless Data Compression with Neural Networks

  1. #121
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,135
    Thanks
    320
    Thanked 1,397 Times in 802 Posts
    You can find the parameters in readme.txt here: https://bellard.org/nncp/nncp-2019-11-16-win64.zip
    And NNCP is intended to be used with its own preprocessor, without preprocessing its result is much worse (and slower).

  2. #122
    Member
    Join Date
    Apr 2019
    Location
    France
    Posts
    20
    Thanks
    0
    Thanked 35 Times in 17 Posts
    A text compression demo using GPT-2 is available at: https://bellard.org/nncp/gpt2tc.html . Results are available for book1 and alice29.txt.

    As expected, the compression ratios are high but the speed is low and the decompression program is huge because it requires the complete GPT-2 model parameters. More testing is required to confirm that the good performance is generic and does not come from the fact the benchmarks were already included in the training data.

  3. Thanks (2):

    Darek (13th June 2020),Mike (13th June 2020)

  4. #123
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    1,272
    Thanks
    802
    Thanked 545 Times in 415 Posts
    Could you provide Windows executable binary?

    On this page is written that enwik8 GPT-2 record is about 0.93 bits per character which means (if I calculate it properly) about 11'625'000 bytes, hmmm.... impressive.

    https://openai.com/blog/better-language-models/

  5. #124
    Member
    Join Date
    Apr 2019
    Location
    France
    Posts
    20
    Thanks
    0
    Thanked 35 Times in 17 Posts
    Quote Originally Posted by Darek View Post
    Could you provide Windows executable binary?

    On this page is written that enwik8 GPT-2 record is about 0.93 bits per character which means (if I calculate it properly) about 11'625'000 bytes, hmmm.... impressive.

    https://openai.com/blog/better-language-models/
    A Windows executable is now available.

    Of course, you cannot directly compare the enwik8 language modelling results (e.g. from GPT-2) with the results from CMIX or NNCP because the decompression program size (which includes the model parameters) is not counted in the result. Moreover, for GPT-2, parts of enwik8 may have been included in the training data, so it is not a precise result even in the context of language modelling.

  6. Thanks:

    Darek (13th June 2020)

  7. #125
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    1,272
    Thanks
    802
    Thanked 545 Times in 415 Posts
    Quote Originally Posted by fab View Post
    Of course, you cannot directly compare the enwik8 language modelling results (e.g. from GPT-2) with the results from CMIX or NNCP because the decompression program size (which includes the model parameters) is not counted in the result. Moreover, for GPT-2, parts of enwik8 may have been included in the training data, so it is not a precise result even in the context of language modelling.
    Yes. As the additional, separate compressor GPT-2 couldn't be competitive with so big model of parameters, especially for some contests...

    However if such compression method could be in future a part of Windows or Linux or other operational system (similar to ZIP like standard), and the parameter file would be in all system cores inside, then such level of compression will be real - means achievable.

    R.DOC scores:
    24'435 bytes - cmix v17 - previous record
    18'701 bytes - GPT-2 117M score....

    Based on my first files scores, on my laptop time to compress my entire testbed should be about 18h (4T), it's not much longer than latest cmix versions - 15.7h,

    Where I can find larger parameter's files?
    Last edited by Darek; 13th June 2020 at 14:26.

  8. #126
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,040
    Thanks
    104
    Thanked 420 Times in 293 Posts
    Code:
    enwik6 plus round 4k:
    197,354 bytes,   286.828 sec., paq8sk23 -x15 -w -e1,english.dic
    195,030 bytes,   111.383 sec., paq8px_v187fix2 -12
    193,386 bytes,    77.913 sec., paq8pxd_v89_NO_AVX2 -s15
    192,295 bytes, 1,100.690 sec., cmix -c
    174,901 bytes,   770.520 sec., cmix -c english.dic
    151,616 bytes, 1,697.506 sec., gpt2tc c

  9. Thanks:

    Darek (14th June 2020)

  10. #127
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    1,272
    Thanks
    802
    Thanked 545 Times in 415 Posts
    It was 117M parametersc score? If it was, it's impressive.

  11. #128
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,135
    Thanks
    320
    Thanked 1,397 Times in 802 Posts
    Why don't you test paq8 with a 100MB reference?
    Maybe just attach enwik8 at the end of paq8px.exe, then compress R.DOC with -e?

  12. #129
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    127
    Thanks
    43
    Thanked 33 Times in 22 Posts
    Quote Originally Posted by Darek View Post
    Where I can find larger parameter's files?
    . It is described in readme.txt. You download it with download.sh and the run the python script to convert.

  13. Thanks:

    Darek (14th June 2020)

  14. #130
    Member
    Join Date
    Aug 2015
    Location
    indonesia
    Posts
    514
    Thanks
    63
    Thanked 96 Times in 75 Posts
    Quote Originally Posted by fab View Post
    A Windows executable is now available.

    Of course, you cannot directly compare the enwik8 language modelling results (e.g. from GPT-2) with the results from CMIX or NNCP because the decompression program size (which includes the model parameters) is not counted in the result. Moreover, for GPT-2, parts of enwik8 may have been included in the training data, so it is not a precise result even in the context of language modelling.
    is it written in python ?

  15. #131
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    739
    Thanks
    424
    Thanked 486 Times in 260 Posts
    Quote Originally Posted by Shelwien View Post
    Why don't you test paq8 with a 100MB reference?
    Maybe just attach enwik8 at the end of paq8px.exe, then compress R.DOC with -e?
    A bit better approach: create a list file where the items are your "training files" (like enwik8, dickens, etc). Compress the @listfile way. Take note of the size.
    Add R.DOC to the end of the list file, compress again. Take note of the new size. Substract. That's all.

  16. Thanks:

    msaidov (13th July 2020)

  17. #132
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    1,272
    Thanks
    802
    Thanked 545 Times in 415 Posts
    Quote Originally Posted by Shelwien View Post
    Why don't you test paq8 with a 100MB reference?
    Maybe just attach enwik8 at the end of paq8px.exe, then compress R.DOC with -e?
    As I wrote earlier - until parametric file is "general" and not trained on particular file there's no issue.

    I think we need to distinguish two cases:

    a) more theoretical -if we need to define only compressed information with uncompress program/algorithm/idea as a package (for sent, backup, contest, etc.) - then adding 117MB of training file have a zero sense - right.

    b) more practical - if we work in defined, standarized environment and use our software - in such case why we didn't add to lenght of compressed file and decompresor also Windows, Liux, Unix, Android or MacOs executables and other needed files to work? Without these systems/files our compressor rather didn't work. In this case as mentioned ZIP standard, such NN idea with even 30GB of parameters file embedded in the system environment could be treated as a beseline functionality and then such high compression ratio would be possible because there no neccessary to share/send parameters file. Of course if the parameters file is "general".

    That's of course only my opinion. I could be wrong.

  18. #133
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,135
    Thanks
    320
    Thanked 1,397 Times in 802 Posts
    > until parametric file is "general" and not trained on particular file there's no issue

    Sure, but its unfair to compare a trained model vs untrained.
    So I suggested a way to see paq8px output with comparable volume of reference data.

  19. Thanks (2):

    Darek (14th June 2020),msaidov (13th July 2020)

  20. #134
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    1,272
    Thanks
    802
    Thanked 545 Times in 415 Posts
    Quote Originally Posted by Shelwien View Post
    > until parametric file is "general" and not trained on particular file there's no issue

    Sure, but its unfair to compare a trained model vs untrained.
    So I suggested a way to see paq8px output with comparable volume of reference data.
    Yes, you have absolutely right. Is not fair. It's rather comparison of approaches than programs.
    I'll try to use method as you wrote.

  21. Thanks:

    msaidov (13th July 2020)

  22. #135
    Member
    Join Date
    Jul 2020
    Location
    Moscow
    Posts
    1
    Thanks
    3
    Thanked 0 Times in 0 Posts
    ​Hello! In the beginning of this thread there was an error with making executable file in Ubuntu:

    /usr/bin/ld: libnc.a(libnc.o): relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
    /usr/bin/ld: libnc.a(job.o): relocation R_X86_64_32 against `.text' can not be used when making a PIE object; recompile with -fPIE
    collect2: error: ld returned 1 exit status
    make: *** [Makefile:47: nncp] Error 1

    The point is that we obtain an already compilled libnc.a file. And there is no possibility to add flag -fPIE into it. Could you give a hint how to build a project properly in Ubuntu?

    If there was a response in this thread and I didn't recognize it, hope to get a reference. Thank you.

  23. #136
    Member
    Join Date
    Apr 2019
    Location
    France
    Posts
    20
    Thanks
    0
    Thanked 35 Times in 17 Posts
    A new version of NNCP is available at https://bellard.org/nncp . It is now based on the Transformer model and outperforms CMIX v18 on enwik9. More processing is needed so a GPU is mandatory.

  24. Thanks (8):

    algorithm (4th January 2021),byronknoll (3rd January 2021),Darek (4th January 2021),Gotty (3rd January 2021),Mauro Vezzosi (4th January 2021),Mike (3rd January 2021),schnaader (3rd January 2021),xinix (3rd January 2021)

  25. #137
    Member
    Join Date
    Nov 2011
    Location
    france
    Posts
    103
    Thanks
    13
    Thanked 54 Times in 35 Posts
    Quote Originally Posted by fab View Post
    A new version of NNCP is available at https://bellard.org/nncp . It is now based on the Transformer model and outperforms CMIX v18 on enwik9. More processing is needed so a GPU is mandatory.
    Impressive, Bravo Fabrice!

    (wow, 3.9 days enc/dec time)

  26. #138
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    290
    Thanks
    120
    Thanked 168 Times in 124 Posts
    Code:
    http://mattmahoney.net/dc/text.html
    Feb 06 2021 - added nncp v2.1.
    
    nncp v2.1 was released Feb. 6, 2021. It is the same code as v2 except for a larger model and slightly different hyperparameters.
    
                Compression     Compressed size      Decompresser  Total size   Time (ns/byte)
    Program       Options      enwik8      enwik9     size (zip)   enwik9+prog  Comp   Decomp   Mem  Alg Notes
    -------       -------    ----------  -----------  -----------  -----------  ------ ------  ----- --- -----
    nncp 2019-05-08          16,791,077  125,623,896    161,133 xd 125,785,029  420168 602409   2040 LSTM 84
    nncp 2019-11-16          16,292,774  119,167,224    238,452 xd 119,405,676  826048 1156467  5360 LSTM 84
    nncp v2                  15,600,675  114,317,255     99,671 xd 114,317,255  308645 313468  17000 Transformer 88     
    nncp v2.1                15,020,691  112,219,309    100,046 xd 112,319,355  508332 515401  23000 Transformer 88
    
    
                    Compression                      Compressed size      Decompresser  Total size   Time (ns/byte)
    Program           Options                       enwik8      enwik9     size (zip)   enwik9+prog  Comp Decomp   Mem Alg Note
    -------           -------                     ----------  -----------  -----------  -----------  ----- -----   --- --- ----
    nncp v2.1                                     15,020,691  112,219,309    100,046 xd 112,319,355 508332 515401 23000 Tr  88
    cmix v18                                      14,838,332  115,714,367    208,961 s  115,923,328 602867 601569 25738 CM  83

  27. Thanks:

    Darek (9th February 2021)

Page 5 of 5 FirstFirst ... 345

Similar Threads

  1. Compression with recurrent neural networks
    By Matt Mahoney in forum Data Compression
    Replies: 6
    Last Post: 28th January 2019, 23:44
  2. The future of lossless data compression
    By inikep in forum Data Compression
    Replies: 66
    Last Post: 5th March 2018, 13:03
  3. Draco 3D data compression (lossless)
    By pothos2 in forum Data Compression
    Replies: 2
    Last Post: 27th January 2017, 01:27
  4. Replies: 5
    Last Post: 17th September 2015, 21:43
  5. lossless data compression
    By SLS in forum Data Compression
    Replies: 21
    Last Post: 15th March 2011, 12:35

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •