You can find the parameters in readme.txt here: https://bellard.org/nncp/nncp-2019-11-16-win64.zip
And NNCP is intended to be used with its own preprocessor, without preprocessing its result is much worse (and slower).
You can find the parameters in readme.txt here: https://bellard.org/nncp/nncp-2019-11-16-win64.zip
And NNCP is intended to be used with its own preprocessor, without preprocessing its result is much worse (and slower).
A text compression demo using GPT-2 is available at: https://bellard.org/nncp/gpt2tc.html . Results are available for book1 and alice29.txt.
As expected, the compression ratios are high but the speed is low and the decompression program is huge because it requires the complete GPT-2 model parameters. More testing is required to confirm that the good performance is generic and does not come from the fact the benchmarks were already included in the training data.
Could you provide Windows executable binary?
On this page is written that enwik8 GPT-2 record is about 0.93 bits per character which means (if I calculate it properly) about 11'625'000 bytes, hmmm.... impressive.
https://openai.com/blog/better-language-models/
A Windows executable is now available.
Of course, you cannot directly compare the enwik8 language modelling results (e.g. from GPT-2) with the results from CMIX or NNCP because the decompression program size (which includes the model parameters) is not counted in the result. Moreover, for GPT-2, parts of enwik8 may have been included in the training data, so it is not a precise result even in the context of language modelling.
Darek (13th June 2020)
Yes. As the additional, separate compressor GPT-2 couldn't be competitive with so big model of parameters, especially for some contests...
However if such compression method could be in future a part of Windows or Linux or other operational system (similar to ZIP like standard), and the parameter file would be in all system cores inside, then such level of compression will be real - means achievable.
R.DOC scores:
24'435 bytes - cmix v17 - previous record
18'701 bytes - GPT-2 117M score....
Based on my first files scores, on my laptop time to compress my entire testbed should be about 18h (4T), it's not much longer than latest cmix versions - 15.7h,
Where I can find larger parameter's files?
Last edited by Darek; 13th June 2020 at 14:26.
Code:enwik6 plus round 4k: 197,354 bytes, 286.828 sec., paq8sk23 -x15 -w -e1,english.dic 195,030 bytes, 111.383 sec., paq8px_v187fix2 -12 193,386 bytes, 77.913 sec., paq8pxd_v89_NO_AVX2 -s15 192,295 bytes, 1,100.690 sec., cmix -c 174,901 bytes, 770.520 sec., cmix -c english.dic 151,616 bytes, 1,697.506 sec., gpt2tc c
Darek (14th June 2020)
It was 117M parametersc score? If it was, it's impressive.
Why don't you test paq8 with a 100MB reference?
Maybe just attach enwik8 at the end of paq8px.exe, then compress R.DOC with -e?
Darek (14th June 2020)
msaidov (13th July 2020)
As I wrote earlier - until parametric file is "general" and not trained on particular file there's no issue.
I think we need to distinguish two cases:
a) more theoretical -if we need to define only compressed information with uncompress program/algorithm/idea as a package (for sent, backup, contest, etc.) - then adding 117MB of training file have a zero sense - right.
b) more practical - if we work in defined, standarized environment and use our software - in such case why we didn't add to lenght of compressed file and decompresor also Windows, Liux, Unix, Android or MacOs executables and other needed files to work? Without these systems/files our compressor rather didn't work. In this case as mentioned ZIP standard, such NN idea with even 30GB of parameters file embedded in the system environment could be treated as a beseline functionality and then such high compression ratio would be possible because there no neccessary to share/send parameters file. Of course if the parameters file is "general".
That's of course only my opinion. I could be wrong.![]()
> until parametric file is "general" and not trained on particular file there's no issue
Sure, but its unfair to compare a trained model vs untrained.
So I suggested a way to see paq8px output with comparable volume of reference data.
msaidov (13th July 2020)
​Hello! In the beginning of this thread there was an error with making executable file in Ubuntu:
/usr/bin/ld: libnc.a(libnc.o): relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: libnc.a(job.o): relocation R_X86_64_32 against `.text' can not be used when making a PIE object; recompile with -fPIE
collect2: error: ld returned 1 exit status
make: *** [Makefile:47: nncp] Error 1
The point is that we obtain an already compilled libnc.a file. And there is no possibility to add flag -fPIE into it. Could you give a hint how to build a project properly in Ubuntu?
If there was a response in this thread and I didn't recognize it, hope to get a reference. Thank you.
A new version of NNCP is available at https://bellard.org/nncp . It is now based on the Transformer model and outperforms CMIX v18 on enwik9. More processing is needed so a GPU is mandatory.
algorithm (4th January 2021),byronknoll (3rd January 2021),Darek (4th January 2021),Gotty (3rd January 2021),Mauro Vezzosi (4th January 2021),Mike (3rd January 2021),schnaader (3rd January 2021),xinix (3rd January 2021)
Code:http://mattmahoney.net/dc/text.html Feb 06 2021 - added nncp v2.1. nncp v2.1 was released Feb. 6, 2021. It is the same code as v2 except for a larger model and slightly different hyperparameters. Compression Compressed size Decompresser Total size Time (ns/byte) Program Options enwik8 enwik9 size (zip) enwik9+prog Comp Decomp Mem Alg Notes ------- ------- ---------- ----------- ----------- ----------- ------ ------ ----- --- ----- nncp 2019-05-08 16,791,077 125,623,896 161,133 xd 125,785,029 420168 602409 2040 LSTM 84 nncp 2019-11-16 16,292,774 119,167,224 238,452 xd 119,405,676 826048 1156467 5360 LSTM 84 nncp v2 15,600,675 114,317,255 99,671 xd 114,317,255 308645 313468 17000 Transformer 88 nncp v2.1 15,020,691 112,219,309 100,046 xd 112,319,355 508332 515401 23000 Transformer 88 Compression Compressed size Decompresser Total size Time (ns/byte) Program Options enwik8 enwik9 size (zip) enwik9+prog Comp Decomp Mem Alg Note ------- ------- ---------- ----------- ----------- ----------- ----- ----- --- --- ---- nncp v2.1 15,020,691 112,219,309 100,046 xd 112,319,355 508332 515401 23000 Tr 88 cmix v18 14,838,332 115,714,367 208,961 s 115,923,328 602867 601569 25738 CM 83
Darek (9th February 2021)