Activity Stream

Filter
Sort By Time Show
Recent Recent Popular Popular Anytime Anytime Last 24 Hours Last 24 Hours Last 7 Days Last 7 Days Last 30 Days Last 30 Days All All Photos Photos Forum Forums
  • encode's Avatar
    Today, 01:07
    encode replied to a thread CHK Hash Tool in Data Compression
    Okay, it's the biggest update ever! Please welcome - CHK v3.00! https://compressme.net/ :_banana2:
    179 replies | 77020 view(s)
  • jethro's Avatar
    Yesterday, 19:13
    jethro replied to a thread Zstandard in Data Compression
    Thanks Cyan *** zstd command line interface 64-bits v1.4.2, by Yann Collet *** Win 10
    335 replies | 111859 view(s)
  • rainerzufalldererste's Avatar
    Yesterday, 17:14
    I have changed the rle8_ultra implementation to have a maximum length of 32 (allowing to not even iterate over the bytes to set, but rather just _mm256_storeu_si256 the symbols and move the write pointer forward by the count) This is obviously slower and less efficient for files with symbols that occur quite often, but beats TurboRLE in terms of speed on my encoded image test file (by a tiny amount). I believe the results (both speed and efficiency wise) can be improved by selecting certain modes adaptively throughout the file. (e.g. max length 32; use unused symbol for the RLE symbol to not stop scanning on single occurrences of the symbol in question; etc.) Obviously TurboRLE still beats rle8 and rle8_ultra easily when not using an 8 bit RLE. New results: Encoded image file: Mode | Compression Rate | Compression Speed | Decompression Speed | Compression rate of result (using rans_static_32x16) - | 100 % | - | - | 12.861 % rle8 Ultra (Single Symbol Mode) | 24.1 % | 424.73 MB/s | 2564.1 MB/s | 43.793 % rle8 Normal (Single Symbol Mode) | 19.9 % | 444.52 MB/s | 2261.3 MB/s | 46.088 % rle8 Ultra | 24.2 % | 425.2 MB/s | 1559.79 MB/s | 43.681 % rle8 Normal | 19.9 % | 446.36 MB/s | 1473 MB/s | 45.944 % trle | 17.2 % | 699.13 MB/s | 1707.79 MB/s | - srle 0 | 18.7 % | 686.07 MB/s | 2522.70 MB/s | - srle 8 | 18.7 % | 983.68 MB/s | 2420.88 MB/s | - mrle | 19.7 % | 208.05 MB/s | 1261.91 MB/s | - The single run length encodable symbol file I had also used previously (rle_ultra is a lot slower here, but normal rle still beats TurboRLE barely) Mode | Compression Rate | Compression Speed | Decompression Speed | Compression rate of result (using rans_static_32x16) - | 100 % | - | - | 33.838 % rle8 Normal | 56.7 % | 528.59 MB/s | 5125.2 MB/s | 41.538 % rle8 Ultra | 58.9 % | 506.46 MB/s | 4417.12 MB/s | 43.64 % trle | 55.9 % | 307.96 MB/s | 2327.31 MB/s | - srle 0 | 56.5 % | 306.58 MB/s | 4975.80 MB/s | - srle 8 | 56.5 % | 354.67 MB/s | 4983.12 MB/s | - mrle | 56.7% | 135.02 MB/s | 1837.48 MB/s | -
    4 replies | 328 view(s)
  • compgt's Avatar
    Yesterday, 16:31
    How good their compressors/decompressors are tell us how "intelligent" the programmers are. Understanding "context" is one measure. I wonder if investigators in other areas who are good with context will perform precisely as good in data compression.
    159 replies | 73032 view(s)
  • Cyan's Avatar
    Yesterday, 13:15
    Cyan replied to a thread Zstandard in Data Compression
    Hi @Jethro This command line you present should have worked. is a usual construction that is known and well tested. I can't explain from this snippet why it would not work for you ....
    335 replies | 111859 view(s)
  • jethro's Avatar
    Yesterday, 10:10
    jethro replied to a thread Zstandard in Data Compression
    Yes, this is the dict.1 file zstd --train .\train\* -o dict.1 Trying 5 different sets of parameters k=1998 d=8 f=20 steps=4 split=75 accel=1 Save dictionary of size 112640 into file dict.1
    335 replies | 111859 view(s)
  • Sportman's Avatar
    16th August 2019, 23:40
    Key Negotiation of Bluetooth Attack, Breaking Bluetooth Security: https://knobattack.com/ New Bluetooth Vulnerability Lets Attackers Spy On Encrypted Connections: https://thehackernews.com/2019/08/bluetooth-knob-vulnerability.html
    0 replies | 36 view(s)
  • Shelwien's Avatar
    16th August 2019, 23:12
    Shelwien replied to a thread Zstandard in Data Compression
    zstd dictionary file is not raw, you have to build it first: Dictionary builder : --train ## : create a dictionary from a training set of files --train-cover : use the cover algorithm with optional args --train-fastcover : use the fast cover algorithm with optional args --train-legacy : use the legacy algorithm with selectivity (default: 9) -o file : `file` is dictionary name (default: dictionary) --maxdict=# : limit dictionary to specified size (default: 112640) --dictID=# : force dictionary ID to specified value (default: random)
    335 replies | 111859 view(s)
  • CompressMaster's Avatar
    16th August 2019, 22:04
    You´re right. JPEG isn´t random if it´s decompressed. But I´m talking about already compressed data. Randomness in IT does not exists for me, they´re all compressible, but hardly. I´m aware how JPEG REcompressor (such as stuffit or paq) works. But my custom data preprocessing method is able to compress even random data WITHOUT recompression i.e. it´s not neccessary to decompress JPG file. And it´s noticeably faster, although compression ratio is not that good. 200KB original (preprocessed to almost 4MB - preprocessed, not decompressed, see the difference) to 174 KB lossless is possible with CMIX. @Gotty, what´s your progress with BSC algorithm anti-files?
    159 replies | 73032 view(s)
  • Gotty's Avatar
    16th August 2019, 19:51
    Gotty replied to a thread paq8px in Data Compression
    It's crazy, isn't it? I'll need to refresh the building instructions a bit anyway.
    1663 replies | 475146 view(s)
  • moisesmcardona's Avatar
    16th August 2019, 19:46
    moisesmcardona replied to a thread paq8px in Data Compression
    I had to add Shell32.lib to the linker options. Now it compiled :)
    1663 replies | 475146 view(s)
  • jethro's Avatar
    16th August 2019, 19:46
    jethro replied to a thread Zstandard in Data Compression
    How to use dictionary with ZSTD? I tried: zstd BJ_all_Corr.csv -D dict.1 -o dict.zstd zstd: cannot use BJ_all_Corr.csv as an input file and dictionary How do i tell zstd which file is the trained dictonary (dict.1 here)?
    335 replies | 111859 view(s)
  • moisesmcardona's Avatar
    16th August 2019, 19:42
    moisesmcardona replied to a thread paq8px in Data Compression
    Hmm. If you were able to compile it with VS 2019 then I should as well... I'm using Visual Studio 2019 Community Edition too. Has the solution file changed, perhaps? Yes, it fails at the linker stage. Windows 10. Latest SDK installed.
    1663 replies | 475146 view(s)
  • Gotty's Avatar
    16th August 2019, 18:44
    Gotty replied to a thread paq8px in Data Compression
    I tested (compiled and run) successfully on/with: Windows 10: Visual Studio Community Edition 2017 15.9.12 Windows 10: MinGW-w64 x86_64-7.2.0-win32-seh-rt_v5-rev1 Windows 8.1: Visual Studio Community Edition 2019 16.1.5 Windows 7: MinGW-w64 x86_64-7.2.0-win32-seh-rt_v5-rev1 Lubuntu 19.04: GCC 8.3.0 Could you try to include the following: #include <shellapi.h> CommandLineToArgvW is defined in that header, but I did not need to include it on my system, and I think it's not needed. It looks like you have successfully compiled the source, and got stuck at linking. Which Windows and which Visual Studio are you using? (Remark: command line parsing and file operations work everywhere but character display works properly only in Linux and Windows 10. On Windows 10 you'll need to set a "good" console font (like Lucida Console) if you need to display filenames in exotic languages. The default raster font does not have enough glyphs.) EDIT: I've got it! In Visual Studio: In Configuration properties -> Linker -> Input -> Additional Dependencies You have: zlibstat.lib You need to have: zlibstat.lib;shell32.lib
    1663 replies | 475146 view(s)
  • moisesmcardona's Avatar
    16th August 2019, 17:12
    moisesmcardona replied to a thread paq8px in Data Compression
    Is there any changes that needs to be done to compile it in Visual Studio? It fails with "unresolved external symbol __imp_CommandLineToArgvW". I have the Windows SDK installed and use the paq8px visual studio solution files.
    1663 replies | 475146 view(s)
  • Darek's Avatar
    16th August 2019, 11:06
    Darek replied to a thread Paq8pxd dict in Data Compression
    @kaitz -> According to v67 version - for option -s15 and for some files (G.EXE, H.EXE from my filetest and AcroRd32.exe, FlashMX.pdf, MSO97.DLL from Maximum Compression) program didn't finish the compression process and exit w/o crash... Question about WAV model -could you implement also newest model from paq8px v179/v181 version? And the maybe a tip - if you want to optimize ebnwik scores then maybe it's reasoneble to look at Paq8pxd_v48_bwt4 version - it got 16'183'xxx bytes for -s8 option on the model from paq8pxd v48 version. And scores for paq8pxd v68 version -> 0.29% of improvement = -30KB! Very nice gain, especially for the biggest files. That means paq8pxd v68 got total score for my testset better than cmix v18 :) - 14KB less!
    639 replies | 253649 view(s)
  • Gotty's Avatar
    16th August 2019, 03:37
    Gotty replied to a thread paq8px in Data Compression
    - support for unicode file names - printing minimal feedback on screen when output is redirected (file names, file sizes, progress bar) Only user interface changes - no change in compression.
    1663 replies | 475146 view(s)
  • Gotty's Avatar
    16th August 2019, 03:37
    Gotty replied to a thread paq8px in Data Compression
    That's a good idea! Added.
    1663 replies | 475146 view(s)
  • Gotty's Avatar
    16th August 2019, 03:31
    Gotty replied to a thread paq8px in Data Compression
    Unfortunately it's not enough. While all i/o on Linux works perfectly with char* utf8 strings, on Windows even the command line arguments are lost when the input is not compatible with the current codepage (current = before the exe starts executing). You try to give a cyrillic filename on a non-cyrillic locale, and windows will lose it as it can't convert it. You'll need to re-acquire the command line arguments with the CommandLineToArgvW function in wchar_t* (utf16) and convert it to utf8. The next problem comes when you try to work with functions such as fopen or stat: they can't handle multibyte (utf8 ) or wide characters (utf16) on Windows. You'll need to use _wfopen and _wstat (which are not available on Linux). Since they work with wide characters (utf16) only, you'll need to do some utf8 conversion as well. Here comes paq8px_v181fix1 - it contains all of the above.
    1663 replies | 475146 view(s)
  • Gotty's Avatar
    16th August 2019, 03:06
    Gotty replied to a thread paq8px in Data Compression
    I'm afraid that is not possible at the moment. There are a lot of tunable parameters in the source file not just in form of numbers, but in for of algorithmic choices. For performance reasons they are all hardcoded. To be able to to tune any of them you will need to grab the source file, change it, compile, compress some files and see if compression gets better... This is how we all do it. Don't forget that the jpeg compression depends not only on the JpegModel but some more: the NormalModel, MatchModel the Mixer and the SSE stage.
    1663 replies | 475146 view(s)
  • moisesmcardona's Avatar
    16th August 2019, 00:37
    moisesmcardona replied to a thread paq8px in Data Compression
    You need Microsoft Visual C++ 2015 Redistributable https://www.microsoft.com/en-us/download/details.aspx?id=52685
    1663 replies | 475146 view(s)
  • LucaBiondi's Avatar
    16th August 2019, 00:12
    LucaBiondi replied to a thread paq8px in Data Compression
    Ok thanks i found ZLIB1.DLL But when i run Paq8px167ContextualMemory i get VCRUNTIME140_1.dll not found... Luca https://sqlserverperformace.blogspot.com/
    1663 replies | 475146 view(s)
  • Mauro Vezzosi's Avatar
    15th August 2019, 23:15
    > lstm_(25, 3, 20, 0.05) What implementation do you use? lstm-compress, nncp, third-party library (which?), your implementation, ...? Do you use it as a predictor or a mixer? Only 25 nodes? It's fast but scarce.
    639 replies | 253649 view(s)
  • Mauro Vezzosi's Avatar
    15th August 2019, 23:13
    Tested -time_steps and -seed on the following 3 files to check if and how much they gain. 1.BMP 333.003 (a) = Darek's min nncp 332.617 (b) = (a) + -time_steps 17 331.765 (c) = (b) + -hidden_size 192 331.746 (d) = (c) + -adam_beta2 0.999 No gain was found by changing -seed. O.APR 6.052 (a) = Darek's min nncp 6.045 (b) = (a) + -seed 0 No gain was found by changing -time_steps. R.DOC 34.802 (a) = Darek's min nncp 34.692 (b) = (a) + -time_steps 17 34.659 (c) = (b) + -seed 1
    80 replies | 7749 view(s)
  • moisesmcardona's Avatar
    15th August 2019, 22:14
    moisesmcardona replied to a thread paq8px in Data Compression
    Hmm, that's interesting. Try using the shared build. That one has the zlib1.dll file.
    1663 replies | 475146 view(s)
  • LucaBiondi's Avatar
    15th August 2019, 22:05
    LucaBiondi replied to a thread paq8px in Data Compression
    I used the static build
    1663 replies | 475146 view(s)
  • LucaBiondi's Avatar
    15th August 2019, 22:00
    LucaBiondi replied to a thread paq8px in Data Compression
    i used the static build
    1663 replies | 475146 view(s)
  • moisesmcardona's Avatar
    15th August 2019, 18:33
    moisesmcardona replied to a thread paq8px in Data Compression
    Did you used the static or shared build? Try the second zip. I forgot to use the static zlib library in the first one.
    1663 replies | 475146 view(s)
  • LucaBiondi's Avatar
    15th August 2019, 18:27
    LucaBiondi replied to a thread paq8px in Data Compression
    Hi moisescardona, I would try your versione but if i execute it with this sintax: Paq8px167ContextualMemory @txtlist.txt -9 -v -log paq8px__167cm.txt i obtain this error: Zlib1.dll not found and then VCRUNTIME140_1 .. Where i can find this file? thanks Luca
    1663 replies | 475146 view(s)
  • kaitz's Avatar
    15th August 2019, 17:46
    kaitz replied to a thread Paq8pxd dict in Data Compression
    paq8pxd_v68 update im8Mode,im24model,exemodel from paq8px_179 compressed stream /MZIP,EOL,ZLIB fail/ linearPredictionModel from paq8px_179 add charModel into wordModel paq8px_179 If i add ppm model to v68 and lstm model (lstm_(25, 3, 20, 0.05)) then enwik is about 16243xxx bytes with option -8
    639 replies | 253649 view(s)
  • maadjordan's Avatar
    15th August 2019, 15:11
    This is a main thread for all 7-zip addons and plugins 7-Zip-FL2: https://github.com/conor42/7-Zip-FL2 7-Zip using the Fast LZMA2 library Asar7z: http://www.tc4shell.com/en/7zip/asar/ This plugin allows you to open, create, or modify ASAR archives (used for packaging applications based on the Electron framework). eDecoder: http://www.tc4shell.com/en/7zip/edecoder/ This plugin allows you to open following email files and MHTML files in 7-Zip: MGS – a format used by Microsoft Outlook TNEF – a format used by Microsoft Outlook (the winmail.dat or ATT0001.dat file) DBX – the format used by Outlook Express 5 and 6 MBX – the format used by Outlook Express 4 MBOX – a format used by many email clients TBB – the format used by The Bat! PMM – the format used by Pegasus Mail EMLX – the format used by Mail (Apple’s email client) EML, NWS, MHT, MHTML, and B64 UUE and XXE NTX – YEnc files BIN – MacBinary files HQX – BinHex files WARC – Web ARChive files It also can create B64, XXE, UUE, NTX, BIN, and HQX files. The plugin also contains eSplitter. It's a special codec that helps 7-Zip pack text files containing binary data encoded using base64. fast-lzma2: https://github.com/conor42/fast-lzma2 adds Fast LZMA2 math finder to 7-zip. Forensic7z: http://www.tc4shell.com/en/7zip/forensic7z/ This plugin allows you to open and browse the following disk images created by specialized software for forensic analysis: ASR Expert Witness Compression Format (.S01) Encase Image File Format (.E01, .Ex01) Encase Logical Image File Format (.L01, .Lx01) Advanced Forensics Format (.AFF) AccessData FTK Imager Logical Image (.AD1) Grit7z: http://www.tc4shell.com/en/7zip/grit7z/ This plugin allows you to open, create, or modify PAK archives (used in Chrome or Chromium-based browsers): Iso7z:http://www.tc4shell.com/en/7zip/iso7z/ This plugin allows you to open the following additional disc images formats in 7-Zip: CCD/IMG - disc images created with CloneCD CDI - disc images created with DiscJuggler CHD (v4) - images used by MAME CSO CUE/BIN ECM - disc images compressed with ECM Tool GDI - Dreamcast Gigabyte disc images ISZ - disc images created with UltraISO MDS/MDF - disc images created with Alcohol 120% NRG - disc images created with Nero Burning ROM Zisofs compressed files The plugin also contains a special codec RawSplitter that enables 7-Zip to efficiently pack uncompressed raw disc images (CCD/IMG, CDI, CUE/BIN, GDI, MDS/MDF, or NRG) into 7z archives Lzip7z: http://www.tc4shell.com/en/7zip/lzip/ This plugin allows you to open, create, or modify LZIP archives (used on Unix-like systems based on http://www.nongnu.org/lzip/lzip.html). Modern7z:http://www.tc4shell.com/en/7zip/modern7z/ This plugin allows you to use the following additional compression methods in 7-Zip: Zstandard v1.4.2 Brotli v1.0.4 LZ4 v1.9.1 LZ5 v1.5 Lizard v1.0 Fast LZMA2 v0.9.2 7-zip Zstd: https://github.com/mcmilk/7-Zip-zstd Adds support to other compression methods to 7-Zip with support for Brotli, Fast-LZMA2, Lizard, LZ4, LZ5 and Zstandard. Smart 7z: http://www.tc4shell.com/en/7zip/smart7z/ This plugin provides flexible settings when packing files into a .7z archive Thumbs7z: http://www.tc4shell.com/en/7zip/thumbs7z/ This plugin allows you to open Thumbs.db and thumbcache.db files (Windows thumbnail cache files) in 7-Zip. Wavpack7z: http://www.tc4shell.com/en/7zip/wavpack7z/ This plugin allows you to use additional compression method WavPack in 7-Zip: .wav (including bwf/rf64 and Multiple Data Chunks (Legasy Audition Format) formats) .caf (Core Audio Format) .w64 (Sony Wave64) .dff (Philips DSDIFF) .dsf (Sony DSD stream) .aif, .afc (Аudio Interchange File Format, including AIFF-C format) Wincrypthashers: http://www.tc4shell.com/en/7zip/wincrypthashers/ This plugin allows you to view additional checksums in 7-Zip: MD2 MD4 MD5 SHA-384 SHA-512 It can also create text files containing checksums. Mfilter: https://encode.su/threads/15-7-Zip?p=60616&viewfull=1#post60616 (UNDER DEVLEOPMENT) Currently is adding JPG recompression to data streams through Lepton and Brunsli Old Plugins: LZMH: New compression method from Igor Pavlov https://encode.su/threads/1117-LZHAM?p=22513&viewfull=1#post22513 LZHAM: New compression method at https://github.com/richgel999/lzham_codec_devel Optimizers: 7-ZIP finetuner : http://krzyrio.blogspot.com/2017/01/7-zip-finetuner.html optimize 7z to smaller size Ultra7z Optimizer: http://www.ultra7z.ru/ Optimize and convert your 7z (rar, zip…) in smaller 7z archives! http://www.tc4shell.com/en/7zip/ Test files: 7zmethods: https://encode.su/threads/15-7-Zip?p=11695&viewfull=1#post11695 7z file with files compressed in different method in one file. Other Developments: Ecozip: based on 7-zip with additional compression methods https://github.com/StephanBusch/EcoZip
    0 replies | 139 view(s)
  • SolidComp's Avatar
    15th August 2019, 14:36
    Interesting, I had heard about the P30 having some unique features but not that.
    4 replies | 290 view(s)
  • SolidComp's Avatar
    15th August 2019, 14:30
    That's a good point. I think a good use for libdeflate and zopfli is precompressing static files, like for Static Site Generators like Jekyll, and SVG images, etc.
    6 replies | 188 view(s)
  • SolidComp's Avatar
    15th August 2019, 14:21
    Well, it looks like he ultimately got Cloudflare's fork working: https://community.centminmod.com/threads/enable-cloudflare-zlib-performance-library-by-default-for-nginx-zlib.14084/
    6 replies | 188 view(s)
  • SolidComp's Avatar
    15th August 2019, 14:10
    Yes, libdeflate is very good. The issue with those projects is that none are drop-in replacements for zlib, so they have very limited use. libdeflate doesn't do streaming. zlib-ng was declared by its contributors as not production ready last time I checked. Cloudflare doesn't document their forks and provides no instructions, so they usually don't work. The Intel fork didn't work either when the Centminmod project tried to use it (Cloudflare wouldn't build either).
    6 replies | 188 view(s)
  • Jyrki Alakuijala's Avatar
    15th August 2019, 10:50
    Streaming and non-streaming implementations/formats should not be mixed in benchmarking, as often properly streaming are about 2x slower but lead to better overall system performance (like web pages loading faster, intermediate results shown earlier, subsequent queries issued earlier hiding overall latency). For gzip's main use (web) a non-streaming implementation would likely be highly harmful for system performance.
    6 replies | 188 view(s)
  • Jarek's Avatar
    14th August 2019, 16:55
    Especially for small blocks, it is worth to consider very approximated frequencies, like in representation from diagram above: with accuracies optimizing header cost + length * Kullback-Leibler. Canonical Huffman can be seen as example of such approximation: representing only 1/2^k probabilities. It might be also worth to consider parametric distributions, like with Golomb codes for Laplace distribution and storing/predicting only their parameters ( https://encode.su/threads/3124-LOCO-I-(JPEG-LS)-uses-365-parameters-to-choose-width-any-alternatves ).
    11 replies | 376 view(s)
  • JamesB's Avatar
    14th August 2019, 15:59
    You may also want to consider libdeflate or zlibng: https://github.com/ebiggers/libdeflate https://github.com/zlib-ng/zlib-ng There are CloudFlare and Intel optimised versions of zlib too. Edit: for block based formats libdeflate is super. I added it to htslib for the "BAM" sequence alignment format. It beat both cloudflare and intel versions and trounced the original zlib, but bizarrely all linux distributions seem to still ship with the slow code base. Libdeflate however isn't a drop-in replacement (unlike the others). https://github.com/samtools/htslib/pull/581
    6 replies | 188 view(s)
  • JamesB's Avatar
    14th August 2019, 15:57
    One thing I wrestled with was how to store frequencies for both large and small blocks. Large blocks we'll be normalising the frequencies to, say, 14 bit anyway. Small blocks we don't want to normalise them upwards as that's inventing extra precision that we have to store. We could store the actual or normalised (whatever is smaller) and always renormalise on decode, but you have to be very careful when renormalising in the decoder to ensure the maths is exact with no potential rounding differences. Instead I normalise to the maximum bit count (say 14) or the next highest bit count if the block is small. Then the decode process can either use a smaller lookup table or, for my use in Order-1 tables where everything must be the same normalised number, scale up using a simple bit shift. Storing order-1 frequencies, ie a 2D array, poses its own problems though. I wanted completely random access so couldn't use previous blocks, but I can use order-0 frequencies to aid the order-1 storing. I store a list of all the order-0 symbols followed by all the order-1 tables for those symbols only, including all the zero frequency entries. For those, I also have a run length because many order-1 frequencies have large runs of 0-frequency values (for symbols that may occur in another context). Example code here: https://github.com/jkbonfield/htscodecs/blob/master/javascript/rans4x16.js#L919
    11 replies | 376 view(s)
  • jibz's Avatar
    14th August 2019, 15:57
    I believe PNG files store their deflate compressed data in zlib format.
    6 replies | 188 view(s)
  • JamesB's Avatar
    14th August 2019, 15:49
    If you're encoding block by block but not having synchronisation points for random access, then yes using the previous freqs helps. It's possible you may want to just exploit the previous symbol order though. One idea: for text we store ETAONRISHD... for the first block of letters sorted by frequency, along with their frequencies as deltas to the previous. The next block we don't store the symbol order, but do store the frequencies in that order, with delta (which may now be negative) and then sort to compute the actual frequency order - eg maybe I is now more common and has moved up a couple: ETAOINRSHD. Our delta in the second block would mean I frequency is negative when delta vs R, but zig-zag encoding can cope with that. I don't know if this helps much, but I'd guess the order of symbol frequency to vary less than the actual frequencies themselves. If so then perhaps we can use this to reduce complexity in the frequency values.
    11 replies | 376 view(s)
  • SolidComp's Avatar
    14th August 2019, 15:35
    Hi all – The Large Text Compression Benchmark led me down a winding road on tracking down the canonical gzip and zlib applications. The LTCB uses an ancient version of gzip for Windows, which is almost certainly slower than current zlib for generating gzip files, and probably slower than GNU gzip. The gzip for Windows in the LTCB was released in 2006 or possibly earlier. Using that old build is therefore misleading with respect to how gzip or zlib actually performs compared to other codecs. If you want to include gzip or zlib in your benchmarks, these are the relevant websites: GNU gzip: http://savannah.gnu.org/projects/gzip/ https://www.gnu.org/software/gzip/ http://www.gzip.org/ zlib: https://www.zlib.net/ https://github.com/madler/zlib As you know, zlib is generally used to generate gzip files, typically on web servers. zlib can also generate a zlib file format, but no one seems interested in it (is it another DEFLATE wrapper?), since browsers consume gzip. zlib is a library used for streaming applications, like the above mentioned web servers (nginx, Apache, IIS, H2O). GNU gzip is an end-user application included in most Linux and BSD distributions for compressing files as needed. (Is it called by other applications?) They have three different websites, as listed above. It might be worth including both in benchmarks (the latest versions).
    6 replies | 188 view(s)
  • Shelwien's Avatar
    14th August 2019, 14:34
    They are supported, but somehow with "|" as column separator and LF as row separator A|B|C D|E|F G|H|I {table}A|B|C D|E|F G|H|I {/table}
    4 replies | 328 view(s)
  • rainerzufalldererste's Avatar
    14th August 2019, 14:11
    I've added a small optimization which enables rle8 to slightly beat TurboRLE in specific cases. TurboRLE does an awesome job even in complex circumstances, where rle8 isn't that competitive. Single RLE-Worthy Symbol:Mode | Compression Rate | Compression Speed | Decompression Speed | Compression rate of result (using rans_static_32x16) - | 100 % | - | - | 33.838 % | rle8 Normal | 56.7 % | 528.593 MB/s | 5125.2 MB/s | 41.538 % rle8 Ultra | 56.7 % | 507.868 MB/s | 5112.2 MB/s | 42.400 % trle | 55.9 % | 307.96 MB/s | 2327.31 MB/s | - srle 0 | 56.5 % | 306.58 MB/s | 4975.80 MB/s | - srle 8 | 56.5 % | 354.67 MB/s | 4983.12 MB/s | - More Complex File: Mode | Compression Rate | Compression Speed | Decompression Speed | Compression rate of result (using rans_static_32x16) - | 100 % | - | - | 12.861 % rle8 Normal (Single Symbol Mode) | 20.0 % | 444.518 MB/s | 2261.307 MB/s | 46.088 % rle8 Ultra (Single Symbol Mode) | 20.0 % | 444.152 MB/s | 2329.594 MB/s | 53.556 % rle8 Normal | 19.9 % | 446.355 MB/s | 1472.995 MB/s | 45.944 % rle8 Ultra | 19.9 % | 444.957 MB/s | 1457.883 MB/s | 53.428 % trle | 17.2 % | 699.13 MB/s | 1707.79 MB/s | - srle 0 | 18.7 % | 686.07 MB/s | 2522.70 MB/s | - srle 8 | 18.7 % | 983.68 MB/s | 2420.88 MB/s | -
    4 replies | 328 view(s)
  • Jarek's Avatar
    14th August 2019, 13:47
    Alternatively you could just continue scaling (renormalization) until reaching https://en.wikipedia.org/wiki/Range_encoding Also there are lots of available implementations, see e.g. references in https://sites.google.com/site/powturbo/entropy-coder
    8 replies | 352 view(s)
  • Jarek's Avatar
    14th August 2019, 13:41
    Storing entire counts is a simple option, but not necessarily optimal - you might need a smaller total number of bits if storing approximated frequencies instead. The splitting into subranges is just the first step - top of the diagram, then below is adding successive bits until reaching chosen accuracy: higher for low probabilities.
    11 replies | 376 view(s)
  • Shelwien's Avatar
    14th August 2019, 13:36
    > Ideally we would like to optimize minimal description length With AC/rANS we can use all information from precise symbol counts, so precision trade-offs don't make any sense. > start with splitting the range into #symbols equal subranges My implementation recursively encodes range sums until it reaches individual symbols, but based on paq results (and James' preprocessing) there's a potential for improvement.
    11 replies | 376 view(s)
  • Jarek's Avatar
    14th August 2019, 13:29
    While Huffman has canonical coding, optimal encoding of frequencies for accurate entropy coder is a difficult problem. I had a thread about it: https://encode.su/threads/1883-Reducing-the-header-optimal-quantization-compression-of-probability-distribution Ideally we would like to optimize minimal description length: header cost + redundancy from inaccuracies ~ cost of storing probabilities + Kullback-Leibler * length Kullback-Leibler(p,q) ~ sum_s (p_s - q_s)^2 / p_s So e.g. we should use larger accuracy for longer blocks and lower probability symbols. Here is some my old construction (but I haven't worked further on that): - encode CDF as points in (0,1) range, - start with splitting the range into #symbols equal subranges, encode number of points in each with unary coding - trick from James Dow Allen which turns out nearly optimal (slide 17, 18 here), - then scan symbols adding single bits of information until reaching chosen precision:
    11 replies | 376 view(s)
  • Shelwien's Avatar
    14th August 2019, 13:00
    > knowing how much you can get when not chopped at least hints at a baseline to aim for. It already works better than a similar coder with incremented freqs. And overall compression improves when I use smaller blocks. > You may be able to predict the upcoming frequencies based on > what's gone before in this block of frequencies. Well, I actually have a proper bitwise adaptive model for freqs there, it already produces results similar to default lzma. Just that it doesn't remember individual freqs from the previous block atm. > E.g. if we're encoding freq for symbol 200 and the sum of all freqs for symbols 0 to 199 > add up to 65535 then we know there's only 1 symbol left out there. That's probably a rare case. > Depending on the cardinality of the symbols, it may also be better to sort by frequency > and write the symbol letters out in addition to the delta frequencies, That's the idea I had actually - I already sort symbols anyway. But it also doesn't take into account similarity of symbol freqs in adjacent blocks. > I'm not sure how FSE works, but it's likely Yann will have put some effort > into finding a good solution that doesn't cost too much time. I think zstd just compresses the stats with default literal coder using precomputed state tables. Jyrki Alakuijala said that its not worth extra complexity when headers take like 0.2% of compressed data (0.5% here). But I think that its better to compress the headers as much as possible (since it doesn't change the processing speed anyway).
    11 replies | 376 view(s)
  • JamesB's Avatar
    14th August 2019, 12:17
    Blockwise you need to evaluate those compression tools by chopping up into blocks and testing each block individually then, but knowing how much you can get when not chopped at least hints at a baseline to aim for. You may be able to predict the upcoming frequencies based on what's gone before in this block of frequencies. E.g. if we're encoding freq for symbol 200 and the sum of all freqs for symbols 0 to 199 add up to 65535 then we know there's only 1 symbol left out there. Depending on the cardinality of the symbols, it may also be better to sort by frequency and write the symbol letters out in addition to the delta frequencies, but generally I found that didn't work too well. Some data sets it may do though. I'm not sure how FSE works, but it's likely Yann will have put some effort into finding a good solution that doesn't cost too much time. Edit: or perhaps there is a sweet range, expressible as 2 bytes (eg 32 to 122) whose symbols we encode first and then we encode the remainder (0-31 and 123 to 255) armed with the knowledge the total remaining frequency is very low. If you can work out which range leaves no unstored frequency of >= 256 then you know the others all fit in 1 byte, meaning no var-int encoding is necessary and freqs between 128 to 255 don't take an additional byte up (assuming a byte-based varint method instead of bit-bases).
    11 replies | 376 view(s)
  • Shelwien's Avatar
    14th August 2019, 11:59
    Can't really rotate anything - I'm trying to make a blockwise adaptive rc to compete with rANS. So I'd like each table to be stored in the header of each block. But yeah, I guess there's more similarity in freqs of individual symbols than I expected. 47,270,885 0.250s 0.125s // rans_static -o0 46,715,033 1.203s 2.031s // gcc82/k8 46,715,033 1.047s 2.016s // gcc82/native 46,715,033 1.094s 2.234s // clang900x/k8 46,715,033 0.750s 1.938s // IC19 AVX2 46,502,448 0.781s 2.000s // BLKSIZE=1<<14 (was 15)
    11 replies | 376 view(s)
  • JamesB's Avatar
    14th August 2019, 11:42
    Alternatively a data rotation by 512. So byte 0, 512, 1024, 1536, ..., followed by byte 1, 513, 1025, 1537, ... etc. Maybe in 16-bit quantities instead of 8-bit too if you wish to keep high and low bytes per value together. Rotate by 512 | paq8px_v101 yielded 183994 bytes (179070 on 16-bit quantities) and xz gets it to 192664. That's a trivially fast transform to use. Eg hacky perl 1-liner again for data rotation with 8 and 16-bit quantities. @ seq3a; perl -e '$/=undef;@a=unpack("C*",<>);for ($i=0;$i<929;$i++) {for($j=0;$j<512;$j++) {$b=$a}};print pack("C*",@b)' 1.frq > 1.frq.rot1 @ seq3a; xz < 1.frq.rot1|wc -c 192664 @ seq3a; perl -e '$/=undef;@a=unpack("S*",<>);for ($i=0;$i<929;$i++) {for($j=0;$j<256;$j++) {$b=$a}};print pack("S*",@b)' 1.frq > 1.frq.rot2 @ seq3a; xz < 1.frq.rot2|wc -c 199936 # paq8 is 179070
    11 replies | 376 view(s)
  • JamesB's Avatar
    14th August 2019, 11:37
    You could split into 512 byte chunks, delta each uint16 vs the previous uint16 (512 bytes ago), zig-zag(ish) and var-int encode, and then chuck through your favourite compression tool or entropy encoder. The logic being that each 64k block of input is likely to have some similarity to the last. Alternatively build up an average frequency and delta vs that, if we believe the frequencies are static. Doing the former and putting through xz gave me 211044, assuming my process was actually reversible and bug free. It harms more complex tools like paq though. I did a horrid perl 1 liner hack :-) perl -e '$/=undef;for($i=0;$i<256;$i++){$b=0}@a=unpack("S*",<>);$n=0;while($n < $#a){$i=0;foreach (@a) {$x=$_;$_-=$b;$_=(abs($_)*2)+($_<0);$b=$x;do {print chr(($_ & 0x7f)+($_>=128?128:0));$_>>=7;}while($_)};$n+=256}'
    11 replies | 376 view(s)
  • InfiniteAC's Avatar
    14th August 2019, 10:34
    Thank you very much, I will have a look at it later :)
    8 replies | 352 view(s)
  • Shelwien's Avatar
    14th August 2019, 03:45
    Actually here it is: http://nishi.dreamhosters.com/u/bitalign_v0.rar I made some small changes to sh_v2f_static (like adding a flush at LF and bitoffset index in frq file). Surprisingly, tail cutting actually reduced the compressed data size, so compressed file size for book1 is 433033 (but there's also the index). So a script like this: coder c book1 1 1_frq coder d 1 2 1_frq md5sum 2 book1 echo line 0: coder d0 1 con 1_frq echo line 1: coder d1 1 con 1_frq echo line 16621: coder d16621 1 con 1_frq works like this: C:\9A6R-bitalign_ari\!arc\bitalign_v0>test.bat 0a0fdbaf0589c9713bde9120cbb20199 *2 0a0fdbaf0589c9713bde9120cbb20199 *book1 line 0: <Y 1874> line 1: <A T. HARDY> line 16621: THE END
    8 replies | 352 view(s)
  • Shelwien's Avatar
    14th August 2019, 01:59
    http://nishi.dreamhosters.com/u/freqtable_v1.rar blkfrq book1 book1.frq creates a file with "uint16_t freq" freqtable repeated for each 64k block of input file. frqcomp c book1.frq book1.ari compresses it (see model_freq.inc) Current model works like this: for( i=A,sum=0; i<B; i++ ) sum+=freq; encode( sum ); uint h = (B-A)>>1; uint sum0 = freq_proc( freq,A+0,A+h, sum,0 ); freq_proc( freq,A+h,B+0, 0, sum-sum0 ); Even this is better than simple adaptive bytewise order0 coding: 63,086,136 // fp32_rc 63,001,795 // fp32_rc with model_freq 62,826,269 // sh_v1x with model_freq But there's some further potential: 475,648 1.frq // 16-bit freqtables, 512 bytes per block 227,489 2.ari // frqcomp 220,489 2.lzma // lzma.exe e 1 2l -fb273 -mc999 -lc2 -lp1 -pb4 -mt1 219,133 2.pa // 7z.exe a -m0=plzma:a1:fb273:lc2:lp1:pb4 2.pa 1 191,220 1.paq8px181 // paq8px_v181.exe -7 1 191,102 1.paq8pxd67 // paq8pxd_v67_SSE42.exe -s7 1 202,600 1.paq8hp12 // paq8hp12.exe c 1 2p Any ideas on freqtable model improvement? (test file 1.frq can be found in archive above)
    11 replies | 376 view(s)
  • Shelwien's Avatar
    14th August 2019, 01:17
    > I need an AC that does not need to encode an EOF symbol, because I always know how many symbols will be encoded The thread I linked contains working implementations of various coders that do exactly that. Most recent version even implements self-termination (AC EOF is created by adjusting flush code to cause decoding error at file EOF). Modifying it to work with bit alignment should be simple enough, I can make a demo if necessary. Basically just have to remove byte alignment loop from rangecoder_v0e/sh_v2f.cpp: while( byte(bits>>24)!=0xFF ) outbit(1); (rangecoder is the same thing as AC, just optimized for bytewise i/o).
    8 replies | 352 view(s)
  • moisesmcardona's Avatar
    14th August 2019, 00:01
    moisesmcardona replied to a thread paq8px in Data Compression
    Static compile. The previous one had a shared zlib library.
    1663 replies | 475146 view(s)
  • InfiniteAC's Avatar
    13th August 2019, 23:39
    Maybe it is still unclear what I mean, so here is another take on it, at least I believe you are doing something else: At the end of my encoding process I have my interval . Now I have another implementation of AC where I simply use doubles for simplicity and do not use any scaling operations (naive AC). This works for my current problem, because I can restrict myself to short messages and no extreme probabilities. But for some future development I need an AC that does not need to encode an EOF symbol, because I always know how many symbols will be encoded (it is a fixed number known prior both by the encoder and the decoder).
    8 replies | 352 view(s)
  • SvenBent's Avatar
    13th August 2019, 22:47
    Yeah i know AES was chosen due to speed in the final so it would be eaiser/faster ti integrate and could widespread support for it Rabbit and NC-256 are other winners in Estream which is the european version of EAS https://en.wikipedia.org/wiki/ESTREAM salsa20 HC-128/256 Rabbit but besides chacha20 the other that are all from the same portfolio are like forgotten myths. it jsut funny to see how the AES semi finalist are very well known but the estream finalist are not getting any attention besides salsa20 (chacha20) -- edit -- I wrote NC-256. the real name is HC-256
    2 replies | 85 view(s)
  • Shelwien's Avatar
    13th August 2019, 22:42
    >> but the shortest code would be 10 or something > only in cases where you already know what you are going to code in the future. > Else you need the sufficient condition of codewordlength = ceil(-log(HIGH-LOW))+1. Not quite. If its actually EOF, we can tweak i/o to return 0s or 1s when reading after EOF. Eg. "0" flush with padding of 1s would turn into "0111(1)" and would fit into interval. As to "10", any "10xx(x)" fits, and we'd be able to return unnecessary bits after decoding. >> can't you just write a loop to find the shortest possible codeword that fits within the interval? > Well, I could write a loop where I test whether the decoder would decode > the correct message, because the problem only arises at the end which keeps > the complexity low in most cases. Yes. But its not necessary to test full decoding - you just need to find the right sub-interval. I meant something like this: x = HIGH; n = ceil(-log2(HIGH)); for( l=0; l<n; l++ ) { mask = (1<<l)-1; nmask = ~mask; xl = x & nmask; xh = x | mask; if( (xl>=LOW) && (xh<HIGH) ) break; } > To give context, I want to encode binary messages that are rather short > (but too long for efficient huffmancode) and the overhead of an EOF symbol > is crucial in this situation. I posted some similar coders here: https://encode.su/threads/3084-Simple-rangecoder-with-low-redundancy-on-short-strings?p=59643&viewfull=1#post59643 Although these are byte-aligned.
    8 replies | 352 view(s)
  • InfiniteAC's Avatar
    13th August 2019, 22:23
    Thank you for your reply! I dont want to find the length, I already know how to calculate it. I need to find a way to actually derive the correct codeword that lies in the desired interval without using an EOF symbol. only in cases where you already know what you are going to code in the future. Else you need the sufficient condition of codewordlength = ceil(-log(HIGH-LOW))+1. Well, I could write a loop where I test whether the decoder would decode the correct message, because the problem only arises at the end which keeps the complexity low in most cases. But this does not seem like an unnecessary workaround. To give context, I want to encode binary messages that are rather short (but too long for efficient huffmancode) and the overhead of an EOF symbol is crucial in this situation.
    8 replies | 352 view(s)
  • Shelwien's Avatar
    13th August 2019, 22:20
    Here they have the requirements for AES: https://competitions.cr.yp.to/aes.html as you can see, its not decided based on the cryptographic strength only. But I never heard of NC-256 or rabbit, where did you find them?
    2 replies | 85 view(s)
  • Shelwien's Avatar
    13th August 2019, 22:07
    If its done only once, at the coder flush, can't you just write a loop to find the shortest possible codeword that fits within the interval? I don't think its possible to just calculate the length, for example LOW=0111, HIGH=1111, -log(HIGH-LOW)+1=4, but the shortest code would be "10" or something (can be also "1" or "0" depending on implementation).
    8 replies | 352 view(s)
  • moisesmcardona's Avatar
    13th August 2019, 19:47
    moisesmcardona replied to a thread paq8px in Data Compression
    Compiled paq8px v167cm. I had to add #include <cmath> and also #define USE_ZLIB since it wasn't defined. Built using Visual Studio 2019. It seems the use of the dictionary is broken in this build.
    1663 replies | 475146 view(s)
  • Darek's Avatar
    13th August 2019, 19:29
    Darek replied to a thread Paq8pxd dict in Data Compression
    Scores of enwik8/9 compressed by paq8pxd v67 - slightly worse than previous tested version, however this release (AVX2) is about 50% faster than v63! 16'309'012 - enwik8 -s8 by Paq8pxd_v63, time 9'240s 15'967'201 - enwik8 -s15 by Paq8pxd_v63, time 8'880s 16'63'7302 - enwik8.drt -s15 by Paq8pxd_v63, time 11'837s 126'597'584 - enwik9_1423 -s15 by Paq8pxd_v63, time 98'387s 16'374'223 - enwik8 -s8 by Paq8pxd_v67_AVX2, +0.40% to v63score, time 6'431s, 16'048'070 - enwik8 -s15 by Paq8pxd_v67_AVX2, +0.51% to v63 score, time 6'643s 16'774'998 - enwik8.drt -s15 by Paq8pxd_v67_AVX2, +0.83% to v63 score, time 8'413s 127'063'602 - enwik9_1423 -s15 by Paq8pxd_v67_AVX2, +0.37% to v63 score, time 66'041s
    639 replies | 253649 view(s)
  • kaitz's Avatar
    13th August 2019, 19:12
    kaitz replied to a thread paq8px in Data Compression
    https://www.researchgate.net/publication/334136036_Improving_Lossless_Image_Compression_with_Contextual_Memory Github: https://github.com/AlexDorobantiu/Paq8px167ContextualMemory
    1663 replies | 475146 view(s)
  • InfiniteAC's Avatar
    13th August 2019, 18:55
    Hi, I am currently trying to figure out how to determine the correct binary value in the finite precision implementation of arithmetic coding (AC). I'd like to implement it with the side information about the number of symbols encoded so that I do not have to use an end of stream symbol. I was not able to find such an implementation anywhere, so I'd like to implement it myself. Lets go by example: I have an alphabet with two symbols {a, b} and would like to encode the message "abab" with symbol probabilities P(a) = 0.8, P(b) = 0.2. Now going by infinite precision arithmetic coding I end up at the interval , use 7 bits and yield the codeword "1100000" (by using using the binary representation of the lower bound truncated after the 7th bit + adding 2^-7 to make it lie within the interval), corresponding to the probability 0.75. Now using the usual E1-E3 scaling operations, for finite precision AC I get the following sequence of intervals: Message to be encoded: "abab": i) ceil(-log2(HIGH - LOW)) + 1 + numel(bin_seq) + E3_count; where numel(bin_seq) gives the number of bits already outputted (here: one) and E3_count is the number of E3 scalings unresolved and ceil is the round up function. But I am still unsure about finding the correct codeword from the interval I ended up at. Anyone got an idea? I mean the codeword has to correspond to a probability within [0.64, 0.8), this can be deduced from ii). Before I write unnecessary text, does anyone know how to correctly find the final codeword? That would be a great help, thank you :)
    8 replies | 352 view(s)
  • Piotr Tarsa's Avatar
    13th August 2019, 18:38
    If that's Python then you're missing identation.
    44 replies | 2745 view(s)
  • jjh1990n's Avatar
    13th August 2019, 16:16
    I’m doing the archiver will do soon. import binascii a=0 b=0 l="" j=0 b=0 m = while b<256: m+=1] b=b+1 numbers = name = input("What is name of file? ") with open(name, "rb") as binary_file: # Read the whole file at once data = binary_file.read() s=str(data) with open("tesj.mirror", "wb") as binary_filen: for byte in data: av=bin(byte) a=a+1 if a<=768: byte=int(byte) m = byte numbers.append(byte) if a == 768: print(m) a=0 del numbers numbers = m = b=0 while b<256: m+=1] b=b+1 b=0
    44 replies | 2745 view(s)
More Activity