Activity Stream

Filter
Sort By Time Show
Recent Recent Popular Popular Anytime Anytime Last 24 Hours Last 24 Hours Last 7 Days Last 7 Days Last 30 Days Last 30 Days All All Photos Photos Forum Forums
Filter by: Popular Clear All
  • LawCounsels's Avatar
    15th February 2021, 01:01
    Hi All : I understand Hutter increased prize fund 10x to €500,000 This at last is worth effort to claim There is a minor snag which I hope organisers can help overcome : Is it possible to somehow verify results ( to gold standard impossible any shady ) without laying bare source code thus algorithm ? And for undertaking/ cast iron arrangement .exe will not be retained after tests ( perhaps test in presence of Author's representative control of .exe running )
    66 replies | 2210 view(s)
  • Hacker's Avatar
    17th February 2021, 00:23
    Since Squezee Chart is not being updated anymore, I wonder, what packers are actively developed, i.e. at least one new version in the last six months? PAQ and its gazillion forks are actively developed but mostly experimental 7-Zip, RAR go on as usual The development of EMMA, MCM, NanoZip, PackJPG, PackRAW, Razor, UDA, ZCM and ZPAQ seems to be on hold or has been abandoned altogether Are any other packers actively developed, please? Thank you.
    32 replies | 2394 view(s)
  • Trench's Avatar
    27th February 2021, 08:22
    I asked a while ago about compressing binary file and got some decent programs that can compress but suck at binary compared to just putting the binary into Hex / Ascii and the compressing it which beats just binary compression by around 15% what I tested. While other program like 7zip it is not as good even with its Lzma2 being better than Bzip2 when dealing with binary but still. Maybe my tests were not done properly some might feel for the file to not be big enough (32kb) or not putting the proper configurations. This was just a quick tests. Maybe more focus is needed on binary compression? Bad enough other compressed programs as stated before like PNG is off by over 50%.
    21 replies | 616 view(s)
  • mitiko's Avatar
    27th February 2021, 22:56
    Dictionary transforms are quite useful but do we take advantage of all they can offer when optimizing for size. LZ77/78 techniques are wildly known and optimal parsing is an easy O(n) minimal path search in DAWG (directed acyclic weighted/word graph). In practice, optimal LZ77 algorithms don't do optimal parsing because of the big offsets it can generate. Also LZ algs lose context when converting words to (offset, count) pairs. These pairs are harder to predict by entropy coders. I've been working on BWD - pretentious name stands for Best Word Dictionary. It's a semi-adaptive dictionary, which means it computes the best/optimal dictionary (or tries to get some suboptimal approximation) for the data given and adds it to the stream. This makes decompression faster than LZ77/78 methods, as the whole dictionary can sit in memory but compression becomes very costly. The main advatages are: optimality, easier to predict symbols, slight improvement in decompression speed. The disadvantages are clear: have to add the dictionary to the compressed file, slow compression. This is still quite experimental and I do plan on rewriting it in C++ (it's in C# right now), for which I'll probably need your guys' help later on. I've also considered starting a new set of paq8 compressors but that seems like getting too deep into the woods for an idea I haven't developed fully yet. Someone requested test results - Compression takes about 45s for enwik5 and decompression is at around 20ms. You can find the exact results in https://github.com/Mitiko/BWDPerf/issues/5 I'm still making frequent changes and I'm not spending too much time testing. It generates a new sequence of symbols (representing the indexes to the dictionary) - length is 50257 and entropy is 5.79166, dictionary uncompressed is 34042 bytes. This means, that after an entropy coder it can go to about 325'113 bytes. (This is for enwik5 with m=32 and dictionarySize=256) Stay tuned, I'm working on a very good optimization that should take compression time down a lot. I'll try to release more information soon on how everything works, the structure of the compressed file and some of my notes. You can read more on how the algorith works here: https://mitiko.github.io/BWDPerf/ This is the github repo: https://github.com/Mitiko/BWDPerf/ I'll be glad to answer any and all questions here or with the new github feature of discussions: https://github.com/Mitiko/BWDPerf/discussions Note this project is also a pipeline for compression, that I built with the idea of being able to switch between algorithms fast. If you'd like to contribute any of your experiments, transforms or other general well-known compressors to it, to create a library for easy benchmarking, that'd be cool. I know Bulat Ziganshin made something similiar so I'm not too invested in this idea. Any cool new ideas, either on how to optimize parsing, counting words or how to model the new symbols are always appreciated.
    17 replies | 793 view(s)
  • Gonzalo's Avatar
    10th February 2021, 19:04
    Do you know of a compressor or archiver that's content-aware regarding sqlite or other databases? Just to clarify, I am NOT talking about VACUUM command or similar in which the output is still a functional database
    10 replies | 687 view(s)
  • Gribok's Avatar
    24th February 2021, 07:31
    libsais is my new library for fast (see Benchmarks on GitHub) linear time suffix array and Burrows-Wheeler transform construction based on induced sorting (same algorithm as in sais-lite by Yuta Mori). The algorithms runs in a linear time (and outperforms divsufsort) using typically only ~12KB of extra memory (with 2n bytes as absolute worst-case extra working space). Source code and Benchmarks is available at: https://github.com/IlyaGrebnov/libsais
    6 replies | 723 view(s)
  • lz77's Avatar
    17th February 2021, 20:59
    ​What is the distance in bytes between instructions that can be executed in parallel? How many ALUs are there for this, and what instructions can they follow? I mean CPU I8700K.
    8 replies | 398 view(s)
  • lz77's Avatar
    8th February 2021, 14:41
    In Delphi 7 such a trick is possible with the calculation of the jump address in the case statement. For example, there is case statement by prefix = 0,1,2,3 when LZ decompressing (but it breaks optimization): var cm0: array of dword; ... asm mov dword ,offset m0 mov dword ,offset m1 mov dword ,offset m2 mov dword ,offset m3 end; <Decompressing loop starts> asm mov eax,prefix jmp offset cm0+eax*4 end; m0: offs:=...; goto 1000; m1: offs:=...; goto 1000; m2: offs:=...; goto 1000; m3: offs:=...; 1000: Is it possible in C to calculate the addresses of the labels and make such a jump? Or does the translator do it itself?
    7 replies | 406 view(s)
  • SolidComp's Avatar
    14th February 2021, 21:26
    Hi all – What's the situation with PPM in comparison to other approaches, in terms of ratio, complexity, and speed tradeoffs? I don't see a PPM or PPMII thread here, and I'm curious to learn more about it. Which codecs are using PPM? Also, what happened with Dmitry Shkarin's PPMII algorithm? I was intrigued by this passage in Steinruecken's 2014 thesis (PDF) on lossless data compression: How does PPMII compare to other, non-PPM approaches? (I didn't realize until now that PPMd is also Shkarin's.) (That thesis also references a 1998 paper by Charles Bloom (PDF) that focuses on PPM. I haven't read it yet.) Thanks.
    6 replies | 575 view(s)
  • mitiko's Avatar
    25th February 2021, 00:35
    Is there a way to make ANS FIFO? I'm assuming it's either impossible or no one has tried it, since we're stuck with a LIFO fast coder since 2014. If so, what's the reason making it impossible? ANS basically maps numbers to bits and bigger state means more bits. This is why the state is increasing as we encode. Normalization helps us with infinite precision. Can't we just reverse this process? Instead of starting at a low state and increasing as we go, start at some high state and decrease it. Normalization would make sure we don't get to negative states - it will increase the state. In ANS we're asking the question "What is the next state that encodes all the bits we have and one more bit?". Now we can ask "What is the closest previous state that encoded the information in the current state, preceded with the new bit?" I'm imagining some hypothetical (for the binary alphabet): Let's say we want to encode 00110100 First symbol is 0 x_0 = 16 encodes 000100 The last symbol of this sequence is 0 and it matches the end of this code. We don't switch states. The next symbol we read is 0. The closest smaller state that matches is: x_1 = 13 encodes 00100 The next symbol is 1. The closest smaller state that matches 001 is: x_2 = 5 encodes 001 The next symbol is 1. But we can't encode anything - our state is too low. So we normalize - we output some bits and go to let's say state x_3 = 40 x_3 = 40 encodes 0010010001001 This matches our symbol 1. We read the next symbol to be 0. x_4 = 38 encodes 0100100101 and matches "01" And so on... That's the basic idea. I'm trying to understand deeper how ANS works and how exactly it relates to range coding. Notice the string encoded by our state is the reverse of what we read. We're still reading in the correct direction but the state encodes in the opposite - we're adding data to the most significant bits of the state, not the least significant. That's just me brainstorming. Has anyone else tried this? I would find great pleasure in a proof it doesn't work as well. Thanks!
    4 replies | 429 view(s)
  • stbrumme's Avatar
    5th February 2021, 22:00
    Hi, various Huffman-based compression formats pose an upper limit on the length of prefix codes. For example, DEFLATE and JPEG allow only up to 16 bits per symbol. Depending on how skewed a symbol distribution is, the standard Huffman algorithm may produce longer codes. Therefore I created a free C library with various length-limited algorithms: https://github.com/stbrumme/length-limited-prefix-codes You can find heavily documented implementations of: Package-Merge JPEG's recommendation from Annex K.3 MiniZ's approach (similar to zlib) BZip2's solution two Kraft-McMillan based algorithms The Kraft-McMillan code was entirely written by me - though hugely inspired by Charles Bloom's blog. The other implementations are derived from other people's hard work (either actual source code or papers/specifications) and optimized for clarity and/or speed. All algorithms share the same interface so they are easily interchangeable: unsigned char algorithmName(unsigned char maxLength, unsigned int numCodes, const unsigned int histogram, unsigned char codeLengths); For each algorithm there is a custom-tailored interface, too, which is often faster and requires less code but has certain restrictions regarding its input (such as a sorted histogram etc.). A basic benchmark is included so you can measure performance and compression ratio with your own data. The aforementioned GIT repository has a long README.md file with tons of more information. Major spoilers: Package-Merge can be made reasonably fast my Kraft Strategy B runs much faster than a good Huffman implementation (e.g. compared with Moffat's inplace version) BZip2's approach is dead-simple but slow and inefficient I hope you enjoy it, Stephan
    3 replies | 376 view(s)
  • suryakandau@yahoo.co.id's Avatar
    15th February 2021, 16:29
    i am looking for cm compressor written in vb6 or in turbo pascal
    2 replies | 228 view(s)
  • Trench's Avatar
    20th February 2021, 20:21
    I assume many here are into file compression for the greater good more than money. But up to what ratio as the poll shows. Also to the extent how far you are willing to go to screw another as well if someone hired you to benefit from the idea. Kind of like how Xerox got screwed over by Apple, or Apple and others got screwed by MS, or the original created of FB got screwed over, The McDonald brothers got screwed over by McDonald, Nikola Tesla screwed over by JP Morgan, Etc. like most big corporations. So what would make you different it seems. Think about it for a few days before you answer.
    2 replies | 318 view(s)
  • Gonzalo's Avatar
    21st February 2021, 22:32
    Fazip uses TTA but it doesn't work for compound content, like a .wav somewhere inside a TAR archive. Precomp was supposed to include wavpack like a year ago but nothing has happened yet and it doesn't seem like it's going to. So, do you know of something like this, that detects and compress audio chunks and makes a copy of the rest of the data? Thanks in advance!
    2 replies | 317 view(s)
  • Bulat Ziganshin's Avatar
    24th February 2021, 22:55
    I started to work on codecs bindings for the new CELS framework with what I thought the simplest one - LZ4. But it turned out to be more complex work than I expected. My main surce of information is lz4 manual, library source code and examples. I have not found any tutorial-style guides for using the library, and the documentaton provided is centered around individual functions rather than usage scenarios. My goal is to avoid mallocs inside the library (providing all memory buffers by myself), and reusing compression contexts between operations in order to maximize perfromance. I will put my questions in bold, mixed with commentaries describing my setup. Q1: It seems that decompression context is so small and quickly initialized, that keeping and reusing it between operations is meaningless? For compression of the single buffer, I use sequence of `malloc(LZ4_sizeofState()) + LZ4_compress_fast_extState() + free()` Q2: Does LZ4_compress_fast_extState guranteed to never call malloc inside (in current 1.9.3 version)? It seems that the next step to improve perfromance is to have malloc + LZ4_compress_fast_extState + LZ4_compress_fast_extState_fastReset (many times) + free? Q3: Am I correct that I should call LZ4_compress_fast_extState with freshly allocated state object, but on the following operations using the same state object, I can use LZ4_compress_fast_extState_fastReset to improve perfomance? Q4: In lz4.c:819 I see that the fast reset is actually used only for input blocks less than 4 KB. Why it don't used for larger blocks? Now, going to the stream compression. Q5: The LZ4_resetStream_fast docs stated that despite having same size, context used for single-shot compression, shouldn't be reused for the stream compression and vice versa. Is it true or is it my misunderstanding? The proper approach to maximize speed by reusing context with streaming compression looks like: 1. malloc + LZ4_initStream 2. call LZ4_compress_fast_continue multiple times, processing single stream 3. LZ4_resetStream_fast 4. call LZ4_compress_fast_continue multiple times, processing single stream repeat steps 3 and 4 again for each subsequent stream and finally 5. free the memory Q6: Is it correct approach to ensure max speed? Q7: LZ4_decompress_safe_continue documentation mentions three decompression scenarios, and LZ4_compress_fast_continue mentions various scenarios as well. Is any of them has speed advantage? In particular, I have implemented double-buffering with adjancent buffers (as implemented in blockStreaming_doubleBuffer.c although this contradicts Note 3 on LZ4_compress_fast_continue) - is it correct and can I improve the speed by using other approach to compression/decompression? By default I use two buffers of >=64 KB, but I also interested to know answer for smaller buffers. PS: at the end of day, it seems that internal LZ4 API is already well-thought for my needs, not so much for the manual.
    1 replies | 414 view(s)
  • ron_s_s's Avatar
    10th February 2021, 23:41
    ​Example 1 : codewords {0,01,11} 0 is prefix of 01 -> dangling suffix is 1. List {0,01,11,1} 1 is a prefix of 11 -> dangling suffix is 1 - already there -> The code is ud Example 2: codewords {0,01,10} 0 is prefix of 01 -> dangling suffix is 1 List {0,01,10,1} 1 is a prefix of 10 -> dangling suffix is 0 - which is an original codeword! the code is not ud. These 2 examples are clear to me, but I have this specfic case where it is not clear to me. In case I add 2 dangling suffixes to my list, and one is prefix of the other, thus getting another dangling suffix. This new dangling suffix is an original codeword does it mean the code is not ud? Example : codewrods {0,01,110} 0 is prefix of 01 -> dangling suffix is 1 List {0,01,110,1} 1 is a prefix of 110 -> dangling suffix is 10 List {0,01,110,1,10} (this is where the misconception happens...) I have in this list 2 dangling suffixes {1,10} which are not an original codewords. and 1 is prefix of 10 -> 0 is a dangling suffix. 0 is also an original codeword thus code is not ud. Is it conventional by sardinas patterson algorithm rules to produce a dangling suffix from 2 other dangling suffixes?
    1 replies | 265 view(s)
  • SolidComp's Avatar
    5th March 2021, 03:29
    Hi all – I've not too familiar with hashing theory, and had some questions. I want to tokenize various data in social science research. So for example, what's the best way to convert a street address to a short fixed-length token? Say a six-character string. Addresses are of course highly variable strings, and max out at around 100 characters/bytes. For a country with a max of 10 million possible and future addresses, I assume there's no way to hash an address to a six-character string without collisions right? Six characters long from a 30 character set covers 729 million possibilities, but the inputs can be any of more than 100^36 possibilities (all caps + numbers). It's actually close to the factorial of that number since the input address is variable length. If every hash has to be unique, six characters can't do it even if it gives us more than enough range for 10 million addresses. Unless... there's some way to guide the hash so that there are no collisions. I don't know of anything like that that is still called a "hashing algorithm". It would be some kind of subset hashing that takes inputs that have a huge range if they're arbitrary, where the actual possibilities are a much smaller subset, and the output hash string is sized for the actual possibilities. Would you just randomly assign string tokens instead of hashing? Is that what they do with credit card tokenization? Thanks.
    2 replies | 60 view(s)
  • SolidComp's Avatar
    24th February 2021, 22:32
    Hi all – I just realized that I don't know anything substantial about encryption. Am I correct in assuming that we can't generally compress encrypted data? I assume that effective encryption produces random-looking data to the naive sentient being, therefore the sequence of steps has to be compression first, then encryption, then decryption, then decompression right? I've heard a bit about homomorphic encryption, which I think means a type of encryption that enables one to extract useful data from an encrypted dataset without decrypting the dataset. At that level of description, this can mean a few different things. Homomorphic encryption might mean that data in its informative form exists at different levels of abstraction or something, some of which is not actually encrypted (?), or there might be something about the nature of the search algorithms – how you find what you're looking for without decrypting... Or it could be much more interesting than this. But the general thrust of homomorphic encryption hints at the possibility of compressing encrypted data, if the surfaced information was somehow useful to a compressor, could be acted on by a compressor, but I don't know enough about it. Has anyone worked on the theory around this? Thanks.
    1 replies | 227 view(s)
  • JamesWasil's Avatar
    25th February 2021, 20:58
    I've wanted this with 286 and 386 laptops for decades. Do you feel it will live up to the promises from the press release? https://arstechnica.com/gadgets/2021/02/framework-startup-designed-a-thin-modular-repairable-13-inch-laptop/
    1 replies | 117 view(s)
  • macarena's Avatar
    25th February 2021, 21:43
    Hello, Most recent papers on image compression use the Bjontegard metric to report average bitrate savings or PSNR/SSIM gains (BD-BR, BD-PSNR etc.). It works by finding the average difference between RD curves. Here's the original document: https://www.itu.int/wftp3/av-arch/video-site/0104_Aus/VCEG-M33.doc I am a little confused by the doc.The bitrate they show in the graphs at the end of the document seems to be in bits/sec. At least it is not bits per pixel (bpp) which is commonly used in RD curves. As per this site https://www.intopix.com/blogs/post/How-to-define-the-compression-rate-according-to-bpp-or-bps, converting from bpp to bits/s would require information of fps which might not be known(?). I want to know if it really matters if the bitrate is in bpp or bits/s? Or does the metric give correct values no matter which one uses? Here's a Matlab implementation that seems to be recommended by JPEG: https://fr.mathworks.com/matlabcentral/fileexchange/27798-bjontegaard-metric . I ran a few experiments and it seems bpp gives plausible results. Though a confirmation would be nice. Thanks!
    0 replies | 172 view(s)
  • Shelwien's Avatar
    5th March 2021, 14:07
    https://stackoverflow.com/questions/45382914/parallel-memory-access-on-modern-processors I made a benchmark which times something like this loop: #define rndup(seed) (seed = 214013 * seed + 2531011) s0=1*seedinc; s1=2*seedinc; s2=3*seedinc; s3=4*seedinc; for( i=0; i<numreads; i++ ) { s0 += buf; s1 += buf; s2 += buf; s3 += buf; rndup(s0);rndup(s1);rndup(s2);rndup(s3); } for number of streams 1..64 (loop code is generated by a script) and tested it on two CPUs: i7-7820X @ 4.5Ghz and i7-8650U @ 2.5Ghz (see pictures). For 7820 there's no clear result, just that timings converge after ~16 streams and there's no further speed improvement (per stream). For 8650 it seems like there're ~16 streams for L3 and 6-8 for RAM, there's a visible speed difference after that.
    0 replies | 92 view(s)
  • Sportman's Avatar
    Today, 00:26
    Sportman replied to a thread Zstandard in Data Compression
    It's if a file name contain the root folder content instead of the direct root folder content as by 1.4.5-1.4.8 win64 zip files.
    460 replies | 147596 view(s)
  • Sportman's Avatar
    Today, 00:18
    Sportman replied to a thread Zstandard in Data Compression
    Input: 984.801.591 bytes - data Output zstd v1.4.8: 151.119.611 bytes, 1.344 sec., zstd -1 --ultra -T1 149.904.965 bytes, 3.375 sec., zstd -1 --ultra --long -T1 147.939.926 bytes, 1.547 sec., zstd -2 --ultra -T1 146.708.829 bytes, 3.547 sec., zstd -2 --ultra --long -T1 143.977.987 bytes, 1.969 sec., zstd -3 --ultra -T1 143.965.769 bytes, 2.141 sec., zstd -4 --ultra -T1 142.896.262 bytes, 4.063 sec., zstd -3 --ultra --long -T1 142.869.673 bytes, 4.125 sec., zstd -4 --ultra --long -T1 138.447.020 bytes, 3.437 sec., zstd -5 --ultra -T1 137.671.146 bytes, 5.187 sec., zstd -5 --ultra --long -T1 135.307.599 bytes, 3.735 sec., zstd -6 --ultra -T1 134.660.677 bytes, 5.625 sec., zstd -6 --ultra --long -T1 129.713.923 bytes, 5.297 sec., zstd -7 --ultra -T1 129.552.776 bytes, 7.125 sec., zstd -7 --ultra --long -T1 129.262.251 bytes, 6.547 sec., zstd -8 --ultra -T1 129.090.886 bytes, 8.328 sec., zstd -8 --ultra --long -T1 128.271.371 bytes, 7.687 sec., zstd -9 --ultra -T1 128.210.617 bytes, 9.485 sec., zstd -9 --ultra --long -T1 128.087.378 bytes, 8.328 sec., zstd -10 --ultra -T1 128.056.308 bytes, 10.312 sec., zstd -10 --ultra --long -T1 127.999.182 bytes, 9.812 sec., zstd -11 --ultra -T1 127.940.684 bytes, 12.109 sec., zstd -11 --ultra --long -T1 127.445.750 bytes, 14.828 sec., zstd -12 --ultra --long -T1 127.422.579 bytes, 12.656 sec., zstd -12 --ultra -T1 126.935.545 bytes, 28.125 sec., zstd -13 --ultra --long -T1 126.802.723 bytes, 30.359 sec., zstd -14 --ultra --long -T1 126.768.260 bytes, 26.766 sec., zstd -13 --ultra -T1 126.659.580 bytes, 28.734 sec., zstd -14 --ultra -T1 126.499.068 bytes, 37.922 sec., zstd -15 --ultra --long -T1 126.327.122 bytes, 35.860 sec., zstd -15 --ultra -T1 118.142.065 bytes, 1:18.344 sec., zstd -16 --ultra -T1 117.643.674 bytes, 1:22.531 sec., zstd -16 --ultra --long -T1 113.112.473 bytes, 1:29.234 sec., zstd -17 --ultra -T1 112.773.080 bytes, 1:33.485 sec., zstd -17 --ultra --long -T1 112.255.514 bytes, 2:11.281 sec., zstd -18 --ultra -T1 111.880.990 bytes, 2:16.656 sec., zstd -18 --ultra --long -T1 111.517.628 bytes, 2:46.016 sec., zstd -19 --ultra -T1 110.962.579 bytes, 2:57.359 sec., zstd -19 --ultra --long -T1 111.041.788 bytes, 2:55.297 sec., zstd -20 --ultra -T1 110.644.739 bytes, 3:04.047 sec., zstd -20 --ultra --long -T1 110.431.137 bytes, 3:25.172 sec., zstd -21 --ultra -T1 110.188.109 bytes, 3:39.516 sec., zstd -21 --ultra --long -T1 109.903.935 bytes, 3:52.750 sec., zstd -22 --ultra -T1 109.903.935 bytes, 3:52.719 sec., zstd -22 --ultra --long -T1 Output zstd v1.4.9: 151.119.611 bytes, 1.344 sec., zstd -1 --ultra -T1 149.930.990 bytes, 1.969 sec., zstd -1 --ultra --long -T1 147.939.926 bytes, 1.547 sec., zstd -2 --ultra -T1 146.731.879 bytes, 2.141 sec., zstd -2 --ultra --long -T1 143.977.987 bytes, 1.969 sec., zstd -3 --ultra -T1 143.965.769 bytes, 2.156 sec., zstd -4 --ultra -T1 142.942.060 bytes, 2.656 sec., zstd -3 --ultra --long -T1 142.912.214 bytes, 2.687 sec., zstd -4 --ultra --long -T1 138.447.020 bytes, 3.453 sec., zstd -5 --ultra -T1 137.698.236 bytes, 3.781 sec., zstd -5 --ultra --long -T1 135.307.599 bytes, 3.781 sec., zstd -6 --ultra -T1 134.685.348 bytes, 4.188 sec., zstd -6 --ultra --long -T1 129.713.923 bytes, 5.297 sec., zstd -7 --ultra -T1 129.541.732 bytes, 5.687 sec., zstd -7 --ultra --long -T1 129.262.251 bytes, 6.563 sec., zstd -8 --ultra -T1 129.079.778 bytes, 6.890 sec., zstd -8 --ultra --long -T1 128.271.371 bytes, 7.703 sec., zstd -9 --ultra -T1 128.185.107 bytes, 8.016 sec., zstd -9 --ultra --long -T1 128.087.378 bytes, 8.344 sec., zstd -10 --ultra -T1 128.030.897 bytes, 8.860 sec., zstd -10 --ultra --long -T1 127.999.182 bytes, 9.859 sec., zstd -11 --ultra -T1 127.915.469 bytes, 10.719 sec., zstd -11 --ultra --long -T1 127.422.579 bytes, 12.703 sec., zstd -12 --ultra -T1 127.415.784 bytes, 13.422 sec., zstd -12 --ultra --long -T1 126.907.957 bytes, 26.703 sec., zstd -13 --ultra --long -T1 126.775.263 bytes, 28.954 sec., zstd -14 --ultra --long -T1 126.768.260 bytes, 26.750 sec., zstd -13 --ultra -T1 126.659.580 bytes, 28.735 sec., zstd -14 --ultra -T1 126.467.698 bytes, 36.609 sec., zstd -15 --ultra --long -T1 126.327.122 bytes, 35.812 sec., zstd -15 --ultra -T1 118.142.065 bytes, 1:18.329 sec., zstd -16 --ultra -T1 117.640.741 bytes, 1:21.078 sec., zstd -16 --ultra --long -T1 113.112.473 bytes, 1:29.187 sec., zstd -17 --ultra -T1 112.767.845 bytes, 1:32.015 sec., zstd -17 --ultra --long -T1 112.255.514 bytes, 2:10.438 sec., zstd -18 --ultra -T1 111.878.283 bytes, 2:14.078 sec., zstd -18 --ultra --long -T1 111.517.628 bytes, 2:45.672 sec., zstd -19 --ultra -T1 111.041.788 bytes, 2:55.078 sec., zstd -20 --ultra -T1 110.958.922 bytes, 2:55.203 sec., zstd -19 --ultra --long -T1 110.691.009 bytes, 3:01.454 sec., zstd -20 --ultra --long -T1 110.431.137 bytes, 3:24.843 sec., zstd -21 --ultra -T1 110.190.299 bytes, 3:36.609 sec., zstd -21 --ultra --long -T1 109.900.896 bytes, 3:49.813 sec., zstd -22 --ultra -T1 109.900.896 bytes, 3:49.844 sec., zstd -22 --ultra --long -T1
    460 replies | 147596 view(s)
  • mitiko's Avatar
    Today, 00:04
    Oh, that type of a transform... Well, static dictionaries support all types of patterns (like the ones described in the post) and I'm planning on doing this in one way or another. What they're doing with LZ77 + static dictionaries is pretty ingenious actually! Wait, that's actually pretty clever for a fast encoder! Anyways, that's not the point of BWD. I'm trying to preserve the context of the text, the information it holds. After LZ77 and such pattern-y words the rest of the compression pipeline looses context and it will predict worse. But if we instead create a whole new entire context based on a pattern, it will compress better and the original stream will be more compressible as well. Patterns are great, they just seem a bit to deterministic right now - like for every pattern/dictionry transform that you can imagine, you need to write it's own ranking function and extraction function and then handle it's serialization to the dictionary, etc. It'd be amazing if we find a way to get the system to generate patterns, or find them in the dictionary after the words have been chosen. For example if the dictionary is calculated to be something like: ​ "the next" "the following" "the lion" "the fox" Maybe some independent system can see the pattern and generate the word "the (*)", which will then match all of the above and also split the stream into 2 - the main stream + the stream of the pattern. The stream of the pattern would the look something like: "next;following;lion;next;fox;fox;fox;next..." which is very compressable. In practice this would suit well some grammar rules (essentially making the grammar information that text holds, known, so zero bytes to transmit) like "(*)ing" or "(*)ed". With patterns, we get into this whole idea of context splitting and start to think if this is possible in the general case. I have some notes, if you'd like to look at them.
    17 replies | 793 view(s)
  • Jarek's Avatar
    Yesterday, 23:26
    Thanks but I am not taking this only "award" for ANS to be able to defend it for the rest of my life from succeeding megacorporations. Corporations using it, instead of deploying own "defensive" legal mines, please just prevent building this minefield. Especially that I haven't heard of any ANS-based compressor from Microsoft (?)
    122 replies | 91254 view(s)
  • SolidComp's Avatar
    Yesterday, 22:19
    Brotli has a seemingly novel set of 120 dictionary transforms. You can read a bit about them in this recent Cloudflare post.
    17 replies | 793 view(s)
  • JamesWasil's Avatar
    Yesterday, 19:10
    Nope. Only semantic rewording of your work and fabricating big tech innovations for Microsoft to compete with Google by bogus patent claims for what was intended to be free with due credit. This is why we can't have nice things, and why development and certain types of information sharing and discoveries are no longer offered in good faith like they were before. Corporations by majority have ruined the good faith and academic contributions for the sake of exclusive profiteering, even against the inventor's wishes. It says a lot about the companies and state of the world when they have to rip off the author, take credit and make money on work released freely, rather than developing their own UNIQUE methods even if they yield nearly the same performance or results. Cookie cutter world now. Question: Should Jarek make a global Gofundme for legal help to fight this to where the lawyers who Google, Microsoft, and other big tech don't yet have in their pocket might actually help? I would assume that if Marcus Hutter could throw 50k at the Hutter prize, would he and others be willing to donate to the Jarek rANS/tANS defense fund?
    122 replies | 91254 view(s)
  • mitiko's Avatar
    Yesterday, 17:28
    I'm going to try and compete with BWD later this year. I also have some good ideas for modeling after the dictionary transform.
    78 replies | 12265 view(s)
  • mitiko's Avatar
    Yesterday, 17:13
    Thanks for looking through the code. I knew I had to eventually remove LINQ but I don't think I'm using it at some very critical places. Right now, the most work seems to be in counting the words and ranking. Thanks for the heads up for foreach loops, I didn't know they were more expensive than for's. The architecture is pretty ok for now, it's like LINQ but async (using IAsyncEnumerable). I've thought it might be slowing me down but it's actually not. It's processing data in blocks and I'm testing it for optimality for the block size being the full filesize, so all the data only travels the pipeline once. The slow-down comes from calculating the dictionary, when we have it, parsing and serializing dictionary is pretty fast. What the rank (as per the latest version) calculates is the difference in bits used to encode the string. So if the dictionary size is 256, 8 bits would be needed per word and the ranking would be analogous to (n-1)*s for n=word size, s=occurences. (I'd rather work with l and c for length and count) In your example, the weights are correct and the savings in bytes are the same as the rank. The whole ranking system is greedy in this way. It will prefer: ​"the" -> rank=2000 "rabbit" -> rank=175 "lion" -> rank=105 "and" -> rank=70 "fox" -> rank=70 total = 2420 The ranks here only being from the top of my head. The non-greedy variation you have in mind looks something like: "the fox and the rabbit and the lion" -> rank=1225 "the" -> rank=2000 total = 3225 I have actually thought a lot about this and it's not a problem of the ranking function itself, at least not it's primitive variation. No mathematical formula can solve the greediness, there's always going to be some example where if we switch the order, we'll get a better result. It's a problem of choosing the best word from the ranking. Instead of greedy-ly choosing the highest ranked word, we can look ahead (in time) and calculate the ranks of future words, then choose the optimal branch. It becomes a look-ahead tree. The fragmentation problem by itself can be solved if we consider the dictionary to initially be the alphabet and then morphing it into longer words. Like: A -- AT T -- A C -- T G -- C ​ G Then we can do parsing before each new word calculation and we won't have the constraint of words not spanning over multiple blocks. I've excluded this as a good idea, because 1) we have to do parsing at each word and only reparsing doesn't seem viable. 2) we won't have this problem for a large enough look-ahead tree. But now that you mention it, maybe the look-ahead tree can't be that big. My hope was that since we're calculating a lot of the words' ranks, they won't change under the tree and when we choose the optimal branch, we'd only have to calculate the ranks of the next layer of words. Basically ranking becomes at most x times slower, if we prune and look at the top x words. The tree depth is only a concern for the first word and then updates are pretty fast, so it's pretty inexpensive to have a deeper tree, compared to a more spanning tree. But for the look-ahead tree to solve the fragmentation dilemma, a deeper tree doesn't cut it. We need a more spanning tree. So, maybe you're right, I should reconsider overlapping words and look into re-parsing, although I'm still pretty set on having the look-ahead tree as well. A couple assumptions: Right now, with the naive ranking ((n-1)*s) we don't have to rank all words and can only re-rank some. If we choose to re-rank the words, we have to store all the ranks, we can't store just the top 5, because at some point we won't be able to update the top words for 5 times in a row and we won't have a top 5. My current ranking was last updated with a hashtable for ranks, so the actual calculation is only done once per similair words. What we can do, for sure, is only re-count some words. I'm working on that still. I have a draft, I just have to fix some minor indexing issues. The reason why I haven't done re-ranking is because the optimal ranking algorithm uses a state. The technique used is similiar to morphing/updating the dictionary. This way the current rank of a word depends on the other words chosen and it's closer to the actual optimal order of the dictionary. Especially with the look-ahead tree. For sure, re-ranking and re-counting + the fragmentation fix will create a great compressor, with acceptable compression times. But I'm rather looking to improve BWD in the direction of optimality for the cost of compression time, than leveraging memory for time. When I'm done, I'll definitely try to implement one with this stack!
    17 replies | 793 view(s)
  • CompressMaster's Avatar
    Yesterday, 15:58
    CompressMaster replied to a thread WinRAR in Data Compression
    Yeah, I have seen this almost 2 years ago or so... even in windows there are vulnerabilities like that - one that has been lastly patched in win10 existed even from windows 3.11..
    192 replies | 137083 view(s)
  • Jarek's Avatar
    Yesterday, 15:40
    I was just pointed 2019 Microsoft patent for rANS: https://patents.google.com/patent/US20200413106A1 Can anybody find something novel here?
    122 replies | 91254 view(s)
  • LucaBiondi's Avatar
    Yesterday, 13:30
    LucaBiondi replied to a thread paq8px in Data Compression
    Hi guys Thanks to all for the comments. I run paq8px besides the normal work. My pc has 16 gb of ram and often ram is not enough. This happen for 201 -10 vs. 201 -11 Luca
    2337 replies | 625128 view(s)
  • Gotty's Avatar
    Yesterday, 13:11
    Gotty replied to a thread paq8px in Data Compression
    Yes. In this league there is a serious "fight" for each byte gained. Each version is better like around 0.02%-0.10%, and for that tiny improvement there is a lot going on. The result is extremely "tight" and optimized already - it's not really possible to "tune" the existing models and contexts for any more gains, we (usually) need to add more models to experience better gains. That costs time and memory. The gain is not too great usually as the existing models cover the possibilities quite well. We really sacrifice memory and speed for compression ratio. So in this league 3% improvement is a lot - and it costs a lot.
    2337 replies | 625128 view(s)
  • hexagone's Avatar
    Yesterday, 07:21
    hexagone replied to a thread paq8px in Data Compression
    You mean 'accurate' I assume. Accurate timings are useful. Or just compare the option with the highest compression for each release. ​ 3% improvement for a x 3 compression time is incredibly expensive. "Worth" is in the eye of the beholder...
    2337 replies | 625128 view(s)
  • Gotty's Avatar
    Yesterday, 04:53
    I have only two points (for performance-critical parts of your code): - use for loops instead of foreach loops; - avoid using LINQ - convert them to normal loops. (These modifications will be eventually needed when you really want to convert your code to c/c++ one day.) I know these will give you only a small boost, you're looking for a faster architecture. Unfortunately I don't know your architecture (only had a peek). Could you try something? What if you rank words simply by their size? That is a stable metric, you don't need to recalculate the ranking after each round, just see if there is still more than 1 occurrence in the next round. It may sound strange at first, but it could be as good as the current metric that taks into account the occurrence too. Why? When you rank the words by (n-1)*s there is an issue: long words are polluted when their constituents are replaced. Example: the fox and the rabbit and the lion the chicken the fox and the rabbit and the lion the lion the fox and the rabbit and the lion the horse the fox and the rabbit and the lion the penguin the zebra the cat the worm ... (many more) ... the dog The weights: "the": (n-1)*s= 2*1000 = 2000 "the fox and the rabbit and the lion": (n-1)*s = 3*35 = 105 So "the" will win the first round. After the first replacement those 4 long substring will be unnecessary polluted by indexes of the "the"s. In this case it would be better to just replace the 4 long substrings first and then replace those "the"'s. What do you think? I wonder how much is the difference with this metric and is it better or worse.
    17 replies | 793 view(s)
  • JamesWasil's Avatar
    Yesterday, 03:04
    JamesWasil replied to a thread WinRAR in Data Compression
    This may have been missed or not seen a year or two ago, but found it interesting that a vulnerability like this existed for almost 20 years in the wild. Technically not WinRAR's fault, but support for WinACE and external formats was why and how the exploit came to be: https://www.theverge.com/2019/2/21/18234448/winrar-winace-19-year-old-vulnerability-patched-version-5-70-beta-1
    192 replies | 137083 view(s)
  • Alexander Rhatushnyak's Avatar
    Yesterday, 02:19
    I won't submit anything neither this year nor next year. Not later than in late 2022 someone else will win a prize. Maybe a simplified version of cmix or nncp, maybe indeed something like paq8px with a better NLP part.
    78 replies | 12265 view(s)
  • Gotty's Avatar
    Yesterday, 00:26
    Gotty replied to a thread paq8px in Data Compression
    It looks like the timing is not reliable - look at the last 2 rows: (Paq8px_v201 -10 / Paq8px_v201 -11). In the xml column they are the same (to the hundredth), which is probably an error. Ignoring this column the ratio still varies between 1.29 and 4.29. That's a very large spread. Luca was probably running multiple instances at the same time or run other tasks beside paq8px. So it's best to ignore the timings. They are not really useful. If you still would like to compare timings, you may need to clean the results by scaling the outliers but most importantly compare the results along the same command line options. You may compare versions along the compression level (8 to 8, 9 to 9, etc), and especially if lstm was used or not used. I'd say the "slowdown" from v95 to v201 is probably about 3x. Yes it's still a lot. But for 3-4% compression gain it worth it.
    2337 replies | 625128 view(s)
  • Scope's Avatar
    5th March 2021, 22:13
    https://gitlab.com/wg1/jpeg-xl/-/commit/65825c8629f181dbe8bdd432b15576735af29701
    15 replies | 2018 view(s)
  • Fallon's Avatar
    5th March 2021, 15:39
    Fallon replied to a thread WinRAR in Data Compression
    WinRAR - What's new in the latest version https://www.rarlab.com/download.htm Version 6.01 beta 1
    192 replies | 137083 view(s)
  • Gotty's Avatar
    5th March 2021, 12:29
    First of all, your use case is not clear to me: what is the purpose of hashing in your case? Do you need uniqueness? If yes, why? Do you need characters in the hash? Or bytes are also OK? >>to convert a street address Addresses are terribly regarding their quality. What are you planning to do with addresses that are written differently but they represent the same actual address? Do you plan a preprocessing stage and clean the addresses up?
    2 replies | 60 view(s)
  • Shelwien's Avatar
    5th March 2021, 06:32
    > So for example, what's the best way to convert a street address to a short fixed-length token? Some cryptohash (md5,sha1,sha256,...). Faster hashes might work too, but are less reliable. > Say a six-character string. If its actually a binary blob, you can just take first 6 bytes of hash value or something. But if its a restricted-charset "readable" string, you'd have to use first 6 symbols of base64 code of hash value, or, ideally, a more flexible restricted-charset code like http://nishi.dreamhosters.com/u/marc_v2b.7z > If every hash has to be unique, six characters can't do it even if it gives us more than enough range for 10 million addresses. There's obviously no perfect solution, but one idea is to modify actual strings (eg. add spaces) until unique hash value is found. > Unless... there's some way to guide the hash so that there are no collisions Hash functions usually contain some magic numbers that can be treated as parameters and adjusted to improve the results. This might let you get rid of a few collisions among known entries... but of course it won't work for unknown future entries.
    2 replies | 60 view(s)
  • LucaBiondi's Avatar
    5th March 2021, 00:26
    LucaBiondi replied to a thread paq8px in Data Compression
    Oh yes, help me @Darek please! Luca
    2337 replies | 625128 view(s)
  • Darek's Avatar
    5th March 2021, 00:04
    Darek replied to a thread paq8px in Data Compression
    @LucaBiondi - maybe I'm not the master of Excel but I could help you to made some changes
    2337 replies | 625128 view(s)
  • hexagone's Avatar
    4th March 2021, 23:45
    hexagone replied to a thread paq8px in Data Compression
    This is brutal. Dividing the last row of the numbers by the first show that the few percents of compression ratio improvements cost x 5 to x 15 in compression time.
    2337 replies | 625128 view(s)
  • Hacker's Avatar
    4th March 2021, 23:19
    Yup, I am aware of JPEG XL, I hope for some wider adoption, as time and time again with each hopeful JPEG successor.
    32 replies | 2394 view(s)
  • Jarek's Avatar
    4th March 2021, 16:44
    Jarek replied to a thread Zstandard in Data Compression
    https://www.phoronix.com/scan.php?page=news_item&px=Zstd-1.4.9-Released https://www.reddit.com/r/linux/comments/lx43x1/zstandard_v149_released_2x_faster_long_distance/
    460 replies | 147596 view(s)
  • SolidComp's Avatar
    4th March 2021, 06:15
    It needs a much higher quality display to be competitive with major competitors in this space. Also it seems to be missing basic biometrics, like IR camera for Windows Hello facial recognition. I wouldn't buy a laptop without those features. But I love the microphone cover. That's actually more important than covering a webcam. And the PCIe 4.0 x4 slot for an SSD is excellent and I don't know of another laptop that has 4-lane PCIe 4.0.
    1 replies | 117 view(s)
  • LucaBiondi's Avatar
    3rd March 2021, 23:08
    LucaBiondi replied to a thread paq8px in Data Compression
    Done! i hope you enjoy the graphs size vs. time
    2337 replies | 625128 view(s)
  • Gotty's Avatar
    3rd March 2021, 21:56
    Gotty replied to a thread paq8px in Data Compression
    Hi Luca! You can upload the excel itself (in a zip or 7zip).
    2337 replies | 625128 view(s)
  • LucaBiondi's Avatar
    3rd March 2021, 18:36
    LucaBiondi replied to a thread paq8px in Data Compression
    Hi guys i have an excel that contain the results of the compression of my dataset starting from PAQ8PX_V95(!) i have also plotted for each datatype size / time. It's not easy to attach all these images to the post. Where could i upload my excel? Maybe someone can help to plot data in a better way.. thank you, Luca
    2337 replies | 625128 view(s)
  • Darek's Avatar
    3rd March 2021, 11:16
    Darek replied to a thread paq8px in Data Compression
    No, there were different moments. It's probably something with my laptop. I suppose that could be issue of low space on Disk C (system). I need to sometime close the laptop and hibernate system and for some cases, after standing up system, paq8px quits w/o any communicate. At now I started to plan compression for time when there won't be needed to hibernate the system. For enwik9 it's about 4 days of time which need to be not disturbed.
    2337 replies | 625128 view(s)
  • Shelwien's Avatar
    3rd March 2021, 07:33
    Which can be interpreted as "Let's wait 3 more years, then buy a bunch of otherwise useless preprocessors from Alex for $25k". 500000*(1-116673681/122000000) = 21829 (122000000-116673681)/5000/365 = 2.92 122MB is the estimated paq8px result, cmix and nncp won't fit due to time limit.
    78 replies | 12265 view(s)
  • mitiko's Avatar
    2nd March 2021, 23:40
    Uhm, Wikipedia says it's a general purpose LZ77 paired with Huffman and second order context mixing, but looking at the code it also includes a small static general purpose dictionary: https://github.com/google/brotli/blob/fcda9db7fd554ffb19c2410b9ada57cdabd19de5/c/common/dictionary.c But I'm probably not the guy to ask. Meanwhile, I made the counting linear but it's way slower. I tried some performance analyzing tools but nothing conslusive, just a little bit of everything - matching is expensive and sorting is expensive and looking at the bitvector (which I implemented myself) is inefficient. I'm considering some dp approach where we only count the words once but then there would be quite a complicated process of updating the counts, involving multiple suffix array searches and I'm still debating if it's worth it. Correct me if I'm wrong but mcm seems to be getting away with counting easily because they only rank matches to the words in the lookahead window. Later it's also using a hashtable of strings, which would be quite inefficient in my case.
    17 replies | 793 view(s)
  • SolidComp's Avatar
    2nd March 2021, 21:55
    Mitiko, what about brotli's dictionary transforms? Are they LZ77/78?
    17 replies | 793 view(s)
  • byronknoll's Avatar
    2nd March 2021, 21:47
    A new rule has been added to Hutter Prize to make it easier: http://prize.hutter1.net/hfaq.htm#ddisc
    78 replies | 12265 view(s)
  • Gotty's Avatar
    2nd March 2021, 14:03
    Yes! You have it. Exactly: if you have no further information ("no theory" in your words) you can ask 2 questions, hoping that you will find the number quicker, but if you are unlucky, there is a higher chance that you will need to ask 4 questions. So 3 is the overall optimal: it's always 3, no risk. There is no way to solve it with 2 question without risking. If you have some information ("theory"), that usually it's AB, and rarely it's CDEFGH, you can go for AB first. But in this case the symbols are not equiprobable so it's not random. And we are going for random content, so you don't have more information what the result usually is (what distribution the symbols usually follow). If we do the game with 16 letters, it's the same: the optimal is 4 questions. With 32 it's 5. No matter how big your range is, the optimal way is always log2(n). Going even higher doesn't change the game: If you have 4096x8 random bits, your optimal way to find out those bits is to do it in 4096x8 questions. Can't do it in less. Practical compression example: try compressing random files with any compression software. The result will always be a bit bigger. In case of paq* it will actually try finding patters (find a "theory" behind the bits and bytes). It really does try finding patterns! Being a random file we know that there are no patterns there. But paq does not know that. So it tries. When it believes it found some theory (that some symbols are more probably than others) then it applies it - and experiences a loss. Ouch! So the result gets a bit bigger. Like trying to guess the 8-symbol game in 2 question. Risky right? Because it is programmed to lower the risk and try finding the best theory it will give up soon trying to solve the 8-symbol game in 2 questions - it will stick with 3 questions. And so it reduces it's loss. So this is the reason why there is always a small loss there - because it is always doing a bit riskier at the beginning, so those losses are inevitable (in case of paq or any other compression software with similar internal workings).
    21 replies | 616 view(s)
  • Gotty's Avatar
    2nd March 2021, 13:49
    xinix you are a bit harsh. As Trench said: "upset". I know it's upsetting when people have claims or statements and they don't fully understand the subject (yet), but it's OK. I sense that he's open and he tries to understand it deeper, so let's give it a go. He has no background in programming or information theory - it would be nice to have, but what can you do? We are here to help him (remember: he came with a question). This is what we can do: practical examples, visualizations, games - so even without the full experience or solid background at least he can sense what randomness means. You know how I started in ~ 2007? Funny: I tried to see if I could compress the already compressed paq8hp result smaller. I could easy-peasy win the contest. Fortunately I have math and programming background, and I quickly realized the it won't work. You do have to try compressing random files in order to understand what randomness means (not with paq or any other software - they don't give you first-hand experience, but in your own way, creating your own tools). So then I deeply understood the real meaning of randomness. And I can tell: it has its beauty. Whatever you do, however you slice, you just can't make them smaller. The result will always have the same size (in optimal case). So I tried it, and have actual first-hand experience. Equipped with my math background I also know the explanation. Now he can have a tiny-little experience by trying to solve the ABCDEFGH-puzzle in 2 questions.
    21 replies | 616 view(s)
  • Gotty's Avatar
    2nd March 2021, 13:21
    I'm trying to understand what's happening. So, we had breakthroughs in 2007, 2015, 2020, and now in 2021. The messages switching between significant savings and "little smaller". At this point this is what I get: no breakthroughs, the previous attempts were false alarms. This time you can probably save a couple of bytes with not completely random files. (Which is belivable). Your request for "be patient" tells me that you are more careful this time. I think it's wise, and I respect that.
    66 replies | 2210 view(s)
  • xinix's Avatar
    2nd March 2021, 06:54
    Ahaaaa!!! There you go! Kudos to RANDOM, you had an epiphany! We wrote you too that your "random" 4kb file is also very small for correct tests. You need at least 1 megabyte! That would eliminate some of the overhead, archive format header and the like. Gotty don't worry, I'm fine! It doesn't work any other way with him Why are you blind? You've been told 2 times that PAQ already compresses 2 characters! (Reminder, Gotty don't worry, I'm fine!) Well, PAQ already does what you need it to do, it already compresses those 2 characters, only it does it better and faster, without bloating the file and we don't have to convert the file to bit view because PAQ does it itself! Just read: https://en.wikipedia.org/wiki/PAQ http://mattmahoney.net/dc/paq1.pdf You have a lot of time to think about:) ______ Tell me, do you program? Have you really learned any programming language?
    21 replies | 616 view(s)
  • Trench's Avatar
    2nd March 2021, 05:05
    I rushed in my last reply which was sloppy You said it well about how conversion loses to compression, but you also answered the question after to state that no one deals with compressing 2 digits since their is no need for it. Which again their is no need for me to answer the puzzle but it was a challenge for others to learn from. Which again if everyone does the same nothing new is being done to get a difference perspective. Your reply is right from a conventional stand point. But why have you not 2nd guessed yourself? Since its not needed as you say. How do you know that you know it all? based on experience. A lot of things are found by accident just fooling around and trying silly things since the obvious is not so obvious. Random to you is not random to me. people with super memory have a structure to remember random numbers of phrases due to association. They may be random for others but not to a few. The computer does not know what random is unless a person puts in what is or is not random which you agree. So to say its impossible due to it being random is like saying it is impossible since it is being ignored. Ignoring something does not make it true truly but temporary to the limitation of the person. And that person puts it in the code which also makes it limited. and then it becomes a standard and then people ignore it since who will argue with the official stance. That does not seem healthy for innovation, since it puts boundaries and discourages. So not everyone says you can not compress random it reached its limit by so many experts and anyone new feels experts know best. That is not healthy for any field of work is what I am trying to get at. words make of break things which have restricted a lot of innovation. Even dictionaries have changed the meaning of words over time. Even something as simple as the word leisure means now to relax with free time but it use to mean to use free time to learn if we go much further back as a quick example. I am not here to change anyone mind I am here to offer another perspective to see things. Since when everyone seems a playing cards from one side it may look wide but from another it is thin. No one has to agree, but just like your tests its a way to get use to thinking and thinking differently to help things. on a side note Tesla likes the #3 and he sid the secret to everything is frequency. When he meaning everything does he mean everything? as for game ABCDEFG which i did not have time to read it last time. but the 8 letters seems to small to answer under 3. 8= you choose 4, 4=2, 2=1. the selection it too small to deal with but needs other methods to get 2. Sure your problem can be solved with 2 but again the odds of it being wrong are greater. LOL but if you only had 2 choices then you have to come up with a theory on why you choose and understand the context of the where it it coming from. not as practical but its a good way to think things through. as for the 256 which was answered last time it might be good to take a guess in the first part to eliminate 2/3 i was trying to take that path than 1/2 which is the sure and steady way to get a shorter answer. it has a probability of being wrong to delay more but when right it has a probability of being fast which was the purpose of the game. if i was right all the time guess and eliminate 2/3 that would be 6 turns if wrong and stuck to it then 13 turns at worse which to be wrong every time the probability would be 50% so within that 50% it would take 8 turns as well. So 5 at best not likely 13 at worst. Which that is how I came to my answer but thought it out now to understand it better to explain the numbers. Unless your percentage is different? SO you are right but I think I gave a better answer in speed which was a risk. You try it at 2/3 off and tell me how fast you get it. answer is 3 and the 2nd one is 200. use the same method which order will you eliminate first is up to you. so what are your odds? I should have puyt more effort last time but as i said i was in a rush and did not have time to think it through. since lotto odds have also a pattern which most dont know, but it looks "Random" to most. The thing is probability to get better odds. Just like if you flip a coin it will be 50% heads or tails. and if you flip it 100 times it should be near 50% heads or tails. Sure it can be heads 100 times but the odds increased despite it always has a 50% change in every flip to be heads despite it it is heads 4 times in a row or 8 times but their are other probability factors in play that most ignore due to what they see at face value which everyone is tricked even most mathematicians. you say "Also don't expect that one day we will invent a smart method to compress random data. " You already do, :) A file is a file. what the program that deals with it determines if it is too "random" for it to handle. So to call a random file not random when you know the code is not fair. it is just like a lock how it is all random but not to the key you have. all locks are the same, its the key that determines if it is in order or not. if anything the compression program is random to use predetermined patterns. Since some programs compress better than others on the same file that is not "random" which is random in how much it is compressed. It is not the files fault it is the program. Or to be fair it is both, and if it does not work the program is not comparable. It is not the 4k files fault that the compression programs do not do a good job which as you said no need to compress 2 character, which is based of programmers preconceived notion. ;) which that is what I am trying to say how everyone is stuck to some degree doing the same thing fear of taking the next step. Everyone is trying to squeeze a dried up lemon and putting more memory and processor into doing so. Its like throwing more money at the problem to need a bigger processor and memory. Again most wont agree which I see xinix (hello) is upset with me and gives you tumbs up all the time. LOL reply is long enough I will read the other game another time. you can try my method if it applies and see if that works. YEP ;) So try to beat 4.0k
    21 replies | 616 view(s)
  • LawCounsels's Avatar
    2nd March 2021, 02:34
    Celebrating 1st time ever random files compressed a little smaller and reconstructs exact same ( no further details will be provided until separate encoder decoder done ) Subject to eventual final confirmation with separate encoder decoder over 2 clean PCs
    66 replies | 2210 view(s)
  • Gotty's Avatar
    2nd March 2021, 02:26
    Random files? You mean: more? I thought it was jut a zstd file. So you have different files, right? That's good! I just wanted to ask you to double check the results with real random files. xinix also posted one. You also referred to random.org files in your earlier posts, so I believe you have now some test files with you. So what are the (preliminary?) results? What are we celebrating?
    66 replies | 2210 view(s)
  • LawCounsels's Avatar
    2nd March 2021, 02:03
    Sportman's provided random files ( pls leave any further requests for technical details for when separate encoder decoder done. None will be responded at this time)
    66 replies | 2210 view(s)
  • Gotty's Avatar
    2nd March 2021, 01:46
    Are these inputs random files?
    66 replies | 2210 view(s)
  • LawCounsels's Avatar
    2nd March 2021, 01:34
    Hi , I have made good progress coding done compressor and have now greatly made simpler decompressor At moment decompressor works direct from compressed internal variables decompressed exact same. There is only work to separate into encoder decoder. But do not underestimate the amount of 'mundane' tedious work involved .Likely will be a week's time to able do as requested. But we can and should already be celebrating. ( now can handle any size input also ) LawCounsels
    66 replies | 2210 view(s)
  • Gotty's Avatar
    1st March 2021, 20:39
    Gotty replied to a thread paq8px in Data Compression
    Does it happen around the same %? How does it crash? Blue screen?
    2337 replies | 625128 view(s)
  • Shelwien's Avatar
    1st March 2021, 02:58
    > am I supposed to remove/ sanction the 2 'dog>ate' in set #2 before blend them? In a similar case in PPM (called masked contexts) symbols seen in higher order contexts before escape are excluded from probability estimation. But in most CM implementations, submodels (based on contexts of different orders in this case) are handled independently, thus multiple contexts could provide predictions based on same data. Overall first approach is better, but is also harder to implement in a CM, and CM provides better compression than PPM even without it.
    196 replies | 17169 view(s)
  • Gotty's Avatar
    1st March 2021, 01:25
    I agree: unfamiliar. Actually there is a better word. It is called "unpredictable". See the wikipedia definition of randomness: "randomness is the apparent lack of pattern or predictability in events". See? "Apparent lack". It means that the file may have a pattern, you just don't find it or see it. From your viewpoint then the file is practically random. There is no way you can compress it. I also agree: "all files are random", but you need to add "... until you find some useful patterns". Otherwise this statement in itself is not valid.
    21 replies | 616 view(s)
  • Gotty's Avatar
    1st March 2021, 01:11
    Let's do another (different) approach for the second game. Let's say that you record the answers to my questions with ones ("yes") and zeroes ("no"). Here are my questions, and your answers 1. Is x>= 128? No 2. Is x>= 64? No 3. Is x>= 32? No 4. Is x>= 16? No 5. Is x>= 8? Yes (we are in 8..15) 6. Is x>= 12? Yes (we are in 12..15) 7. Is x>= 14? No (we are in 12..13) 8. Is x>= 13? Yes Then it is 13. What did you record? No-no-no-no-yes-yes-no-yes -> 00001101. This looks like 8 bits. In other words a byte... Hmmm... Let's look up this number in decimal. It's 13. Wow! Try with other numbers. The recorded yes-no answers will always match the number's binary representation. Isn't that cool? Your yes-no answers and the binary format will be the same when you follow the above question-structure. If the number is between 1-65536, that's 16 yes-no questions. Can you calculate the required number of questions somehow? Yes. The needed answers is always log2(n) where n is the number of possible symbols (the range). So log2(256) = 8 and log2(65536) = 16. Now think with me. A bit more concentration... Let's do the opposite. You thought of 8 bits (8 numbers of either 0 or 1). Let's stay at the same example. Let they be 00001101. How can I find out which 8 bits you have? I need to simply ask the bits one after the other: Is the 1st bit 1? No Is the 2nd bit 1? No Is the 3rd bit 1? No Is the 4th bit 1? No Is the 5th bit 1? Yes Is the 6th bit 1? Yes Is the 7th bit 1? No Is the 8th bit 1? Yes Notice that again it's No-no-no-no-yes-yes-no-yes. It matches exactly with the bits (00001101) again. And that represents 13. You needed 8 yes-no questions, in other words: 8 bits. It's even more amazing when you have probabilities for the symbols. ((Actually this is the universal case it covers the above yes-no questions, as well.)) You can have questions like that (guessing again 13 from 1..256): Is it 1? (p=1/256) No Is it 2? (p=1/255) No Is it 3? (p=1/254) No Is it 4? (p=1/253) No Is it 5? (p=1/252) No Is it 6? (p=1/251) No Is it 7? (p=1/250) No Is it 8? (p=1/249) No Is it 9? (p=1/248) No Is it 10? (p=1/247) No Is it 11? (p=1/246) No Is it 12? (p=1/245) No Is it 13? (p=1/244) Yes Got it: it's 13. In order to find out the information content, you'll need to sum {-log2(1-p) for the "No"s and -log(p) for the "Yes"es} for the above answers. Here is the result for each question: Is it 1? (p=1/256) No 0.005646563 Is it 2? (p=1/255) No 0.00566875 Is it 3? (p=1/254) No 0.005691112 Is it 4? (p=1/253) No 0.005713651 Is it 5? (p=1/252) No 0.00573637 Is it 6? (p=1/251) No 0.005759269 Is it 7? (p=1/250) No 0.005782353 Is it 8? (p=1/249) No 0.005805622 Is it 9? (p=1/248) No 0.005829079 Is it 10? (p=1/247) No 0.005852726 Is it 11? (p=1/246) No 0.005876566 Is it 12? (p=1/245) No 0.005900601 Is it 13? (p=1/244) Yes 7.930737338 Sum those numbers and be amazed. It's again 8. (It's 13 questions with skewed probabilities with 8 bits of information content.) You can also start from the top: Is it 256?, 255?... 13?. Even you had much more questions the answer will be still 8. Try. No matter what order you do, no matter how you slice up your range, it will be always 8 bits. See? The method or the used number of symbols is irrelevant: whether it's 8 bits or 1 byte (base 2 or base 256): it's still 256 symbols, you always need 8 bits to represent a number from this range. Actually if you store anything in any format it's always the same: the information content is: how many yes-no questions do you need to figure out the full content. This is the number of bits. You cannot do in less - you cannot compress these 8 bits of information to be less. If you understand all the above (even partially) you will have a better grasp: no matter how you transform your random files, no matter what approach you use, no matter with how many symbols you use to represent your data, the information content will be always the same. (4K in your case.)
    21 replies | 616 view(s)
More Activity