Activity Stream

Filter
Sort By Time Show
Recent Recent Popular Popular Anytime Anytime Last 24 Hours Last 24 Hours Last 7 Days Last 7 Days Last 30 Days Last 30 Days All All Photos Photos Forum Forums
Filter by: Popular Clear All
  • LawCounsels's Avatar
    15th February 2021, 01:01
    Hi All : I understand Hutter increased prize fund 10x to €500,000 This at last is worth effort to claim There is a minor snag which I hope organisers can help overcome : Is it possible to somehow verify results ( to gold standard impossible any shady ) without laying bare source code thus algorithm ? And for undertaking/ cast iron arrangement .exe will not be retained after tests ( perhaps test in presence of Author's representative control of .exe running )
    66 replies | 2072 view(s)
  • Hacker's Avatar
    17th February 2021, 00:23
    Since Squezee Chart is not being updated anymore, I wonder, what packers are actively developed, i.e. at least one new version in the last six months? PAQ and its gazillion forks are actively developed but mostly experimental 7-Zip, RAR go on as usual The development of EMMA, MCM, NanoZip, PackJPG, PackRAW, Razor, UDA, ZCM and ZPAQ seems to be on hold or has been abandoned altogether Are any other packers actively developed, please? Thank you.
    32 replies | 2221 view(s)
  • Trench's Avatar
    27th February 2021, 08:22
    I asked a while ago about compressing binary file and got some decent programs that can compress but suck at binary compared to just putting the binary into Hex / Ascii and the compressing it which beats just binary compression by around 15% what I tested. While other program like 7zip it is not as good even with its Lzma2 being better than Bzip2 when dealing with binary but still. Maybe my tests were not done properly some might feel for the file to not be big enough (32kb) or not putting the proper configurations. This was just a quick tests. Maybe more focus is needed on binary compression? Bad enough other compressed programs as stated before like PNG is off by over 50%.
    21 replies | 486 view(s)
  • mitiko's Avatar
    27th February 2021, 22:56
    Dictionary transforms are quite useful but do we take advantage of all they can offer when optimizing for size. LZ77/78 techniques are wildly known and optimal parsing is an easy O(n) minimal path search in DAWG (directed acyclic weighted/word graph). In practice, optimal LZ77 algorithms don't do optimal parsing because of the big offsets it can generate. Also LZ algs lose context when converting words to (offset, count) pairs. These pairs are harder to predict by entropy coders. I've been working on BWD - pretentious name stands for Best Word Dictionary. It's a semi-adaptive dictionary, which means it computes the best/optimal dictionary (or tries to get some suboptimal approximation) for the data given and adds it to the stream. This makes decompression faster than LZ77/78 methods, as the whole dictionary can sit in memory but compression becomes very costly. The main advatages are: optimality, easier to predict symbols, slight improvement in decompression speed. The disadvantages are clear: have to add the dictionary to the compressed file, slow compression. This is still quite experimental and I do plan on rewriting it in C++ (it's in C# right now), for which I'll probably need your guys' help later on. I've also considered starting a new set of paq8 compressors but that seems like getting too deep into the woods for an idea I haven't developed fully yet. Someone requested test results - Compression takes about 45s for enwik5 and decompression is at around 20ms. You can find the exact results in https://github.com/Mitiko/BWDPerf/issues/5 I'm still making frequent changes and I'm not spending too much time testing. It generates a new sequence of symbols (representing the indexes to the dictionary) - length is 50257 and entropy is 5.79166, dictionary uncompressed is 34042 bytes. This means, that after an entropy coder it can go to about 325'113 bytes. (This is for enwik5 with m=32 and dictionarySize=256) Stay tuned, I'm working on a very good optimization that should take compression time down a lot. I'll try to release more information soon on how everything works, the structure of the compressed file and some of my notes. You can read more on how the algorith works here: https://mitiko.github.io/BWDPerf/ This is the github repo: https://github.com/Mitiko/BWDPerf/ I'll be glad to answer any and all questions here or with the new github feature of discussions: https://github.com/Mitiko/BWDPerf/discussions Note this project is also a pipeline for compression, that I built with the idea of being able to switch between algorithms fast. If you'd like to contribute any of your experiments, transforms or other general well-known compressors to it, to create a library for easy benchmarking, that'd be cool. I know Bulat Ziganshin made something similiar so I'm not too invested in this idea. Any cool new ideas, either on how to optimize parsing, counting words or how to model the new symbols are always appreciated.
    13 replies | 579 view(s)
  • Gonzalo's Avatar
    10th February 2021, 19:04
    Do you know of a compressor or archiver that's content-aware regarding sqlite or other databases? Just to clarify, I am NOT talking about VACUUM command or similar in which the output is still a functional database
    10 replies | 670 view(s)
  • lz77's Avatar
    17th February 2021, 20:59
    ​What is the distance in bytes between instructions that can be executed in parallel? How many ALUs are there for this, and what instructions can they follow? I mean CPU I8700K.
    8 replies | 390 view(s)
  • Gribok's Avatar
    24th February 2021, 07:31
    libsais is my new library for fast (see Benchmarks on GitHub) linear time suffix array and Burrows-Wheeler transform construction based on induced sorting (same algorithm as in sais-lite by Yuta Mori). The algorithms runs in a linear time (and outperforms divsufsort) using typically only ~12KB of extra memory (with 2n bytes as absolute worst-case extra working space). Source code and Benchmarks is available at: https://github.com/IlyaGrebnov/libsais
    6 replies | 636 view(s)
  • lz77's Avatar
    8th February 2021, 14:41
    In Delphi 7 such a trick is possible with the calculation of the jump address in the case statement. For example, there is case statement by prefix = 0,1,2,3 when LZ decompressing (but it breaks optimization): var cm0: array of dword; ... asm mov dword ,offset m0 mov dword ,offset m1 mov dword ,offset m2 mov dword ,offset m3 end; <Decompressing loop starts> asm mov eax,prefix jmp offset cm0+eax*4 end; m0: offs:=...; goto 1000; m1: offs:=...; goto 1000; m2: offs:=...; goto 1000; m3: offs:=...; 1000: Is it possible in C to calculate the addresses of the labels and make such a jump? Or does the translator do it itself?
    7 replies | 395 view(s)
  • SolidComp's Avatar
    14th February 2021, 21:26
    Hi all – What's the situation with PPM in comparison to other approaches, in terms of ratio, complexity, and speed tradeoffs? I don't see a PPM or PPMII thread here, and I'm curious to learn more about it. Which codecs are using PPM? Also, what happened with Dmitry Shkarin's PPMII algorithm? I was intrigued by this passage in Steinruecken's 2014 thesis (PDF) on lossless data compression: How does PPMII compare to other, non-PPM approaches? (I didn't realize until now that PPMd is also Shkarin's.) (That thesis also references a 1998 paper by Charles Bloom (PDF) that focuses on PPM. I haven't read it yet.) Thanks.
    6 replies | 566 view(s)
  • mitiko's Avatar
    25th February 2021, 00:35
    Is there a way to make ANS FIFO? I'm assuming it's either impossible or no one has tried it, since we're stuck with a LIFO fast coder since 2014. If so, what's the reason making it impossible? ANS basically maps numbers to bits and bigger state means more bits. This is why the state is increasing as we encode. Normalization helps us with infinite precision. Can't we just reverse this process? Instead of starting at a low state and increasing as we go, start at some high state and decrease it. Normalization would make sure we don't get to negative states - it will increase the state. In ANS we're asking the question "What is the next state that encodes all the bits we have and one more bit?". Now we can ask "What is the closest previous state that encoded the information in the current state, preceded with the new bit?" I'm imagining some hypothetical (for the binary alphabet): Let's say we want to encode 00110100 First symbol is 0 x_0 = 16 encodes 000100 The last symbol of this sequence is 0 and it matches the end of this code. We don't switch states. The next symbol we read is 0. The closest smaller state that matches is: x_1 = 13 encodes 00100 The next symbol is 1. The closest smaller state that matches 001 is: x_2 = 5 encodes 001 The next symbol is 1. But we can't encode anything - our state is too low. So we normalize - we output some bits and go to let's say state x_3 = 40 x_3 = 40 encodes 0010010001001 This matches our symbol 1. We read the next symbol to be 0. x_4 = 38 encodes 0100100101 and matches "01" And so on... That's the basic idea. I'm trying to understand deeper how ANS works and how exactly it relates to range coding. Notice the string encoded by our state is the reverse of what we read. We're still reading in the correct direction but the state encodes in the opposite - we're adding data to the most significant bits of the state, not the least significant. That's just me brainstorming. Has anyone else tried this? I would find great pleasure in a proof it doesn't work as well. Thanks!
    4 replies | 394 view(s)
  • stbrumme's Avatar
    5th February 2021, 22:00
    Hi, various Huffman-based compression formats pose an upper limit on the length of prefix codes. For example, DEFLATE and JPEG allow only up to 16 bits per symbol. Depending on how skewed a symbol distribution is, the standard Huffman algorithm may produce longer codes. Therefore I created a free C library with various length-limited algorithms: https://github.com/stbrumme/length-limited-prefix-codes You can find heavily documented implementations of: Package-Merge JPEG's recommendation from Annex K.3 MiniZ's approach (similar to zlib) BZip2's solution two Kraft-McMillan based algorithms The Kraft-McMillan code was entirely written by me - though hugely inspired by Charles Bloom's blog. The other implementations are derived from other people's hard work (either actual source code or papers/specifications) and optimized for clarity and/or speed. All algorithms share the same interface so they are easily interchangeable: unsigned char algorithmName(unsigned char maxLength, unsigned int numCodes, const unsigned int histogram, unsigned char codeLengths); For each algorithm there is a custom-tailored interface, too, which is often faster and requires less code but has certain restrictions regarding its input (such as a sorted histogram etc.). A basic benchmark is included so you can measure performance and compression ratio with your own data. The aforementioned GIT repository has a long README.md file with tons of more information. Major spoilers: Package-Merge can be made reasonably fast my Kraft Strategy B runs much faster than a good Huffman implementation (e.g. compared with Moffat's inplace version) BZip2's approach is dead-simple but slow and inefficient I hope you enjoy it, Stephan
    3 replies | 369 view(s)
  • suryakandau@yahoo.co.id's Avatar
    15th February 2021, 16:29
    i am looking for cm compressor written in vb6 or in turbo pascal
    2 replies | 223 view(s)
  • Gonzalo's Avatar
    21st February 2021, 22:32
    Fazip uses TTA but it doesn't work for compound content, like a .wav somewhere inside a TAR archive. Precomp was supposed to include wavpack like a year ago but nothing has happened yet and it doesn't seem like it's going to. So, do you know of something like this, that detects and compress audio chunks and makes a copy of the rest of the data? Thanks in advance!
    2 replies | 307 view(s)
  • Bulat Ziganshin's Avatar
    24th February 2021, 22:55
    I started to work on codecs bindings for the new CELS framework with what I thought the simplest one - LZ4. But it turned out to be more complex work than I expected. My main surce of information is lz4 manual, library source code and examples. I have not found any tutorial-style guides for using the library, and the documentaton provided is centered around individual functions rather than usage scenarios. My goal is to avoid mallocs inside the library (providing all memory buffers by myself), and reusing compression contexts between operations in order to maximize perfromance. I will put my questions in bold, mixed with commentaries describing my setup. Q1: It seems that decompression context is so small and quickly initialized, that keeping and reusing it between operations is meaningless? For compression of the single buffer, I use sequence of `malloc(LZ4_sizeofState()) + LZ4_compress_fast_extState() + free()` Q2: Does LZ4_compress_fast_extState guranteed to never call malloc inside (in current 1.9.3 version)? It seems that the next step to improve perfromance is to have malloc + LZ4_compress_fast_extState + LZ4_compress_fast_extState_fastReset (many times) + free? Q3: Am I correct that I should call LZ4_compress_fast_extState with freshly allocated state object, but on the following operations using the same state object, I can use LZ4_compress_fast_extState_fastReset to improve perfomance? Q4: In lz4.c:819 I see that the fast reset is actually used only for input blocks less than 4 KB. Why it don't used for larger blocks? Now, going to the stream compression. Q5: The LZ4_resetStream_fast docs stated that despite having same size, context used for single-shot compression, shouldn't be reused for the stream compression and vice versa. Is it true or is it my misunderstanding? The proper approach to maximize speed by reusing context with streaming compression looks like: 1. malloc + LZ4_initStream 2. call LZ4_compress_fast_continue multiple times, processing single stream 3. LZ4_resetStream_fast 4. call LZ4_compress_fast_continue multiple times, processing single stream repeat steps 3 and 4 again for each subsequent stream and finally 5. free the memory Q6: Is it correct approach to ensure max speed? Q7: LZ4_decompress_safe_continue documentation mentions three decompression scenarios, and LZ4_compress_fast_continue mentions various scenarios as well. Is any of them has speed advantage? In particular, I have implemented double-buffering with adjancent buffers (as implemented in blockStreaming_doubleBuffer.c although this contradicts Note 3 on LZ4_compress_fast_continue) - is it correct and can I improve the speed by using other approach to compression/decompression? By default I use two buffers of >=64 KB, but I also interested to know answer for smaller buffers. PS: at the end of day, it seems that internal LZ4 API is already well-thought for my needs, not so much for the manual.
    1 replies | 401 view(s)
  • Trench's Avatar
    20th February 2021, 20:21
    I assume many here are into file compression for the greater good more than money. But up to what ratio as the poll shows. Also to the extent how far you are willing to go to screw another as well if someone hired you to benefit from the idea. Kind of like how Xerox got screwed over by Apple, or Apple and others got screwed by MS, or the original created of FB got screwed over, The McDonald brothers got screwed over by McDonald, Nikola Tesla screwed over by JP Morgan, Etc. like most big corporations. So what would make you different it seems. Think about it for a few days before you answer.
    2 replies | 299 view(s)
  • ron_s_s's Avatar
    10th February 2021, 23:41
    ​Example 1 : codewords {0,01,11} 0 is prefix of 01 -> dangling suffix is 1. List {0,01,11,1} 1 is a prefix of 11 -> dangling suffix is 1 - already there -> The code is ud Example 2: codewords {0,01,10} 0 is prefix of 01 -> dangling suffix is 1 List {0,01,10,1} 1 is a prefix of 10 -> dangling suffix is 0 - which is an original codeword! the code is not ud. These 2 examples are clear to me, but I have this specfic case where it is not clear to me. In case I add 2 dangling suffixes to my list, and one is prefix of the other, thus getting another dangling suffix. This new dangling suffix is an original codeword does it mean the code is not ud? Example : codewrods {0,01,110} 0 is prefix of 01 -> dangling suffix is 1 List {0,01,110,1} 1 is a prefix of 110 -> dangling suffix is 10 List {0,01,110,1,10} (this is where the misconception happens...) I have in this list 2 dangling suffixes {1,10} which are not an original codewords. and 1 is prefix of 10 -> 0 is a dangling suffix. 0 is also an original codeword thus code is not ud. Is it conventional by sardinas patterson algorithm rules to produce a dangling suffix from 2 other dangling suffixes?
    1 replies | 259 view(s)
  • SolidComp's Avatar
    24th February 2021, 22:32
    Hi all – I just realized that I don't know anything substantial about encryption. Am I correct in assuming that we can't generally compress encrypted data? I assume that effective encryption produces random-looking data to the naive sentient being, therefore the sequence of steps has to be compression first, then encryption, then decryption, then decompression right? I've heard a bit about homomorphic encryption, which I think means a type of encryption that enables one to extract useful data from an encrypted dataset without decrypting the dataset. At that level of description, this can mean a few different things. Homomorphic encryption might mean that data in its informative form exists at different levels of abstraction or something, some of which is not actually encrypted (?), or there might be something about the nature of the search algorithms – how you find what you're looking for without decrypting... Or it could be much more interesting than this. But the general thrust of homomorphic encryption hints at the possibility of compressing encrypted data, if the surfaced information was somehow useful to a compressor, could be acted on by a compressor, but I don't know enough about it. Has anyone worked on the theory around this? Thanks.
    1 replies | 213 view(s)
  • JamesWasil's Avatar
    25th February 2021, 20:58
    I've wanted this with 286 and 386 laptops for decades. Do you feel it will live up to the promises from the press release? https://arstechnica.com/gadgets/2021/02/framework-startup-designed-a-thin-modular-repairable-13-inch-laptop/
    1 replies | 104 view(s)
  • SolidComp's Avatar
    Today, 03:29
    Hi all – I've not too familiar with hashing theory, and had some questions. I want to tokenize various data in social science research. So for example, what's the best way to convert a street address to a short fixed-length token? Say a six-character string. Addresses are of course highly variable strings, and max out at around 100 characters/bytes. For a country with a max of 10 million possible and future addresses, I assume there's no way to hash an address to a six-character string without collisions right? Six characters long from a 30 character set covers 729 million possibilities, but the inputs can be any of more than 100^36 possibilities (all caps + numbers). It's actually close to the factorial of that number since the input address is variable length. If every hash has to be unique, six characters can't do it even if it gives us more than enough range for 10 million addresses. Unless... there's some way to guide the hash so that there are no collisions. I don't know of anything like that that is still called a "hashing algorithm". It would be some kind of subset hashing that takes inputs that have a huge range if they're arbitrary, where the actual possibilities are a much smaller subset, and the output hash string is sized for the actual possibilities. Would you just randomly assign string tokens instead of hashing? Is that what they do with credit card tokenization? Thanks.
    1 replies | 7 view(s)
  • macarena's Avatar
    25th February 2021, 21:43
    Hello, Most recent papers on image compression use the Bjontegard metric to report average bitrate savings or PSNR/SSIM gains (BD-BR, BD-PSNR etc.). It works by finding the average difference between RD curves. Here's the original document: https://www.itu.int/wftp3/av-arch/video-site/0104_Aus/VCEG-M33.doc I am a little confused by the doc.The bitrate they show in the graphs at the end of the document seems to be in bits/sec. At least it is not bits per pixel (bpp) which is commonly used in RD curves. As per this site https://www.intopix.com/blogs/post/How-to-define-the-compression-rate-according-to-bpp-or-bps, converting from bpp to bits/s would require information of fps which might not be known(?). I want to know if it really matters if the bitrate is in bpp or bits/s? Or does the metric give correct values no matter which one uses? Here's a Matlab implementation that seems to be recommended by JPEG: https://fr.mathworks.com/matlabcentral/fileexchange/27798-bjontegaard-metric . I ran a few experiments and it seems bpp gives plausible results. Though a confirmation would be nice. Thanks!
    0 replies | 164 view(s)
  • Shelwien's Avatar
    Today, 06:32
    > So for example, what's the best way to convert a street address to a short fixed-length token? Some cryptohash (md5,sha1,sha256,...). Faster hashes might work too, but are less reliable. > Say a six-character string. If its actually a binary blob, you can just take first 6 bytes of hash value or something. But if its a restricted-charset "readable" string, you'd have to use first 6 symbols of base64 code of hash value, or, ideally, a more flexible restricted-charset code like http://nishi.dreamhosters.com/u/marc_v2b.7z > If every hash has to be unique, six characters can't do it even if it gives us more than enough range for 10 million addresses. There's obviously no perfect solution, but one idea is to modify actual strings (eg. add spaces) until unique hash value is found. > Unless... there's some way to guide the hash so that there are no collisions Hash functions usually contain some magic numbers that can be treated as parameters and adjusted to improve the results. This might let you get rid of a few collisions among known entries... but of course it won't work for unknown future entries.
    1 replies | 7 view(s)
  • LucaBiondi's Avatar
    Today, 00:26
    LucaBiondi replied to a thread paq8px in Data Compression
    Oh yes, help me please! Luca
    2333 replies | 624241 view(s)
  • Darek's Avatar
    Today, 00:04
    Darek replied to a thread paq8px in Data Compression
    @LucaBiondi - maybe I'm not the master of Excel but I could help you to made some changes
    2333 replies | 624241 view(s)
  • hexagone's Avatar
    Yesterday, 23:45
    hexagone replied to a thread paq8px in Data Compression
    This is brutal. Dividing the last row of the numbers by the first show that the few percents of compression ratio improvements cost x 5 to x 15 in compression time.
    2333 replies | 624241 view(s)
  • Hacker's Avatar
    Yesterday, 23:19
    Yup, I am aware of JPEG XL, I hope for some wider adoption, as time and time again with each hopeful JPEG successor.
    32 replies | 2221 view(s)
  • Jarek's Avatar
    Yesterday, 16:44
    Jarek replied to a thread Zstandard in Data Compression
    https://www.phoronix.com/scan.php?page=news_item&px=Zstd-1.4.9-Released https://www.reddit.com/r/linux/comments/lx43x1/zstandard_v149_released_2x_faster_long_distance/
    458 replies | 147285 view(s)
  • SolidComp's Avatar
    Yesterday, 06:15
    It needs a much higher quality display to be competitive with major competitors in this space. Also it seems to be missing basic biometrics, like IR camera for Windows Hello facial recognition. I wouldn't buy a laptop without those features. But I love the microphone cover. That's actually more important than covering a webcam. And the PCIe 4.0 x4 slot for an SSD is excellent and I don't know of another laptop that has 4-lane PCIe 4.0.
    1 replies | 104 view(s)
  • LucaBiondi's Avatar
    3rd March 2021, 23:08
    LucaBiondi replied to a thread paq8px in Data Compression
    Done! i hope you enjoy the graphs size vs. time
    2333 replies | 624241 view(s)
  • Gotty's Avatar
    3rd March 2021, 21:56
    Gotty replied to a thread paq8px in Data Compression
    Hi Luca! You can upload the excel itself (in a zip or 7zip).
    2333 replies | 624241 view(s)
  • LucaBiondi's Avatar
    3rd March 2021, 18:36
    LucaBiondi replied to a thread paq8px in Data Compression
    Hi guys i have an excel that contain the results of the compression of my dataset starting from PAQ8PX_V95(!) i have also plotted for each datatype size / time. It's not easy to attach all these images to the post. Where could i upload my excel? Maybe someone can help to plot data in a better way.. thank you, Luca
    2333 replies | 624241 view(s)
  • Darek's Avatar
    3rd March 2021, 11:16
    Darek replied to a thread paq8px in Data Compression
    No, there were different moments. It's probably something with my laptop. I suppose that could be issue of low space on Disk C (system). I need to sometime close the laptop and hibernate system and for some cases, after standing up system, paq8px quits w/o any communicate. At now I started to plan compression for time when there won't be needed to hibernate the system. For enwik9 it's about 4 days of time which need to be not disturbed.
    2333 replies | 624241 view(s)
  • Shelwien's Avatar
    3rd March 2021, 07:33
    Which can be interpreted as "Let's wait 3 more years, then buy a bunch of otherwise useless preprocessors from Alex for $25k". 500000*(1-116673681/122000000) = 21829 (122000000-116673681)/5000/365 = 2.92 122MB is the estimated paq8px result, cmix and nncp won't fit due to time limit.
    76 replies | 12015 view(s)
  • mitiko's Avatar
    2nd March 2021, 23:40
    Uhm, Wikipedia says it's a general purpose LZ77 paired with Huffman and second order context mixing, but looking at the code it also includes a small static general purpose dictionary: https://github.com/google/brotli/blob/fcda9db7fd554ffb19c2410b9ada57cdabd19de5/c/common/dictionary.c But I'm probably not the guy to ask. Meanwhile, I made the counting linear but it's way slower. I tried some performance analyzing tools but nothing conslusive, just a little bit of everything - matching is expensive and sorting is expensive and looking at the bitvector (which I implemented myself) is inefficient. I'm considering some dp approach where we only count the words once but then there would be quite a complicated process of updating the counts, involving multiple suffix array searches and I'm still debating if it's worth it. Correct me if I'm wrong but mcm seems to be getting away with counting easily because they only rank matches to the words in the lookahead window. Later it's also using a hashtable of strings, which would be quite inefficient in my case.
    13 replies | 579 view(s)
  • SolidComp's Avatar
    2nd March 2021, 21:55
    Mitiko, what about brotli's dictionary transforms? Are they LZ77/78?
    13 replies | 579 view(s)
  • byronknoll's Avatar
    2nd March 2021, 21:47
    A new rule has been added to Hutter Prize to make it easier: http://prize.hutter1.net/hfaq.htm#ddisc
    76 replies | 12015 view(s)
  • Gotty's Avatar
    2nd March 2021, 14:03
    Yes! You have it. Exactly: if you have no further information ("no theory" in your words) you can ask 2 questions, hoping that you will find the number quicker, but if you are unlucky, there is a higher chance that you will need to ask 4 questions. So 3 is the overall optimal: it's always 3, no risk. There is no way to solve it with 2 question without risking. If you have some information ("theory"), that usually it's AB, and rarely it's CDEFGH, you can go for AB first. But in this case the symbols are not equiprobable so it's not random. And we are going for random content, so you don't have more information what the result usually is (what distribution the symbols usually follow). If we do the game with 16 letters, it's the same: the optimal is 4 questions. With 32 it's 5. No matter how big your range is, the optimal way is always log2(n). Going even higher doesn't change the game: If you have 4096x8 random bits, your optimal way to find out those bits is to do it in 4096x8 questions. Can't do it in less. Practical compression example: try compressing random files with any compression software. The result will always be a bit bigger. In case of paq* it will actually try finding patters (find a "theory" behind the bits and bytes). It really does try finding patterns! Being a random file we know that there are no patterns there. But paq does not know that. So it tries. When it believes it found some theory (that some symbols are more probably than others) then it applies it - and experiences a loss. Ouch! So the result gets a bit bigger. Like trying to guess the 8-symbol game in 2 question. Risky right? Because it is programmed to lower the risk and try finding the best theory it will give up soon trying to solve the 8-symbol game in 2 questions - it will stick with 3 questions. And so it reduces it's loss. So this is the reason why there is always a small loss there - because it is always doing a bit riskier at the beginning, so those losses are inevitable (in case of paq or any other compression software with similar internal workings).
    21 replies | 486 view(s)
  • Gotty's Avatar
    2nd March 2021, 13:49
    xinix you are a bit harsh. As Trench said: "upset". I know it's upsetting when people have claims or statements and they don't fully understand the subject (yet), but it's OK. I sense that he's open and he tries to understand it deeper, so let's give it a go. He has no background in programming or information theory - it would be nice to have, but what can you do? We are here to help him (remember: he came with a question). This is what we can do: practical examples, visualizations, games - so even without the full experience or solid background at least he can sense what randomness means. You know how I started in ~ 2007? Funny: I tried to see if I could compress the already compressed paq8hp result smaller. I could easy-peasy win the contest. Fortunately I have math and programming background, and I quickly realized the it won't work. You do have to try compressing random files in order to understand what randomness means (not with paq or any other software - they don't give you first-hand experience, but in your own way, creating your own tools). So then I deeply understood the real meaning of randomness. And I can tell: it has its beauty. Whatever you do, however you slice, you just can't make them smaller. The result will always have the same size (in optimal case). So I tried it, and have actual first-hand experience. Equipped with my math background I also know the explanation. Now he can have a tiny-little experience by trying to solve the ABCDEFGH-puzzle in 2 questions.
    21 replies | 486 view(s)
  • Gotty's Avatar
    2nd March 2021, 13:21
    I'm trying to understand what's happening. So, we had breakthroughs in 2007, 2015, 2020, and now in 2021. The messages switching between significant savings and "little smaller". At this point this is what I get: no breakthroughs, the previous attempts were false alarms. This time you can probably save a couple of bytes with not completely random files. (Which is belivable). Your request for "be patient" tells me that you are more careful this time. I think it's wise, and I respect that.
    66 replies | 2072 view(s)
  • xinix's Avatar
    2nd March 2021, 06:54
    Ahaaaa!!! There you go! Kudos to RANDOM, you had an epiphany! We wrote you too that your "random" 4kb file is also very small for correct tests. You need at least 1 megabyte! That would eliminate some of the overhead, archive format header and the like. Gotty don't worry, I'm fine! It doesn't work any other way with him Why are you blind? You've been told 2 times that PAQ already compresses 2 characters! (Reminder, Gotty don't worry, I'm fine!) Well, PAQ already does what you need it to do, it already compresses those 2 characters, only it does it better and faster, without bloating the file and we don't have to convert the file to bit view because PAQ does it itself! Just read: https://en.wikipedia.org/wiki/PAQ http://mattmahoney.net/dc/paq1.pdf You have a lot of time to think about:) ______ Tell me, do you program? Have you really learned any programming language?
    21 replies | 486 view(s)
  • Trench's Avatar
    2nd March 2021, 05:05
    I rushed in my last reply which was sloppy You said it well about how conversion loses to compression, but you also answered the question after to state that no one deals with compressing 2 digits since their is no need for it. Which again their is no need for me to answer the puzzle but it was a challenge for others to learn from. Which again if everyone does the same nothing new is being done to get a difference perspective. Your reply is right from a conventional stand point. But why have you not 2nd guessed yourself? Since its not needed as you say. How do you know that you know it all? based on experience. A lot of things are found by accident just fooling around and trying silly things since the obvious is not so obvious. Random to you is not random to me. people with super memory have a structure to remember random numbers of phrases due to association. They may be random for others but not to a few. The computer does not know what random is unless a person puts in what is or is not random which you agree. So to say its impossible due to it being random is like saying it is impossible since it is being ignored. Ignoring something does not make it true truly but temporary to the limitation of the person. And that person puts it in the code which also makes it limited. and then it becomes a standard and then people ignore it since who will argue with the official stance. That does not seem healthy for innovation, since it puts boundaries and discourages. So not everyone says you can not compress random it reached its limit by so many experts and anyone new feels experts know best. That is not healthy for any field of work is what I am trying to get at. words make of break things which have restricted a lot of innovation. Even dictionaries have changed the meaning of words over time. Even something as simple as the word leisure means now to relax with free time but it use to mean to use free time to learn if we go much further back as a quick example. I am not here to change anyone mind I am here to offer another perspective to see things. Since when everyone seems a playing cards from one side it may look wide but from another it is thin. No one has to agree, but just like your tests its a way to get use to thinking and thinking differently to help things. on a side note Tesla likes the #3 and he sid the secret to everything is frequency. When he meaning everything does he mean everything? as for game ABCDEFG which i did not have time to read it last time. but the 8 letters seems to small to answer under 3. 8= you choose 4, 4=2, 2=1. the selection it too small to deal with but needs other methods to get 2. Sure your problem can be solved with 2 but again the odds of it being wrong are greater. LOL but if you only had 2 choices then you have to come up with a theory on why you choose and understand the context of the where it it coming from. not as practical but its a good way to think things through. as for the 256 which was answered last time it might be good to take a guess in the first part to eliminate 2/3 i was trying to take that path than 1/2 which is the sure and steady way to get a shorter answer. it has a probability of being wrong to delay more but when right it has a probability of being fast which was the purpose of the game. if i was right all the time guess and eliminate 2/3 that would be 6 turns if wrong and stuck to it then 13 turns at worse which to be wrong every time the probability would be 50% so within that 50% it would take 8 turns as well. So 5 at best not likely 13 at worst. Which that is how I came to my answer but thought it out now to understand it better to explain the numbers. Unless your percentage is different? SO you are right but I think I gave a better answer in speed which was a risk. You try it at 2/3 off and tell me how fast you get it. answer is 3 and the 2nd one is 200. use the same method which order will you eliminate first is up to you. so what are your odds? I should have puyt more effort last time but as i said i was in a rush and did not have time to think it through. since lotto odds have also a pattern which most dont know, but it looks "Random" to most. The thing is probability to get better odds. Just like if you flip a coin it will be 50% heads or tails. and if you flip it 100 times it should be near 50% heads or tails. Sure it can be heads 100 times but the odds increased despite it always has a 50% change in every flip to be heads despite it it is heads 4 times in a row or 8 times but their are other probability factors in play that most ignore due to what they see at face value which everyone is tricked even most mathematicians. you say "Also don't expect that one day we will invent a smart method to compress random data. " You already do, :) A file is a file. what the program that deals with it determines if it is too "random" for it to handle. So to call a random file not random when you know the code is not fair. it is just like a lock how it is all random but not to the key you have. all locks are the same, its the key that determines if it is in order or not. if anything the compression program is random to use predetermined patterns. Since some programs compress better than others on the same file that is not "random" which is random in how much it is compressed. It is not the files fault it is the program. Or to be fair it is both, and if it does not work the program is not comparable. It is not the 4k files fault that the compression programs do not do a good job which as you said no need to compress 2 character, which is based of programmers preconceived notion. ;) which that is what I am trying to say how everyone is stuck to some degree doing the same thing fear of taking the next step. Everyone is trying to squeeze a dried up lemon and putting more memory and processor into doing so. Its like throwing more money at the problem to need a bigger processor and memory. Again most wont agree which I see xinix (hello) is upset with me and gives you tumbs up all the time. LOL reply is long enough I will read the other game another time. you can try my method if it applies and see if that works. YEP ;) So try to beat 4.0k
    21 replies | 486 view(s)
  • LawCounsels's Avatar
    2nd March 2021, 02:34
    Celebrating 1st time ever random files compressed a little smaller and reconstructs exact same ( no further details will be provided until separate encoder decoder done ) Subject to eventual final confirmation with separate encoder decoder over 2 clean PCs
    66 replies | 2072 view(s)
  • Gotty's Avatar
    2nd March 2021, 02:26
    Random files? You mean: more? I thought it was jut a zstd file. So you have different files, right? That's good! I just wanted to ask you to double check the results with real random files. xinix also posted one. You also referred to random.org files in your earlier posts, so I believe you have now some test files with you. So what are the (preliminary?) results? What are we celebrating?
    66 replies | 2072 view(s)
  • LawCounsels's Avatar
    2nd March 2021, 02:03
    Sportman's provided random files ( pls leave any further requests for technical details for when separate encoder decoder done. None will be responded at this time)
    66 replies | 2072 view(s)
  • Gotty's Avatar
    2nd March 2021, 01:46
    Are these inputs random files?
    66 replies | 2072 view(s)
  • LawCounsels's Avatar
    2nd March 2021, 01:34
    Hi , I have made good progress coding done compressor and have now greatly made simpler decompressor At moment decompressor works direct from compressed internal variables decompressed exact same. There is only work to separate into encoder decoder. But do not underestimate the amount of 'mundane' tedious work involved .Likely will be a week's time to able do as requested. But we can and should already be celebrating. ( now can handle any size input also ) LawCounsels
    66 replies | 2072 view(s)
  • Gotty's Avatar
    1st March 2021, 20:39
    Gotty replied to a thread paq8px in Data Compression
    Does it happen around the same %? How does it crash? Blue screen?
    2333 replies | 624241 view(s)
  • Shelwien's Avatar
    1st March 2021, 02:58
    > am I supposed to remove/ sanction the 2 'dog>ate' in set #2 before blend them? In a similar case in PPM (called masked contexts) symbols seen in higher order contexts before escape are excluded from probability estimation. But in most CM implementations, submodels (based on contexts of different orders in this case) are handled independently, thus multiple contexts could provide predictions based on same data. Overall first approach is better, but is also harder to implement in a CM, and CM provides better compression than PPM even without it.
    196 replies | 17041 view(s)
  • Gotty's Avatar
    1st March 2021, 01:25
    I agree: unfamiliar. Actually there is a better word. It is called "unpredictable". See the wikipedia definition of randomness: "randomness is the apparent lack of pattern or predictability in events". See? "Apparent lack". It means that the file may have a pattern, you just don't find it or see it. From your viewpoint then the file is practically random. There is no way you can compress it. I also agree: "all files are random", but you need to add "... until you find some useful patterns". Otherwise this statement in itself is not valid.
    21 replies | 486 view(s)
  • Gotty's Avatar
    1st March 2021, 01:11
    Let's do another (different) approach for the second game. Let's say that you record the answers to my questions with ones ("yes") and zeroes ("no"). Here are my questions, and your answers 1. Is x>= 128? No 2. Is x>= 64? No 3. Is x>= 32? No 4. Is x>= 16? No 5. Is x>= 8? Yes (we are in 8..15) 6. Is x>= 12? Yes (we are in 12..15) 7. Is x>= 14? No (we are in 12..13) 8. Is x>= 13? Yes Then it is 13. What did you record? No-no-no-no-yes-yes-no-yes -> 00001101. This looks like 8 bits. In other words a byte... Hmmm... Let's look up this number in decimal. It's 13. Wow! Try with other numbers. The recorded yes-no answers will always match the number's binary representation. Isn't that cool? Your yes-no answers and the binary format will be the same when you follow the above question-structure. If the number is between 1-65536, that's 16 yes-no questions. Can you calculate the required number of questions somehow? Yes. The needed answers is always log2(n) where n is the number of possible symbols (the range). So log2(256) = 8 and log2(65536) = 16. Now think with me. A bit more concentration... Let's do the opposite. You thought of 8 bits (8 numbers of either 0 or 1). Let's stay at the same example. Let they be 00001101. How can I find out which 8 bits you have? I need to simply ask the bits one after the other: Is the 1st bit 1? No Is the 2nd bit 1? No Is the 3rd bit 1? No Is the 4th bit 1? No Is the 5th bit 1? Yes Is the 6th bit 1? Yes Is the 7th bit 1? No Is the 8th bit 1? Yes Notice that again it's No-no-no-no-yes-yes-no-yes. It matches exactly with the bits (00001101) again. And that represents 13. You needed 8 yes-no questions, in other words: 8 bits. It's even more amazing when you have probabilities for the symbols. ((Actually this is the universal case it covers the above yes-no questions, as well.)) You can have questions like that (guessing again 13 from 1..256): Is it 1? (p=1/256) No Is it 2? (p=1/255) No Is it 3? (p=1/254) No Is it 4? (p=1/253) No Is it 5? (p=1/252) No Is it 6? (p=1/251) No Is it 7? (p=1/250) No Is it 8? (p=1/249) No Is it 9? (p=1/248) No Is it 10? (p=1/247) No Is it 11? (p=1/246) No Is it 12? (p=1/245) No Is it 13? (p=1/244) Yes Got it: it's 13. In order to find out the information content, you'll need to sum {-log2(1-p) for the "No"s and -log(p) for the "Yes"es} for the above answers. Here is the result for each question: Is it 1? (p=1/256) No 0.005646563 Is it 2? (p=1/255) No 0.00566875 Is it 3? (p=1/254) No 0.005691112 Is it 4? (p=1/253) No 0.005713651 Is it 5? (p=1/252) No 0.00573637 Is it 6? (p=1/251) No 0.005759269 Is it 7? (p=1/250) No 0.005782353 Is it 8? (p=1/249) No 0.005805622 Is it 9? (p=1/248) No 0.005829079 Is it 10? (p=1/247) No 0.005852726 Is it 11? (p=1/246) No 0.005876566 Is it 12? (p=1/245) No 0.005900601 Is it 13? (p=1/244) Yes 7.930737338 Sum those numbers and be amazed. It's again 8. (It's 13 questions with skewed probabilities with 8 bits of information content.) You can also start from the top: Is it 256?, 255?... 13?. Even you had much more questions the answer will be still 8. Try. No matter what order you do, no matter how you slice up your range, it will be always 8 bits. See? The method or the used number of symbols is irrelevant: whether it's 8 bits or 1 byte (base 2 or base 256): it's still 256 symbols, you always need 8 bits to represent a number from this range. Actually if you store anything in any format it's always the same: the information content is: how many yes-no questions do you need to figure out the full content. This is the number of bits. You cannot do in less - you cannot compress these 8 bits of information to be less. If you understand all the above (even partially) you will have a better grasp: no matter how you transform your random files, no matter what approach you use, no matter with how many symbols you use to represent your data, the information content will be always the same. (4K in your case.)
    21 replies | 486 view(s)
  • Self_Recursive_Data's Avatar
    1st March 2021, 00:26
    Does the following help prediction? How if so? I tried it one time but it didn't work. If we take 2 sets/ orders of predictions to blend: SET1 my dog>ate my dog>ate + SET2 dog>ate dog>ate dog>jumped , am I supposed to remove/ sanction the 2 'dog>ate' in set #2 before blend them? They are in the same location in the dataset, cuz order3 is found in order2 but not the other way around always. The dataset looks like (with to emphasize the matches): "], ], ". As we can see here, some of the order3 is order2.
    196 replies | 17041 view(s)
  • Gotty's Avatar
    1st March 2021, 00:17
    Not bad! So you figured out that you need to do halving to find the number. The correct answer is 8 actually, but your approach is good, so accepted ;-).
    21 replies | 486 view(s)
  • Gotty's Avatar
    1st March 2021, 00:16
    Stuck = no improvements. But we do have improvements in the last 10 years in every field (image, video, general) both practical and experimental. Are you following the news? :-) As time progresses, the leaps will be smaller and smaller - we are slowly but surely approaching the theoretical compression limit that is reachable in reasonable time. So don't expect anything "extraordinary". Also don't expect that one day we will invent a smart method to compress random data. I think that would count a big leap, right? 2-digit ascii compression would be everywhere if that would be a common format to share data. But no one share data like that. So it's not in focus at all. You convert your binary file to a 2-digit file when you want to visualize the bits. Other than that it doesn't seem to be a useful format. It's just a waste of space. Do you use or see 2-digit files? Look, xinix also immediately converted it to binary. Why? That's the natural and compact container for such content.
    21 replies | 486 view(s)
  • Trench's Avatar
    28th February 2021, 23:11
    ---- Ok I just saw post #8. As explained in another post that all files are random. you can take any file and it will look random to you at face value. The program does not see it as random since it looks for certain patterns. And those patterns become "familiar" to be able to compress it. even with the dictionary size of a 128 it still has limitation. if the dictionary size was 4 it misses out on the other 122 which you call random when it is obviously not. It is just not familiar to the program in what to use. The computer does not know the difference from 1234 or 1z$T. You see the difference but you are not doing it manul by your own dictionary in your head, but the computer does not unless you tell it. For people that talk technical random should not be in the definition but it is unfamiliar should be. And I stick by that despite I can slip up and say it at times being influenced by others. LOL Anyway You are not stuck? so you made leaps and bounds on compression or did we all hit a wall for the past 10 years to see major improvements for a standard model of compression? Some programs are improved by a bit but not practical which is why not in general use. You help explain the comprehension barrier, but the issue still remains that I do not see any effort to improve 2 digit compression since everyone is focued on the standard, and it is an alert to focus on. People can disagree or disregard but its a new door to explore. You notice how a computer has memory? well it tried to copy the mind and it is hard to keep track when memory is full with other things. LOL So I lost track on the 33,420 info. lol I disagree that both are not random and dont like that word which You can not beat 4.0kb the question is will you try to. from doubling the numbers 1 2 4 8 , 16 32 64, 128 256 1. is it between 1-128? (2/3 off) yes 2. is it 8 to 32? (2/3 off) yes 3. is it 8-16? yes 8 9 10, 11 12 13, 14 15 16 4. is it 11-13? yes 5.is it 11-12? no 6. answer is 13. which would means 6 questions maybe some errors. It was the first thing that came to mind so I could be wrong but did it fast. faster but inaccurate to a smaller degree. if you said 7 I would have lost. LOL will reply to rest later.
    21 replies | 486 view(s)
  • hexagone's Avatar
    28th February 2021, 23:02
    Code is here: https://github.com/mathieuchartier/mcm/blob/master/Dict.hpp & https://github.com/mathieuchartier/mcm/blob/master/WordCounter.hpp Look at CodeWordGeneratorFast (Dict.hpp), especially WordCount::CompareSavings & Savings (WordCounter.hpp) for the ranking logic.
    13 replies | 579 view(s)
  • Shelwien's Avatar
    28th February 2021, 22:43
    There was this: https://encode.su/threads/482-Bit-guessing-game Humans are actually pretty bad at generating random numbers.
    21 replies | 486 view(s)
  • mitiko's Avatar
    28th February 2021, 22:40
    Thanks for the tests, but it's not quite optimized yet. I might have forgotten to bump up the block size, so it has done 10 blocks to compress enwik8 and it needs 10 dictionaries instead of 1. LZ probably only generates overhead as well. I'm rewriting the ranking loop now, so it can do ranking with states and also save some CPU with hashtable lookups. I'm pretty busy with school but I'll try to drop it this week. mcm seems interesting, do you know what it uses as a ranking function, I'm a little bit lost in the code.
    13 replies | 579 view(s)
  • Darek's Avatar
    28th February 2021, 22:36
    Darek replied to a thread Paq8sk in Data Compression
    OK, I'll try. Somehow I have some issues with my laptop recently - I've started 5 times enwik9 for paq8px v201 - and even after 3-4 days my computer crashes...
    228 replies | 23549 view(s)
  • Gotty's Avatar
    28th February 2021, 21:47
    Second game: Let's say that you thought of a number between 1-256. I'd like to figure out that number. It could be any number in that range. Let's say it is 13. What do you think how many yes-no questions do I need to find it out? I encourage you to really try to find out the answer. It's free tutoring :-) Take advantage of the opportunity.
    21 replies | 486 view(s)
  • Gotty's Avatar
    28th February 2021, 21:39
    Please follow the game here: Think of a letter between A-H. That is: A,B,C,D,E,F,G,H (8 different symbols). Let me find out which it is. How many yes-no questions do I need to ask you in the optimal case (= fewest possible questions)? Let's say you think of "F". My approach: - Is it between A-D? - No. - (Then it is between "E"-"H". Let's half that range then.). Is it between "E"-"F"? - Yes. - (I have two choice now: either "E" or "F".) Is it "E"? - Yes. - I got it! It's "E"! How many questions did I need? 3. Could I solve it in less questions? 2 questions? Is there a smarter way?
    21 replies | 486 view(s)
  • hexagone's Avatar
    28th February 2021, 21:31
    You should look at the dict encoding in https://github.com/mathieuchartier/mcm. It dynamically creates a dictionary of ranked words. mcm_sk64.exe -filter=dict -store enwik8 Compressing to enwik8.mcm mode=store mem=6 Analyzing 97656KB , 46591KB/s text : 1(95.3674MB) Analyzing took 2.098s Compressed metadata 14 -> 18 Compressing text stream size=100,000,000 Constructed dict words=32+11904+33948=45884 save=6956257+30898956+3574176=414293 89 extra=0 time=0.064s Dictionary words=45884 size=376.186KB 97656KB -> 59990KB 74774KB/s ratio: 0.61431 Compressed 100,000,000 -> 61,430,639 in 1.38400s Compressed 100,000,000->61,430,671 in 3.544s bpc=4.914 Avg rate: 35.440 ns/B ​
    13 replies | 579 view(s)
  • Gotty's Avatar
    28th February 2021, 21:25
    I don't feel that I'm stuck. I'm posting improvements quite regularly. Why would you say that we are stuck? Data compression is a technical challenge - not a philosophical one. So we address the issue from technical point of view. If you think there is a better approach... show us. But it should work ;-) Your issue is that you would like to understand why compressing a 2-digit file is worse than simply converting it to binary. This is what you asked. This is what I explained. Regarding what the issue is - I think we are on the same page. No, it's not. The 33,421 bytes could be compressed to 4-5K, it's clearly not random. The 4096 bytes are random. From (general) compression point of view these two files are totally different. In your head (and in my head) the binary file is equivalent to the 2-digit file (and it's absolutely true from our viewpoint), but they are not equal for the compression engines. Please read my explanation again above. It looks like my explanation didn't get through. I tried my best to make it clear - it looks like I was not very successful. Please read my posts again and tell me where it is difficult to understand. I'm ready to try to explain it better. No, they will never beat conversion. The entropy is 4096 bytes in each of your case. That's how low you can go no matter what. Regarding their information content, they are equal - you are right! But "technically" they are different: the 2-digit file needs to be actually compressed to get to (nearly) 4K, while the 4K file is already 4K - nothing to do from compression point of view. You cannot beat 4K. That's the limit. What do you mean by "conventional excuses"? Is it that you hear similar explanations again and again that random content is not compressible? Because it's not. You experienced it yourself. If you read most of the threads in the "Random Compression" subforum, you'll notice that most people try converting some random content to different formats hoping that it will become compressible. Like yourself. It's the most natural approach, it's true. So everyone is doing it. But everyone fails. Do you think that there is some law behind it? There is. You have still the same information content (4K) in any format you convert the file to. If you convert it to any format that is actually compressible (like the 2-digit format), you just give file compressors a hard time, because they need to do actual compression to go down to near 4K. What they see it not a random file - the 2-digit format is not random. You mean 256 symbols and 2 symbols (not digits). Your intuition is correct: it's easier to work with 2 symbols than 256 symbols. This is how the paq* family works. But it does not mean that a 2-symbol file is more compressible than a 256 symbol file. From information content point of view they are equal. It's surprising, I know. In my next post I'll give you an example, maybe it helps you grasp what information content is.
    21 replies | 486 view(s)
  • suryakandau@yahoo.co.id's Avatar
    28th February 2021, 21:19
    I just curious how much Silesia can be compressed using the new mixer context set in each predictor...
    228 replies | 23549 view(s)
  • suryakandau@yahoo.co.id's Avatar
    28th February 2021, 21:09
    The base of paq8sk is paq8pxd. @darek, could you test it on Silesia using the same parameter like you do when you test paq8pxd ??
    228 replies | 23549 view(s)
  • xinix's Avatar
    28th February 2021, 20:24
    Does compression seem to be enabled now? I would like to test only the preprocessor enwik8 lzma - 23.6 MB (24,750,737 bytes) enwik8.bwd lzma - 34.7 MB (36,438,877 bytes) ___ enwik8 PAQ fp8 - 18.6 MB (19,566,727 bytes) enwik8.bwd PAQ fp8 - 29.9 MB (31,357,466 bytes)
    13 replies | 579 view(s)
  • xinix's Avatar
    28th February 2021, 19:37
    I tested this on enwik8 95.3 MB (100,000,000 bytes) Processing lasted 5 hours on AMD Ryzen 5 3600X enwik8.bwd 83.7 MB (87,829,234 bytes)
    13 replies | 579 view(s)
  • Darek's Avatar
    28th February 2021, 18:46
    Darek replied to a thread Paq8sk in Data Compression
    @suryakandau - which version is an base of paq8sk48 - paq8px or paq8pxd? OK, i Know. paq8pxd is the base
    228 replies | 23549 view(s)
  • Trench's Avatar
    28th February 2021, 18:26
    4096 in its converted state which yes I agree. But the unconverted 33,420 bytes file is also a mess but compression fails do beat out conversion. It does get much better results without the newline. ;) But it still compression still loses to conversion from what I see which is the main issue of this post. from 31.9 compressed to bzip2 4.29, lzma2 4.11, p12 4.39, and converted 4.00. I assume better results will show when the file is much bigger since metadata is saved in the compressing programs, but the issue is can it beat conversion? maybe no. But it should be technically equal. So overall all your comments are right but again its not the issue. Can you beat 4.0 kb in theory without conventional excuses? You can not compress 256 digits as easily as 2 digits. Sure 256 is faster but 2 is more orderly. take food for example try chewing an entire apple in your mouth than a fraction of it. the person with a fraction of an apple will do better. The topic was not about compressing files form binary format but binary digits. Everyone in this forum thinks too technical which seems to be stuck in the Chinese finger trap which is not helping matters and progressing nowhere fast. But if everyone feels I am wrong....ok. Thanks again
    21 replies | 486 view(s)
  • Gotty's Avatar
    28th February 2021, 17:09
    I'm not sure what you mean. You don't like the word "Random" when used for incompressible data?
    21 replies | 486 view(s)
  • mitiko's Avatar
    28th February 2021, 16:52
    Yes, it's a transform. The new ranking function is based on it being followed by entropy coding.
    13 replies | 579 view(s)
  • Gotty's Avatar
    28th February 2021, 16:21
    Additional explanation: Unfortunately most compressors will try finding longer patterns in the ascii files. And indeed there are patterns and they seem to be useful. They think that it's a text file and looking for string matches is a good strategy. Unfortunately it's not. The best strategy would be to simply convert the file to binary. But examining this possibility is not programmed in them. So they go for the string matches... Why looking for patterns its not a good strategy in this case? (A simplified example follows) Imagine that you are a compressor and you are given the binary file (4096 bytes). There are *very* short matches (max. 2 bytes only) and you'll see that these short matches won't help compressing the file. So you won't use them. Good. Or you can use them, but still it won't help compression. Now you are given the 2-digit ascii file. And you'll see that you have for example quite many 8-character matches (since the bits are bytes now). Oh, that's very good - you say. Instead of storing those 8 bytes I will just encode the match position and length and I'm good - encoding the position and length is cheaper (let's say it's 2 bytes). So you win 6 bytes. And this is where you were led astray. In reality you encoded a 1-byte information to 2 bytes. That's a huge loss. And you thought you did good... So an additional problem is that when seeing the ascii files many compressors are led astray. They think finding string matches is a good strategy. They will not find or not easily find the "optimal" strategy (which is an order-0 model with equiprobable symbols.)
    21 replies | 486 view(s)
More Activity