Activity Stream

Filter
Sort By Time Show
Recent Recent Popular Popular Anytime Anytime Last 24 Hours Last 24 Hours Last 7 Days Last 7 Days Last 30 Days Last 30 Days All All Photos Photos Forum Forums
Filter by: Last 7 Days Clear All
  • Lucas's Avatar
    Today, 21:11
    I know sais isn't a new algorithm, but being able to execute a majority of it out-of-order is an impressive optimization nonetheless.
    3 replies | 174 view(s)
  • Gribok's Avatar
    Today, 20:39
    libsais is not based on net new ideas (we are still using 10+ years old sais algorithm). I think what changed is hardware profile. CPU Frequencies are stalled, but CPU cache and RAM keep getting bigger and faster. So something which were not practical 10 years ago become practical now. DDR5 is also expect to land later this year, so I think could be even better.
    3 replies | 174 view(s)
  • Lucas's Avatar
    Today, 19:23
    It's great to see an actual improvement in SACA performance after 10 years of stagnation in the field. Brilliant work! This certainly seems like it will become a new standard.
    3 replies | 174 view(s)
  • Bulat Ziganshin's Avatar
    Today, 18:07
    In 2017, I managed to implement CELS framework and wrote quite a long documentation of it, with two parts - the first one for codec developers, and another one for application developers leveraging CELS to access all those codecs. But then the work stalled, and it remained unfinished. Now I continued my work on this framework. Since the initial design is pretty well documented, I invite you to read the docs and give your opinions on the design. To quickly summarize idea behind it, it provides to external (DLL) codecs the same API as the one provided to codecs inside FreeArc. It's more powerful and much easier to use that 7-zip codec API, so I hope that other archivers will employ it, and developers of compression libraries (or 3rd-party developers) can make these libraries available for all these archivers by implementing pretty simple API - instead of spending precious time on implementing CLI time and again. Repository is https://github.com/Bulat-Ziganshin/CELS
    13 replies | 5079 view(s)
  • Gribok's Avatar
    Today, 07:31
    libsais is my new library for fast (see Benchmarks on GitHub) linear time suffix array and Burrows-Wheeler transform construction based on induced sorting (same algorithm as in sais-lite by Yuta Mori). The algorithms runs in a linear time (and outperforms divsufsort) using typically only ~12KB of extra memory (with 2n bytes as absolute worst-case extra working space). Source code and Benchmarks is available at: https://github.com/IlyaGrebnov/libsais
    3 replies | 174 view(s)
  • mpais's Avatar
    Yesterday, 19:47
    Sure, but no one helped either. We've discussed this before, writing a full blown archiver as ambitious as that, on my free time, is delusional. There's a reason most of us here write single file compressors. An archiver needs to implement a lot more functionality (adding content to an existing archive, deleting content from an archive, partial extraction, etc) and we wanted to do it in an OS and ISA independent way, i.e., it should just work out of the box on Windows, Linux, MacOS, etc, and on x86, ARM, MIPS, RISC-V, etc. That is way above my skill level. Anyway, that is off-topic and I don't want to derail the thread. Gotty is doing a fine work on paq8gen, the paq8 legacy is in good hands.
    67 replies | 3598 view(s)
  • Shelwien's Avatar
    Yesterday, 18:42
    C# is troublesome, so can you provide some test results? Using some popular dataset like http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia ? Also, maybe move this to a new thread? This thread is about GLZA really.
    881 replies | 457086 view(s)
  • mitiko's Avatar
    Yesterday, 18:04
    I've been working on an optimal dictionary algorithm for a bit now, and I didn't even know this thread existed, so I'm happy to share some knowledge. The easiest way to search for strings and assign rankings is to construct the suffix array. This way all possible words to choose from are sorted and appear adjacently in the suffix array. (It's also not that costly to compute the SA - you can even optimize prefix doubling implementations by stopping early and constructing an array of the suffixes up to a size m - max word size). I've also done the math for the exact cost savings of choosing a word (given an entropy coder is used after the dictionary transform) A cool direction to look into is constructing a tree of possibilities when choosing a word - so that your algorithm isn't greedy to choose the best ranked word. Sometimes you should choose a word with smaller ranking and see the improvements in the next iteration. There's also a cool idea to choose a word which is a pattern - for example "*ing". Now when the encoder has to parse "doing" it will emit the pattern as a word and then fill in the remaining characters, in this case "do". Patterns allow you to match more words but at the cost of having more information to encode at each location. The hope is that the extracted bonus characters form a different context and can be encoded seperately than the rest of the file with better models. My code: https://github.com/Mitiko/BWDPerf (The documentation is a bit outdated now, so don't look too much into whatever pretentious stuff I've written)
    881 replies | 457086 view(s)
  • spaceship9876's Avatar
    Yesterday, 15:40
    I'm not sure if this is relevant or not: https://jpeg.org/items/20201226_jpeg_dna_2nd_workshop_announcement.html
    67 replies | 3598 view(s)
  • Mega's Avatar
    Yesterday, 15:06
    Nikola Tesla worked for the Rothschilds with the understanding that they owned most of his work upon contingency for funding it. While Apple may have been screwed over by Microsoft surely, Xerox was not by either and had no patents nor expectations on what they made before Apple or Microsoft. Facebook was not stolen, but planned. It is the current generation of CIA project Lifelock which was immediately transitioned to "Facebook" in 2004. Only the initial look of the front end and a few basic functions were in question by the Winklevoss twins. The rest was and still partly is classified as Lifelock. It would seem people's understanding of events are rather skewed aka "mainstreamed" to not know this?
    1 replies | 180 view(s)
  • hexagone's Avatar
    Yesterday, 04:33
    Here: https://encode.su/threads/2083-Kanzi-Java-Go-and-C-compressors?highlight=kanzi To be clear, it is a compressor, not an archiver.
    29 replies | 1378 view(s)
  • Hacker's Avatar
    Yesterday, 03:30
    data man, No executable though?
    29 replies | 1378 view(s)
  • Hacker's Avatar
    Yesterday, 03:26
    Gonzalo, Thanks, I'll give it a chance. Let me rephrase it - until they are more widely supported (e.g. in image viewers or in cameras) they are nice for experimenting but we should've had a JPG successor for at least 20 years now but we still don't. Apple did something with HEIC but that's it, unfortunately. OK, sounds good enough to give it a try as well. Any thoughts about future compatibility? Will I be able to open the archives in ten years? Are the involved programs standalone downloadable exe's or are they some sort of internal part of Windows 10?
    29 replies | 1378 view(s)
  • Hacker's Avatar
    Yesterday, 03:21
    fcorbelli, Ah, true, I forgot about the case when one already possesses the necessary knowledge.
    29 replies | 1378 view(s)
  • innar's Avatar
    Yesterday, 02:10
    mpais, I have validated the submission and updated the web page. Thanks again! My testing computer was probably a bit slower: 1036 seconds. Nothing compared to ongoing test with cmix. I think lossy compression should not be acceptable, especially in the case where lossless compression outperforms! Those correlations you found and which you think Gotty's gonna catch as well, resulted a shorter representation - they have to be meaningful. I don't have an answer yet how to translate that back to domain knowledge, but I will continue thinking about it and talking to people.
    75 replies | 4905 view(s)
  • Gonzalo's Avatar
    22nd February 2021, 21:24
    You want to work on Fairytale? No-one is stopping you, to be fair ;) Maybe there aren't a lot of people on board just yet but you know that can change in a heartbeat. I know I for one would test the hell out of the thing. Maybe I'm not a developer, but that hasn't stopped me from collaborating on lots of projects. Just look at all the bugs I've found in precomp, for instance...
    67 replies | 3598 view(s)
  • mpais's Avatar
    22nd February 2021, 20:35
    As I said in the challenge thread, I expect paq8gen to get down to 57x.xxx, at least, so no pressure :D The thing is, I don't think anyone in that field would even use it. Maybe innar, Kirr and JamesB have a different opinion on that, but it seems people in that field just cook up their own flawed compressors (see the benchmark that Kirr put together, especially his comments on each compressor) as long as it suits their particular workflow. For archiving and transmitting the large datasets they use, paq8gen is just too slow. And judging by innar's comments, it seems the field is not convinced of the use of compression algorithms as a way to find similarities between sequences. That's why I don't see the point in continuing working on LILY for this either. I'd rather work on something fun and useful, work is boring enough as it is, which is why I proposed fairytale in the first place, it would have kept me busy writing open-source versions of LILY, EMMA, LEA, etc. Alas, that ship has sailed, so maybe it's time for a new hobby.
    67 replies | 3598 view(s)
  • mpais's Avatar
    22nd February 2021, 20:03
    Well, I wasn't really paying much attention, so I just used the time from the last run (638s) to see how much slower the decompressor was, but I was browsing and doing other stuff at the same time. I've now run a "clean" test and it seems I grossly underestimated the difference, this run took just 528s (~2.4MB/s), so the size-optimized decompressor is a whopping 57% slower. But hey, it saves a few KB :_rofl2: As for GeCo3, the result listed is on the original fasta file. That file contains 22.020.510 "new line" bytes (ASCII #10), so even if it was stripping all of them and all the sequence names, it's still coming up short by about 14MB of what must undoubtedly be sequence data. Now, I'm far from an expert, but how can you have "lossy" compression on the actual sequence data and deem that acceptable? It's not like these are images and the appreciation of the importance of the lost details is subjective to every viewer.
    75 replies | 4905 view(s)
  • Gonzalo's Avatar
    22nd February 2021, 18:32
    Wow, I had completely forgotten about that one. Not exactly what I had in mind, but the idea is there. As a side effect, it allows for chunk reordering and that gives a surprising boost in ratio, even for smaller datasets. Of course it's not production-ready. More like a proof of concept. The binary is extremely slow, even for the copy operations, like restoring the original file (just a concatenation here; tested). The native version compiled with clang crashes very often, an gcc won't even compile. Anyway, I guess a newer version of paq* could be used to make something like this. The ideal situation would be to start working on Fairytale, which was supposed to do all this and more. But personally, I'm finishig a bootcamp on Full Stack web development (JavaScript) so maybe after that I could finally get myself to learn some c++. It's not that different, fortunately. I'll just have to deal with data types and memory management I guess.
    2 replies | 235 view(s)
  • data man's Avatar
    22nd February 2021, 03:27
    ​https://github.com/flanglet/kanzi-cpp is very active.
    29 replies | 1378 view(s)
  • Shelwien's Avatar
    22nd February 2021, 03:02
    https://encode.su/threads/1971-Detect-And-Segment-TAR-By-Headers?p=38810&viewfull=1#post38810 ? Nanozip and RZ (and some other of Christian's codecs) seem to have that kind of detection, but its not open-source.
    2 replies | 235 view(s)
  • Shelwien's Avatar
    22nd February 2021, 02:53
    The difference between bytewise and bitwise model is that after occurrence of some symbol, a bytewise model increases the probability of only this symbol, while decreasing all others. While bitwise model also increases the probability of some groups of other symbols with a matching prefix. So we can view the alphabet permutation as an approximation of non-binary APM. This would mean that optimal permutations can easily be different not only for different files, but even for different contexts. (Which can be even practically implemented by building symbol probability distributions from binary counters separately for each context, then converting these distributions to a common binary tree for mixing and coding.) Btw, alphabet permutation is not limited to permutations of 0..3 even for target files with only 4 symbols. A code like {0001,0010,0100,1000} would have a different behavior than any of 0..3 permutations.
    67 replies | 3598 view(s)
  • fcorbelli's Avatar
    22nd February 2021, 00:51
    I will try
    82 replies | 4757 view(s)
  • Gotty's Avatar
    21st February 2021, 23:33
    Back to the ACGT-alphabet-reordering. These are the results from the small DNA corpus: There is no clear winner which reordering is the "general best". There are some green areas though. Meanwhile I found the optimal alphabet-transform for the sars-cov-2 challenge: it's GCAT (meaning: 'G'->0; 'C'->1; 'A'->2; 'T'->3). It's followed closely by GCTA, CGAT, CGTA). Hmm... So G likes to be together with C and A with T. Looks like we've got something - see: https://en.wikipedia.org/wiki/Complementarity_(molecular_biology) But it does not really match with the results from the small DNA corpus. The difference in compression is significant. Who's got some insight?
    67 replies | 3598 view(s)
  • Gotty's Avatar
    21st February 2021, 23:19
    Oh, I thought it could be more. OK, I'm glad then!
    67 replies | 3598 view(s)
  • Gotty's Avatar
    21st February 2021, 23:17
    It's funny indeed. LILY and paq8gen are always head to head. So following that pattern, now paq8gen must go under 600K. OK. Don't know how on earth it's gonna be, yet. Feel free to add your magic there. ;-) I'm not that serious about squeeze the last bits from the exe. (You can see it's not my priority.) Pushing compression further is what I'm into. When enhancing paq8gen I'm also checking my changes against two other corpus: what I call the small DNA corpus (https://encode.su/threads/2105-DNA-Corpus) and the large one: https://tinyurl.com/DNAcorpus (referenced from GeCo3: https://github.com/cobilab/geco3). So for me it's a threefold challenge. Paq8gen is far from a proper genomic sequence compressor, yet: it should support a line-unwrapping transform and reordering sequence names and sequences automatically for example. Or look for palindromes... Or anything I haven't even heard of ;-) If you've got ideas and have some time - feel free to join and let's make it a good entropy estimator in the genomic-compression field.
    67 replies | 3598 view(s)
  • suryakandau@yahoo.co.id's Avatar
    21st February 2021, 23:11
    I don't have personal reason with mpais, I just see the result of paq8gen is better than lily.
    67 replies | 3598 view(s)
  • Gotty's Avatar
    21st February 2021, 22:56
    Thank you, Surya, it's very kind of you! You have to know that paq8gen is a community effort it's not "mine" - I'm standing on the shoulders of those who have laid the foundation with hard dedicated work or supported paq8* with ideas or testing. Please don't root against LILY. Let's be fair with one another. mpais did an excellent job with LILY. Now under 600K. It's clear that you are rooting against LILY for personal reasons. You have to know that paq8px and also paq8gen would not be as strong as they are today without mpais. The models and methods that help paq8gen to be "good" are all coming from paq8px, and guess who helped tremendously enhancing those models? It's mpais. Fun fact: the strength of paq8gen comes mostly from the MatchModel (which is based on the MatchModel in EMMA). So any success of paq8gen is also success for mpais.
    67 replies | 3598 view(s)
  • Gonzalo's Avatar
    21st February 2021, 22:32
    Fazip uses TTA but it doesn't work for compound content, like a .wav somewhere inside a TAR archive. Precomp was supposed to include wavpack like a year ago but nothing has happened yet and it doesn't seem like it's going to. So, do you know of something like this, that detects and compress audio chunks and makes a copy of the rest of the data? Thanks in advance!
    2 replies | 235 view(s)
  • Shelwien's Avatar
    21st February 2021, 21:50
    Single-file zstd -1 Compiling: g++ -O3 czstd.cpp -o czstd Usage: czstd c/d input output
    82 replies | 4757 view(s)
  • hexagone's Avatar
    21st February 2021, 21:47
    It is comparing apples to oranges and it is just too easy for the reader to miss the note. At first glance it looks like this compressor is better than other (lossless) compressors. You should really have a dedicated table for lossy compressors.
    75 replies | 4905 view(s)
  • spaceship9876's Avatar
    21st February 2021, 18:58
    When will you be releasing v0.12?
    82 replies | 4757 view(s)
  • innar's Avatar
    21st February 2021, 18:55
    Thank you! Just for curiosity - what was the compression time? * Cilibrasi/Vitanyi paper was added (later) for the context and building an argument that compression is useful in this case. AFAIK their paper is still in the review process for a journal, indicating that it is not obvious in that community that compression is useful for more purposes than 'saving storage'. * GeCo3 - I noticed it not being lossless later since I did not get it initially from the original paper. My bad and inconsistency. But I thought that there is value to not just to delete the row, but to mark it somehow. Especially, since the author was kind enough to tune-play with the parameters to achieve the best result. When testing with other sequence compressors, I keep realizing the same thing Kirr was indicating earlier in this thread - most of the algorithms in the sequence compression are 'broken' in some sense, mostly lossy. When reaching out to authors, I have gotten the answer that lossless compression-decompression has not been the goal, since the goal is to compress sequences (?!). If your lossless compression algorithm is better than another, but lossy algorithm, I think it makes it especially good.
    75 replies | 4905 view(s)
  • mpais's Avatar
    21st February 2021, 18:37
    From SARS-CoV-2 Coronavirus Data Compression Benchmark thread: Reordering even helped LZMA get close to the paq8l result on the original fasta file. Also, it's funny seeing how you find and tackle the same things as I do, like the low precision problem (you really went scorched earth on that one, 28 bits of precision? :D). The way you solved it was my plan B, in case the way I solved it in LILY wouldn't work. If you're serious about the sars-cov-2 benchmark, why not just create a new branch of paq8gen on the repo dedicated to it, instead of polluting the main version with a bunch of #defines?
    67 replies | 3598 view(s)
  • mpais's Avatar
    21st February 2021, 18:20
    Gotty seems to have made some tests and his data seems to confirm precision problems in cmix. I honestly haven't tried anything with cmix because it's simply unusable, it's far too slow. It shouldn't be a surprise though that it doesn't do very well on what seems like "textual data", since it was continuously tuned over the years for the LTCB. Comparing it to paq8gen would make more sense, since they're much closer in architecture, and paq8gen is progressing nicely. It's quite impressive that it's been keeping up with LILY without any special model. As for LILY, it's just about 1000 lines of badly hacked together code, much simpler than any of those 2, as implied by the difference in speed. And since it's modelling exactly the correlations I'm choosing, I can tell exactly what they are. From the moment I decided to have a go at this, I've been focusing on a theory (call it a hunch, an intuition, if you will) and have been trying to see if I could model it. Now that I've finally found what I was doing wrong, it was quite simple to get a big gain (2 lines of code gave 55.000 bytes of gain). If anything, I'd be tempted to just start over because I think I can get pretty much the same result with half the complexity. I must say however that it's frustating, not being an expert, to not know if the correlations I found, that accurately predict the differences in the sequences, have any actual meaning in the grand scheme of things. They seem to validate what I thought, but I can't shake the feeling there's an higher order structure at play here that I'm missing. I'm sure Gotty will find them too, and paq8gen will probably go at least as low as 57x.xxx bytes.
    75 replies | 4905 view(s)
  • mpais's Avatar
    21st February 2021, 18:12
    Since paq8gen is about to overtake LILY, it's time for a new entry (this time for the unwrapped fasta). The total is 613.466 bytes (decompressor "lily.zip": 13.499 bytes, payload "1": 599.967 bytes). This time the decompressor is compiled with MSVC 2019, it produces a very slightly smaller executable at the cost of about another 13.6% cut in performance, and getting the last few KB to go under 600.000 bytes accounts for roughly 30% of the relative performance loss. The decompressor is now also 30% slower than the compressor (compiled with gcc -03). Decompression time is now 830 s. @innar This entry marks an over >50% reduction in size relative to the paq8l baseline (1.238.330 bytes) from your paper, and that was for the 2bit version :D It will probably also be my last entry, because I honestly don't see the point in this. I skimmed over the Cilibrasi/Vitanyi paper back then and the approach used seemed valid, but I fail to see how the current format of the challenge will help. It's my understanding the methodology used is to take N previously known sequences (of Coronaviridae, in that paper), compress each one individually (to size A) and then compress the concatenation of each of them with the new sequence (to size B), and hence the lowest B/A ratio will allow you to identify the most "similar" pair of sequences, without using specific sequence alignment tools. However, for the challenge, the task is simply to take a concatenation of N sequences of SARS-COV-2 and compress them. Not for the purpose of archiving or speeding up transmission of such data (for that NAF seems much, much better suited), but in the hope that it may help on similar tasks as that described above. I posit that the task should mirror that of the Cilibrasi/Vitanyi paper, because the challenges there are much different from the one here. I also saw that you included results from GeCo3 on the original fasta file, but then you note that decompression isn't lossless and over 36MB of the original file go missing. I honestly don't see the point in including such results.
    75 replies | 4905 view(s)
  • fcorbelli's Avatar
    21st February 2021, 14:08
    In fact, no,or at least not always. No if you need something way easier to compile. Strip down one of those 'monster' require a lot of time and efforts. Mini lzo works rather well with about 150K on a single thread. Of course ratio is not exceptional, but it doesnt really matter for VMs. Good (I hope, developing) for a 24h background compression task (ex on ghetto vmware snapshot) Therefore another question is: where the software will run?
    82 replies | 4757 view(s)
  • suryakandau@yahoo.co.id's Avatar
    21st February 2021, 13:42
    Wow it is amazing result, now it can beat lily. Congrats gotty!!!
    67 replies | 3598 view(s)
  • Gotty's Avatar
    21st February 2021, 12:48
    coronavirus.fasta.preprocessed.full 1'317'937'667 bytes paq8gen_v1.exe -12 1'118'915 bytes 23275.12 sec paq8gen_v2.exe -12 1'085'007 bytes 25974.80 sec paq8gen_v3.exe -12 723'319 bytes 23747.36 sec paq8gen_v4.exe -12 696'424 bytes 23969.18 sec paq8gen_v5.exe -12 660'417 bytes 26623.40 sec Update: paq8gen_v5.exe -12 659'727 bytes <- arithmetic encoder precision increased from 20 to 28 bits paq8gen_v5.exe -12 646'466 bytes <- ^ + alphabet-transformation applied (only in case of the sequences) - see above post Note: The codebase is up to date with the above changes but the alphabet-transformtion is not run automatically (as it is sars-cov2-challenge-specific - you have to compile with the transform included and run it manually (before compression and after decompression to get back the original file)).
    67 replies | 3598 view(s)
  • Shelwien's Avatar
    21st February 2021, 12:28
    @fcorbelli: please test zstd: https://github.com/facebook/zstd There's currently no point to accept a codec with worse compression than that. @warthog: The actual reason for relative LZMA slowness on out-of-order CPUs is serial dependency. Its possible to significantly improve LZMA decoding speed if substream interleaving is introduced... which adds some redundancy and breaks format compatibility though. Codecs like that do already exist: LZNA, LOLZ, NLZM etc. LZHAM uses huffman coding and thus should be compared to zstd and brotli instead.
    82 replies | 4757 view(s)
  • Shelwien's Avatar
    21st February 2021, 12:13
    Normally arithmetic coding is done without long division - range is just occasionally rounded down instead. Redundancy like 1 bit per megabyte of compressed data is simply not relevant. But you can do full-precision AC with long arithmetics, yes.
    194 replies | 16450 view(s)
  • warthog's Avatar
    21st February 2021, 06:40
    Hello. Some time lurker, first time poster here. I had a wishlist for a while now: LZHAM has close LZMA(2) ratio, and significantly faster decompression than LZMA(2). but very slow compression. FLZMA2 on the other hand, has LZMA(2) ratio, and faster compression than LZMA(2), but being 7z/LZMA format compatible, it stuck with slow LZMA(2) decompression. I imagine that if someone could fit FLZMA2 compression & speed into the faster decompressing LZHAM format, that would be the best of the two worlds! (Maybe lose a bit of compression ratio in such merge, but still... a class of its own, possibly Paretto approaching too) And such combination may be the answer to the needs you seek.
    82 replies | 4757 view(s)
  • Self_Recursive_Data's Avatar
    21st February 2021, 00:03
    Will Long Division do the trick?
    194 replies | 16450 view(s)
  • fcorbelli's Avatar
    20th February 2021, 21:48
    As the compressor I suggest something like this (just a mock up for Windows, I am working on a vSphere port)
    82 replies | 4757 view(s)
  • Gotty's Avatar
    20th February 2021, 21:38
    Let's see: alphabet reordering. freq char old code new code 418420417 T 84 0 388931944 A 65 1 255577497 G 71 2 239142742 C 67 3 15186607 N 78 4 18009 Y 89 5 11622 K 75 6 7865 R 82 7 5764 W 87 8 3505 M 77 9 2381 S 83 11 237 H 72 12 144 D 68 13 89 V 86 14 59 B 66 15 \n 10 10 The above frequency information is extracted from the full sars-2 challenge file ignoring the sequence names (i.e. it's only the sequences). We are lucky that there are exactly 16 distinct chars (including the newline). with original with new alphabet alphabet paq8gen_v5x -8 c6 8104 8056 (-0.59%) paq8gen_v5x -8 c7 12951 12770 (-1.39%) paq8gen_v5x -8 c8 73726 72318 (-1.91%) Where - c6, c7, c8 are the first 1M, 10M, 100M bytes of sars-2 challenge file (sequence-only); - paq8gen_v5x is a modified version of paq8gen_v5: the arithmetic encoder precision is increased from 20 to 28 bits and line type detection is disabled. I didn't put much effort in finding an optimal reordering, nevertheless the goal is fulfilled and the validity of your idea is confirmed. Thanks a lot! These preliminary results also show that the gain is better and better as file size increases. So for the full sars-2 file it must be even greater.
    67 replies | 3598 view(s)
  • suryakandau@yahoo.co.id's Avatar
    20th February 2021, 21:05
    Paq8sk46 - improve jpeg compression using -s8 option on f.jpg (darek corpus) Total 112038 bytes compressed to 80087 bytes. Time 26.14 sec, used 3285 MB (3444911171 bytes) of memory a10.jpg (maximum compression corpus) Total 842468 bytes compressed to 616176 bytes. Time 192.61 sec, used 3287 MB (3446746213 bytes) of memory there is source code and binary inside the package file ​
    223 replies | 22478 view(s)
  • Trench's Avatar
    20th February 2021, 20:21
    I assume many here are into file compression for the greater good more than money. But up to what ratio as the poll shows. Also to the extent how far you are willing to go to screw another as well if someone hired you to benefit from the idea. Kind of like how Xerox got screwed over by Apple, or Apple and others got screwed by MS, or the original created of FB got screwed over, The McDonald brothers got screwed over by McDonald, Nikola Tesla screwed over by JP Morgan, Etc. like most big corporations. So what would make you different it seems. Think about it for a few days before you answer.
    1 replies | 180 view(s)
  • Trench's Avatar
    20th February 2021, 19:31
    1 As I stated before all compression is random if people want to believe it or not. it just looks non random when dealing with it from another perspective. Cant find new ideas when stuck with the same methods and views. 2 But I dont think a programmers way of thinking can achieve it since they ignore point 1. Just like how an painter can not understand a programmer in how they perceive things. As the saying goes jack of all trades master of none. And if a person is a master in programming it is hard to understand various other persecutes since time and memory is limited. 3 If one wants to not revealed the code then might as well be anonymous as well. 4 If point 3 applies then you have to have a dead mans switch as stated become. 5 I assume if 2 people win the top prize the money will be split. 6 the top prize is not enough compared to what its worth since many companies from cloud storage are getting billions, but society is not progressing but at a stand still so it is a matter of profit & stagnation vs ethics & progress, which should be talked about but most wont understand. I assume most here seem to have ethics from the comments they say but maybe it might have a limit when higher price comes to play as it seems with this prize. Agree or not its just another view.
    60 replies | 1451 view(s)
  • Gonzalo's Avatar
    20th February 2021, 16:38
    1) The only one that would be competitive is fast-lzma2. In theory it's got just a tiny bit worse ratio, but in practice, due to multi-threading and RAM constraints, it usually is actually stronger than lzma2 (take into consideration that flzma2 is actually lzma2 with a different match finder) 2) Yup. Bt it works nonetheless. And it is especially good for already compressed data. It can compress it way faster than 7z, and for what is worth, a little bit stronger too. 3) AFAIK all these libraries have at least one working command line exe. 4) Yes, it is better. There are exceptions, of course; after all, they are in the same tier. But for me the best advantage is that my script reaches the same or better level of compression while doing it 3X faster. Without deduplication, ratio is comparable. Here you have a test I made for precomp comparing its lzma codec with flzma2 Sumary: Weighted average speedup for fxz: 195% Weighted average speedup for fxz -e: 168% Weighted average ratio gain for fxz: 2.03% Weighted average ratio gain for fxz -e: 2.20% Now you don't want to get rid of the preprocessors. They are key to improve ratio, compression speed and sometimes even decompression speed. AND they can significantly reduce RAM usage. Especially srep can make memory consumption drop, but the combination of file sorting, file deduplication, rep filter, srep filter and sometimes lzp can make memory requirements for lzma way lower than they would otherwise be. That's because you're actually compressing a lot less information. For large datasets you can end up compressing only a quarter of the original size (real life example - "dztest" folder is 24% of its original size after dedupers. Only that enters lzma stage).
    29 replies | 1378 view(s)
  • Shelwien's Avatar
    20th February 2021, 14:04
    > This tells me that reordering probably won't help. If it can make compression visibly worse, there's a chance that it can also improve it. For example, maybe unary code for "ATCG" or some such. > did you construct the xlt files manually? With a heuristic optimizer based on specific entropy coder, but its slow.
    67 replies | 3598 view(s)
  • Gotty's Avatar
    20th February 2021, 13:57
    Yes, it is. Earlier I measured the entropy for every bit (in a byte). It showed me, that it "knows" very consistently which is "easy" to predict and which is not. Rerun it again: Bit Entropy (in bytes) 0 5.3 <- "easy" 1 38.5 <- "easy" 2 5.3 <- "easy" 3 3696.6 4 155.1 <- "easy" 5 2234.1 6 1775.5 7 11.1 <- "easy" ----------- (measured on c6 (1M bytes of sars-2 sequence only data)) So five of the bits are "very easy to predict", 3 bits have the entropy. This tells me that reordering probably won't help. Question: did you construct the xlt files manually?
    67 replies | 3598 view(s)
  • Shelwien's Avatar
    20th February 2021, 12:58
    Btw, model is still bitwise, right? Is there an effect from alphabet reordering? http://nishi.dreamhosters.com/u/bwt_reorder_v4b.rar http://ctxmodel.net/files/BWT_reorder_v2.rar for source
    67 replies | 3598 view(s)
  • kaitz's Avatar
    20th February 2021, 12:54
    If you change adaptivemap update to use full 32bit for prediction, like in pxd. For main statemap in normal model cm after 1MB of input. Will it improve?
    67 replies | 3598 view(s)
  • Gotty's Avatar
    20th February 2021, 12:35
    And so here it is the sars-2 challenge sequences-only file first 10 MBs - effect of changing the final learning rate on the last layer. The numbers indicate the compressed size of each chunk (the difference measured at the last byte minus the first byte of the chunk). Since the numbers show a very small fluctuation I didn't actually measure the filesize, I computed the entropy at the arithmetic encoder instead, so we can see the fractions, too. The winner is: learning rate 6.
    67 replies | 3598 view(s)
  • Shelwien's Avatar
    20th February 2021, 12:11
    https://stackoverflow.com/questions/8883567/how-to-implement-fast-bigint-division https://stackoverflow.com/questions/9257612/division-of-large-numbers
    194 replies | 16450 view(s)
  • Gotty's Avatar
    20th February 2021, 11:49
    In the meantime these are the learning rate results for the small DNA corpus (https://encode.su/threads/2105-DNA-Corpus) It shows us that changing the learning rate on the last layer have a very-very small effect. It doesn't tell much, since the files are tiny. But it closely matches with the 1MB sars-2 test: the learning rate should be large in case of these files, too. Strangely the best result is at learning rate 7, the worst is at 6.
    67 replies | 3598 view(s)
  • Gotty's Avatar
    20th February 2021, 11:37
    For the first 1MB of sars-2 (sequence only): n-m: size (n: final learning rate on 1st layer, m: final learning rate on 2nd (last) layer) 8-2: 8122 8-3: 8113 8-4: 8109 8-5: 8108 8-6: 8103 * best 8-7: 8105 8-8: 8106 <- max learning rate on both layers 7-8: 8109 6-8: 8109 5-8: 8112 4-8: 8118 3-8: 8123 2-8: 8128 So a small decay in the last layer works, but having no max learning rate on the first layer hurts. This is what I learned. The difference is just a handful of bytes for a 1MB file, so I didn't test with smaller size or chunks. Let me run a test round with c7 (first 10M bytes of sars-2) and record compression rate after every MB).
    67 replies | 3598 view(s)
  • kaitz's Avatar
    20th February 2021, 10:39
    @Gotty, have you tested all the learning rates in mixer on smaller set? Is compressed data geting smaller and smaller say for 200kb chunk of input or any other larger size?
    67 replies | 3598 view(s)
  • Self_Recursive_Data's Avatar
    20th February 2021, 06:14
    I haven't yet implemented the final part of the algorithm - when you compress the extremely long decimal (after arithmetic coding) to bin form ex. 256 becomes 11111111. I implemented it yesterday but it can't divide long numbers... how can I do it without resorting to doing chunks?
    194 replies | 16450 view(s)
  • Gotty's Avatar
    20th February 2021, 02:46
    coronavirus.fasta.preprocessed_seq_only.8 100'003'453 bytes paq8gen_v1.exe -8 109364 bytes paq8gen_v2.exe -8 107662 bytes paq8gen_v3.exe -8 82331 bytes paq8gen_v4.exe -8 78236 bytes paq8gen_v5.exe -8 73779 bytes coronavirus.fasta.preprocessed.full 1'317'937'667 bytes paq8gen_v1.exe -12 1'118'915 bytes 23275.12 sec paq8gen_v2.exe -12 1'085'007 bytes 25974.80 sec paq8gen_v3.exe -12 723'319 bytes 23747.36 sec paq8gen_v4.exe -12 696'424 bytes 23969.18 sec paq8gen_v5.exe -12 660'417 bytes 26623.40 sec
    67 replies | 3598 view(s)
  • Gotty's Avatar
    20th February 2021, 02:44
    - Simplified replacement strategy in Bucket16 - Decreased final learning rate from 8 to 6 in last mixer layer - Removed IndirectContext and LargeIndirectContext - Refined SSE stage: now using 4 high precision APMPost maps instead of 2 - Removed not useful contexts from LineModel, added contexts to better model codons - Removed not useful contexts form MatchModel, added context to identify ambiguous matches; now using initial match lengths of multiple of 3 - Cosmetic changes in NormalModel and Shared Source: https://github.com/GotthardtZ/paq8gen Windows binaries: https://github.com/GotthardtZ/paq8gen/releases/tag/v5
    67 replies | 3598 view(s)
  • LawCounsels's Avatar
    20th February 2021, 00:04
    >> ( Sportman's )
    60 replies | 1451 view(s)
  • Gotty's Avatar
    19th February 2021, 23:33
    LawCounsels, It's not clear which is the quoted text and which is your text. Could you fix it? Edit: Thank you! Better. But still not there. Please see all the posts above how quoting looks like. Type your messages OUTSIDE of the quoted area, please.
    60 replies | 1451 view(s)
  • LawCounsels's Avatar
    19th February 2021, 23:01
    60 replies | 1451 view(s)
  • Bulat Ziganshin's Avatar
    19th February 2021, 21:23
    modern OoO CPU has 100-200 real registers exactly for this reason btw, this is series of posts exactly about efficient bit reading: https://fgiesen.wordpress.com/2018/02/19/reading-bits-in-far-too-many-ways-part-1/
    8 replies | 305 view(s)
  • lz77's Avatar
    19th February 2021, 20:52
    It may have happened because there are no registers left for OoO execution. :) LZ4 is ~ two times faster in decompression than mine. On CPU I5005U, Broadwell.
    8 replies | 305 view(s)
  • Lucas's Avatar
    19th February 2021, 20:44
    0.328 seconds to decode is roughly 305MB/s, however there are codecs in the wild which can do 1-3GB/s decode. LZ4 is a good example and it's not written in assembly at all, it's written in C, it's the compiler that does the rest of the optimizations for you. ​ The trick to writing fast C is to keep it simple and be aware of what instructions the compiler will emit, since this will produce the best assembly code possible. Though there are edge cases when the compiler cannot optimize certain code-paths, they are quite rare. ​The best example that comes to mind is rANS renormalization, hence why rANS static exists. Word of advice; try using godbolt to inspect the assembly output of your C/C++, it really helps you understand what the compiler thinks you're trying to do with your functions.
    8 replies | 305 view(s)
  • xinix's Avatar
    19th February 2021, 20:20
    60 replies | 1451 view(s)
  • Romul's Avatar
    19th February 2021, 20:18
    OK. Realized my mistake.
    60 replies | 1451 view(s)
  • Gotty's Avatar
    19th February 2021, 20:14
    Confirmed: The cut version looks random.
    60 replies | 1451 view(s)
More Activity