Activity Stream

Filter
Sort By Time Show
Recent Recent Popular Popular Anytime Anytime Last 24 Hours Last 24 Hours Last 7 Days Last 7 Days Last 30 Days Last 30 Days All All Photos Photos Forum Forums
  • Shelwien's Avatar
    Today, 15:40
    What "no"? I looked at the debuginfo in it, and its got all the cmix classes 0001:00000500 Manager::Manager(void) 0001:0000088E Manager::UpdateHistory(void) 0001:000008D0 Manager::UpdateWords(void) 0001:00000A36 Manager::UpdateRecentBytes(void) 0001:00000A68 Manager::Perceive(int) 0001:00000B9C Manager::AddContext(std::unique_ptr<Context,std::default_delete<Context>>) 0001:00000C36 Manager::AddBitContext(std::unique_ptr<BitContext,std::default_delete<BitContext>>) 0001:00000C90 Predictor::GetNumNeurons(void) 0001:00000D46 Predictor::Predict(void) 0001:00000FE4 Predictor::Perceive(int) 0001:000011D0 Predictor::Add(Model *) 0001:00001290 Predictor::AddDMC(void) 0001:0000139C Predictor::AddByteRun(void) 0001:000019C8 Predictor::AddRunMap(void) 0001:00002004 Predictor::AddNonstationary(void) 0001:00002728 Predictor::AddMatch(void) 0001:00002FA8 Predictor::AddDoubleIndirect(void) 0001:0000386A Predictor::AddDirect(void) 0001:00003F1C Predictor::AddSparse(void) 0001:000051B2 Predictor::AddEnglish(void) 0001:000065C4 Predictor::AddByteModel(ByteModel *) 0001:00006686 Predictor::AddPPM(void) 0001:000067A6 Predictor::Add(int,Mixer *) 0001:000068CC Predictor::AddPAQ8L(void) 0001:00006A16 Predictor::AddPAQ8HP(void) 0001:00006B62 Predictor::AddPAQ8H2(void) 0001:00006CAC Predictor::AddSSE(void) 0001:000073BE Predictor::AddMixers(void) 0001:0000A3D4 Predictor::Predictor(void) https://github.com/byronknoll/cmix/blob/1ec3a39a154b4853c6b5722bf0cbe23db29342c0/src/context-manager.h#L13 https://github.com/byronknoll/cmix/blob/bc50e54a920e9b80984a4d9225556326093a0940/src/predictor.h#L26
    3 replies | 29 view(s)
  • suryakandau@yahoo.co.id's Avatar
    Today, 15:32
    No.
    3 replies | 29 view(s)
  • Shelwien's Avatar
    Today, 15:17
    https://www.virustotal.com/gui/file/5d1849fdcf0e10a32857c38dd64cd1afcb179c88a5a8c1ce6d99962547d3d632/detection Seems like renamed cmix?
    3 replies | 29 view(s)
  • suryakandau@yahoo.co.id's Avatar
    Today, 14:43
    may someone on this forum run it for enwik8 and enwik9 or darek benchmark ?? this is use neural network so the running time can so long. thanx you very much
    3 replies | 29 view(s)
  • Fallon's Avatar
    Today, 13:23
    Fallon replied to a thread WinRAR in Data Compression
    WinRAR Version 5.80 beta 4 https://www.rarlab.com/download.htm
    174 replies | 120633 view(s)
  • jethro's Avatar
    15 replies | 695 view(s)
  • skal's Avatar
    Today, 01:20
    + WebP v2 on the right - 254bytes. ​ paper: https://arxiv.org/abs/1812.02831
    15 replies | 695 view(s)
  • Shelwien's Avatar
    Today, 00:11
    https://github.com/Shelwien/shar2 https://github.com/Shelwien/shar2/releases/download/v0/shar2_v0.7z (there's also x64 linux binary) This is an implementation of a "different" archive format. It has a solid index at the end of archive like .7z/.arc, but still is able to extract from a pipe, by extracting to temp names first, and then renaming when index is reached. It also doesn't store file sizes explicitly (uses escape-sequences for terminators), so its possible to eg. batch-edit files like this: "shar a0 - src/ | sed -e s/foobar/foo/ | shar x - new" Known problems: 1. Wildcards are not implemented yet, so its only possible to create archive from a single directory. 2. Needs newer (2015+?) VS libs for proper support of non-english names on windows. Mingw gcc default libs don't have support for locales. 3. gcc >=8 builds a buggy exe. Builds from gcc7 and earlier work properly, while gcc8+ don't. I don't know why yet, but probably again for gcc its too complicated to parse C++ templates.
    0 replies | 114 view(s)
  • Gonzalo's Avatar
    Yesterday, 21:38
    @Christian: Last night processed and restored all my PDFs, GZs and ZIPs with mtprecomp. Some 630+ files, several GBs. Totally forgot to bit compare outputs OTOH. At least we know there are no crashes :D ----------------------- On a different note: Are you sure?? I mean, what's the point? ZSTD is barely better than zlib. Why go thru all the trouble just to end up using a weak algorithm? If you want a faster codec, I strongly suggest you to look at flzma2. On my tests, it is always way faster than lzma (1.5x being the slowest, up to 5x and more when multithreading) and ratios are barely 1% worse (sometimes even better than lzma - see the pdf). Here's the results of a recent test. The testset is a folder with windows .ico icons (with pngs inside). On the first two pages are the pareto frontier-maker methods for cspeed vs ratio (with Razor manually added as it is probably the strongest LZ program I know of); next is everything else for comparison. I did several of these test with other kinds of data too. The results are pretty much the same. This is just the better formatted one.
    37 replies | 1699 view(s)
  • maadjordan's Avatar
    Yesterday, 17:27
    Xnview has similar capabilities and was made as seperate program through XnConvert https://www.xnview.com/en/xnconvert/ which works as GUI to NConvert https://www.xnview.com/en/nconvert/
    3 replies | 75 view(s)
  • Shelwien's Avatar
    Yesterday, 16:57
    Shelwien replied to a thread NNCPa in Data Compression
    Yes, but in the end it computes probabilities of bits and encodes bits. Just with MT processing of chunks and bits of a symbol - eg. one stream contains all bit15, another all bit14 etc. Basically "-batch_size 1" does want I wanted, but its so slow that some other MT implementation would be useful.
    7 replies | 551 view(s)
  • Mauro Vezzosi's Avatar
    Yesterday, 16:43
    Mauro Vezzosi replied to a thread NNCPa in Data Compression
    ​I don't know if we can remove it, I guess we can't because it's hidden in the library. There is only this in the source files: libnc.h: typedef struct NCModel NCModel; nncp.c: static int nb_threads = 1; NCModel *m; m = nc_model_init(nb_threads); case 'T': nb_threads = atoi(optarg); and NCModel isn't defined anywhere :-( (fields are not used in NNCP *.c/h). NNCP process data at the symbol level and I doubt that LibNC works at bit level. It is possible to input a sequence of bits by setting -n_symb 2 (the input data will be one bit per byte).
    7 replies | 551 view(s)
  • Shelwien's Avatar
    Yesterday, 15:22
    See rzwrap above.
    37 replies | 1699 view(s)
  • pklat's Avatar
    Yesterday, 14:55
    but for twitter challenge, perhaps some other image samples would be more appropriate, *cough* goatse *cough*
    15 replies | 695 view(s)
  • 78372's Avatar
    Yesterday, 14:54
    Any chance to include stdio in precomp?
    37 replies | 1699 view(s)
  • Darek's Avatar
    Yesterday, 13:31
    Darek replied to a thread NNCPa in Data Compression
    >Can you tell me your best current options? My best actual options are like in table. Thanks for descriptions, I'll start to test it.
    7 replies | 551 view(s)
  • Shelwien's Avatar
    Yesterday, 12:29
    Shelwien replied to a thread NNCPa in Data Compression
    > ​In the first version of NNCPa I tried to integrate Shelwien's SSE but I stopped almost immediately, I'll try again in the next versions. That SSE is not really universal, there're many tuned parameters, and overall effect was pretty small, because I couldn't access contexts in actual data. The basic component is the interpolated SSE class (sh_SSE1.inc), so maybe you can experiment with that instead? (Since it starts with 1:1 mapping and with low wr stays like that; so tuned SSEi never hurts compression). Also I wonder if its possible to remove the NNCP's threading method and keep only normal sequential processing of bits. It provides best compression anyway (I mean batch_size=1) and for encoding its possible to implement a different kind of MT eg. via sorting bits by contexts.
    7 replies | 551 view(s)
  • Mauro Vezzosi's Avatar
    Yesterday, 10:22
    I hope there are no problems if I answer. > My question is about -lr use: -lr 1e-4,5000000,5e-5,28000000,3e-5 > -> is that means that it's possible to use different options for selected parts of the file? Yes, you can set a maximum of 3 block length with different learning rate. The l.r. changes proportionally according to the current position in the block, for -lr 1e-4,5000000,5e-5,28000000,3e-5 is: Position 0 --> lr 0.000100 (= 1e-4) Position 2500000 (= (5000000 - 0) * 0.5) --> lr 0.000075 (= 0.000100 - 0.000050 * 0.5) Position 5000000 --> lr 0.000050 (= 5e-5) Position 9600000 (= 5000000 + (28000000 - 5000000) * 0.2) --> lr 0.000044 (= 0.000050 - 0.000030 * 0.2) Position >=28000000 --> lr 0.000030 (= 3e-5) Default is -lr 3e-4,5e6,1e-4. > -> is it possible with all options? No, only with -lr.
    101 replies | 11753 view(s)
  • Mauro Vezzosi's Avatar
    Yesterday, 10:20
    Mauro Vezzosi replied to a thread NNCPa in Data Compression
    > Scores of my testset compressed by nncpa with best options from RC_1. ​In the first version of NNCPa I tried to integrate Shelwien's SSE but I stopped almost immediately, I'll try again in the next versions. SSE improves compression but slows down considerably and needs more memory. Can you tell me your best current options? I used those written here and the results are worse, except for: 284582 D.TGA 111472 F.JPG 6280 O.APR 34751 R.DOC (see below) 31992 S.DOC 22085 V.DOC 422 Y.CFG These results can be now slightly different because I don't remember when I made them (maybe I added a few extra bytes to the header). > However to test more options I'll need to change some parameters to speed up - with these options my testset compress time is about 7 days... Yes, I stopped K.WAD after 2 MB because it takes too long with your options. > Do you suggest to test some of the new options as first? Try the new weird fg_activ and og_activ options on small files (<~100/150 KB), e.g. (bold: best, underlined: better than w/o -*g_activ): w/o -*g_activ -ig_activ -fg_activ -og_activ 111472 111607 111442 111486 F 44232 311619 101412 75609 J 6280 154021 71754 11761 O 4746 4996 4437 4658 P 34751 35755 34584 33980 R 31992 740475 117547 51237 S 22409 23522 22357 22273 T 10528 11999 10519 10490 U 22085 23089 21754 21601 V 16136 16349 15840 15579 W 12857 13333 12685 12506 X 422 459 416 425 Y 249 258 239 250 Z Try -block_len. If -batch_size = 1, then set -block_len to >= length of the file: you should save 4+ bytes every 1600000 bytes... If -batch_size > 1, then set -block_len to -batch_size * 20000-40000, e.g. if -batch_size 3 then -block_len 90000 (3 * 30000). Try the new -block_loop and -block_iter. I only did few tests, just to start set -block_len to -batch_size * 50000-60000 (regardless of the value of -batch_size), -block_loop 2 and -block_iter 1. -*_embed_* can be useful with -n_layer >~5-6, however leave these options aside not because they are less important but because they can have various combinations All other new options are less important. For example, my best options for R.DOC are: -batch_size 1 -n_layer 3 -hidden_size 432 -full_connect 1 -lr 0.0009 -time_step 18 -og_activ 1 -seed 0 -block_len 90000 -block_loop 2 -block_iter 1 bps=2.203 (33095 bytes) When you find a better option, you should re-test all/most the options previously found because they can be modified by the new one. Remember that small fluctuations are normal, for example if an option improves from 1000 to 997, this does not mean that it will continue to be better even when other parameters will change.
    7 replies | 551 view(s)
  • xcrh's Avatar
    Yesterday, 06:50
    I wonder if it https://encode.su/threads/3152-LZSA makes sense? Proven to be rather interesting codec, even if inclination on 8-bit antique HW can be crippling, it still seems to be fairly "dense" for codec that isn't even bit-aligned.
    53 replies | 10722 view(s)
  • xcrh's Avatar
    Yesterday, 06:38
    xcrh replied to a thread AV1 Video Codec in Data Compression
    Uhm? WTF? No even up-to-date reference AV1 codec results? Only some exotic and relatively unknown codec? EPIC FAIL!!! Yeah, sure, you have to build from source, but I do think state-university-something team supposed to be able to cope with it, if even mortals like me can do that, no? This said, reference codec is "slow" but ancient 1.0 slow and current git "slow" are two big differences. It got improved hell a lot. And after fiddling with it here and there I have to say result appears to be worth of waiting for it. At least it able to get very good picture at bitrate - and it even already started to be supported by browsers. Not something H.265 is likely to see at all - and I'd call it quite important consideration. I'd say this report shows gross lack of competence in AV1 part. Furthermore, it uses compressed video as source, which could measure anything but codec efficiency due to original encoding artifacts in effect. And even so, no any RAW sequences runs? Or I'm blind? Or is this intentional? Well, even then some unknown AV1 thing kept high score in subjective quality - and I'd say that's where AV1 is really good. Somehow, it exceptionally hard to spot encoding artefacts, even at low bitrates. As concrete example, based on (uncompressed) BigBuckBunny x264 and x265 tend to damage scrolling credits text at the end hell a lot, with almost any settings (well, maybe except forced constant quality mode, which is quite specific thing). AV1 does much better under comparable bitrate targeting. Well, I haven't tested proprietary codecs though, as I have no use for them, yet it could be interesting to see how current reference AV1 performs compared to these. Um, maybe it just compresses too well? VP9 is nice, but I'd say main development focus of Google video team apparently moved on AV1, at least VP9 seems to be far less active project for now in terms # of commits.
    35 replies | 6770 view(s)
  • Gonzalo's Avatar
    Yesterday, 01:10
    Need any assistance with that? PM if you could use some testing or whatever
    37 replies | 1699 view(s)
  • bwt's Avatar
    17th November 2019, 16:01
    This is my project about computer vision on Android https://youtu.be/-izclDEIctw Can it do on computer pc without emgucv or opencv ?
    9 replies | 468 view(s)
  • Shelwien's Avatar
    17th November 2019, 12:30
    1) Jpeg is an already compressed format. Its entropy coding is inefficient and metainfo is left uncompressed, but still, a general-purpose compression algorithm (eg. nanozip) can't achieve good results. 2) There're specialized compressors for jpeg format (usually called recompressors): winzip,stuffit, brunsli,lepton,packjpg, jojpeg,rejpeg,pjpg 3) On single file the best compression would be achieved by something using the recent paq8px jpeg model - maybe cmix or paq8pxd, depending on specific details. But on a large number of files the result surprisingly may be better using mfilter, because of its preprocessing of thumbnails and color profiles. So the best method might be to use mfilter with dummy lepton exe (without compression), then paq8px or cmix. 135,671 0067.jpg // original 134,291 0067.cdm // bwtsh|cdm5 117,573 0067.pjpg // pjpg c 0067.jpg 0067.pjpg 114,725 0067.rjg // rejpeg 0067.jpg 114,499 0067.zip // WZZIP -ez 0067.zip 0067.jpg 113,576 0067.brn // brunsli c 0067.jpg 0067.brn 110,079 0067.pcf // precomp 4.7 -cn (uses packjpg library) 110,126 0067.lep1 // lepton 0067.jpg 110,055 0067.pjg // packJPG 0067.jpg 109,838 0067.lep2 // lepton-slow 0067.jpg (based on packjpg) 109,982 0067.7z // 7z a -m0=mfilter:a1 0067.7z 0067.jpg (uses lepton) 107,106 0067.pa // 7z a -m0=jojpeg -m1=plzma 0067.pa 0067.jpg (uses jojpeg_sh4) 106,677 0067.jo3a // jojpeg_sh3 c 0067.jpg 0067.jo3a 0067.jo3b (separate stream for metainfo) 409 0067.jo3b 103,974 0067.jo2 // jojpeg_sh2 c 0067.jpg 0067.jo2 (standalone paq8p model) 103,660 0067.sitx // stuffit\pack /c -a=0067.sitx -l=16 -x=30 -b -o --jpeg-no-thumbnails --level=999 --recompression=all --recompression-level=2 0067.jpg
    1 replies | 174 view(s)
  • suryakandau@yahoo.co.id's Avatar
    15 replies | 695 view(s)
  • suryakandau@yahoo.co.id's Avatar
    17th November 2019, 07:05
    This is my little project on computer vision on android. https://youtu.be/-izclDEIctw Can it do under PC computer without emgucv or opencv ?
    9 replies | 468 view(s)
  • brispuss's Avatar
    17th November 2019, 01:52
    What is the best method to maximize compression of JPG's losslessy? I've done some tests on the enclosed JPG test file. Original size 135,671 bytes Nanozip 135,433 -cO -m4g Precomp/Nanozip 110,250 -cn -intense + -cO -m4g Precomp/paq8px183fix1 110,148 -cn -intense + -9beta Precomp/fast paq8 v6 110,140 -cn -intense + -8 Precomp 110,079 -cn -intense PackJPG v2.5k 110,055 fast paq8 v6 101,459 -8 paq8px183fix1 101,192 -9beta Ideas please?
    1 replies | 174 view(s)
  • Darek's Avatar
    16th November 2019, 23:55
    Darek replied to a thread NNCPa in Data Compression
    Scores of my testset compressed by nncpa with best options from RC_1. There some gains and loses. In total 4.7KB. However to test more options I'll need to change some parameters to speed up - with these options my testset compress time is about 7 days... Do you suggest to test some of the new options as first?
    7 replies | 551 view(s)
  • Jyrki Alakuijala's Avatar
    16th November 2019, 23:53
    Yes. Practical applications start at 2000 bytes. (One image is worth 1000 WORDs.)
    15 replies | 695 view(s)
  • Darek's Avatar
    16th November 2019, 23:38
    @fab - I've question about trfcp (or maybe nncp also). In readme.txt there is example of use trfcp: ./trfcp -n_layer 6 -d_model 512 -d_pos 64 -d_inner 2048 -n_symb 16388 -lr 1e-4,5000000,5e-5,28000000,3e-5 c out.pre out.bin My question is about -lr use: -lr 1e-4,5000000,5e-5,28000000,3e-5 -> is that means that it's possible to use different options for selected parts of the file? -> is it possible with all options?
    101 replies | 11753 view(s)
  • fab's Avatar
    16th November 2019, 21:16
    A new version of NNCP is available (2019-11-16) at https://bellard.org/nncp . The models are not bigger but there are several improvements: - the training is done every 20 symbols but the truncated back propagation window is larger (40 symbols) - rms_norm is used instead of layer_norm (i.e. no zero average, just vector normalization) - for enwik9, a dictionary of 32k symbols instead of 16k symbols was used - the Transformer model was improved with a new positional encoding scheme, but it is still not better than the LSTM one. Results: enwik8: 16292774 bytes enwik9: 119167224 bytes
    101 replies | 11753 view(s)
  • Shelwien's Avatar
    16th November 2019, 16:47
    Compiled with msys after installing half of linux in it: http://nishi.dreamhosters.com/u/lrzip_20191116.7z mostly works, but gets stuck on lzma levels >6 (seems like a well-known issue). lzma levels >6 seem to work with -p 1 (single thread).
    7 replies | 500 view(s)
  • schnaader's Avatar
    16th November 2019, 16:15
    No problem, nice to read something from you. Also, that discussion is very interesting and precomp is in a similar phase now, I have to decide which features to implement and where its journey should go. I think it will get kind of schizophrenic in the next versions, offering both faster modes (e.g. making zstd default instead of lzma) and modes with better compression (e.g. brute-force modes for image data trying FLIF, webp, pik, PNG filters and pure lzma). The multithreading changes also make me doubt streaming support since these two are often incompatible.
    37 replies | 1699 view(s)
  • mpais's Avatar
    16th November 2019, 15:33
    @bulat: >FTL still saves 8-byte size in the archive header, so you need to go back. >7z/arc store metainfo size in the last bytes of archive, >so archive should be decoded starting with the last few bytes That was quite discussed in the Gitter chat, and something that seemed simple actually required considering a lot of things. Most would prefer a streaming-friendly format, which was basically out of the question when other features were considered. So with that out of the way, we considered ease of data-handling versus robustness to data loss, specifically truncation. The order of the metainfo blocks is such as to maximize the possibility of data recovery in that case; the file information structure is the last in the file, so if it got truncated, only some files would be lost. Even this wasn't really much to my liking, since maybe losing info for the directory tree structure would possibly be less catastrophic, but that is why this was a draft. Placing the offset last would mean truncation would effectively render all data lost. So it was considered, and mostly in limbo since we never actually had other draft proposals, but personally I think the small inconvenience is worth the effort. Sure, if you're considering backups to tape then completely strict sequential IO would be better. And the described reasoning would only be of concern for archives that didn't use the proposed wrapping recovery format. >so, FTL doesn't support compression/encryption of metainfo, making some users unhappy. Only compression of metainfo wasn't described in the draft, and could be added at no loss of generality, there are flag bytes specifically useful for things like encryption, aside from a version number. There was discussion on the encryption method to use and whether to support several. >also, FTL employs fixed order of fixed-structure blocks. This means that if you will need to change some block structure, older programs will be screwed. As above, backward compatibility was required, but not upward, obviously. >Instead, arc metainfo is a sequence of tagged blocks. When block type 1 will become obsolete, >I will just stop including it in newer archives. Some functionality will lose upward compatibility, >but the rest will still work. This also means that 3rd-party uitilities can add blocks of their own types Only the information required to describe the block structure, the codecs used and their sequences, the directory tree and the file list are required in specific order and thus don't need tags (also, inside these, almost every other metadata is based on a variable length list of tags). If these need to change, a simple version-number change suffices let newer decoders handle it, and older (incompatible ones) skip it. Should a 3rd-party tool wish it can append their own blocks after these, tag them or not as a way of identification, since any data after the end of the file list structure is ignored (again, since we don't rely on the metainfo offset being in the last bytes of the archive). >Then, some items such as codec tag, requires more input. This means that when older version >reads an archive with newer codec, it can't continue the decoding. arc encodes all codec params >as asciiz string, so older programs can display archive listing that includes newer codecs and >even extract files in solid blocks that use only older codecs The way parameters for each codec are stored is completely undefined as it is codec-dependent, it is literally only stated as being X bytes, so you could even include a asciiz/utf8/other string to describe the name of the codec, so older versions could at least also inform the user what codec it isn't recognizing and what parameters were used for it. And that doesn't influence its ability to extract any other blocks, solid or not, that use codecs it does recognize. >this is the most important point. among usual file directory fields (name,size,date/time,attr,crc) >only first two are varsized. and even the size field may be encoded as fixed-sized integer - after lzma, >compressed fixed-size data for the filesize field has about the same size as compressed VLI data, >according to my tests (I think that saving each byte of the size field separately may further improve >lzma-compressed size but don't checked it :) We use the optional metadata tag structure to hold all file/directory fiels apart from the name. Even file size is optional, since it can be deduced from its block structure, but Stephan Busch rightly noted that it would make listing file sizes a lot more computationally expensive, so it could be listed with its own tag. >next, with SoA you encode field tag only once per archive rather than once per file. Correct, but since the user would have complete control of what fields to include in the archive, if the absolute smallest archive size was required it could skip all fields, in which case a single null byte would be stored per file/directory, so that would help reduce the overhead. And it allows for flexibility, the user may choose to store complex metadata for just some file types of personal interest. A photographer could even keep a full disk backup including his/her photos, and when updating it with newer photos, include personalized text comments for each, describing personal thoughts about it. >Also, you can store field size after field tag and thus skip unknown fields, >again improving forward compatibility and 3rd-party extensibility Again, straight from the draft: >finally, compression ratios are greatly improved on same-type data, >i.e. name1,name2...,attr1,attr2... is compressed better than AoS layout Agreed. But again, at least in this initial version, we weren't considering compressing the metainfo since that would almost surely mean the format would be ignored by everyone else. Do you really think 3rd-party tool creators would want to have to handle writing/using decompression routines just to be able to even read the archive structure, let alone do anything with it? I don't see anyone apart from a few enthusiasts in this forum being willing to go through all that hassle. @shelwien: >Actually I think there's too little of constructive discussion, rather than too much. A gitter chat with several feature-focused rooms was setup for discussion of all of this, so as not to pollute the forum. >What action do you want? >Bulat made a few complete archivers, freearc is still relatively popular. >I also made my share of archiver-like tools and maintain a 7z-based archive format with recompression. Exactly, there is largely more than enough know-how in this community to make something great. And when once in a while it's discussed, there are lots of suggestions and discussion. But when time comes to actually help out and do something, no one can be bothered to contribute. But then just as easily as the saying goes, everyone's a critic. >Do you mean participation specifically in FT project? I'm actually also waiting on your improvements to paq8px. Gotty is going through so much trouble to improve the code (and probably fix all my bugs) to make it more decoupled so others, like yourself, can contribute ideas without being hamstringed by all the clunky mess it was. >But I don't feel that its compatible with my ideas and even requirements of my existing codecs? >(mainly multi-stream output). >I do try to suggest features and open-source preprocessors that might be useful, is there anything else? Fair enough, everyone has their own ideas, and suggestions are always welcome. But why is it that those here trying to cobble-up something for free in their spare time are belittled so easily and see such criticism? This dismissive atitude some of you here have for those trying to make progress in this field, constantly reminding everyone on how you guys could do a much better job at it, is rightfully going to get you called out. Sorry, but sometimes one either puts up or shuts up. >I think it has to start with concepts which the format is supposed to support. >1) there're common features which would be necessary to compete with popular >formats, like .rar or .7z: encryption, volumes, recovery, solid modes, API for external modules; Of those, only volumes weren't considered, since that can be approached in the same way we did with the recovery: design a wrapper format for it, useful not just for our own archive. This spin-off project could allow people with different expertise to contribute and would accelerate development. This was quite discussed in the chat and the recovery format was still in limbo without anything set in stone before discussion stopped. That is not to say better ways of handling any of those couldn't be explored, it was, after all, just a response to a call for drafts. >2) some possible specialized features: compatibility with stream processing (like tar), >incremental backups (like zpaq), forward compatibility (rarvm,zpaq), >random access to files/blocks (file systems); >3) new features that would establish superiority of the new format: >recompression, multi-level dedup, smart detection, MT compression improvements >(eg. ability to reference all unlocked regions of data for all threads, >rather than dumb independent compression of blocks), file/block diffs, >virtual files, new file/block sorting modes... Sure, lots of cool and innovative features can be explored, but we need to be realistic. Even as-is the project would already be extremely complex, as proven from its still-born status. And a few of those you mentioned were proposed. >Of course, its not necessary to implement everything right away. >But it'd be bad if you don't consider eg. volumes, and end up unable >to implement it properly, eg. because block descriptors don't support >splitting of blocks between volumes. Recovery and volumes could/would be handled as lower-level IO abstraction layers. They would be independent formats (like for example .PAR), so a FTL archive with recovery info would have extension ".ftl.rec", for instance. When processing, we'd check for which IO handler to use, and it would be passed on to the codecs. >And this FT format spec looks like a cleaner version of .paq8px or .pcf. >Sure, it would be an improvement for these, but do you really think >that proposed format is better than .7z, for example? If it was that simple to acertain what is better we wouldn't have needed any lenghty discussions. As long there are multiple specific needs in play, each possible option with their set of pros and cons, we should remain open. Hence the call for drafts. >But your format has "chunk" descriptors which specify codecs, >and "block" descriptors which specify preprocessors? Preprocessors are just another codec, that's all. Only after all the parsers, transforms and dedupers reach a block segmentation, do we then proceed to applying codec sequences. The format just describes the segmentation structure and what codecs were used, nothing more. Each individual block can have its own side-data (like reconstruction info for deflate streams, GIF images, etc). >Nope. I was talking about multi-level dedup. >As example, I made this: https://encode.su/threads/3231-LZ-token-dedup-demo >So there's this token dedup example, where dedup ideally >has to be applied on two levels simultaneously, but a greedy approach doesn't work: >entropy-level recompression isn't always better, but sometimes it can be. >But if a duplicate fragment is removed from lzma stream, it would be impossible to decode to tokens; >and if a dup simply removed from token stream, it would become impossible to reapply entropy coding. On your posted example, if I understood it correctly: Both files passed file-level dedup, so we parse them. The LZMA parser detects the stream. Maybe we try a few common options, none decoded it, so we can't mark it as a LZMA block. We try decoding it to tokens, succeed, mark the block as a LZMA-TOKENS block. They don't match, so content-level dedup doesn't affect them. After all parsing is done, we're still free to do any reversible-operations on the blocks before calling the preprocessors and codecs. So since we have a few LZMA-TOKENS blocks, we run them through their specific deduper, which finds their similarity. block 1 is left unchanged, block 2 is now set as LZMA-TOKENS-DEDUPED, it's private info points to block 1, and its related stream is now your ".dif". Both get compressed later on by the codecs. Now when decompressing, we get to block 2. We see it was lz-deduped from block 1, so we decompress blocks 1 and 2, apply the diff and get the tokens for block 2, and apply the inverse transform to get the original lzma stream. >That's actually good, atm only .rar has explicit file deduplication, >and extending it to embedded formats is a good idea. >But how would you handle a chunked-deflate format (like http traffic or email or png)? >(there's a single deflate stream with extra data inserted into it). Specific parsers and transforms, it was clear from the start that something like PNG would require it. Even decoding the LZW-encoded data from GIF images requires a specific transform. >Or container formats with filenames, which are best handled by extracting to virtual files, >then sorting them by filetype along with all other files? Again, content-aware segmentation would then allow (if using solid compression) for a (optional) content-aware clustering stage, designed to sort blocks of the same type by ranking their similarity. So if say, a TAR parser split all files in the stream into single blocks, of which some were then already detected as a specific type, all the files could be sorted. Images could just be sorted by resolution, or date taken, or camera used, or even visual likeness. Texts could be sorted by language. For this stage, that TAR parser might have even included the filename as optional info in each block it detected, to better helps us sort them, since for those we might not have any specific similarity-comparer. >But I think that adding every ~100-byte CDC fragment to a block list >would be quite inefficient. Which is why I said that they would need to be dedup-worthy, we need to take into account the block segmentation overhead. No point in deduping just a few bytes if the segmentation alone is bigger. >delta(5,6,7,8, 1,2,3,4,5,6,7,8,9) -> (5,1,1,1, 1,1,1,1,1,1,1,1,1). >- better to ignore dups in original data and just compress delta output; >delta(5,6,7,8,9, 1,3,7,7, 5,6,7,8,9) -> (5,1,1,1,1, 1,2,4,0, 5,1,1,1,1). >- better to ignore delta; >delta(5,6,7,8, 2,3,7,7,7, 3,4,8,8,8,5,6,7,8) -> (5,1,1,1, 2,1,4,0,0, 3,1,4,0,0,-3,1,1,1). >- "1,4,0,0" dups in delta, but "5,6,7,8" is better in original? The last level dedup would handle most big duplications, where deduping would most likely provide better compression anyway. The codec-sequences for these default blocks (assuming fast, lz-based codecs, since most CM codecs probably wouldn't need it) could be setup with a delta preprocessor before the lz-codec or not, and the best option would be used. Not the same, sure, but if delta helped we'd still get benefits, and otherwise an lz codec should reasonably handle small duplications, especially if we cluster similar blocks together). For specific type blocks, we can make use of the information in them to sort them, to create transforms and codecs. >2) I'm mostly concerned about cases where preprocessing stops >working after dedup - especially recompression; >For example, suppose we have a deflate stream with 3 blocks: blk1,blk2,blk3. >blk2 is a duplicate of some previous block. >but with blk2 removed, blk3 can't be recompressed. >And if dedup is ignored, recompression works, but an extra parsing diff is generated for blk2. >One possible solution would be to do both, with decoding kinda like this: >- restore blk1 >- insert blk2 by reference >- recompress blk1.blk2 >- patch rec(blk1.blk2) to rec(blk1.blk2.blk3) >- restore full sequence >But this certainly would require support in archive index. >And normal diffs would too, I suppose. If blk2, as an intermediate deflate block, is a direct match for some other deflate block in another file, it's most likely a stored/raw block, so after we decompress the whole stream, the last level dedup could likely handle it (for default FTL blocks) and not much would be lost anyway. But lets assume that is not the case, it's a huffman compressed block, static or not, and luck would have it, it doesn't decompress the same. Your problem is that we'd then possibly be storing duplicated deflate diff information. Nothing in the proposed draft keeps you from doing it. Again, it simply describes a block structure and parameters on how to recreate them. You can have a deflate-diffed block type, whose individual reconstruction parameters would do just what you wanted. But if we're considering contrived examples, think of Darek's testset, with the same image in 2 different compressed formats. We could decompress both, but one is stored top-to-bottom and the other bottom-to-top. Ideally, one of the transforms would account for this, to allow for better chances of deduping, but let's say that it not the case. If we use the last-level dedup on non-default blocks, we will match every line of one image to the same line in the vertically-mirrored image, and we'll have to split both into single-line blocks, one will be just deduped-blocks, and the other will be split into as many 1-line "images" blocks as the height. So, for one image we'll just have to store dedup info, and get great compression. But the other image now has a lot of extra segmentation overhead, and if we're not using solid-mode (and even there, we'd need stable sorting in the clustering stage to ensure the lines would be stored in order), then we're compressing single lines without any added 2d-context, and may get sub-par compression. Would it still be better in solid mode than relying on the clustering stage to put both images consecutively so that a solid image codec could get the 2nd one almost for free anyway? Would it then make sense to have an image-specific deduper that would realize this, and just store as "diff" for the 2nd image that we should mirror it vertically from the 1st image? >In any case, I think that a new format has to be able to beat existing ones, >otherwise why not just use .7z or .arc which have a plugin interface? Sure, but why not discuss other approaches? Isn't that the idea of a call for drafts, that people can provide alternative solutions and a consensus be reached on what to keep from each? I tried to detail how to handle every feature we set out to achieve with this draft, and we discussed its pros and cons. And since no one cared, I called it a day and stopped bothering with the project. @schnaader: I'm sorry for the off-topic, it's nice to see precomp is still getting developed.
    37 replies | 1699 view(s)
  • hunman's Avatar
    16th November 2019, 15:11
    Someone can compile a Win/x64 version,please ? I'm a noob.. Thank you very much !
    7 replies | 500 view(s)
  • Shelwien's Avatar
    16th November 2019, 13:37
    First you need to have something you can actually sell. Like a US patent, or startup company based on your idea. An idea is not a product. With just an idea you can get a research grant or a job contract, but nothing luxurious. Then, there's also a cost/benefit analysis. The choice is not between storing the data with your algorithm, or not storing anything at all. But rather between getting more storage vs computing power required + extra latency. For a new algorithm with a speed similar to zlib there would be some applications, but we don't see cmix or nncp getting used to save storage space, because it wouldn't be profitable.
    3 replies | 87 view(s)
  • vteromero's Avatar
    16th November 2019, 11:34
    vteromero replied to a thread VTEnc in Data Compression
    The library can compress sorted lists of unsigned integers. The API exposes two flavours of sequences: lists and sets. Lists can held repeated values, sets cannot. An example of list would be and a set looks like this . Both lists and sets work with 8, 16, 32 and 64 bits. I don't know if this explanation answers your question.
    18 replies | 1532 view(s)
  • vteromero's Avatar
    16th November 2019, 11:24
    vteromero replied to a thread VTEnc in Data Compression
    I see, that's interesting, I'll give it some thought.
    18 replies | 1532 view(s)
  • KingAmada's Avatar
    16th November 2019, 06:08
    Don't you think 20k euro is too nano money for a true general random lossless compressor?
    3 replies | 87 view(s)
  • LawCounsels's Avatar
    16th November 2019, 05:43
    .... but not great ( in zip archive posted each individual petals look crisp fresh distinct clear ) How was it reduced ?
    15 replies | 695 view(s)
  • necros's Avatar
    16th November 2019, 05:33
    Still acceptable? : 4522 bytes 4187 bytes Arith JPG
    15 replies | 695 view(s)
  • Shelwien's Avatar
    16th November 2019, 05:17
    You can win 20k euro here: http://prize.hutter1.net/ Have to decompress it rather than compress, though :) 15284944*7/12 = 8916217 50000*(15284944-8916217)/15284944 = 20833
    3 replies | 87 view(s)
  • Shelwien's Avatar
    16th November 2019, 04:52
    @mpais: > This is exactly what made me decide to quit posting any stuff here. > All I see here is talk. Lots of talk, very little if any action. > It's so, so easy to just sit on the sidelines and just criticize others. Actually I think there's too little of constructive discussion, rather than too much. What action do you want? Bulat made a few complete archivers, freearc is still relatively popular. I also made my share of archiver-like tools and maintain a 7z-based archive format with recompression. Do you mean participation specifically in FT project? But I don't feel that its compatible with my ideas and even requirements of my existing codecs? (mainly multi-stream output). I do try to suggest features and open-source preprocessors that might be useful, is there anything else? >> Its full of unimportant details like specific data structures, > > It's.. a specification for a format, i.e., it should describe in detail > how to read an archive and interpret its data structures.. I think it has to start with concepts which the format is supposed to support. 1) there're common features which would be necessary to compete with popular formats, like .rar or .7z: encryption, volumes, recovery, solid modes, API for external modules; 2) some possible specialized features: compatibility with stream processing (like tar), incremental backups (like zpaq), forward compatibility (rarvm,zpaq), random access to files/blocks (file systems); 3) new features that would establish superiority of the new format: recompression, multi-level dedup, smart detection, MT compression improvements (eg. ability to reference all unlocked regions of data for all threads, rather than dumb independent compression of blocks), file/block diffs, virtual files, new file/block sorting modes... Of course, its not necessary to implement everything right away. But it'd be bad if you don't consider eg. volumes, and end up unable to implement it properly, eg. because block descriptors don't support splitting of blocks between volumes. And this FT format spec looks like a cleaner version of .paq8px or .pcf. Sure, it would be an improvement for these, but do you really think that proposed format is better than .7z, for example? >> Normally there're only two options - either we preprocess a file, or we don't. >> But here the optimal solution would be to preprocess most of the data, >> except for chunks which are duplicated in original files. > It doesn't even mention preprocessors, since those would just be codecs, > usually the first ones in a codec sequence for a block type. But your format has "chunk" descriptors which specify codecs, and "block" descriptors which specify preprocessors? > And it was specifically designed to do just what you mentioned. Nope. I was talking about multi-level dedup. As example, I made this: https://encode.su/threads/3231-LZ-token-dedup-demo Currently in most cases preprocessing is applied either blindly (for example, many archivers would apply exe filters to files preprocessed by external tools, even though that makes compression worse), or with very simple detection algorithm integrated in the preprocessor itself. So there's this token dedup example, where dedup ideally has to be applied on two levels simultaneously, but a greedy approach doesn't work: entropy-level recompression isn't always better, but sometimes it can be. But if a duplicate fragment is removed from lzma stream, it would be impossible to decode to tokens; and if a dup simply removed from token stream, it would become impossible to reapply entropy coding. I think that this is actually a common case for a new archiver with lots of preprocessors. But at the moment, at best, developers make specific approximate detectors for preprocessors, rather than add a part of archiver framework which would handle detection and switching of codecs and preprocessors, including external plugins. > First, deduplication occurs at the file level, then later on it's done at the content level > (if a parser splits a block, we perform deduplication on those new sub-blocks). That's actually good, atm only .rar has explicit file deduplication, and extending it to embedded formats is a good idea. But how would you handle a chunked-deflate format (like http traffic or email or png)? (there's a single deflate stream with extra data inserted into it). Or container formats with filenames, which are best handled by extracting to virtual files, then sorting them by filetype along with all other files? > After all parsing is done, the plan was to run a rep-like (or such) deduplication > stage on any non-specific blocks, so they could still be further divided. I'm not too sure about this, since atm in most cases dedup is implemented more like a long-range LZ. As example, we can try to add to archive movie1.avi and movie1.mkv, where 2nd is converted from 1st. Turns out, there're many frames smaller than 512 bytes, so even default srep doesn't properly dedup them. But I think that adding every ~100-byte CDC fragment to a block list would be quite inefficient. > So in your example, if the common code chunks were large enough to merit deduplication, > they'd be deduped before any exe-preprocessor would even be called. delta(5,6,7,8, 1,2,3,4,5,6,7,8,9) -> (5,1,1,1, 1,1,1,1,1,1,1,1,1). - better to ignore dups in original data and just compress delta output; delta(5,6,7,8,9, 1,3,7,7, 5,6,7,8,9) -> (5,1,1,1,1, 1,2,4,0, 5,1,1,1,1). - better to ignore delta; delta(5,6,7,8, 2,3,7,7,7, 3,4,8,8,8,5,6,7,8) -> (5,1,1,1, 2,1,4,0,0, 3,1,4,0,0,-3,1,1,1). - "1,4,0,0" dups in delta, but "5,6,7,8" is better in original? > Nothing in the format prevents that. Each block uses a codec-sequence. > The user could specify that, for instance, on 24bpp images, > we should try 2 sequences for every block: one that includes a color-space transform before the actual compression, > and one that doesn't; and we just keep the best. Yeah, as I said, a strict choice. 1) this is only perfect with block dedup. While with a more generic long-range-LZ-like dedup there'd be cases where locally one preprocessor is better, but globally another, because part of its output matches something. 2) I'm mostly concerned about cases where preprocessing stops working after dedup - especially recompression; For example, suppose we have a deflate stream with 3 blocks: blk1,blk2,blk3. blk2 is a duplicate of some previous block. but with blk2 removed, blk3 can't be recompressed. And if dedup is ignored, recompression works, but an extra parsing diff is generated for blk2. One possible solution would be to do both, with decoding kinda like this: - restore blk1 - insert blk2 by reference - recompress blk1.blk2 - patch rec(blk1.blk2) to rec(blk1.blk2.blk3) - restore full sequence But this certainly would require support in archive index. And normal diffs would too, I suppose. Of course, there may be an easier solution. For some types of preprocessing (eg. exe) it really may be enough to give the preprocessor the original position, rather than the one in the deduplicated stream. In any case, I think that a new format has to be able to beat existing ones, otherwise why not just use .7z or .arc which have a plugin interface? And thus the most important part is new features that can significantly improve compression, but would require support in format, since otherwise its better to use .7z. Sure, block dedup is one such feature, but in case of solid compression it would be always worse than long-range-LZ dedup like srep. As example, I can only use .7z for half of my codecs, because it provides an explicit multi-stream interface. I'd have to add my own stream interleaving to make reflate or mp3det/packmp3c compatible with any other format, including FT.
    37 replies | 1699 view(s)
  • KingAmada's Avatar
    16th November 2019, 04:35
    I programmed it, but it's really really slow. It takes an hour to compress 12 bytes to 7 bytes. And you can still compress that 7 bytes again. But the time it takes is it worth it? And How much would it worth? Because theoretically it can compress all types of files, and I believe with faster computer, and elegant code. It would take few seconds to compress.
    3 replies | 87 view(s)
  • Bulat Ziganshin's Avatar
    16th November 2019, 03:29
    >it will be great to have archive format that can be written strictly sequentially. this means that index to metadata block(s) should be placed at the archive end FTL still saves 8-byte size in the archive header, so you need to go back. 7z/arc store metainfo size in the last bytes of archive, so archive should be decoded starting with the last few bytes >Every global data structure is a checksummed single block. They would be stored uncompressed so, FTL doesn't support compression/encryption of metainfo, making some users unhappy. also, FTL employs fixed order of fixed-structure blocks. This means that if you will need to change some block structure, older programs will be screwed. Instead, arc metainfo is a sequence of tagged blocks. When block type 1 will become obsolete, I will just stop including it in newer archives. Some functionality will lose upward compatibility, but the rest will still work. This also means that 3rd-party uitilities can add blocks of their own types Then, some items such as codec tag, requires more input. This means that when older version reads an archive with newer codec, it can't continue the decoding. arc encodes all codec params as asciiz string, so older programs can display archive listing that includes newer codecs and even extract files in solid blocks that use only older codecs >Since we'd be using variable-length integers, most of the benefits of SoA over AoS aren't really available this is the most important point. among usual file directory fields (name,size,date/time,attr,crc) only first two are varsized. and even the size field may be encoded as fixed-sized integer - after lzma, compressed fixed-size data for the filesize field has about the same size as compressed VLI data, according to my tests (I think that saving each byte of the size field separately may further improve lzma-compressed size but don't checked it :) next, with SoA you encode field tag only once per archive rather than once per file. Also, you can store field size after field tag and thus skip unknown fields, again improving forward compatibility and 3rd-party extensibility finally, compression ratios are greatly improved on same-type data, i.e. name1,name2...,attr1,attr2... is compressed better than AoS layout
    37 replies | 1699 view(s)
  • JamesWasil's Avatar
    16th November 2019, 01:21
    It looks like Chaz Bono wins the internet for having an avatar that is only 347 bytes. It doesn't look like Mona Lisa at all anymore, but they're still kinda famous? Famous-ish
    15 replies | 695 view(s)
  • pklat's Avatar
    16th November 2019, 00:28
    "Normally there're only two options - either we preprocess a file, or we don't. But here the optimal solution would be to preprocess most of the data, except for chunks which are duplicated in original files." I think you should preprocess those chunks too.
    37 replies | 1699 view(s)
  • mpais's Avatar
    16th November 2019, 00:11
    I don't know why I even bother, but here goes.. @Shelwien: >I actually did look yesterday. You probably didn't look very hard. >Its full of unimportant details like specific data structures, It's.. a specification for a format, i.e., it should describe in detail how to read an archive and interpret its data structures.. :rolleyes: >but doesn't provide a solution for new features like multi-layer dedup. >(Currently we can either dedup data, then preprocess, or preprocess then dedup; >there's no option to fall back to original data when archiver detects that >preprocessing hurts compression on specific chunks of data.) >Here's a more specific example: >- We have a set of exe files >- Exe preprocessor turns relative addresses in exe files to absolute >(it usually improves compression by creating more matches within the exe) >- There're chunks of code which are the same in original files, >but become different after preprocessing. >Normally there're only two options - either we preprocess a file, or we don't. >But here the optimal solution would be to preprocess most of the data, >except for chunks which are duplicated in original files. It doesn't even mention preprocessors, since those would just be codecs, usually the first ones in a codec sequence for a block type. And it was specifically designed to do just what you mentioned. First, deduplication occurs at the file level, then later on it's done at the content level (if a parser splits a block, we perform deduplication on those new sub-blocks). After all parsing is done, the plan was to run a rep-like (or such) deduplication stage on any non-specific blocks, so they could still be further divided. So in your example, if the common code chunks were large enough to merit deduplication, they'd be deduped before any exe-preprocessor would even be called. > there's no option to fall back to original data when archiver detects > that preprocessing hurts compression on specific chunks of data.) Nothing in the format prevents that. Each block uses a codec-sequence. The user could specify that, for instance, on 24bpp images, we should try 2 sequences for every block: one that includes a color-space transform before the actual compression, and one that doesn't; and we just keep the best. Only compression would be slower, decompression would be unaffected. @Bulat: >1. overall format is very weak and doesn't include many features from 7z/arc. It was never meant to include everything and the kitchen sink, it was supposed to be efficient at allowing pratical (re)compression of many useful formats and allow the operations that 99% of users may want to do. The idea was that it would provide a skeleton that users here could use to test out their own codecs without having to worry about writing parsers, dedupers, archiving routines, etc. >I recalled that I already wrote about it and my comments were rejected. I'm gonna have to call you out on this one, sorry. In the Community Archiver thread, we discussed how you'd prefer a DFS versus my BFS parsing strategy, and how the archive itself doesn't care and both can be available to the user. It was also important to discuss how to handle the potential for exponential storage needs when parsing, and I proposed a fixed intermediary storage-pool as that seems like the only realistic solution since we can't expect the end-user to simply have infinite storage. The prototype code was a quick-hack to see if it could be done, and it could. I have a (non-public) completely rewritten version that solves the problem of having lots of memory allocations and temporary files, by using a pre-allocated hybrid pool of a single memory block and a single temporary file, so that much is done. In the Fairytale thread itself, you said: >basic stuff: >- everything including info you saved in the archive header should be protected by checksums As proposed. >- it will be great to have archive format that can be written strictly sequentially. this means that index to metadata block(s) should be placed at the archive end As proposed. >- in 99.9% cases, it's enough to put all metadata into single block, checksummed, compressed and encrypted as the single entity Every global data structure is a checksummed single block. They would be stored uncompressed to make for easier reading by external tools, which seems like good-practice if the format was to get any traction outside this forum. >- all metainfo should be encoded as SoA rather than AoS Since we'd be using variable-length integers, most of the benefits of SoA over AoS aren't really available. If you know you have 10 structures to read, and the first array contains 10x the size field, you can't just read 10*sizeof(..) bytes. >- allow to disable ANY metainfo field (crc, filesize...) - this requires that any field should be preceded by tag. but in order to conserve space, you can place flags for standard set of fields >into the block start, so you can still disable any, but at very small cost As proposed. >- similarly, if you have default compression/deduplication/checksumming/... algorithms, allow to encode them with just a few bits, but provide the option to replace them with arbitrary >custom algos As proposed. So how was your input ignored? >Since authors have no experience of prior archive development, we can't really expect anything else Nice ad-hominem, I get it, I'm not part of the "old-guard" of this community, so therefore, what could I possibly know? Also on that Fairytale post: >I have started to develop nextgen archive format based on all those ideas, but aborted it pretty early. Nevertheless, I recommend to use it as the basis and just fill in unfinished parts of my spec. You do realize that could just as well be a quote from me? At least I tried, and didn't make myself out to be some uber-expert that had all the answers. I said I couldn't do it alone, which was as true then as it is now. Trying to write multi-plataform/processor archive handling routines in C++ is completely daunting and I have no motivation to spend months (if I'm being optimistic) trying to learn enough to get it right. @all: This is exactly what made me decide to quit posting any stuff here. All I see here is talk. Lots of talk, very little if any action. It's so, so easy to just sit on the sidelines and just criticize others.
    37 replies | 1699 view(s)
  • Shelwien's Avatar
    15th November 2019, 22:33
    Hosting did it "for security". Should work now, I removed the redirect.
    46 replies | 2899 view(s)
  • Bulat Ziganshin's Avatar
    15th November 2019, 21:14
    on the fairytale format - I studied v 1.2 and then recalled that I already done it a year ago 1. overall format is very weak and doesn't include many features from 7z/arc. I recalled that I already wrote about it and my comments were rejected. Since authors have no experience of prior archive development, we can't really expect anything else 2. the only FTL thing lacking in 7z/arc is FTL block structure. The format they developed serves very specific need - archive-wide catalogue of chunks at any compression stage. It's a sort of feature that may sometimes improve compression ratios. It may be supported in 7z/arc by adding another type of meta-info blocks, so no need to introduce the new archive type
    37 replies | 1699 view(s)
  • Mike's Avatar
    15th November 2019, 21:06
    @Shelwien, Please allow non HTTPS connections on nishi.dreamhosters.com! Not everyone need encrypted communications...
    46 replies | 2899 view(s)
  • Shelwien's Avatar
    15th November 2019, 19:46
    Can you also do something with this https://github.com/schnaader/precomp-cpp/blob/master/precomp.cpp#L129 ? Like in the patch I made here: http://nishi.dreamhosters.com/rzwrap_precomp_v0a.rar (its a blockwise MT wrapper for precomp) http://nishi.dreamhosters.com/u/precomp.cpp.diff Aside from overlapping names of tempfiles there was another problem - how to stop precomp from asking questions.
    37 replies | 1699 view(s)
  • schnaader's Avatar
    15th November 2019, 18:57
    Got another development version here, this time I multithreaded reconstruction (-r) of JPG. This involved all this nasty multi-threading stuff (threads, mutexes, std::future), so there might be some bugs, especially when combined with other streams and recursion, so the preferred test case would be to throw all kind of stuff together and try this version on it. Unfortunately, both packJPG and packMP3 are using all kind of global variables and so are not thread safe, so this only speeds up brunsli encoded jpegs, but not packJPG encoded ones or MP3s. For JPG heavy files like this MJPEG video, JPG recompression scales with the number of cores: 28.261.596 WhatBox_720x480_411_q25.avi 20.676.282 WhatBox_720x480_411_q25.avi.pcf, 13 s, 1174 JPG streams (precomp -cn) 28.261.596 WhatBox_720x480_411_q25.avi_ , 10 s (precomp048dev -r, old version) 28.261.596 WhatBox_720x480_411_q25.avi_ , 3 s (precomp048dev -r, new version)
    37 replies | 1699 view(s)
  • Shelwien's Avatar
    15th November 2019, 18:41
    For strong LZ codecs with large window, complete recompression like reflate is not really practical, at least not for automated use. Like, in case of lzma, there're plenty of versions which produce subtly different outputs, compression levels differ in hard to detect options like "fb", etc. And lzmarec experiment showed that entropy-level recompression (like what's used for jpegs, mp3s etc) is only able to gain 1-2%, mostly because tokens are already optimized for specific entropy coder. But there's still another option that I missed when writing lzmarec - deduplication. http://nishi.dreamhosters.com/u/token_dedup_demo_0.rar 768,771 BOOK1.2 536,624 wcc386.1 535,311 0.7z // book1,wcc386 535,140 00000000.lzma // lzma stream extracted with lzmadump 1,462,669 00000000.rec // token stream decoded from lzma stream 535,194 1.7z // wcc386,book1 535,023 00000001.lzma 1,464,420 00000001.rec 276,869 0+1.dif // bsdiff.exe 00000000.rec 00000001.rec 0+1.dif 168,564 0+1.dif.7z // 7z a "0+1.dif.7z" "0+1.dif" 1,070,732 0.7z+1.7z.7z // solid compression 867,579 0.rec+1.rec.7z // solid compression of token streams 703,875 = 535311+168564 // 0.7z+"0+1.dif.7z" Of course, its possible to further improve this result by integrating dedup into lzmarec, or by creating an archiver with universal support for this kind of deduplication.
    0 replies | 144 view(s)
  • JamesB's Avatar
    15th November 2019, 16:48
    Haha, one of my bits of test data I played with was indeed ONT signals. :-) I explored compression of this back in 2014, but at the time the company didn't seem interested. More recently they realised they could do far better than just relying on HDF5 by doing zigzag + streamvbyte + zstd. That still didn't match my own method, but it's not far off and was at least pretty standard. My own approach noted two things. 1) Delta alone isn't best. Delta against a smoothed signal when the jumps and small and just store as-is when the jump is large, because the signal tends to oscillate around a specific level and then switch to another level (with more oscillations due to errors, etc). 2) The delta doesn't fit well with standard zigzag + a naive 7 bit encoding. Indeed the distribution was heavily skewed such that maybe 99% of deltas are within 1 byte, but only maybe 70% are within +/-64 (a made up guess - I forget the actual figure). Thus I had 255 as escape for 16-bit value and 0-254 for a 1 byte delta. I wasn't interested in minimising the delta signal in one step though, just in reducing the entropy such that rANS afterwards would give the smallest size. My encode loop was this trivial hack (input is declared as bytes, but is actually 16-bit LE ints): #ifndef MAXD #define MAXD 15 #endif unsigned char *encode(unsigned char *data, uint64_t len, uint64_t *out_len) { uint64_t i, j; signed short *in = (signed short *)data; unsigned char *out = malloc(len*2); int p = 0; len /= 2; for (i = 0, j = 0; i < len; i++) { int d = in-p; if (d >= -127 && d < 127) { out = d; } else { out = -128; d = (d<0) ? -d*2+1 : d*2; out = d&0xff; out = (d>>8)&0xff; } // Slight differences imply average with previous value to // smooth out noise, while large differences imply a big jump // so use this value as predictor. if (d >= -MAXD && d <= MAXD) p = (p + in)>>1; else p = in; } *out_len = j; return out; } Edit: on the ONT signals from vbz with MAXD=30 this comes out at 1574907 before entropy encoding and 1272300 after going through rANS. It's no better with zstd (1279381). They're all much of a muchness frankly, so speed matters more when they're all around the 1.3Mb mark. My experiments aren't particularly fast given that transform, but I didn't attempt to optimise it as there was little interest at the time from the ONT community.
    50 replies | 19791 view(s)
  • dnd's Avatar
    15th November 2019, 15:21
    The integer compression functions are predestined to be used on their own. If you want to encode integer values with entropy coding, it is better to use variable length encoding like "extra bits" , variable byte with some context or something similar. You can also preprocess your data with a (delta/xor) transpose and then use lz or bwt compression. If you want to use an entropy coder then it is better to add a byte RLE after transpose. Summary: - Variable length integers like extra bits, variable byte, gamma, golomb coding,... + (context) + entropy coder - (delta/xor) transpose + lz or bwt - (delta/xor) transpose + (rle) + entropy coder In this 16 bits example, you can see that transpose is encoding better than "variable byte" when lz or an entropy coder is used. vbz is using streamvbyte. TurboByte is similar to streamvbyte. And as usual, better test with your data.
    50 replies | 19791 view(s)
  • introspec's Avatar
    15th November 2019, 12:53
    While really appreciating the technical achievement here, I wish she did not look like a bearded man quite so much!
    15 replies | 695 view(s)
  • Jyrki Alakuijala's Avatar
    15th November 2019, 00:03
    Even though such extreme compression is a special use case, JPEG XL does surprisingly well. One of the specialized contestants (img2twit) on the left, JPEG XL on the right at 347 bytes.
    15 replies | 695 view(s)
  • Krishty's Avatar
    14th November 2019, 23:31
    13145 bytes using Papa’s Best Optimizer (no arithmetic coding; removing all metadata).
    15 replies | 695 view(s)
  • JamesB's Avatar
    14th November 2019, 21:02
    It may be interesting to include a trivial byte entropy column (basic order-0, but maybe a simple O1 entropy calc too). For example, streamvbyte is indeed fast but leaves a lot of potential savings on the table. TurboPFor however produces a smaller file. However if both of them are passed through a trivial fast entropy encoder (eg FSE) then you may find the differences vanish. Of course it's also extra CPU cost, but if you're looking at space-efficient packing anyway possibly you're better off with something that doesn't lose the byte-frequencies in order to make a byte-based entropy encoder work well. Such things are done quite commonly in file formats, with a variety of bit-packing methods (beta, gamma, subexponential) or byte-packing methods prior to a secondary general purpose compression tool.
    50 replies | 19791 view(s)
  • CompressMaster's Avatar
    14th November 2019, 20:21
    CompressMaster replied to a thread VTEnc in Data Compression
    @vteromero, is your tool able to compress repeated, but random numbers (see below)? How much? 984165748916348965149746136498416167491603648941631441034984161654974163649416341941610674984631489614896416106749641630648964160364194164741034974163489416031949641314941649741649616461649461049419649164896167489461631064894614987410631064974160361048941638794615748956110374894163541036741603574891689416301657489641631034974896461037498416498416309416389674166504894645025848769848769856923856971425032412098532047869525869845269745812579624868552832658623589753867854696325879635236898557429865748165416479741674894167416789416578946741636416541974163164789678496116414741661646467479494961641674946164161165646494964649741679264358579456894674658549164801670460234946106165106064149610164160149610364416530641603496416103461030346510304641034896413616416316746141136416465174864164146741616641361064641616410346846489237685289859846536452687625896345876578625872456387638793687632783678536876372786378368726874378654274687245374201876587436246368746345348634834563685
    18 replies | 1532 view(s)
  • Shelwien's Avatar
    14th November 2019, 19:33
    > I don't feel like 7z provides me with everything I need right now (nor FA for that matter), > so I use other tools too. Maybe you need to learn more about the tools you have. Like, your example script in previous post could be implemented with freearc alone. For MT precomp you can use this: http://nishi.dreamhosters.com/rzwrap_precomp_v0a.rar http://nishi.dreamhosters.com/u/precomp.cpp.diff (doesn't work with unmodified precomp!) >> f.e. rip some features from BCJ2 and xflt3 > I know we all care about precomp bc here we are, so why not make the best of it? It doesn't really work like that in this case. A good disasm filter would be always better than BCJ2 or x64flt3. The problem with dispack is that its outdated... there's no x64 parser (it still works because CALL/JMP code is the same) or new vector instructions. > BTW, what is cdm?? I don't think I ever heard of it until yesterday Its this: https://encode.su/threads/2742-Compressed-data-model?p=52493&viewfull=1#post52493 Kinda an universal recompressor for huffman-based formats. >> doesn't provide a solution for new features like multi-layer dedup. > Actually, it does. In Fairytale, preprocessing IS deduplication. > Everything is fingerprinted so if a jpg is in the root folder, > also inside a zip, also inside a pdf inside a tgz, it only gets processed once. Not what I meant. Here's a more specific example: - We have a set of exe files - Exe preprocessor turns relative addresses in exe files to absolute (it usually improves compression by creating more matches within the exe) - There're chunks of code which are the same in original files, but become different after preprocessing. Normally there're only two options - either we preprocess a file, or we don't. But here the optimal solution would be to preprocess most of the data, except for chunks which are duplicated in original files. In this case, maybe a specific solution can be created, like we can integrate exe filter into dedup, and only run it on unmatched chunks. But ideally it has to be the job of archiver and apply not only to cases with specifically written manual workarounds, but to anything. >> there's no option to fall back to original data when archiver detects >> that preprocessing hurts compression on specific chunks of data.) > I believe this depends on the implementation. Yes, currently it is handled by preprocessors. For example, Bulat's delta filter does some rough entropy estimation to see if delta-coding of a table improves compression. But this same problem applies basically to any codec and preprocessor, so I think that adding an approximate detector (and other things like stream interleaving) to every codec is not a good solution. > Anyway, it's a draft, so improvements can and certainly should be made. > Márcio himself said that multiple times. I currently think that its too early to design an archive format. We have to start with codec API, MT task manager and stream interleaving. There're obscure but tricky parts like progress estimation (dedup filter quickly caches all input data, then archiver gets stuck at 99% for a long time, because it has to actually compress the data; but during encoding we also can't calculate progress from compressed size either, because we don't know the total compressed size), or memory allocation (its a bad idea to let each codec allocate whatever it wants; ideally task manager has to collect alloc request from all nodes in filter tree, adjust parameters where necessary, when do a batch allocation). Based on that we can build a few standalone composite coders (like rep1/mp3det/packmp3c/jojpeg/lzma in 7zdll), then a few based on new methods (like stegdict+xwrt+ppmd or minhash+bsdiff+lzma), and only then we'd know actual requirements for the archive format to support that.
    37 replies | 1699 view(s)
  • Gonzalo's Avatar
    14th November 2019, 17:19
    We should have a "Holly Wars" thread for this sorts of things :D On a more serious note: @Shelwien: Maybe you're right. Maybe 7z is objectively better. With this things, there isn't always an empirical way of determining if a solution is the best, bc there are so many vectors. It's not just which file size is smaller anymore, we're talking about whether a particular program or chain of programs is best suitable for a particular case, and our cases are different enough for it to matter. I don't feel like 7z provides me with everything I need right now (nor FA for that matter), so I use other tools too. ----------------- Regarding codecs, instead of full-blown archivers, I really like Bulat's suggestion: I know we all care about precomp bc here we are, so why not make the best of it? I can definitely help testing. BTW, what is cdm?? I don't think I ever heard of it until yesterday ----------------- > doesn't provide a solution for new features like multi-layer dedup. (Currently we can either dedup data, then preprocess, or preprocess then dedup; Actually, it does. In Fairytale, preprocessing IS deduplication. Everything is fingerprinted so if a jpg is in the root folder, also inside a zip, also inside a pdf inside a tgz, it only gets processed once. After that, other kinds of deduplication can be applied to blocks, chunks or the whole thing. > there's no option to fall back to original data when archiver detects that preprocessing hurts compression on specific chunks of data.) I believe this depends on the implementation. The format itself doesn't care how the data got in the archive, only that it makes sense so it can be decompressed. Anyway, it's a draft, so improvements can and certainly should be made. Márcio himself said that multiple times. ----------------- PS: I like where this is going...
    37 replies | 1699 view(s)
  • nikkho's Avatar
    14th November 2019, 15:16
    13163 bytes using FileOptimizer ​12408 bytes using FileOptimizer enabling arith coding.
    15 replies | 695 view(s)
  • dnd's Avatar
    14th November 2019, 13:46
    dnd replied to a thread VTEnc in Data Compression
    You can probably replace the recursion by an iteration using a stack. Search: https://www.google.com/search?q=recursion+iteration+stack Not always better, but it's faster and more elegant to a use a simple 1<<i instead of a lookup table. - Direct Access: as you are using a bit trie a think it is possible to add a fast direct access function to get the value at index i without decompression. - Get next value equal or greather: given a value and a start index, search and return the next (index and value) that's '>='. see file vint.c in turbopfor. ​
    18 replies | 1532 view(s)
  • Shelwien's Avatar
    14th November 2019, 13:40
    My points are these: 1. People keep talking about freearc greatness, but actually use it as wrapper for external codecs, for example srep, xtool and lolz. Gonzalo's case seems to be different, but even weirder, since he runs precomp outside and just uses freearc for rep/delta/mm. 2. It'd be easier for me if people learned to use 7-zip properly, because I can add stuff to 7z, but not to freearc.
    37 replies | 1699 view(s)
  • Bulat Ziganshin's Avatar
    14th November 2019, 12:46
    Eugene, I think you miss the point - 7z had some great features, rar had some great features, and even winzip had a few. FA just combined them - not all, but it had more features simultaneously than any competitor.
    37 replies | 1699 view(s)
  • Shelwien's Avatar
    14th November 2019, 10:38
    > OK, sorry Christian for going off-topic I can always move it to a new thread if it continues. For now its not even really off-topic. >> so, besides global dedup, actual compression algorithms became not really relevant. > They are if precomp is involved. PDF are really compressible and susceptible to all kinds > of preprocessing and codec tuning. PNGs too and so on. Yes, but imho precomp is more of a proof-of-concept rather than a tool for practical use. For repeated automatic use you need to always verify correct decompression and provide a fallback compression script without it if something happens (like crash). Also, all the recompression libraries used by precomp are 3rd-party unmaintained code, so we might have a problem if somebody makes a remote execution exploit for packmp3, or when x86 becomes obsolete. As to PNG/PDF... precomp is certainly the best tool atm, but there's a lot of potential for improvement, like that FLIF idea. > Aniskin's plugins are great, but he himself has said that 7z internal structure makes the symbiosis too messy. Uh, no, "7z internal structure" is unrelated here. Its not like any archiver can provide a codec API with random seek. > For example, the latest Mfilter, which I think is great, has to write a temp file > for every single thing it wants to process. Way subpar IMHO. > It's not anybody's fault, it's just 7z is not the best format, to say the least. Nope. In this case its the fault of jpeg format designers. Jpeg is a very ugly format - to start, there's no certain way to even detect it, except "start decoding from here and see if it eventually works out". There's no specific signature, ~5 different ways to embed other jpegs, and "progressive mode" which is incompatible with sequential recompression. Because of that, brunsli/lepton/packjpg prefer to load/parse the whole file first, then recompress it in a second pass. Still, there're pjpg and jojpeg that can recompress non-progressive jpegs sequentially. But tempfiles in mfilter and precomp are just a lazy workaround. Jpeg files are not large enough that they won't fit in RAM. > I also tried your 7zdll but I didn't put too much attention into it bc it > looked more like an experiment in the moment and I was also afraid of its legal status. Well, that's unfortunate. > Isn't it part of Power Archiver? Do you work for them? Is it your company? Yes, yes, no. > Bottom line, can we use the program w/o infringing copyrights? You can even use a trial of full GUI PA, I don't see any reasons for caution. Can you name a program which can't be legally used? The only real case is when you sign NDA and then leak the software. Though sure, its an experimental tool for testing, rather than a supported product. But there's no difference with precomp in that sense. I don't advertise it mainly because there's no automated codec switching - we actually have some, but its on GUI side. But 7zdll is atm the best solution for mp3 recompression, may be pretty good for pdfs with reflate/jojpeg, there's cdm, x64flt3 is better than BCJ2, etc. > Maybe now, after years of FA being dead and lots of people trying to address 7z's shortcomings. "7z's shortcomings"? I can only name recovery records, too simple volume format and lack of repair function. But I never heard of anybody trying to fix that. And who's "lots of people"? There's flzma author, Stephan's EcoZip, Aniskin and 7zdll. Am I missing something? Otherwise .7z was and is the most efficient and flexible archive format. Sure, .arc is close and may be better at some points, but lack of stream interleaving and multi-output API is a deal-breaker. While its quite possible to add all of freearc's codecs to .7z. > * File sorting and grouping Was introduced by rar, 7z kinda has it too. As for grouping, it was always possible to write a script to add different filetypes with different codecs by multiple 7z calls, and if you want a text config, there's now smart7z or something. But 7z always had file analysis stage, it just wasn't used for anything except for BCJ2 and maybe wavs. > * Content aware compression See above, even baseline 7z actually can detect exes by signatures, same always could be done for other filetypes. Rar is better in that sense because it has an option to switch codecs within a file, its a real problem. But afaik its a problem both for .7z and .arc too. > * Different kinds of deduplication Yes, but it was always possible to attach srep or something to 7z too, just nobody bothered. And now (since 2016 or so) there's also 7zdll with rep1. > * Dedicated lossless compressor for audio What's there, tta? Afaik its only applicable to whole files and not like Bulat made it. Then, zip has wavpack, 7z has delta, rar has its own detection and audio compression. Also nz and rz are currently much better for that anyway. > All of it automatic. It was just the perfect solution, > something both faster and stronger than the competitors. I've only seen it used as a batch wrapper for external codecs. Dedup introduction was certainly a breakthrough, but afaik there's no special support for it in .arc format, so it can be used just as well anywhere. > And this wasn't even the greatest thing about it. It was the mindset... > zip, 7z and company are too simplistic. winzip actually was the first to introduce jpeg recompression. Currently it has wzjpeg,wavpack,lzma,xz codecs, so it might beat freearc on some data sets. > It only lets you chose one and that's it. Just one per run. > I mean, really? That's soo not efficient. Unfortunately its also true for freearc (or at least as true as for 7z). Did you ever see auto-assigned 7z profile for exes? its actually something like this: "m0=BCJ2 -m1=LZMA:d25 -m2=LZMA:d19 -m3=LZMA:d19 -mb0:1 -mb0s1:2 -mb0s2:3". Of course, there's potential for improvement. For example, there's currently no option to run global dedup, then compress some files with lzma and others with ppmd. But its the same for 7z and freearc. > If you do use it with srep and precomp - that's a wrapper. > and rep is more efficient on smaller inputs. Uh, I think you should test it. Bulat added some new modes there, like "future-LZ", which actually improve compression. > $ wine arc a -m0=rep:500m+probably_some_other_filter > $ wine 7-Zip-Zstandard_7z.exe a OUT -mfb=273 -myx=9 -m0=flzma2 -mf=on -mqs=on -md=128m -mx9 -ms=on IN.arc > That's the most efficient configuration for me. Strong enough, fast enough, pareto frontier (backed by experiments). Yeah, it really doesn't seem like you need freearc here. > That's a good thing, but cspeed remains slow for lzma-like ratios. Not really, you can try faster modes in oodle/brotli/zstd. And its impossible to get fast encoding speed with larger-than-cpu-cache window. > You might want to look at Márcio's draft for the Fairytale format. I'm no expert, > I just remember a lot of knowledgeable people saying it was the closest thing to an ideal format. I actually did look yesterday. Its full of unimportant details like specific data structures, but doesn't provide a solution for new features like multi-layer dedup. (Currently we can either dedup data, then preprocess, or preprocess then dedup; there's no option to fall back to original data when archiver detects that preprocessing hurts compression on specific chunks of data.) > That last scheme with the diffs seems a lot like a poor man's preflate for other formats. Yes, but surprisingly its also frequently used for speed. Like, we extract a million of <=64kb chunks individually compressed with LZ4. In such a case it might be faster to run solid compression of chunk data as a single file and fix encoder output differences with a patch.
    37 replies | 1699 view(s)
  • vteromero's Avatar
    14th November 2019, 10:23
    vteromero replied to a thread VTEnc in Data Compression
    That's very useful, thanks @dnd! The tree depth is known upfront, but not the number of different paths, so not sure how I can do that... Yes, inlining functions is one of the things I had in mind. Nice trick, I'll try that. Interesting... Is it always better to use bitwise operations instead of lookup tables? I don't know what you mean here. Definitely, will do that.
    18 replies | 1532 view(s)
  • Gonzalo's Avatar
    14th November 2019, 04:38
    OK, sorry Christian for going off-topic > All the popular formats of documents and images are already compressed, so, besides global dedup, actual compression algorithms became not really relevant. They are if precomp is involved. PDF are really compressible and susceptible to all kinds of preprocessing and codec tuning. PNGs too and so on. > but I'd be more interested in your evaluation of my 7zdll and Aniskin's plugins You mean this? ​ Aniskin's plugins are great, but he himself has said that 7z internal structure makes the symbiosis too messy. For example, the latest Mfilter, which I think is great, has to write a temp file for every single thing it wants to process. Way subpar IMHO. It's not anybody's fault, it's just 7z is not the best format, to say the least. I also tried your 7zdll but I didn't put too much attention into it bc it looked more like an experiment in the moment and I was also afraid of its legal status. Isn't it part of Power Archiver? Do you work for them? Is it your company? Bottom line, can we use the program w/o infringing copyrights? > Comparing to what? vs zip,rar - yes, vs 7z - not really. > I don't see what we could take from .7z or .arc for development of state-of-art archive format. Maybe now, after years of FA being dead and lots of people trying to address 7z's shortcomings. I'll give you just a few examples of things that didn't exist at the moment on a user friendly archiver: * File sorting and grouping * Content aware compression * Different kinds of deduplication * Dedicated lossless compressor for audio All of it automatic. It was just the perfect solution, something both faster and stronger than the competitors. And this wasn't even the greatest thing about it. It was the mindset... zip, 7z and company are too simplistic. Take a file, pass it through THE codec. Write to the disk. Up until today, official 7z doesn't capitalize on the fact that it's got a lot of codecs included. It only lets you chose one and that's it. Just one per run. I mean, really? That's soo not efficient. > If you do use it with srep and precomp - that's a wrapper. No I don't. I lost interest on Srep long time ago bc I don't really need it (I don't have 10 GB archives) and rep is more efficient on smaller inputs. Regarding precomp, I always use the latest commit via shell scripts, mostly to run it on parallel > If you use only integrated codecs - freearc compression is probably neither best or fastest. Agree! Let me show you what I use it for and how: //pseudo code $ some_script_applying_precomp_multithreaded $ wine arc a -m0=rep:500m+probably_some_other_filter $ wine 7-Zip-Zstandard_7z.exe a OUT -mfb=273 -myx=9 -m0=flzma2 -mf=on -mqs=on -md=128m -mx9 -ms=on IN.arc That's the most efficient configuration for me. Strong enough, fast enough, pareto frontier (backed by experiments). ​> so only new codecs made after the change are efficient now - their compression is similar to lzma, but decoding is 10x faster. That's a good thing, but cspeed remains slow for lzma-like ratios. That's where things like rep are truly helpful, to speed up compression. > My own choice for now is to collect/write relevant new codecs and add them to .7z. > Then, once I have my own framework that can replace 7-zip's (MT task manager, stream interleaver, codec API), I can start migrating to a new format. You might want to look at Márcio's draft for the Fairytale format. I'm no expert, I just remember a lot of knowledgeable people saying it was the closest thing to an ideal format. About the 'repacker' thing: Fair enough, I guess I do things like that. I just don't like the label. That last scheme with the diffs seems a lot like a poor man's preflate for other formats. I did something like that for lzx streams a while ago just to see if it yielded any gains. It did, a lot...
    37 replies | 1699 view(s)
More Activity