Why does 128 look better than 256 to you?
I'd try to lengthen it still.
In some scenarios, even 128K window might be too large (small chunk compression).
However, as a general rule, larger window (dictionary) = better compression, especially if we talking about byte-aligned LZ coders like LZ4 or some versions of LZO. At the same time, and at some point we must start using a variable length codes for offset (match distance) coding. And IMO 64K-256K is the limit if we code distance in fixed bits. In addition, the bigger the dictionary, the slower the compressor.
I have tested many LZ4 modifications, and came up with the Enhanced LZ4 - LZ4 with 256 KB window (vs 64 KB) - LZ4X or LZ4v2
It's like the Deflate vs Enhanced Deflate (Deflate64).
Actually, everything is pretty the same as with original LZ4 legacy frame (the same file extention .lz4), except:
New magic number - possibly "LZ4X"
Block size is 16 MB
Match distance of 0 is either unused or represents the EOF marker
The same file structure: compressed size, compressed data, etc.
The main difference - window size = 256 KB
To make this possible we modify the "Token" byte as follows:
rrr oo lll
3 highest bits represents literal run length (7 means read one extra byte etc., 255 means read another one etc.)
2 middle bits represents the highest bits of a match distance
3 lowest bits represents a match length (7 means read one extra byte etc.)
With such Token structure we may efficiently extract fields:
int Token=GET_BYTE();
if (Token>=32)
{
int LiteralRun=Token>>5;
// ...
int MatchPos=Pos-((Token&0x18)<<13);
MatchPos-=GET_BYTE();
MatchPos-=GET_BYTE()<<8;
if (MatchPos==Pos) // EOF
// ...
Other than that, encoder shouldn't be slower at the same strength. With a greedy parser and HT match finder, speed should be the same. If you use stronger techniques and get lower efficiency than with a small dict, you're doing it wrong.
Sure, sure the same algorithm will be slower. But it will be stronger. And if overall efficiency drops (still talking about files that can benefit from larger dict), implementation is wrong, not dict size.
As far as I'm concerned, I believe the most important thing is to ensure user expectation consistency.
If a file is branded `*.lz4`, it should be compatible other lz4 tools, hence respect official interoperable format.
There is no problem creating a fork introducing a different format.
That's what inikep did with lz5. I welcome such initiatives.
But please make sure users understand it's a different format (for example, by emphasizing the X of lz4x, or using a completely different name).
Then, selecting between 128 or 256 KB window size becomes a specific format discussion.
I suspect 256 KB will usually give more compression ratio, although this statement should be backed by tests.
256 KB is also the size of a typical L2 cache, so it should remain relatively fast to decompress.
Such choice also expects a "large" input, such as a big file, to deal with.
From an archiver perspective, it makes sense.
From a library perspective, less so.
Why ? because most LZ4 real-world applications work on small data blocks (<= 64 KB).
Some testing results for LZ4 with 256 KB window, for future reference:
enwik8: 100000000->38324398
enwik9: 1000000000->337785382
dickens: 10192446->3845494
samba: 21606400->5760769
webster: 41458703->12381074
xml: 5345280->634061
world95.txt: 2988578->815206
book1: 768771->324652
calgary.tar: 3152896->1118982
3200.txt: 16013962->6291766
mptrack.exe: 1159172->632586
reaktor.exe: 14446592->3022800
photoshop.exe: 19533824->9186191
Well, I think I will keep the LZ4X as an LZ4 legacy frame compatible compressor then. Better release an improved CRUSH compressor I guess. And if you will release something, I will follow you...
How about having 2 bytestreams with different window sizes and either switching after some size threshold or just indicated with a flag? Lots of complexity, but allows to have both good small block and large block strength with good compression and decompression speed.
what means your "advanced string parsing" to find longest match? Do you use suffix arrays with search acceleration or you have invented something new and faster?
As I already wrote:
Normal ("c") and High ("ch") modes = Hash Chain match finder with Greedy parsing.
Extreme ("cx") mode = Binary Tree match finder with Optimal parsing (Some people call this variant Dynamic Programming, in other words it is something by far more advanced than a generic Storer&Szymanski Optimal parse).
Honestly, I have planned this program expecting a notable higher compression ratio gain compared to LZ4HC...
Honestly, I have planned this program expecting a notable higher compression ratio gain compared to LZ4HC...
The heuristic parse in LZ4HC is very impressive. Yann has identified very well the cases that need to be tested to find an ideal parse. Going from there to a full-fat optimal parse is generally not much win.
In more complex formats it's almost impossible to find those rules (and of course with entropy coding it's impossible).
Honestly, I have planned this program expecting a notable higher compression ratio gain compared to LZ4HC...
Well, you did better than I expected is possible.
At some point I started writing an optimal LZ4 compressor (strength-wise, without any speed optimisations). I haven't finished it, I realised the output was more different from what I had than I expected and I quit. I was not willing to spend those few extra evenings.
Working really hard on a new release (v1.02) right now. What's new:
Improved compression ratio, once again. Now I keep an eye on long literal runs more precisely. New ENWIK9 result is 372,068,631 bytes
Notable faster decompression
Slightly higher compression ratio with "ch" mode
Faster default compression with some compression loss
And for parsing comparison:
book1 result is 359,284 bytes (LZ4 compatible)
With no magic number and LAST_LITERALS=0 (should we call it PAD or PADDING_LITERALS?)
-> 359,279 bytes
and if we not storing a compressed size:
-> 359,275 bytes
I will wait for some time and for feedback and decide on release or not/license and hosting platform (SourceForge/GitHub)
FWIW, I don't really see much reason to host on SourceForge, even now that it is under new ownership and no longer bundles adware with downloads.
I'd love to suggest GitLab; it has a good UI, an excellent feature set, and it's open-source software. However, if you want to try to encourage community involvement GitHub is the way to go… you'll almost certainly get more publicity, bug reports, patches/pull requests, etc. hosting there than anywhere else.
One thing I don't like about GitHub is monoculture. On two fronts.
First, it's nearly a monopoly among source hosting sites.
Second, (unlike its competitors) it's not VCS-agnostic. Its near-monopoly on the hosting front makes the leading VCS a near-monopoly by itself.
And monoculture is not good.
Another thing I don't like about it is git, but that's more of a personal matter.
License...you surely know the traditional arguments about strong copyleft, weak copyleft etc. Yet you still ask. I wonder what sort of answer do you need.
For the starters I'll give you an important, but less stressed in the press summary of licenses based on who prefers which license.
* (L)GPL3 - Many Linux users like it. Corporations strongly don't. BSD users strongly don't. GPL2 users weakly don't.
* (L)GPL2 - Many Linux users like it. Corporations strongly don't. BSD users weakly don't. GPL3 users weakly don't.
* AGPL2 - Nobody likes it
* AGPL3 - Few Stallman supporters like it.
* Apache2 - The most preferred corporate license. Some Linux users don't like it and some BSD ones don't either, but overall there are few gripes.
* Public Domain - non lawyers tend to like it, aside from that see MIT below.
* MIT/3-clause BSD - Some Linux users don't like it and some corporations don't either, but overall there are very few gripes.
* Custom - depends on the text, but hardly anyone likes them
Please note that I don't mention Windows and Mac users specifically. That's because the general trend is "I don't care". Nevertheless, these ecosystems have so much mass that outliers who do care are numerous enough to matter. What do they tend to prefer? I'll tell you: I don't know.
You used to pick public domain. I think it's a great choice. If you have more specific questions, ask.
License...you surely know the traditional arguments about strong copyleft, weak copyleft etc. Yet you still ask. I wonder what sort of answer do you need.
For the starters I'll give you an important, but less stressed in the press summary of licenses based on who prefers which license.
* (L)GPL3 - Many Linux users like it. Corporations strongly don't. BSD users strongly don't. GPL2 users weakly don't.
* (L)GPL2 - Many Linux users like it. Corporations strongly don't. BSD users weakly don't. GPL3 users weakly don't.
* AGPL2 - Nobody likes it
* AGPL3 - Few Stallman supporters like it.
* Apache2 - The most preferred corporate license. Some Linux users don't like it and some BSD ones don't either, but overall there are few gripes.
* Public Domain - non lawyers tend to like it, aside from that see MIT below.
* MIT/3-clause BSD - Some Linux users don't like it and some corporations don't either, but overall there are very few gripes.
* Custom - depends on the text, but hardly anyone likes them
Wow, those are some pretty big generalizations (many of which I disagree with). Also, I think people should choose a license based on what the license does, not some group's opinion of it. Instead of trying to say which groups like which licenses, it would be better to just have a quick summary of each one…
GPLv2 — if you use the code then you must release your code under similar terms.
LGPLv2 — people can use the code even it proprietary software, as long as they release any changes to the code are released under similar terms.
AGPLv2 — The GPL is triggered by distribution; if you distribute GPL-licensed software to someone then you have to offer the source code, but if you never distribute the code and instead use it to create a service which interacts over the network with the GPL you don't have to offer the source code, but with the AGPL you do.
(A|L)GPLv3 — the big thing these add is a prohibition on tivoization (releasing the source code under the GPL but preventing modified versions from running). There is also some patent stuff.
Apache 2 — permissive license which allows integrating the code into proprietary software. It includes a patent license grant, which prevents people from releasing open-source software then suing users for patent infringement.
3-clause BSD — It's short, and simple enough for non-lawyers to grok; just read it. Basically, people can do whatever they want as long as they follow the rules in those clauses (keeping the copyright notice in the code, adding the full text of the license to documentation, and not using the name(s) of the copyright holders to promote the software without prior written permission).
MIT — Also short and readable. Basically, "do whatever you want, I'm not liable".
ChooseALicense.com has some good (easy to understand) information on several licenses.
Originally Posted by m^2
You used to pick public domain. I think it's a great choice. If you have more specific questions, ask.
I completely disagree with this. All the licenses you listed have very good arguments in favor of them, but there is basically no benefit to public domain over MIT. For those who don't know what public domain is: when you create something it is automatically copyrighted, and a public domain dedication is an attempt to opt-out of copyright protection. That sounds great on the surface, but there are some very big practical issues:
Legally ambiguous — Most countries don't really have the concept of opting out of copyright as part of their laws, so lawyers have a good reason to be nervous about this one. See https://creativecommons.org/about/cc0/.
Most public domain dedications don't include a disclaimer of liability (note: CC0 does), so bug in your code which causes some company's data to be corrupted, they could conceivably sue you for damages.
If really want to use a public domain dedication, use CC0; simply saying that you wish for the software to be in the public domain is insufficient.
ChooseALicense.com has some good (easy to understand) information on several licenses.
I can't argue with that exact phrase, but overall I consider this site a terrible resource, for inaccuracy.
In particular, implying that MIT and Apache2 users don't care about sharing improvements is so grossly wrong that I want to scream whenever I see someone advocating this site.
Originally Posted by nemequ
I completely disagree with this. All the licenses you listed have very good arguments in favor of them, but there is basically no benefit to public domain over MIT.
There are 2 benefits:
* it's the simplest and the most understandable license
* it puts less legal burden on user
Not that either of these is a big difference, but for me that makes a difference.
I can't argue with that exact phrase, but overall I consider this site a terrible resource, for inaccuracy.
In particular, implying that MIT and Apache2 users don't care about sharing improvements is so grossly wrong that I want to scream whenever I see someone advocating this site.
Looking at just the main page (which I think is what you're talking about, if not let me know), I don't think it does that at all. Certainly no more than it implies that MIT and GPL fans don't care about patents, or Apache 2/GPL don't care about simplicity or permissiveness. Maybe the wording should be changed to make it clearer that it is asking what you care about most, but TBH until you objected just now it didn't even occur to me that someone might read it as "you get to choose one of these aspects, and the others are completely ignored".
Originally Posted by m^2
There are 2 benefits:
* it's the simplest and the most understandable license
No, it's not. It may look the simplest at first glace (at least to a non-lawyer) if you just use a very simple statement that you wish for the code to be in the public domain, but the truth is vastly more complicated. A simple statement is very complicated legally, and the effects could be completely different in different jurisdictions. Quoting the "About CC0" page:
Dedicating works to the public domain is difficult if not impossible for those wanting to contribute their works for public use before applicable copyright or database protection terms expire. Few if any jurisdictions have a process for doing so easily and reliably. Laws vary from jurisdiction to jurisdiction as to what rights are automatically granted and how and when they expire or may be voluntarily relinquished. More challenging yet, many legal systems effectively prohibit any attempt by these owners to surrender rights automatically conferred by law, particularly moral rights, even when the author wishing to do so is well informed and resolute about doing so and contributing their work to the public domain.
The only PD dedication that I'm aware of which actually looks pretty solid legally (note: IANAL) is CC0, and it is much more complicated and difficult to understand than the MIT license. The "Public License Fallback" section is particularly telling; the first two sections try to place the work in the public domain, but the license is so skeptical about how successful that would be that they basically include something very MIT-like as a fallback. Even lawyers can't figure out the effects of a public domain dedication… definitely not the "simplest and most understandable license".
It's also not a license, but I'm going to assume that you're just using that word for convenience.
* it puts less legal burden on user
No, it doesn't. It leaves the user in a legally ambiguous situation, which is a huge burden.
It's also worth pointing out again that the vast majority of PD dedications don't include a disclaimer of liability. I don't know about you, but the idea of being sued for damages because something I generously gave away for free isn't absolutely perfect is a pretty big problem. Again, CC0 is an exception; if you really want to try to go PD that's definitely the way to do it.
Looking at just the main page (which I think is what you're talking about, if not let me know), I don't think it does that at all. Certainly no more than it implies that MIT and GPL fans don't care about patents, or Apache 2/GPL don't care about simplicity or permissiveness. Maybe the wording should be changed to make it clearer that it is asking what you care about most, but TBH until you objected just now it didn't even occur to me that someone might read it as "you get to choose one of these aspects, and the others are completely ignored".
Seriously? They are a website that's meant to help in a license choice. The catch phrase is "I care about sharing improvements". This clearly means that if you care about sharing, you should pick GPL. Which clearly means that if you care about sharing you should not pick the alternatives. At which point am I wrong?
The catch phrase shall be the most important license property that distinguishes it from others. Here it's something that doesn't help in distinguishment. But it does help in making a choice. Though misinformed one.
Originally Posted by nemequ
No, it's not. It may look the simplest at first glace (at least to a non-lawyer) if you just use a very simple statement that you wish for the code to be in the public domain, but the truth is vastly more complicated. A simple statement is very complicated legally, and the effects could be completely different in different jurisdictions. Quoting the "About CC0" page:
The only PD dedication that I'm aware of which actually looks pretty solid legally (note: IANAL) is CC0, and it is much more complicated and difficult to understand than the MIT license. The "Public License Fallback" section is particularly telling; the first two sections try to place the work in the public domain, but the license is so skeptical about how successful that would be that they basically include something very MIT-like as a fallback. Even lawyers can't figure out the effects of a public domain dedication… definitely not the "simplest and most understandable license".
You know what? MIT has all the same legal issues as simple public domain dedications. In my country (Poland) it's not a valid license at all(*), so all BSD / MIT software that I use I use illegally.
But most ignore that and feel comfortable with the license, yet moan about public domain.
Being legally bullet-proof is strictly impossible. You can get very near at the cost of a huge complexity or at many points in between.
It applies to public domain as well as to any other license. But equally bullet proof license that has simpler terms will always be simpler. Strictly simpler.
Originally Posted by nemequ
It's also not a license, but I'm going to assume that you're just using that word for convenience.
I call "public domain" any work that is either:
* not copyrighted
* licensed w/out any restrictions
regardless of how strong is the legalese of the license.
Originally Posted by nemequ
No, it doesn't. It leaves the user in a legally ambiguous situation, which is a huge burden.
Yes, it does. Ambiguity is orthogonal to actual terms and PD terms are strictly less stringent. After all, BSD requires attribution and PD doesn't.
Originally Posted by nemequ
It's also worth pointing out again that the vast majority of PD dedications don't include a disclaimer of liability. I don't know about you, but the idea of being sued for damages because something I generously gave away for free isn't absolutely perfect is a pretty big problem. Again, CC0 is an exception; if you really want to try to go PD that's definitely the way to do it.
I didn't speak about how to do the PD dedication. Frankly, I don't advocate any choice as I see all of them as broken in some way. But this point made me interested. Are you aware of any open source developer that got sued by a user? The risk definitely is there, it would be best to quantify it, but it would be good to at least have *some* information on it.
Ad (*):
My country requires all copyright licenses to explicitly state all fields of use and does not consider "any use" to be a valid field. You are required to list "use, copying, storing" and a number of other fields recognised in the doctrine.
Actually this means BSD (or public domain or GPL) terms are impossible to express because the list of fields of use may be amended with the world development. It happened in our lifetime when courts decided that publishing on the internet is different from regular publishing; therefore all licenses that were unlimited before the internet suddenly got a limitation. If you think that's freaking broken, welcome to the law.
(Sorry for the delay, I forgot about this discussion, and I don't check encode.su all that often, especially when I'm busy…)
Originally Posted by m^2
Seriously? They are a website that's meant to help in a license choice. The catch phrase is "I care about sharing improvements". This clearly means that if you care about sharing, you should pick GPL. Which clearly means that if you care about sharing you should not pick the alternatives. At which point am I wrong?
You're oversimplifying. You can want multiple things, the question is where the priority lies. If your highest priority is that people share their improvements then you should probably choose the GPL.
The catch phrase shall be the most important license property that distinguishes it from others. Here it's something that doesn't help in distinguishment. But it does help in making a choice. Though misinformed one.
Maybe a better idea would be filling out a form with a handful of questions (kind of like what Creative Commons does)… copyleft or permissive, patent grant or no, etc.
You know what? MIT has all the same legal issues as simple public domain dedications. In my country (Poland) it's not a valid license at all(*), so all BSD / MIT software that I use I use illegally.
[citation needed]. I've been doing this for a long time, and I've never heard anything like that. If that is true, I really hope someone sues the Polish government for copyright infringement just to push them to change the law.
But most ignore that and feel comfortable with the license, yet moan about public domain.
Most people don't ignore that, most people have never heard that. Most people also don't live in a country with such an absurd law^H^H^Hthat particular absurd law. The public domain has some very real problems in pretty much every country. Just because MIT may (honestly, I'm having a very hard time believing this) not work correctly in one jurisdiction doesn't put them on equal footing.
Being legally bullet-proof is strictly impossible. You can get very near at the cost of a huge complexity or at many points in between.
It applies to public domain as well as to any other license. But equally bullet proof license that has simpler terms will always be simpler. Strictly simpler.
And a simple public domain dedication is known to be extremely weak. MIT, OTOH, is a short and simple license which achieves basically the same thing that a PD dedication tries to, but is generally considered to be fairly "bullet proof".
I call "public domain" any work that is either:
* not copyrighted
* licensed w/out any restrictions
regardless of how strong is the legalese of the license.
Then you're using the term incorrectly. Licensing something without any restrictions doesn't put it in the public domain, though in practice the effect is similar. Public domain means the work is not copyrighted, and everything is copyrighted automatically these days (since the Berne Convention, IIRC) so public domain basically means the copyright has expired. In most (possibly all) jurisdictions there is no legal framework for placing something in the public domain other than waiting for it to expire, hence the problem with short public domain licenses. Saying "I place this in the public domain" doesn't mean it actually is in the public domain. Depending on jurisdiction, it may well be closer to saying "All rights reserved." (if a PD dedication has no effect, and copyright is automatic…)
Yes, it does. Ambiguity is orthogonal to actual terms and PD terms are strictly less stringent.
Let's assume for a second that a PD dedication has no effect; in that case the work is still copyrighted and you have not provided a license. If you don't have a license, what you can is determined by copyright law, which is extremely stringent. Basically, the only things you're allowed to do are things which fall into the "fair use" category.
The ambiguity means there is a possibility that you're opening yourself up to significant liability, which is a pretty big burden.
I didn't speak about how to do the PD dedication. Frankly, I don't advocate any choice as I see all of them as broken in some way. But this point made me interested. Are you aware of any open source developer that got sued by a user? The risk definitely is there, it would be best to quantify it, but it would be good to at least have *some* information on it.
No, luckily I'm not. And I'm not about to volunteer to test it.
You're oversimplifying. You can want multiple things, the question is where the priority lies. If your highest priority is that people share their improvements then you should probably choose the GPL.
That's highly questionable. Ask Yann how are his contributions since he's switched to BSD.
Some time ago I've seen this in their bug tracker; the issue has been open basically since they launched. There are many possible changes proposed, but the staff ignored the problem entirely.
Originally Posted by nemequ
And a simple public domain dedication is known to be extremely weak. MIT, OTOH, is a short and simple license which achieves basically the same thing that a PD dedication tries to, but is generally considered to be fairly "bullet proof".
(...)
Then you're using the term incorrectly. Licensing something without any restrictions doesn't put it in the public domain, though in practice the effect is similar. Public domain means the work is not copyrighted, and everything is copyrighted automatically these days (since the Berne Convention, IIRC) so public domain basically means the copyright has expired. In most (possibly all) jurisdictions there is no legal framework for placing something in the public domain other than waiting for it to expire, hence the problem with short public domain licenses. Saying "I place this in the public domain" doesn't mean it actually is in the public domain. Depending on jurisdiction, it may well be closer to saying "All rights reserved." (if a PD dedication has no effect, and copyright is automatic…)
Let's assume for a second that a PD dedication has no effect; in that case the work is still copyrighted and you have not provided a license. If you don't have a license, what you can is determined by copyright law, which is extremely stringent. Basically, the only things you're allowed to do are things which fall into the "fair use" category.
The ambiguity means there is a possibility that you're opening yourself up to significant liability, which is a pretty big burden.
Take MIT, remove restrictions and you have what is basically a simple PD dedication, every bit as strong legally as the MIT license.
Yes, I use the term in a way that's not fully correct, yet it is:
* understandable, to the extent needed by non-lawyers
* in use by many people, yourself including (you called CC0 a PD dedication a few posts before, though in most countries it's a license)
I'd like to see a court case where someone who has placed their work in public domain in a way that's not recognised in some country suing their users in that country for unlicensed use. Legally possible, but extremely contrived case and the contradiction between stated intent and actions may be the reason for the court to recognise an implied license. Or may not.