Page 1 of 2 12 LastLast
Results 1 to 30 of 39

Thread: copy of "Source code for Razor compressor" thread

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,962
    Thanks
    295
    Thanked 1,293 Times in 734 Posts

    copy of "Source code for Razor compressor" thread

    I accidentally deleted the thread while attempting to move it to another subforum.
    Here's a copy for now. html: https://encode.su/thread_3264.html

    Update: No, there's no source inside :)
    Attached Files Attached Files

  2. #2
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    779
    Thanked 687 Times in 372 Posts

  3. #3
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    156
    Thanks
    39
    Thanked 46 Times in 26 Posts
    ​Since I cannot answer there, just my 2c as an end-user, a developer, and often, a Linux user:

    Windows, Photoshop, etc are commercial business-oriented applications. They make most of their revenue by charging for support. Would anyone use them if they were advertised as "at your own risk"? How much confidence would you have in Photoshop if it only supported a proprietary format and the author seldom provides support or an upgrade path? Consumers for this market segment put a priority on upgrade path: if old .psd files become useless every time a new version comes out, no one will use it for any serious purposes. I used to work with image processing, and whenever I wanted to know how something was implemented in Photoshop (i.e. auto-normalization of colour channels), a quick Google search would show answers. Adobe wouldn't come after anyone for reverse engineering image processing solutions because they did not impact Photoshop sales. Even if Photoshop is never updated again, GIMP has an open-source .psd loader, which mostly works, but nothing can load most proprietary archival formats other than the original programs.

    Data compression and archival systems have a completely different purpose. If I compress an archive on Windows, and later on the original executable is no longer usable (i.e. Linux or incompatible version of Windows), then wouldn't that compressor be a waste? WinRAR avoided this entirely by publishing the decompressor as open source with the condition that it would not be used to reverse-engineer a rar-compatible compressor. WinACE's format is also available online. I don't have to worry that my archived files will become useless one day.

    LZMA, PAQ, balz and bzip2 have been the most useful compression sources I've read (among others, and no particular order). Had they not been taken apart and explained, with or without the original author's permission, I wouldn't have been in this field. I don't agree at all that theft of ideas is why innovation is being reduced.

    My opinion is insights into algorithms and details would have been useful, especially compared to the vague descriptions in the Razor thread. In this case it isn't even a complete and compatible reverse-engineering but just a partial decompilation to show compression and decompression routines for a closed-source and free-use Windows-only compressor? While everyone is issue terms as they wish, if we all switched to executable-only releases, we could go back to trying to one-up everyone else with our hidden knowledge.

    If an archival program is not maintained and its file format is proprietary (use at your own risk?), it is useless for an end-user. If any details about its implementation are vague at best, it is useless for educational purposes. What would the purpose of such a non-commercial program be, other than advertising the original developer's skills in benchmarks? Has he lost anything in real terms (i.e. money, patent on inventing ROLZ, etc)? Can the authors of PAQ, balz, etc pursue him for not being original?

    As for me, I'll go back to my current reverse-engineering effort, and hope no one comes after me for reverse-engineering a proxy DirectDraw dll for 1990s Windows games that cannot run properly anymore.

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,962
    Thanks
    295
    Thanked 1,293 Times in 734 Posts
    Here's a question: what do you think about nanozip reverse-engineering?
    Can we host sources and a public discussion thread of that?
    If yes, why? Because only Christian has a fanclub?
    Sami is supposedly dead, but maybe he faked it? We don't have any legal proof of his death.
    And even if he is, copyright can be inherited.
    Does the fact that Sami participated in ccm reverse-engineering change anything?

  5. #5
    Member JamesWasil's Avatar
    Join Date
    Dec 2017
    Location
    Arizona
    Posts
    101
    Thanks
    93
    Thanked 18 Times in 17 Posts
    Quote Originally Posted by cade View Post
    ​Since I cannot answer there, just my 2c as an end-user, a developer, and often, a Linux user:

    Windows, Photoshop, etc are commercial business-oriented applications. They make most of their revenue by charging for support. Would anyone use them if they were advertised as "at your own risk"? How much confidence would you have in Photoshop if it only supported a proprietary format and the author seldom provides support or an upgrade path? Consumers for this market segment put a priority on upgrade path: if old .psd files become useless every time a new version comes out, no one will use it for any serious purposes. I used to work with image processing, and whenever I wanted to know how something was implemented in Photoshop (i.e. auto-normalization of colour channels), a quick Google search would show answers. Adobe wouldn't come after anyone for reverse engineering image processing solutions because they did not impact Photoshop sales. Even if Photoshop is never updated again, GIMP has an open-source .psd loader, which mostly works, but nothing can load most proprietary archival formats other than the original programs.

    Data compression and archival systems have a completely different purpose. If I compress an archive on Windows, and later on the original executable is no longer usable (i.e. Linux or incompatible version of Windows), then wouldn't that compressor be a waste? WinRAR avoided this entirely by publishing the decompressor as open source with the condition that it would not be used to reverse-engineer a rar-compatible compressor. WinACE's format is also available online. I don't have to worry that my archived files will become useless one day.

    LZMA, PAQ, balz and bzip2 have been the most useful compression sources I've read (among others, and no particular order). Had they not been taken apart and explained, with or without the original author's permission, I wouldn't have been in this field. I don't agree at all that theft of ideas is why innovation is being reduced.

    My opinion is insights into algorithms and details would have been useful, especially compared to the vague descriptions in the Razor thread. In this case it isn't even a complete and compatible reverse-engineering but just a partial decompilation to show compression and decompression routines for a closed-source and free-use Windows-only compressor? While everyone is issue terms as they wish, if we all switched to executable-only releases, we could go back to trying to one-up everyone else with our hidden knowledge.

    If an archival program is not maintained and its file format is proprietary (use at your own risk?), it is useless for an end-user. If any details about its implementation are vague at best, it is useless for educational purposes. What would the purpose of such a non-commercial program be, other than advertising the original developer's skills in benchmarks? Has he lost anything in real terms (i.e. money, patent on inventing ROLZ, etc)? Can the authors of PAQ, balz, etc pursue him for not being original?

    As for me, I'll go back to my current reverse-engineering effort, and hope no one comes after me for reverse-engineering a proxy DirectDraw dll for 1990s Windows games that cannot run properly anymore.
    I was using and writing software for others back in the 80's so my comfort level with "use at your own risk" would have been fine then just as much as it is now. When in doubt, I test it on a virtual machine and use a process monitor to watch every aspect of what it does if I have any feeling of discomfort with it. I would and have treated Adobe Photoshop just like Gimp, just like shareware and freeware from the days of Bulletin board BBS systems, Usenet, and Fidonet even. I'm sure several who are from the same era or before have and do the same today.

    I've never once paid for support from Adobe, Microsoft, or any of the other major companies, and most people I know of (even those who are non-programmers) never have either. It wasn't until you had a certain type of people who were sold the notion of computers as a miracle pill for everything that you had "mandatory paid support" coming about as any type of necessity. Prior to that, we figured things out for ourselves...and although there is nothing wrong with and no shame in using or needing paid support, I would like to think that the best of us still do try to figure things out and not pay or try to buy our way out of that learning experience whenever possible.

    Because Adobe wanted to sell their products and not worry about issues with PSD file changes, they did what just about any other sensible program does: implements backwards compatibility with previous versions. This is not rocket science, it's just common sense along with good business and not shooting yourself in the foot with attempts at innovation that don't really even need to be there. That said, there are a lot of people who refused to use newer versions of Photoshop when they went the "software as a service" model years later, and got along fine with Adobe Photoshop 7 and never wanted or needed to upgrade unless they are forced off of it.

    If Photoshop were never updated again, then you'd never have to worry about the standard changing or any open-source program needing to support updates to it! Why not consider that as well while we are on this? If it doesn't change and isn't changed a billion times by everyone who wants to implement their own version of it BECAUSE it is "open-sourced", you actually end up with a very streamlined and COMPATIBLE file format that isn't going to be overshadowed or mucked up a thousand times a day because everyone and their brother across the world wants to make another version of it? By NOT having everything out there for people to see, you might actually get a standard relevant enough that isn't changed too much to where everyone CAN be on the same page with it, which is unintentionally something provided by all of those closed-source formats over the years (even if they went public years later with the specs and algorithms for them).

    If people keep releasing pointless rehashes that are useless but "open-source" and therefore expecting fanfare, the same irrelevance comes not just to the format of what it uses, but to the very program itself along with bugs left unfixed and nothing but a total mess no one wants to clean up. At that point, you often either have to get a consensus to choose one and fix it paid or for free, or...you avoid that mess from the start by having a talented team of programmers to make a closed source product that free ones can be based upon later. The commercial producers for software are not ALL monsters, and some WILL give you the goods for their formats. I mean heck, WOTSIT file formats existed for years and a lot of those were contributed by the programmers for the closed source programs themselves. It wasn't to see a billion shitty versions pop up over night across the world and want fanfare because they were "open source", but more for the hobbyist and anyone really needing to know the format for the industry to use without issue.

    Of course Data Compression is a different animal than imaging and other productivity software, but that doesn't mean the results from community work will have any different outcome upon it. Winrar having a free decompressor was done out of necessity, as was Winace. But do you know what would happen if it were closed source and no one had a free decompressor? You'd have seen very small scale virtual machines that emulated the closed source compressor or decompressor reverse-engineered anyway (since people don't respect copyrights for anything anymore) and people would have just transferred or converted the data to a new format with a VHD or VDSK virtual disk middle-man in a VM. That means that there is STILL no point to being forced to give away your intellectual property if that's a possibility or a result if people want to do it badly enough IF you don't want to or want to make plans with what you made yourself, and that there would never be a moment that you couldn't transfer data to another form or that a compression archive would ever become "useless" or unreadable. If it was ever important enough to make waves, it would be on the radar and people would be actively using it, and if anyone is, there is going to be someone who makes a way to keep being able to or to get data back from it if it's ever required.

    Whenever I didn't have something on Linux or BSD, or if I didn't have something in DOS or Windows, I would just use a shared partition or use an EX-FAT partition (later on) as a go-between even if a virtual machine wasn't available. If I didn't have DGJPP tools installed, I'd just shift it to another partition or another system on the shared network and do it remotely from that terminal, then copy it back if I had to do it that way even without a virtual machine. The BeBox I had (BeOS) and OS/2 Warp were two other examples of systems that needed that. If I had to grep or awk something, I was going to have to use linux to do it and neither OS/2 nor BE had enough native tools to do everything from the same OS at the time. You adapt, you figure out the answers to those puzzles, and you apply them. At least, that's what we used to do?

    It isn't up to a community to decide the merit or the intention of a programmer with their own work. No one should do that to you, to me, to Shelwien, to Bulat, to Christian, to Matt, to Michael, or to a hundred other people viewing this forum. Who are we to decide and determine what an author should or should not do with their own things? Isn't that rather pretentious and contradictory to an individual's work and desires they have for it?

    I'm still trying to wrap my head around how and why people think they are entitled to what anyone else makes if they are not freely offering it?

    At what point did we go from honorable programmers and engineers who appreciated and supported our colleagues by respecting their wishes while working together with them to achieve great things...to this Entitlement Society where everything MUST be given freely as demanded - or else?

    It's a flawed argument to assume that an archiver becomes useless just because you don't have the source to it, when you can easily virtualize the commercial version and always translate it and repack it or leave it in any form you want, or that anyone actually needs paid support for commercial graphics programs. Before I had access to the GIF format or JPEG, I saved files in BMP and used proprietary compression that was between GIF and JPEG in size anyway. Did I really NEED to be able to have GIF and JPEG file specs? No, because any graphics program free or paid worth a darn was usually able to read and save them in any format you wanted after that for your programs to use even if you didn't have access to it yet.

    Having specs at times can make things easier, at the risk of ruining greater things and directions in the future that an author has for them. Having everything I WANT with it, does not entitle me to it or make it more than I NEED to get the job done. Perhaps people have become too spoiled today and think they should have everything because they value too much the desire to have, and not enough the desire of those who made it possible TO have?

    If everything becomes free, then nothing has value and the merit and virtue of working for things and earning them or even bothering to create them becomes lost in it.

    If a person has a personal interest in discovering how things work, then reverse-engineering becomes inevitable and I know that, we all know that. But there should be SOME form of respect for not doing it blatantly beyond the desire to know, and just assuming that because it's POSSIBLE to do, that everyone should have it done and that the hard work of an individual (even if it's based upon another person's) should be tossed to the wayside.

    Furthermore, when Christian came out with CCM, I was actually jealous of it and him that it worked as well as it did. But you know what? That's a personal problem for me to have felt that way, and while I should have been cheering people on to reverse engineer it to 1UP it or have it at my fingertips, I also secretly respected and appreciated that he did the work on his own, even with support from others here. That helped me say to myself: "Do I really have a right to feel that way? Or is it only my pride or ego hurt that I didn't put out something like it publicly first whether I had it or not?" I came to the conclusion that it was pride more than anything making me see it that way. I'm saying this because I'm sure there are others who might have felt the same, but by me saying this, they never have to because it's covered.

    Now, I say that not knowing whether they copied anyone else's work or not!

    Yes, no one knows for sure if he did or didn't, and no one even knows if that has any bearing upon why he would or would not want it to be reverse-engineered. IT IS POSSIBLE THAT HE COPIED SOME OF IT OR BASED IT OFF OF OTHERS' WORKS, WHICH MIGHT INVALIDATE EXCLUSIVITY TO SOME...and perhaps even invalidate the Copyright itself if it violates other people's copyrighted works in the process.

    But that said, I think that if that WERE the case, then those who reverse-engineered it would already know for sure and provided those correlations to us directly in the process of that. No such evidence was ever given following those events nearly 12 years ago, and has remained speculation.

    As such, the only conclusion that I can draw from this is that people just wanted it all and wanted it now, rather than respecting the author's wishes to not do that for whatever their reasons might be.
    If we're going to be able to respect ourselves ethically, we have to at least respect others' wishes enough to not trample their rights regardless how advanced or primitive the product is.

    There is a point that people stop becoming engineers and start becoming thieves, and it becomes a very fine line to walk when it comes to this.

    Maybe I'm wrong, maybe you're right. But am I? Change my mind?
    Last edited by JamesWasil; 27th December 2019 at 11:33.

  6. #6
    Member JamesWasil's Avatar
    Join Date
    Dec 2017
    Location
    Arizona
    Posts
    101
    Thanks
    93
    Thanked 18 Times in 17 Posts
    Quote Originally Posted by Shelwien View Post
    Here's a question: what do you think about nanozip reverse-engineering?
    Can we host sources and a public discussion thread of that?
    If yes, why? Because only Christian has a fanclub?
    Sami is supposedly dead, but maybe he faked it? We don't have any legal proof of his death.
    And even if he is, copyright can be inherited.
    Does the fact that Sami participated in ccm reverse-engineering change anything?
    If he did pass away, then it would either be up to him in his will, or his next-of-kin to decide upon the release of what he worked on to the community. That would be fair to him and to us I think.

    It should not be a Christian fan club thing, it should be about respecting the rights for ALL developers who have a specific preference for their work. I don't feel or believe that anything should be shrouded in mystery to keep people out, I think you know that already, but I do think that people have a right and a say in what they make to declare it public or not for sources.

    You offer a very sound and interesting question at the end when you ask if that changes anything. Yes, it does. If he were directly involved in reverse-engineering someone else's work that way, then why should his works remain protected? It creates a hypocrisy standard at that point. If someone willfully violates another person's copyright on purpose without regard, then I'm not quite sure their requests to have things respected should remain standing when it comes to that. Maybe people will agree or disagree with that, but that's how I personally see it.

    Another approach to all of this might be to release things in a standard form, but without the enhancements and additional work to optimize them. For years, that was done with various programs and algorithms, and it helped to increase ingenuity and creativity to reach those levels, and give people something more to try for. One of the bad things about open source is that when everything is given away, the effort to overcome challenges and the reward for doing it on one's own becomes lost to the sea of quick and easy for the taking. When everything was mostly closed source, it made me want to try harder to make my own things to get around that. Now, perhaps that too has been lost to our generation today with where technology is at?

    At the benefit of making compression algorithms demystified and making other standards visible to everyone when needed for programs...are we robbing ourselves of the drive and motivation to keep doing innovative things in the end?

  7. #7
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    156
    Thanks
    39
    Thanked 46 Times in 26 Posts
    ​I used Photoshop as an example of a major copyright and patent protected application to show that they still permit interoperability and reverse-engineering of their file format, neither of which seem to have any risk to their copyright or business model. If they wanted to silo themselves into their own little community, they would lose users to any alternative with a friendlier open file format.

    Yes, the necessity for WinRAR and WinACE to have open-source decompressors was so that they don't fade away into obscurity. Otherwise, to protect his IP, Eugene Roshal would have given instructions on how to make a VM.

    I needed a lossy image compression method without block edge discontinuities (i.e. JPEG), learned about wavelets and released one as open source. It didn't get much attention because it's a rare use-case but didn't lose anything. I cannot claim it as my own IP either because I used a probability coder, embedded zero-trees and some wavelet types, none of which I invented myself, only implemented in a different way. At best, only the way in which I join those techniques together would be claimed as my own.

    As far as I have seen, no one denies who the original author of anything is in this community.

    Source code is copyrighted and ideas are patented. Given the terms and techniques he used (optimal parsing, ROLZ, probability coder, etc), it is not reasonable to assume he came up with everything on his own. He implemented them himself without copying source but it's a bit extreme to call it copyrighted when someone reproduces an approximation of the compression algorithm from publicly available machine code. Although, closely guarded trade secrets are generally not published on the internet for the general public.

    As for figuring out how software is implemented... https://encode.su/threads/111-NanoZi...ll=1#post44245
    Sometimes other people try to figure out how something closed-source is implemented, but there was no fan fare there.

    But my question was simply: what's the point of such a program in this community of mostly developers and enthusiasts besides self-advertisement? There is no chance of adopting a closed and top secret file format, and no interest on his side to help anyone. No one is obliged to help anyone else, but then why bother?

  8. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,962
    Thanks
    295
    Thanked 1,293 Times in 734 Posts
    Well, presumably Christian wants to eventually make a popular commercial archiver like winrar and sell it.
    And meanwhile he needs feedback and testing for his codecs and stuff, so he posts them here.
    Also compression benchmarks are somewhat of a sport, so there's some value in good compressors (even closed-source) simply as collectibles.
    Both topics are hopelessly out of fashion recently, though.

    Problem is, reverse-engineering is also a sport, people would always look for good targets.
    So when some interesting program is not updated for years, it would most likely happen.
    Even without decompiling, there're always indirect methods allowing to identify the details of the algorithm
    (like generating special sample files and testing)... which are also considered reverse-engineering.
    Incidentally, there's usually no difference for the author, since decompiled code is mostly only used for clues -
    people very rarely do incremental updates to others' work, see https://en.wikipedia.org/wiki/Not_invented_here
    So in that sense, there's really no differences whether its decompiled or not.
    Decompiling mostly became popular because of Ida/hexrays which make decompiling 10x easier now.
    Posting plain executables is now the same as posting obfuscated sources - it may be not immediately readable,
    but easy to transform to readable form with the right tools.
    And of course there're protection tools which make it harder (like vmprotect/denuvo).

  9. #9
    Member
    Join Date
    Sep 2018
    Location
    Philippines
    Posts
    121
    Thanks
    31
    Thanked 2 Times in 2 Posts
    That's why there should be commitment by computer tech giants like Google, Microsoft and Facebook to acquire/buy provably superior algorithms/compressors, and not lurk in the background reverse engineering the said compressors to create their own.

  10. #10
    Member
    Join Date
    Apr 2009
    Location
    here
    Posts
    204
    Thanks
    172
    Thanked 110 Times in 66 Posts
    Well, presumably Christian wants to eventually make a popular commercial archiver like winrar and sell it.
    And meanwhile he needs feedback and testing for his codecs and stuff, so he posts them here.
    if that's the case, then why people are worried to christian disappear? after all the forum users seem to be betatesters? what would be the benefit for the community?

  11. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,962
    Thanks
    295
    Thanked 1,293 Times in 734 Posts
    Keep in mind that its just an optimistic interpretation (for a change).
    The realistic one is that Christian is not dumb enough
    to believe in possibility of success of a new archive format in these days,
    but rather just had seen LZNIB/LZNA descriptions and sources,
    and wanted to show off one last time, since his old codecs became obsolete
    in light of zstd and oodle.

    > if that's the case, then why people are worried to christian disappear?

    Because its more interesting with him than without?
    His tools usually take top places in any benchmarks,
    because they always have integrated preprocessors,
    which automatically makes them better than "pure" codecs.

    > what would be the benefit for the community?

    More activity?

  12. #12
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    801
    Thanks
    244
    Thanked 255 Times in 159 Posts
    While razor shares adaptive nibble EC with LZNA (REed in 2016), it seems to have essentially better performance - can it be explained with preprocessing, BWT?
    Does this income have much in common with nanozip?

    Can we be certain that it does not contain some patentable new concepts - some patent vultures could currently try to get 20 year exclusivity for?

  13. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,962
    Thanks
    295
    Thanked 1,293 Times in 734 Posts
    > While razor shares adaptive nibble EC with LZNA (REed in 2016),
    > it seems to have essentially better performance - can it be explained with preprocessing, BWT?

    If you mean compression, its mostly just ROLZ.
    http://mattmahoney.net/dc/dce.html#Section_525 (description)
    http://mattmahoney.net/dc/text.html (see ROLZ results)

    ROLZ normally has stronger compression than LZ77 at the cost of
    extra hashtable maintenance during decompression.

    ROLZ vs LZ77 difference is at most (1-243835/260763)*100=6.5%
    (that's lzma vs rz on book1; the difference is smaller for binary data).

    Razor also has a relatively strong preprocessor for x86 code (BCJ2-like),
    and delta token types which help on binary data.

    > Does this income have much in common with nanozip?

    Not much, except for the general idea of preprocessing.
    Nanozip also has all the similar types of preprocessors
    (plus text preprocessors which Christian's codecs don't have).
    Btw, razor has delta tokens instead of standalone delta/audio/image preprocessors,
    but I'd still consider it an equivalent of preprocessing - its still
    explicit special handling for specific data types.

    But nanozip codecs are less interesting - LZ is subpar (worse than lzma),
    CM is some lpaq derivative afaik.

    The main "selling point" in nanozip is actually BWT
    (which was the main focus of Sami's research; afaik nanozip has something
    like Maniscalco's m03 - BWT with original context recovery during postcoding)
    and runtime codec switching (which is essential because BWT is bad on binary data).

    > Can we be certain that it does not contain some patentable new concepts -
    > some patent vultures could currently try to get 20 year exclusivity for?

    Well, it has CDC dedup, LZNA-like rANS entropy coding, order1 ROLZ and
    16 special (mostly delta) token types:
    Delta1xU8,Delta2xU8,Delta3xU8,Delta4xU8,Delta1xU16 ,Delta2xU16,Delta1xU32,RGB,RGBA,
    ImagePred,Mono8,Stereo8,Mono16,Stereo16,Literals32 ,RawBytes.
    DeltaN here means N-byte step between values and %N positional context, otherwise
    values are simply subtracted.

    But, no, I don't think there's anything worth patenting (aside from interleaved vector rANS).

  14. Thanks (2):

    JamesB (28th December 2019),Jarek (28th December 2019)

  15. #14
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    506
    Thanks
    187
    Thanked 177 Times in 120 Posts
    I suspect the main ideas that make Razor so good are known, and to a large extent not a new invention either. Christian's clearly done a great job of combining everything into a single compressor, which makes it strong: ROLZ, interleaved rANS, lots of preprocessing and transformations. If anything I'd say the smarts are in the compressor to determine how best to mix these technologies together and tackle the large search space. I don't think there'd be any chance of patenting interleaved vectorised rANS now as it's been in the public for a while.

    So what to gain now from having a glimpse at the internals? Probably not much tbh. I'd respect him for the work and if he wants it closed source then that's his choice.

    As for my personal preference - IMO open source gives a lot more longevity and my experience of support on commercial vs open source has been mixed with sometimes the commercial side being substantially poorer. Some commercial entities doing backups and/or compression will eskrow their source so in the event the company folds or the technology ceases to be maintained then the users can obtain the source code and still access their files. That's probably the right way to do things if you want to go down the secrecy route.

  16. #15
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,962
    Thanks
    295
    Thanked 1,293 Times in 734 Posts
    > I suspect the main ideas that make Razor so good are known,
    > and to a large extent not a new invention either.

    Its not "so good", its basically not used anywhere except benchmarks.

    Sure, it might be the best choice atm for software installers,
    but even game repackers mostly keep using "lolz" (independently decompiled LZNA with filters)
    within freearc framework with srep and recompressors.

    Razor integration of dedup and preprocessing is actually inconvenient for practical use -
    for example, its pretty hard to use a better external exe preprocessor (like dispack)
    with rz, since it would still try to apply its own exe preprocessor and hurt compression.

    Also on custom data types (like game resources, databases, logs etc) rz compression
    is not anything special (a little worse than lzma usually).
    For example, I recently made a benchmark of various codecs for 8k blocks.

    86,100,833 // zpaq -m5 estimation (header size is subtracted)
    83,786,297 // zstd -22
    83,805,502 // rz.exe estimation
    80,437,160 // glza 0.11 estimation
    76,617,735 // ppmd_sh /o16
    64,121,600 // paq8pxd69 -s7 estimation

    > Christian's clearly done a great job of combining everything into a single compressor,
    > which makes it strong: ROLZ, interleaved rANS, lots of preprocessing and transformations.

    Sure, but it'd be nice to have a standalone ROLZ/rANS without preprocessors,
    just to evaluate the algorithm performance vs LZ77,BWT,PPM.

    Or a standalone preprocessor with all these features,
    since preprocessors are mostly made by the kind of people who want to win at any cost.
    They usually also don't like helping others or disclosing their sources,
    so its pretty hard to setup a fair compression comparison of other codecs vs razor.

    > If anything I'd say the smarts are in the compressor to determine how best to mix these
    > technologies together and tackle the large search space.

    Not really, it just has the normal dynamic programming parsing optimizer.
    So "the smarts" are in having the patience to collect relevant samples of relevant datatypes
    and implementing handlers for them.

    > So what to gain now from having a glimpse at the internals?

    Mostly which preprocessors are necessary for a fair comparison vs rz.
    Though there're also some interesting data structures in ROLZ implementation - its not the usual ht4.

    > That's probably the right way to do things if you want to go down the secrecy route.

    There're sufficient technical means for protection against reverse-engineering,
    especially driver-based since building a working windows driver requires signing it,
    which has legal consequences. Also there's intel SGX now.

    Other countermeasures include
    1) Using an unique framework, which lacks support in RE tools.
    (Like my coroutines, or non-pthread threads, or state machines, or unique compiler)
    2) Releasing binary libraries which can be used by GUI developers etc.

    Much less people would test a "protected" executable, and there's some performance drop, but that's the tradeoff.

  17. Thanks:

    Jarek (28th December 2019)

  18. #16
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    801
    Thanks
    244
    Thanked 255 Times in 159 Posts
    I wasn't aware of this "lolz", the only I could find is this ProFrager's:
    https://translate.google.com/transla...%2Flolz.264%2F

  19. #17
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,962
    Thanks
    295
    Thanked 1,293 Times in 734 Posts

  20. #18
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 199 Times in 147 Posts
    - I think rolz is a little overestimated. By using order1 (in offset encoding)
    it's mostly good for text, and there bwt is much better. For binaries order1 rolz is bad.
    Of course, with preprocessing you have other factors that can improve binary compression.

    - Traditionally, rolz has better compression in fast modes (zlib 6 class) as lz77 at similar speed,
    but decompresion is always slower. Lz77 can also be made fast at the same compression level,
    so this single advantage of rolz disappears also.

    - Optimal parsing is hard with rolz

    - Rans like used in razor, adds no value for a compressor that the main purpose is archiving.
    Here a bitwise range coder is compressing better and it is also fast enough for lz decompression.
    The bottleneck for rolz is anyway the huge offset table in decompression.
    Additionally, when it comes to speed, entropy coding is only a part in a compressor.
    The evidence: lzna is now deprecated in oodle.
    Last edited by dnd; 5th January 2020 at 00:11.

  21. Thanks:

    Shelwien (28th December 2019)

  22. #19
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    801
    Thanks
    244
    Thanked 255 Times in 159 Posts
    Quote Originally Posted by dnd View Post
    The evidence: lzna is now deprecated in oodle.​
    It was too slow for games - they use tANS especially in Leviathan:
    https://encode.su/threads/2078-List-...ll=1#post56059

  23. #20
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,962
    Thanks
    295
    Thanked 1,293 Times in 734 Posts
    > it's mostly good for text, and there bwt is much better.

    I agree.

    > For binaries order1 rolz is bad.

    I guess that's why razor also has a LZ77 token type.

    > Optimal parsing is hard with rolz

    I don't think so.
    Adaptive counters and state context derived from previous token types
    are hard to support in optimizer.

    While rolz token type can be added to eg. lzma optimizer rather trivially,
    since it doesn't have any dynamic dependencies.

    > Rans like used in razor, adds no value for a compressor that the main purpose is archiving.

    Well, faster decoding is still good.
    Unlike tANS, rANS doesn't really have any entropy overhead.
    There's only some overhead on final states for interleaved streams, and stream sizes.
    But even that can be reduced - like, we could use RC for block header (including final states etc),
    then rANS for most of the block.

    > Here a bitwise range coder is compressing better
    > and it is also fast enough for lz decompression.

    Unfortunately in practice now we get asked to write something better than zstd
    (including decoding speed), and 1-2% better compression with 5-10% of decoding
    speed is not accepted.

    > Additionally, when it comes to speed, entropy coding is only a part in a compressor.

    The problem is that it slows down everything else because of dependencies.
    For example, lzma uses previous byte (previous literal, or last byte of previous match)
    as context, so its necessary to decode and look up the match before decoding the next token.
    Dropping this context hurts entropy coding, but allows for optimizations that make
    decoding 10x faster.

    > The evidence: lzna is now deprecated in oodle.

    Well, they got leviathan to the same compression ratio, so there's no point now?
    But I think they're just not interested in statistical models and entropy coding.
    Also they have to target various game consoles... I suppose tANS LUTs are more universal
    than rANS vector mul/div.

  24. #21
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 199 Times in 147 Posts
    >Optimal parsing is hard with rolz
    ​I don't think so.
    ROLZ optimal parsing: suppose at the current position P, you have
    a match with length L. When you want to update the positions P + I (I=2 to L),
    you can only take one of the offsets from the offset table at the previous byte context P + I - 1.
    But the offset table is only build in the encoding phase (after the backward pass), so you must in theory propagate
    the whole offset table at each position.
    This is similar to the offset repeats (recent offset match) table, but that's a very small table.
    With lzma,zstd,brotli style parsing you can ignore those dependencies and assume not changes in the parsing phase,
    but this is suboptimal.

    Unfortunately in practice now we get asked to write something better than zstd
    For archiving, I'm always using 7-zip and the decompression speed
    is more than acceptable with current hardware.
    For other use cases, this is another story.

    The adaptive nibble rans is not much faster than a range coder.
    You can see in Entropy Coding Benchmark, the simd TurboANXN is at most 2x faster than TurboRC in the
    simple order 0 encoding. I've also benchmarked the lzna coder and it's only 20% faster than a range coder.
    In lz77/rolz, you have more branchy coding than in this case, so the small rans speed advantage is also reduced.
    It's interesting to see a comparison razor vs. lzma at decompression.

    I suppose tANS LUTs are more universal than rANS vector mul/div.
    Nibble rans is adaptive, but block based ANSs (rans or tans) are several times faster (5 to 15 times).
    See Entropy Coding Benchmark.
    Last edited by dnd; 30th December 2019 at 11:30.

  25. Thanks:

    Shelwien (28th December 2019)

  26. #22
    Member
    Join Date
    Aug 2015
    Location
    indonesia
    Posts
    336
    Thanks
    50
    Thanked 62 Times in 50 Posts
    Quote Originally Posted by Shelwien View Post
    Well, presumably Christian wants to eventually make a popular commercial archiver like winrar and sell it.
    And meanwhile he needs feedback and testing for his codecs and stuff, so he posts them here.
    Also compression benchmarks are somewhat of a sport, so there's some value in good compressors (even closed-source) simply as collectibles.
    Both topics are hopelessly out of fashion recently, though.

    Problem is, reverse-engineering is also a sport, people would always look for good targets.
    So when some interesting program is not updated for years, it would most likely happen.
    Even without decompiling, there're always indirect methods allowing to identify the details of the algorithm
    (like generating special sample files and testing)... which are also considered reverse-engineering.
    Incidentally, there's usually no difference for the author, since decompiled code is mostly only used for clues -
    people very rarely do incremental updates to others' work, see https://en.wikipedia.org/wiki/Not_invented_here
    So in that sense, there's really no differences whether its decompiled or not.
    Decompiling mostly became popular because of Ida/hexrays which make decompiling 10x easier now.
    Posting plain executables is now the same as posting obfuscated sources - it may be not immediately readable,
    but easy to transform to readable form with the right tools.
    And of course there're protection tools which make it harder (like vmprotect/denuvo).

    how about reverse engineering phda9 ?

  27. #23
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    105
    Thanks
    37
    Thanked 29 Times in 20 Posts

    Post

    Quote Originally Posted by suryakandau@yahoo.co.id View Post
    how about reverse engineering phda9 ?
    Here you go https://cutter.re/

    The open source IDA with decompiler from the NSA .

  28. #24
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,962
    Thanks
    295
    Thanked 1,293 Times in 734 Posts
    > how about reverse engineering phda9 ?

    1. Releases under "phda" name don't really comply with HP rules anymore (use more memory).

    2. Its very unstable even without touching it. Windows decoder didn't work when I tried it
    (likely encoded with linux version, and there're float-point models so its not compatible).

    3. Linux binary is compressed with some exepacker. It has UPX signatures, but upx -d doesn't unpack it,
    and lzma or ucl/nrv compressed data can't be found... so its either not upx, or upx with some extra encryption.

    4. The algorithm itself is very messy, you can look at paq8hp12 and lpaq9m which are available.
    We have most of the components (lstmcompress/cmix, mod_ppmd, transformation scripts...),
    but models there are designed to work with transformed data in unknown format, so its hard to understand.

    Well, basically its an example of how a successful anti-RE protection is implemented :)

  29. #25
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 199 Times in 147 Posts
    RAZOR is compressing enwik9 to 173,041,176 bytes, whereas BWT can compress this to under 167,000,000 using an srcoder (symbol ranking) or QLFC and
    a simple rle range coder.


    It seems RAZOR is using lz77 for binaries.

    Using preprocessing, you can achieve significant compression saving.
    Sometimes 3 to 10 times and more better compression.
    see for example the prime number benchmark with transpose+delta
    https://encode.su/threads/2414-Prime...ll=1#post47027

    Audio is normally incompressible by lz77,
    but you can also achieve significant saving after preprocessing with 2x16 bits delta+transpose,
    without a dedicated compressor.

    Same for image files (bmp,ppm,...) with delta 24 bits+transpose.

    EXE compression with BCJ2 preprocessing.
    In contrast to other people, I've found
    BCJ2 better than dispack.


    Universal preprocessing with dynamic pattern detection is also part of an algo,
    but detecting some popular multi-media files by reading the header is not.
    Oodle for ex. is detecting chunks of data, that can be better compressed with delta preprocessing.
    Last edited by dnd; 29th December 2019 at 22:39.

  30. #26
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,962
    Thanks
    295
    Thanked 1,293 Times in 734 Posts
    > In contrast to other people, I've found BCJ2 better than dispack.

    1. There's more than one version of dispack.
    http://nishi.dreamhosters.com/u/dispack_v0.rar
    http://nishi.dreamhosters.com/u/dispack_arc_v0.rar

    2. Dispack is old, so doesn't have support for x64 and newer vector extensions.
    Part of x64 instructions (CALL/JMP in particular) have the same codes, so it still does something, but results depend on luck.

    3. Dispack doesn't do anything that smart. It does disassemble the code, but then only uses it
    to implement a fancy version of the same E8 basically (with MTF coding of call targets).

    4. I have my own BCJ2-like exe filter: http://nishi.dreamhosters.com/u/x64flt3_v0.rar
    It should be strictly better than BCJ2, and it has explicit support for x64 instructions (x64 has relative address coding in all instructions).
    But some tests showed that some types of exes still compress better with dispack_arc. https://encode.su/threads/3223-preco...ll=1#post62110
    Although it depends on compiler and target cpu used to build the exe.

  31. Thanks:

    dnd (30th December 2019)

  32. #27
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 199 Times in 147 Posts
    Quote Originally Posted by suryakandau@yahoo.co.id View Post
    how about reverse engineering phda9 ?
    ​I've not spent any time with phd9 and probably it is already known:
    Since the wikipedia is stored internally in a text format and a script is used to generate the XML dumps.
    I suppose phda9 is only compressing the text (as stored in the database) and using a script in decompression to rebuild the enwik9 XML dump.

  33. #28
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,962
    Thanks
    295
    Thanked 1,293 Times in 734 Posts
    > I suppose phda9 is only compressing the text (as stored in the database) and using a script in decompression to rebuild the enwik9 XML dump.

    Kinda yes, but that xml is 4.8M uncompressed and 237576 bytes when compressed normally with paq8px -7.
    Processing gets it down to 220k or so: https://encode.su/threads/2590-Some-...enwik8-parsing
    But in any case, it won't change anything even if you can compress this xml index to zero.
    Best open-source solutions (within HP constraints!) produce only around 16.2M, while phda9 result is 15.2M
    (well, it uses more like 5G of virtual memory with manual hdd swap).

  34. Thanks:

    dnd (30th December 2019)

  35. #29
    Member
    Join Date
    Jun 2018
    Location
    Yugoslavia
    Posts
    62
    Thanks
    8
    Thanked 5 Times in 5 Posts
    its also harder to hide backdoor/virus in open source.
    and people can see what new ideas author used, along with the ones he 'borrowed'.

  36. #30
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    801
    Thanks
    244
    Thanked 255 Times in 159 Posts
    ... however, being open source does not mean it can be safely used from legal perspective - there could be filled patents for the methods it uses, what recently became quite popular especially in machine learning, e.g. word2vec, dropout, batch normalization.

Page 1 of 2 12 LastLast

Similar Threads

  1. "Angel algorithm" thread
    By Shelwien in forum The Off-Topic Lounge
    Replies: 10
    Last Post: 8th August 2019, 15:36
  2. new compressor LZNA = "LZ-nibbled-ANS" - "Oodle 1.45"
    By joerg in forum Data Compression
    Replies: 22
    Last Post: 19th February 2018, 04:50
  3. Replies: 7
    Last Post: 4th January 2016, 14:06
  4. Replies: 11
    Last Post: 15th November 2015, 11:10

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •