Page 7 of 7 FirstFirst ... 567
Results 181 to 201 of 201

Thread: LZA archiver

  1. #181
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Jarek View Post
    Nania, you will never get 100% integrity guarantee with hash - there is always a tiny probability that you will accidentally get the proper checksum with improper file - about 2^-32 for crc32 (less than 1 per billion).
    In ANS you have the same situation - with some small probability, you will accidentally get the proper state. For 32bit rANS this probability is ~2^-31.95 (a bit worse due to nonuniform probability distribution of states).
    I think most hashes are equivalent, or in some sense equivalent, to polynomials modulo some base. (CRC32 was rationally designed using Galois fields, so its inner workings are perfectly transparent.) In principle, you could take a hash function that's based on a known polynomial and find the other polynomials that complete an orthogonal basis; then you would have a process for generating a set of hashes that contains 100% of the information, i.e. they are invertible for any input. Another way to phrase this is that they can form an invertible square matrix. My guess is that most hash functions that are adequately well-screened/tested basically contain the max amount of information possible (this is just a gut feeling, based on my own experience playing with hash functions... I suspect that if a hash function had redundancy in its bits, a tool like SMHasher would detect it).

    I am somewhat overextending what I know about math, because I never went down the matrices/linear algebra track when I was a student. But I guess my overall hunch is that hash functions, generally, behave in much simpler and more predictable ways than most people think (they're cooked up by irrationally stirring around bits, which just obscures what's happening and adds a cloak of mystery). CRC32 doesn't really do anything for the sake of enhancing diffusion, like purpose-designed hash functions do. However, CRC32 fundamentally doesn't throw away any bits of information, so it could probably make a good-enough hash function if you just pass its values through a second permutation to add diffusion. My guess is that with a second integer->integer hash function applied to CRC32, it might perform indististinguishably from other general-purpose hash functions.

    I guess my point is that hash functions, notwithstanding being described in probabalistic terms, are actually 100% deterministic and predictable (this is almost too obvious to state), and -- moreover, perhaps -- the majority of hash functions can be pretty easily understood with basic number theory and discrete math. Speaking of them in terms of probabilities allows you to do some useful calculations, of course, but at some point it becomes a dead end. "Random" is just a way of saying "I don't know." Sometimes randomness is too blunt of a tool, perhaps. Maybe sometimes we do know the answers, we're just not seeing them.
    Last edited by nburns; 31st January 2015 at 02:52.

  2. The Following User Says Thank You to nburns For This Useful Post:

    Nania Francesco (31st January 2015)

  3. #182
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    CRC32 has the special property that it is guaranteed to detect up to 31 bit errors.

  4. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    Nania Francesco (31st January 2015)

  5. #183
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Nania Francesco View Post
    Thanks Matt I guess I'll use AES, actually I do not want to engage in something that I don't know well.
    For a job a couple years ago, I ported code that implemented Microsoft's password/encryption system used for .docx and .doc files. I was going to provide a link, but I'm not sure anymore which standard is the new one and which are old and outdated. The nomenclature for that web of standards is pretty obscure... I'd rather not risk linking you to something that's obsolete. But, assuming you can find the most updated documentation, imitating Microsoft would not be the worst thing you could do. Their choices likely reflect the state-of-the-art -- if only because they are big and have lots of money to spend. Even if they blew it on those standards, it would probably be just as useful to examine how and why.

  6. The Following User Says Thank You to nburns For This Useful Post:

    Nania Francesco (31st January 2015)

  7. #184
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Matt Mahoney View Post
    CRC32 has the special property that it is guaranteed to detect up to 31 bit errors.
    Isn't CRC32 only guaranteed to detect errors when there is an odd number of errors? There are lots of combinations of two errors that can give the same CRC value.

  8. The Following User Says Thank You to Kennon Conrad For This Useful Post:

    Nania Francesco (31st January 2015)

  9. #185
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Matt Mahoney View Post
    CRC32 has the special property that it is guaranteed to detect up to 31 bit errors.
    I have come across that fact, although I haven't looked into how it works. My intuitive feeling is that that property is basically the opposite of diffusion, like "anti-diffusion". I.e., CRC32 reacts to an influx of information predictably, which makes it easier to get specific information back out. What would be interesting to know is whether that property affects its usefulness as a hash function, or if you could just pass it through a permutation of integers to get all the appearance of randomness you want.

  10. #186
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    You're right. That only applies to burst errors.

  11. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    Nania Francesco (31st January 2015)

  12. #187
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    858
    Thanks
    450
    Thanked 255 Times in 103 Posts
    Quote Originally Posted by nburns View Post
    I have come across that fact, although I haven't looked into how it works. My intuitive feeling is that that property is basically the opposite of diffusion, like "anti-diffusion". I.e., CRC32 reacts to an influx of information predictably, which makes it easier to get specific information back out. What would be interesting to know is whether that property affects its usefulness as a hash function, or if you could just pass it through a permutation of integers to get all the appearance of randomness you want.
    The same property which makes CRC32 great at detecting a low number of bits errors (<31)
    play against it when it comes to randomness.

    It's just a consequence of pigeon hole principle.

    When testing randomness of CRC32 vs High quality hashes,
    you'll typically get the following behavior :
    - up to 31 bits of differences : 0 collisions
    - beyond : between 2X and 4X more collisions than random hashes


    Conclusion : for a hash function, CRC32 is not the better choice.
    It's not "horrible" either, therefore when used for this role, it's just "a bit less efficient" than random hashes.

  13. The Following 2 Users Say Thank You to Cyan For This Useful Post:

    Bulat Ziganshin (31st January 2015),Nania Francesco (31st January 2015)

  14. #188
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    CRC32 is a pretty generic term. There are lots of options for the polynomial. I think for a hash function, you would want to pick one with a large Hamming distance. But I could be wrong, my main knowledge of CRC is from from a long time ago when we used to put CRC semiconductors (74F401) on circuit boards because doing it in software often wasn't practical and CPLDs and FPGAs were only concepts.

  15. The Following User Says Thank You to Kennon Conrad For This Useful Post:

    Nania Francesco (1st February 2015)

  16. #189
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Cyan View Post
    The same property which makes CRC32 great at detecting a low number of bits errors (<31)
    play against it when it comes to randomness.

    It's just a consequence of pigeon hole principle.

    When testing randomness of CRC32 vs High quality hashes,
    you'll typically get the following behavior :
    - up to 31 bits of differences : 0 collisions
    - beyond : between 2X and 4X more collisions than random hashes


    Conclusion : for a hash function, CRC32 is not the better choice.
    It's not "horrible" either, therefore when used for this role, it's just "a bit less efficient" than random hashes.
    I'm guessing that CRC32 most likely has exactly 31 significant bits (according to wikipedia, the generator polynomial is 33 bits long, but the first and last are not significant, they're always 1, so the generator polynomial has 31 significant bits).

    When I was playing with hashes and made my own hash function, I started with a simple polynomial, mod 2^64; I processed the data with this, using the data bytes as the coefficients (as in the Rabin-Karp rolling hash). The result didn't look random enough (the lowest and highest bits followed patterns), so I folded the two halves into one hash half as long. At that point, it may have looked good enough at a glance, but SMHasher was saying it still wasn't random enough, so I borrowed the avalanching step from MurmurHash v2. From that point after, it passed all SMHasher's (and my own) tests with flying colors. (There were a couple other minor details, I think, that stopped it from degenerating in some special cases that were tested by SMHasher.)

    So I think you can consider general purpose hashing as two separate problems: First, find a function that samples the entropy of the input data down to a fixed number of bits, e.g. the size of a machine word. Second, find a function that re-hashes the output of the other function, without changing the number of bits, so that the bits look sufficiently random.

    CRC32 seems more than adequate to play the first role if you don't need more than 31 bits. For the second function, you could use the same one as I did.

    P.S., recently I was thinking about this and I think I fully understood the reason why the 64-bit polynomial hash had to be folded down to 32 bits (or was it 31 bits?). Without going into details (which are not exactly formalized and publication-ready, at this point... nor am I a professional mathematician), the polynomial and methods I used seem to be the equivalent (+/- minor details) of taking the remainder mod a Mersenne number with half as many bits (2^k-1, k~32 or 31). That would explain exactly the kind of results I observed. The beauty of mod Mersenne is that the carry bits wrap around rather than falling off the left edge. That way, the carry bits can mix with the predictable low bits, allowing them to correct each other. What I did achieved the same thing, more or less.

    PPS. I'm guessing that "random hashes" is not any kind of formal designation, and would not be possible to formalize. I bet that the more additional purposeless "stirring" operations one adds, the more they start to cancel each other out. That's what happened to the German Enigma cipher, which the Allies were able to crack during WWII (using incredibly old computers).
    Last edited by nburns; 31st January 2015 at 22:25.

  17. The Following User Says Thank You to nburns For This Useful Post:

    Nania Francesco (1st February 2015)

  18. #190
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 660 Times in 354 Posts
    PPS. I'm guessing that "random hashes" is not any kind of formal designation, and would not be possible to formalize.
    https://en.wikipedia.org/wiki/Universal_hashing

  19. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

    Nania Francesco (1st February 2015)

  20. #191
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    That's a kind of hash that integrates random numbers into its formula. I don't think that really relates to how Yann was using the term... he seemed to be using it to refer to hash functions that give random-looking results (i.e., that are good), to distinguish them from hash functions that are not random enough, like CRC32. I am not sure that's a distinction worth making, because I think CRC32 could probably be corrected without adding any actual additional information/entropy, and made to look as good as the others.

    If I'm wrong about CRC32, learning why would probably be an illustrative discussion, in any case, and worth the time.
    Last edited by nburns; 31st January 2015 at 22:43.

  21. The Following User Says Thank You to nburns For This Useful Post:

    Nania Francesco (1st February 2015)

  22. #192
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    I'm guessing that CRC32 most likely has exactly 31 significant bits (according to wikipedia, the generator polynomial is 33 bits long, but the first and last are not significant, they're always 1, so the generator polynomial has 31 significant bits).
    Yes, all CRC32 polynomials start with the x32 term and end with 1 (x0), so the new first bit is always the XOR of the old last bit and the new input bit. Only the 31 positions between can be altered where you either send the bit through unmodified to the next flip flop or XOR it with the XOR of the last bit and the new input bit, depending on the chosen polynomial.

    The authors of this paper http://users.ece.cmu.edu/~koopman/ne...02_koopman.pdf did an exhaustive search of all ~1 trillion possible CRC32 polynomials and liked the 0xBA0DC66B polynomial that has a hamming distance of 4 for up to 114,663 bits. I suppose for caching that's probably not the best measure though. I'm mostly trying to make the point that the results for using CRC32 as a cache will vary depending on which CRC32 polynomial is used.

  23. The Following 2 Users Say Thank You to Kennon Conrad For This Useful Post:

    Nania Francesco (1st February 2015),WayneD (11th March 2015)

  24. #193
    Tester
    Nania Francesco's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    1,565
    Thanks
    220
    Thanked 146 Times in 83 Posts
    Released last version 0.82 version !
    - official console and windows version (size limit removed)
    - new archive format with crc32 control !
    - new lza core for good and fast compression !
    - not compatible with old version's

    download from:
    http://heartofcomp.altervista.org/
    Last edited by Nania Francesco; 10th March 2015 at 01:16.

  25. The Following 2 Users Say Thank You to Nania Francesco For This Useful Post:

    Gonzalo (10th March 2015),Stephan Busch (11th March 2015)

  26. #194
    Member
    Join Date
    Jun 2013
    Location
    Sweden
    Posts
    150
    Thanks
    9
    Thanked 25 Times in 23 Posts
    Quote Originally Posted by Nania Francesco View Post
    Released last version 0.82 version !
    Looks good! But if I open LZAwin.exe and chose 1 file and press "add" it says "Flashzip - compression mode". If I now close this add-window by pressing X (top right) program crashes.

  27. #195
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    468
    Thanks
    203
    Thanked 81 Times in 61 Posts
    Good!

    Now:

    "new lza core for good and fast compression"

    ...Can you explain?
    ANS variant? Radyx? Filters?

    Edit: I just saw the specs in your site. Sorry. Any further info beside that? Thanks!
    Last edited by Gonzalo; 10th March 2015 at 07:04.

  28. #196
    Tester
    Nania Francesco's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    1,565
    Thanks
    220
    Thanked 146 Times in 83 Posts
    Good!

    Now:

    "new lza core for good and fast compression"

    ...Can you explain?
    ANS variant? Radyx? Filters?

    Edit: I just saw the specs in your site. Sorry. Any further info beside that? Thanks!
    ANS core is the same!
    No exe and txt filters !
    No Radyx source code or part or variation!
    LZA search match is 2X search index mode !
    implemented data recognition (for ISO,TAR,RAR archive not compressed,ZIP archive not compressed etc )

    @Matt please test !

  29. #197
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    468
    Thanks
    203
    Thanked 81 Times in 61 Posts
    What do you do when you recognize known data in the containers?
    What decisions are made?
    i.e.: to skip compressed data? As you donĀ“t have any filter to handle text nor exe... Or you rearrange the content to put similar blocks together?

    PS: Can you compile LZAwin.exe for x86? Thanks!
    Last edited by Gonzalo; 10th March 2015 at 20:19.

  30. #198
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    468
    Thanks
    203
    Thanked 81 Times in 61 Posts
    Warning: Command line parser doesn't like asterisks nor ..\somefolder\
    For example, this leads to a crash:
    lza a -r -s -v -h3 -b4 -mx9 TEST.lza TEST\*
    lza a -r -s -v -h3 -b4 -mx9 ..\TEST.lza ..\TEST

  31. #199
    Tester
    Nania Francesco's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    1,565
    Thanks
    220
    Thanked 146 Times in 83 Posts
    Warning: Command line parser doesn't like asterisks nor ..\somefolder\
    For example, this leads to a crash:
    lza a -r -s -v -h3 -b4 -mx9 TEST.lza TEST\*
    lza a -r -s -v -h3 -b4 -mx9 ..\TEST.lza ..\TEST
    Thanks Gonzalo!

    Result in Huge Files Compression Benchmark (test with Intel Core i7 920 2.67 ghz 4GB ram)

    File Vm.dll size=4.244.176.896 bytes
    LZA v.0.10 -mx5 -b6 -h7 -> 1179091707 bytes comp. 1154 sec. - decomp. 51 sec.
    LZA v.0.51 -mx5 -b6 -h7 -> 1071931845 bytes comp. 1020 sec. - decomp. 50 sec.
    LZA 0.82 -mx9 -b7 -h7 -> 1049231245 bytes comp. 1517 sec. - decomp.60 sec (crc32 active)
    Last edited by Nania Francesco; 11th March 2015 at 02:23.

  32. #200
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts

  33. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    Nania Francesco (11th March 2015)

  34. #201
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    864
    Thanks
    46
    Thanked 104 Times in 82 Posts
    just a quick test oaf LZA 0.82win on some text files

    FFXI logs - 2023 files - 806 MB (845,219,536 bytes)

    7-zip LZMA:30:lc8Lpb0 = 57.2 MB (60,050,692 bytes)
    LZA Method 9: LZAbuffer 1024mb:HashBuffer 256mb: Solid:recursive:ultra = 81.0 MB (84,953,506 bytes)


    Some annoyances i stumbled upon
    you cant copy paste a path into the naigation pane. so having to navigate twice to do the files
    default archieve spot is in the LZA folder not the folder you navigated to.
    it ads higher level folders even though i only choosed files

    e.g i navigated to D:\FFXI logs\Logs (twice)
    I chose all thefiles to compress but default location is in D:\Inbox\LZA082
    Inside the fininshed archived i have to go throughFFXI\LOGS folders to get to my files.
    I don't think higher level folder should be put into the archive it seems illogical to me.

    Also the program looks like its freezing when its doings it works

    i would wish for a right click context meny option liek 7-zip rather than having to go through an additional navigation step.

  35. The Following User Says Thank You to SvenBent For This Useful Post:

    Nania Francesco (25th March 2015)

Page 7 of 7 FirstFirst ... 567

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •