1. Originally Posted by Jarek
Nania, you will never get 100% integrity guarantee with hash - there is always a tiny probability that you will accidentally get the proper checksum with improper file - about 2^-32 for crc32 (less than 1 per billion).
In ANS you have the same situation - with some small probability, you will accidentally get the proper state. For 32bit rANS this probability is ~2^-31.95 (a bit worse due to nonuniform probability distribution of states).
I think most hashes are equivalent, or in some sense equivalent, to polynomials modulo some base. (CRC32 was rationally designed using Galois fields, so its inner workings are perfectly transparent.) In principle, you could take a hash function that's based on a known polynomial and find the other polynomials that complete an orthogonal basis; then you would have a process for generating a set of hashes that contains 100% of the information, i.e. they are invertible for any input. Another way to phrase this is that they can form an invertible square matrix. My guess is that most hash functions that are adequately well-screened/tested basically contain the max amount of information possible (this is just a gut feeling, based on my own experience playing with hash functions... I suspect that if a hash function had redundancy in its bits, a tool like SMHasher would detect it).

I am somewhat overextending what I know about math, because I never went down the matrices/linear algebra track when I was a student. But I guess my overall hunch is that hash functions, generally, behave in much simpler and more predictable ways than most people think (they're cooked up by irrationally stirring around bits, which just obscures what's happening and adds a cloak of mystery). CRC32 doesn't really do anything for the sake of enhancing diffusion, like purpose-designed hash functions do. However, CRC32 fundamentally doesn't throw away any bits of information, so it could probably make a good-enough hash function if you just pass its values through a second permutation to add diffusion. My guess is that with a second integer->integer hash function applied to CRC32, it might perform indististinguishably from other general-purpose hash functions.

I guess my point is that hash functions, notwithstanding being described in probabalistic terms, are actually 100% deterministic and predictable (this is almost too obvious to state), and -- moreover, perhaps -- the majority of hash functions can be pretty easily understood with basic number theory and discrete math. Speaking of them in terms of probabilities allows you to do some useful calculations, of course, but at some point it becomes a dead end. "Random" is just a way of saying "I don't know." Sometimes randomness is too blunt of a tool, perhaps. Maybe sometimes we do know the answers, we're just not seeing them.

2. ## Thanks:

Nania Francesco (31st January 2015)

3. CRC32 has the special property that it is guaranteed to detect up to 31 bit errors.

4. ## Thanks:

Nania Francesco (31st January 2015)

5. Originally Posted by Nania Francesco
Thanks Matt I guess I'll use AES, actually I do not want to engage in something that I don't know well.
For a job a couple years ago, I ported code that implemented Microsoft's password/encryption system used for .docx and .doc files. I was going to provide a link, but I'm not sure anymore which standard is the new one and which are old and outdated. The nomenclature for that web of standards is pretty obscure... I'd rather not risk linking you to something that's obsolete. But, assuming you can find the most updated documentation, imitating Microsoft would not be the worst thing you could do. Their choices likely reflect the state-of-the-art -- if only because they are big and have lots of money to spend. Even if they blew it on those standards, it would probably be just as useful to examine how and why.

6. ## Thanks:

Nania Francesco (31st January 2015)

7. Originally Posted by Matt Mahoney
CRC32 has the special property that it is guaranteed to detect up to 31 bit errors.
Isn't CRC32 only guaranteed to detect errors when there is an odd number of errors? There are lots of combinations of two errors that can give the same CRC value.

8. ## Thanks:

Nania Francesco (31st January 2015)

9. Originally Posted by Matt Mahoney
CRC32 has the special property that it is guaranteed to detect up to 31 bit errors.
I have come across that fact, although I haven't looked into how it works. My intuitive feeling is that that property is basically the opposite of diffusion, like "anti-diffusion". I.e., CRC32 reacts to an influx of information predictably, which makes it easier to get specific information back out. What would be interesting to know is whether that property affects its usefulness as a hash function, or if you could just pass it through a permutation of integers to get all the appearance of randomness you want.

10. You're right. That only applies to burst errors.

11. ## Thanks:

Nania Francesco (31st January 2015)

12. Originally Posted by nburns
I have come across that fact, although I haven't looked into how it works. My intuitive feeling is that that property is basically the opposite of diffusion, like "anti-diffusion". I.e., CRC32 reacts to an influx of information predictably, which makes it easier to get specific information back out. What would be interesting to know is whether that property affects its usefulness as a hash function, or if you could just pass it through a permutation of integers to get all the appearance of randomness you want.
The same property which makes CRC32 great at detecting a low number of bits errors (<31)
play against it when it comes to randomness.

It's just a consequence of pigeon hole principle.

When testing randomness of CRC32 vs High quality hashes,
you'll typically get the following behavior :
- up to 31 bits of differences : 0 collisions
- beyond : between 2X and 4X more collisions than random hashes

Conclusion : for a hash function, CRC32 is not the better choice.
It's not "horrible" either, therefore when used for this role, it's just "a bit less efficient" than random hashes.

13. ## Thanks (2):

Bulat Ziganshin (31st January 2015),Nania Francesco (31st January 2015)

14. CRC32 is a pretty generic term. There are lots of options for the polynomial. I think for a hash function, you would want to pick one with a large Hamming distance. But I could be wrong, my main knowledge of CRC is from from a long time ago when we used to put CRC semiconductors (74F401) on circuit boards because doing it in software often wasn't practical and CPLDs and FPGAs were only concepts.

15. ## Thanks:

Nania Francesco (1st February 2015)

16. Originally Posted by Cyan
The same property which makes CRC32 great at detecting a low number of bits errors (<31)
play against it when it comes to randomness.

It's just a consequence of pigeon hole principle.

When testing randomness of CRC32 vs High quality hashes,
you'll typically get the following behavior :
- up to 31 bits of differences : 0 collisions
- beyond : between 2X and 4X more collisions than random hashes

Conclusion : for a hash function, CRC32 is not the better choice.
It's not "horrible" either, therefore when used for this role, it's just "a bit less efficient" than random hashes.
I'm guessing that CRC32 most likely has exactly 31 significant bits (according to wikipedia, the generator polynomial is 33 bits long, but the first and last are not significant, they're always 1, so the generator polynomial has 31 significant bits).

When I was playing with hashes and made my own hash function, I started with a simple polynomial, mod 2^64; I processed the data with this, using the data bytes as the coefficients (as in the Rabin-Karp rolling hash). The result didn't look random enough (the lowest and highest bits followed patterns), so I folded the two halves into one hash half as long. At that point, it may have looked good enough at a glance, but SMHasher was saying it still wasn't random enough, so I borrowed the avalanching step from MurmurHash v2. From that point after, it passed all SMHasher's (and my own) tests with flying colors. (There were a couple other minor details, I think, that stopped it from degenerating in some special cases that were tested by SMHasher.)

So I think you can consider general purpose hashing as two separate problems: First, find a function that samples the entropy of the input data down to a fixed number of bits, e.g. the size of a machine word. Second, find a function that re-hashes the output of the other function, without changing the number of bits, so that the bits look sufficiently random.

CRC32 seems more than adequate to play the first role if you don't need more than 31 bits. For the second function, you could use the same one as I did.

P.S., recently I was thinking about this and I think I fully understood the reason why the 64-bit polynomial hash had to be folded down to 32 bits (or was it 31 bits?). Without going into details (which are not exactly formalized and publication-ready, at this point... nor am I a professional mathematician), the polynomial and methods I used seem to be the equivalent (+/- minor details) of taking the remainder mod a Mersenne number with half as many bits (2^k-1, k~32 or 31). That would explain exactly the kind of results I observed. The beauty of mod Mersenne is that the carry bits wrap around rather than falling off the left edge. That way, the carry bits can mix with the predictable low bits, allowing them to correct each other. What I did achieved the same thing, more or less.

PPS. I'm guessing that "random hashes" is not any kind of formal designation, and would not be possible to formalize. I bet that the more additional purposeless "stirring" operations one adds, the more they start to cancel each other out. That's what happened to the German Enigma cipher, which the Allies were able to crack during WWII (using incredibly old computers).

17. ## Thanks:

Nania Francesco (1st February 2015)

18. PPS. I'm guessing that "random hashes" is not any kind of formal designation, and would not be possible to formalize.
https://en.wikipedia.org/wiki/Universal_hashing

19. ## Thanks:

Nania Francesco (1st February 2015)

20. Originally Posted by Bulat Ziganshin
That's a kind of hash that integrates random numbers into its formula. I don't think that really relates to how Yann was using the term... he seemed to be using it to refer to hash functions that give random-looking results (i.e., that are good), to distinguish them from hash functions that are not random enough, like CRC32. I am not sure that's a distinction worth making, because I think CRC32 could probably be corrected without adding any actual additional information/entropy, and made to look as good as the others.

If I'm wrong about CRC32, learning why would probably be an illustrative discussion, in any case, and worth the time.

21. ## Thanks:

Nania Francesco (1st February 2015)

22. Originally Posted by nburns
I'm guessing that CRC32 most likely has exactly 31 significant bits (according to wikipedia, the generator polynomial is 33 bits long, but the first and last are not significant, they're always 1, so the generator polynomial has 31 significant bits).
Yes, all CRC32 polynomials start with the x32 term and end with 1 (x0), so the new first bit is always the XOR of the old last bit and the new input bit. Only the 31 positions between can be altered where you either send the bit through unmodified to the next flip flop or XOR it with the XOR of the last bit and the new input bit, depending on the chosen polynomial.

The authors of this paper http://users.ece.cmu.edu/~koopman/ne...02_koopman.pdf did an exhaustive search of all ~1 trillion possible CRC32 polynomials and liked the 0xBA0DC66B polynomial that has a hamming distance of 4 for up to 114,663 bits. I suppose for caching that's probably not the best measure though. I'm mostly trying to make the point that the results for using CRC32 as a cache will vary depending on which CRC32 polynomial is used.

23. ## Thanks (2):

Nania Francesco (1st February 2015),WayneD (11th March 2015)

24. Released last version 0.82 version !
- official console and windows version (size limit removed)
- new archive format with crc32 control !
- new lza core for good and fast compression !
- not compatible with old version's

http://heartofcomp.altervista.org/

25. ## Thanks (2):

Gonzalo (10th March 2015),Stephan Busch (11th March 2015)

26. Originally Posted by Nania Francesco
Released last version 0.82 version !
Looks good! But if I open LZAwin.exe and chose 1 file and press "add" it says "Flashzip - compression mode". If I now close this add-window by pressing X (top right) program crashes.

27. Good!

Now:

"new lza core for good and fast compression"

...Can you explain?

Edit: I just saw the specs in your site. Sorry. Any further info beside that? Thanks!

28. Good!

Now:

"new lza core for good and fast compression"

...Can you explain?

Edit: I just saw the specs in your site. Sorry. Any further info beside that? Thanks!
ANS core is the same!
No exe and txt filters !
No Radyx source code or part or variation!
LZA search match is 2X search index mode !
implemented data recognition (for ISO,TAR,RAR archive not compressed,ZIP archive not compressed etc )

29. What do you do when you recognize known data in the containers?
i.e.: to skip compressed data? As you donĀ“t have any filter to handle text nor exe... Or you rearrange the content to put similar blocks together?

PS: Can you compile LZAwin.exe for x86? Thanks!

30. Warning: Command line parser doesn't like asterisks nor ..\somefolder\
For example, this leads to a crash:
lza a -r -s -v -h3 -b4 -mx9 TEST.lza TEST\*
lza a -r -s -v -h3 -b4 -mx9 ..\TEST.lza ..\TEST

31. Warning: Command line parser doesn't like asterisks nor ..\somefolder\
For example, this leads to a crash:
lza a -r -s -v -h3 -b4 -mx9 TEST.lza TEST\*
lza a -r -s -v -h3 -b4 -mx9 ..\TEST.lza ..\TEST
Thanks Gonzalo!

Result in Huge Files Compression Benchmark (test with Intel Core i7 920 2.67 ghz 4GB ram)

File Vm.dll size=4.244.176.896 bytes
LZA v.0.10 -mx5 -b6 -h7 -> 1179091707 bytes comp. 1154 sec. - decomp. 51 sec.
LZA v.0.51 -mx5 -b6 -h7 -> 1071931845 bytes comp. 1020 sec. - decomp. 50 sec.
LZA 0.82 -mx9 -b7 -h7 -> 1049231245 bytes comp. 1517 sec. - decomp.60 sec (crc32 active)

32. ## Thanks:

Nania Francesco (11th March 2015)

33. just a quick test oaf LZA 0.82win on some text files

FFXI logs - 2023 files - 806 MB (845,219,536 bytes)

7-zip LZMA:30:lc8Lpb0 = 57.2 MB (60,050,692 bytes)
LZA Method 9: LZAbuffer 1024mb:HashBuffer 256mb: Solid:recursive:ultra = 81.0 MB (84,953,506 bytes)

Some annoyances i stumbled upon
you cant copy paste a path into the naigation pane. so having to navigate twice to do the files
default archieve spot is in the LZA folder not the folder you navigated to.
it ads higher level folders even though i only choosed files

e.g i navigated to D:\FFXI logs\Logs (twice)
I chose all thefiles to compress but default location is in D:\Inbox\LZA082
Inside the fininshed archived i have to go throughFFXI\LOGS folders to get to my files.
I don't think higher level folder should be put into the archive it seems illogical to me.

Also the program looks like its freezing when its doings it works

i would wish for a right click context meny option liek 7-zip rather than having to go through an additional navigation step.

34. ## Thanks:

Nania Francesco (25th March 2015)

Page 7 of 7 First ... 567

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•