# Thread: pcompress, a deduplication/compression utility

1. Coincidence: Haven't been visiting here for a while and just checked today and saw your question.

Yes you are right. My approach is not exactly minhash but derivative of sorts. I had used multiple pieces of a long hash to improve probabilistic matchcing. I am doing approximate matching depending on whether any chunk's partial slice within a segment match that from another segment. In that case two segments will have at least a few chunks in common with high probability. Even the case where two slices (of dissimilar chunks) match but not the full hashes themselves will also be considered. It takes the assumption that some level of similarity between the segments might exist and it is better to check than to ignore.

The current code is the result of a lot of experimentation with various approaches, thresholds etc. However not every bound has been tried. So, in the light of your question, it does make sense to do some tests by varying the number of slices. I remember using a non-crypto hash initially and using it without slicing of course. It yielded poorer results. However the implementation has changed from that time so worth testing.

2. Hello again, any progress re win port of pcompress - believe the eta you gave is past? john

3. Currently author works on MAC port.

4. To add to that this illustrates a point: http://matthewcasperson.blogspot.in/...r-dummies.html
Especially the following steps:

"
1. Break down the document a set of shingles.
2. Calculate the hash value for every shingle.
3. Store the minimum hash value found in step 2.
4. Repeat steps 2 and 3 with different hash algorithms 199 more times to get a total of 200 min hash values.

At this point instead of 200 randomly selected shingles, we have 200 integer hash values. This is effectively a random selection of shingles, because a good hash algorithm is supposed to generate a number that is as likely to be large as it is likely to be small. This kind of distribution of hash codes is what hashing is all about, so selecting the smallest hash is, for all intents and purposes, a random selection of a shingle."

So effectively these are min-wise selections from independent random (pseudo) permutations with different hash functions. So each element is hashed multiple times. Looking at this crudely, I took a single large crypto hash, broke it up to represent hash permutations of the same element and then do a global k-min selection. It worked well.

5. Originally Posted by avitar
Hello again, any progress re win port of pcompress - believe the eta you gave is past? john
Would like to apologize. Yes I was behind on many fronts especially due to a job change, startup pressures and few other personal things. I got the MAC (OS X) port working as it was fairly easy. I will start on Windows port now.

6. Originally Posted by Skymmer
Currently author works on MAC port.
Was 'Thought he said win port was next' - sorry didn't read his last post - dohhhhhhhh j

7. Originally Posted by moinakg
Coincidence: Haven't been visiting here for a while and just checked today and saw your question.

Yes you are right. My approach is not exactly minhash but derivative of sorts. I had used multiple pieces of a long hash to improve probabilistic matchcing. I am doing approximate matching depending on whether any chunk's partial slice within a segment match that from another segment. In that case two segments will have at least a few chunks in common with high probability. Even the case where two slices (of dissimilar chunks) match but not the full hashes themselves will also be considered. It takes the assumption that some level of similarity between the segments might exist and it is better to check than to ignore.
I guess we're on the same page insofar as it's not an implementation of MinHash.

If two cryptographic hashes have some similarity, but are not exact matches, all you can read into that is that the chunks differ. If similar chunks give similar hashes, that's a weakness in the hashing algorithm, and you should not see that in a cryptographic hash (except by pure chance).

The current code is the result of a lot of experimentation with various approaches, thresholds etc. However not every bound has been tried. So, in the light of your question, it does make sense to do some tests by varying the number of slices. I remember using a non-crypto hash initially and using it without slicing of course. It yielded poorer results. However the implementation has changed from that time so worth testing.
Since you've done a lot of empirical validation, you know if it works -- there can't be any surprises. But maybe you could simplify your algorithm further, if you thought you needed MinHash and in reality you didn't?

8. Hi again. Two Qs

1) Is there a review of features, advantages/disadvantages, plusses/minuses of pcompress vs zpaq and/or any other similar compressors eg just 10 bullet points. If not can you do one?
Assume it'll get then reviewed by others on this forum.

2) how is pcompress windows port going, eta? Assume it'll be one source file with different conditional code flavours inside (like zpaq) for win & linux so we can compare the 2?

John

9. pcompress is fast with good compression and deduplication. It beats zpaq over a narrow range of the Pareto frontier on the 10GB benchmark (near zpaq -method 3), but zpaq has a much wider range of size/speed tradeoff. Both support multi-threading and encryption. However, pcompress does not support journaling or incremental backups and only runs on Linux for now.

10. Hello see #98 2), are you there? Anyone know why he isn't answering? John

11. Hi moinakg, seen few recent commits to your pcompress github - any eta yet for win pcompress? John

12. @avitar, was away from this forum for a while. Very busy with a variety of stuff. I had a list of TODO items which I have been very slowly working on. I had several ideas that I wanted to implement, the most complicated being separating data and metadata from libarchive into separate compresed streams. This I finished recently. I have also combined Dispack and the E8E9 filter from CSC 3.2 (Thanks to Stephan Busch for pointing out CSC), apart from several other things (Wavpack, PackPNM etc).

I am now working on the Windows port. It will take some time.

13. ## The Following 2 Users Say Thank You to moinakg For This Useful Post:

samsat1024 (26th October 2014),surfersat (28th October 2014)

14. ok, hello again, sorry for the spam!

I'm not 100% clear how pcompress compares eg with zpaq and maybe other similar compressors. Do you have this info somewhere?

Will you be trying to keep one set of sources for win & linux ie so all future improvements are in both eg like on zpaq?

John

15. Matt already summarized the differences with Zpaq. For other comparisons do you mean functionality comparison? For benchmarks you can see here: http://mattmahoney.net/dc/10gb.html

For some old benchmarks see here:
http://moinakg.github.io/pcompress/#...ion-benchmarks
These are currently out of date since many changes have gone in after these were done, but they give an idea.

Windows, OS X and Linux binaries will be built out of same source tree. Currently the source tree supports OS X and Linux.

16. OK. Problem. It's probably me, I'm no linux expert and that is putting it very mildly. Now downloaded latest ubuntu 64 bit, ver 14.10. Got zpaq working, just followed instructions.

Then I downloaded latest pcompress pcompress-master from git, and then tried to install it, using ./config.

Starts ok, but says it can't find wavpack (what is that?) so says to use the --disable for it. However this doesn't work either. It now says that yasm 1.1 is needed, so seems as if pcompress which is 64 bit has never been installed on ubuntu 64. Surprising since it's quite a common linux I'd have thought.

Could be me, but time to ask for help & for a fix, for pcompress for ubuntu 64 please.

TIA John

17. Did you try the static Linux build from https://github.com/moinakg/pcompress...es/tag/3.0Beta ?

18. Downloaded pcompress-static into directory. Can't really understand what this pcompress does, something v clever, looks like a batch file with a .bin and a lib that runs pcompress.bin - I think!

./pcompress --help doesn't give help, but ./pcompress 2>tmp puts help into tmp
./pcompress -c filename archivename doesn't work, not sure why

PS just noticed that the text above where download is mentioned says it is for mint!
So explanation from moinakg or fix for ubuntu would be good, plus maybe help on git needs improving to show correct steps in making from source and fixing the ./config.

John

19. The git head is undergoing heavy development for a bunch of features so things will be tacky from time to time and READMEs etc will not be updated. Anyway if you are starting from a fresh Ubuntu install, then this is what you have to do:

1) Ensure all the needed development packages are installed: sudo apt-get install git gcc g++ libc-dev libbz2-dev libz-dev openssl libssl-dev yasm
3) Extract WavPack into the user's home directory:
cd ~
4) Now git clone https://github.com/moinakg/pcompress.git
5) Run configure
cd pcompress
./config --wavpack-dir=~/wavpack-4.70.0
6) Once config finishes sucessfully run make.
7) For help just run ./pcompress (without --help)

The WavPack filter allows compressing *.wav files much better. There is now a PackPNM filter as well along with other things like Dispack+E8E9, Dict, improved detection of file types using both extension and magic numbers, improved text vs binary data detection, etc. One of the bigger features is support to compress archive metadata into a separate stream, separate from the data stream. This improves compression as well as makes listing very fast without having to decompress all the data, even though solid archiving mode is used.

To archive and compress just run:

./pcompress -a <directory> <compressed file name>

Max compression level:

./pcompress -a -l14 <directory> <compressed file name>

Decompress:

./pcompress -d <compressed archive> <optional target directory>

List archive contents:

./pcompress -i <compressed archive>

20. ## The Following User Says Thank You to moinakg For This Useful Post:

avitar (3rd November 2014)

21. Also if you want to archive multiple items then do this:

./pcompress -a <directory1/file1> <directory2/file2> <directory3/file3> ... <archive name>

The archive name comes at the last.

22. ## The Following User Says Thank You to moinakg For This Useful Post:

avitar (3rd November 2014)

23. Thanks - I'll see if my linux ability is adequate for these steps. Just as a thought, couldn't there be a compiled downloadable exe snapshot version on your site that'll just run on ubuntu 64 - guess that'll be the most popular platform for trying pcompress?

btw, maybe you could put this useful info in a ubuntu_make_readme?

John

24. Okay, I have done an Ubuntu 64-bit build and put the binaries here:

All you have to do is this:
1) Extract: tar xjvf pc.tar.bz2
2) Run: ./pc/pcompress

25. ## The Following User Says Thank You to moinakg For This Useful Post:

avitar (3rd November 2014)

26. Thanks, this was easy & works ok.
q1: is there any version no/date available -v or -V doesn't work
q2: I can only get help by 2>tmp then more tmp. I thought the linux norm was to std out so one can use |more

John

27. First line of help text shows version number.
Help goes to stderr since stdout may be redirected in pipe mode. You can do this:

./pc/pcompress 2>&1 | less

28. ## The Following User Says Thank You to moinakg For This Useful Post:

avitar (3rd November 2014)

29. ok

1) surely adding -v and --help or -h & outputting to std out for that parameter wouldn't be hard & would make things clearer.
2) does the ubuntu build you've done use your latest sources and will it be kept up to date as you do more mods.

J

30. I will look at adding a help option. Keeping work-in-progress binary up to date is troublesome. I will have an updated binary once I make the next dot release.

31. Have been slowly working on Pcompress improvements due to work pressures and lots of travel. No, Windows version is not ready yet. However I have done a long list of logic fixes, improvements and new features that improve both compression efficiency and performance:

1. Added a Dictionary preprocessor for text files. This one breaks words at predefined separator characters and builds a dictionary of most common words (A hash table with LRU aging is used). The words are then encoded using a base-217 number. Capital conversion is done for proper case words. Literal numbers in the text are encoded using the base-217 representation. The dictionary is emitted along with the encoded data.
2. Added a variant of the Dict preprocessor to improve compression of FASTA genome data files. This one breaks the text into 4-character words and omits the literal number encoding. It marks each encoded word using a prefix and a suffix char. If the prefix and/or suffix occurs within the text then encoding is dropped.
3. Added a Metadata Stream feature where the Libarchive metadata blocks are collected together and compressed in a separate stream from the data. This allows better compression and faster access to metadata. For example listing the table of contents in the archive is now very fast as all the data blocks do not have to be decompressed. Also plan to implement fast extraction of individual archive members based on this feature.
4. Moved Dispack from raw block to file-level processing. It detects code sections by parsing the Win32-PE headers and processes them. This is much more effective than just throwing raw blocsk at Dispack. Plan to extend it to handle Elf32 and Mach-O 32bit binaries as well.
5. For raw blocks (non Win32-PE Exes) an improved E8E9 filter is used.
6. Changed text compression algorithm to prefer Libbsc over PPMd8(LZMA SDK) for plain text files and use PPMd for files having at least some percentage of markup characters like '<' and '>'.
7. Fixed a bug where PPMd would be used for some kinds of binary data resulting in slower worser compression.
8. Improved pathname sorting.
9. Changed default Deduplication fragment/block size to 8KB.
10. Got a slight LZMA performance improvement by using prefetch instructions in the matchfinder.
11. Improved data analysis to better detect data characteristics.
12. Many other bug fixes and improvements.

I did some quick measurements on an MacBook Pro: Core i7-4750HQ (Haswell) CPU, 8 cores, 2GHz per core, 8GB RAM and 256 GB PCI-based Solid State Disk (Does not sit on SATA bus). Results below. All times are in seconds.

The 10GB benchmark
Code:
```Command                                   Comp size   Compress time  Decompress time
pcompress -a -l14 10gb 10gb.pz            2939491456  596            313
pcompress -a -l14 -s60m -t5 10gb 10gb.pz  2896205146  865            328```
Compressing enwik9
Code:
```Command                                     Comp size   Compress time  Decompress time
pcompress -a -c libbsc -l14 -s1000m enwik9  163391992   182            97```

32. ## The Following 3 Users Say Thank You to moinakg For This Useful Post:

avitar (2nd February 2015),Bulat Ziganshin (2nd February 2015),Matt Mahoney (2nd February 2015)

33. '
Okay, I have done an Ubuntu 64-bit build and put the binaries here:

'
Thanks, for this update - can you please update the 'ubuntu' binary - suppose it is really a general purpose linux binary- maybe you can keep it up to date as a 'snapshot' for those of us who are not very good with linux and still want to try pcompress pending the win port.

TIA John

34. Ubuntu binary is available here:

BTW I had forgotten to mention. The changes also include addition of a WavPack filter plugin. I'll probably add a plugin for Monkey's Audio next.

35. I didn't see a pre-built binary but I was able to build in Ubuntu by downloading the wavpack Linux sources and installing with ./configure and make. However, when I run "pcompress -a -l14 10gb usb/10gb.pz" I get "illegal instruction (core dumped)". Same error trying to compress enwik9 as above. I am testing system 4 on the 10GB benchmark (Dell Latitude E6510 laptop, Core i7 M620, 2.66 GHz, 2+2 hyperthreads, 4 GB, Ubuntu). usb is a symbolic link to the external drive. I have gcc 4.8.2-19ubuntu1.

Edit: I found the binaries in buildtmp and I am testing now.

36. Ah, I figured a problem. The Ubuntu binary I posted at the Sourceforge site will crash on non-Haswell processors due to use of AVX2 by Gcc in the generated code. I will have to rebuild with lower SSE flags.

When building from source following steps are necessary:

2. Unpack Pcompress source and run config in pcompress dir like this:
./config --wavpack-dir=../wavpack-4.70.0
This assumes that wavpack was unpacked inside the same directory as pcompress
3. Run make
4. Run ./pcompress ...

Page 4 of 6 First ... 23456 Last

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•