Page 4 of 6 FirstFirst ... 23456 LastLast
Results 91 to 120 of 153

Thread: pcompress, a deduplication/compression utility

  1. #91
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    Coincidence: Haven't been visiting here for a while and just checked today and saw your question.

    Yes you are right. My approach is not exactly minhash but derivative of sorts. I had used multiple pieces of a long hash to improve probabilistic matchcing. I am doing approximate matching depending on whether any chunk's partial slice within a segment match that from another segment. In that case two segments will have at least a few chunks in common with high probability. Even the case where two slices (of dissimilar chunks) match but not the full hashes themselves will also be considered. It takes the assumption that some level of similarity between the segments might exist and it is better to check than to ignore.

    The current code is the result of a lot of experimentation with various approaches, thresholds etc. However not every bound has been tried. So, in the light of your question, it does make sense to do some tests by varying the number of slices. I remember using a non-crypto hash initially and using it without slicing of course. It yielded poorer results. However the implementation has changed from that time so worth testing.

  2. #92
    Member
    Join Date
    Sep 2011
    Location
    uk
    Posts
    237
    Thanks
    187
    Thanked 16 Times in 11 Posts
    Hello again, any progress re win port of pcompress - believe the eta you gave is past? john

  3. #93
    Member Skymmer's Avatar
    Join Date
    Mar 2009
    Location
    Russia
    Posts
    681
    Thanks
    37
    Thanked 168 Times in 84 Posts
    Currently author works on MAC port.

  4. #94
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    To add to that this illustrates a point: http://matthewcasperson.blogspot.in/...r-dummies.html
    Especially the following steps:

    "
    1. Break down the document a set of shingles.
    2. Calculate the hash value for every shingle.
    3. Store the minimum hash value found in step 2.
    4. Repeat steps 2 and 3 with different hash algorithms 199 more times to get a total of 200 min hash values.


    At this point instead of 200 randomly selected shingles, we have 200 integer hash values. This is effectively a random selection of shingles, because a good hash algorithm is supposed to generate a number that is as likely to be large as it is likely to be small. This kind of distribution of hash codes is what hashing is all about, so selecting the smallest hash is, for all intents and purposes, a random selection of a shingle."

    So effectively these are min-wise selections from independent random (pseudo) permutations with different hash functions. So each element is hashed multiple times. Looking at this crudely, I took a single large crypto hash, broke it up to represent hash permutations of the same element and then do a global k-min selection. It worked well.

  5. #95
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    Quote Originally Posted by avitar View Post
    Hello again, any progress re win port of pcompress - believe the eta you gave is past? john
    Would like to apologize. Yes I was behind on many fronts especially due to a job change, startup pressures and few other personal things. I got the MAC (OS X) port working as it was fairly easy. I will start on Windows port now.

  6. #96
    Member
    Join Date
    Sep 2011
    Location
    uk
    Posts
    237
    Thanks
    187
    Thanked 16 Times in 11 Posts
    Quote Originally Posted by Skymmer View Post
    Currently author works on MAC port.
    Was 'Thought he said win port was next' - sorry didn't read his last post - dohhhhhhhh j
    Last edited by avitar; 6th June 2014 at 19:59. Reason: dohhhhh

  7. #97
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by moinakg View Post
    Coincidence: Haven't been visiting here for a while and just checked today and saw your question.

    Yes you are right. My approach is not exactly minhash but derivative of sorts. I had used multiple pieces of a long hash to improve probabilistic matchcing. I am doing approximate matching depending on whether any chunk's partial slice within a segment match that from another segment. In that case two segments will have at least a few chunks in common with high probability. Even the case where two slices (of dissimilar chunks) match but not the full hashes themselves will also be considered. It takes the assumption that some level of similarity between the segments might exist and it is better to check than to ignore.
    I guess we're on the same page insofar as it's not an implementation of MinHash.

    If two cryptographic hashes have some similarity, but are not exact matches, all you can read into that is that the chunks differ. If similar chunks give similar hashes, that's a weakness in the hashing algorithm, and you should not see that in a cryptographic hash (except by pure chance).

    The current code is the result of a lot of experimentation with various approaches, thresholds etc. However not every bound has been tried. So, in the light of your question, it does make sense to do some tests by varying the number of slices. I remember using a non-crypto hash initially and using it without slicing of course. It yielded poorer results. However the implementation has changed from that time so worth testing.
    Since you've done a lot of empirical validation, you know if it works -- there can't be any surprises. But maybe you could simplify your algorithm further, if you thought you needed MinHash and in reality you didn't?

  8. #98
    Member
    Join Date
    Sep 2011
    Location
    uk
    Posts
    237
    Thanks
    187
    Thanked 16 Times in 11 Posts
    Hi again. Two Qs

    1) Is there a review of features, advantages/disadvantages, plusses/minuses of pcompress vs zpaq and/or any other similar compressors eg just 10 bullet points. If not can you do one?
    Assume it'll get then reviewed by others on this forum.

    2) how is pcompress windows port going, eta? Assume it'll be one source file with different conditional code flavours inside (like zpaq) for win & linux so we can compare the 2?

    John

  9. #99
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    pcompress is fast with good compression and deduplication. It beats zpaq over a narrow range of the Pareto frontier on the 10GB benchmark (near zpaq -method 3), but zpaq has a much wider range of size/speed tradeoff. Both support multi-threading and encryption. However, pcompress does not support journaling or incremental backups and only runs on Linux for now.

  10. #100
    Member
    Join Date
    Sep 2011
    Location
    uk
    Posts
    237
    Thanks
    187
    Thanked 16 Times in 11 Posts
    Hello see #98 2), are you there? Anyone know why he isn't answering? John

  11. #101
    Member
    Join Date
    Sep 2011
    Location
    uk
    Posts
    237
    Thanks
    187
    Thanked 16 Times in 11 Posts
    Hi moinakg, seen few recent commits to your pcompress github - any eta yet for win pcompress? John

  12. #102
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    @avitar, was away from this forum for a while. Very busy with a variety of stuff. I had a list of TODO items which I have been very slowly working on. I had several ideas that I wanted to implement, the most complicated being separating data and metadata from libarchive into separate compresed streams. This I finished recently. I have also combined Dispack and the E8E9 filter from CSC 3.2 (Thanks to Stephan Busch for pointing out CSC), apart from several other things (Wavpack, PackPNM etc).

    I am now working on the Windows port. It will take some time.

  13. The Following 2 Users Say Thank You to moinakg For This Useful Post:

    samsat1024 (26th October 2014),surfersat (28th October 2014)

  14. #103
    Member
    Join Date
    Sep 2011
    Location
    uk
    Posts
    237
    Thanks
    187
    Thanked 16 Times in 11 Posts
    ok, hello again, sorry for the spam!

    I'm not 100% clear how pcompress compares eg with zpaq and maybe other similar compressors. Do you have this info somewhere?

    Will you be trying to keep one set of sources for win & linux ie so all future improvements are in both eg like on zpaq?

    John

  15. #104
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    Matt already summarized the differences with Zpaq. For other comparisons do you mean functionality comparison? For benchmarks you can see here: http://mattmahoney.net/dc/10gb.html

    For some old benchmarks see here:
    http://moinakg.github.io/pcompress/#...ion-benchmarks
    These are currently out of date since many changes have gone in after these were done, but they give an idea.

    Windows, OS X and Linux binaries will be built out of same source tree. Currently the source tree supports OS X and Linux.

  16. #105
    Member
    Join Date
    Sep 2011
    Location
    uk
    Posts
    237
    Thanks
    187
    Thanked 16 Times in 11 Posts
    OK. Problem. It's probably me, I'm no linux expert and that is putting it very mildly. Now downloaded latest ubuntu 64 bit, ver 14.10. Got zpaq working, just followed instructions.

    Then I downloaded latest pcompress pcompress-master from git, and then tried to install it, using ./config.

    Starts ok, but says it can't find wavpack (what is that?) so says to use the --disable for it. However this doesn't work either. It now says that yasm 1.1 is needed, so seems as if pcompress which is 64 bit has never been installed on ubuntu 64. Surprising since it's quite a common linux I'd have thought.

    Could be me, but time to ask for help & for a fix, for pcompress for ubuntu 64 please.

    TIA John

  17. #106
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Did you try the static Linux build from https://github.com/moinakg/pcompress...es/tag/3.0Beta ?

  18. #107
    Member
    Join Date
    Sep 2011
    Location
    uk
    Posts
    237
    Thanks
    187
    Thanked 16 Times in 11 Posts
    Downloaded pcompress-static into directory. Can't really understand what this pcompress does, something v clever, looks like a batch file with a .bin and a lib that runs pcompress.bin - I think!

    ./pcompress --help doesn't give help, but ./pcompress 2>tmp puts help into tmp
    ./pcompress -c filename archivename doesn't work, not sure why

    PS just noticed that the text above where download is mentioned says it is for mint!
    So explanation from moinakg or fix for ubuntu would be good, plus maybe help on git needs improving to show correct steps in making from source and fixing the ./config.

    John
    Last edited by avitar; 31st October 2014 at 01:33. Reason: update

  19. #108
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    The git head is undergoing heavy development for a bunch of features so things will be tacky from time to time and READMEs etc will not be updated. Anyway if you are starting from a fresh Ubuntu install, then this is what you have to do:

    1) Ensure all the needed development packages are installed: sudo apt-get install git gcc g++ libc-dev libbz2-dev libz-dev openssl libssl-dev yasm
    2) Download and extract WavPack Unix sources from this link: http://www.wavpack.com/wavpack-4.70.0.tar.bz2
    3) Extract WavPack into the user's home directory:
    cd ~
    tar xvf ~/Downloads/wavpack-4.70.0.tar.bz2
    (If you dowload via Firefox browser, it will put the file into Downloads directory)
    4) Now git clone https://github.com/moinakg/pcompress.git
    5) Run configure
    cd pcompress
    ./config --wavpack-dir=~/wavpack-4.70.0
    6) Once config finishes sucessfully run make.
    7) For help just run ./pcompress (without --help)

    The WavPack filter allows compressing *.wav files much better. There is now a PackPNM filter as well along with other things like Dispack+E8E9, Dict, improved detection of file types using both extension and magic numbers, improved text vs binary data detection, etc. One of the bigger features is support to compress archive metadata into a separate stream, separate from the data stream. This improves compression as well as makes listing very fast without having to decompress all the data, even though solid archiving mode is used.

    To archive and compress just run:

    ./pcompress -a <directory> <compressed file name>

    Max compression level:

    ./pcompress -a -l14 <directory> <compressed file name>

    Decompress:

    ./pcompress -d <compressed archive> <optional target directory>

    List archive contents:

    ./pcompress -i <compressed archive>

  20. The Following User Says Thank You to moinakg For This Useful Post:

    avitar (3rd November 2014)

  21. #109
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    Also if you want to archive multiple items then do this:

    ./pcompress -a <directory1/file1> <directory2/file2> <directory3/file3> ... <archive name>

    The archive name comes at the last.

  22. The Following User Says Thank You to moinakg For This Useful Post:

    avitar (3rd November 2014)

  23. #110
    Member
    Join Date
    Sep 2011
    Location
    uk
    Posts
    237
    Thanks
    187
    Thanked 16 Times in 11 Posts
    Thanks - I'll see if my linux ability is adequate for these steps. Just as a thought, couldn't there be a compiled downloadable exe snapshot version on your site that'll just run on ubuntu 64 - guess that'll be the most popular platform for trying pcompress?

    btw, maybe you could put this useful info in a ubuntu_make_readme?

    John
    Last edited by avitar; 3rd November 2014 at 18:09.

  24. #111
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    Okay, I have done an Ubuntu 64-bit build and put the binaries here:
    http://sourceforge.net/projects/bele...r.bz2/download

    All you have to do is this:
    1) Extract: tar xjvf pc.tar.bz2
    2) Run: ./pc/pcompress

  25. The Following User Says Thank You to moinakg For This Useful Post:

    avitar (3rd November 2014)

  26. #112
    Member
    Join Date
    Sep 2011
    Location
    uk
    Posts
    237
    Thanks
    187
    Thanked 16 Times in 11 Posts
    Thanks, this was easy & works ok.
    q1: is there any version no/date available -v or -V doesn't work
    q2: I can only get help by 2>tmp then more tmp. I thought the linux norm was to std out so one can use |more

    John

  27. #113
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    First line of help text shows version number.
    Help goes to stderr since stdout may be redirected in pipe mode. You can do this:

    ./pc/pcompress 2>&1 | less

  28. The Following User Says Thank You to moinakg For This Useful Post:

    avitar (3rd November 2014)

  29. #114
    Member
    Join Date
    Sep 2011
    Location
    uk
    Posts
    237
    Thanks
    187
    Thanked 16 Times in 11 Posts
    ok

    1) surely adding -v and --help or -h & outputting to std out for that parameter wouldn't be hard & would make things clearer.
    2) does the ubuntu build you've done use your latest sources and will it be kept up to date as you do more mods.

    J

  30. #115
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    I will look at adding a help option. Keeping work-in-progress binary up to date is troublesome. I will have an updated binary once I make the next dot release.

  31. #116
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    Have been slowly working on Pcompress improvements due to work pressures and lots of travel. No, Windows version is not ready yet. However I have done a long list of logic fixes, improvements and new features that improve both compression efficiency and performance:


    1. Added a Dictionary preprocessor for text files. This one breaks words at predefined separator characters and builds a dictionary of most common words (A hash table with LRU aging is used). The words are then encoded using a base-217 number. Capital conversion is done for proper case words. Literal numbers in the text are encoded using the base-217 representation. The dictionary is emitted along with the encoded data.
    2. Added a variant of the Dict preprocessor to improve compression of FASTA genome data files. This one breaks the text into 4-character words and omits the literal number encoding. It marks each encoded word using a prefix and a suffix char. If the prefix and/or suffix occurs within the text then encoding is dropped.
    3. Added a Metadata Stream feature where the Libarchive metadata blocks are collected together and compressed in a separate stream from the data. This allows better compression and faster access to metadata. For example listing the table of contents in the archive is now very fast as all the data blocks do not have to be decompressed. Also plan to implement fast extraction of individual archive members based on this feature.
    4. Moved Dispack from raw block to file-level processing. It detects code sections by parsing the Win32-PE headers and processes them. This is much more effective than just throwing raw blocsk at Dispack. Plan to extend it to handle Elf32 and Mach-O 32bit binaries as well.
    5. For raw blocks (non Win32-PE Exes) an improved E8E9 filter is used.
    6. Changed text compression algorithm to prefer Libbsc over PPMd8(LZMA SDK) for plain text files and use PPMd for files having at least some percentage of markup characters like '<' and '>'.
    7. Fixed a bug where PPMd would be used for some kinds of binary data resulting in slower worser compression.
    8. Improved pathname sorting.
    9. Changed default Deduplication fragment/block size to 8KB.
    10. Got a slight LZMA performance improvement by using prefetch instructions in the matchfinder.
    11. Improved data analysis to better detect data characteristics.
    12. Many other bug fixes and improvements.


    I did some quick measurements on an MacBook Pro: Core i7-4750HQ (Haswell) CPU, 8 cores, 2GHz per core, 8GB RAM and 256 GB PCI-based Solid State Disk (Does not sit on SATA bus). Results below. All times are in seconds.

    The 10GB benchmark
    Code:
    Command                                   Comp size   Compress time  Decompress time
    pcompress -a -l14 10gb 10gb.pz            2939491456  596            313
    pcompress -a -l14 -s60m -t5 10gb 10gb.pz  2896205146  865            328
    Compressing enwik9
    Code:
    Command                                     Comp size   Compress time  Decompress time
    pcompress -a -c libbsc -l14 -s1000m enwik9  163391992   182            97

  32. The Following 3 Users Say Thank You to moinakg For This Useful Post:

    avitar (2nd February 2015),Bulat Ziganshin (2nd February 2015),Matt Mahoney (2nd February 2015)

  33. #117
    Member
    Join Date
    Sep 2011
    Location
    uk
    Posts
    237
    Thanks
    187
    Thanked 16 Times in 11 Posts
    '
    Okay, I have done an Ubuntu 64-bit build and put the binaries here:

    http://sourceforge.net/projects/bele...r.bz2/download
    '
    Thanks, for this update - can you please update the 'ubuntu' binary - suppose it is really a general purpose linux binary- maybe you can keep it up to date as a 'snapshot' for those of us who are not very good with linux and still want to try pcompress pending the win port.

    TIA John

  34. #118
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    Ubuntu binary is available here:
    http://sourceforge.net/projects/belenix/files/pc.tar.bz2/download

    BTW I had forgotten to mention. The changes also include addition of a WavPack filter plugin. I'll probably add a plugin for Monkey's Audio next.

  35. #119
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    I didn't see a pre-built binary but I was able to build in Ubuntu by downloading the wavpack Linux sources and installing with ./configure and make. However, when I run "pcompress -a -l14 10gb usb/10gb.pz" I get "illegal instruction (core dumped)". Same error trying to compress enwik9 as above. I am testing system 4 on the 10GB benchmark (Dell Latitude E6510 laptop, Core i7 M620, 2.66 GHz, 2+2 hyperthreads, 4 GB, Ubuntu). usb is a symbolic link to the external drive. I have gcc 4.8.2-19ubuntu1.

    Edit: I found the binaries in buildtmp and I am testing now.
    Last edited by Matt Mahoney; 3rd February 2015 at 22:22.

  36. #120
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    Ah, I figured a problem. The Ubuntu binary I posted at the Sourceforge site will crash on non-Haswell processors due to use of AVX2 by Gcc in the generated code. I will have to rebuild with lower SSE flags.

    When building from source following steps are necessary:

    1. Download and unpack Wavpack source but do not build it.
    2. Unpack Pcompress source and run config in pcompress dir like this:
      ./config --wavpack-dir=../wavpack-4.70.0
      This assumes that wavpack was unpacked inside the same directory as pcompress
    3. Run make
    4. Run ./pcompress ...
    Last edited by moinakg; 4th February 2015 at 00:11. Reason: Fix typo

Page 4 of 6 FirstFirst ... 23456 LastLast

Similar Threads

  1. Data deduplication
    By Lasse Reinhold in forum Data Compression
    Replies: 79
    Last Post: 18th November 2013, 07:49
  2. native deduplication in Windows 8 x64
    By jimbow in forum Data Compression
    Replies: 6
    Last Post: 30th October 2012, 23:57
  3. A Microsoft study on deduplication
    By m^2 in forum Data Compression
    Replies: 1
    Last Post: 5th May 2011, 18:15
  4. Remote diff utility
    By Shelwien in forum Data Compression
    Replies: 2
    Last Post: 6th September 2009, 15:37

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •