Results 1 to 11 of 11

Thread: Duplicate File Finder Engine

  1. #1
    Member
    Join Date
    Aug 2011
    Location
    Canada
    Posts
    113
    Thanks
    9
    Thanked 22 Times in 15 Posts

    Duplicate File Finder Engine

    I'm not sure that this is the right place to put this, but I made a fast duplicate file finder and wanted to share it. It takes options from a configuration file and writes a report of all duplicates found. The different configuration commands are as follows:


    Code:
    verbose={0,1,2}      :  Sets the level of console verbosity
    recurse={true,false} :  Sets the option to recurse into subdirectories
    report={filename}    :  Sets the output filename for the report
    filter={filter}      :  Sets the filter for finding files
    alldirs={true,false} :  Sets the option to search all directories, not just the ones that match the search filter
    AddPath {path}       :  Add a path to search for duplicates.  Multiple paths may be specified in the same configuration file
    The report format should be fairly simple. Finding duplicate files occurs in several stages. First, all files from the given paths are enumerated. They are then sorted by size and divided into groups of the same size. Groups with only one item are discarded. Then, all files over 256kb have a 16 byte sample taken from the middle and are sorted and divided similarly to how they were divided by size. Afterwards, all remaining files are hashed using SHA-1 and the hashes are sorted to find duplicates.

    The program is written in VB.net, so the .NET framework is required. File enumeration occurs through native API instead of using the framework classes to increase speed. If there are any errors, please let me know.
    Attached Files Attached Files

  2. #2
    Member
    Join Date
    May 2008
    Location
    HK
    Posts
    160
    Thanks
    4
    Thanked 25 Times in 15 Posts
    Mine has one written in Perl/PHP.
    http://rtfreesoft.blogspot.com/2010/...e-initial.html

  3. #3
    Member chornobyl's Avatar
    Join Date
    May 2008
    Location
    ua/kiev
    Posts
    153
    Thanks
    0
    Thanked 0 Times in 0 Posts

  4. #4
    Member
    Join Date
    Aug 2011
    Location
    Canada
    Posts
    113
    Thanks
    9
    Thanked 22 Times in 15 Posts
    Good work roytam1! It's definitely a lot more portable than mine.

    Thanks for the link for finddupe. I may also work on a utility that removes duplicates or replaces them with hardlinks given the report as input. The problem that I run into is choosing which file to keep from the list of duplicates. A GUI might be better suited for selecting which file to keep, but I don't think that most people have the time to look over a list of several thousand duplicates.

    I also ran a benchmark on all three programs on the 'C:\cygwin' directory on my laptop, which contains 21832 files and 5192 directories. The total size of all files is 463,815,672 bytes. I ran 'dir /s C:\cygwin' before running the benchmark to ensure that all engines start on equal grounds. The tests were run in the following order:
    1. FileSystemDedup.exe
    2. finddupe.exe
    3. finddup.pl

    All times are the global times as measured by Timer 11.00. The results are as follows:
    Code:
    FileSystemDedup.exe      119 seconds
    finddupe.exe             168 seconds
    finddup.pl               245 seconds
    I also ran benchmarks on the speed of a second sequential test of each program:
    Code:
    FileSystemDedup.exe      10 seconds
    finddupe.exe             235 seconds
    finddup.pl               34 seconds
    However, being the author of one of the pieces of software in the benchmark, I suggest that other benchmarks are done in order to ensure the reliability of the results.

  5. #5
    Member
    Join Date
    Jan 2012
    Location
    cluj
    Posts
    1
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I've been using this so far

    http://duplicatefilesdeleter.com/


    but I guess I'll try your too!

  6. #6
    Member Karhunen's Avatar
    Join Date
    Dec 2011
    Location
    USA
    Posts
    91
    Thanks
    2
    Thanked 1 Time in 1 Post
    I use http://www.joerg-rosenthal.com/en/an.../download.html for my Win32 box and its slow, but more useful for checking content duplicate pictures. Don't have access to a Win64 box, anyone who has such an OS might like to try it.

  7. #7
    Member
    Join Date
    Aug 2011
    Location
    Canada
    Posts
    113
    Thanks
    9
    Thanked 22 Times in 15 Posts
    Thanks for the links. Both AntiTwin and DuplicateFilesDeleter have more robust multimedia deduplication engines, but are slower because of this. They're also more user friendly. I guess that the main difference between most deduplication programs is how they handle the tradeoff between features/user friendliness/speed.

    It's difficult to benchmark GUI programs, so I'm not including them in the benchmark as of yet. However, if I find an accurate way to automate GUI benchmarking, I will include them.

  8. #8
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Look at AutoHotKey.

  9. #9
    Member Menno de Ruiter's Avatar
    Join Date
    Mar 2012
    Location
    ----------------------------------------------
    Posts
    27
    Thanks
    0
    Thanked 0 Times in 0 Posts
    ----------------------------------------------
    Last edited by Menno de Ruiter; 7th October 2013 at 15:39.

  10. #10
    Member
    Join Date
    Feb 2018
    Location
    Lahore
    Posts
    1
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Easy way to find duplicate files use DuplicateFilesDelete, Please try.

  11. #11
    Member
    Join Date
    May 2008
    Location
    Estonia
    Posts
    406
    Thanks
    155
    Thanked 235 Times in 127 Posts
    KZo


  12. Thanks:

    slipstream (8th June 2018)

Similar Threads

  1. Index-Compress-Update: parallel LZ match finder algo
    By Bulat Ziganshin in forum Data Compression
    Replies: 22
    Last Post: 10th January 2012, 20:36
  2. A fast diffing engine
    By m^2 in forum Data Compression
    Replies: 36
    Last Post: 21st September 2011, 20:30
  3. Replies: 7
    Last Post: 19th March 2011, 11:50
  4. Can't extract file from ARC file.
    By Absurd in forum Data Compression
    Replies: 3
    Last Post: 26th January 2009, 21:11
  5. RZM - a dull ROLZ compression engine
    By Christian in forum Forum Archive
    Replies: 178
    Last Post: 1st May 2008, 22:26

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •