Page 1 of 4 123 ... LastLast
Results 1 to 30 of 110

Thread: FastBackup: yet another deduplication engine

  1. #1
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts

    FastBackup: yet another deduplication engine

    i've started development of one more zpaq/exdupe clone. the very first version that just deduplicates data in memory and doesn't write any archive, made available as http://freearc.org/download/research/fb001.zip

    but i need your help. i'm poor on good program names. can you suggest me a good one? generally speaking, i plan to make all-in-one incremental backup tool, with dedupe, compression, encryption, recovery record and so and so

  2. #2
    Member FatBit's Avatar
    Join Date
    Jan 2012
    Location
    Prague, CZ
    Posts
    191
    Thanks
    0
    Thanked 36 Times in 27 Posts
    Dear Mr. Ziganshin,

    Data Squeezer is suitable?

    FatBit

  3. #3
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    412
    Thanks
    38
    Thanked 64 Times in 38 Posts
    the program is really very fast - congratulation!

    - as a name for the program i would suggest "fastback" or "fastbak"
    - recovery record as in rar sounds wonderful
    - maybe a special variant of a wellproofed archive-format like 7z - but only 1 or 2 compression methods?
    best regards

  4. #4
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    Incremental backups (at least in the zpaq/exdupe style) requires new archive format. It's why i can't just add new features to freearc/7-zip. Also, "Data Squeezer" name looks more appropriate for archiver rather than backup tool

    Now i have names fastback and sciback (both "scientific" and "sky"-like, meaning "cloud backup"). more ideas please!

  5. #5
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,026
    Thanks
    103
    Thanked 410 Times in 285 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    more ideas please!
    First part of your family name, zigzag files together with fast speed and name is not in use accept one gmail account.

    Zigzagfast
    Last edited by Sportman; 16th May 2014 at 17:19. Reason: Added where the name idea came from

  6. #6
    Member Bloax's Avatar
    Join Date
    Feb 2013
    Location
    Dreamland
    Posts
    52
    Thanks
    11
    Thanked 2 Times in 2 Posts
    Quote Originally Posted by Sportman View Post
    Zigzagfast
    ZippyDeDup ~ Use it for backup :^)
    Last edited by Bloax; 16th May 2014 at 17:18.

  7. #7
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    How would you explain the difference between deduplication and compression?

  8. #8
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    dedup engines splits input data into content-defined chunks and stores SHA hash of every chunk in the archive. then new data are checked against existing chunks and only new chunks are added to the archive

  9. #9
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    4,013
    Thanks
    404
    Thanked 403 Times in 153 Posts
    FreeArc ... FreeBackup? Or BulletBackup (Backup from Bulat)

  10. #10
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    BulatBackup looks interesting..

  11. #11
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Quote Originally Posted by Bloax View Post
    ZippyDeDup ~ Use it for backup :^)
    I kinda like this one, but a lot of non-native English speakers probably won't get it. (A punny reference to "Zip ah dee do dah", made famous by a Disney movie.) I don't know if kids these days watch "Song of the South," either.

    https://www.youtube.com/watch?v=LcxYwwIL5zQ

    The reference would be clearer if it was "ZipADeDup," but probably still not recognizable enough.

  12. #12
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    I would just call it freearc and add it as a new format.

  13. #13
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    358
    Thanks
    12
    Thanked 36 Times in 30 Posts
    sr (sleepright)
    ---
    I suggest some ideas for "definitive" backup: create two files.
    One index, one (or more) data.
    Data file always grow, but does not change (aka: append only new data).
    This is great for rsync (with or without --append)

    Index file for very fast extracting of files (the main "limitation" of zpaq).
    Into datafile insert a recovery header (for the chunks) just in case index is lost
    (or you don't need fast extraction).
    ---
    Eventually split (basing on SHA's) on more than one datafile.

    Example: when adding
    file1.txt hash aba12345...
    file2.txt hash aa0...
    file3.txt hash aa3...

    add the first in
    ab_backup.sr

    second and third in
    aa_backup.sr

    If you have a very big backup, you can easily split (sharding) without need to "choose" where to add.
    Using 0,1,2 or 3 chars\level (in HEX) you will get 1, 16, 256 or 4096 set.
    Last edited by fcorbelli; 16th May 2014 at 21:20.

  14. Thanks:

    Bulat Ziganshin (23rd May 2014)

  15. #14
    Member
    Join Date
    Oct 2009
    Location
    usa
    Posts
    60
    Thanks
    1
    Thanked 9 Times in 6 Posts
    I also think you should add it to freearc as a new format, independent of .arc, even compile it as an option into the arc.exe file.

  16. #15
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    358
    Thanks
    12
    Thanked 36 Times in 30 Posts
    Here's the very early test on my source-tree
    [code]
    F:\zarc>z:\fb64.exe -t4 f:\zarc
    4+2 threads * 8mb buffers, 48b..4kb.. chunks
    Scanning: 6,238,754,776 bytes in 32,976 files
    100.0% deduplicated: 6,238,754,776 => 3,952,103,894 bytes

    F:\zarc>zpaq64 a r:\temp\kao f:\zarc\*.* -method 0
    zpaq v6.51 journaling archiver, compiled May 7 2014
    ....
    0 + (6238754776 -> 4164812809) = 4164812809
    93.907 seconds (all OK)[/quote]
    As you can see deduplication seems good (better than zpaq's)

  17. #16
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,497
    Thanks
    26
    Thanked 132 Times in 102 Posts
    Quote Originally Posted by nburns View Post
    How would you explain the difference between deduplication and compression?
    Deduplication only looks for repeated sequences which are above some quite big threshold in length. Compression usually also involves making statistics and/ or looking for short repeated patterns.

  18. #17
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    zpaq uses 64 KB average fragment size by default. To test 4 KB fragments use option -fragment 2

    Also, there is already a compressor named sr (symbol ranking).

  19. #18
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Also, there is already a compressor named sr (symbol ranking).
    If sr is taken, the logical second choice is jr.

  20. Thanks:

    PSHUFB (14th July 2014)

  21. #19
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    dedup engines splits input data into content-defined chunks and stores SHA hash of every chunk in the archive. then new data are checked against existing chunks and only new chunks are added to the archive
    That's pretty much what I thought, but that's just a kind of dictionary compression.

    Are you familiar with git's object database? The applications probably don't overlap perfectly, but it's probably worth studying git's design decisions and how they turned out. Backup and revision control are not thought of as the same, but there is a fair amount of common ground IMO.

  22. #20
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    Quote Originally Posted by fcorbelli View Post
    As you can see deduplication seems good (better than zpaq's)
    they are exactly the same since i'm using zpaq's chunking algo the difference is due to using smaller chunk size by default and lack of archive index

    i plan to beat zpaq on the speed side as well as flexibility, but deduplication efficiency probably will remain the same
    Last edited by Bulat Ziganshin; 16th May 2014 at 23:00.

  23. #21
    Member
    Join Date
    Jun 2008
    Location
    G
    Posts
    377
    Thanks
    26
    Thanked 23 Times in 16 Posts
    Backup ULtimATive

  24. #22
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    why not just Backup ULtimATe?

  25. #23
    Member
    Join Date
    Jun 2008
    Location
    G
    Posts
    377
    Thanks
    26
    Thanked 23 Times in 16 Posts
    Do you also want define a decompression language like zpaql??? So its possible to create an own custom compression algorithnms? So its possible to decode for everyone?

  26. #24
    Member
    Join Date
    Jun 2008
    Location
    G
    Posts
    377
    Thanks
    26
    Thanked 23 Times in 16 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    why not just Backup ULtimATe?
    if its more correct in english then ok, i thought ultimative would sound nicer

  27. #25
    Member Bloax's Avatar
    Join Date
    Feb 2013
    Location
    Dreamland
    Posts
    52
    Thanks
    11
    Thanked 2 Times in 2 Posts
    The least cheesy solution would be to integrate it into FreeArc, yes.

  28. #26
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    Using a fixed format like tornado should be faster than a decompression language like zpaq. The disadvantage is losing compatibility when you want to improve the compression. I think this is more of a problem for high end compression like CM where you are more likely to make changes.

  29. #27
    Tester
    Nania Francesco's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    1,565
    Thanks
    223
    Thanked 146 Times in 83 Posts
    For me ..
    FLASHDUP
    or
    STORMDUP

  30. #28
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    358
    Thanks
    12
    Thanked 36 Times in 30 Posts
    FatBackup?

  31. #29
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    bzig2 -- from your first initial and first three of last name, and "2" because it's your second backup program. Do you think it sounds original enough?

  32. #30
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,497
    Thanks
    26
    Thanked 132 Times in 102 Posts
    Maybe something Hollywood sounding, like: The Synchronizer.


Page 1 of 4 123 ... LastLast

Similar Threads

  1. Duplicate File Finder Engine
    By david_werecat in forum Download Area
    Replies: 10
    Last Post: 10th February 2018, 13:08
  2. Data deduplication
    By Lasse Reinhold in forum Data Compression
    Replies: 79
    Last Post: 18th November 2013, 07:49
  3. Blackbox identification of compression engine
    By Luntik in forum Data Compression
    Replies: 6
    Last Post: 19th January 2013, 19:57
  4. A fast diffing engine
    By m^2 in forum Data Compression
    Replies: 36
    Last Post: 21st September 2011, 19:30
  5. RZM - a dull ROLZ compression engine
    By Christian in forum Forum Archive
    Replies: 178
    Last Post: 1st May 2008, 21:26

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •