1

I create RAW image files plus a small selection of JPEG files derived from the RAW masters. The JPEGs once created are very, very rarely edited again but when they are because they are recompressed, the whole file will change. When editing the RAW images I use software that makes changes non-destructively. A preview file and a meta file (XMP <40KB) is created in conjunction with a catalog that together keep track of the changes.

I manage the preview and catalog file backups in a separate system so for this question I’m only concerned with the RAWs, XMPs and JPEGs.

I want to backup all RAW, JPEG and XMP files offsite over a WAN connection based on new and altered files on a filesystem that is scanned for changes once per day.

The de-duplication seems to work by reading portions of files and creating weak hashes to compare with all other portions of files. If a hash is found to be the same as another, a stronger hash is created and the portions are compared again. If the portions are still creating the same hash then the second portion isn't uploaded. Instead, the backup system points the duplicated portion of the file to it's previously backed up copy.

My question is…

  • If the RAW files don’t change and…
  • The JPEGs will rarely change and…
  • The XMP files may have portions of the files changed and…
  • The CPU/RAM requirements for de-duplication are very high and…
  • Given that data de-duplication can reduce the amount of data transmitted…

…is it worth using de-duplication?

Giacomo1968
  • 53,069
  • 19
  • 162
  • 212
  • which os and which fs are you using or prepared to consider? I am currently working as newbie with btrfs for which there is a project [duperemove](https://github.com/markfasheh/duperemove/tree/v0.09-branch) which offers out-of-band (but online) deduplication. If you apply this to the source filesystem it should also make backups using `btrfs send` quicker (I think). By contrast if you use `rsync` I don't think deduplication will speed up your backup. – gogoud Apr 01 '15 at 11:55
  • I'm using a Synology NAS (ext4) and uploading to crashplan. People that have been using crashplan are commonly commenting on the resource hungry process of de-duplication as being a limiting factor in backup speed. Thank for the response. – adrianlambert Apr 01 '15 at 17:58
  • 1
    This isn’t a bad question. But the issue is this is a massive headache—not just for you—but for all user’s of digital asset management systems and 100% nobody can agree on what the best way to handle sources versus derivative images are. And the “solution” really comes down to what works best for your particular process. Not much else can determine that outside of that. – Giacomo1968 Apr 01 '15 at 23:03
  • 1
    I don't think you are considering my question in the way I had hoped. I'd like to best establish how much de-duping rarely changing data will be of benefit to the amount of data that can be backed up over a given period of time. I.e. Which is faster? De-dupe enabled or de-duplication effectively disabled in a system that's speed is being reduced by the de-duping process. – adrianlambert Apr 02 '15 at 07:48

0 Answers0