4

We have an application that has generated so far over 540k images. The images are kept in a tree structure which is using 5 million Inodes so far.

We would like to backup the data daily in a remote offsite server. We thought of using rsync but we're not sure if it's going to be fastest way.

Do you guys have any recommendation for an efficient backup strategy?

raullarion
  • 43
  • 3
  • How much of it changes? Are there changes within files, or are only new files added (and existing stay the same)? Do you have any storage space (compression) requirements? What is your bandwidth between the source and destination? Do you have any hard caps on time taken (e.g. 2 hours) rather than just a "fastest"? – Bob Feb 01 '18 at 06:57
  • The files are subject to constant changes. Yesterday, our app generated 24k new files. We don't keep track of how many get removed on a daily basis but it's safe to estimate that around 10k are removed daily. The app is hosted in Amsterdam, the backup server is in US, East Coast. We don't know the exact bandwidth between the servers but we're using the same backup server for other apps and things run pretty fast. We don't have hard caps on time taken but we don't want the backup process to generate high loads on our app server. – raullarion Feb 01 '18 at 07:09
  • 540K images, 5M inodes, lots of subdirectories? In applications that generate very many files is usually a good idea to store files in time related directories to avoid directories with too many children. You then create a new directory every year/month/day/hour depending on creation rate. is this already the case? If so knowing when your last backup was will help focus on when to look for new files (yes, this doesn't take care of old files that have changed, but the applications also avoid doing that, preferring creating a new file). – xenoid Feb 01 '18 at 07:39
  • Unfortunately, yes, too many subdirectories. It is a very big design flaw but I'm afraid there is not much we can do now for our existing files. – raullarion Feb 01 '18 at 10:31

1 Answers1

2

Man, it takes such a long time to scan 5,000,000 inodes every single day to find files that changed!

What if there was a way to back up only the changes since the last backup?

Well, you can… with snapshots!

The biggest hurdle to snapshots is switching to a file system that supports them.

On Linux, two well-known snapshotting file systems are:

  • Btrfs – Designed for Linux, less battle-tested
  • ZFS – Ported to Linux, been around longer

Both are copy-on-write file systems. What that practically means for you is that they keep track of the changes since the last snapshot so that when you send the latest snapshot to the backup server, only the changes get sent but you still have a complete copy of all daily backups that you decide to keep.

This means that as a bonus, you have the potential of keeping more than one day of backups for not much extra space (only the disk space used by the changes each day), and you can flexibly delete the backups, keeping weekly, monthly, or yearly backups as you desire.

Btrfs Incremental Backups

This is an example of commands you can run to make incremental backups and send them to your backup server:

# Make a snapshot
btrfs subvolume snapshot -r /app/data /backup/app-data-$(date "+%Y%m%dT%H%M%S%Z")

# Ensure the snapshot is saved
sync

# Find your latest snapshot, referred to as `/backup/app-data-THIS_BACKUP_TIMESTAMP` below
ls -lhtr /backup/

# Send the snapshot since the previous snapshot to the backup server
btrfs send -p /backup/app-data-LAST_BACKUP_TIMESTAMP /backup/app-data-THIS_BACKUP_TIMESTAMP | ssh BACKUP_USER@BACKUP_SERVER "btrfs receive /backup/app-data"

Note: Exclude -p /backup/app-data-LAST_BACKUP_TIMESTAMP from the last command if this is the first backup.

ZFS Incremental Backups

This is an example of commands you can run to make incremental backups and send them to your backup server:

# Create a snapshot of the "data" dataset in your "app-pool" zpool
zfs snapshot app-pool/data@$(date "+%Y%m%dT%H%M%S%Z")

# Find your latest snapshot, referred to as `app-pool/data@THIS_BACKUP_TIMESTAMP` below
zfs list -rt snapshot app-pool/data

# Send the snapshot since the previous snapshot to the backup server
zfs send -i app-pool/data@LAST_BACKUP_TIMESTAMP app-pool/data@THIS_BACKUP_TIMESTAMP | ssh BACKUP_USER@BACKUP_SERVER "zfs receive backup-pool/app-data"

Note: Exclude -i app-pool/data@LAST_BACKUP_TIMESTAMP from the last command if this is the first backup.

Deltik
  • 19,353
  • 17
  • 73
  • 114