Just last weekend, I set up a new (clean install) backup server for my main FreeNAS machine, and started a manual complete pool backup between them. Both machines are enterprise hardware and run fast, the link between is a direct 10G optic LAN (Chelsio), both machines have plenty of fast NVMe ZIL/cache and 128GB fast ddr4, with Xeon v4 and Supermicro baseboards. The pool I'm replicating/copying is 14GB actual data, deduped with 35GB referenced data (2.5x dedup). The pools are striped mirrors (4 sets of 3 way mirrors with enterprise 6+TB 7200 disks) not RaidZ so they dont even have parity to slow them. Nothing else is running on the servers or their connection except the SSH connections for the transfers. The zfs send command includes the args needed to send the data deduped (although by oversight, not compressed).
Command on sender:
zfs send -vvDRLe mypool@latest_snapshot | nc -N BACKUP_IP BACKUP_PORT
Command on recipient:
nc -l PORT | zfs receive -vvFsd my_pool
I was expecting one of two things to happen - either it sends 14TB and finishes, or it sends 35TB but the 21TB that's already sent (deduped data) goes really fast, and only 14 and a bit TB needs to be sent. But instead it seems to be intent on sending all 35TB in full, and incredibly slowly at that - did I do something wrong or misunderstand?
What I don't get is that even with serialising the snapshots/datasets, the backup servers disks are running at almost 100% according to gstat and have been doing so for 4 full days now. The data is arriving correctly (I can mount those snaps/datasets which have completed). But sending the entire pool looks like it'll take about 7 days all-in, with almost 100% disk activity the whole time.
Transferring 14TB or even 35TB on a 10G link between 2 fast servers - whatever status info is displayed on console - just shouldn't take that long, unless it's incredibly inefficient, which seems unlikely.
Both systems can read/write even the HDD spinners at almost 500 MB/s and ZFS optimises disk access and doesn't need to re-dedup the data as it's sent already deduped.
Why is it taking so long? Why isn't it just sending one time only, the raw blocks in the pool?
Replying to some points from comments:
- netcat (nc):
netcat (nc)provides a bare transparent unencrypted tcp transport/tunnel to pipe data between two systems (among other uses) - a bit like ssh/VPN but no slowdown or repackaging other than bare TCP handshakes on the wire. As far aszfs send/zfs receiveare concerned they are in direct communication, and beyond a tiny latency thenetcatlink should run at the maximum speed that send/receive can handle. - Mirror disk speed: A mirror writes at the slowest speed of any of its disks, but ZFS treats the disks as a striped mirror (data stripes across 4 vdevs on both systems, and each vdev is a mirror). With the source pool 55% full and the dest pool empty, and assuming the CPUs can keep up, zfs should be able to simultaneously read from 12 disks, and write to 4 disks, and the writes should be pretty much all sequential, there's no other IO activity. I figure that the slowest disk in any mirror can seq write at >= 125MB/s, which is way below the rate for a modern enterprise 7200 HDD, and the backup can be filled sequentially rather than random IO. That's where I get a sustained replication rate of >> 500MB/s.
- Dedup table/RAM adequacy: The dedup table is about 40GB in RAM (from bytes per entry x total blocks in source pool per
zdb). I've set a sysctl on both systems to reserve 85GB of RAM for the dedup table and other metadata, hence about 35GB for cached data, before any use of L2ARC (if used with send/rcv). So dedup and metadata shouldn't be evicted from RAM on either machine.
Speed and progress update:
- After 5 days runtime, I have some updated progress stats. It's sending data at about 58 MB/sec average. Not completely disastrous, but still, it underpins the question above. I'd expect a rate about 10 x that, since the disk sets can read at up to 12 HDD's at a time (almost 2 GB/sec) and write up to 4 disks at a time (about 500 GB/s). It doesn't have to dedup or re-dedup the data (AFAIK), it's running on 3.5 GHz 4 + 8 core Xeon v4's with tons of RAM on both systems, and a LAN that can do 1GB/sec.