Why is ZFS send/receive sending so much data across my LAN?

Question

Just last weekend, I set up a new (clean install) backup server for my main FreeNAS machine, and started a manual complete pool backup between them. Both machines are enterprise hardware and run fast, the link between is a direct 10G optic LAN (Chelsio), both machines have plenty of fast NVMe ZIL/cache and 128GB fast ddr4, with Xeon v4 and Supermicro baseboards. The pool I'm replicating/copying is 14GB actual data, deduped with 35GB referenced data (2.5x dedup). The pools are striped mirrors (4 sets of 3 way mirrors with enterprise 6+TB 7200 disks) not RaidZ so they dont even have parity to slow them. Nothing else is running on the servers or their connection except the SSH connections for the transfers. The zfs send command includes the args needed to send the data deduped (although by oversight, not compressed).

Command on sender:

zfs send -vvDRLe mypool@latest_snapshot | nc -N BACKUP_IP BACKUP_PORT

Command on recipient:

nc -l PORT | zfs receive -vvFsd my_pool

I was expecting one of two things to happen - either it sends 14TB and finishes, or it sends 35TB but the 21TB that's already sent (deduped data) goes really fast, and only 14 and a bit TB needs to be sent. But instead it seems to be intent on sending all 35TB in full, and incredibly slowly at that - did I do something wrong or misunderstand?

What I don't get is that even with serialising the snapshots/datasets, the backup servers disks are running at almost 100% according to gstat and have been doing so for 4 full days now. The data is arriving correctly (I can mount those snaps/datasets which have completed). But sending the entire pool looks like it'll take about 7 days all-in, with almost 100% disk activity the whole time.

Transferring 14TB or even 35TB on a 10G link between 2 fast servers - whatever status info is displayed on console - just shouldn't take that long, unless it's incredibly inefficient, which seems unlikely.

Both systems can read/write even the HDD spinners at almost 500 MB/s and ZFS optimises disk access and doesn't need to re-dedup the data as it's sent already deduped.

Why is it taking so long? Why isn't it just sending one time only, the raw blocks in the pool?

Replying to some points from comments:

netcat (nc): netcat (nc) provides a bare transparent unencrypted tcp transport/tunnel to pipe data between two systems (among other uses) - a bit like ssh/VPN but no slowdown or repackaging other than bare TCP handshakes on the wire. As far as zfs send/zfs receive are concerned they are in direct communication, and beyond a tiny latency the netcat link should run at the maximum speed that send/receive can handle.
Mirror disk speed: A mirror writes at the slowest speed of any of its disks, but ZFS treats the disks as a striped mirror (data stripes across 4 vdevs on both systems, and each vdev is a mirror). With the source pool 55% full and the dest pool empty, and assuming the CPUs can keep up, zfs should be able to simultaneously read from 12 disks, and write to 4 disks, and the writes should be pretty much all sequential, there's no other IO activity. I figure that the slowest disk in any mirror can seq write at >= 125MB/s, which is way below the rate for a modern enterprise 7200 HDD, and the backup can be filled sequentially rather than random IO. That's where I get a sustained replication rate of >> 500MB/s.
Dedup table/RAM adequacy: The dedup table is about 40GB in RAM (from bytes per entry x total blocks in source pool per zdb). I've set a sysctl on both systems to reserve 85GB of RAM for the dedup table and other metadata, hence about 35GB for cached data, before any use of L2ARC (if used with send/rcv). So dedup and metadata shouldn't be evicted from RAM on either machine.

Speed and progress update:

After 5 days runtime, I have some updated progress stats. It's sending data at about 58 MB/sec average. Not completely disastrous, but still, it underpins the question above. I'd expect a rate about 10 x that, since the disk sets can read at up to 12 HDD's at a time (almost 2 GB/sec) and write up to 4 disks at a time (about 500 GB/s). It doesn't have to dedup or re-dedup the data (AFAIK), it's running on 3.5 GHz 4 + 8 core Xeon v4's with tons of RAM on both systems, and a LAN that can do 1GB/sec.

If `zfs send` sends 35TB, how do you expect `nc` to know which data has already been sent? — u1686_grawity, Jun 15 '18 at 11:34
Surely netcat's just a transparent transport between the `zfs send` and `zfs receive` processes, here. It doesn't need to know anything, any more than an ssh or VPN process needs to understand rsync or other protocols tunneled through them? It wouldn't be usual for a pure unencrypted port-to-port transport tunnel to influence whether `zfs send` is clocking up 14 or 35TB of source (not transported) data in its status output, or taking a week rather than a day or so to transfer the pool? What are you thinking is up, as you probably know more than I do on this? — Stilez, Jun 15 '18 at 15:24
Perhaps I misunderstood what you meant by: "...it sends 35TB but the 21TB that's already sent (deduped data) goes really fast,..." — u1686_grawity, Jun 15 '18 at 15:40
I meant that I expected either (a) it would only send each logical block once - meaning it sends 14TB and the progress count stops at 14TB, or (b) it would send each reference, for a total of 35TB, but any blocks already sent, it would just send a pointer, not the full data, so only 14TB of its "count" would be slowish due to actual data, the other 21TB would be counted in the progress info, but would fly by as only pointers/duplicate block IDs needed to be sent (not the actual block contents) for all duplicate blocks. That's what I had in mind/expected. — Stilez, Jun 15 '18 at 21:02

score 1 · Answer 1 · answered Jun 15 '18 at 15:34

1

From what you mentioned about compression, I’m assuming all the storage sizes / speeds you described were in uncompressed sizes. If not, that could make transfer times longer by a factor equal to your average compression ratio (but not if disk access is the bottleneck, since the decompression / compression happens after reading from disk in zfs send and before writing to disk in zfs receive).

Based on the information you’ve collected so far, it sounds like you’re bottlenecked on the disk bandwidth, not on the network connection. You mentioned that each system can read/write at ~500MB/s, so your best-case transfer time for 35TB is around 20 hours (about 2.5x slower than just transferring through the 10Gb/s network). But, based on your mirroring setup, I’m surprised that reads and writes would get the same throughput — are you sure about that? On the send system you only need to read from one disk (so you can parallelize reads across three disks), but on the receive system you have to write to all three disks (so you’re bound by the throughput of the slowest disk at any given time). To test the write throughput on the receive side, you could run dd if=/dev/urandom of=some_file_in_pool bs=1M count=1024 conv=fdatasync.

Since you said the receiving disks are at 100% busy, my guess is that it’s not reaching 500MB/s write bandwidth. This could either be because the real write limit is lower than that (the dd command above should confirm), or it could be that the system is having to do metadata reads during the receive, and that’s breaking your nice large-IO-size write workload by adding a bunch of disk seeks into the mix. You should be able to investigate the second hypothesis more deeply using DTrace to see what the io provider thinks your read/write sizes are.

answered Jun 15 '18 at 15:34

Dan

1,058
7
15

Thanks, I havent got to grips with dtrace yet (I hink you nedd to know more about the structure/kernel than I do). If there are specific one liners/dtrace code that'll help here, can you suggest them, so I can try it? As not sure I can test dd speed mid-replication? Also to clarify, yes a single mirror runs at the slowest of all its disks. But a zfs pool of 4 striped mirrors can write to all 4 slowest disks simultaneously (4 vdevs), so if HDD IO is the limit, the overall max speed should still be about 4x whatever a single disk would have achieved. That's the thinking I had on write speed. – Stilez Jun 15 '18 at 15:43
Most of the compression in the source pool is from deduplication - its highly deduped data, about 2.6x. By contrast compression is a lesser factor, only about 1.17x. Possibly not strictly comparable as one is taken from zpool list and the other from ZFS list, on the main dataset (90% of the pool), but gives an idea. – Stilez Jun 15 '18 at 15:56
Yeah, you’re right, the striping will help, although there’s no guarantee that IOs will be spread evenly across stripes. The easiest way to get a good feel is just to measure it. Yeah, results will be affected by the ongoing send or receive (but given the specs of each system are the same, you could run it on the send side instead). For DTrace the easiest way to start is with the DTraceToolkit, which is available here: https://github.com/opendtrace/toolkit/blob/master/Docs/Contents. Sounds like compression shouldn’t have a big impact on the performance. – Dan Jun 15 '18 at 16:32
One last thought: I think dedup in send / receive doesn’t work the same as pool-wide dedup — basically, it dedups blocks in the send stream, but the receive side still has to reconstruct the pool’s dedup table from scratch (IIRC). This could be responsible for the random-access metadata reads I mentioned, if the dedup table can’t fit in RAM. The dd workload with random data would have this limitation too, so that still seems like a good way to replicate a receive-like workload for testing. – Dan Jun 15 '18 at 16:42
The ram is calculated to hold the full dedup table and another 50gb of cache/metadata, and metadata ARC sysctl used to ensure enough ram reserved (85G) that dedup tables aren't evicted - at least I hope they aren't! I've calculated dedup size in ram, it's about 40G, or 2.5-3GB per TB). I've looked at dtrace, figuring how to do what you suggest is beyond me without a bit more knowledge. Any chance of a scripty hint I can use this time and learn from/build from? – Stilez Jun 15 '18 at 21:11
See update to OP. – Stilez Jun 16 '18 at 13:04
The reason I don't want to give too many details on investigating small IOs breaking up your nice streaming is that there won't really be anything you can do to fix it. However, I had one final thought that could be helpful, which is written up here: http://everycity.co.uk/alasdair/2010/07/using-mbuffer-to-speed-up-slow-zfs-send-zfs-receive/ – Dan Jun 20 '18 at 23:45

Why is ZFS send/receive sending so much data across my LAN?

1 Answers1