Moving 2TB (10 mil files + dirs), what's my bottleneck?

Question

Background

I ran out of space on /home/data and need to transfer /home/data/repo to /home/data2.

/home/data/repo contains 1M dirs, each of which contain 11 dirs and 10 files. It totals 2TB.

/home/data is on ext3 with dir_index enabled. /home/data2 is on ext4. Running CentOS 6.4.

I assume these approaches are slow because of the fact that repo/ has 1 million dirs directly underneath it.

Attempt 1: `mv` is fast but gets interrupted

I could be done if this had finished:

/home/data> mv repo ../data2

But it was interrupted after 1.5TB was transferred. It was writing at about 1GB/min.

Attempt 2: `rsync` crawls after 8 hours of building file list

/home/data> rsync --ignore-existing -rv repo ../data2

It took several hours to build the 'incremental file list' and then it transfers at 100MB/min.

I cancel it to try a faster approach.

Attempt 3a: `mv` complains

Testing it on a subdirectory:

/home/data/repo> mv -f foobar ../../data2/repo/
mv: inter-device move failed: '(foobar)' to '../../data2/repo/foobar'; unable to remove target: Is a directory

I'm not sure what this is error about, but maybe cp can bail me out..

Attempt 3b: `cp` gets nowhere after 8 hours

/home/data> cp -nr repo ../data2

It reads the disk for 8 hours and I decide to cancel it and go back to rsync.

Attempt 4: `rsync` crawls after 8 hours of building file list

/home/data> rsync --ignore-existing --remove-source-files -rv repo ../data2

I used --remove-source-files thinking it might make it faster if I start cleanup now.

It takes at least 6 hours to build the file list then it transfers at 100-200MB/min.

But the server was burdened overnight and my connection closed.

Attempt 5: THERES ONLY 300GB LEFT TO MOVE WHY IS THIS SO PAINFUL

/home/data> rsync --ignore-existing --remove-source-files -rvW repo ../data2

Interrupted again. The -W almost seemed to make "sending incremental file list" faster, which to my understanding shouldn't make sense. Regardless, the transfer is horribly slow and I'm giving up on this one.

Attempt 6: `tar`

/home/data> nohup tar cf - . |(cd ../data2; tar xvfk -)

Basically attempting to re-copy everything but ignoring existing files. It has to wade thru 1.7TB of existing files but at least it's reading at 1.2GB/min.

So far, this is the only command which gives instant gratification.

Update: interrupted again, somehow, even with nohup..

Attempt 7: harakiri

Still debating this one

Attempt 8: scripted 'merge' with `mv`

The destination dir had about 120k empty dirs, so I ran

/home/data2/repo> find . -type d -empty -exec rmdir {} \;

Ruby script:

SRC  = "/home/data/repo"
DEST = "/home/data2/repo"

`ls #{SRC}  --color=never > lst1.tmp`
`ls #{DEST} --color=never > lst2.tmp`
`diff lst1.tmp lst2.tmp | grep '<' > /home/data/missing.tmp`

t = `cat /home/data/missing.tmp | wc -l`.to_i
puts "Todo: #{t}"

# Manually `mv` each missing directory
File.open('missing.tmp').each do |line|
  dir = line.strip.gsub('< ', '')
  puts `mv #{SRC}/#{dir} #{DEST}/`
end

DONE.

You are correct,it has to find and enumerate each directory and 1 million dirs is going to be painful. — cybernard, Sep 06 '13 at 16:19
Look at the bright side... if it were Windows, you couldn't even have a million subdirectories and still have an OS that works. :) — Jack, Sep 06 '13 at 16:55
@Jack really? Does Windows have a limit? Is this not a relic from the FAT32 days (I haven't used Windows as a main OS since ~2001 so I am not really up to date on it)? — terdon, Sep 06 '13 at 17:27
@Tim, why don't you just `mv` again? In theory `mv` will only delete a source file if the destination file has been completely copied so it _should_ work OK. Also, do you have physical access to the machine or is this done through an `ssh` connection? — terdon, Sep 06 '13 at 17:28
@terdon - Windows doesn't have a limit, per se... but it has a point where it becomes unusable for all intents and purposes. Windows Explorer will take forever to display the file list, etc. — Jack, Sep 06 '13 at 18:25
@Jack OK, but that will only affect that one directory right? Or will the entire system be affected? — terdon, Sep 06 '13 at 18:27
@terdon - Just the one directory. See http://technet.microsoft.com/en-us/magazine/hh395477.aspx — Jack, Sep 06 '13 at 18:43
@terdon - Wanted to use `mv -f` but tested it on a subdir and got `mv: inter-device move failed: '(foobar)' to '../../data2/repo/foobar'; unable to remove target: Is a directory`. And yes, I'm using `ssh`. — Tim, Sep 06 '13 at 19:33
With that many files/directories you'd honestly be better off using `dd` (though for 2TB it'd take hours/days to finish) — justbrowsing, Sep 06 '13 at 19:53
@justbrowsing - the problem now is that I need to merge/resume. Can `dd` do that? If some of the source files weren't deleted already, I'd just delete the destination dir and `mv` the source again. It would have taken only 24 hours had it not been interrupted. — Tim, Sep 06 '13 at 20:20
No it can't. `mv` isn't forgiving, if you keep getting disconnected you could lose data and not even know it. As you said you are doing this over `ssh`, I highly recommend using `screen` and detach. Enable logging and keep track that way. If you are using verbose it'll just take longer. Also try `iotop` — justbrowsing, Sep 06 '13 at 20:34
@justbrowsing - Good call on `screen`. I was wondering about verbose but I guess it's too late to restart `tar` right now. And `iotop` has been my favorite utility for the last few days :) — Tim, Sep 06 '13 at 20:45
is one of your directories mounted from a server? then I would recommend using a direct link using `rsync dir1 server:dir2` or `rsync server:dir1 dir2` depending on the server that is less likely to get disconnected. nesting this command in a `screen` shell allows to avoid some disconnections. — meduz, Sep 10 '13 at 09:56

score 6 · Accepted Answer · answered Sep 18 '13 at 01:05

6

Ever heard of splitting large tasks into smaller tasks?

/home/data/repo contains 1M dirs, each of which contain 11 dirs and 10 files. It totals 2TB.

rsync -a /source/1/ /destination/1/
rsync -a /source/2/ /destination/2/
rsync -a /source/3/ /destination/3/
rsync -a /source/4/ /destination/4/
rsync -a /source/5/ /destination/5/
rsync -a /source/6/ /destination/6/
rsync -a /source/7/ /destination/7/
rsync -a /source/8/ /destination/8/
rsync -a /source/9/ /destination/9/
rsync -a /source/10/ /destination/10/
rsync -a /source/11/ /destination/11/

(...)

Coffee break time.

answered Sep 18 '13 at 01:05

Ярослав Рахматуллин

11,096
5
42
74

1

The benefit I'm vaguely emphasizing is that *you* track the progress in small parts *manually* so that resuming the task will take lesss time if some part is aborted (because you know which steps were completed successfully). – Ярослав Рахматуллин Sep 18 '13 at 01:08
This is basically what I ended up doing in the end, except with `mv`. Unfortunate there is no tool meeting `mv` and `rsync` halfway. – Tim Sep 23 '13 at 20:41

score 4 · Answer 2 · edited Mar 31 '14 at 10:59

4

This is what is happening:

Initially rsync will build the list of files.
Building this list is really slow, due to an initial sorting of the file list.
This can be avoided by using ls -f -1 and combining it with xargs for building the set of files that rsync will use, or either redirecting output to a file with the file list.
Passing this list to rsync instead of the folder, will make rsync to start working immediately.
This trick of ls -f -1 over folders with millions of files is perfectly described in this article: http://unixetc.co.uk/2012/05/20/large-directory-causes-ls-to-hang/

edited Mar 31 '14 at 10:59

Dave

25,297
10
57
69

answered Mar 31 '14 at 10:37

maki

141
2

1

Can you give an example of how to use ls with rsync? I have a similar but not identical situation. On machine A I have rsyncd running and a large directory tree I want to transfer to machine B (actually, 90% of the directory is already at B). The problem is that I have to do this using a unstable mobile connection that frequently drops. Spending an hour on building the file list everytime I restart is pretty inefficient. Also, B is behind NAT that I don't control so it is hard to connect A -> B, while B -> A is easy. – d-b Feb 04 '15 at 09:49
Agree with @d-b. If an example could be given, that would make this answer much more useful. – redfox05 Apr 08 '19 at 15:13

Angelo · Answer 3 · 2018-02-13T18:38:25.360

Even if rsync is slow (why is it slow? maybe -z will help) it sounds like you've gotten a lot of it moved over, so you could just keep trying:

If you used --remove-source-files, you could then follow-up by removing empty directories. --remove-source-files will remove all the files, but will leave the directories there.

Just make sure you DO NOT use --remove-source-files with --delete to do multiple passes.

Also for increased speed you can use --inplace

If you're getting kicked out because you're trying to do this remotely on a server, go ahead and run this inside a 'screen' session. At least that way you can let it run.

score 0 · Answer 4 · answered Nov 05 '22 at 05:02

Could this not have been done using rsync with the --inc-recursive switch along with cron?

Even on a gigabit connection, it would take several hours to move 2 TB without any overhead. Rsync, mv or cp will all add varying amounts of overhead to the I/O, particularly if checksums or other validation is being done.

At least with the --inc-recursive switch, the transfer can start while the list of files is still being built.

I've been taught that --inplace improves speed and reduces space required on the destination, but at a slight reduction in file integrity -- I'd be interested to hear if this is not the case.

If a cron job was then created with whatever rsync settings are appropriate (and whatever is required to mount remote volumes), it could be set to run for a max of 5:58h (using --stop-after=358) and cron could start it every 6h. This way, if it randomly stopped, it would be started again automatically. --remove-source-files could be used with rsync, and find could be used first to delete empty source directories (perhaps decreasing the rsync run time to 5:50h in order to allow find to traverse all the directories).

I recognize that the speed of rsync was slower (as per the OP) but it seems to me that this would have a lower risk of file corruption.

(full disclosure - I'm still learning, so if I'm way off base, please try to be gentle when you let me know...)

Moving 2TB (10 mil files + dirs), what's my bottleneck?

Background

Attempt 1: mv is fast but gets interrupted

Attempt 2: rsync crawls after 8 hours of building file list

Attempt 3a: mv complains

Attempt 3b: cp gets nowhere after 8 hours

Attempt 4: rsync crawls after 8 hours of building file list