5

What i want to do is to copy 500K of files.

I want to copy within server from one destination to another.It includes emails mostly so many small files.

Its over 23 GB only but takes so long (over 30 mins and not done yet) , linux cp command also only uses 1 CPU .

So if i script it to use multiple cps , would that make it faster.

System is 16 cores , 16 GB Ram , 15K Drivers (15000 RPM SATA) .

What are other options?

I believe tarring and untaring would take even longer and wont use multi-core ..

Oliver Salzburg
  • 86,445
  • 63
  • 260
  • 306
Phyo Arkar Lwin
  • 420
  • 1
  • 5
  • 13
  • 1
    see my answer to this question as to why copying a lot of files requires a lot of disk I/O: http://superuser.com/questions/344534/why-does-copying-the-same-amount-of-data-take-longer-if-spread-across-many-separ/344860#344860 – sawdust Oct 22 '11 at 01:08

5 Answers5

7

Your bottleneck is hard-drive speed. Multi-core can't speed this up.

Pubby
  • 364
  • 3
  • 13
  • Harddrive . when tested with hdpram it returns 278MB/s are you sure about this? it should only take 100 seconds to copy 23GB file. So using mulitiple CP in multi-threading progams wont improve this too? – Phyo Arkar Lwin Oct 21 '11 at 22:48
  • 1
    No, no it won't. The bottleneck is almost certainly the read/write speed of the physical media itself unless you're using enterprise-level gear. – Shinrai Oct 21 '11 at 22:51
  • @V3ss0n I do know that hard drives are not random access, which prevents them from being accessed in parallel. – Pubby Oct 21 '11 at 22:51
  • 2
    @Pubby8 - Umm, HDD are random access devices (at the block/sector level). It's often compared to tape (e.g. magnetic tape) which is a sequential block device. I suspect you're trying to state that the typical device can only perform one I/O operation at a time. There is an animal called *dual-port disk drive* that can do two operations at once, but there filesystem issues that make this rather complicated. – sawdust Oct 22 '11 at 01:14
  • What i want to make sure is , there was a program i made in python , which extract text from multiple file format using different kind of parser (doc , pdf , eml , etc) into database for later indexing and search. At first the script was only single process , and after making it multi-process using multiprocessing module (high level Fork, so same as forking) it increase speed significantly. But it only works well up to 4 process , at 6 process IO Stall and totally slow thing down , and even freeze whole process sometime. – Phyo Arkar Lwin Oct 22 '11 at 12:16
  • so the sweet spot there is 4 processes. Should i test that way? – Phyo Arkar Lwin Oct 22 '11 at 12:16
3

Coping a single large file is faster than moving lots of small files as there is lots of latency with the setup and tear down of each operation - also the disk and OS can do lots of read-ahead with a single large file. So tarring it first would make it quicker. Though once you factor in the time taken to tar, it may not speed things up too much.

Note that you are only reading from a single disk, so parallelising your calls to the disk may actually slow things down, where it tries to serve multiple files at the same time.

Paul
  • 59,223
  • 18
  • 147
  • 168
  • 1
    Wouldn't tarring require reading all the files, creating the tar, deleting the original files, and then creating the copy? Seems like it would definitely take longer. – Pubby Oct 21 '11 at 23:01
  • Yes for sure - I agreed with your answer, mine was just to provide some additional info. Given that the copy seems to be underway at the time the OP wrote the question, it seemed to be an information gathering exercise. There will be circumstances where tarring first may provide better overall performance. – Paul Oct 22 '11 at 03:23
0

Although the question has been quite old, I think the best way is to zip using multi-cores like lbzip2 and pbzip2. Transfer the compressed file and decompress it using multi-cores. You can find about the commands on Internet.

Dharma
  • 103
  • 3
0

Compression may half the size of the file that needs to be written. If you can fully and efficiently utilize the cores and most of the compression occurs in fast memory, this (theoretically) could cut your write time almost in half. Writes are also usually slower than reads. Half is just a guess of course, a lot depends on the type, size, and number of "small" files you are trying to compress. Large log files seem to compress the best b/c they're all text, lots of spaces, etc. whereas already compressed image files will yield little if any improvement at all. Just as with compilation, any terminal I/O from the copying program is extremely slow and should be assigned to a file or for pure speed to null using the >& sequence. Null of course will save no error information, and put the onus on the user to ensure the file(s) got copied. This works best for a few large files unless the files can be verified through a sequence or other method.

0

Is it all in the same directory? There is a script that starts multiple cp: http://www.unix.com/unix-dummies-questions-answers/128363-copy-files-parallel.html

For a tree you need to adjust it.

ott--
  • 2,201
  • 1
  • 15
  • 15