1

I have a large system - 128GB, a couple of RAID0 filesystems (6TB and 2TB) with an SSD cache, 8 cores (16 with hyperthreading), running Ubuntu 12.04 64bit. When I try to write a large file I get very poor performance, and iotop shows processes waiting over 99% in iowait:

dd if=/dev/zero of=lezz bs=1024 count=$((1024*50))
51200+0 records in
51200+0 records out
52428800 bytes (52 MB) copied, 3.74852 s, 14.0 MB/s

From iotop:

Total DISK READ:     185.92 K/s | Total DISK WRITE:      84.06 M/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
24481 be/4 arris292    0.00 B/s    0.00 B/s  0.00 % 99.99 % dd if=/dev/zero of=lezz     bs=1024 count=512000
22668 be/4 root        0.00 B/s    0.00 B/s  0.00 % 99.99 % [flush-252:0]
21532 be/4 root        0.00 B/s    0.00 B/s  0.00 % 99.99 % [kworker/1:2]
    1 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % init
    2 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kthreadd]
    3 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/0]
 8196 be/4 arris292    0.00 B/s    0.00 B/s  0.00 %  0.00 % sshd: arris292@pts/22
    5 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kworker/u:0]
    6 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/0]
    7 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/0]

On a very similar system (same memory, same model, similar filesystem) I get the expected performance and no processes waiting 99% of their time for IO....

dd if=/dev/zero of=lezz bs=1024 count=$((1024*50))
51200+0 records in
51200+0 records out
52428800 bytes (52 MB) copied, 0.111191 s, 472 MB/s

I've seen this before but I've never really been able to get to the bottom of the problem, and as the day goes on and more engineers start using this system to build, the overall performance will drop to a crawl.

So what could be causing the incredibly high IO wait times? How can I troubleshoot this further? Is it possibly an SSD or disk problem, and if so, what tools can I use to diagnose?

user3324033
  • 151
  • 1
  • 5
  • 2
    On serverfault there's a nice read about this: http://serverfault.com/questions/12679/can-anyone-explain-precisely-what-iowait-is. Use `iostat`, `iotop`, `strace` and `sar` to diagnose problems. To prevent waiting periods, you can ensure you have enough free memory in your system, keep file system usage below 80% (fragmentation), tune your file system, use a battery powered array controller and choose good buffer sizes. – Jakke Jul 02 '14 at 13:28
  • I've seen all the normal articles on this - however having several processes waiting 99.99% of their time for IO is not a normal situation and is not covered in the articles. As I said, I can carry out the same experiment on essentially an identical system, and I get reasonable results - there is something bad here. It could be a bad disk, but I'm not seeing errors from the ssd cache package or in syslog, or anywhere else. – user3324033 Jul 02 '14 at 16:15
  • try a `hdparm` on your disks to see which one is causing problems – Jakke Jul 02 '14 at 18:42
  • Neither hdparm nor smartctl show me anything specific. hdparm -T and hdparm -t both hang indefinitely, as does (after some more research) sync. If I understand the man page properly, hdparm -T doesn't go anywhere near the physical disk, so this may not be a disk problem. – user3324033 Jul 07 '14 at 17:23
  • `hdparm -T` shows the cached speeds (so no real disk access) and `hdparm -t` shows the actual read speeds... if they both hang, I'd say there's something wrong with your disk though. You may want to check some other disk checking tools however. – Jakke Jul 07 '14 at 19:00
  • 2
    You could test the disk in another system and see how it performs there. Besides drive issues, you could also have issues with your controller and/or cabling. – Jakke Jul 07 '14 at 19:03

0 Answers0