5

I have a server running Ubuntu server 14.04 64bit and I am having an "interesting" problem with it. I have 2 users running some programs concurrentlly - d and m. The processes of user "m" are not time critical, and therefore are being run on a "nice" value of 19. The processes of user "d" are time critical, and therefore are running on the standard "nice" value of 0. The thing is, the processes of user "m" are still getting more CPU time than those of user "d".

Also, despite all the CPU pressure, one of the CPUs (3 in the snapshot) is hardly getting any use.

I cannot reproduce the issue in an identical machine running Ubuntu server 10.04 64bit (I know, I should have updated by now).

I am attaching a snapshot of htop running to illustrate the issue. Can anyone help me with this?

htop screenshot

Thanks in advance.

PS - The screenshot get reduced on upload and becomes too small to be readable. Here is a link to a full sized file.

Stunts
  • 2,152
  • 1
  • 16
  • 26
  • Do you have "autogroup" enabled? (if `more `/proc/sys/kernel/sched_autogroup_enabled` shows 1 you have). If you do `echo 0 > /proc/sys/kernel/sched_autogroup_enabled` turns it of. Try again with that ;) PM me if it is the answer and Ill post it ;) – Rinzwind May 19 '15 at 10:46
  • `/proc/sys/kernel/sched_autogroup_enabled` was set to 1, but setting it to 0 did not solve the issue... – Stunts May 19 '15 at 10:50
  • I believe the suggestion from @Rinzwind is correct, however it will not effect any existing session. It will effect any new session. And yes, this is a difference introduced somewhere between 10.04 and 12.04. See [also](http://ubuntuforums.org/showthread.php?t=2061134&highlight=nice). About CPU 3, I don't know. – Doug Smythies May 19 '15 at 15:55
  • Unforutnately, even after applying the fix and rebooting the machine, the "nice" value is still not being respected... any more suggestions? – Stunts May 27 '15 at 10:27

1 Answers1

1

Ok, so it seems I have found an answer. Running iotop made me realize what was going on - it was reporting the pyrad jobs as taking 100% I/O, which meant the jobs were effectively I/O and not CPU bound.

After a trip to the server room where no errors were being reported on the HUD display, entering the iDRAc controller revealed a degraded RAID5 array.

Now that the array is fixed, everything is back to normal operation.

Regardless, thank you all for your suggestions and time.

Stunts
  • 2,152
  • 1
  • 16
  • 26