1

I have a Dell 5820 running Ubuntu Server 18.04 with SGE installed. I have a queue set up for 20 slots which I fill with 20 jobs running at the same time. % CPU goes up to about 70% and I have 128GB so no problems there.

I am worried about CPU overheating. I look at what happens with watch sensors and see:

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +82.0°C  (high = +81.0°C, crit = +91.0°C)
Core 0:        +80.0°C  (high = +81.0°C, crit = +91.0°C)
Core 1:        +77.0°C  (high = +81.0°C, crit = +91.0°C)
Core 2:        +78.0°C  (high = +81.0°C, crit = +91.0°C)
Core 3:        +78.0°C  (high = +81.0°C, crit = +91.0°C)
Core 4:        +77.0°C  (high = +81.0°C, crit = +91.0°C)
Core 5:        +75.0°C  (high = +81.0°C, crit = +91.0°C)
Core 6:        +82.0°C  (high = +81.0°C, crit = +91.0°C)
Core 8:        +76.0°C  (high = +81.0°C, crit = +91.0°C)
Core 9:        +79.0°C  (high = +81.0°C, crit = +91.0°C)
Core 10:       +82.0°C  (high = +81.0°C, crit = +91.0°C)
Core 11:       +82.0°C  (high = +81.0°C, crit = +91.0°C)
Core 12:       +81.0°C  (high = +81.0°C, crit = +91.0°C)
Core 13:       +78.0°C  (high = +81.0°C, crit = +91.0°C)
Core 14:       +80.0°C  (high = +81.0°C, crit = +91.0°C)

So the CPU runs very hot (82C).

My confusion is why the fan RPM stays so low, even after wating for a few minutes:

dell_smm-virtual-0
Adapter: Virtual device
fan1:        1573 RPM
fan2:         722 RPM
fan3:         684 RPM

I know the fan works well because it spins up to > 2000 in Bios diagnostics. Moreover, when I run a simple stress test in Ubuntu using stress --cpu 8 the fan spins up after a few seconds:

dell_smm-virtual-0
Adapter: Virtual device
fan1:        3567 RPM
fan2:         716 RPM
fan3:        3384 RPM

Is this normal? Can someone explain this? Why does the fan not trigger higher RPM when using SGE? Obviously I am worried about frying the CPU with such high temps.

  • As you are running the stress test with 8 threads I assume you have one 8-core CPU installed. Running 20 jobs means that a lot of RAM is accessed. In difference to a stress test the 20 jobs don't fit into the CPU cache. Which means the CPU will spend a lot of time waiting for data from RAM. I would try to reduce the job count and see what happens. – Robert Oct 24 '19 at 17:47
  • You'll see pretty similar behavior in many recent systems using Intel CPU's. I don't know whether it's Intel's design decision or the vendor's, but most seem to prefer to run hot and quite over cool and loud. – Austin Hemmelgarn Oct 24 '19 at 19:28

1 Answers1

2

If procHot is 91°C [which feels a bit low these days] & Hot is 81°C, then the fans won't really ramp until you're in that zone, the CPU is happy to do that all day.

Your figures show it's succeeding at staying just about <=Hot, so I don't see any issue.

Tetsujin
  • 47,296
  • 8
  • 108
  • 135
  • Thanks that is reassuring. Although it does not really explain why the `stress --cpu 8` command does ramp up the fan (but spooling 20 jobs does not), but I guess it might not only be the CPU temp that triggers the fan. – Niels Janssen Oct 24 '19 at 21:09
  • `stress` likely goes to 100%CPU while your jobs peak at 70%. Also, possibly [Turboboost](https://www.intel.com/content/www/us/en/support/articles/000007359/processors/intel-core-processors.html) – xenoid Oct 25 '19 at 07:02