3

So recently I've noticed that I have a process that will randomly crash and become a zombie with a PPID of 1 (init). I've been told that the only way to fix this is to reboot the PC (or send SIGCHLD to init, which is....dicey/useless, from what i understand. )

Essentially, what I'm looking to do is write a bash script that will just look for a zombie process and if there is one, reboot the PC.

Currently, i use this script to monitor the process itself:

 ps auxw | grep ethminer | grep -v grep > /dev/null

 if [ $? != 0 ]
 then
    sudo reboot
 fi

Now, this script seems to work fine when ethminer is either RUNNING, or NOT RUNNING; it will reboot the machine if it does not see ethminer in the process table, and it does nothing if it doesn't see it.

However, (from my admittedly loose understanding) since there is no exit code when the process becomes a zombie if [ $? != 0 ] doesn't get any input, and therefore doesn't do anything.

Is there anyway I can fix/modify this script so it does what i want it to do? Or am I way off track here?

Thanks!

heemayl
  • 90,425
  • 20
  • 200
  • 267
cannabeatz
  • 55
  • 1
  • 6

1 Answers1

5

You don't have to reboot when they are zombie processes. Here's why:

  • A process becomes zombie when the process is finished, but it's parent has not called wait(2) to get it's return code

  • The zombie does not take any physical or virtual resources except only an entry in the kernel's process table

  • Once the parent calls wait(2) the zombie will be properly reaped and the process table entry will be removed

  • If the zombie becomes an orphan i.e. if it's parent dies, then init (PID 1) will inherit the process and will reap it by calling wait(2)

As you can see it's a matter of time till the wait(2) is called and the zombie is reaped. If you have many zombies over the time, consider it's a programming flaw, you should look at fixing (or ask for fixing) the code instead rather than rebooting, which is absolutely unnecessary and should not be done.


To find the zombie processes, get the STATE of the process, if it's Z, then the process is a zombie:

ps -eo pid,ppid,state,cmd | awk '$3=="Z"'

Here i have taken only selective fields namely the PID, PPID, STATE and COMMAND.

heemayl
  • 90,425
  • 20
  • 200
  • 267
  • OK, that makes sense. Thanks! Let me explain a bit better; the machine is question is a cryptocurrency mining rig, so ideally, the miner process is to run 24/7. The reason i'm concerned about this process being a zombie isnt necessarily because of resource utilization, it is that i cant figure out how to make the process restart without removing the zombie, since it already has an entry in the process table. Init doesnt seem to be properly reaping the orphan zombie either, its been two days now and the zombie still persists. – cannabeatz Sep 01 '16 at 18:48
  • So i guess the real question here is "Why is this process's parent dying?". My workaround is just sort of a band-aid. – cannabeatz Sep 01 '16 at 18:55
  • @cannabeatz Are you sure the current parent of the process is `init` i.e. it's original parent died? Get me the output of the command i have given. – heemayl Sep 01 '16 at 18:57
  • I am relatively sure of this, the PPID is 1 and the process shows up as ethminer. I'd paste in the actual output of the command, but i just rebooted the rig again and apparently it didn't restart correctly..so now i've lost remote access. I'll have to get back with more info when i'm physically in front of the machine. Thanks for the info! – cannabeatz Sep 01 '16 at 19:11
  • @cannabeatz `init` reaps its child regularly, it is defined by the design. If it is not in your case, consider it as a bug in `init` itself. What's your `init`? Also try sending `SIGCHLD`: `kill -SIGCHLD 1`. if that does not work, file a bug report. And of course at first make sure that the current parent is really `init`, because these sort of bugs are very very rare for `init`. – heemayl Sep 01 '16 at 19:15
  • Alright, I'll give that a shot. I'm inclined to believe that my issue isn't a bug and is probably the result of some sort of usage error or poor observation on my part; I'll try to collect more info and will update when i do. Thanks again! – cannabeatz Sep 02 '16 at 01:46
  • I was able to restore remote access to my machine, and i ran the command you gave me. Turns out ethminer's PPID was NOT 1, it was another process number ( #6837, if it matters). This process's name is "sudo". Now i'm REALLY confused. From what i understand, when i run a command as sudo, two processes are started (sudo & whatever the command is called). So this means my parent sudo process is dying for some reason? – cannabeatz Sep 06 '16 at 18:49
  • Alright, If i kill the PPID that is listed when i run the above command, the PPID of the zombie becomes init. Very confused at this point. – cannabeatz Sep 06 '16 at 19:07
  • @cannabeatz This justified my words, there could not be such a major bug in `init`. What makes you confused? All info you need is in my answer. – heemayl Sep 06 '16 at 19:47
  • While I appreciate all the information, i haven't really found an answer to my issue. Allow me to explain what i am observing: Running `ps -eo pid,ppid,state,cmd | awk '$3=="Z"'` to find PPID of "ethminer" tells me ethminers PPID is "6837". I kill that process with `sudo kill 6837`, and run `ps -eo pid,ppid,state,cmd | awk '$3=="Z"'` again to see if i successfully killed the zombie. No luck; the output of `ps -eo pid,ppid,state,cmd | awk '$3=="Z"'` is now `6840 1 Z [ethminer] ` which shows init has once again become the parent of ethminer. – cannabeatz Sep 06 '16 at 20:57