0

We have a bare metal at a remote data center. Once in a while, we can't log in via ssh, oftentimes the reason is that the ssh server dies. It just needs a reboot, but we CAN'T do it fast (have to make a few call/request to the data center), have to wait for a day or so until someone walks to the machine and presses the physical button.

So we are thinking of designing an API (with credentials, of course), running on the server. Whenever we can't ssh, we call that API and trigger the reboot process.

Is that a good solution? If not, any other practices out there?

Thanks.

EyeQ Tech
  • 115
  • 5
  • proper servers should have some out of band management tool like ILO - it might be helpful to look at what your bare metal actually is – Journeyman Geek Sep 25 '21 at 06:44
  • Can you just add a cron job bash script that checks the status on your daemon every minute and restarts it or the server if the down state is detected? (If so, then no special api needed) Also, if it’s a lower level issue and connectivity is actually lost to the box and a restart fixes, there are network monitor power switches that power cycle whatever is plugged into them when connection is lost. Usually these are used on remote site modems and other fragile networking gear. – sadtank Sep 25 '21 at 08:21

1 Answers1

0

There are lots of solutions. Your script would work, but if SSH is not accessible but http is you should fix the underlying cause rather then the hack of rebooting. (Wild guess, too mich swap on spinning disk. Try set vm.swappiness to 5 and reduce - but dont eliminate - the swap partition size if its Linux and hats the cause).

As others have said "proper" servers usually have out-of-band management which allows for remote reboots and a lot more (iLo for HP and iDrac for Dell for example).

Another option is to get a power switch with the ability to toggle power ports. They are not that expensive.

Going "more hack", look at Watchdog timer support - there are lots if variants and options here depending in hardware and OS, but the idea is the system monitors itself and if the OS does not write to a special device periodically the system reboots.

If you are a hacker (in the ethical sense), or know someone who is, you should be able to use a Raspberry pi (or even an arduino if you make your own watchdog or have WIFI) using the basic computer to drive a relay (As it happens I was looking at an arduino board with relay for under US$10 earlier today - On aliexpress do a search for esp8266 relay)

Another solution - which worked well for me - was to convert my bare metal to a VM server ( lf budget is a concern look at KVM/proxmox) and ensured I did not overprovision resources and virtualise the app. I this way you can go into your hypervisor and debug and perform reboots in most cases. You can further improve things by breaking different functions onto different VM's so that in case of a failure you likely still have partial service - this, of course assumes you don:t just go the cloud/vm route.

Journeyman Geek
  • 127,463
  • 52
  • 260
  • 430
davidgo
  • 68,623
  • 13
  • 106
  • 163
  • Those management capabilities should never be exposed to the Internet. They are just to powerful… – Ramhound Sep 25 '21 at 16:47
  • @ramhound - To the extent "exposed to the Internet" means "directly exposed", agreed. Many/most out-of-band management systems are, however, indirectly exposed - eg on a seperately controlled network accessible via a vpn. Also the risk depends on the capability of the management - if its just rebooting its a lot less risky then if it can take control of the whole boit process. – davidgo Sep 25 '21 at 19:16
  • Another thought - There are a number of systems which can oversee a server/vm and force it into compliance with a specific state - I know Puppet can do this and I am certain there are other solutions. My thinking here is more along the lines of "ensure SSH is available, and if not, do what is required to make it so" rather then relying on reboots. – davidgo Oct 20 '21 at 08:41