home |  electronics |  toolbox |  science club |  tuxtalk |  photos |  e-cards |  online-shop

http://tuxgraphics.org/electronics




Content:
By Guido Socher

 

An ethernet network host watchdog

[Illustration]

Abstract:

A watchdog is a piece of equipment that supervises other systems and resets them in case it detects that those systems are failing.

Such watchdogs can be used to make systems more reliable. Reliability is a major cost factor in many cases. Think about remote equipment where it might take hours to get on site and service it. In some cases it is even impossible to get there. Think about satellites in outer space.

A crucial factor is of course also the reliability of the watchdog. A small, independent watchdog device is therefore generally better then a software only solution implemented in the system itself. The Linux kernel has e.g such a watch dog called "softdog". This softdog can help a lot to improve the reachability of a server but it can not cover all possible cases because it is part of the failing system. Finally a watchdog can never cover a total equipment failure. It is a good remedy for temporary problems that go away after a reboot.

_________________ _________________ _________________

 

The idea

The idea of a network equipment watchdog is based on the requirements and ideas of a customer who needed to improve the reliability of telecommunication equipment.

This equipment was just hanging once in a while and he had to manually monitor the system around the clock to be able to reset it in case it was stuck again. He wanted some device to automatically monitor the system and to automatically recover it.  

Ping

A simple way of detecting if network equipment is up is to send a ping and see if there is a reply. Such a ping (ICMP echo) can therefore be used to monitor network equipment. A watchdog could therefore just ping that equipment.

A problem is however the case of a system that is "half up". Think of a webserver. The network interface might be up but somehow the apache webserver application died. In this case the machine would be ping-able but the web server would actually not work. We could poll a specific web-page to fix this. A web-server is however only a very specific case. How can we generalize the solution for other systems? One could run a script on the server itself that would execute a number of tests to see if the system was in good shape. If everything was OK the script can send a ping to the watchdog. In this case it is not the watchdog that originates the ping but the "health check script" on the on the monitored equipment that sends once in a while a ping to the watchdog to say "I am OK".

Only if those pings are missing for a period of time then the watchdog will reset the system.  

Time to reboot

We must pay special attention to the way systems reboot. Let's say we expect an "alive signal" (=ping/reply) from the monitored network equipment every 30sec. Maybe after 2 missing ping/reply we would initiate a reboot. In other words a little bit after 60sec we would initiate a reboot. The system reboots but the time it takes to do that might be 5 minutes, 10 minutes. We must avoid to reboot the system during the startup otherwise it will never finish the startup.

The solution is to put the watchdog after a reset into a "passive state". In this state it will continue to monitor the system but it will not initiate a new reset. Only when the watchdog gets again the first "I am alive indication" then the watchdog will go back into an "active state" where it would initiate again a reboot/reset in case of a failure. This way it does not really matter how long the startup takes.  

The hardware

relay

The tuxgraphics ethernet board has on pin PD7 the possibility to connect a relay. Relays do usually have a contact that opens and one that closes. Dependent on whether you want to reset the monitored equipment or you want to disconnect it for a moment from power you can use one of the two relay contact. The Ethernet board will just supply a current to the relay at the time of the reset/restart.

The hardware is therefore very simple. Just take the standard tuxgraphics ethernet board and connect a relay to it.  

The tuxgraphics host watchdog

host watchdog

The watchdog is configurable via its own web-pages. You just point your web browser to it and you can see the state of the system, how often it had to be reset, if the watchdog is active or passive etc.... You can also configure if ping shall be sent from the watchdog or if the system will ping the watchdog.

The watchdog has its own online help. Have a look.  

References/Download





© Guido Socher, tuxgraphics.org

2009-03-02, generated by tuxgrparser version 2.56