WATCHDOG(8) WATCHDOG(8)
NAME (,) - a software
(,) SYNOPSIS (,) [
-f | --force ] [
-c | --config-file ] [
-v | --verbose ] [
-s | --sync ] [
-b | --softboot ] [
-q | --no-action ]
DESCRIPTION Watchdog is a that checks
(,n) your
system is still working. If
programs
(,) user space are not longer executed it will hard
(,,) the
system.
The kernel provides /dev/
(,), when
(2,,) must be written to
within a minute or the machine will reboot. Each
(,) delays the
(,,n) another minute. After a minute the
(,) hardware will
cause the reset. In the of the software
(,) the ability to
will depend on the of the machines and interrupts.
Watchdog can be stopped without causing a
(,n) the
device /dev/
(,) is closed correctly, unless of course your kernel is com-
piled with the CONFIG_WATCHDOG_NOWAYOUT enabled.
TESTS Watchdog itself does several additional tests to check the
system sta-
tus:
Check whether the process table is full.
Check whether there is enough available.
Check whether some given files are accessible.
Check whether some given files change
(,) a given interval.
Check whether the average work
(,) exceeds a predefined maximal value.
Check whether the a
(,) table overflow occurred.
Check whether a given process (specified by a
(,)) is still run-
ning.
Check whether some given IP addresses answer to a message.
Check whether some given network interfaces received some traffic.
Check the temperature (
(,n) available)
Execute a user defined to do arbitrary tests.
If any of these checks fail
(,) will cause a shutdown. Should any
of these tests except the user defined longer than one
minute the machine will be rebooted, too.
Available are the following:
-v | --verbose
Set verbose mode. Only implemented
(,n) compiled with SYSLOG fea-
ture. This mode will each several infos
(,) LOG_DAEMON with
priority LOG_INFO. This is useful
(,n) you want to see exactly
what happened until
(,) rebooted the system. Currently it
logs the temperature (
(,n) available), the
(,) average, the
change of the files it checks and how often it went to
sleep.
-s | --sync
Try to
(,,) the filesystem every
(,,n) the process is awake. Be
aware that the
system is rebooted
(,n) any reason syncing
lasts longer than a minute. -b | --softboot
Soft-boot the
system (,n) an
(,) occurs during the main loop,
e.g.
(,n) the
(,) given with -n is not accessible via the
(,) call. Note that this does not apply to the
(2,,) calls to
/dev/
(,) and /
(,)/loadavg are opened before the main
loop starts.
-f | --force
Force the usage of the interval given or the maximal
(,) aver-
age given
(,) the
(,) file.
-c <
(,) (,)> | --config-file <
(,) (,)>
Use <
(,) (,)>
(,) (,) instead of the default
/etc/watchdog.conf.
-q | --no-action
Do not or halt the machine. This is testing purposes.
All checks are executed and the results are logged usual, but
no action is taken. Also your hardware resp. the kernel
software
(,) driver is not enabled. Note that temperature
checking is also disabled since this triggers the hardware
(,) on some cards.
FUNCTION Watchdog starts, put itself into the background and then try all checks
specified
(,) its
(,) (,) (,) turn. Between each two tests it will
trigger the kernel device. After finishing all tests
(,) goes to
(,) some time. The kernel drivers expects a
(,) to the
(,) device every minute. Otherwise the
system will be rebooted. As a
default
(,) will
(,) only 10 seconds so it triggers the
device early enough.
Under high
system (,) (,) might be swapped out of and may
fail to it back
(,) (,) time. Under these circumstances the Linux
kernel will hard
(,,) the machine. To sure you won't get unnecas-
sary reboots sure you have the 'realtime'
(,,) to
(,) the
(,) (,) watchdog.conf. It adds real
(,,n) support to watchdog.
Thus it will itself into memeory and there should be no problem
even under the highest of loads.
Also you can specify a maximal allowed
(,) average. Once this
(,) average is reached the
system is rebooted. You may specify maximal
(,) averages 1 minute, 5 minutes or 15 minutes. The default values is
to this test. Be careful not to
(,,) this parameter too low. To
(,,) a value
(,) then the predefined minimal value of 2, you have to
use the -f option.
You can also specify a minimal amount of
(,) you want to
have available free. As soon
(,) is used action
is taken by watchdog. Note, however, that
(,) does not distinguish
between different types of usage. It just checks
-
(,) memory.
If you have a
(,) with temperature sensor you can specify the
maximal allowed temperature. Once this temperature is reached the
sys-
tem is halted. Default value is 120. There is no unit conversion. So
sure you use the same unit your hardware. Watchdog will
once the tempearture increases 90%, 95% and 98% of this tem-
perature.
When using
(,) mode
(,) will try
(,) the given files. Errors
returned by
(,) will
not cause a reboot. For a the
(,) has to least one minute. This may happen
(,n) the
(,) is
located on an NFS mounted filesystem. If your
system relies on an NFS
mounted filesystem you might try this option. However,
(,) such a
the
(,,) may not work
(,n) the NFS server is not answering.
If you give
(,) a pidfile it will
(,,) the from this
(,) and
(,,)(,0) to see whether the process still exists. If not action
is taken by watchdog. So you can instance restart the server from
your repair-binary.
Watchdog will try periodically to itself to see whether the
process table is full. This process will leave a zombie process until
(,) wakes up again and cathes it.
In mode
(,) tries to the given addresses. These
addresses do not have to be a single machine. It is possible to to
a broadcast address instead to see
(,n) least one machine
(,) a subnet
is still living.
Do not use this broadcast unless your MIS person a) knows about it and b) has given you explicit permission to use it! Watchdog will
(,n) out three packages and up to
seconds the reply with being the (,,n) it goes to (,)
between two triggering the (,) device. Thus a unreachable
network will not cause a hard (,,) but a soft reboot.
You can also passively an unreavhable network by just monitor-
ing a given interface traffic. If no traffic arrives the network is
considered unreachable causing a soft resp. action from the
binary.
With using an external check (,) can run user defined
tests. This may longer than the (,,n) slice defined the kernel
device without a problem. However, note that (,) this (,) mes-
sages are generated into the (,,,) facility. If you have enabled soft-
on (,) the machine will be rebooted (,n) the doesn't (,,)
(,) half the (,,n) (,) sleeps between two tries triggering the ker-
nel device.
If you specify a it will be started instead of shutting
down the system. If this is not able to fix the problem (,)
will still cause a afterwards.
If eventually the machine is halted an email is (,n) to a human
that the machine is going down. Starting with (,,) 4.4 (,) will
also the human (,) charge (,n) the machine is rebooted.
SOFT REBOOT
A soft (i.e. controlled (,) and ) is initiated
every (,) that is found. Since there might be no processes
available, (,) does it all by himself. That means:
1) Kill all processes with SIGTERM.
2) After a short (,,) all remaining processes with SIGKILL.
3) Record a (,) (,) wtmp.
4) Save the (,4,) seed from /dev/urandom. If the device is non-exis-
tant or
the to save to is empty this step is skipped.
5) Turn off accounting.
6) Turn off (,) and swapp.
7) Unmount all partitions except the partition.
8) Remount the partition read-only.
9) Shut down all network interfaces.
10) Finally reboot.
CHECK BINARY
If the code of the check is not (,) will assume
an (,) and the system. Be careful with this (,n) you are using
the real-time properties of (,) since (,) will the
of this before proceeding. An positive (,,) code is
interpreted an system (,) code (see errno.h details) Negative
values are special to watchdog:
-1 the system. This is not exactly an (,) but a -
to
watchdog. If the code is -1 (,) will not try to run
a (,) instead.
-2 (,,) the system. This is not exactly an (,) but a
to
watchdog. If the code is -2 (,) will simply refuse
to (,) the kernel device again.
-3 max (,) average exceeded.
-4 the temperature inside is too high.
-5 /(,)/loadavg contains no (or not enough) data.
-6 Given (,) was not changed (,) the given interval.
-7 /(,)/meminfo contains invalid data.
-8 personal use
REPAIR BINARY
The is started with one parameter: the (,) -
that caused (,) (,) initiate the process. After
trying to the system the should (,,) with 0 (,n) the
system was successfully repaired and thus there is no to
anymore. A value not equal 0 tells (,) to
reboot. The code of the should be the (,)
of the (,) causing (,) to reboot. Be careful with
this (,n) you are using the real-time properties of (,) since
(,) will the of this before proceed-
ing.
BUGS
None known so far.
AUTHORS
The original code is an example written by Alan Cox
, the author of the kernel driver. All addi-
tions were written by Michael Meskes Johnie Ingram
had the idea of testing the (,) average. He also
took over the Debian specific work. Dave Cinege
brought up some hardware (,) issues and helped testing this stuff.
FILES
/dev/(,) The (,) device
/var/run/watchdog.pid The PID of the running (,)
SEE ALSO
(5)
4th Berkeley Distribution February 1996 WATCHDOG(8