使用 watchdog 构建高可用性的 Linux 系统及应用-janson3344-ChinaUnix博客

快乐和谐eternalunix.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

janson3344

博客访问： 402773
博文数量： 99
博客积分： 5134
博客等级：大校
技术积分： 1607
用户组：普通用户
注册时间： 2007-03-30 09:31

文章分类

全部博文（99）

技术文档（99）

DB（1）

web server（11）

shell（1）

windows（5）

常用（2）

programme（1）

python（3）

私人（26）

linux（26）

网络（10）

oracle（13）
文摘（0）

开心笑话（0）
未分配的博文（0）

文章存档

2011年（48）

2010年（40）

2009年（10）

2008年（1）

我的朋友

相关博文

使用 watchdog 构建高可用性的 Linux 系统及应用

分类： LINUX

2011-08-18 14:40:31

Watchdog HowTo (edited: 15-July-2008)
==============
Keywords: software autoreboot, autorebooting, auto-reboot, auto-rebooting, auto rebooting

Watchdog is a program that you can use to reboot your server automatically in a lot of cases.
It has been used succesfully to reboot servers in the "Unexplained Crash" problem, that can have as causes a disk queue starvation problem, or a quota/ext3 filesystem deadlock, crashing the server many times randomly. If downtime due crashes in your system is a problem, probably you must use watchdog to assure you peacefully tranquility back again.

This works in any distribution: Ensim, Plesk, CPanel, etc., in any Linux system.

As documentation in /usr/src/[your-linux-kernel]/Documentation/watchdog.txt, kernel provides watchdog timer interfaces in a device named /dev/watchdog, "which when open must be written to within a timeout or the machine will reboot. Each write delays the reboot time another timeout. In the case of the software watchdog the ability to reboot will depend on the state of the machines and interrupts. The hardware boards physically pull the machine down off their own onboard timers and will reboot from almost anything.". The timeout default is 60 seconds.

The watchdog program simply uses the /dev/watchdog device, activating the softdog module on your system, if you have support in your kernel, and writes in /dev/watchdog within 10 seconds, making several other (configurable) checks in your system. If your system crashes, or watchdog stop to working, or in any case watchdog be supposed not to write in that device in 60 seconds, but kernel remains live, it will reboot within 60 seconds.

I have acknowledgement the following RedHat kernel already comes with support to softdog module:
2.4.18-27.7.x
2.4.20-19.7
2.4.20-24.7
2.4.20-27.7
2.4.20-28.7
2.4.21-27.ELsmp (RHEL3)

The major distros already comes with softdog module support. If you don´t use any of above kernels, try to check if your version/distro come with softdog module suport, with the command "modprobe softdog", and check with "lsmod|grep softdog". If so, quickly execute "rmmod softdog", to your server not reboot automaticly. If not supported, you must compile a kernel with support for watchdog, setting these parameters:
CONFIG_WATCHDOG=y
CONFIG_SOFT_WATCHDOG=m

Refer to "Kernel compile HowTo" to compile a new kernel for your system.

Installation
============
In general steps, to install watchdog it´s suffice download, install, and change a few parameters in /etc/watchdog.conf. It´s very simple. But in *NO* way experiment with watchdog !!! You can have a bad experience, and need to restore your server. Only do what you know what you are doing! Be advised. I´m a experienced network administrator (20 years IT, 11 years with hosting), and although my experience, this costed me 2 (two) restores with EV1 to learn.

Always check your backups *before* install watchdog.

Download:
=========
If you are using Ensim, download from:
# wget (Several other different versions in dag repository)

Run your rpm:
# rpm -ivh watchdog-5.2-5.dag.rh73.i386.rpm

FIRST IMPORTANT THING TO DO: Disable auto-start of watchdog (explained below the reason):
(This is for RedHat like distros. Check how to do it for another distros)
# chkconfig watchdog off

Configuration:
==============
Softdog is auto-loaded by watchdog, so you don´t need make nothing.
You need at least to change the /etc/watchdog.conf, in the following lines, uncomenting its:
Uncomment:

CODE

=================================
#file = /var/log/messages
#watchdog-device = /dev/watchdog
=================================

Turning in:

CODE

=================================
file = /var/log/messages
watchdog-device = /dev/watchdog
=================================

You can adjust any other configurations at your taste. Check too the file '/etc/sysconfig/watchdog' (for RedHat-like distros) to startup / command line configurations of watchdog. (for example, mine is: OPTIONS="-v -b" to verbose log, and soft reboot)

Create the watchdog device:
# mknod /dev/watchdog c 10 130

Check if it exists really:
# ls -alF /dev/watchdog

If ok, execute the following:
# service watchdog start

You already have watchdog working.

Check in your /var/log/messages if there are some lines like the following:

CODE

Jan 13 15:06:13 ensim kernel: Software Watchdog Timer: 0.05,
timer margin: 60 sec
Jan 13 15:06:13 ensim kernel: pcwd: v1.13 (03/06/2002) Ken Hollis
(kenji@bitgate.com)
Jan 13 15:06:13 ensim kernel: pcwd: No card detected, or port not
available
Jan 13 15:06:13 ensim kernel: WDT driver for Acquire single board
computer initialising.
Jan 13 15:06:13 ensim watchdog: watchdog startup succeeded
Jan 13 15:06:13 ensim watchdog[3130]: starting daemon (5.1): (...
long line with options...)

If so, it´s all right.
After, test watchdog, rebooting your server:
# service watchdog stop
(NOTICE 1: This is not a truely shutdown/reboot procedure! The kernel will make a hard reboot here. So, analyse the consequences, if you do not have any program writting in your disk. Close all processes first, if you have worries about, shutdowning all daemons before. Is kernel rebooting your machine, not watchdog daemon program.)

In 60 seconds your system will reboot. If not, try to check if the module are loaded, with the "lsmod" command. In more modern systems like some newer version of RedHat-like distro, the "service" command to "stop" or "restart" does nothing. This is much more secure to work with watchdog. If so, try to reboot manually your server. Your system should restart in normal time (nearly two minutes. Pray!)

If your system it´s ok, restart watchdog again (service watchdog restart), and you could include a line in the end of file /etc/rc.d/rc.local:
# echo "/sbin/service watchdog restart" >> /etc/rc.d/rc.local

If you to want, test again watchdog:
# service watchdog stop

If reboot ok, you are already protected.
If not reboot, ask for EV1 reboot in single user mode, or a different kernel, and undo yourself the changes. The main is to remove the device /dev/watchdog, with "rm -f /dev/watchdog" command. When the computer starts, and see no /dev/watchdog, the softdog do nothing, and your server stops to rebooting in next boot.

!!! CAUTION !!! CAUTION !!! CAUTION !!!
1) Never, never, never use chkconfig to make watchdog auto-restart in next boot. Redhat kill processes when changing runlevels, and when kill watchdog your system will eternally rebooting, needing a restore from EV1. Don´t experiment with watchdog in a real production machine.
2) Following rigorously the steps above worked for me, and I think can work for your. But I cannot warranty any thing to you, so you are the ultimate responsible to following them.
!!! CAUTION !!! CAUTION !!! CAUTION !!!

That steps above are secure. But before you install new kernels NEVER forget to drop the line in your /etc/rc.d/rc.local file (comment it). Test watchdog again in the new system, like showed before, but without start automatically in the next boot, commenting the start line of watchdog in rc.local. If any problem, simply ask a reboot to EV1, and all it´s ok again, allowing you know what fails, if kernel not support watchdog, if installation problem, etc.

Some succesful and unsuccesful (me) installations of watchdog in:
<>

Watchdog在实现上可以是硬件电路也可以是软件定时器，能够在系统出现故障时自动重新启动系统。在Linux 内核下, watchdog的基本工作原理是：当watchdog启动后(即/dev/watchdog 设备被打开后)，如果在某一设定的时间间隔内/dev/watchdog没有被执行写操作, 硬件watchdog电路或软件定时器就会重新启动系统。

/dev/watchdog 是一个主设备号为10，从设备号130的字符设备节点。 Linux内核不仅为各种不同类型的watchdog硬件电路提供了驱动，还提供了一个基于定时器的纯软件watchdog驱动。驱动源码位于内核源码树drivers\char\watchdog\目录下。

硬件watchdog必须有硬件电路支持, 设备节点/dev/watchdog对应着真实的物理设备，不同类型的硬件watchdog设备由相应的硬件驱动管理。软件watchdog由一内核模块softdog.ko 通过定时器机制实现，/dev/watchdog并不对应着真实的物理设备，只是为应用提供了一个与操作硬件watchdog相同的接口。
硬件watchdog比软件watchdog有更好的可靠性。软件watchdog基于内核的定时器实现，当内核或中断出现异常时，软件watchdog将会失效。而硬件watchdog由自身的硬件电路控制, 独立于内核。无论当前系统状态如何，硬件watchdog在设定的时间间隔内没有被执行写操作，仍会重新启动系统。
一些硬件watchdog卡如WDT501P 以及一些Berkshire卡还可以监测系统温度，提供了 /dev/temperature接口。

对于应用程序而言, 操作软件、硬件watchdog的方式基本相同：打开设备/dev/watchdog, 在重启时间间隔内对/dev/watchdog执行写操作。即软件、硬件watchdog对应用程序而言基本是透明的。

在任一时刻，只能有一个watchdog驱动模块被加载，管理/dev/watchdog 设备节点。如果系统没有硬件watchdog电路，可以加载软件watchdog驱动softdog.ko。

在Linux下使用watchdog开发应用之前，请确定内核已经正确地配置支持watchdog。内核源码下的drivers/char/watchdog/Kconfig文件提供了各种 watchdog配置选项的详细介绍。特别指出关于‘CONFIG_WATCHDOG_NOWAYOUT’选项的配置，从清单1软件watchdog的模块信息可以看到，nowayout参数的缺省值等于‘CONFIG_WATCHDOG_NOWAYOUT’, 如果‘CONFIG_WATCHDOG_NOWAYOUT’选项在内核配置时设为‘Y’, 缺省情况下，watchdog启动后（即/dev/watchdog被打开后），无论是执行close操作还是写入字符‘V’都不能停止watchdog 的运行。

linux-mach:~ # modinfo softdog filename: /lib/modules/2.6.16.21-0.8-smp/kernel/drivers/char/watchdog/softdog.ko author: Alan Cox description: Software Watchdog Device Driver license: GPL alias: char-major-10-130 vermagic: 2.6.16.21-0.8-smp SMP 586 REGPARM gcc-4.1 supported: yes depends: srcversion: EAE9E5688843C073B0EF5BC parm: soft_noboot:Softdog action, set to 1 to ignore reboots, 0 to reboot (default depends on ONLY_TESTING) (int) parm: nowayout:Watchdog cannot be stopped once started (default=CONFIG_WATCHDOG_NOWAYOUT) (int) parm: soft_margin:Watchdog soft_margin in seconds. (0

在开发应用之前，必须了解所用的watchdog驱动的重启时间间隔。各种硬件以及软件watchdog驱动都设有一个缺省的重启时间间隔，此重启时间间隔也可在加载驱动模块时设置。

从清单1软件watchdog的模块信息可以看到，soft_margin参数代表了softdog.ko的重启时间间隔，缺省值是60秒，可以在加载 softdog.ko 时指定重启时间间隔，如 `modprobe softdog soft_margin=100`。

各种硬件以及软件watchdog驱动都为应用提供了相同的操作方式。打开/dev/watchdog设备，watchdog将被启动。如果在指定重启时间间隔内没有对/dev/watchdog执行写操作，系统将重启。

int wdt_fd = -1; wdt_fd = open("/dev/watchdog", O_WRONLY); if (wdt_fd == -1) { // fail to open watchdog device }

如果内核配置选项‘CONFIG_WATCHDOG_NOWAYOUT’设为‘Y’, 缺省情况下watchdog启动后不能被停止。如果模块的nowayout参数设为0，往/dev/watchdog 写入字符`V’ 可以使watchdog停止工作。可以参考参考文献二2.6内核源码drivers\char\watchdog目录下各种硬件及软件watchdog驱动的write函数得到停止watchdog的逻辑, 譬如软件watchdog驱动softdog.c 中的write函数。

参考文献三watchdog daemon源码 watchdog-5.4.tar.gz的 close_all函数提供了停止watchdog运行的范例。以下是一个简单的停止watchdog的代码段范例:

if (wdt_fd != -1) { write(wdt_fd, "V", 1); close(wdt_fd); wdt_fd = -1; }

从1.6节中提到的softdog模块的write函数可以看到，在watchdog重启时间间隔内执行写操作，softdog_keepalive将被调用，增加定时器定时时间。

所以应用在启动watchdog后，必须在重启时间间隔内，周期性地对/dev/watchdog执行写操作才能使系统不被重启。

参考文献三watchdog daemon源码 watchdog-5.4.tar.gz的keep_alive函数提供了保持watchdog运行的范例。以下是一个简单的保持watchdog运行的代码段范例:

if (wdt_fd != -1) write(wdt_fd, "a", 1);

不同的系统与应用有自己的监控需求，不同的被监控对象有相应的监控方法。下文第一部分介绍了一个Linux下的开源项目 watchdog daemon。这是一个系统监控的后台应用，通过从配置文件中读取监控对象以及监控参数等信息对系统进行内存、负载、进程、网络等方面的监控，同时将各种监测信息记入系统日志，并能够在系统重启之前发email通知管理员。通过阅读这个项目的源码可以学习到：

如何在一个系统监控程序中使用watchdog，在系统出现故障时重新启动系统，以提高系统的可用性
常用的被监控对象以及相应的监控方法

当Linux用于电信、嵌入式领域，一些基于Linux操作系统开发的服务应用在可用性上有较高的要求，下文第二部分介绍了如何在这种类型的应用中加入watchdog机制以提高应用的可用性。

watchdog daemon是一个Linux下使用了watchdog的系统监控的后台应用。开发人员可以下载最新的watchdog_5.4.tar.gz源码。参见参考资源下载源码。

Watchdog daemon主程序运行流程如下：

解析命令行输入，从watchdog.conf 配置文件读取配置信息，检测各种配置参数的有效性。这些配置信息包括：
- 被监控的对象譬如网络接口eth0,　被监控进程的pid文件如/var/run/syslogd.pid
- 各种检测参数的阙值
- 用户提供的修复程序路径
- 用户提供的测试程序路径
?
如果没有在配置文件中设置某些配置信息，相应的检测将不被执行。
进行一系列初始化：
- 初始化用于检测网络状态的套接口
- 设置自身为后台进程
- 打开日志文件，往日志中记入各种配置信息
- 打开/dev/watchdog设备启动watchdog
- 打开proc/下各种状态信息文件如/proc/meminfo、 /proc/loadavg
- 禁止自身被swap出内存，设置自身的调度优先级(在系统负载较高时， watchdog 后台进程自身有可能被swap出内存，导致不能及时对/dev/watchdog 执行写操作而使系统重启)
?
进入while(1) 主循环，依次执行： ?
- 对/dev/watchdog执行写操作，表明当前系统运行良好。如果超过一分钟的时间间隔没有对/dev/watchdog执行写操作，系统将会被动重启
- 进行各种检测，下节将介绍被各种检测的对象以及相应的检测方法
- 睡眠一段时间，此睡眠时间在配置文件中设置。如果睡眠时间设置太长，则会出现不必要的重启；设置太短，会导致检测太频繁，增加系统负载。
? ?

如果检测返回错误，检测返回值显示故障可以修复，同时配置文件中也设置了修复程序，修复程序将会被调用。如果修复故障失败，主程序将会尝试执行一系列清理、记载日志、email通知管理员的操作，然后重新启动系统。必须注意的是，当检测发现错误时，主程序并不是退出while（1）循环，使/dev/watchdog在1分钟内没被执行写操作而导致系统被动重启；而是执行一系列操作后主动重启系统，这样既可以记录下日志信息，email通知管理员，也可以防止系统被动重启导致文件系统损坏。

主程序在主动重启之前，会执行以下一系列操作:

关闭所有打开的文件
如果sendemail应用存在，配置文件中提供了管理员的email地址，将发封email给管理员通知系统重启
将重启信息记入系统日志
kill掉所有进程
将重启信息记入wtmp
关闭accounting，quota，swap ?
卸载所有非根文件系统
以只读方式重新装载根文件系统
关闭网络接口
重启系统

从以上程序框架可以看到，监控应用检测出系统故障时，在系统可控的状态下，应用会尝试主动重启系统。但是通过启动watchdog，能够在系统不可控，或是监控应用自身不能正常运行时，由watchdog自动启动系统。

watchdog daemon根据配置文件对系统进行以下类型的监控，并将每次检测的结果记入系统日志：

通过打开一个文件成功与否测试文件表是否已满
通过读取/proc/loadavg 以检测1、5、15分钟内系统平均负载是否超过设定值
通过读取/proc/meminfo检测系统是否还剩下足够的空闲内存
如果一些硬件watchdog卡提供有温度传感器进行温度监控，访问/dev/temperature设备判断温度是否过高
通过调用kill (pid, 0)检测某个进程是否仍在运行，如果kill调用返回为0，则进程仍在运行, 通过从配置文件中读取pid文件如/var/run/syslogd.pid来获取被监控进程的pid
通过解析/proc/net/dev 的信息，查看指定的网络接口如eh0的收发包状况
通过往一些IP地址发包检测这些IP地址可否被ping通，或是ping广播地址检测子网中是否至少一台机器可以被ping通
通过调用fork、execl执行用户传递的测试程序

watchdog daemon提供了一些常用的很有参考价值的监控方法及源码，开发人员也可自行设计开发更丰富的监控方法对系统进行更为细致、全面的监控。

Linux因为其强大的功能已成为很多企业级应用的开发平台。这些应用本身并不是一个监控程序而是一个管理或服务程序，对可用性有较高的要求。通过在这类应用中加watchdog模块，一方面监测系统状态，另一方面监测此应用中其它模块的工作状态。系统出现故障时能自动重启，因为应用已加在系统的init.d启动服务中，应用会随着系统的启动而启动，自动提供服务。

这类服务应用通常是多线程的，甚至是多进程的。加入watchdog机制的通常作法是在应用中加入一个watchdog线程，此线程主循环的工作流程与上节介绍的系统监控应用的守护进程的主循环工作流程大致相似。Watchdog线程在进入主循环之前的初始化阶段打开/dev/watchdog 启动watchdog, 然后执行上节中提到的while(1)循环中的操作。

如果此应用是多进程的，子进程的fork可以放在watchdog线程的初始化阶段执行。 Watchdog线程可以获得这些子进程的pid，从而检测这些子进程的工作状态。

Watchdog线程除了可以参照上节watchdog daemon提供的方法对系统状态进行监控，还可以通过消息队列或套接口与应用中的其它线程进行通信，获得其它线程的工作状态。不过这需要在其它子线程中加入与watchdog线程进行通信的接口。

linux-2.6.xx\Documentation\watchdog\ 内核提供的关于watchdog的文档
各种硬件watchdog驱动以及软件watchdog 驱动源码

原文地址: http://www.ibm.com/developerworks/cn/linux/l-cn-watchdog/index.html

阅读(4142) | 评论(3) | 转发(0) |

上一篇：oracle 管理命令

下一篇：ssh反向连接和简单实现

给主人留下些什么吧！~~

liu1188662016-03-18 11:16:01

http://gdsgfhfdhgfj.ebdoor.com/
http://defrtgggdfhgfj.ebdoor.com/
http://mjhgsgffdhgfhgf.ebdoor.com/
http://frdfsdgffhgf.ebdoor.com/
http://hfdhdfhgffgjh.ebdoor.com/
http://gtfdgsgdfhh.ebdoor.com/
http://htfhfgjjhgjgh.ebdoor.com/
http://lkmnhgug.ebdoor.com/
http://frdgdfgdhgff.ebdoor.com/
http://dhfhgfgjjj.ebdoor.com/
http://nhhghgfhjhgjg.ebdoor.com/
http://ebgfhdhgfhgf.ebdoor.com/
http://devfrghfgjhjhg.ebdoor.com/
http://kjhygtfhfhg.ebdoor.com/
http://bgngfdgfdhgfj.ebdoor.com/
http://wedghffgjhgj.ebdoor.com/
http:

回复 | 举报

liu1188662016-03-18 11:15:49

http://grhgfjhjhgjk.ebdoor.com/
http://grfhchgjvjh.ebdoor.com/
http://frdgdfdfhgffg.ebdoor.com/
http://defrvfbghnfgd.ebdoor.com/
http://frbgnhhfghfg.ebdoor.com/
http://wcdvfgbfhgfj.ebdoor.com/
http://erdsgffdhgf.ebdoor.com/
http://erfvbghnfdh.ebdoor.com/
http://drgfdhhgfjhgjhg.ebdoor.com/
http://efrbgfgjhjhg.ebdoor.com/
http://evfbgfdhgfhgf.ebdoor.com/
http://rgthygfjjhg.ebdoor.com/
http://bgthfdhhgfh.ebdoor.com/
http://ergtfdhghfgjhj.ebdoor.com/
http://rtbggfdhgf.ebdoor.com/
http://wevfgtgfdhgh.ebdoor.com/
http://

回复 | 举报

liu1188662016-03-18 11:15:25

http://fdhhgfjjhg.ebdoor.com/
http://lkmnhgug.ebdoor.com/
http://bgngfdgfdhgfj.ebdoor.com/
http://bgtdssgfdgd.ebdoor.com/
http://nhmjfgjjghk.ebdoor.com/
http://mjhgsgffdhgfhgf.ebdoor.com/
http://hfdhdfhgffgjh.ebdoor.com/
http://gtfdgsgdfhh.ebdoor.com/
http://htfhfgjjhgjgh.ebdoor.com/
http://defrbgnhgffgj.ebdoor.com/
http://nhmukjyhhghf.ebdoor.com/
http://frgtgnhhghfg.ebdoor.com/
http://wedvfbgfddfh.ebdoor.com/
http://frbgnhfdhhgfhg.ebdoor.com/
http://mjnhbgghggg.ebdoor.com/
http://mjjgjfjhgjhg.ebdoor.com/
http://de

回复 | 举报

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6