Chinaunix首页 | 论坛 | 博客
  • 博客访问: 584092
  • 博文数量: 199
  • 博客积分: 5087
  • 博客等级: 大校
  • 技术积分: 2165
  • 用 户 组: 普通用户
  • 注册时间: 2010-01-26 21:53
文章存档

2010年(199)

我的朋友

分类: LINUX

2010-07-01 17:19:04

Problem

A remote machine that is hard to reach gets stuck, e.g. because all memory has been consumed by some rogue process. We want it to reboot automatically.

Solution

  • We want the kernel to reboot upon a panic. man 7 bootparam tells us that this can be achieved by giving the parameter panic=N (to be put in the ``kernel'' line in /boot/grub/menu.lst (on gentoo). This ensures that the kernel will reboot (after N secs) upon panic.

    The same effect can be achieved by

    echo N >/proc/sys/kernel/panic

    (I'm not sure whether that is really needed for the watchdog daemon, but it won't hurt either).

  • Turn on support for the software /dev/watchdog device in the kernel. This can be done in make menuconfig: select character devices (``watchdog cards''), turn on the main option and software watchdog (the ``softdog'' driver). I don't want to bother with modules so I just added the driver to the kernel (after testing as modules).

    Note that activating the softdog driver does not force you to have a watchdog daemon to ``pet the dog''. For that, the /dev/watchdog file must be opened first.

    There are 2 possible parameters for the softdog driver as appears from the code snippet below:

    	#define TIMER_MARGIN	60		/* (secs) Default is 1 minute */
    
    	static int soft_margin = TIMER_MARGIN;	/* in seconds */
    	#ifdef ONLY_TESTING
    	static int soft_noboot = 1;
    	#else
    	static int soft_noboot = 0;
    	#endif  /* ONLY_TESTING */
    
    	MODULE_PARM(soft_margin,"i");
    	MODULE_PARM(soft_noboot,"i");
        
    Thus, without any parameters, the machine will halt if the file /dev/watchdog, when open, has not been written to in 60 secs.
  • Install the watchdog daemon package (available ). I use gentoo, so emerge watchdog does the lot. If the device /dev/watchdog does not appear automatically, you can create it using
      mknod /dev/watchdog c 10 130
      
    Since I'd like to know what was going on when the machine got stuck, I specified a user-defined test program /usr/sbin/watchdog-user-test in /etc/watchdog/watchdog.conf which is reproduced below.
      #ping			= 172.31.14.1
      #ping			= 172.26.1.255
      #interface		= eth0
      #file			= /var/log/messages
      #change			= 1407
    
      # Uncomment to enable test. Setting one of these values to '0' disables it.
      # These values will hopefully never reboot your machine during normal use
      # (if your machine is really hung, the loadavg will go much higher than 25)
      #max-load-1		= 24
      #max-load-5		= 18
      #max-load-15		= 12
    
      # Note that this is the number of pages!
      # To get the real size, check how large the pagesize is on your machine.
      #min-memory		= 1
    
      repair-binary		= /usr/sbin/repair
      test-binary		= /usr/sbin/watchdog-user-test
    
    
      #watchdog-device	= /dev/watchdog
      
      # Defaults compiled into the binary
      #temperature-device	=
      #max-temperature	= 120
      
      # Defaults compiled into the binary
      #admin			= root
      #interval		= 10
      #logtick                = 1
      
      # This greatly decreases the chance that watchdog won't be scheduled before
      # your machine is really loaded
      realtime		= yes
      priority		= 1
    
      # Check if syslogd is still running by enabling the following line
      #pidfile		= /var/run/syslogd.pid   
      
    Here is /usr/sbin/watchdog-user-test:
      #!/bin/sh
      # 
      { date; ps -efl; } >/var/log/watchdog-user-test-output
      
    Finally, to have the watchdog daemon start at boot time, it suffices (in gentoo) to do rc-update add watchdog default.

References

  • : how to upgrade openssh without rebooting
  • daemon sources and further info

转自:

阅读(1491) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~