Chinaunix首页 | 论坛 | 博客
  • 博客访问: 711268
  • 博文数量: 160
  • 博客积分: 8847
  • 博客等级: 中将
  • 技术积分: 1656
  • 用 户 组: 普通用户
  • 注册时间: 2010-11-25 16:46
个人简介

。。。。。。。。。。。。。。。。。。。。。。

文章分类

全部博文(160)

文章存档

2015年(1)

2013年(1)

2012年(4)

2011年(26)

2010年(14)

2009年(36)

2008年(38)

2007年(39)

2006年(1)

分类: LINUX

2007-03-07 15:57:04

用Nagios监控服务器
nagios可以对服务器进行全面的监控,包括服务(apache、mysql、ntp、dns、disk、qmail和sshd等等)的状态,服务器的状态(up、down等等)。它是一个完全GPL协议的开源软件包,包含有nagios主程序和它的各个插件,配置非常灵活,可以监视的项目很多,可以自定义shell脚本进行监控服务,非常适合大型


nagios的包含主动监控和被动监控。
主动检查是通过监控中心的主机发出请求,让运行在远程主机上的nrpe守护进程收集信息,然后报告它,它通过web接口把
显示在页面上。
它的工作原理如下:

被动监控是当远程被监控主机处于
之内的时候,只有远程主机可以访问到监控中心,之内可以设置另外一个监控中心,远程监控中心的nagios收集服务器信息以后,和nsca报告,由naca客户端报告naca的服务器端,然后报告监控中心的nagios,通过web接口显示监控结果。


nagios的功能非常强大,
是它的窝,只有e文、法文和日文,没有中文,可惜啊。

我现在引用它的一段文字进行总结一下到底什么是nagios:
What Is This?
什么是nagios?
Nagios® is a system and network monitoring application. It watches hosts and services that you specify, alerting you when things go bad and when they get better.
Nagios was originally designed to run under
, although it should work under most other unices as well.
Some of the many features of Nagios® include:
Monitoring of network services (SMTP, POP3, HTTP, NNTP, PING, etc.)
Monitoring of host resources (processor load, disk usage, etc.)
Simple plugin design that allows users to easily develop their own service checks
Parallelized service checks
Ability to define network host hierarchy using "parent" hosts, allowing detection of and distinction between hosts that are down and those that are unreachable
Contact notifications when service or host problems occur and get resolved (via email, pager, or user-defined method)
Ability to define event handlers to be run during service or host events for proactive problem resolution
Automatic log file rotation
Support for implementing redundant monitoring hosts
Optional web interface for viewing current network status, notification and problem history, log file, etc.
Nagios是一个监视系统和
的应用程序。它监视你所指定主机和服务,当监视的内容变好或者变坏时发出警告。Nagios最初是被设计在平台上运行的,然而现在在其他平台上也运行良好。
Nagios的特性包括:
监视
服务(SMTP, POP3, HTTP, NNTP, PING, 等等)
监视主机资源(处理器负载、磁盘空间等)
容许用户开发自己的插件去检查自定义的项目;
通过使用“父主机”,定义
主机的分层,容许探测主机down掉或者不可到达。
可以定义在主机或服务运行期间,事件发生以后如何处理和解决方式;
自动记录错误日志;
支持冗余监视;
可选web接口,通过web页面查看当前
状态,提示和报告故障历史,日志文件等;

Nagios的系统要求:
、Unix等
apache
GD库(1.63以上)
zlib
pnglib
jpeglib
basic icons
等,其中apache的安装在blog中已经有相关的文章,搜索一下就行;gd、zlib、pnglib和jpeglib安装比较简单,步骤:
下载tarball
tar zxvf xxx.tar.gz
cd xxx
./configure
make && make install

----------------------------------------------------------------------
Nagios的安装过程(
)
----------------------------------------------------------------------
nagios的安装比较简单,复杂的是设置和配置参数的设定。不过你要放松一点,毕竟我们要搞定它,不是吗?那就开始吧:

1:获得最新的安装包,

2:以root身份登录服务器,目前最新的版本是2.5:
1)nagios,版本2.5:
fetch

or
wget


2)获得nagios插件,版本1.4.3:


3)获得图库文件:
http://dl.sf.net/nagios/imagepak-base.tar.gz

4)NRPE,版本2.5.2


5)NSCA,版本2.6


3:切换到root用户:
sudo su

4:解压缩
tar zxvf nagios-2.5.tar.gz

5:建立运行nagios的用户:
adduser nagios

6:建立安装nagios的文件夹,并使这个文件夹的所有者为nagios:nagios
mkdir /usr/local/nagios
chown nagios.nagios /usr/local/nagios

7:确认web服务器的用户
可能会通过web接口执行一些命令,必须确定web服务器以哪个用户运行的,通常为:apache:
grep "^User" /usr/local/apache2/conf/httpd.conf

8:建立命令文件组
这个新的组会包括apache的用户和nagios的用户
pw groupadd nagcmd
pw usermod apache -G nagcmd
pw usermod nagios -G nagcmd
----------------------------------
cat /etc/group
nagcmd:*:9007:apache,nagios
----------------------------------

8:运行配置脚本并安装nagios
cd nagios-2.5
./configure --prefix=/usr/local/nagios --with-gd-lib=/usr/local/lib --with-gd-inc=/usr/local/include
---------------------------------
*** Configuration summary for nagios 2.5 07-13-2006 ***:

General Options:
-------------------------
Nagios executable: nagios
Nagios user/group: nagios,nagios
Command user/group: nagios,nagios
Embedded Perl: no
Event Broker: yes
Install ${prefix}: /usr/local/nagios
Lock file: ${prefix}/var/nagios.lock
Init directory: /usr/local/etc/rc.d
Host OS: freebsd6.0

Web Interface Options:
------------------------
HTML URL:

CGI URL:
Traceroute (used by WAP): /usr/sbin/traceroute


Review the options above for accuracy. If they look okay,
type 'make all' to compile the main program and CGIs.
---------------------------------
make all
make install
make install-init
make install-commandmode
make install-config

9:安装nagios-plugins
tar zxvf nagios-plugins-1.4.3.tar.gz
cd nagios-plugins-1.4.3
./configure --prefix=/usr/local/nagios-plugins
make all
make install
安装完成以后在/usr/local/nagios-plugins-plugins会产生一个libexec的目录,将该目录全部移动到/usr/local/nagios目录下即可。
mv /usr/local/nagios-plugins-plugins/libexec/ /usr/local/nagios/

10:imagepak-base.tar.gz的安装
tar –xvzf imagepak-base.tar.gz
解压以后是base目录
mv base/ /usr/local/nagios/share/images/logos/

----------------------------------------------------------------------
现在开始配置:
----------------------------------------------------------------------
1:配置web接口
假设你已经运行了apache,如果没有,请参考:
http://localhost/upload/blog.php?do-showone-tid-18.html

vi /usr/local/apache2/conf/httpd.conf
添加如下内容:
ScriptAlias /nagios/cgi-bin /usr/local/nagios/sbin


Options ExecCGI
AllowOverride None
Order allow,deny
Allow from all
AuthName "Nagios Access"
AuthType Basic
AuthUserFile /usr/local/nagios/etc/htpasswd.users
Require valid-user


Alias /nagios /usr/local/nagios/share


Options None
AllowOverride None
Order allow,deny
Allow from all
AuthName "Nagios Access"
AuthType Basic
AuthUserFile /usr/local/nagios/etc/htpasswd.users
Require valid-user

修改完毕,保存文件,并重启apache:
/usr/local/apahce2/bin/apachectl restart

2:配置apache的BASIC认证:
生成认证密码:
/usr/local/apache2/bin/htpasswd –c /usr/local/nagios/etc/htpasswd.users nagios nagios
apache接口配置完成。

开始配置nagios:
cd /usr/local/nagios/etc/
在/usr/local/nagios/etc下是nagios的配置模板文件-sample,把.cfg-sample文件全部拷贝成.cfg
例如:cp nagios.cfg-sample nagios.cfg
全部拷贝完成即可.

vi minimal.cfg
注释所有command:
注释的方法是在每一个定义语句前面添加”#“
修改cgi.cfg
修改use_authentication=1为use_authentication=0,即不用验证.不然有一些页面不会显示。

现在检查配置文件是否有语法错误:
/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
如果正确,会显示以下结果:
Total Warnings: 0
Total Errors: 0
否则,需要根据提示进行修改配置文件。

配置文件等会再弄。现在启动nagios
/usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

为了使nagios异常中断,我们使用daemontools启动:
安装daemontool:
mkdir -p /package
chmod 1755 /package
cd /package
fetch

cd admin/daemontools-0.76/
package/install
检查svscan进程是否启动:
ps aux | grep svscan
root 376 0.0 0.0 1636 0 con- IW - 0:00.00 /bin/sh /command/svscanboot
root 411 0.0 0.0 1224 208 con- S 8Jul06 0:42.50 svscan /service

ok,启动正常了。
cd /service
mkdir nagios
chmod 1755 nagios
touch ./run
chmod 755 ./run
vi run
PATH=/usr/local/bin:/usr/bin:/bin
export PATH

exec env - PATH=$PATH \
/usr/local/nagios/bin/nagios /usr/local/nagios/etc/nagios.cfg

mkdir log
cd log
touch ./run
chmod 755 ./run
vi ./run
#!/bin/sh
exec setuidgid logadmin multilog t s1000000 n100 ./main

mkdir main
chmod 777 main
chown nagios.nagios main
touch status
chown nagios.nagios status

svc -u /service/nagios/
svstat /service/nagios/
root@## ps auxww | grep nagios
root 23276 0.0 0.1 1176 488 ?? I 5:00PM 0:01.71 supervise nagios
nagios 34251 0.0 0.3 2316 1552 ?? S 6:06PM 0:00.10 /usr/local/nagios/bin/nagios /usr/local/nagios/etc/nagios.cfg
root@##

ok,现在把nagios服务做成自动启动的服务了。
通过svc命令可以启动或者停止服务。
---------------------------------------------------------------------------------
svc opts services
opts is a series of getopt-style options. services consists of any number of arguments, each argument naming a directory used by supervise.

-u: Up. If the service is not running, start it. If the service stops, restart it.
-d: Down. If the service is running, send it a TERM signal and then a CONT signal. After it stops, do not restart it.
-o: Once. If the service is not running, start it. Do not restart it if it stops.
-p: Pause. Send the service a STOP signal.
-c: Continue. Send the service a CONT signal.
-h: Hangup. Send the service a HUP signal.
-a: Alarm. Send the service an ALRM signal.
-i: Interrupt. Send the service an INT signal.
-t: Terminate. Send the service a TERM signal.
-k: Kill. Send the service a KILL signal.
-x: Exit. supervise will exit as soon as the service is down. If you use this option on a stable system, you're doing something wrong; supervise is designed to run forever.
---------------------------------------------------------------------------------
比如:
停止nagios--svc -d /service/nagios/
重启nagios--svc -t /service/nagios/
启动nagios--svc -u /service/nagios/

当然,你也可以使用inited的方式进行:
/usr/local/etc/rc.d/nagios start/stop

好了,反正daemontools很强大,以后慢慢熟悉,转入正题。
现在打开网页:

一定会让你大吃一惊,呵呵,我的服务器和服务状态都清楚的看到了。
现在我们的nagios中只有一个,那就是它自己,localhost,呵呵,等会我们添加别的主机和主机服务,ok,我们认识一下nagios的庐山真面目:

配置nagios:

1)为主机添加服务
2)添加主机并添加服务
3)停止一个服务
4)删除一台主机和服务
5)查看所有主机的故障
6)查看一台特定的主机状态
7)改变报警的时间间隔
8)改变发现故障的重试次数
9)如何在nagios中使用外部命令


1)为主机添加一个服务
为localhost主机添加qmail服务的监控,方法如下:
vi minimal.cfg
define service{
use generic-service ; Name of service template to use
host_name localhost
service_description qmail_smtp
is_volatile 0
check_period 24x7
max_check_attempts 1
normal_check_interval 1
retry_check_interval 1
contact_groups admins
notification_options w,u,c,r
notification_interval 960
notification_period 24x7
check_command check_smtp!20%!10%!/
}

可以直接拷贝原有的进行修改,我这个就是拷贝的原有的check_local_disk进行的。
修改host_name,service_description,check_command等

define service{
use generic-service ; Name of service template to use
host_name localhost
service_description qmail_pop3
is_volatile 0
check_period 24x7
max_check_attempts 1
normal_check_interval 1
retry_check_interval 1
contact_groups admins
notification_options w,u,c,r
notification_interval 960
notification_period 24x7
check_command check_pop!20%!10%!/
}
照猫画虎的进行修改,然后去修改:
vi checkcommands.cfg
#'check_qmail' command definition
define command{
command_name check_qmail
command_line $USER1$/check_smtp -H 127.0.0.1
}
define command{
command_name check_pop3
command_line $USER1$/check_pop -H 127.0.0.1
}
保存,然后检查配置文件:
/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
如果没有错误会显示:
Total Warnings: 0
Total Errors: 0
如果有错误,请根据提示进行错误的修正。
重启nagios
svc -d /service/nagios/ && svc -u /service/nagios/
通过web页面检查nagios的结果:

点击“Service Detail”
会出现:

2)添加主机并添加服务
我们会监控这台主机的负载、磁盘等一些没有通过端口方式启动的服务器状态,以及它的服务,比如:apache、mysql、qmail和ntp等等吧。那么没有端口的nagios直接能监控到吗?答案是不行。所以我们必须在两台主机上安装nrpe,nrpe可以启动5666端口,把的信息源源不断的传给监控中心的主机。
ok,我们把apache、mysql、qmail和ntp先加上,这回我们把监控的主机和服务新建一个文件:
cd /usr/local/nagios/etc/
touch 10_5_1_156.cfg
vi nagios.cfg
cfg_file=/usr/local/nagios/etc/10_5_1_156.cfg

vi 10_5_1_156.cfg
定义一个主机:
define host{
use generic-host ; Name of host template to use
host_name test_nrpe
alias client
address 10.5.1.156
check_command check-host-alive
max_check_attempts 1
check_period 24x7
notification_interval 120
notification_period 24x7
notification_options d,r
contact_groups admins
}

定义主机需要检查的服务:
define service{
use generic-service ; Name of service template to use
host_name test_nrpe
service_description PING
is_volatile 0
check_period 24x7
max_check_attempts 1
normal_check_interval 1
retry_check_interval 1
contact_groups admins
notification_options w,u,c,r
notification_interval 960
notification_period 24x7
check_command check_ping!100.0,20%!500.0,60%
}

define service{
use generic-service ; Name of service template to use
host_name test_nrpe
service_description apache
is_volatile 0
check_period 24x7
max_check_attempts 1
normal_check_interval 1
retry_check_interval 1
contact_groups admins
notification_options w,u,c,r
notification_interval 960
notification_period 24x7
check_command check_http!100.0,20%!500.0,60%
}

define service{
use generic-service ; Name of service template to use
host_name test_nrpe
service_description mysql
is_volatile 0
check_period 24x7
max_check_attempts 1
normal_check_interval 1
retry_check_interval 1
contact_groups admins
notification_options w,u,c,r
notification_interval 960
notification_period 24x7
check_command check_mysql!100.0,20%!500.0,60%
}

define service{
use generic-service ; Name of service template to use
host_name test_nrpe
service_description ntp
is_volatile 0
check_period 24x7
max_check_attempts 1
normal_check_interval 1
retry_check_interval 1
contact_groups admins
notification_options w,u,c,r
notification_interval 960
notification_period 24x7
check_command check_ntp!100.0,20%!500.0,60%
}

define service{
use generic-service ; Name of service template to use
host_name test_nrpe
service_description qmail_smtp
is_volatile 0
check_period 24x7
max_check_attempts 1
normal_check_interval 1
retry_check_interval 1
contact_groups admins
notification_options w,u,c,r
notification_interval 960
notification_period 24x7
check_command check_smtp!100.0,20%!500.0,60%
}

define service{
use generic-service ; Name of service template to use
host_name test_nrpe
service_description qmail_pop3
is_volatile 0
check_period 24x7
max_check_attempts 1
normal_check_interval 1
retry_check_interval 1
contact_groups admins
notification_options w,u,c,r
notification_interval 960
notification_period 24x7
check_command check_pop!100.0,20%!500.0,60%
}
现在我们象上次一样把服务也定义完了:
此时是不是多了一个主机和它下面的服务呢?那是肯定的,添加主机和服务可能出现的问题有如下情况:
1:配置参数出现问题,如果你没有检查配置就启动nagios,可能会启动成功,但是显示会不正常;
解决方法:调整配置参数
2:Connection refused
当出现这个问题的时候,我开始以为是ssh的无密码登录没有成功,但是其实我的服务器没有启动该服务造成的,启动服务即可。

但是这些是有端口的服务,没有使用端口的状态任何?
使用nrpe,ok,我们现在在服务器上安装nrpe:
一、远程主机的配置
1、安装nrpe与配置
fetch

tar zxvf nrpe-2.5.2.tar.gz
cd nrpe-2.5.2
./configure --enable-ssl --enable-command-args
make all
mkdir -p /usr/local/nagios/etc
mkdir /usr/local/nagios/bin
mkdir /usr/local/nagios/libexec
pw addgroup nagios
pw useradd nagios -g nagios -d /usr/local/nagios/ -s /sbin/nologin
chown -R nagios:nagios /usr/local/nagios
cp ./sample-config/nrpe.cfg /usr/local/nagios/etc
cp src/nrpe /usr/local/nagios/bin
2、启动nrpe,端口为5666
/usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d
netstat -ant | grep 5666
tcp4 0 0 *.5666 *.* LISTEN

二、监控服务器上的配置
1、安装nrpe(主要是使用check_nrpe模块)
fetch

tar zxvf nrpe-2.5.2.tar.gz
cd nrpe-2.5.2
./configure --enable-ssl --enable-command-args
make all
cp src/check_nrpe /usr/local/nagios/libexec
2、nagios文件的配置
vi checkcommands.cfg
定义check_nrpe命令
# 'check_nrep' command definition
define command{
command_name check_nrpe
command_line /usr/local/nagios/libexec/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
}
三、上面我们已经配置了一部分参数,下面是配置的最终结果:
define host{
use generic-host ; Name of host template to use
host_name test_nrpe
alias client
address 10.5.1.156
check_command check-host-alive
max_check_attempts 1
check_period 24x7
notification_interval 120
notification_period 24x7
notification_options d,r
contact_groups admins
}

# 'check_load' command definition
define command{
command_name check_load
command_line $USER1$/check_load -w $ARG1$ -c $ARG2$
}

# 'check_load' command definition
define command{
command_name check_disk
command_line $USER1$/check_disk -w $ARG1$ -c $ARG2$
}
define service{
use generic-service ; Name of service template to use
host_name test_nrpe
service_description PING
is_volatile 0
check_period 24x7
max_check_attempts 1
normal_check_interval 1
retry_check_interval 1
contact_groups admins
notification_options w,u,c,r
notification_interval 960
notification_period 24x7
check_command check_ping!100.0,20%!500.0,60%
}

define service{
use generic-service ; Name of service template to use
host_name test_nrpe
service_description apache
is_volatile 0
check_period 24x7
max_check_attempts 1
normal_check_interval 1
retry_check_interval 1
contact_groups admins
notification_options w,u,c,r
notification_interval 960
notification_period 24x7
check_command check_http!100.0,20%!500.0,60%
}

define service{
use generic-service ; Name of service template to use
host_name test_nrpe
service_description mysql
is_volatile 0
check_period 24x7
max_check_attempts 1
normal_check_interval 1
retry_check_interval 1
contact_groups admins
notification_options w,u,c,r
notification_interval 960
notification_period 24x7
check_command check_mysql!100.0,20%!500.0,60%
}

define service{
use generic-service ; Name of service template to use
host_name test_nrpe
service_description ntp
is_volatile 0
check_period 24x7
max_check_attempts 1
normal_check_interval 1
retry_check_interval 1
contact_groups admins
notification_options w,u,c,r
notification_interval 960
notification_period 24x7
check_command check_ntp!100.0,20%!500.0,60%
}

define service{
use generic-service ; Name of service template to use
host_name test_nrpe
service_description qmail_smtp
is_volatile 0
check_period 24x7
max_check_attempts 1
normal_check_interval 1
retry_check_interval 1
contact_groups admins
notification_options w,u,c,r
notification_interval 960
notification_period 24x7
check_command check_smtp!100.0,20%!500.0,60%
}

define service{
use generic-service ; Name of service template to use
host_name test_nrpe
service_description qmail_pop3
is_volatile 0
check_period 24x7
max_check_attempts 1
normal_check_interval 1
retry_check_interval 1
contact_groups admins
notification_options w,u,c,r
notification_interval 960
notification_period 24x7
check_command check_pop!100.0,20%!500.0,60%
}

define service{
use generic-service ; Name of service template to use
host_name test_nrpe
service_description test_load
is_volatile 0
check_period 24x7
max_check_attempts 1
normal_check_interval 1
retry_check_interval 1
contact_groups admins
notification_options w,u,c,r
notification_interval 960
notification_period 24x7
check_command check_load!100.0,20%!500.0,60%
}

define service{
use generic-service ; Name of service template to use
host_name test_nrpe
service_description test_disk
is_volatile 0
check_period 24x7
max_check_attempts 1
normal_check_interval 1
retry_check_interval 1
contact_groups admins
notification_options w,u,c,r
notification_interval 960
notification_period 24x7
check_command check_disk!100.0,20%!500.0,60%
}

四、检查配置参数并重启nagios


9)如何在nagios中使用外部命令
vi /usr/local/nagios/etc/nagios.cfg
check_external_commands=1

mkdir /usr/local/nagios/var/rw
chown nagios.nagcmd /usr/local/nagios/var/rw
chmod u+rw /usr/local/nagios/var/rw
chmod g+rw /usr/local/nagios/var/rw
chmod g+s /usr/local/nagios/var/rw

svc -t /service/nagios/
/usr/local/apache2/bin/apachectl restart
阅读(1745) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~