我现在引用它的一段文字进行总结一下到底什么是nagios: What Is This? 什么是nagios? Nagios® is a system and network monitoring application. It watches hosts and services that you specify, alerting you when things go bad and when they get better. Nagios was originally designed to run under , although it should work under most other unices as well. Some of the many features of Nagios® include: Monitoring of network services (SMTP, POP3, HTTP, NNTP, PING, etc.) Monitoring of host resources (processor load, disk usage, etc.) Simple plugin design that allows users to easily develop their own service checks Parallelized service checks Ability to define network host hierarchy using "parent" hosts, allowing detection of and distinction between hosts that are down and those that are unreachable Contact notifications when service or host problems occur and get resolved (via email, pager, or user-defined method) Ability to define event handlers to be run during service or host events for proactive problem resolution Automatic log file rotation Support for implementing redundant monitoring hosts Optional web interface for viewing current network status, notification and problem history, log file, etc. Nagios是一个监视系统和的应用程序。它监视你所指定主机和服务,当监视的内容变好或者变坏时发出警告。Nagios最初是被设计在平台上运行的,然而现在在其他平台上也运行良好。 Nagios的特性包括: 监视服务(SMTP, POP3, HTTP, NNTP, PING, 等等) 监视主机资源(处理器负载、磁盘空间等) 容许用户开发自己的插件去检查自定义的项目; 通过使用“父主机”,定义主机的分层,容许探测主机down掉或者不可到达。 可以定义在主机或服务运行期间,事件发生以后如何处理和解决方式; 自动记录错误日志; 支持冗余监视; 可选web接口,通过web页面查看当前状态,提示和报告故障历史,日志文件等;
Nagios的系统要求: 、Unix等 apache GD库(1.63以上) zlib pnglib jpeglib basic icons 等,其中apache的安装在blog中已经有相关的文章,搜索一下就行;gd、zlib、pnglib和jpeglib安装比较简单,步骤: 下载tarball tar zxvf xxx.tar.gz cd xxx ./configure make && make install
Web Interface Options: ------------------------ HTML URL: CGI URL: Traceroute (used by WAP): /usr/sbin/traceroute
Review the options above for accuracy. If they look okay, type 'make all' to compile the main program and CGIs. --------------------------------- make all make install make install-init make install-commandmode make install-config
9:安装nagios-plugins tar zxvf nagios-plugins-1.4.3.tar.gz cd nagios-plugins-1.4.3 ./configure --prefix=/usr/local/nagios-plugins make all make install 安装完成以后在/usr/local/nagios-plugins-plugins会产生一个libexec的目录,将该目录全部移动到/usr/local/nagios目录下即可。 mv /usr/local/nagios-plugins-plugins/libexec/ /usr/local/nagios/
10:imagepak-base.tar.gz的安装 tar –xvzf imagepak-base.tar.gz 解压以后是base目录 mv base/ /usr/local/nagios/share/images/logos/
ok,现在把nagios服务做成自动启动的服务了。 通过svc命令可以启动或者停止服务。 --------------------------------------------------------------------------------- svc opts services opts is a series of getopt-style options. services consists of any number of arguments, each argument naming a directory used by supervise.
-u: Up. If the service is not running, start it. If the service stops, restart it. -d: Down. If the service is running, send it a TERM signal and then a CONT signal. After it stops, do not restart it. -o: Once. If the service is not running, start it. Do not restart it if it stops. -p: Pause. Send the service a STOP signal. -c: Continue. Send the service a CONT signal. -h: Hangup. Send the service a HUP signal. -a: Alarm. Send the service an ALRM signal. -i: Interrupt. Send the service an INT signal. -t: Terminate. Send the service a TERM signal. -k: Kill. Send the service a KILL signal. -x: Exit. supervise will exit as soon as the service is down. If you use this option on a stable system, you're doing something wrong; supervise is designed to run forever. --------------------------------------------------------------------------------- 比如: 停止nagios--svc -d /service/nagios/ 重启nagios--svc -t /service/nagios/ 启动nagios--svc -u /service/nagios/
define service{ use generic-service ; Name of service template to use host_name localhost service_description qmail_pop3 is_volatile 0 check_period 24x7 max_check_attempts 1 normal_check_interval 1 retry_check_interval 1 contact_groups admins notification_options w,u,c,r notification_interval 960 notification_period 24x7 check_command check_pop!20%!10%!/ } 照猫画虎的进行修改,然后去修改: vi checkcommands.cfg #'check_qmail' command definition define command{ command_name check_qmail command_line $USER1$/check_smtp -H 127.0.0.1 } define command{ command_name check_pop3 command_line $USER1$/check_pop -H 127.0.0.1 } 保存,然后检查配置文件: /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg 如果没有错误会显示: Total Warnings: 0 Total Errors: 0 如果有错误,请根据提示进行错误的修正。 重启nagios svc -d /service/nagios/ && svc -u /service/nagios/ 通过web页面检查nagios的结果:
点击“Service Detail” 会出现:
2)添加主机并添加服务 我们会监控这台主机的负载、磁盘等一些没有通过端口方式启动的服务器状态,以及它的服务,比如:apache、mysql、qmail和ntp等等吧。那么没有端口的nagios直接能监控到吗?答案是不行。所以我们必须在两台主机上安装nrpe,nrpe可以启动5666端口,把的信息源源不断的传给监控中心的主机。 ok,我们把apache、mysql、qmail和ntp先加上,这回我们把监控的主机和服务新建一个文件: cd /usr/local/nagios/etc/ touch 10_5_1_156.cfg vi nagios.cfg cfg_file=/usr/local/nagios/etc/10_5_1_156.cfg
vi 10_5_1_156.cfg 定义一个主机: define host{ use generic-host ; Name of host template to use host_name test_nrpe alias client address 10.5.1.156 check_command check-host-alive max_check_attempts 1 check_period 24x7 notification_interval 120 notification_period 24x7 notification_options d,r contact_groups admins }
定义主机需要检查的服务: define service{ use generic-service ; Name of service template to use host_name test_nrpe service_description PING is_volatile 0 check_period 24x7 max_check_attempts 1 normal_check_interval 1 retry_check_interval 1 contact_groups admins notification_options w,u,c,r notification_interval 960 notification_period 24x7 check_command check_ping!100.0,20%!500.0,60% }
define service{ use generic-service ; Name of service template to use host_name test_nrpe service_description apache is_volatile 0 check_period 24x7 max_check_attempts 1 normal_check_interval 1 retry_check_interval 1 contact_groups admins notification_options w,u,c,r notification_interval 960 notification_period 24x7 check_command check_http!100.0,20%!500.0,60% }
define service{ use generic-service ; Name of service template to use host_name test_nrpe service_description mysql is_volatile 0 check_period 24x7 max_check_attempts 1 normal_check_interval 1 retry_check_interval 1 contact_groups admins notification_options w,u,c,r notification_interval 960 notification_period 24x7 check_command check_mysql!100.0,20%!500.0,60% }
define service{ use generic-service ; Name of service template to use host_name test_nrpe service_description ntp is_volatile 0 check_period 24x7 max_check_attempts 1 normal_check_interval 1 retry_check_interval 1 contact_groups admins notification_options w,u,c,r notification_interval 960 notification_period 24x7 check_command check_ntp!100.0,20%!500.0,60% }
define service{ use generic-service ; Name of service template to use host_name test_nrpe service_description qmail_smtp is_volatile 0 check_period 24x7 max_check_attempts 1 normal_check_interval 1 retry_check_interval 1 contact_groups admins notification_options w,u,c,r notification_interval 960 notification_period 24x7 check_command check_smtp!100.0,20%!500.0,60% }
define service{ use generic-service ; Name of service template to use host_name test_nrpe service_description qmail_pop3 is_volatile 0 check_period 24x7 max_check_attempts 1 normal_check_interval 1 retry_check_interval 1 contact_groups admins notification_options w,u,c,r notification_interval 960 notification_period 24x7 check_command check_pop!100.0,20%!500.0,60% } 现在我们象上次一样把服务也定义完了: 此时是不是多了一个主机和它下面的服务呢?那是肯定的,添加主机和服务可能出现的问题有如下情况: 1:配置参数出现问题,如果你没有检查配置就启动nagios,可能会启动成功,但是显示会不正常; 解决方法:调整配置参数 2:Connection refused 当出现这个问题的时候,我开始以为是ssh的无密码登录没有成功,但是其实我的服务器没有启动该服务造成的,启动服务即可。