分类:
2006-10-20 16:48:50
1. EMS介绍
EMS(Event Monitoring Service)是一项HP-UX的集成服务,它能够对主机硬件进行实时监控,并可以通过指定方式将监控信息报告给系统维护人员,这有助于运维人员及时、准确的发现主机故障,并辅助判定故障所在,提高主机的可用时间。
EMS可以通过MRM(Monitoring Request Manager)进行管理,通过MRM可以对EMS的监控范围、事情报警触发条件、事件信息报警方式进行设置。
MRM调用方法如下:
(1)用root身份登陆主机系统
(2)运行/etc/opt/resmon/lbin/monconfig
(3)通过(MRM)Monitoring Request Manager Main Menu进行配置
在MRM菜单中,可以查看、检查、修改、删除、启用、禁用检测器。
如下:
============================================================================
=================== Event Monitoring Service ===================
=================== Monitoring Request Manager ===================
============================================================================
EVENT MONITORING IS CURRENTLY ENABLED.
EMS Version : A.04.10
STM Version : C.46.15
============================================================================
============== Monitoring Request Manager Main Menu ==============
============================================================================
Note: Monitoring requests let you specify the events for monitors
to report and the notification methods to use.
Select:
(S)how monitoring requests configured via monconfig
(C)heck detailed monitoring status
(L)ist descriptions of available monitors
(A)dd a monitoring request
(D)elete a monitoring request
(M)odify an existing monitoring request
(E)nable Monitoring
(K)ill (disable) monitoring
(H)elp
(Q)uit
Enter selection: [s]
下面以定制一个monitor为例子,说明MRM的配置方法:
(1)以root身份登陆系统
(2)运行/etc/opt/resmon/lbin/monconfig进入MRM主菜单(就是上面看到的)
(3)选择a并回车,对应的功能选项是(A)dd a monitoring request
(4)此时将显示出可供监控的硬件模块,一般全选,键入a并回车就行
(5)选择基准事件级别,建议选择2)MINOR WARNING
(6)选择报警触发的条件,选择4)>=
(7)选择监控事件信息报警的方式,选择6)EMAIL
(8)选择事件报警邮件的接收人,这里可根据需要输入相应的用户名,例如:monitor
(9)对此次monitor进行注释说明,选择(A)dd
(10)Client Configuration File,这里选择(C)lear
(11)保存上述配置信息,此后将退回到主菜单
(12)在主菜单下,选择(S)how monitoring requests configured via monconfig查看新建的monitor是否存在
(13)退回到MRM主菜单,选择(C)heck detailed monitoring status,可查看所有有效的监控状态,因主机配置而异,对于主机中不存在的硬件,EMS将会忽略,即使在上述第“4”步中设置为监控所有硬件
(14)(E)nable Monitoring,开启EMS服务功能
说明:通过上述步骤,我们新建的monitor是针对所有硬件模块(step 4)实时监控,但仅对严重程度大于等于Minor Warning(step 5 & step 6)的事件,通过email(step 6)的方式报告给用户monitor(step 8)。
2. 如何从event mail获取信息
EMS产生的时间警告邮件可通过内部网络接收,无需另外配置域名服务器。EMS产生的邮件,根据事先定义发给目标用户monitor,可通过PC上的邮件客户端软件(outlook等)进行接收。
以outlook为例子,为了接收event mail,邮件客户端软件需要新建邮件账号,用户名为在MRM中指定的HP-UX用户名,口令为HP-UX中对应的口令,pop3/smtp服务器为被检测主机的IP地址,建议outlook设定自动收取邮件的间隔时间,以便能及时收到来自EMS的事件信息。
说明:
(1)因为HP-UX自身的安全机制,root用户的e-mail无法通过客户端软件收取,因此在MRM中指定事件邮件接收用户时指定为其他普通用户,例如此次就新建了monitor这个用户
(2)网络中应该开放pop3/pop的110/109两个端口
(3)供event mail使用的用户是HP-UX中的用户,也能够登陆主机,建议定期修改HP-UX中该用户的密码,对应的,也要修改outlook的密码
下面举例说明EMS生成的事件报警邮件的内容,下述故障来自人为带电拔出一块硬盘导致的系统异常(中文部分为注释)
>------------ Event Monitoring Service Event Notification ------------<
Notification Time: Wed Jun 8 23:26:18 2005 事件触发时间
hpux1 sent Event Monitor notification information: 可反映主机名
/storage/events/disks/default/0_0_1_1.15.0 is >= 2. 硬件模块、触发器
Its current value is CRITICAL(5). 该事件严重程度
User Comments:
Just a test:)
Event data from monitor:
Event Time..........: Wed Jun 8 23:26:16 2005
Severity............: CRITICAL
Monitor.............: disk_em
Event #.............: 101
System..............: hpux1
Summary: 事件概述
Disk at hardware path 0/0/1/1.15.0 : Device removed from monitoring
Description of Error: 故障描述
The device has been removed from the list of devices being monitored by
this monitor.
Probable Cause / Recommended Action: 可能原因/推荐处理办法
The device was removed from the system, has stopped responding to the
system or it has been replaced with a device that is not supported by this
monitor.
Run ioscan to determine the state and type of the device.
Check the /var/stm/data/os_decode_xref for the information indicating
which devices are supported by this monitor.
Check other monitors to determine if they are now monitoring the
device by running /etc/opt/resmon/lbin/monconfig and using the "Check
monitoring" command.
Additional Event Data:
System IP Address...: 15.85.114.14 主机IP
Event Id............: 0x42a70e1800000000
Monitor Version.....: B.01.01
Event Class.........: I/O 事件类别
Client Configuration File...........:
/var/stm/config/tools/monitor/default_disk_em.clcfg
Client Configuration File Version...: A.01.00
Qualification criteria met.
Number of events..: 1
Associated OS error log entry id(s):
None
Additional System Data:
System Model Number.............: 9000/800/A500-44 主机model号
OS Version......................: B.11.11 操作系统版本
STM Version.....................: A.45.00
EMS Version.....................: A.04.00
Latest information on this event:
v-v-v-v-v-v-v-v-v-v-v-v-v D E T A I L S v-v-v-v-v-v-v-v-v-v-v-v-v
Component Data:
Physical Device Path...: 0/0/1/1.15.0 故障设备物理路径
Device Class...........: Disk 设备类型
Inquiry Vendor ID......: SEAGATE 设备生产商
Inquiry Product ID.....: ST34572WC 产品号
Firmware Version.......: HP03 固件版本
Serial Number..........: JKJ118650QPJCX 故障备件序列号
>---------- End Event Monitoring Service Event Notification ----------<
Enven mail中显示了故障发生的事件、主机名字、事件严重等级、故障盘的物理路径、硬盘的product ID、建议的检查步骤、主机型号、操作系统版本等信息,有助于发现并排查主机硬件故障。
但因主机硬件故障可能并非单一部件的简单故障,故event mail中Probable Cause / Recommended Action 描述有可能更最终发现确认的故障鉴定不一致,这是正常情形。往往对故障分析,需辅助更多的工具和手段进行排查。