安装、配置、测试HACMP,差不多1个月了,各种实验也都做了,相信自己已经完全可以应付公司日后HACMP方面的维护和纠错工作了。对HACMP的学习应该告一段落了,最后把自己觉得有用的脚本都记录下来。希望以后有机会能参加一次HACMP的培训,把自己的零散的知识串起来,得到全面、系统的提高。
1、简化hacmp.out的脚本
hacmp.out中的内容太乱,如果只想看都有哪些event依次发生,这样就好了。
month=`date| awk '{print $2}'`
RG=`/usr/sbin/cluster/sbin/cl_lsvg |tail -1 |awk '{print $1}'`
cat /tmp/hacmp.out |egrep "(^$RG|^:|$month)" |awk -F[ '{print $1}' |uniq
说明:我的环境只有一个RESOURCE GROUP,如果你有多个,脚本可能要稍微修改一下。
2、查看HACMP启动的所有Subsystem相关信息的脚本
for i in `cat 1 |awk -F"The " '{print $2}' |awk -F" Subsystem" '{print $1}'`; do
echo ================================
echo $i
lssrc -ls $i;
done
说明:文件1就是smit clstart后屏幕输出的内容,自己复制、粘贴一下吧。
下面这个是我机器上的输出,看明白这些内容,对HACMP的理解肯定能上一个层次。
================================
portmap
0513-005 The Subsystem, portmap, only supports signal communication.
================================
inetd
Subsystem Group PID Status
inetd tcpip 73792 active
Debug Inactive
Signal Purpose
SIGALRM Establishes socket connections for failed services
SIGHUP Rereads configuration database and reconfigures services
SIGCHLD Restarts service in case the service dies abnormally
Service Command Arguments Status
godm /usr/es/sbin/cluster/godmd active
xmquery /usr/bin/xmtopas xmtopas -p3 active
telnet /usr/sbin/telnetd telnetd -a active
ftp /usr/sbin/ftpd ftpd active
================================
clsmuxpdES
SRC request not supported.
================================
topsvcs
Subsystem Group PID Status
topsvcs topsvcs 630958 active
Network Name Indx Defd Mbrs St Adapter ID Group ID
net_ether_01_0 [ 0] 2 2 S 192.168.168.2 192.168.168.2
net_ether_01_0 [ 0] en1 0x4383d5f8 0x4383d5fa
HB Interval = 1.000 secs. Sensitivity = 10 missed beats
Missed HBs: Total: 0 Current group: 0
Packets sent : 858 ICMP 0 Errors: 0 No mbuf: 0
Packets received: 1288 ICMP 0 Dropped: 0
NIM's PID: 557170
net_ether_01_1 [ 1] 2 2 S 192.168.68.2 192.168.68.2
net_ether_01_1 [ 1] en0 0x4383d5f9 0x4383d5fa
HB Interval = 1.000 secs. Sensitivity = 10 missed beats
Missed HBs: Total: 0 Current group: 0
Packets sent : 858 ICMP 0 Errors: 0 No mbuf: 0
Packets received: 1290 ICMP 0 Dropped: 0
NIM's PID: 381208
rs232_0 [ 2] 2 2 S 255.255.0.1 255.255.0.1
rs232_0 [ 2] tty0 0x8383d5fa 0x8383d5fd
HB Interval = 2.000 secs. Sensitivity = 5 missed beats
Missed HBs: Total: 0 Current group: 0
Packets sent : 599 ICMP 0 Errors: 0 No mbuf: 0
Packets received: 599 ICMP 0 Dropped: 0
NIM's PID: 454786
2 locally connected Clients with PIDs:
haemd(795082) hagsd(508140)
Dead Man Switch Enabled:
reset interval = 1 seconds
trip interval = 20 seconds
Configuration Instance = 198
Daemon employs no security
Segments pinned: Text Data.
Text segment size: 768 KB. Static data segment size: 957 KB.
Dynamic data segment size: 3713. Number of outstanding malloc: 175
User time 0 sec. System time 0 sec.
Number of page faults: 0. Process swapped out 0 times.
Number of nodes up: 2. Number of nodes down: 0.
================================
grpsvcs
Subsystem Group PID Status
grpsvcs grpsvcs 508140 active
2 locally-connected clients. Their PIDs:
795082(haemd) 647448(clstrmgr)
HA Group Services domain information:
Domain established by node 1
Number of groups known locally: 3
Number of Number of local
Group name providers providers/subscribers
ha_em_peers 2 1 0
CLRESMGRD_1130856062 2 1 0
CLSTRMGR_1130856062 2 1 0
================================
emsvcs
Subsystem Group PID Status
emsvcs emsvcs 795082 active
No trace flags are set
Configuration Data Base version from local copy of CDB:
941748909,457092608,0
Daemon started on Wednesday 11/23/05 at 10:37:46
Daemon has been running 0 days, 0 hours, 13 minutes and 31 seconds
Daemon connected to group services: Yes
Daemon has joined peer group: Yes
Daemon communications enabled: Yes
Daemon security: No support
Peer count: 1
Peer group state:
941748909,457092608,0
NOSECSUPPORT
Logical Connection Information for Local Clients
LCID FD PID Start Time
0 11 647448 Wednesday 11/23/05 10:38:35
Logical Connection Information for Remote Clients
LCID FD PID Start Time
Logical Connection Information for Peers
LCID Node
Resource Monitor Information
Name Inst Type FD SHMID PID Locked
IBM.HACMP.clresmgrd 0 C -1 -1 -2 00/00 No
IBM.HACMP.clstrmgr 0 C 12 -1 -2 00/00 No
IBM.PSSP.harmpd 0 S -1 -1 -1 00/00 No
Membership 0 I -1 -1 -2 00/00 No
aixos 0 S 10 28311558 -2 00/01 No
Highest file descrīptor in use is 12
Highest file descrīptor allowed for client connections is 1500
Peer Daemon Status
1 I A
Internal Daemon Counters
GS init attempts = 1 GS join attempts = 1
GS resp callback = 6 CCI conn rejects = 0
RMC conn rejects = 0 HR conn rejects = 0
Retry req msg = 0 Retry rsp msg = 0
Intervl usr util = 0 Total usr util = 2
Intervl sys util = 1 Total sys util = 2
Intervl time = 12000 Total time = 72001
lccb's created = 1 lccb's freed = 0
Reg rcb's creatd = 0 Reg rcb's freed = 0
Qry rcb's creatd = 0 Qry rcb's freed = 0
vrr created = 0 vrr freed = 0
vqr created = 0 vqr freed = 0
var inst created = 168 var inst freed = 0
Events regstrd = 0 Events unregstrd = 0
Insts assigned = 0 Insts unassigned = 0
Smem vars obsrv = 0 State vars ōbsrv = 2
Preds evaluated = 0 Events generated = 0
Smem lck intrvl = 0 Smem lck total = 0
PRM msgs to all = 0 PRM msgs to peer = 0
PRM resp msgs = 0 PRM msgs rcvd = 0
PRM_NODATA = 0 PRM_BADMSG errs = 0
Sched q elements = 16 Free q elements = 16
xcb alloc'd = 3 xcb freed = 3
xcb freed msgfp = 0 xcb freed reqp = 0
xcb freed reqn = 0 xcb freed rspc = 1
xcb freed rspp = 0 xcb freed cmdrm = 2
xcb freed unkwn = 0 Sec enable = 0
Sec disable = 0 Sec authent = 0
Wake sec thread = 0 Wake main thread = 0
Missed sec rsps = 0 Enq sec request = 0
Deq sec request = 0 Enq sec response = 0
Deq sec response = 0
Daemon Resource Utilization Last Interval
User: 0.000 seconds 0.000%
System: 0.010 seconds 0.008%
User+System: 0.010 seconds 0.008%
Daemon Resource Utilization Total
User: 0.020 seconds 0.003%
System: 0.020 seconds 0.003%
User+System: 0.040 seconds 0.006%
Data segment size: 528K
================================
emaixos
Subsystem Group PID Status
emaixos emsvcs 725244 active
Trace Level: None
Domain Type: HACMP
Domain Name: testdb_ha
RMAPI Initialized: TRUE
Data Initialized: TRUE
Data Init. Attempts: 1
Data Init. Delay: 5
Inst. Interval: 600
Inst. Count: 2
SRC FD: 3
Server FD: 7
Class Count: 7
Variable Count: 41
================================
clstrmgrES
Current state: ST_STABLE
i_local_nodeid 1, i_local_siteid -1, my_handle 2
ml_idx[1]=0 ml_idx[2]=1
There are 0 events on the Ibcast queue
There are 0 events on the RM Ibcast queue
CLversion: 7
sccsid = "@(#)36 1.139 src/43haes/usr/sbin/cluster/hacmprd/main.C, hacmp.pe, 51haes_r520, r520s006a 7/20/05 14:32:42"
local node vrmf is 5200
cluster fix level is "0"
The following timer(s) are currently active:
Current DNP values
DNP Values for NodeId - 1 NodeName - TESTDB1
PgSpFree = 127719 PvPctBusy = 0 PctTotalTimeIdle = 99.842852
DNP Values for NodeId - 2 NodeName - TESTDB2
PgSpFree = 129269 PvPctBusy = 0 PctTotalTimeIdle = 99.789802
================================
gsclvmd
Subsystem Group PID Status
gsclvmd gsclvmd 295002 active
No Active VGs.
================================
clinfoES
SRC request not supported.
3、除了网卡、网络、节点失效这三种HACMP自动要监控的故障外,对应用故障的监控也很必要,Notify Method脚本也写了一个,实验很顺利。
banner stop >>/ha52/man_fallover.log
date >>/ha52/man_fallover.log
/usr/es/sbin/cluster/utilities/clstop -grsy >>/ha52/man_fallover.log 2>&1
banner wait >>/ha52/man_fallover.log
ps -e |grep clstrmgr |grep -v grep
while [ $? = 0 ];do
date >>/ha52/man_fallover.log
echo clstrmgrES is stopping >>/ha52/man_fallover.log
sleep 15
ps -e |grep clstrmgr |grep -v grep
done
banner start >>/ha52/man_fallover.log
date >>/ha52/man_fallover.log
/usr/es/sbin/cluster/etc/rc.cluster -boot -N -i >>/ha52/man_fallover.log 2>&1