分类:
2007-06-18 12:01:12
Document #: 2311023F09000
Body:
[标题]
内容提要:
说明:如何处理hacmp中dms的问题
内容提要:1. dms 的介绍
2. 发生dms 的症状及原因
3. 解决方法
说明:1. dms 的介绍:
DMS(deadman switch)是用来描述系统kernel extension用的,它可以在系统崩溃前down掉系统,并产生dump
文件,以供日后检查。
DMS存在的目的是为了保护共享外置硬盘及数据,当系统挂起时间长过一定限制时间时,DMS会自动down掉该系统,
由hacmp的备份节点接管系统,以保护数据和业务的正常进行,避免潜在的问题,特别是外置磁盘阵列。
2. DMS 的起因:
DMS起作用的原因主要有以下几点:
a. 某种应用程序的优先级大于clstrmgr deamon , 导致clstrmgr无法正常reset DMS计数器。
b. 在系统上存在大量IO 操作, 导致cpu 没有时间相应clstrmgr deamon .
c. 内存泄漏或溢出问题
d. 大量的系统错误日志活动, 如: (token-ring beaconing 问题)
3. 如何检查是否系统发生了DMS
我们可以通过分析DUMP文件来看,如:
# crash /dev/lv00
Using /unix as the default namelist file.
> cpu
Selected cpu number : 0
> stat
sysname: AIX
nodename: sp13
release: 3
version: 4
machine: 00091968A400
time of crash: Sat Aug 31 04:36:52 EDT 2002
age of system: 5 day, 21 hr., 6 min.
xmalloc debug: disabled
abend code: 700
csa: 0x438eb0
exception struct:
0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
panic: HACMP for AIX dms timeout - ha
.
> status
CPU TID TSLOT PID PSLOT STOPPED PROC_NAME
0 205 2 204 2 yes wait
1 307 3 306 3 yes wait
2 409 4 408 4 yes wait
3 50b 5 50a 5 yes wait
4 60d 6 60c 6 yes wait
5 1867 24 125a 18 yes errdemon
6 811 8 810 8 yes wait
7 913 9 912 9 yes wait
> t -mk
Skipping first MST
.
MST STACK TRACE:
0x00438eb0 (excpt=00000000:00000000:00000000:00000000:00000000)
(intpri=5)
IAR: .panic_trap+0 (00012678): tweq r1,r1
LR: .[ dms :dead_man_sw_handler]+18 (0171335c)
00438d40: .[ dms :timeout_end]+4c (01713b98)
00438d80: .clock+134 (0002e9a8)
00438de0: .i_softmod+2a8 (0001c3b0)
00438e70: flih_603_patch+cc (00028b74)
.
0x2ff3b400 (excpt=00000000:00000000:00000000:00000000:00000000)
(intpri=11)
IAR: .waitproc_find_run_queue+c0 (000255e0): addic
r3,r0,-4
LR: .waitproc+a0 (00025aa4)
2ff3b328: .waitproc+a0 (00025aa4)
2ff3b388: .procentry+14 (00098288)
2ff3b3c8: .low+0 (00000000)
.
> symptom
PIDS/5765C3403 LVLS/430 PCSS/SPI1 MS/700 FLDS/panic_tra VALU/7c810808
FLDS/[ dms :dead VALU/18
或者检查 errpt , 如:
errpt -a
-------
LABEL: KERNEL_PANIC
IDENTIFIER: 225E3B63
Date/Time: Thu Apr 25 21:26:16
Sequence Number: 609
Machine Id: 0040613A4C00
Node Id: localhost
Class: S
Type: TEMP
Resource Name: PANIC
Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED
Recommended Actions
PERFORM PROBLEM DETERMINATION PROCEDURES
Detail Data
ASSERT STRING
PANIC STRING
HACMP for AIX dms timeout - halting hung node
4.避免DMS的几种方法:
a.调整系统的io pacing
如:#smitty chsys 如下调整高低水印
Maximum number of PROCESSES allowed per user [128]
Maximum number of pages in block I/O BUFFER CACHE [20]
Maximum Kbytes of real memory allowed for MBUFS [0]
Automatically REBOOT system after a crash false
Continuously maintain DISK I/O history false
HIGH water mark for pending write I/Os per file [33]
LOW water mark for pending write I/Os per file [24]
Amount of usable physical memory in Kbytes 262144
State of system keylock at boot time normal
Enable full CORE dump false
Use pre-430 style CORE dump false
Enable CPU Guard disable
b. 调快cpu同步频率,(系统默认60秒)
如果客户安装了hacmp4.4.0或以上版本, 再hacmp菜单中可以直接设置
建议可以改为 10 秒
Smitty cm_tuning_parms_chsyncd
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
syncd frequency (in seconds) [60]
Esc+1=Help Esc+2=Refresh Esc+3=Cancel Esc+4=List
Esc+5=Reset Esc+6=Command Esc+7=Edit Esc+8=Image
Esc+9=Shell Esc+0=Exit Enter=Do
如果hacmp版本比较低,可以修改 /sbin/rc.boot 文件中的sync 值。
如:
echo "Starting the sync daemon" | alog -t boot
nohup /usr/sbin/syncd 60 > /dev/null 2>&1 &
c. 减慢心跳线诊断频率:
smitty cm_config_networks.chg_pre.select
Change a Cluster Network Module using Predefined Values
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
[Entry Fields]
* Network Module Name IP
New Network Module Name []
Description [Generic IP]
Failure Detection Rate Normal >> slow
d. 调整网络参数;
# no -a
extendednetstats = 0
thewall = 6048
sockthresh = 85
sb_max = 1048576
somaxconn = 1024
clean_partial_conns = 0
net_malloc_police = 1
net_malloc_frag_mask = 0
rto_low = 1
#no -o thewall=131052
#no -a
extendednetstats = 0
thewall = 131052
sockthresh = 85
sb_max = 1048576
somaxconn = 1024
clean_partial_conns = 0
net_malloc_police = 1
net_malloc_frag_mask = 0
rto_low = 1
e. 如果客户安装了hacmp软件又发生了DMS , 则可以检查一下是否机器运行了电源管理软件(power management ),
如果是,请关闭电源管理。如:
smitty pm
Power Management
Move cursor to desired item and press Enter.
Enable / Disable Power Management State Transition
Configure / Unconfigure Power Management
System State Transition from Enable State
Change / Show Characteristics of Power Management
Power Management Timer
Display Power Management
Power Management Characteristics of Each Device
Battery
参考资料: