DMS的简单介绍
1. dms 的介绍:
DMS(deadman switch)是用来描述系统kernel extension用的,它可以在系统崩溃前down掉系统,并产生dump
文件,以供日后检查。
DMS存在的目的是为了保护共享外置硬盘及数据,当系统挂起时间长过一定限制时间时,DMS会自动down掉该系统,
由hacmp的备份节点接管系统,以保护数据和业务的正常进行,避免潜在的问题,特别是外置磁盘阵列。
2. DMS 的起因:
DMS起作用的原因主要有以下几点:
a. 某种应用程序的优先级大于clstrmgr deamon , 导致clstrmgr无法正常reset DMS计数器。
b. 在系统上存在大量IO 操作, 导致cpu 没有时间相应clstrmgr deamon .
c. 内存泄漏或溢出问题
d. 大量的系统错误日志活动, 如: (token-ring beaconing 问题)
3. 如何检查是否系统发生了DMS
我们可以通过分析DUMP文件来看,如:
# crash /dev/lv00
Using /unix as the default namelist file.
> cpu
Selected cpu number : 0
> stat
------sysname: AIX
------nodename: sp13
------release: 3
------version: 4
------machine: 00091968A400
------time of crash: Sat Aug 31 04:36:52 EDT 2002
------age of system: 5 day, 21 hr., 6 min.
------xmalloc debug: disabled
------abend code: 700
------csa: 0x438eb0
------exception struct:
------0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
------panic: HACMP for AIX dms timeout - ha
.
> status
CPU TID TSLOT --PID PSLOT STOPPED PROC_NAME
0 --205 ----2 --204 ----2 ----yes wait
1 --307 ----3 --306 ----3 ----yes wait
2 --409 ----4 --408 ----4 ----yes wait
3 --50b ----5 --50a ----5 ----yes wait
4 --60d ----6 --60c ----6---- yes wait
5 -1867 ---24 -125a -- 18 ----yes errdemon
6 --811 ----8 --810 ----8---- yes wait
7 --913 ----9 --912 ----9 ----yes wait
> t -mk
Skipping first MST
.
MST STACK TRACE:
0x00438eb0 (excpt=00000000:00000000:00000000:00000000:00000000)
(intpri=5)
IAR: -----.panic_trap+0 (00012678): tweq r1,r1
LR: ------.[dms:dead_man_sw_handler]+18 (0171335c)
00438d40: .[dms:timeout_end]+4c (01713b98)
00438d80: .clock+134 (0002e9a8)
00438de0: .i_softmod+2a8 (0001c3b0)
00438e70: flih_603_patch+cc (00028b74)
.
0x2ff3b400 (excpt=00000000:00000000:00000000:00000000:00000000)
(intpri=11)
IAR: -----.waitproc_find_run_queue+c0 (000255e0): addic r3,r0,-4
LR: ----- .waitproc+a0 (00025aa4)
2ff3b328: .waitproc+a0 (00025aa4)
2ff3b388: .procentry+14 (00098288)
2ff3b3c8: .low+0 (00000000)
.
> symptom
PIDS/5765C3403 LVLS/430 PCSS/SPI1 MS/700 FLDS/panic_tra VALU/7c810808
FLDS/[dms:dead VALU/18
或者检查 errpt , 如:
errpt -a
-------
LABEL: ----- ----- KERNEL_PANIC
IDENTIFIER: ----- 225E3B63
Date/Time: -------Thu Apr 25 21:26:16
Sequence Number: 609
Machine Id: ----- 0040613A4C00
Node Id: ---------localhost
Class: ------------S
Type: -------------TEMP
Resource Name: ---PANIC
Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED
---Recommended Actions
---PERFORM PROBLEM DETERMINATION PROCEDURES
Detail Data
ASSERT STRING
PANIC STRING
HACMP for AIX dms timeout - halting hung node
4.避免DMS的几种方法:
a.调整系统的io pacing
如:#smitty chsys 如下调整高低水印
Maximum number of PROCESSES allowed per user -----[128]
Maximum number of pages in block I/O BUFFER CACHE [20]
Maximum Kbytes of real memory allowed for MBUFS --[0]
Automatically REBOOT system after a crash --------false
Continuously maintain DISK I/O history -----------false
HIGH water mark for pending write I/Os per file --[33]
LOW water mark for pending write I/Os per file ---[24]
Amount of usable physical memory in Kbytes -------262144
State of system keylock at boot time -------------normal
Enable full CORE dump ----------------------------false
Use pre-430 style CORE dump ----------------------false
Enable CPU Guard ---------------------------------disable
b.调快cpu同步频率,(系统默认60秒)
如果客户安装了hacmp4.4.0或以上版本,再hacmp菜单中可以直接设置
建议可以改为 10 秒
Smitty cm_tuning_parms_chsyncd
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
syncd frequency (in seconds) ----[60]
Esc+1=Help -Esc+2=Refresh Esc+3=Cancel Esc+4=List
Esc+5=Reset Esc+6=Command Esc+7=Edit --Esc+8=Image
Esc+9=Shell Esc+0=Exit ---Enter=Do
如果hacmp版本比较低,可以修改 /sbin/rc.boot 文件中的sync 值。
如:
echo "Starting the sync daemon" | alog -t boot
nohup /usr/sbin/syncd 60 > /dev/null 2>&1 &
c. 减慢心跳线诊断频率:
smitty cm_config_networks.chg_pre.select
Change a Cluster Network Module using Predefined Values
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
---- ---- ---- ---- ---- ---- -[Entry Fields]
* Network Module Name ---- - IP
New Network Module Name ----[]
Description ---- ---- ---- ---[Generic IP]
Failure Detection Rate ---- -Normal >> slow
d. 调整网络参数;
# no -a
extendednetstats = 0
thewall = 6048
sockthresh = 85
sb_max = 1048576
somaxconn = 1024
clean_partial_conns = 0
net_malloc_police = 1
net_malloc_frag_mask = 0
rto_low = 1
#no -o thewall=131052
#no -a
extendednetstats = 0
thewall = 131052
sockthresh = 85
sb_max = 1048576
somaxconn = 1024
clean_partial_conns = 0
net_malloc_police = 1
net_malloc_frag_mask = 0
rto_low = 1
e. 如果客户安装了hacmp软件又发生了DMS , 则可以检查一下是否机器运行了电源管理软件(power management ),如果是,请关闭电源管理。如:
smitty pm
-------------------------------Power Management
Move cursor to desired item and press Enter.
Enable / Disable Power Management State Transition
Configure / Unconfigure Power Management
System State Transition from Enable State
Change / Show Characteristics of Power Management
Power Management Timer
Display Power Management
Power Management Characteristics of Each Device
Battery
阅读(1741) | 评论(0) | 转发(0) |