先看一个例子:
[root@node96 ~]# ps -ef |grep postgres
postgres 4076 1 0 17:17 ? 00:00:00
/data/postgresql-9.2.0/bin/postgres -D /usr/local/pgsql/data -c
config_file=/usr/local/pgsql/data/postgresql.conf
postgres 4128 4076 0 17:17 ? 00:00:00 postgres: logger
process
postgres 4130 4076 0 17:17 ? 00:00:00 postgres: checkpointer
process
postgres 4131 4076 0 17:17 ? 00:00:00 postgres: writer
process
postgres 4132 4076 0 17:17 ? 00:00:00 postgres: wal writer
process
postgres 4133 4076 0 17:17 ? 00:00:00 postgres: autovacuum
launcher
process
postgres 4134 4076 0 17:17 ? 00:00:00 postgres: archiver
process
postgres 4135 4076 0 17:17 ? 00:00:00 postgres: stats
collector
process
postgres 4229 4076 0 17:17 ? 00:00:00 postgres: wal sender
process repl 192.168.11.95(35071) streaming
0/170131D8
root 4462 4420 0 17:17 pts/4 00:00:00 su - postgres
postgres 4463 4462 0 17:17 pts/4 00:00:00 -bash
postgres 4493 4463 0 17:17 pts/4 00:00:00 psql
postgres 4494 4076 0 17:17 ? 00:00:00 postgres: postgres
postgres [local]
idle
root 6115 20538 0 17:23 pts/2 00:00:00 grep postgres
找到pg 数据库数据库主进程id 4076
先看看资源状态:
============
Last updated: Wed Sep 19 17:26:06 2012
Last change: Wed Sep 19 17:16:56 2012 via crmd on node95
Stack: openais
Current DC: node96 - partition with quorum
Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14
2 Nodes configured, 2 expected votes
5 Resources configured.
============
Node node95: online
fence_vm96 (stonith:fence_vmware) Started
Node node96: online
ClusterIp (ocf::heartbeat:IPaddr2) Started
fence_vm95 (stonith:fence_vmware) Started
ping (ocf::pacemaker:ping) Started
postgres_res (ocf::heartbeat:pgsql) Started
Inactive resources:
Migration summary:
* Node node95:
* Node node96:
然后我们kill -9 数据库的主进程。
看看资源状态
============
Last updated: Wed Sep 19 17:29:06 2012
Last change: Wed Sep 19 17:16:56 2012 via crmd on node95
Stack: openais
Current DC: node96 - partition with quorum
Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14
2 Nodes configured, 2 expected votes
5 Resources configured.
============
Node node95: online
fence_vm96 (stonith:fence_vmware) Started
Node node96: online
ClusterIp (ocf::heartbeat:IPaddr2) Started
fence_vm95 (stonith:fence_vmware) Started
ping (ocf::pacemaker:ping) Started
postgres_res (ocf::heartbeat:pgsql) Started
Inactive resources:
Migration summary:
* Node node95:
* Node node96:
postgres_res: migration-threshold=1000000 fail-count=1
Failed actions:
postgres_res_monitor_30000 (node=node96, call=45, rc=7, status=complete): not running
数据库资源在node96 重启了,下面多了个错误记录
看看日志:
Sep 19 17:29:05 node96
pgsql(postgres_res)[7479]: ERROR: command failed: su postgres -c cd
/usr/local/pgsql/data; kill -s 0 4076 >/dev/null
2>&1 /* 数据库检测失败
Sep 19 17:29:05 node96 pgsql(postgres_res)[7479]: INFO: PostgreSQL is down
Sep 19 17:29:05 node96 crmd[1925]:
info: process_lrm_event: LRM operation postgres_res_monitor_30000
(call=45, rc=7, cib-update=380, confirmed=false) not running
Sep 19 17:29:05 node96 crmd[1925]: info: process_graph_event: Action
postgres_res_monitor_30000 arrived after a completed transition
Sep 19 17:29:05 node96 crmd[1925]: info: abort_transition_graph:
process_graph_event:481 - Triggered transition abort (complete=1,
tag=lrm_rsc_op, id=postgres_res_last_failure_0,
magic=0:7;5:128:0:87f9cd86-767b-4162-a0c8-da0217d89baf, cib=0.115.12) :
Inactive graph
Sep 19 17:29:05 node96 crmd[1925]: warning:
update_failcount: Updating failcount for postgres_res on node96 after
failed monitor: rc=7 (update=value++, time=1348046945) /* 更新集群状态
--------/下面这些就是策略引擎的信息
Sep 19 17:29:05 node96 crmd[1925]: notice:
do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [
input=I_PE_CALC cause=C_FSA_INTERNAL rigin=abort_transition_graph ]
Sep 19 17:29:05 node96 attrd[1923]: notice: attrd_trigger_update:
Sending flush op to all hosts for: fail-count-postgres_res (1)
Sep 19 17:29:05 node96 attrd[1923]: notice: attrd_perform_update: Sent update 133: fail-count-postgres_res=1
Sep 19 17:29:05 node96 attrd[1923]: notice: attrd_trigger_update:
Sending flush op to all hosts for: last-failure-postgres_res
(1348046945)
Sep 19 17:29:05 node96 pengine[1924]: notice: unpack_config: On loss of CCM Quorum: Ignore
Sep 19 17:29:05 node96 pengine[1924]: notice: unpack_rsc_op: Operation monitor found resource postgres_res active on node95
Sep 19 17:29:05 node96 pengine[1924]: warning: unpack_rsc_op:
Processing failed op postgres_res_last_failure_0 on node96: not running
(7)
Sep 19 17:29:05 node96 crmd[1925]: info: abort_transition_graph:
te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair,
id=status-node96-fail-count-postgres_res, name=fail-count-postgres_res,
value=1, magic=NA, cib=0.115.13) : Transient attribute: update
Sep 19 17:29:05 node96 attrd[1923]: notice: attrd_perform_update: Sent update 135: last-failure-postgres_res=1348046945
Sep 19 17:29:05 node96 pengine[1924]: notice: LogActions: Recover postgres_res#011(Started node96)
Sep 19 17:29:05 node96 crmd[1925]: info: handle_response: pe_calc calculation pe_calc-dc-1348046945-250 is obsolete
Sep 19 17:29:05 node96 crmd[1925]: info: abort_transition_graph:
te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair,
id=status-node96-last-failure-postgres_res,
name=last-failure-postgres_res, value=1348046945, magic=NA,
cib=0.115.14) : Transient attribute: update
Sep 19 17:29:05 node96 pengine[1924]: notice: process_pe_message:
Transition 129: PEngine Input stored in:
/var/lib/pengine/pe-input-285.bz2
Sep 19 17:29:05 node96 pengine[1924]: notice: unpack_config: On loss of CCM Quorum: Ignore
Sep 19 17:29:05 node96 pengine[1924]: notice: unpack_rsc_op: Operation monitor found resource postgres_res active on node95
Sep 19 17:29:05 node96 pengine[1924]: warning: unpack_rsc_op:
Processing failed op postgres_res_last_failure_0 on node96: not running
(7)
Sep 19 17:29:05 node96 pengine[1924]:
notice: common_apply_stickiness: postgres_res can fail 999999 more times
on node96 before being forced off /* 这里提到了failcount 还剩下999999次失败的机会。
-----------/下面的日志就是在本地节点重新启动了服务。
Sep 19 17:29:05 node96 pengine[1924]: notice: LogActions: Recover postgres_res#011(Started node96)
Sep 19 17:29:05 node96 crmd[1925]: notice: do_state_transition: State
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [
input=I_PE_SUCCESS cause=C_IPC_MESSAGE rigin=handle_response ]
Sep 19 17:29:05 node96 crmd[1925]: info: do_te_invoke: Processing
graph 130 (ref=pe_calc-dc-1348046945-251) derived from
/var/lib/pengine/pe-input-286.bz2
Sep 19 17:29:05 node96 crmd[1925]: info: te_rsc_command: Initiating action 6: stop postgres_res_stop_0 on node96 (local)
Sep 19 17:29:05 node96 lrmd: [1922]: info: cancel_op: operation
monitor[45] on ocf::pgsql::postgres_res for client 1925, its parameters:
CRM_meta_depth=[0] pgdba=[postgres] pgdb=[postgres]
pgdata=[/usr/local/pgsql/data]
config=[/usr/local/pgsql/data/postgresql.conf] depth=[0]
psql=[/usr/local/pgsql/bin/psql] pgctl=[/usr/local/pgsql/bin/pg_ctl]
start_opt=[] crm_feature_set=[3.0.6] CRM_meta_on_fail=[standby]
CRM_meta_name=[monitor] CRM_meta_interval=[30000]
CRM_meta_timeout=[30000] cancelled
Sep 19 17:29:05 node96 lrmd: [1922]: info: rsc:postgres_res:46: stop
Sep 19 17:29:05 node96 crmd[1925]: info: process_lrm_event: LRM
operation postgres_res_monitor_30000 (call=45, status=1, cib-update=0,
confirmed=true) Cancelled
Sep 19 17:29:05 node96 pengine[1924]: notice: process_pe_message:
Transition 130: PEngine Input stored in:
/var/lib/pengine/pe-input-286.bz2
Sep 19 17:29:05 node96 pgsql(postgres_res)[7523]: ERROR: command failed:
su postgres -c cd /usr/local/pgsql/data; kill -s 0 4076 >/dev/null
2>&1
Sep 19 17:29:05 node96 crmd[1925]: info: process_lrm_event: LRM
operation postgres_res_stop_0 (call=46, rc=0, cib-update=384,
confirmed=true) ok
Sep 19 17:29:05 node96 crmd[1925]: info: te_rsc_command: Initiating action 17: start postgres_res_start_0 on node96 (local)
Sep 19 17:29:05 node96 lrmd: [1922]: info: rsc:postgres_res:47: start
Sep 19 17:29:05 node96 pgsql(postgres_res)[7561]: ERROR: command failed:
su postgres -c cd /usr/local/pgsql/data; kill -s 0 4076 >/dev/null
2>&1
Sep 19 17:29:05 node96 pgsql(postgres_res)[7561]: INFO: server starting
Sep 19 17:29:05 node96 pgsql(postgres_res)[7561]: INFO: PostgreSQL start command sent.
Sep 19 17:29:05 node96 pgsql(postgres_res)[7561]: WARNING: psql: could
not connect to server: Connection refused Is the server running locally
and accepting connections on Unix domain socket "/tmp/.s.PGSQL.5432"?
Sep 19 17:29:05 node96 pgsql(postgres_res)[7561]: WARNING: PostgreSQL postgres isn't running
Sep 19 17:29:05 node96 pgsql(postgres_res)[7561]: WARNING: Connection
error (connection to the server went bad and the session was not
interactive) occurred while executing the psql command.
Sep 19 17:29:06 node96 pgsql(postgres_res)[7561]: INFO: PostgreSQL is started.
我们本来的的愿望是资源要切到备库去的,结果没有切,原因想必就是这个failcount 在做怪了。
pacemaker 里有个参数 : migration-threshold=N
设定的是本机上资源失败的次数达到了这个N 以后就会切到standby node 上去了。 并且再也不会切回来了。
除非管理员清理了这个节点上failcount。
就是这部分:
Migration summary:
* Node node95:
* Node node96:
postgres_res: migration-threshold=1000000 fail-count=1
命令很简单:
cmr resource cleanup
还有另外一个参数 failure-timeout=N
失败的超时时间
如果同时设置了这两个参数 ,如果达到了 migration-threshold的阀值,导致应用切换到从库,那么经过failure-timeout=N的时间后,可能会导致资源再切回到原来的主库,这个切回来的动作受到(stickiness and constraint scores) 这两个 条件的约束。
这里还涉及到了两个意外情况,在本机start-fail 和 stop-fail 如果是start-fail 会直接跟新failcount 到INFINITY , 这会导致切换到备机。
如果是stop-fail ,如果开启了fence设备,会导致fence 设备发生fence动作,然后把资源切到备机。
如果没有启动 fence 设备,那么集群会尝试不停的去stop 应用,但是这个应用不会切刀备机。
阅读(6647) | 评论(0) | 转发(0) |