作为以下文章的补充,说明MHA GTID based failover的处理流程。
http://blog.chinaunix.net/uid-20726500-id-5700631.html
MHA判断是GTID based failover需要满足下面3个条件(参考函数get_gtid_status)
所有节点gtid_mode=1
所有节点Executed_Gtid_Set不为空
至少一个节点Auto_Position=1
GTID based MHA故障切换
-
MHA::MasterFailover::main()
-
->do_master_failover
-
Phase 1: Configuration Check Phase
-
-> check_settings:
-
check_node_version:查看MHA的版本信息
-
connect_all_and_read_server_status:确认各个node的MySQL实例是否可以连接
-
get_dead_servers/get_alive_servers/get_alive_slaves:double check各个node的死活状态
-
start_sql_threads_if:查看Slave_SQL_Running是否为Yes,若不是则启动SQL thread
-
-
Phase 2: Dead Master Shutdown Phase:对于我们来说,唯一的作用就是stop IO thread
-
-> force_shutdown($dead_master):
-
stop_io_thread:所有slave的IO thread stop掉(将stop掉master)
-
force_shutdown_internal(实际上就是执行配置文件中的master_ip_failover_script/shutdown_script,若无则不执行):
-
master_ip_failover_script:如果设置了VIP,则首先切换VIP
-
shutdown_script:如果设置了shutdown脚本,则执行
-
-
Phase 3: Master Recovery Phase
-
-> Phase 3.1: Getting Latest Slaves Phase(取得latest slave)
-
read_slave_status:取得各个slave的binlog file/position
-
check_slave_status:调用"SHOW SLAVE STATUS"来取得slave的如下信息:
-
Slave_IO_State, Master_Host,
-
Master_Port, Master_User,
-
Slave_IO_Running, Slave_SQL_Running,
-
Master_Log_File, Read_Master_Log_Pos,
-
Relay_Master_Log_File, Last_Errno,
-
Last_Error, Exec_Master_Log_Pos,
-
Relay_Log_File, Relay_Log_Pos,
-
Seconds_Behind_Master, Retrieved_Gtid_Set,
-
Executed_Gtid_Set, Auto_Position
-
Replicate_Do_DB, Replicate_Ignore_DB, Replicate_Do_Table,
-
Replicate_Ignore_Table, Replicate_Wild_Do_Table,
-
Replicate_Wild_Ignore_Table
-
identify_latest_slaves:
-
通过比较各个slave中的Master_Log_File/Read_Master_Log_Pos,来找到latest的slave
-
identify_oldest_slaves:
-
通过比较各个slave中的Master_Log_File/Read_Master_Log_Pos,来找到oldest的slave
-
-
-> Phase 3.2: Determining New Master Phase
-
get_most_advanced_latest_slave:找到(Relay_Master_Log_File,Exec_Master_Log_Pos)最靠前的Slave
-
-
select_new_master:选出新的master节点
-
If preferred node is specified, one of active preferred nodes will be new master.
-
If the latest server behinds too much (i.e. stopping sql thread for online backups),
-
we should not use it as a new master, we should fetch relay log there. Even though preferred
-
master is configured, it does not become a master if it's far behind.
get_candidate_masters:
就是配置文件中配置了candidate_master>0的节点
get_bad_candidate_masters:
# The following servers can not be master:
# - dead servers
# - Set no_master in conf files (i.e. DR servers)
# - log_bin is disabled
# - Major version is not the oldest
# - too much replication delay(slave与master的binlog position差距大于100000000)
Searching from candidate_master slaves which have received the latest relay log events
if NOT FOUND:
Searching from all candidate_master slaves
if NOT FOUND:
Searching from all slaves which have received the latest relay log events
if NOT FOUND:
Searching from all slaves
-> Phase 3.3: Phase 3.3: New Master Recovery Phase
recover_master_gtid_internal:
wait_until_relay_log_applied
stop_slave
如果new master不是拥有最新relay的Slave
$latest_slave->wait_until_relay_log_applied:等待直到最新relay的Slave上Exec_Master_Log_Pos等于Read_Master_Log_Pos
change_master_and_start_slave( $target, $latest_slave)
wait_until_in_sync( $target, $latest_slave )
save_from_binlog_server:
遍历所有binary server,执行save_binary_logs --command=save获取后面的binlog
apply_binlog_to_master:
应用从binary server上获取的binlog(如果有的话)
如果设置了master_ip_failover_script,调用$master_ip_failover_script --command=start进行启用vip
如果未设置skip_disable_read_only,设置read_only=0
Phase 4: Slaves Recovery Phase
recover_slaves_gtid_internal
-> Phase 4.1: Starting Slaves in parallel
对所有Slave执行change_master_and_start_slave
如果设置了wait_until_gtid_in_sync,通过"SELECT WAIT_UNTIL_SQL_THREAD_AFTER_GTIDS(?,0)"等待Slave数据同步
Phase 5: New master cleanup phase
reset_slave_on_new_master
清理New Master其实就是重置slave info,即取消原来的Slave信息。至此整个Master故障切换过程完成
启用GTID时的在线切换流程和不启用GTID时一样(唯一不同的是执行的change master语句),所以省略。
阅读(10061) | 评论(0) | 转发(0) |