Chinaunix首页 | 论坛 | 博客
  • 博客访问: 2943814
  • 博文数量: 199
  • 博客积分: 1400
  • 博客等级: 上尉
  • 技术积分: 4126
  • 用 户 组: 普通用户
  • 注册时间: 2008-07-06 19:06
个人简介

半个PostgreSQL DBA,热衷于数据库相关的技术。我的ppt分享https://pan.baidu.com/s/1eRQsdAa https://github.com/chenhuajun https://chenhuajun.github.io

文章分类

全部博文(199)

文章存档

2020年(5)

2019年(1)

2018年(12)

2017年(23)

2016年(43)

2015年(51)

2014年(27)

2013年(21)

2011年(1)

2010年(4)

2009年(5)

2008年(6)

分类: Mysql/postgreSQL

2016-08-24 00:25:49

作为以下文章的补充,说明MHA GTID based failover的处理流程。
http://blog.chinaunix.net/uid-20726500-id-5700631.html

MHA判断是GTID based failover需要满足下面3个条件(参考函数get_gtid_status)
所有节点gtid_mode=1
所有节点Executed_Gtid_Set不为空
至少一个节点Auto_Position=1


GTID based MHA故障切换

点击(此处)折叠或打开

  1. MHA::MasterFailover::main()
  2.     ->do_master_failover
  3.         Phase 1: Configuration Check Phase
  4.             -> check_settings:
  5.                 check_node_version:查看MHA的版本信息
  6.                 connect_all_and_read_server_status:确认各个node的MySQL实例是否可以连接
  7.                 get_dead_servers/get_alive_servers/get_alive_slaves:double check各个node的死活状态
  8.                 start_sql_threads_if:查看Slave_SQL_Running是否为Yes,若不是则启动SQL thread
  9.              
  10.         Phase 2: Dead Master Shutdown Phase:对于我们来说,唯一的作用就是stop IO thread
  11.             -> force_shutdown($dead_master)
  12.                 stop_io_thread:所有slave的IO thread stop掉(将stop掉master)
  13.                 force_shutdown_internal(实际上就是执行配置文件中的master_ip_failover_script/shutdown_script,若无则不执行)
  14.                     master_ip_failover_script:如果设置了VIP,则首先切换VIP
  15.                     shutdown_script:如果设置了shutdown脚本,则执行
  16.  
  17.         Phase 3: Master Recovery Phase
  18.             -> Phase 3.1: Getting Latest Slaves Phase(取得latest slave)
  19.                 read_slave_status:取得各个slave的binlog file/position
  20.                     check_slave_status:调用"SHOW SLAVE STATUS"来取得slave的如下信息:
  21.                          Slave_IO_State, Master_Host,
  22.                          Master_Port, Master_User,
  23.                          Slave_IO_Running, Slave_SQL_Running,
  24.                          Master_Log_File, Read_Master_Log_Pos,
  25.                          Relay_Master_Log_File, Last_Errno,
  26.                          Last_Error, Exec_Master_Log_Pos,
  27.                          Relay_Log_File, Relay_Log_Pos,
  28.                          Seconds_Behind_Master, Retrieved_Gtid_Set,
  29.                          Executed_Gtid_Set, Auto_Position
  30.                          Replicate_Do_DB, Replicate_Ignore_DB, Replicate_Do_Table,
  31.                          Replicate_Ignore_Table, Replicate_Wild_Do_Table,
  32.                          Replicate_Wild_Ignore_Table
  33.                 identify_latest_slaves:
  34.                     通过比较各个slave中的Master_Log_File/Read_Master_Log_Pos,来找到latest的slave
  35.                 identify_oldest_slaves:
  36.                     通过比较各个slave中的Master_Log_File/Read_Master_Log_Pos,来找到oldest的slave
  37.  
  38.             -> Phase 3.2: Determining New Master Phase
  39.                 get_most_advanced_latest_slave:找到(Relay_Master_Log_File,Exec_Master_Log_Pos)最靠前的Slave
  40.                      
  41.                 select_new_master:选出新的master节点
  42.                     If preferred node is specified, one of active preferred nodes will be new master.
  43.                     If the latest server behinds too much (i.e. stopping sql thread for online backups),
  44.                     we should not use it as a new master, we should fetch relay log there. Even though preferred
  45.                     master is configured, it does not become a master if it's far behind.
                        get_candidate_masters:
                            就是配置文件中配置了candidate_master>0的节点
                        get_bad_candidate_masters:
                            # The following servers can not be master:
                            # - dead servers
                            # - Set no_master in conf files (i.e. DR servers)
                            # - log_bin is disabled
                            # - Major version is not the oldest
                            # - too much replication delay(slave与master的binlog position差距大于100000000)
                        Searching from candidate_master slaves which have received the latest relay log events
                        if NOT FOUND:
                            Searching from all candidate_master slaves
                                if NOT FOUND:
                                    Searching from all slaves which have received the latest relay log events
                                        if NOT FOUND:
                                            Searching from all slaves
                                 
                -> Phase 3.3: Phase 3.3: New Master Recovery Phase
                    recover_master_gtid_internal:
                        wait_until_relay_log_applied
                        stop_slave
                        如果new master不是拥有最新relay的Slave
                            $latest_slave->wait_until_relay_log_applied:等待直到最新relay的Slave上Exec_Master_Log_Pos等于Read_Master_Log_Pos
                            change_master_and_start_slave( $target, $latest_slave)
                            wait_until_in_sync( $target, $latest_slave )
                        save_from_binlog_server:
                            遍历所有binary server,执行save_binary_logs --command=save获取后面的binlog
                        apply_binlog_to_master:
                            应用从binary server上获取的binlog(如果有的话)
                    如果设置了master_ip_failover_script,调用$master_ip_failover_script --command=start进行启用vip
                    如果未设置skip_disable_read_only,设置read_only=0
     
            Phase 4: Slaves Recovery Phase
                recover_slaves_gtid_internal
                -> Phase 4.1: Starting Slaves in parallel
                    对所有Slave执行change_master_and_start_slave
                    如果设置了wait_until_gtid_in_sync,通过"SELECT WAIT_UNTIL_SQL_THREAD_AFTER_GTIDS(?,0)"等待Slave数据同步
     
            Phase 5: New master cleanup phase
                reset_slave_on_new_master
                    清理New Master其实就是重置slave info,即取消原来的Slave信息。至此整个Master故障切换过程完成



启用GTID时的在线切换流程和不启用GTID时一样(唯一不同的是执行的change master语句),所以省略。
阅读(10043) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~