HA方案之MySQL半复制+MHA+Keepalived+Atlas+LVS

简介

目前Mysql高可用的方案有好多,比如MMM,heartbeat+drbd,Cluster等,还有percona的Galera Cluster等。这些高可用软件各有优劣。在进行高可用方案选择时,主要是看业务还有对数据一致性方面的要求。本文探讨的是MHA这种方案。
至于DAL层,也有很多方案,现在主流的一是在应用层写数据库路由,当然这个效率也是最快的,但是最大的缺陷就是运维难度大,技术难度也大,需要有强大的技术团队支持。第二个方案是proxy中间件,现在开源的中间件很多,比如TDDL、Cobar、Atlas、MyCat、Mysql Proxy、Oceanus等等,中间件的最大的好处是对数据层解耦,减轻了运维难度,当然在服务器层与数据库层加了一层proxy,使得效率明显不如直接访问数据库。按业务实际需求选择合适的方案,在本篇文章中,选的是360开源的DAL中间件Atlas,详细信息官方文档(https://github.com/Qihoo360/Atlas)以描述的很清楚,这里不做过多的描述。
现在继续讨论MHA,MHA(Master High Availability)目前在MySQL高可用方面是一个相对成熟的解决方案,它由日本DeNA公司youshimaton(现就职于Facebook公司)开发,是一套优秀的作为MySQL高可用性环境下故障切换和主从提升的高可用软件。在MySQL故障切换过程中,MHA能做到在0~30秒之内自动完成数据库的故障切换操作,并且在进行故障切换的过程中,MHA能在最大程度上保证数据的一致性,以达到真正意义上的高可用。
该软件由两部分组成:MHA Manager(管理节点)和MHA Node(数据节点)。MHA Manager可以单独部署在一台独立的机器上管理多个master-slave集群,也可以部署在一台slave节点上。MHA Node运行在每台MySQL服务器上,MHA Manager会定时探测集群中的master节点,当master出现故障时,它可以自动将最新数据的slave提升为新的master,然后将所有其他的slave重新指向新的master。整个故障转移过程对应用程序完全透明。
在MHA自动故障切换过程中,MHA试图从宕机的主服务器上保存二进制日志,最大程度的保证数据的不丢失,但这并不总是可行的。例如,如果主服务器硬件故障或无法通过ssh访问,MHA没法保存二进制日志,只进行故障转移而丢失了最新的数据。使用MySQL 5.5的半同步复制,可以大大降低数据丢失的风险。MHA可以与半同步复制结合起来。如果只有一个slave已经收到了最新的二进制日志,MHA可以将最新的二进制日志应用于其他所有的slave服务器上,因此可以保证所有节点的数据一致性。

我们自己使用其实也可以使用1主1从,但是master主机宕机后无法切换,以及无法补全binlog。master的mysqld进程crash后,还是可以切换成功,以及补全binlog的。

架构图

2015-06-18_172304
现在说下架构图的层级关系,最底层的是mysql主从,其中有从机作为备选的主机,备选主机与主机之间由keepalived维护着心跳。每个mysql server都是MHA node,由MHA Manager监控着Mysql的节点,当MHA Manager监控(keepalived的心跳检测)到主机出现问题,会切换到备选主机,并将备选主机提升为新主机,其他的从机成为新主机的从机,并且VIP也漂移到新主机,这样就实现了Mysql的HA。db上面的一层是Atlas Proxy,Atlas主要的功能是读写分离、从库的负载均衡、自动分表、自动摘除宕机的DB、DBA可平滑的上下线DB及IP过滤,其中读写分离及自动摘除宕机的DB是当初选择该中间件的最主要原因,自动摘除宕机的从机保障了整个架构系统持续性,而不影响业务的正常运行。后期根据数据量的增大可以切换到Mycat这个方案,支持分库分表、主从切换等等,并且社区活跃度也很高,感兴趣的可以查看下官方文档(https://github.com/MyCATApache)。回到正题,这里有两个Atlas做HA,上层的LVS做Atlas的负载均衡,两个LVS之间利用keepalived心跳实现热主备模式,防止出现单点故障。再上层的就是应用server,这里就不在讨论了。

角色

  1. 角色 ip地址 主机名
  2. MHA Manger 192.168.3.123 B-dev23
  3. Mysql Master 192.168.3.110 B-dev10
  4. Mysql slave1 192.168.3.115 B-dev15
  5. Mysql slave2 192.168.3.117 B-dev17
  6. Candicate master 192.168.3.118 B-dev18
  7. Atlas1 192.168.3.124 B-dev24
  8. Atlas2 192.168.3.125 B-dev25
  9. LVS Master 192.168.3.119 B-dev19
  10. LVS Backup 192.168.3.120 B-dev20

MHA

1.MHA的基本原理

(1)从宕机崩溃的master保存二进制日志事件(binlog events);

(2)识别含有最新更新的slave;

(3)应用差异的中继日志(relay log)到其他的slave;

(4)应用从master保存的二进制日志事件(binlog events);

(5)提升一个slave为新的master;

(6)使其他的slave连接新的master进行复制;

MHA软件由两部分组成,Manager工具包和Node工具包,具体的说明如下。

Manager工具包主要包括以下几个工具:

  1. masterha_check_ssh 检查MHASSH配置状况
  2. masterha_check_repl 检查MySQL复制状况
  3. masterha_manger 启动MHA
  4. masterha_check_status 检测当前MHA运行状态
  5. masterha_master_monitor 检测master是否宕机
  6. masterha_master_switch 控制故障转移(自动或者手动)
  7. masterha_conf_host 添加或删除配置的server信息

Node工具包(这些工具通常由MHA Manager的脚本触发,无需人为操作)主要包括以下几个工具:

  1. save_binary_logs 保存和复制master的二进制日志
  2. apply_diff_relay_logs 识别差异的中继日志事件并将其差异的事件应用于其他的slave
  3. filter_mysqlbinlog 去除不必要的ROLLBACK事件(MHA已不再使用这个工具)
  4. purge_relay_logs 清除中继日志(不会阻塞SQL线程)

注意:

为了尽可能的减少主库硬件损坏宕机造成的数据丢失,因此在配置MHA的同时建议配置成MySQL 5.5的半同步复制,关于半同步复制原理及安装流程下面会详细介绍。

2.部署MHA

接下来部署MHA,具体的搭建环境如下:
其中master对外提供写服务,slave提供相关的读服务,一旦master宕机,将会把备选master提升为新的master,slave指向新的master。
(1)在所有节点安装MHA node所需的perl模块(DBD:mysql),安装脚本如下:
  1. [root@192.168.3.110 ~]# cat install.sh
  2. #!/bin/bash
  3. wget http://xrl.us/cpanm --no-check-certificate
  4. mv cpanm /usr/bin
  5. chmod 755 /usr/bin/cpanm
  6. cat > /root/list << EOF
  7. install DBD::mysql
  8. EOF
  9. for package in `cat /root/list`
  10. do
  11. cpanm $package
  12. done
  13. [root@192.168.3.110 ~]#

如果有安装epel源,也可以使用yum安装

  1. yum install perl-DBD-MySQL -y

(2)在所有的节点安装mha node:

  1. wget http://mysql-master-ha.googlecode.com/files/mha4mysql-node-0.53.tar.gz
  2. tar xf mha4mysql-node-0.53.tar.gz
  3. cd mha4mysql-node-0.53
  4. perl Makefile.PL
  5. make && make install

安装完成后会在/usr/local/bin目录下生成以下脚本文件:

  1. [root@192.168.3.110 ~]# pwd
  2. /usr/local/bin
  3. [root@192.168.3.110 bin]# ll
  4. total 40
  5. -r-xr-xr-x 1 root root 15498 Jun 20 10:05 apply_diff_relay_logs
  6. -r-xr-xr-x 1 root root 4807 Jun 20 10:05 filter_mysqlbinlog
  7. -r-xr-xr-x 1 root root 7401 Jun 20 10:05 purge_relay_logs
  8. -r-xr-xr-x 1 root root 7263 Jun 20 10:05 save_binary_logs
  9. [root@192.168.3.110 bin]#

关于上面脚本的功能,上面已经介绍过了,这里不再重复了。

3.安装MHA Manager

MHA Manager中主要包括了几个管理员的命令行工具,例如master_manger,master_master_switch等。MHA Manger也依赖于perl模块,具体如下:

(1)安装MHA Node软件包之前需要安装依赖。我这里使用yum完成,没有epel源的可以使用上面提到的脚本(epel源安装也简单)。注意:在MHA Manager的主机也是需要安装MHA Node。

  1. rpm -ivh http://dl.fedorJunoject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
  2. yum install perl-DBD-MySQL -y

安装MHA Node软件包,和上面的方法一样,如下:

  1. wget http://mysql-master-ha.googlecode.com/files/mha4mysql-node-0.53.tar.gz
  2. tar xf mha4mysql-node-0.53.tar.gz
  3. cd mha4mysql-node-0.53perl Makefile.PL
  4. make && make install

(2)安装MHA Manager。首先安装MHA Manger依赖的perl模块(我这里使用yum安装):

  1. yum install perl-DBD-MySQL perl-Config-Tiny perl-Log-Dispatch perl-Parallel-ForkManager perl-Time-HiRes -y

安装MHA Manager软件包:

  1. wget http://mysql-master-ha.googlecode.com/files/mha4mysql-manager-0.53.tar.gz
  2. tar xf mha4mysql-manager-0.53.tar.gz
  3. cd mha4mysql-manager-0.53
  4. perl Makefile.PL
  5. make && make install

安装完成后会在/usr/local/bin目录下面生成以下脚本文件,前面已经说过这些脚本的作用,这里不再重复。

  1. [root@192.168.3.123 bin]# pwd
  2. /usr/local/bin
  3. [root@192.168.3.123 bin]# ll
  4. total 76
  5. -r-xr-xr-x 1 root root 15498 Jun 20 10:58 apply_diff_relay_logs
  6. -r-xr-xr-x 1 root root 4807 Jun 20 10:58 filter_mysqlbinlog
  7. -r-xr-xr-x 1 root root 1995 Jun 20 11:33 masterha_check_repl
  8. -r-xr-xr-x 1 root root 1779 Jun 20 11:33 masterha_check_ssh
  9. -r-xr-xr-x 1 root root 1865 Jun 20 11:33 masterha_check_status
  10. -r-xr-xr-x 1 root root 3201 Jun 20 11:33 masterha_conf_host
  11. -r-xr-xr-x 1 root root 2517 Jun 20 11:33 masterha_manager
  12. -r-xr-xr-x 1 root root 2165 Jun 20 11:33 masterha_master_monitor
  13. -r-xr-xr-x 1 root root 2373 Jun 20 11:33 masterha_master_switch
  14. -r-xr-xr-x 1 root root 3749 Jun 20 11:33 masterha_secondary_check
  15. -r-xr-xr-x 1 root root 1739 Jun 20 11:33 masterha_stop
  16. -r-xr-xr-x 1 root root 7401 Jun 20 10:58 purge_relay_logs
  17. -r-xr-xr-x 1 root root 7263 Jun 20 10:58 save_binary_logs

复制相关脚本到/usr/local/bin目录

  1. [root@192.168.3.123 scripts]# pwd
  2. /root/mha4mysql-manager-0.53/samples/scripts
  3. [root@192.168.3.123 scripts]# ll
  4. total 32
  5. -rwxr-xr-x 1 root root 3443 Jun 8 2015 master_ip_failover #自动切换时vip管理的脚本,不是必须,如果我们使用keepalived的,我们可以自己编写脚本完成对vip的管理,比如监控mysql,如果mysql异常,我们停止keepalived就行,这样vip就会自动漂移
  6. -rwxr-xr-x 1 root root 9186 Jun 8 2015 master_ip_online_change #在线切换时vip的管理,不是必须,同样可以可以自行编写简单的shell完成
  7. -rwxr-xr-x 1 root root 11867 Jun 8 2015 power_manager #故障发生后关闭主机的脚本,不是必须
  8. -rwxr-xr-x 1 root root 1360 Jun 8 2015 send_report #因故障切换后发送报警的脚本,不是必须,可自行编写简单的shell完成。
  9. [root@192.168.3.123 scripts]# cp * /usr/local/bin/

4.配置SSH免密码登录(key模式)

这里需要注意的是所有的mysql server节点两两之间必须配成SSH无密码免登录双向模式,否则后面会报错。SSH免密码登录配置的文档可以查看Linux(使用ssh-keygen)设置SSH免密码登录

5.搭建主从复制环境

搭建MySQL主从复制的文档可以查看linux(Ubuntu)下mysql安装及主从复制,MySQL半复制的文档可以查看Mysql半复制浅谈,这里不再赘述。

6.配置MHA

(1)创建MHA的工作目录,并且创建相关配置文件(在软件包解压后的目录里面有样例配置文件)。

  1. [root@192.168.3.123 ~]# mkdir -p /etc/masterha
  2. [root@192.168.3.123 ~]# cp mha4mysql-manager-0.53/samples/conf/app1.cnf /etc/masterha/

修改app1.cnf配置文件,修改后的文件内容如下(注意,配置文件中的注释需要去掉,我这里是为了解释清楚):

  1. [root@192.168.3.123 ~]# cat /etc/masterha/app1.cnf
  2.  
  3. [server default]
  4. manager_workdir=/var/log/masterha/app1.log //设置manager的工作目录
  5. manager_log=/var/log/masterha/app1/manager.log //设置manager的日志
  6. master_binlog_dir=/data/mysql //设置master 保存binlog的位置,以便MHA可以找到master的日志,我这里的也就是mysql的数据目录
  7. #master_ip_failover_script= /usr/local/bin/master_ip_failover //设置自动failover时候的切换脚本,这里先注释掉后面集成keepalived后再开启
  8. master_ip_online_change_script= /usr/local/bin/master_ip_online_change //设置手动切换时候的切换脚本
  9. password=123456 //设置mysql中root用户的密码,这个密码是前文中创建监控用户的那个密码
  10. user=root 设置监控用户root
  11. ping_interval=1 //设置监控主库,发送ping包的时间间隔,默认是3秒,尝试三次没有回应的时候自动进行railover
  12. remote_workdir=/tmp //设置远端mysql在发生切换时binlog的保存位置
  13. repl_password=123456 //设置复制用户的密码
  14. repl_user=repl //设置复制环境中的复制用户名
  15. report_script=/usr/local/send_report //设置发生切换后发送的报警的脚本
  16. secondary_check_script= /usr/local/bin/masterha_secondary_check -s B-dev18 -s B-dev10 --user=root --master_host=B-dev10 --master_ip=192.168.3.110 --master_port=3306 //一旦MHA到B-dev10的监控之间出现问题,MHA Manager将会尝试从B-dev18登录到B-dev10
  17. shutdown_script="" //设置故障发生后关闭故障主机脚本(该脚本的主要作用是关闭主机放在发生脑裂,这里没有使用)
  18. ssh_user=root //设置ssh的登录用户名
  19.  
  20. [server1]
  21. hostname=192.168.3.110
  22. port=3306
  23.  
  24. [server2]
  25. hostname=192.168.3.118
  26. port=3306
  27. candidate_master=1 //设置为候选master,如果设置该参数以后,发生主从切换以后将会将此从库提升为主库,即使这个主库不是集群中事件最新的slave
  28. check_repl_delay=0 //默认情况下如果一个slave落后master 100M的relay logs的话,MHA将不会选择该slave作为一个新的master,因为对于这个slave的恢复需要花费很长时间,通过设置check_repl_delay=0,MHA触发切换在选择一个新的master的时候将会忽略复制延时,这个参数对于设置了candidate_master=1的主机非常有用,因为这个候选主在切换的过程中一定是新的master
  29.  
  30. [server3]
  31. hostname=192.168.3.115
  32. port=3306
  33.  
  34. [server4]
  35. hostname=192.168.3.117
  36. port=3306
  37.  
  38. [root@192.168.3.123 ~]#

(2)设置relay log的清除方式(在每个slave节点上):

  1. [root@192.168.3.115 ~]# mysql -e 'set global relay_log_purge=0'
  2. [root@192.168.3.117 ~]# mysql -e 'set global relay_log_purge=0'
  3. [root@192.168.3.118 ~]# mysql -e 'set global relay_log_purge=0'

注意:

MHA在发生切换的过程中,从库的恢复过程中依赖于relay log的相关信息,所以这里要将relay log的自动清除设置为OFF,采用手动清除relay log的方式。在默认情况下,从服务器上的中继日志会在SQL线程执行完毕后被自动删除。但是在MHA环境中,这些中继日志在恢复其他从服务器时可能会被用到,因此需要禁用中继日志的自动删除功能。定期清除中继日志需要考虑到复制延时的问题。在ext3的文件系统下,删除大的文件需要一定的时间,会导致严重的复制延时。为了避免复制延时,需要暂时为中继日志创建硬链接,因为在linux系统中通过硬链接删除大文件速度会很快。(在mysql数据库中,删除大表时,通常也采用建立硬链接的方式)

MHA节点中包含了pure_relay_logs命令工具,它可以为中继日志创建硬链接,执行SET GLOBAL relay_log_purge=1,等待几秒钟以便SQL线程切换到新的中继日志,再执行SET GLOBAL relay_log_purge=0。

pure_relay_logs脚本参数如下所示:

  1. --user mysql 用户名
  2. --password mysql 密码
  3. --port 端口号
  4. --workdir 指定创建relay log的硬链接的位置,默认是/var/tmp,由于系统不同分区创建硬链接文件会失败,故需要执行硬链接具体位置,成功执行脚本后,硬链接的中继日志文件被删除
  5. --disable_relay_log_purge 默认情况下,如果relay_log_purge=1,脚本会什么都不清理,自动退出,通过设定这个参数,当relay_log_purge=1的情况下会将relay_log_purge设置为0。清理relay log之后,最后将参数设置为OFF

(3)设置定期清理relay脚本(三台slave服务器)

  1. [root@192.168.3.115 ~]# cat purge_relay_log.sh
  2. #!/bin/bash
  3. user=root
  4. passwd=123456
  5. port=3306
  6. log_dir='/data/masterha/log'
  7. work_dir='/data'
  8. purge='/usr/local/bin/purge_relay_logs'
  9.  
  10. if [ ! -d $log_dir ]
  11. then
  12. mkdir $log_dir -p
  13. fi
  14.  
  15. $purge --user=$user --password=$passwd --disable_relay_log_purge --port=$port --workdir=$work_dir >> $log_dir/purge_relay_logs.log 2>&1
  16. [root@192.168.3.115 ~]#

添加到crontab定期执行

  1. [root@192.168.3.115 ~]# crontab -l
  2. 0 4 * * * /bin/bash /root/purge_relay_log.sh
  3. [root@192.168.3.115 ~]#

purge_relay_logs脚本删除中继日志不会阻塞SQL线程。下面我们手动执行看看什么情况。

  1. [root@192.168.3.115 ~]# purge_relay_logs --user=root --password=123456 --port=3306 -disable_relay_log_purge --workdir=/data/
  2. 2015-06-20 15:47:24: purge_relay_logs script started.
  3. Found relay_log.info: /data/mysql/relay-log.info
  4. Removing hard linked relay log files B-dev15-relay-bin* under /data/.. done.
  5. Current relay log file: /data/mysql/B-dev15-relay-bin.000002
  6. Archiving unused relay log files (up to /data/mysql/B-dev15-relay-bin.000001) ...
  7. Creating hard link for /data/mysql/B-dev15-relay-bin.000001 under /data//B-dev15-relay-bin.000001 .. ok.
  8. Creating hard links for unused relay log files completed.
  9. Executing SET GLOBAL relay_log_purge=1; FLUSH LOGS; sleeping a few seconds so that SQL thread can delete older relay log files (if it keeps up); SET GLOBAL relay_log_purge=0; .. ok.
  10. Removing hard linked relay log files B-dev15-relay-bin* under /data/.. done.
  11. 2015-06-20 15:47:27: All relay log purging operations succeeded.
  12. [root@192.168.3.115 ~]#

7.检查SSH配置

检查MHA Manger到所有MHA Node的SSH连接状态:

  1. [root@192.168.3.123 ~]# masterha_check_ssh --conf=/etc/masterha/app1.cnf
  2. Wed Jun 24 02:04:24 2015 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
  3. Wed Jun 24 02:04:24 2015 - [info] Reading application default configurations from /etc/masterha/app1.cnf..
  4. Wed Jun 24 02:04:24 2015 - [info] Reading server configurations from /etc/masterha/app1.cnf..
  5. Wed Jun 24 02:04:24 2015 - [info] Starting SSH connection tests..
  6. Wed Jun 24 02:04:26 2015 - [debug]
  7. Wed Jun 24 02:04:24 2015 - [debug] Connecting via SSH from root@192.168.3.110(192.168.3.110:22) to root@192.168.3.118(192.168.3.118:22)..
  8. Wed Jun 24 02:04:24 2015 - [debug] ok.
  9. Wed Jun 24 02:04:24 2015 - [debug] Connecting via SSH from root@192.168.3.110(192.168.3.110:22) to root@192.168.3.115(192.168.3.115:22)..
  10. Wed Jun 24 02:04:25 2015 - [debug] ok.
  11. Wed Jun 24 02:04:25 2015 - [debug] Connecting via SSH from root@192.168.3.110(192.168.3.110:22) to root@192.168.3.117(192.168.3.117:22)..
  12. Wed Jun 24 02:04:26 2015 - [debug] ok.
  13. Wed Jun 24 02:04:27 2015 - [debug]
  14. Wed Jun 24 02:04:24 2015 - [debug] Connecting via SSH from root@192.168.3.118(192.168.3.118:22) to root@192.168.3.110(192.168.3.110:22)..
  15. Wed Jun 24 02:04:25 2015 - [debug] ok.
  16. Wed Jun 24 02:04:25 2015 - [debug] Connecting via SSH from root@192.168.3.118(192.168.3.118:22) to root@192.168.3.115(192.168.3.115:22)..
  17. Wed Jun 24 02:04:26 2015 - [debug] ok.
  18. Wed Jun 24 02:04:26 2015 - [debug] Connecting via SSH from root@192.168.3.118(192.168.3.118:22) to root@192.168.3.117(192.168.3.117:22)..
  19. Wed Jun 24 02:04:27 2015 - [debug] ok.
  20. Wed Jun 24 02:04:28 2015 - [debug]
  21. Wed Jun 24 02:04:25 2015 - [debug] Connecting via SSH from root@192.168.3.115(192.168.3.115:22) to root@192.168.3.110(192.168.3.110:22)..
  22. Wed Jun 24 02:04:25 2015 - [debug] ok.
  23. Wed Jun 24 02:04:25 2015 - [debug] Connecting via SSH from root@192.168.3.115(192.168.3.115:22) to root@192.168.3.118(192.168.3.118:22)..
  24. Wed Jun 24 02:04:27 2015 - [debug] ok.
  25. Wed Jun 24 02:04:27 2015 - [debug] Connecting via SSH from root@192.168.3.115(192.168.3.115:22) to root@192.168.3.117(192.168.3.117:22)..
  26. Wed Jun 24 02:04:28 2015 - [debug] ok.
  27. Wed Jun 24 02:04:28 2015 - [debug]
  28. Wed Jun 24 02:04:25 2015 - [debug] Connecting via SSH from root@192.168.3.117(192.168.3.117:22) to root@192.168.3.110(192.168.3.110:22)..
  29. Wed Jun 24 02:04:26 2015 - [debug] ok.
  30. Wed Jun 24 02:04:26 2015 - [debug] Connecting via SSH from root@192.168.3.117(192.168.3.117:22) to root@192.168.3.118(192.168.3.118:22)..
  31. Wed Jun 24 02:04:27 2015 - [debug] ok.
  32. Wed Jun 24 02:04:27 2015 - [debug] Connecting via SSH from root@192.168.3.117(192.168.3.117:22) to root@192.168.3.115(192.168.3.115:22)..
  33. Wed Jun 24 02:04:28 2015 - [debug] ok.
  34. Wed Jun 24 02:04:28 2015 - [info] All SSH connection tests passed successfully.

可以看见各个节点ssh验证都是ok的。

8.检查整个复制环境状况

通过masterha_check_repl脚本查看整个集群的状态

  1. [root@192.168.123 ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
  2. Wed Jun 24 01:51:17 2015 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
  3. Wed Jun 24 01:51:17 2015 - [info] Reading application default configurations from /etc/masterha/app1.cnf..
  4. Wed Jun 24 01:51:17 2015 - [info] Reading server configurations from /etc/masterha/app1.cnf..
  5. Wed Jun 24 01:51:17 2015 - [info] MHA::MasterMonitor version 0.53.
  6. Wed Jun 24 01:51:18 2015 - [info] Dead Servers:
  7. Wed Jun 24 01:51:18 2015 - [info] Alive Servers:
  8. Wed Jun 24 01:51:18 2015 - [info] 192.168.3.110(192.168.3.110:3306)
  9. Wed Jun 24 01:51:18 2015 - [info] 192.168.3.118(192.168.3.118:3306)
  10. Wed Jun 24 01:51:18 2015 - [info] 192.168.3.115(192.168.3.115:3306)
  11. Wed Jun 24 01:51:18 2015 - [info] 192.168.3.117(192.168.3.117:3306)
  12. Wed Jun 24 01:51:18 2015 - [info] Alive Slaves:
  13. Wed Jun 24 01:51:18 2015 - [info] 192.168.3.118(192.168.3.118:3306) Version=5.6.25-log (oldest major version between slaves) log-bin:enabled
  14. Wed Jun 24 01:51:18 2015 - [info] Replicating from 192.168.3.110(192.168.3.110:3306)
  15. Wed Jun 24 01:51:18 2015 - [info] Primary candidate for the new Master (candidate_master is set)
  16. Wed Jun 24 01:51:18 2015 - [info] 192.168.3.115(192.168.3.115:3306) Version=5.6.25-log (oldest major version between slaves) log-bin:enabled
  17. Wed Jun 24 01:51:18 2015 - [info] Replicating from 192.168.3.110(192.168.3.110:3306)
  18. Wed Jun 24 01:51:18 2015 - [info] 192.168.3.117(192.168.3.117:3306) Version=5.6.25-log (oldest major version between slaves) log-bin:enabled
  19. Wed Jun 24 01:51:18 2015 - [info] Replicating from 192.168.3.110(192.168.3.110:3306)
  20. Wed Jun 24 01:51:18 2015 - [info] Current Alive Master: 192.168.3.110(192.168.3.110:3306)
  21. Wed Jun 24 01:51:18 2015 - [info] Checking slave configurations..
  22. Wed Jun 24 01:51:18 2015 - [info] read_only=1 is not set on slave 192.168.3.118(192.168.3.118:3306).
  23. Wed Jun 24 01:51:18 2015 - [warning] relay_log_purge=0 is not set on slave 192.168.3.118(192.168.3.118:3306).
  24. Wed Jun 24 01:51:18 2015 - [info] read_only=1 is not set on slave 192.168.3.115(192.168.3.115:3306).
  25. Wed Jun 24 01:51:18 2015 - [warning] relay_log_purge=0 is not set on slave 192.168.3.115(192.168.3.115:3306).
  26. Wed Jun 24 01:51:18 2015 - [info] read_only=1 is not set on slave 192.168.3.117(192.168.3.117:3306).
  27. Wed Jun 24 01:51:18 2015 - [warning] relay_log_purge=0 is not set on slave 192.168.3.117(192.168.3.117:3306).
  28. Wed Jun 24 01:51:18 2015 - [info] Checking replication filtering settings..
  29. Wed Jun 24 01:51:18 2015 - [info] binlog_do_db= , binlog_ignore_db=
  30. Wed Jun 24 01:51:18 2015 - [info] Replication filtering check ok.
  31. Wed Jun 24 01:51:18 2015 - [info] Starting SSH connection tests..
  32. Wed Jun 24 01:51:23 2015 - [info] All SSH connection tests passed successfully.
  33. Wed Jun 24 01:51:23 2015 - [info] Checking MHA Node version..
  34. Wed Jun 24 01:51:24 2015 - [info] Version check ok.
  35. Wed Jun 24 01:51:24 2015 - [info] Checking SSH publickey authentication settings on the current master..
  36. Wed Jun 24 01:51:25 2015 - [info] HealthCheck: SSH to 192.168.3.110 is reachable.
  37. Wed Jun 24 01:51:26 2015 - [info] Master MHA Node version is 0.53.
  38. Wed Jun 24 01:51:26 2015 - [info] Checking recovery script configurations on the current master..
  39. Wed Jun 24 01:51:26 2015 - [info] Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/var/lib/mysql --output_file=/tmp/save_binary_logs_test --manager_version=0.53 --start_file=mysql-bin.000002
  40. Wed Jun 24 01:51:26 2015 - [info] Connecting to root@192.168.3.110(192.168.3.110)..
  41. Creating /tmp if not exists.. ok.
  42. Checking output directory is accessible or not..
  43. ok.
  44. Binlog found at /var/lib/mysql, up to mysql-bin.000002
  45. Wed Jun 24 01:51:26 2015 - [info] Master setting check done.
  46. Wed Jun 24 01:51:26 2015 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers..
  47. Wed Jun 24 01:51:26 2015 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user=root --slave_host=192.168.3.118 --slave_ip=192.168.3.118 --slave_port=3306 --workdir=/tmp --target_version=5.6.25-log --manager_version=0.53 --relay_log_info=/var/lib/mysql/relay-log.info --relay_dir=/var/lib/mysql/ --slave_pass=xxx
  48. Wed Jun 24 01:51:26 2015 - [info] Connecting to root@192.168.3.118(192.168.3.118:22)..
  49. Checking slave recovery environment settings..
  50. Opening /var/lib/mysql/relay-log.info ... ok.
  51. Relay log found at /var/lib/mysql, up to mysqld-relay-bin.000004
  52. Temporary relay log file is /var/lib/mysql/mysqld-relay-bin.000004
  53. Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure.
  54. done.
  55. Testing mysqlbinlog output.. done.
  56. Cleaning up test file(s).. done.
  57. Wed Jun 24 01:51:27 2015 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user=root --slave_host=192.168.3.115 --slave_ip=192.168.3.115 --slave_port=3306 --workdir=/tmp --target_version=5.6.25-log --manager_version=0.53 --relay_log_info=/var/lib/mysql/relay-log.info --relay_dir=/var/lib/mysql/ --slave_pass=xxx
  58. Wed Jun 24 01:51:27 2015 - [info] Connecting to root@192.168.3.115(192.168.3.115:22)..
  59. Checking slave recovery environment settings..
  60. Opening /var/lib/mysql/relay-log.info ... ok.
  61. Relay log found at /var/lib/mysql, up to mysqld-relay-bin.000004
  62. Temporary relay log file is /var/lib/mysql/mysqld-relay-bin.000004
  63. Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure.
  64. done.
  65. Testing mysqlbinlog output.. done.
  66. Cleaning up test file(s).. done.
  67. Wed Jun 24 01:51:29 2015 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user=root --slave_host=192.168.3.117 --slave_ip=192.168.3.117 --slave_port=3306 --workdir=/tmp --target_version=5.6.25-log --manager_version=0.53 --relay_log_info=/var/lib/mysql/relay-log.info --relay_dir=/var/lib/mysql/ --slave_pass=xxx
  68. Wed Jun 24 01:51:29 2015 - [info] Connecting to root@192.168.3.117(192.168.3.117:22)..
  69. Checking slave recovery environment settings..
  70. Opening /var/lib/mysql/relay-log.info ... ok.
  71. Relay log found at /var/lib/mysql, up to mysqld-relay-bin.000003
  72. Temporary relay log file is /var/lib/mysql/mysqld-relay-bin.000003
  73. Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure.
  74. done.
  75. Testing mysqlbinlog output.. done.
  76. Cleaning up test file(s).. done.
  77. Wed Jun 24 01:51:31 2015 - [info] Slaves settings check done.
  78. Wed Jun 24 01:51:31 2015 - [info]
  79. 192.168.3.110 (current master)
  80. +--192.168.3.118
  81. +--192.168.3.115
  82. +--192.168.3.117
  83.  
  84. Wed Jun 24 01:51:31 2015 - [info] Checking replication health on 192.168.3.118..
  85. Wed Jun 24 01:51:31 2015 - [info] ok.
  86. Wed Jun 24 01:51:31 2015 - [info] Checking replication health on 192.168.3.115..
  87. Wed Jun 24 01:51:31 2015 - [info] ok.
  88. Wed Jun 24 01:51:31 2015 - [info] Checking replication health on 192.168.3.117..
  89. Wed Jun 24 01:51:31 2015 - [info] ok.
  90. Wed Jun 24 01:51:31 2015 - [info] Checking master_ip_failover_script status:
  91. Wed Jun 24 01:51:31 2015 - [info] /usr/local/bin/master_ip_failover --command=status --ssh_user=root --orig_master_host=192.168.3.110 --orig_master_ip=192.168.3.110 --orig_master_port=3306
  92.  
  93.  
  94. IN SCRIPT TEST====/etc/init.d/keepalived stop==/etc/init.d/keepalived start===
  95.  
  96. Checking the Status of the script.. OK
  97. Wed Jun 24 01:51:31 2015 - [info] OK.
  98. Wed Jun 24 01:51:31 2015 - [warning] master_ip_failover is not defined.
  99. Wed Jun 24 01:51:31 2015 - [warning] shutdown_script is not defined.
  100. Wed Jun 24 01:51:31 2015 - [info] Got exit code 0 (Not master dead).
  101.  
  102. MySQL Replication Health is OK.

没有报错信息,只有两个警告而已,复制也显示正常了。

9.检查MHA Manager的状态

通过master_check_status脚本查看Manager的状态:

  1. [root@192.168.3.123 ~]# masterha_check_status --conf=/etc/masterha/app1.cnf
  2. app1 is stopped(2:NOT_RUNNING).
  3. [root@192.168.3.123 ~]#

注意:如果正常,会显示”PING_OK”,否则会显示”NOT_RUNNING”,这代表MHA监控没有开启。

10.开启MHA Manager监控

  1. [root@192.168.3.123 ~]# nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover < /dev/null > /var/log/masterha/app1/manager.log 2>&1 &
  2. [1] 30867
  3. [root@192.168.3.123 ~]#

启动参数介绍:

  • –remove_dead_master_conf      该参数代表当发生主从切换后,老的主库的ip将会从配置文件中移除。
  • –manger_log                            日志存放位置
  • –ignore_last_failover                 在缺省情况下,如果MHA检测到连续发生宕机,且两次宕机间隔不足8小时的话,则不会进行Failover,之所以这样限制是为了避免ping-pong效应。该参数代表忽略上次MHA触发切换产生的文件,默认情况下,MHA发生切换后会在日志目录,也就是上面我设置的/data产生app1.failover.complete文件,下次再次切换的时候如果发现该目录下存在该文件将不允许触发切换,除非在第一次切换后收到删除该文件,为了方便,这里设置为–ignore_last_failover。

查看MHA Manager监控是否正常:

  1. [root@192.168.3.123 ~]# masterha_check_status --conf=/etc/masterha/app1.cnf
  2. app1 (pid:20386) is running(0:PING_OK), master:192.168.3.110
  3. [root@192.168.3.123 ~]#

可以看见已经在监控了,而且master的主机为192.168.3.110

11.查看启动日志

  1. [root@192.168.3.123 ~]# tail -f /var/log/masterha/app1/manager.log
  2. Sun Jun 20 19:12:01 2015 - [info] Connecting to root@192.168.3.117(192.168.3.117:22)..
  3. Checking slave recovery environment settings..
  4. Opening /data/mysql/relay-log.info ... ok.
  5. Relay log found at /data/mysql, up to B-dev17-relay-bin.000002
  6. Temporary relay log file is /data/mysql/B-dev17-relay-bin.000002
  7. Testing mysql connection and privileges.. done.
  8. Testing mysqlbinlog output.. done.
  9. Cleaning up test file(s).. done.
  10. Sun Jun 20 19:12:01 2015 - [info] Connecting to root@192.168.3.115(192.168.3.117:22)..
  11. Checking slave recovery environment settings..
  12. Opening /data/mysql/relay-log.info ... ok.
  13. Relay log found at /data/mysql, up to B-dev15-relay-bin.000002
  14. Temporary relay log file is /data/mysql/B-dev15-relay-bin.000002
  15. Testing mysql connection and privileges.. done.
  16. Testing mysqlbinlog output.. done.
  17. Cleaning up test file(s).. done.
  18. Sun Jun 20 19:12:01 2015 - [info] Slaves settings check done.
  19. Sun Jun 20 19:12:01 2015 - [info]
  20. 192.168.3.110 (current master)
  21. +--192.168.3.115
  22. +--192.168.3.117
  23. +--192.168.3.123
  24. Sun Apr 20 19:12:01 2015 - [warning] master_ip_failover_script is not defined.
  25. Sun Apr 20 19:12:01 2015 - [warning] shutdown_script is not defined.
  26. Sun Apr 20 19:12:01 2015 - [info] Set master ping interval 1 seconds.
  27. Sun Apr 20 19:12:01 2015 - [info] Set secondary check script: /usr/local/bin/masterha_secondary_check -s B-dev18 -s B-dev10 --user=root --master_host=B-dev10 --master_ip=192.168.3.110 --master_port=3306
  28. Sun Apr 20 19:12:01 2015 - [info] Starting ping health check on 192.168.3.123(192.168.3.123:3306)..
  29. Sun Apr 20 19:12:01 2015 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..
  30.  
  31. [root@192.168.3.123 ~]#

其中”Ping(SELECT) succeeded, waiting until MySQL doesn’t respond..”说明整个系统已经开始监控了。

12.关闭MHA Manage监控

关闭很简单,使用masterha_stop命令完成。

  1. [root@192.168.3.123 ~]# masterha_stop --conf=/etc/masterha/app1.cnf
  2. Stopped app1 successfully.
  3. [1]+ Exit 1 nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover --manager_log=/data/mamanager.log
  4. [root@192.168.3.123 ~]#

13.配置VIP

vip配置可以采用两种方式,一种通过keepalived的方式管理虚拟ip的漂移;另外一种通过脚本方式启动虚拟ip的方式(即不需要keepalived或者heartbeat类似的软件)。这里采用的是keepalived的方式管理vip的漂移。

keepalived方式管理虚拟ip,keepalived配置方法如下:

(1)下载软件进行并进行安装(两台master,准确的说一台是master,另外一台是备选master,在没有切换以前是slave):

  1. [root@192.168.3.110 ~]# wget http://www.keepalived.org/software/keepalived-1.2.17.tar.gz
  2. tar xf keepalived-1.2.17.tar.gz
  3. cd keepalived-1.2.17
  4. ./configure --prefix=/usr/local/keepalived
  5. make && make install
  6. cp /usr/local/keepalived/etc/rc.d/init.d/keepalived /etc/init.d/
  7. cp /usr/local/keepalived/etc/sysconfig/keepalived /etc/sysconfig/
  8. mkdir /etc/keepalived
  9. cp /usr/local/keepalived/etc/keepalived/keepalived.conf /etc/keepalived/
  10. cp /usr/local/keepalived/sbin/keepalived /usr/sbin/

(2)配置keepalived的配置文件,在master上配置(192.168.3.110)

  1. [root@192.168.3.110 ~]# cat /etc/keepalived/keepalived.conf
  2. ! Configuration File for keepalived
  3.  
  4. global_defs {
  5. notification_email {
  6. acassen@firewall.loc
  7. failover@firewall.loc
  8. sysadmin@firewall.loc
  9. }
  10. notification_email_from Alexandre.Cassen@firewall.loc
  11. smtp_server 127.0.0.1
  12. smtp_connect_timeout 30
  13. router_id LVS_DEVEL
  14. }
  15.  
  16. vrrp_instance VI_1 {
  17. state BACKUP
  18. interface eth0
  19. virtual_router_id 51
  20. priority 80
  21. advert_int 1
  22. nopreempt
  23. authentication {
  24. auth_type PASS
  25. auth_pass 1111
  26. }
  27. virtual_ipaddress {
  28. 192.168.3.250
  29. }
  30. }
  31.  
  32. [root@192.168.3.110 ~]#

其中router_id MySQL HA表示设定keepalived组的名称,将192.168.3.250这个虚拟ip绑定到该主机的eth0网卡上,并且设置了状态为backup模式,将keepalived的模式设置为非抢占模式(nopreempt),priority 80表示设置的优先级为80。下面的配置略有不同,但是都是一个意思。

(3)在候选master上配置(192.168.3.118)

  1. [root@192.168.3.118 ~]# cat /etc/keepalived/keepalived.conf
  2. ! Configuration File for keepalived
  3.  
  4. global_defs {
  5. notification_email {
  6. acassen@firewall.loc
  7. failover@firewall.loc
  8. sysadmin@firewall.loc
  9. }
  10. notification_email_from Alexandre.Cassen@firewall.loc
  11. smtp_server 127.0.0.1
  12. smtp_connect_timeout 30
  13. router_id LVS_DEVEL
  14. }
  15.  
  16. vrrp_instance VI_1 {
  17. state BACKUP
  18. interface eth0
  19. virtual_router_id 51
  20. priority 60
  21. advert_int 1
  22. nopreempt
  23. authentication {
  24. auth_type PASS
  25. auth_pass 1111
  26. }
  27. virtual_ipaddress {
  28. 192.168.3.250
  29. }
  30. }
  31.  
  32. [root@192.168.3.118 ~]#

(4)启动keepalived服务,在master上启动并查看日志

  1. [root@192.168.3.110 ~]# /etc/init.d/keepalived start
  2. Starting keepalived: [ OK ]
  3. [root@192.168.3.110 ~]# tail -f /var/log/messages
  4. Jun 20 20:22:16 192 Keepalived_healthcheckers[15334]: Opening file '/etc/keepalived/keepalived.conf'.
  5. Jun 20 20:22:16 192 Keepalived_healthcheckers[15334]: Configuration is using : 7231 Bytes
  6. Jun 20 20:22:16 192 kernel: IPVS: Connection hash table configured (size=4096, memory=64Kbytes)
  7. Jun 20 20:22:16 192 kernel: IPVS: ipvs loaded.
  8. Jun 20 20:22:16 192 Keepalived_healthcheckers[15334]: Using LinkWatch kernel netlink reflector...
  9. Jun 20 20:22:19 192 Keepalived_vrrp[15335]: VRRP_Instance(VI_1) Transition to MASTER STATE
  10. Jun 20 20:22:20 192 Keepalived_vrrp[15335]: VRRP_Instance(VI_1) Entering MASTER STATE
  11. Jun 20 20:22:20 192 Keepalived_vrrp[15335]: VRRP_Instance(VI_1) setting protocol VIPs.
  12. Jun 20 20:22:20 192 Keepalived_vrrp[15335]: VRRP_Instance(VI_1) Sending gratuitous ARPs on eth0 for 192.168.3.250
  13. Jun 20 20:22:20 192 Keepalived_healthcheckers[15334]: Netlink reflector reports IP 192.168.3.250 added
  14. Jun 20 20:22:25 192 Keepalived_vrrp[15335]: VRRP_Instance(VI_1) Sending gratuitous ARPs on eth0 for 192.168.3.250

发现已经将虚拟ip 192.168.3.250绑定了网卡eth0上。

(5)查看绑定情况

  1. [root@192.168.3.110 ~]# ip addr | grep eth0
  2. 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
  3. inet 192.168.3.110/24 brd 192.168.3.255 scope global eth0
  4. inet 192.168.3.250/32 scope global eth0
  5. [root@192.168.3.110 ~]#

在另外一台服务器,候选master上启动keepalived服务,并观察

  1.  [root@192.168.3.118 ~]# /etc/init.d/keepalived start
  2. Starting keepalived: [ OK ]
  3. [root@192.168.3.110 ~]# tail -f /var/log/messages
  4. Jun 20 20:26:18 192 Keepalived_vrrp[9472]: Registering gratuitous ARP shared channel
  5. Jun 20 20:26:18 192 Keepalived_vrrp[9472]: Opening file '/etc/keepalived/keepalived.conf'.
  6. Jun 20 20:26:18 192 Keepalived_vrrp[9472]: Configuration is using : 62976 Bytes
  7. Jun 20 20:26:18 192 Keepalived_vrrp[9472]: Using LinkWatch kernel netlink reflector...
  8. Jun 20 20:26:18 192 Keepalived_vrrp[9472]: VRRP_Instance(VI_1) Entering BACKUP STATE
  9. Jun 20 20:26:18 192 Keepalived_vrrp[9472]: VRRP sockpool: [ifindex(3), proto(112), unicast(0), fd(10,11)]
  10. Jun 20 20:26:18 192 Keepalived_healthcheckers[9471]: Netlink reflector reports IP 192.168.3.250 added
  11. Jun 20 20:26:18 192 Keepalived_healthcheckers[9471]: Netlink reflector reports IP 192.168.3.118 added
  12. Jun 20 20:26:18 192 Keepalived_healthcheckers[9471]: Netlink reflector reports IP fe80::20c:29ff:fe9d:6a9e added
  13. Jun 20 20:26:18 192 Keepalived_healthcheckers[9471]: Netlink reflector reports IP fe80::20c:29ff:fe9d:6aa8 added
  14. Jun 20 20:26:18 192 Keepalived_healthcheckers[9471]: Registering Kernel netlink reflector
  15. Jun 20 20:26:18 192 Keepalived_healthcheckers[9471]: Registering Kernel netlink command channel
  16. Jun 20 20:26:18 192 Keepalived_healthcheckers[9471]: Opening file '/etc/keepalived/keepalived.conf'.
  17. Jun 20 20:26:18 192 Keepalived_healthcheckers[9471]: Configuration is using : 7231 Bytes
  18. Jun 20 20:26:18 192 kernel: IPVS: Registered protocols (TCP, UDP, AH, ESP)
  19. Jun 20 20:26:18 192 kernel: IPVS: Connection hash table configured (size=4096, memory=64Kbytes)
  20. Jun 20 20:26:18 192 kernel: IPVS: ipvs loaded.
  21. Jun 20 20:26:18 192 Keepalived_healthcheckers[9471]: Using LinkWatch kernel netlink reflector...

从上面的信息可以看到keepalived已经配置成功。

注意:

上面两台服务器的keepalived都设置为了BACKUP模式,在keepalived中2种模式,分别是master->backup模式和backup->backup模式。这两种模式有很大区别。在master->backup模式下,一旦主库宕机,虚拟ip会自动漂移到从库,当主库修复后,keepalived启动后,还会把虚拟ip抢占过来,即使设置了非抢占模式(nopreempt)抢占ip的动作也会发生。在backup->backup模式下,当主库宕机后虚拟ip会自动漂移到从库上,当原主库恢复和keepalived服务启动后,并不会抢占新主的虚拟ip,即使是优先级高于从库的优先级别,也不会发生抢占。为了减少ip漂移次数,通常是把修复好的主库当做新的备库。

(6)MHA引入keepalived

要想把keepalived服务引入MHA,我们只需要修改切换是触发的脚本文件master_ip_failover即可,在该脚本中添加在master发生宕机时对keepalived的处理。

编辑脚本/usr/local/bin/master_ip_failover,修改后如下,这里完整贴出该脚本(192.168.3.123)。

在MHA Manager修改脚本修改后的内容如下:

  1. #!/usr/bin/env perl
  2.  
  3. use strict;
  4. use warnings FATAL => 'all';
  5.  
  6. use Getopt::Long;
  7.  
  8. my (
  9. $command, $ssh_user, $orig_master_host, $orig_master_ip,
  10. $orig_master_port, $new_master_host, $new_master_ip, $new_master_port
  11. );
  12.  
  13. my $vip = '192.168.3.250';
  14. my $ssh_start_vip = "/etc/init.d/keepalived start";
  15. my $ssh_stop_vip = "/etc/init.d/keepalived stop";
  16.  
  17. GetOptions(
  18. 'command=s' => \$command,
  19. 'ssh_user=s' => \$ssh_user,
  20. 'orig_master_host=s' => \$orig_master_host,
  21. 'orig_master_ip=s' => \$orig_master_ip,
  22. 'orig_master_port=i' => \$orig_master_port,
  23. 'new_master_host=s' => \$new_master_host,
  24. 'new_master_ip=s' => \$new_master_ip,
  25. 'new_master_port=i' => \$new_master_port,
  26. );
  27.  
  28. exit &main();
  29.  
  30. sub main {
  31.  
  32. print "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n";
  33.  
  34. if ( $command eq "stop" || $command eq "stopssh" ) {
  35.  
  36. my $exit_code = 1;
  37. eval {
  38. print "Disabling the VIP on old master: $orig_master_host \n";
  39. &stop_vip();
  40. $exit_code = 0;
  41. };
  42. if ($@) {
  43. warn "Got Error: $@\n";
  44. exit $exit_code;
  45. }
  46. exit $exit_code;
  47. }
  48. elsif ( $command eq "start" ) {
  49.  
  50. my $exit_code = 10;
  51. eval {
  52. print "Enabling the VIP - $vip on the new master - $new_master_host \n";
  53. &start_vip();
  54. $exit_code = 0;
  55. };
  56. if ($@) {
  57. warn $@;
  58. exit $exit_code;
  59. }
  60. exit $exit_code;
  61. }
  62. elsif ( $command eq "status" ) {
  63. print "Checking the Status of the script.. OK \n";
  64. #`ssh $ssh_user\@cluster1 \" $ssh_start_vip \"`;
  65. exit 0;
  66. }
  67. else {
  68. &usage();
  69. exit 1;
  70. }
  71. }
  72.  
  73. # A simple system call that enable the VIP on the new mastersub start_vip() {
  74. `ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;
  75. }
  76. # A simple system call that disable the VIP on the old_master
  77. sub stop_vip() {
  78. return 0 unless ($ssh_user);
  79. `ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;
  80. }
  81.  
  82. sub usage {
  83. print
  84. "Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";
  85. }

现在已经修改这个脚本了,我们现在打开在上面提到过的参数,再检查集群状态,看是否会报错。

  1. [root@192.168.3.123 ~]# grep 'master_ip_failover_script' /etc/masterha/app1.cnf
  2. master_ip_failover_script= /usr/local/bin/master_ip_failover
  3. [root@192.168.3.123 ~]#
  1. [root@192.168.3.123 ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
  2. Wed Jun 24 02:02:20 2015 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
  3. Wed Jun 24 02:02:20 2015 - [info] Reading application default configurations from /etc/masterha/app1.cnf..
  4. Wed Jun 24 02:02:20 2015 - [info] Reading server configurations from /etc/masterha/app1.cnf..
  5. Wed Jun 24 02:02:20 2015 - [info] MHA::MasterMonitor version 0.53.
  6. Wed Jun 24 02:02:20 2015 - [info] Dead Servers:
  7. Wed Jun 24 02:02:20 2015 - [info] Alive Servers:
  8. Wed Jun 24 02:02:20 2015 - [info] 192.168.3.110(192.168.3.110:3306)
  9. Wed Jun 24 02:02:20 2015 - [info] 192.168.3.118(192.168.3.118:3306)
  10. Wed Jun 24 02:02:20 2015 - [info] 192.168.3.115(192.168.3.115:3306)
  11. Wed Jun 24 02:02:20 2015 - [info] 192.168.3.117(192.168.3.117:3306)
  12. Wed Jun 24 02:02:20 2015 - [info] Alive Slaves:
  13. Wed Jun 24 02:02:20 2015 - [info] 192.168.3.118(192.168.3.118:3306) Version=5.6.25-log (oldest major version between slaves) log-bin:enabled
  14. Wed Jun 24 02:02:20 2015 - [info] Replicating from 192.168.3.110(192.168.3.110:3306)
  15. Wed Jun 24 02:02:20 2015 - [info] Primary candidate for the new Master (candidate_master is set)
  16. Wed Jun 24 02:02:20 2015 - [info] 192.168.3.115(192.168.3.115:3306) Version=5.6.25-log (oldest major version between slaves) log-bin:enabled
  17. Wed Jun 24 02:02:20 2015 - [info] Replicating from 192.168.3.110(192.168.3.110:3306)
  18. Wed Jun 24 02:02:20 2015 - [info] 192.168.3.117(192.168.3.117:3306) Version=5.6.25-log (oldest major version between slaves) log-bin:enabled
  19. Wed Jun 24 02:02:20 2015 - [info] Replicating from 192.168.3.110(192.168.3.110:3306)
  20. Wed Jun 24 02:02:20 2015 - [info] Current Alive Master: 192.168.3.110(192.168.3.110:3306)
  21. Wed Jun 24 02:02:20 2015 - [info] Checking slave configurations..
  22. Wed Jun 24 02:02:20 2015 - [info] read_only=1 is not set on slave 192.168.3.118(192.168.3.118:3306).
  23. Wed Jun 24 02:02:20 2015 - [warning] relay_log_purge=0 is not set on slave 192.168.3.118(192.168.3.118:3306).
  24. Wed Jun 24 02:02:20 2015 - [info] read_only=1 is not set on slave 192.168.3.115(192.168.3.115:3306).
  25. Wed Jun 24 02:02:20 2015 - [warning] relay_log_purge=0 is not set on slave 192.168.3.115(192.168.3.115:3306).
  26. Wed Jun 24 02:02:20 2015 - [info] read_only=1 is not set on slave 192.168.3.117(192.168.3.117:3306).
  27. Wed Jun 24 02:02:20 2015 - [warning] relay_log_purge=0 is not set on slave 192.168.3.117(192.168.3.117:3306).
  28. Wed Jun 24 02:02:20 2015 - [info] Checking replication filtering settings..
  29. Wed Jun 24 02:02:20 2015 - [info] binlog_do_db= , binlog_ignore_db=
  30. Wed Jun 24 02:02:20 2015 - [info] Replication filtering check ok.
  31. Wed Jun 24 02:02:20 2015 - [info] Starting SSH connection tests..
  32. Wed Jun 24 02:02:25 2015 - [info] All SSH connection tests passed successfully.
  33. Wed Jun 24 02:02:25 2015 - [info] Checking MHA Node version..
  34. Wed Jun 24 02:02:26 2015 - [info] Version check ok.
  35. Wed Jun 24 02:02:26 2015 - [info] Checking SSH publickey authentication settings on the current master..
  36. Wed Jun 24 02:02:27 2015 - [info] HealthCheck: SSH to 192.168.3.110 is reachable.
  37. Wed Jun 24 02:02:27 2015 - [info] Master MHA Node version is 0.53.
  38. Wed Jun 24 02:02:27 2015 - [info] Checking recovery script configurations on the current master..
  39. Wed Jun 24 02:02:27 2015 - [info] Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/var/lib/mysql --output_file=/tmp/save_binary_logs_test --manager_version=0.53 --start_file=mysql-bin.000002
  40. Wed Jun 24 02:02:27 2015 - [info] Connecting to root@192.168.3.110(192.168.3.110)..
  41. Creating /tmp if not exists.. ok.
  42. Checking output directory is accessible or not..
  43. ok.
  44. Binlog found at /var/lib/mysql, up to mysql-bin.000002
  45. Wed Jun 24 02:02:28 2015 - [info] Master setting check done.
  46. Wed Jun 24 02:02:28 2015 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers..
  47. Wed Jun 24 02:02:28 2015 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user=root --slave_host=192.168.3.118 --slave_ip=192.168.3.118 --slave_port=3306 --workdir=/tmp --target_version=5.6.25-log --manager_version=0.53 --relay_log_info=/var/lib/mysql/relay-log.info --relay_dir=/var/lib/mysql/ --slave_pass=xxx
  48. Wed Jun 24 02:02:28 2015 - [info] Connecting to root@192.168.3.118(192.168.3.118:22)..
  49. Checking slave recovery environment settings..
  50. Opening /var/lib/mysql/relay-log.info ... ok.
  51. Relay log found at /var/lib/mysql, up to mysqld-relay-bin.000004
  52. Temporary relay log file is /var/lib/mysql/mysqld-relay-bin.000004
  53. Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure.
  54. done.
  55. Testing mysqlbinlog output.. done.
  56. Cleaning up test file(s).. done.
  57. Wed Jun 24 02:02:29 2015 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user=root --slave_host=192.168.3.115 --slave_ip=192.168.3.115 --slave_port=3306 --workdir=/tmp --target_version=5.6.25-log --manager_version=0.53 --relay_log_info=/var/lib/mysql/relay-log.info --relay_dir=/var/lib/mysql/ --slave_pass=xxx
  58. Wed Jun 24 02:02:29 2015 - [info] Connecting to root@192.168.3.115(192.168.3.115:22)..
  59. Checking slave recovery environment settings..
  60. Opening /var/lib/mysql/relay-log.info ... ok.
  61. Relay log found at /var/lib/mysql, up to mysqld-relay-bin.000004
  62. Temporary relay log file is /var/lib/mysql/mysqld-relay-bin.000004
  63. Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure.
  64. done.
  65. Testing mysqlbinlog output.. done.
  66. Cleaning up test file(s).. done.
  67. Wed Jun 24 02:02:30 2015 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user=root --slave_host=192.168.3.117 --slave_ip=192.168.3.117 --slave_port=3306 --workdir=/tmp --target_version=5.6.25-log --manager_version=0.53 --relay_log_info=/var/lib/mysql/relay-log.info --relay_dir=/var/lib/mysql/ --slave_pass=xxx
  68. Wed Jun 24 02:02:30 2015 - [info] Connecting to root@192.168.3.117(192.168.3.117:22)..
  69. Checking slave recovery environment settings..
  70. Opening /var/lib/mysql/relay-log.info ... ok.
  71. Relay log found at /var/lib/mysql, up to mysqld-relay-bin.000003
  72. Temporary relay log file is /var/lib/mysql/mysqld-relay-bin.000003
  73. Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure.
  74. done.
  75. Testing mysqlbinlog output.. done.
  76. Cleaning up test file(s).. done.
  77. Wed Jun 24 02:02:31 2015 - [info] Slaves settings check done.
  78. Wed Jun 24 02:02:31 2015 - [info]
  79. 192.168.3.110 (current master)
  80. +--192.168.3.118
  81. +--192.168.3.115
  82. +--192.168.3.117
  83.  
  84. Wed Jun 24 02:02:31 2015 - [info] Checking replication health on 192.168.3.118..
  85. Wed Jun 24 02:02:31 2015 - [info] ok.
  86. Wed Jun 24 02:02:31 2015 - [info] Checking replication health on 192.168.3.115..
  87. Wed Jun 24 02:02:31 2015 - [info] ok.
  88. Wed Jun 24 02:02:31 2015 - [info] Checking replication health on 192.168.3.117..
  89. Wed Jun 24 02:02:31 2015 - [info] ok.
  90. Wed Jun 24 02:02:31 2015 - [info] Checking master_ip_failover_script status:
  91. Wed Jun 24 02:02:31 2015 - [info] /usr/local/bin/master_ip_failover --command=status --ssh_user=root --orig_master_host=192.168.3.110 --orig_master_ip=192.168.3.110 --orig_master_port=3306
  92.  
  93.  
  94. IN SCRIPT TEST====/etc/init.d/keepalived stop==/etc/init.d/keepalived start===
  95.  
  96. Checking the Status of the script.. OK
  97. Wed Jun 24 02:02:31 2015 - [info] OK.
  98. Wed Jun 24 02:02:31 2015 - [warning] shutdown_script is not defined.
  99. Wed Jun 24 02:02:31 2015 - [info] Got exit code 0 (Not master dead).
  100.  
  101. MySQL Replication Health is OK.
可以看见没有报错。 /usr/local/bin/master_ip_failover添加或者修改的内容意思是当主库数据库发生故障时,会触发MHA切换,MHA Manager会停掉主库上的keepalived服务,触发虚拟ip漂移到备选从库,从而完成切换。当然可以在keepalived里面引入脚本,这个脚本监控mysql是否正常运行,如果不正常,则调用该脚本杀掉keepalived进程。
keepalived的vip漂移原理及配置项说明可以查看文档小谈keepalived vip漂移原理与VRRP协议

14.MHA测试

到此为止,基本MHA集群已经配置完毕。接下来就是实际的测试环节了。通过一些测试来看一下MHA到底是如何进行工作的。下面将从MHA自动failover、手动failover两种方式来介绍MHA的工作情况。

14.1. 自动Failover(必须先启动MHA Manager,否则无法自动切换,当然手动切换不需要开启MHA Manager监控)

自动failover模拟测试的操作步骤如下:

(1)使用sysbench生成测试数据(使用yum快速安装),sysbench的使用说明可以查看文档sysbench使用详解

  1. yum install sysbench -y

在主库(192.168.3.110)上进行sysbench数据生成,在sbtest库下生成sbtest表,共100W记录。

  1. [root@192.168.3.110 ~]# sysbench --test=oltp --oltp-table-size=1000000 --oltp-read-only=off --mysql-db=sps --oltp-table-name=sbtest1 --init-rng=on --num-threads=16 --max-requests=0 --oltp-dist-type=uniform --max-time=1800 --mysql-user=root --mysql-socket=/var/lib/mysql/mysql.sock --mysql-password=123456 --db-driver=mysql --mysql-table-engine=innodb --oltp-test-mode=complex perpare

(2)停掉slave sql线程,模拟主从延时(192.168.3.118)

  1. mysql> stop slave io_thread;
  2. Query OK, 0 rows affected (0.08 sec)
  3.  
  4. mysql>

另外两台slave我们没有停止io线程,所以还在继续接收日志。

(3)模拟sysbench压力测试

在主库上(192.168.3.110)进行压力测试,持续时间为3分钟,产生大量的binlog。

  1. [root@192.168.3.110 ~]# sysbench --test=oltp --oltp-table-size=1000000 --oltp-read-only=off --mysql-db=sps --oltp-table-name=sbtest1 --init-rng=on --num-threads=16 --max-requests=0 --oltp-dist-type=uniform --max-time=1800 --mysql-user=root --mysql-socket=/var/lib/mysql/mysql.sock --mysql-password=123456 --db-driver=mysql --mysql-table-engine=innodb --oltp-test-mode=complex run
  2. sysbench 0.4.12: multi-threaded system evaluation benchmark
  3.  
  4. Running the test with following options:
  5. Number of threads: 16
  6. Initializing random number generator from timer.
  7.  
  8.  
  9. Doing OLTP test.
  10. Running mixed OLTP test
  11. Using Uniform distribution
  12. Using "BEGIN" for starting transactions
  13. Using auto_inc on the id column
  14. Threads started!
  15. Time limit exceeded, exiting...
  16. (last message repeated 15 times)
  17. Done.
  18.  
  19. OLTP test statistics:
  20. queries performed:
  21. read: 15092
  22. write: 5390
  23. other: 2156
  24. total: 22638
  25. transactions: 1078 (5.92 per sec.)
  26. deadlocks: 0 (0.00 per sec.)
  27. read/write requests: 20482 (112.56 per sec.)
  28. other operations: 2156 (11.85 per sec.)
  29.  
  30. Test execution summary:
  31. total time: 181.9728s
  32. total number of events: 1078
  33. total time taken by event execution: 2910.4518
  34. per-request statistics:
  35. min: 934.29ms
  36. avg: 2699.86ms
  37. max: 7679.95ms
  38. approx. 95 percentile: 4441.47ms
  39.  
  40. Threads fairness:
  41. events (avg/stddev): 67.3750/1.49
  42. execution time (avg/stddev): 181.9032/0.11

(4)开启slave(192.168.3.118)上的IO线程,追赶落后于master的binlog。

  1. mysql> start slave io_thread;
  2. Query OK, 0 rows affected (0.00 sec)
  3. mysql>

(5)杀掉主库mysql进程,模拟主库发生故障,进行自动failover操作。

  1. [root@192.168.3.110 ~]# pkill -9 mysqld

(6)查看MHA切换日志,了解整个切换过程,在192.168.3.123上查看日志(注意:由于测试多次切换过程,下面摘录的是192.168.3.118为主机,192.168.3.110为候选主机的切换log):

  1. [root@192.168.3.123 ~]# tail -f /var/log/masterha/app1/manager.log
  2.  
  3. IN SCRIPT TEST====/etc/init.d/keepalived stop==/etc/init.d/keepalived start===
  4.  
  5. Checking the Status of the script.. OK
  6. Thu Jun 11 13:52:09 2015 - [info]  OK.
  7. Thu Jun 11 13:52:09 2015 - [warning] shutdown_script is not defined.
  8. Thu Jun 11 13:52:09 2015 - [info] Set master ping interval 1 seconds.
  9. Thu Jun 11 13:52:09 2015 - [info] Set secondary check script: /usr/local/bin/masterha_secondary_check -s B-dev10 -s B-dev18  --user=root --master_host=B-dev18 --master_ip=192.168.3.118 --master_port=3306
  10. Thu Jun 11 13:52:09 2015 - [info] Starting ping health check on 192.168.3.118(192.168.3.118:3306)..
  11. Thu Jun 11 13:52:09 2015 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..
  12. Thu Jun 11 13:54:01 2015 - [warning] Got error on MySQL select ping: 2013 (Lost connection to MySQL server during query)
  13. Thu Jun 11 13:54:02 2015 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/var/lib/mysql --output_file=/tmp/save_binary_logs_test --manager_version=0.53 --binlog_prefix=mysql-bin
  14. Thu Jun 11 13:54:02 2015 - [info] Executing seconary network check script: /usr/local/bin/masterha_secondary_check -s B-dev10 -s B-dev18  --user=root --master_host=B-dev18 --master_ip=192.168.3.118 --master_port=3306  --user=root  --master_host=192.168.3.118  --master_ip=192.168.3.118  --master_port=3306
  15. Thu Jun 11 13:54:02 2015 - [info] HealthCheck: SSH to 192.168.3.118 is reachable.
  16. Thu Jun 11 13:54:02 2015 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111)
  17. Thu Jun 11 13:54:02 2015 - [warning] Connection failed 1 time(s)..
  18. Monitoring server B-dev10 is reachable, Master is not reachable from B-dev10. OK.
  19. Thu Jun 11 13:54:03 2015 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111)
  20. Thu Jun 11 13:54:03 2015 - [warning] Connection failed 2 time(s)..
  21. Monitoring server B-dev18 is reachable, Master is not reachable from B-dev18. OK.
  22. Thu Jun 11 13:54:04 2015 - [info] Master is not reachable from all other monitoring servers. Failover should start.
  23. Thu Jun 11 13:54:04 2015 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111)
  24. Thu Jun 11 13:54:04 2015 - [warning] Connection failed 3 time(s)..
  25. Thu Jun 11 13:54:04 2015 - [warning] Master is not reachable from health checker!
  26. Thu Jun 11 13:54:04 2015 - [warning] Master 192.168.3.118(192.168.3.118:3306) is not reachable!
  27. Thu Jun 11 13:54:04 2015 - [warning] SSH is reachable.
  28. Thu Jun 11 13:54:04 2015 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /etc/masterha/app1.cnf again, and trying to connect to all servers to check server status..
  29. Thu Jun 11 13:54:04 2015 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
  30. Thu Jun 11 13:54:04 2015 - [info] Reading application default configurations from /etc/masterha/app1.cnf..
  31. Thu Jun 11 13:54:04 2015 - [info] Reading server configurations from /etc/masterha/app1.cnf..
  32. Thu Jun 11 13:54:13 2015 - [info] Dead Servers:
  33. Thu Jun 11 13:54:13 2015 - [info]   192.168.3.118(192.168.3.118:3306)
  34. Thu Jun 11 13:54:13 2015 - [info] Alive Servers:
  35. Thu Jun 11 13:54:13 2015 - [info]   192.168.3.110(192.168.3.110:3306)
  36. Thu Jun 11 13:54:13 2015 - [info]   192.168.3.115(192.168.3.115:3306)
  37. Thu Jun 11 13:54:13 2015 - [info]   192.168.3.117(192.168.3.117:3306)
  38. Thu Jun 11 13:54:13 2015 - [info] Alive Slaves:
  39. Thu Jun 11 13:54:13 2015 - [info]   192.168.3.110(192.168.3.110:3306)  Version=5.6.25-log (oldest major version between slaves) log-bin:enabled
  40. Thu Jun 11 13:54:13 2015 - [info]     Replicating from 192.168.3.118(192.168.3.118:3306)
  41. Thu Jun 11 13:54:13 2015 - [info]     Primary candidate for the new Master (candidate_master is set)
  42. Thu Jun 11 13:54:13 2015 - [info]   192.168.3.115(192.168.3.115:3306)  Version=5.6.25-log (oldest major version between slaves) log-bin:enabled
  43. Thu Jun 11 13:54:13 2015 - [info]     Replicating from 192.168.3.118(192.168.3.118:3306)
  44. Thu Jun 11 13:54:13 2015 - [info]   192.168.3.117(192.168.3.117:3306)  Version=5.6.25-log (oldest major version between slaves) log-bin:enabled
  45. Thu Jun 11 13:54:13 2015 - [info]     Replicating from 192.168.3.118(192.168.3.118:3306)
  46. Thu Jun 11 13:54:13 2015 - [info] Checking slave configurations..
  47. Thu Jun 11 13:54:13 2015 - [info]  read_only=1 is not set on slave 192.168.3.110(192.168.3.110:3306).
  48. Thu Jun 11 13:54:13 2015 - [warning]  relay_log_purge=0 is not set on slave 192.168.3.110(192.168.3.110:3306).
  49. Thu Jun 11 13:54:13 2015 - [info]  read_only=1 is not set on slave 192.168.3.115(192.168.3.115:3306).
  50. Thu Jun 11 13:54:13 2015 - [info]  read_only=1 is not set on slave 192.168.3.117(192.168.3.117:3306).
  51. Thu Jun 11 13:54:13 2015 - [info] Checking replication filtering settings..
  52. Thu Jun 11 13:54:13 2015 - [info]  Replication filtering check ok.
  53. Thu Jun 11 13:54:13 2015 - [info] Master is down!
  54. Thu Jun 11 13:54:13 2015 - [info] Terminating monitoring script.
  55. Thu Jun 11 13:54:13 2015 - [info] Got exit code 20 (Master dead).
  56. Thu Jun 11 13:54:13 2015 - [info] MHA::MasterFailover version 0.53.
  57. Thu Jun 11 13:54:13 2015 - [info] Starting master failover.
  58. Thu Jun 11 13:54:13 2015 - [info]
  59. Thu Jun 11 13:54:13 2015 - [info] * Phase 1: Configuration Check Phase..
  60. Thu Jun 11 13:54:13 2015 - [info]
  61. Thu Jun 11 13:54:19 2015 - [info] Dead Servers:
  62. Thu Jun 11 13:54:19 2015 - [info]   192.168.3.118(192.168.3.118:3306)
  63. Thu Jun 11 13:54:19 2015 - [info] Checking master reachability via mysql(double check)..
  64. Thu Jun 11 13:54:19 2015 - [info]  ok.
  65. Thu Jun 11 13:54:19 2015 - [info] Alive Servers:
  66. Thu Jun 11 13:54:19 2015 - [info]   192.168.3.110(192.168.3.110:3306)
  67. Thu Jun 11 13:54:19 2015 - [info]   192.168.3.115(192.168.3.115:3306)
  68. Thu Jun 11 13:54:19 2015 - [info]   192.168.3.117(192.168.3.117:3306)
  69. Thu Jun 11 13:54:19 2015 - [info] Alive Slaves:
  70. Thu Jun 11 13:54:19 2015 - [info]   192.168.3.110(192.168.3.110:3306)  Version=5.6.25-log (oldest major version between slaves) log-bin:enabled
  71. Thu Jun 11 13:54:19 2015 - [info]     Replicating from 192.168.3.118(192.168.3.118:3306)
  72. Thu Jun 11 13:54:19 2015 - [info]     Primary candidate for the new Master (candidate_master is set)
  73. Thu Jun 11 13:54:19 2015 - [info]   192.168.3.115(192.168.3.115:3306)  Version=5.6.25-log (oldest major version between slaves) log-bin:enabled
  74. Thu Jun 11 13:54:19 2015 - [info]     Replicating from 192.168.3.118(192.168.3.118:3306)
  75. Thu Jun 11 13:54:19 2015 - [info]   192.168.3.117(192.168.3.117:3306)  Version=5.6.25-log (oldest major version between slaves) log-bin:enabled
  76. Thu Jun 11 13:54:19 2015 - [info]     Replicating from 192.168.3.118(192.168.3.118:3306)
  77. Thu Jun 11 13:54:22 2015 - [info] ** Phase 1: Configuration Check Phase completed.
  78. Thu Jun 11 13:54:22 2015 - [info]
  79. Thu Jun 11 13:54:22 2015 - [info] * Phase 2: Dead Master Shutdown Phase..
  80. Thu Jun 11 13:54:22 2015 - [info]
  81. Thu Jun 11 13:54:22 2015 - [info] Forcing shutdown so that applications never connect to the current master..
  82. Thu Jun 11 13:54:22 2015 - [info] Executing master IP deactivatation script:
  83. Thu Jun 11 13:54:22 2015 - [info]   /usr/local/bin/master_ip_failover --orig_master_host=192.168.3.118 --orig_master_ip=192.168.3.118 --orig_master_port=3306 --command=stopssh --ssh_user=root 
  84.  
  85.  
  86. IN SCRIPT TEST====/etc/init.d/keepalived stop==/etc/init.d/keepalived start===
  87.  
  88. Disabling the VIP on old master: 192.168.3.118
  89. Thu Jun 11 13:54:23 2015 - [info]  done.
  90. Thu Jun 11 13:54:23 2015 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
  91. Thu Jun 11 13:54:23 2015 - [info] * Phase 2: Dead Master Shutdown Phase completed.
  92. Thu Jun 11 13:54:23 2015 - [info]
  93. Thu Jun 11 13:54:23 2015 - [info] * Phase 3: Master Recovery Phase..
  94. Thu Jun 11 13:54:23 2015 - [info]
  95. Thu Jun 11 13:54:23 2015 - [info] * Phase 3.1: Getting Latest Slaves Phase..
  96. Thu Jun 11 13:54:23 2015 - [info]
  97. Thu Jun 11 13:54:23 2015 - [info] The latest binary log file/position on all slaves is mysql-bin.000001:145958297
  98. Thu Jun 11 13:54:23 2015 - [info] Latest slaves (Slaves that received relay log files to the latest):
  99. Thu Jun 11 13:54:23 2015 - [info]   192.168.3.110(192.168.3.110:3306)  Version=5.6.25-log (oldest major version between slaves) log-bin:enabled
  100. Thu Jun 11 13:54:23 2015 - [info]     Replicating from 192.168.3.118(192.168.3.118:3306)
  101. Thu Jun 11 13:54:23 2015 - [info]     Primary candidate for the new Master (candidate_master is set)
  102. Thu Jun 11 13:54:23 2015 - [info] The oldest binary log file/position on all slaves is mysql-bin.000001:142466719
  103. Thu Jun 11 13:54:23 2015 - [info] Oldest slaves:
  104. Thu Jun 11 13:54:23 2015 - [info]   192.168.3.115(192.168.3.115:3306)  Version=5.6.25-log (oldest major version between slaves) log-bin:enabled
  105. Thu Jun 11 13:54:23 2015 - [info]     Replicating from 192.168.3.118(192.168.3.118:3306)
  106. Thu Jun 11 13:54:23 2015 - [info]
  107. Thu Jun 11 13:54:23 2015 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
  108. Thu Jun 11 13:54:23 2015 - [info]
  109. Thu Jun 11 13:54:24 2015 - [info] Fetching dead master's binary logs..
  110. Thu Jun 11 13:54:24 2015 - [info] Executing command on the dead master 192.168.3.118(192.168.3.118:3306): save_binary_logs --command=save --start_file=mysql-bin.000001  --start_pos=145958297 --binlog_dir=/var/lib/mysql --output_file=/tmp/saved_master_binlog_from_192.168.3.118_3306_20150611135413.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.53
  111.   Creating /tmp if not exists..    ok.
  112. Concat binary/relay logs from mysql-bin.000001 pos 145958297 to mysql-bin.000001 EOF into /tmp/saved_master_binlog_from_192.168.3.118_3306_20150611135413.binlog ..
  113.   Dumping binlog format description event, from position 0 to 120.. ok.
  114.   Dumping effective binlog data from /var/lib/mysql/mysql-bin.000001 position 145958297 to tail(145969603).. ok.
  115. Concat succeeded.
  116. Thu Jun 11 13:54:26 2015 - [info] scp from root@192.168.3.118:/tmp/saved_master_binlog_from_192.168.3.118_3306_20150611135413.binlog to local:/var/log/masterha/app1/saved_master_binlog_from_192.168.3.118_3306_20150611135413.binlog succeeded.
  117. Thu Jun 11 13:54:26 2015 - [info] HealthCheck: SSH to 192.168.3.110 is reachable.
  118. Thu Jun 11 13:54:27 2015 - [info] HealthCheck: SSH to 192.168.3.115 is reachable.
  119. Thu Jun 11 13:54:28 2015 - [info] HealthCheck: SSH to 192.168.3.117 is reachable.
  120. Thu Jun 11 13:54:29 2015 - [info]
  121. Thu Jun 11 13:54:29 2015 - [info] * Phase 3.3: Determining New Master Phase..
  122. Thu Jun 11 13:54:29 2015 - [info]
  123. Thu Jun 11 13:54:29 2015 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
  124. Thu Jun 11 13:54:29 2015 - [info] Checking whether 192.168.3.110 has relay logs from the oldest position..
  125. Thu Jun 11 13:54:29 2015 - [info] Executing command: apply_diff_relay_logs --command=find --latest_mlf=mysql-bin.000001 --latest_rmlp=145958297 --target_mlf=mysql-bin.000001 --target_rmlp=142466719 --server_id=110 --workdir=/tmp --timestamp=20150611135413 --manager_version=0.53 --relay_log_info=/var/lib/mysql/relay-log.info  --relay_dir=/var/lib/mysql/  :
  126.     Opening /var/lib/mysql/relay-log.info ... ok.
  127.     Relay log found at /var/lib/mysql, up to mysqld-relay-bin.000003
  128. Fast relay log position search succeeded.
  129. Target relay log file/position found. start_file:mysqld-relay-bin.000003, start_pos:79548.
  130. Target relay log FOUND!
  131. Thu Jun 11 13:54:30 2015 - [info] OK. 192.168.3.110 has all relay logs.
  132. Thu Jun 11 13:54:30 2015 - [info] Searching new master from slaves..
  133. Thu Jun 11 13:54:30 2015 - [info]  Candidate masters from the configuration file:
  134. Thu Jun 11 13:54:30 2015 - [info]   192.168.3.110(192.168.3.110:3306)  Version=5.6.25-log (oldest major version between slaves) log-bin:enabled
  135. Thu Jun 11 13:54:30 2015 - [info]     Replicating from 192.168.3.118(192.168.3.118:3306)
  136. Thu Jun 11 13:54:30 2015 - [info]     Primary candidate for the new Master (candidate_master is set)
  137. Thu Jun 11 13:54:30 2015 - [info]  Non-candidate masters:
  138. Thu Jun 11 13:54:30 2015 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
  139. Thu Jun 11 13:54:30 2015 - [info] New master is 192.168.3.110(192.168.3.110:3306)
  140. Thu Jun 11 13:54:30 2015 - [info] Starting master failover..
  141. Thu Jun 11 13:54:30 2015 - [info]
  142. From:
  143. 192.168.3.118 (current master)
  144. +--192.168.3.110
  145. +--192.168.3.115
  146. +--192.168.3.117
  147.  
  148. To:
  149. 192.168.3.110 (new master)
  150. +--192.168.3.115
  151. +--192.168.3.117
  152. Thu Jun 11 13:54:30 2015 - [info]
  153. Thu Jun 11 13:54:30 2015 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
  154. Thu Jun 11 13:54:30 2015 - [info]
  155. Thu Jun 11 13:54:30 2015 - [info]  This server has all relay logs. No need to generate diff files from the latest slave.
  156. Thu Jun 11 13:54:30 2015 - [info] Sending binlog..
  157. Thu Jun 11 13:54:31 2015 - [info] scp from local:/var/log/masterha/app1/saved_master_binlog_from_192.168.3.118_3306_20150611135413.binlog to root@192.168.3.110:/tmp/saved_master_binlog_from_192.168.3.118_3306_20150611135413.binlog succeeded.
  158. Thu Jun 11 13:54:31 2015 - [info]
  159. Thu Jun 11 13:54:31 2015 - [info] * Phase 3.4: Master Log Apply Phase..
  160. Thu Jun 11 13:54:31 2015 - [info]
  161. Thu Jun 11 13:54:31 2015 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
  162. Thu Jun 11 13:54:31 2015 - [info] Starting recovery on 192.168.3.110(192.168.3.110:3306)..
  163. Thu Jun 11 13:54:31 2015 - [info]  Generating diffs succeeded.
  164. Thu Jun 11 13:54:31 2015 - [info] Waiting until all relay logs are applied.
  165. Thu Jun 11 13:59:46 2015 - [info]  done.
  166. Thu Jun 11 13:59:46 2015 - [info] Getting slave status..
  167. Thu Jun 11 13:59:46 2015 - [info] This slave(192.168.3.110)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.000001:145958297). No need to recover from Exec_Master_Log_Pos.
  168. Thu Jun 11 13:59:46 2015 - [info] Connecting to the target slave host 192.168.3.110, running recover script..
  169. Thu Jun 11 13:59:46 2015 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user=root --slave_host=192.168.3.110 --slave_ip=192.168.3.110  --slave_port=3306 --apply_files=/tmp/saved_master_binlog_from_192.168.3.118_3306_20150611135413.binlog --workdir=/tmp --target_version=5.6.25-log --timestamp=20150611135413 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.53 --slave_pass=xxx
  170. Thu Jun 11 13:59:49 2015 - [info]
  171. MySQL client version is 5.6.25. Using --binary-mode.
  172. Applying differential binary/relay log files /tmp/saved_master_binlog_from_192.168.3.118_3306_20150611135413.binlog on 192.168.3.110:3306. This may take long time...
  173. Applying log files succeeded.
  174. Thu Jun 11 13:59:49 2015 - [info]  All relay logs were successfully applied.
  175. Thu Jun 11 13:59:49 2015 - [info] Getting new master's binlog name and position..
  176. Thu Jun 11 13:59:49 2015 - [info]  mysql-bin.000001:11760
  177. Thu Jun 11 13:59:49 2015 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.3.110', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000001', MASTER_LOG_POS=11760, MASTER_USER='repl', MASTER_PASSWORD='xxx';
  178. Thu Jun 11 13:59:49 2015 - [info] Executing master IP activate script:
  179. Thu Jun 11 13:59:49 2015 - [info]   /usr/local/bin/master_ip_failover --command=start --ssh_user=root --orig_master_host=192.168.3.118 --orig_master_ip=192.168.3.118 --orig_master_port=3306 --new_master_host=192.168.3.110 --new_master_ip=192.168.3.110 --new_master_port=3306 
  180.  
  181.  
  182. IN SCRIPT TEST====/etc/init.d/keepalived stop==/etc/init.d/keepalived start===
  183.  
  184. Enabling the VIP - 192.168.3.250 on the new master - 192.168.3.110
  185. Thu Jun 11 13:59:49 2015 - [info]  OK.
  186. Thu Jun 11 13:59:49 2015 - [info] ** Finished master recovery successfully.
  187. Thu Jun 11 13:59:49 2015 - [info] * Phase 3: Master Recovery Phase completed.
  188. Thu Jun 11 13:59:49 2015 - [info]
  189. Thu Jun 11 13:59:49 2015 - [info] * Phase 4: Slaves Recovery Phase..
  190. Thu Jun 11 13:59:49 2015 - [info]
  191. Thu Jun 11 13:59:49 2015 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
  192. Thu Jun 11 13:59:49 2015 - [info]
  193. Thu Jun 11 13:59:49 2015 - [info] -- Slave diff file generation on host 192.168.3.115(192.168.3.115:3306) started, pid: 17896. Check tmp log /var/log/masterha/app1/192.168.3.115_3306_20150611135413.log if it takes time..
  194. Thu Jun 11 13:59:49 2015 - [info] -- Slave diff file generation on host 192.168.3.117(192.168.3.117:3306) started, pid: 17897. Check tmp log /var/log/masterha/app1/192.168.3.117_3306_20150611135413.log if it takes time..
  195. Thu Jun 11 13:59:52 2015 - [info]
  196. Thu Jun 11 13:59:52 2015 - [info] Log messages from 192.168.3.117 ...
  197. Thu Jun 11 13:59:52 2015 - [info]
  198. Thu Jun 11 13:59:49 2015 - [info] Server 192.168.3.117 received relay logs up to: mysql-bin.000001:142482387
  199. Thu Jun 11 13:59:49 2015 - [info] Need to get diffs from the latest slave(192.168.3.110) up to: mysql-bin.000001:145958297 (using the latest slave's relay logs)
  200. Thu Jun 11 13:59:50 2015 - [info] Connecting to the latest slave host 192.168.3.110, generating diff relay log files..
  201. Thu Jun 11 13:59:50 2015 - [info] Executing command: apply_diff_relay_logs --command=generate_and_send --scp_user=root --scp_host=192.168.3.117 --latest_mlf=mysql-bin.000001 --latest_rmlp=145958297 --target_mlf=mysql-bin.000001 --target_rmlp=142482387 --server_id=110 --diff_file_readtolatest=/tmp/relay_from_read_to_latest_192.168.3.117_3306_20150611135413.binlog --workdir=/tmp --timestamp=20150611135413 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.53 --relay_log_info=/var/lib/mysql/relay-log.info  --relay_dir=/var/lib/mysql/
  202. Thu Jun 11 13:59:52 2015 - [info]
  203.     Opening /var/lib/mysql/relay-log.info ... ok.
  204.     Relay log found at /var/lib/mysql, up to mysqld-relay-bin.000003
  205. Fast relay log position search succeeded.
  206. Target relay log file/position found. start_file:mysqld-relay-bin.000003, start_pos:95216.
  207. Concat binary/relay logs from mysqld-relay-bin.000003 pos 95216 to mysqld-relay-bin.000003 EOF into /tmp/relay_from_read_to_latest_192.168.3.117_3306_20150611135413.binlog ..
  208.   Dumping binlog format description event, from position 0 to 283.. ok.
  209.   Dumping effective binlog data from /var/lib/mysql/mysqld-relay-bin.000003 position 95216 to tail(3571126).. ok.
  210. Concat succeeded.
  211. Generating diff relay log succeeded. Saved at /tmp/relay_from_read_to_latest_192.168.3.117_3306_20150611135413.binlog .
  212. scp B-dev10:/tmp/relay_from_read_to_latest_192.168.3.117_3306_20150611135413.binlog to root@192.168.3.117(22) succeeded.
  213. Thu Jun 11 13:59:52 2015 - [info]  Generating diff files succeeded.
  214. Thu Jun 11 13:59:52 2015 - [info] End of log messages from 192.168.3.117.
  215. Thu Jun 11 13:59:52 2015 - [info] -- Slave diff log generation on host 192.168.3.117(192.168.3.117:3306) succeeded.
  216. Thu Jun 11 13:59:53 2015 - [info]
  217. Thu Jun 11 13:59:53 2015 - [info] Log messages from 192.168.3.115 ...
  218. Thu Jun 11 13:59:53 2015 - [info]
  219. Thu Jun 11 13:59:49 2015 - [info] Server 192.168.3.115 received relay logs up to: mysql-bin.000001:142466719
  220. Thu Jun 11 13:59:49 2015 - [info] Need to get diffs from the latest slave(192.168.3.110) up to: mysql-bin.000001:145958297 (using the latest slave's relay logs)
  221. Thu Jun 11 13:59:50 2015 - [info] Connecting to the latest slave host 192.168.3.110, generating diff relay log files..
  222. Thu Jun 11 13:59:50 2015 - [info] Executing command: apply_diff_relay_logs --command=generate_and_send --scp_user=root --scp_host=192.168.3.115 --latest_mlf=mysql-bin.000001 --latest_rmlp=145958297 --target_mlf=mysql-bin.000001 --target_rmlp=142466719 --server_id=110 --diff_file_readtolatest=/tmp/relay_from_read_to_latest_192.168.3.115_3306_20150611135413.binlog --workdir=/tmp --timestamp=20150611135413 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.53 --relay_log_info=/var/lib/mysql/relay-log.info  --relay_dir=/var/lib/mysql/
  223. Thu Jun 11 13:59:53 2015 - [info]
  224.     Opening /var/lib/mysql/relay-log.info ... ok.
  225.     Relay log found at /var/lib/mysql, up to mysqld-relay-bin.000003
  226. Fast relay log position search succeeded.
  227. Target relay log file/position found. start_file:mysqld-relay-bin.000003, start_pos:79548.
  228. Concat binary/relay logs from mysqld-relay-bin.000003 pos 79548 to mysqld-relay-bin.000003 EOF into /tmp/relay_from_read_to_latest_192.168.3.115_3306_20150611135413.binlog ..
  229.   Dumping binlog format description event, from position 0 to 283.. ok.
  230.   Dumping effective binlog data from /var/lib/mysql/mysqld-relay-bin.000003 position 79548 to tail(3571126).. ok.
  231. Concat succeeded.
  232. Generating diff relay log succeeded. Saved at /tmp/relay_from_read_to_latest_192.168.3.115_3306_20150611135413.binlog .
  233. scp B-dev10:/tmp/relay_from_read_to_latest_192.168.3.115_3306_20150611135413.binlog to root@192.168.3.115(22) succeeded.
  234. Thu Jun 11 13:59:53 2015 - [info]  Generating diff files succeeded.
  235. Thu Jun 11 13:59:53 2015 - [info] End of log messages from 192.168.3.115.
  236. Thu Jun 11 13:59:53 2015 - [info] -- Slave diff log generation on host 192.168.3.115(192.168.3.115:3306) succeeded.
  237. Thu Jun 11 13:59:53 2015 - [info] Generating relay diff files from the latest slave succeeded.
  238. Thu Jun 11 13:59:53 2015 - [info]
  239. Thu Jun 11 13:59:53 2015 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
  240. Thu Jun 11 13:59:53 2015 - [info]
  241. Thu Jun 11 13:59:53 2015 - [info] -- Slave recovery on host 192.168.3.115(192.168.3.115:3306) started, pid: 17910. Check tmp log /var/log/masterha/app1/192.168.3.115_3306_20150611135413.log if it takes time..
  242. Thu Jun 11 13:59:53 2015 - [info] -- Slave recovery on host 192.168.3.117(192.168.3.117:3306) started, pid: 17911. Check tmp log /var/log/masterha/app1/192.168.3.117_3306_20150611135413.log if it takes time..
  243. Thu Jun 11 14:14:14 2015 - [info]
  244. Thu Jun 11 14:14:14 2015 - [info] Log messages from 192.168.3.117 ...
  245. Thu Jun 11 14:14:14 2015 - [info]
  246. Thu Jun 11 13:59:53 2015 - [info] Sending binlog..
  247. Thu Jun 11 13:59:53 2015 - [info] scp from local:/var/log/masterha/app1/saved_master_binlog_from_192.168.3.118_3306_20150611135413.binlog to root@192.168.3.117:/tmp/saved_master_binlog_from_192.168.3.118_3306_20150611135413.binlog succeeded.
  248. Thu Jun 11 13:59:53 2015 - [info] Starting recovery on 192.168.3.117(192.168.3.117:3306)..
  249. Thu Jun 11 13:59:53 2015 - [info]  Generating diffs succeeded.
  250. Thu Jun 11 13:59:53 2015 - [info] Waiting until all relay logs are applied.
  251. Thu Jun 11 13:59:53 2015 - [info]  done.
  252. Thu Jun 11 13:59:54 2015 - [info] Getting slave status..
  253. Thu Jun 11 13:59:54 2015 - [info] This slave(192.168.3.117)'s Exec_Master_Log_Pos(mysql-bin.000001:142481989) does not equal to Read_Master_Log_Pos(mysql-bin.000001:142482387). It is likely that relay log was cut during transaction. Need to recover from Exec_Master_Log_Pos.
  254. Thu Jun 11 13:59:54 2015 - [info] Saving local relay logs from exec pos to read pos on 192.168.3.117: from mysqld-relay-bin.000002:142482152 to the end of the relay log..
  255. Thu Jun 11 13:59:54 2015 - [info] Executing command : save_binary_logs --command=save --start_file=mysqld-relay-bin.000002  --start_pos=142482152 --output_file=/tmp/relay_from_exec_to_read_192.168.3.117_3306_20150611135413.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.53 --relay_log_info=/var/lib/mysql/relay-log.info  --binlog_dir=/var/lib/mysql/
  256. Thu Jun 11 13:59:54 2015 - [info]
  257.   Creating /tmp if not exists..    ok.
  258. Concat binary/relay logs from mysqld-relay-bin.000002 pos 142482152 to mysqld-relay-bin.000002 EOF into /tmp/relay_from_exec_to_read_192.168.3.117_3306_20150611135413.binlog ..
  259.   Dumping binlog format description event, from position 0 to 283.. ok.
  260.   Dumping effective binlog data from /var/lib/mysql/mysqld-relay-bin.000002 position 142482152 to tail(142482550).. ok.
  261. Concat succeeded.
  262. Thu Jun 11 13:59:54 2015 - [info] Connecting to the target slave host 192.168.3.117, running recover script..
  263. Thu Jun 11 13:59:54 2015 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user=root --slave_host=192.168.3.117 --slave_ip=192.168.3.117  --slave_port=3306 --apply_files=/tmp/relay_from_exec_to_read_192.168.3.117_3306_20150611135413.binlog,/tmp/relay_from_read_to_latest_192.168.3.117_3306_20150611135413.binlog,/tmp/saved_master_binlog_from_192.168.3.118_3306_20150611135413.binlog --workdir=/tmp --target_version=5.6.25-log --timestamp=20150611135413 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.53 --slave_pass=xxx
  264. Thu Jun 11 14:14:12 2015 - [info]
  265. Concat all apply files to /tmp/total_binlog_for_192.168.3.117_3306.20150611135413.binlog ..
  266. Copying the first binlog file /tmp/relay_from_exec_to_read_192.168.3.117_3306_20150611135413.binlog to /tmp/total_binlog_for_192.168.3.117_3306.20150611135413.binlog.. ok.
  267.   Dumping binlog head events (rotate events), skipping format description events from /tmp/relay_from_read_to_latest_192.168.3.117_3306_20150611135413.binlog.. dumped up to pos 283. ok.
  268. /tmp/relay_from_read_to_latest_192.168.3.117_3306_20150611135413.binlog has effective binlog events from pos 283.
  269.   Dumping effective binlog data from /tmp/relay_from_read_to_latest_192.168.3.117_3306_20150611135413.binlog position 283 to tail(3476193).. ok.
  270.   Dumping binlog head events (rotate events), skipping format description events from /tmp/saved_master_binlog_from_192.168.3.118_3306_20150611135413.binlog.. dumped up to pos 120. ok.
  271. /tmp/saved_master_binlog_from_192.168.3.118_3306_20150611135413.binlog has effective binlog events from pos 120.
  272.   Dumping effective binlog data from /tmp/saved_master_binlog_from_192.168.3.118_3306_20150611135413.binlog position 120 to tail(11426).. ok.
  273. Concat succeeded.
  274. All apply target binary logs are concatinated at /tmp/total_binlog_for_192.168.3.117_3306.20150611135413.binlog .
  275. MySQL client version is 5.6.25. Using --binary-mode.
  276. Applying differential binary/relay log files /tmp/relay_from_exec_to_read_192.168.3.117_3306_20150611135413.binlog,/tmp/relay_from_read_to_latest_192.168.3.117_3306_20150611135413.binlog,/tmp/saved_master_binlog_from_192.168.3.118_3306_20150611135413.binlog on 192.168.3.117:3306. This may take long time...
  277. Applying log files succeeded.
  278. Thu Jun 11 14:14:12 2015 - [info]  All relay logs were successfully applied.
  279. Thu Jun 11 14:14:12 2015 - [info]  Resetting slave 192.168.3.117(192.168.3.117:3306) and starting replication from the new master 192.168.3.110(192.168.3.110:3306)..
  280. Thu Jun 11 14:14:14 2015 - [info]  Executed CHANGE MASTER.
  281. Thu Jun 11 14:14:14 2015 - [info]  Slave started.
  282. Thu Jun 11 14:14:14 2015 - [info] End of log messages from 192.168.3.117.
  283. Thu Jun 11 14:14:14 2015 - [info] -- Slave recovery on host 192.168.3.117(192.168.3.117:3306) succeeded.
  284. Thu Jun 11 14:14:37 2015 - [info]
  285. Thu Jun 11 14:14:37 2015 - [info] Log messages from 192.168.3.115 ...
  286. Thu Jun 11 14:14:37 2015 - [info]
  287. Thu Jun 11 13:59:53 2015 - [info] Sending binlog..
  288. Thu Jun 11 13:59:53 2015 - [info] scp from local:/var/log/masterha/app1/saved_master_binlog_from_192.168.3.118_3306_20150611135413.binlog to root@192.168.3.115:/tmp/saved_master_binlog_from_192.168.3.118_3306_20150611135413.binlog succeeded.
  289. Thu Jun 11 13:59:53 2015 - [info] Starting recovery on 192.168.3.115(192.168.3.115:3306)..
  290. Thu Jun 11 13:59:53 2015 - [info]  Generating diffs succeeded.
  291. Thu Jun 11 13:59:53 2015 - [info] Waiting until all relay logs are applied.
  292. Thu Jun 11 13:59:53 2015 - [info]  done.
  293. Thu Jun 11 13:59:53 2015 - [info] Getting slave status..
  294. Thu Jun 11 13:59:53 2015 - [info] This slave(192.168.3.115)'s Exec_Master_Log_Pos(mysql-bin.000001:142466642) does not equal to Read_Master_Log_Pos(mysql-bin.000001:142466719). It is likely that relay log was cut during transaction. Need to recover from Exec_Master_Log_Pos.
  295. Thu Jun 11 13:59:53 2015 - [info] Saving local relay logs from exec pos to read pos on 192.168.3.115: from mysqld-relay-bin.000002:142466805 to the end of the relay log..
  296. Thu Jun 11 13:59:53 2015 - [info] Executing command : save_binary_logs --command=save --start_file=mysqld-relay-bin.000002  --start_pos=142466805 --output_file=/tmp/relay_from_exec_to_read_192.168.3.115_3306_20150611135413.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.53 --relay_log_info=/var/lib/mysql/relay-log.info  --binlog_dir=/var/lib/mysql/
  297. Thu Jun 11 13:59:54 2015 - [info]
  298.   Creating /tmp if not exists..    ok.
  299. Concat binary/relay logs from mysqld-relay-bin.000002 pos 142466805 to mysqld-relay-bin.000002 EOF into /tmp/relay_from_exec_to_read_192.168.3.115_3306_20150611135413.binlog ..
  300.   Dumping binlog format description event, from position 0 to 283.. ok.
  301.   Dumping effective binlog data from /var/lib/mysql/mysqld-relay-bin.000002 position 142466805 to tail(142466882).. ok.
  302. Concat succeeded.
  303. Thu Jun 11 13:59:54 2015 - [info] Connecting to the target slave host 192.168.3.115, running recover script..
  304. Thu Jun 11 13:59:54 2015 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user=root --slave_host=192.168.3.115 --slave_ip=192.168.3.115  --slave_port=3306 --apply_files=/tmp/relay_from_exec_to_read_192.168.3.115_3306_20150611135413.binlog,/tmp/relay_from_read_to_latest_192.168.3.115_3306_20150611135413.binlog,/tmp/saved_master_binlog_from_192.168.3.118_3306_20150611135413.binlog --workdir=/tmp --target_version=5.6.25-log --timestamp=20150611135413 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.53 --slave_pass=xxx
  305. Thu Jun 11 14:14:35 2015 - [info]
  306. Concat all apply files to /tmp/total_binlog_for_192.168.3.115_3306.20150611135413.binlog ..
  307. Copying the first binlog file /tmp/relay_from_exec_to_read_192.168.3.115_3306_20150611135413.binlog to /tmp/total_binlog_for_192.168.3.115_3306.20150611135413.binlog.. ok.
  308.   Dumping binlog head events (rotate events), skipping format description events from /tmp/relay_from_read_to_latest_192.168.3.115_3306_20150611135413.binlog.. dumped up to pos 283. ok.
  309. /tmp/relay_from_read_to_latest_192.168.3.115_3306_20150611135413.binlog has effective binlog events from pos 283.
  310.   Dumping effective binlog data from /tmp/relay_from_read_to_latest_192.168.3.115_3306_20150611135413.binlog position 283 to tail(3491861).. ok.
  311.   Dumping binlog head events (rotate events), skipping format description events from /tmp/saved_master_binlog_from_192.168.3.118_3306_20150611135413.binlog.. dumped up to pos 120. ok.
  312. /tmp/saved_master_binlog_from_192.168.3.118_3306_20150611135413.binlog has effective binlog events from pos 120.
  313.   Dumping effective binlog data from /tmp/saved_master_binlog_from_192.168.3.118_3306_20150611135413.binlog position 120 to tail(11426).. ok.
  314. Concat succeeded.
  315. All apply target binary logs are concatinated at /tmp/total_binlog_for_192.168.3.115_3306.20150611135413.binlog .
  316. MySQL client version is 5.6.25. Using --binary-mode.
  317. Applying differential binary/relay log files /tmp/relay_from_exec_to_read_192.168.3.115_3306_20150611135413.binlog,/tmp/relay_from_read_to_latest_192.168.3.115_3306_20150611135413.binlog,/tmp/saved_master_binlog_from_192.168.3.118_3306_20150611135413.binlog on 192.168.3.115:3306. This may take long time...
  318. Applying log files succeeded.
  319. Thu Jun 11 14:14:35 2015 - [info]  All relay logs were successfully applied.
  320. Thu Jun 11 14:14:35 2015 - [info]  Resetting slave 192.168.3.115(192.168.3.115:3306) and starting replication from the new master 192.168.3.110(192.168.3.110:3306)..
  321. Thu Jun 11 14:14:37 2015 - [info]  Executed CHANGE MASTER.
  322. Thu Jun 11 14:14:37 2015 - [info]  Slave started.
  323. Thu Jun 11 14:14:37 2015 - [info] End of log messages from 192.168.3.115.
  324. Thu Jun 11 14:14:37 2015 - [info] -- Slave recovery on host 192.168.3.115(192.168.3.115:3306) succeeded.
  325. Thu Jun 11 14:14:37 2015 - [info] All new slave servers recovered successfully.
  326. Thu Jun 11 14:14:37 2015 - [info]
  327. Thu Jun 11 14:14:37 2015 - [info] * Phase 5: New master cleanup phease..
  328. Thu Jun 11 14:14:37 2015 - [info]
  329. Thu Jun 11 14:14:37 2015 - [info] Resetting slave info on the new master..
  330. Thu Jun 11 14:14:37 2015 - [info]  192.168.3.110: Resetting slave info succeeded.
  331. Thu Jun 11 14:14:37 2015 - [info] Master failover to 192.168.3.110(192.168.3.110:3306) completed successfully.
  332. Thu Jun 11 14:14:37 2015 - [info] Deleted server1 entry from /etc/masterha/app1.cnf .
  333. Thu Jun 11 14:14:37 2015 - [info]
  334.  
  335. ----- Failover Report -----
  336.  
  337. app1: MySQL Master failover 192.168.3.118 to 192.168.3.110 succeeded
  338.  
  339. Master 192.168.3.118 is down!
  340.  
  341. Check MHA Manager logs at B-dev23:/var/log/masterha/app1/manager.log for details.
  342.  
  343. Started automated(non-interactive) failover.
  344. Invalidated master IP address on 192.168.3.118.
  345. The latest slave 192.168.3.110(192.168.3.110:3306) has all relay logs for recovery.
  346. Selected 192.168.3.110 as a new master.
  347. 192.168.3.110: OK: Applying all logs succeeded.
  348. 192.168.3.110: OK: Activated master IP address.
  349. 192.168.3.117: Generating differential relay logs up to 192.168.3.110 succeeded.
  350. 192.168.3.115: Generating differential relay logs up to 192.168.3.110 succeeded.
  351. Generating relay diff files from the latest slave succeeded.
  352. 192.168.3.117: OK: Applying all logs succeeded. Slave started, replicating from 192.168.3.110.
  353. 192.168.3.115: OK: Applying all logs succeeded. Slave started, replicating from 192.168.3.110.
  354. 192.168.3.110: Resetting slave info succeeded.
  355. Master failover to 192.168.3.110(192.168.3.110:3306) completed successfully.
看到最后的Master failover to 192.168.3.110(192.168.3.110:3306) completed successfully.说明备选master现在已经上位了。

从上面的输出可以看出整个MHA的切换过程,共包括以下的步骤:

  • 连接主机三次失败确认主机宕机,摘除宕机的vip
  • 检查从库中使用的最新binlog file/pos
  • 检查从库中使用的最老binlog file/pos
  • 获取宕掉主机的binlog:根据步骤1获取的binlog位置开始保存未同步到从机的binlog(导出binlog描述事件,导出影响的binlog数据);将导出的binlog发送到mha-manage服务器指定目录下
  • healthcheck:监控ssh到其他的机器是否正常
  • 检查最新binlog file/pos的从机是否有需要恢复其他从机所需的所有relay log
  • 如果其他从机跟最新从机pos不一致,需要最新从机的relay log进行一致性恢复
  • 检查是否设置候选主机
  • 开始候选主机切换:先检查候选主机是否有所有的relay log,没有首先从最新从机恢复,然后将mha收到宕机主机缺省的binlog发送给候选主机,识别差异的relay log并将差异的事件应用所有的relay log,执行vip漂移到候选主机,候选主机恢复结束
  • 其他从机relay log恢复:先检查从机是否有所有的relay log,没有首先从最新从机恢复,然后将mha收到宕机主机缺省的binlog发送给从机,识别差异的relay log并将差异的事件应用所有的relay log
  • 将所有从机指向为候选主机
启动MHA Manger监控,查看集群里面现在谁是master。
(7)检查切换后数据是主从数据是否一致
在主机和候选主机上分别安装percona-toolkit,安装流程及使用文档请查看percona-toolkit 简明教程
这里简单说下pt-table-checksum 的工作原理: pt-table-checksum 在主上执行检查语句在线检查mysql复制的一致性, 生成replace 语句,然后通过复制传递到从,再通过update更新master_src 的值。通过检测从上this_src 和master_src 的值从而判断复制是否一致。注意: 使用的时候选择业务地峰的时候运行,因为运行的时候会造成表的部分记录锁定。 使用–max-load 来指定最大的负载情况,如果达到那个负载这个暂停运行。 如果发现有不一致的数据,可以使用pt-table-sync 工具来修复。
在切换后的新主机上使用命令:
  1. [root@192.168.3.110 ~]# pt-table-checksum --nocheck-replication-filters --chunk-size=1000000 --no-check-binlog-format --replicate=sps.checksums --databases=sps h=127.0.0.1,u=root,p=123456,P=3306
  2. TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE
  3. 06-23T18:06:38 0 0 0 1 0 3.885 sps.sps_account
  4. 06-23T18:06:42 0 0 3 1 0 3.118 sps.sps_activiti_goods
  5. 06-23T18:06:46 0 0 0 1 0 4.613 sps.sps_activiti_prop
  6. 06-23T18:06:49 0 0 2 1 0 3.013 sps.sps_activiti_re
  7. 06-23T18:06:52 0 0 3 1 0 2.858 sps.sps_activity
  8. 06-23T18:06:55 0 0 1 1 0 3.359 sps.sps_app_version
  9. 06-23T18:06:58 0 0 4 1 0 2.894 sps.sps_brand
  10. 06-23T18:07:02 0 0 3272 1 0 3.367 sps.sps_district
  11. 06-23T18:07:05 0 0 16 1 0 2.935 sps.sps_goods
  12. 06-23T18:07:08 0 0 6 1 0 3.761 sps.sps_goods_brand_cat
  13. 06-23T18:07:12 0 0 332 1 0 3.418 sps.sps_goods_category
  14. 06-23T18:07:16 0 0 0 1 0 4.175 sps.sps_goods_category_template
  15. 06-23T18:07:20 0 0 5 1 0 4.061 sps.sps_goods_comment
  16. 06-23T18:07:24 0 0 9 1 0 3.784 sps.sps_goods_desc
  17. 06-23T18:07:27 0 0 4 1 0 3.487 sps.sps_goods_favorite
  18. 06-23T18:07:33 0 0 14 1 0 5.847 sps.sps_goods_media
  19. 06-23T18:07:37 0 0 2 1 0 3.811 sps.sps_goods_recommend
  20. 06-23T18:07:41 0 0 2 1 0 3.623 sps.sps_goods_recommend_group
  21. 06-23T18:07:44 0 0 4 1 0 3.402 sps.sps_goods_recommend_group_cate
  22. 06-23T18:07:48 0 0 12 1 0 3.662 sps.sps_goods_sku_attribute_value
  23. 06-23T18:07:52 0 0 19 1 0 3.951 sps.sps_goods_stock
  24. 06-23T18:07:55 0 0 6 1 0 3.557 sps.sps_goods_template
  25. 06-23T18:07:58 0 0 2 1 0 3.226 sps.sps_goods_tmpl_sku_attribute_line
  26. 06-23T18:08:03 0 0 5 1 0 4.183 sps.sps_goods_tmpl_sku_attribute_line_value
  27. 06-23T18:08:08 0 0 208733 1 0 5.422 sps.sps_member
  28. 06-23T18:08:11 0 0 2 1 0 3.304 sps.sps_member_account_seq
  29. 06-23T18:08:15 0 0 27721 1 0 3.575 sps.sps_member_address
  30. 06-23T18:08:18 0 0 2 1 0 3.116 sps.sps_member_baby_birthday
  31. 06-23T18:08:21 0 0 1 1 0 2.876 sps.sps_member_grade_change_info
  32. 06-23T18:08:24 0 0 0 1 0 3.563 sps.sps_member_grade_dict
  33. 06-23T18:08:28 0 0 16 1 0 3.848 sps.sps_member_points_seq
  34. 06-23T18:08:31 0 0 0 1 0 2.958 sps.sps_menu
  35. 06-23T18:08:35 0 0 0 1 0 3.687 sps.sps_message
  36. 06-23T18:08:39 0 0 5 1 0 3.589 sps.sps_message_text
  37. 06-23T18:08:41 0 0 746 1 0 2.848 sps.sps_order
  38. 06-23T18:08:46 0 0 94 1 0 4.529 sps.sps_order_goods_detail
  39. 06-23T18:08:49 0 0 23 1 0 3.030 sps.sps_order_status
  40. 06-23T18:08:52 0 0 0 1 0 3.390 sps.sps_payment
  41. 06-23T18:08:57 0 0 13 1 0 4.559 sps.sps_promotion
  42. 06-23T18:09:00 0