2013年(4)
分类: LINUX
2013-06-08 10:31:57
什么是newstart HA?有什么作用?如何搭建?如何使用?当我们接触到新的知识时,会带有一系列的疑问,下面我们带着疑问共同探索一番。
HA,全称High Availability(即高可用性),而newstart HA,作为一款实现高可用性的双机集群软件,用于保证业务持续性运行,在大多数对业务持续性运行(N*24小时)要求比较高的企业,如通信行业的企业,经常会用到。在简单了解一些概念及其作用后,下面详细讲解如何在linux下双机集群搭建和使用。
一、 准备工作
工欲善其事必先利其器,要在linux系统下高效地搭建及使用newstarth HA,前期工作要准备好。
1、 一些概念:
l 节点:指运行高可用双机集群软件中的计算机。
l 工作链路(work link):指集群向外提供服务的链路,从服务器到交换机的链路。
l 心跳链路(heartbeat link):维持高可用集群软件内部互联,传送心跳信息的链路。
l 服务(service):是与用户应用相关的一组资源的集合,一般包括:管理用户进程资源的应用脚本(application),网络资源,存储资源;譬如说用户的一个 Oracle数据库,该服务包括管理Oracle的脚本(用于启动,关闭和监控), IP地址和所需要 mount的磁盘;服务可以是其中几种或全部资源的组合。
2、 硬件(两台物理机子,以下信息相同):
l 三张网卡:两张网卡做bonding(工作链路),一张网卡做心路链路(要保证心跳链路总数不少于2条)
l 串口:组串口心跳链路,加上上面网口心跳链路,达到2条
l 磁阵:存放共享数据,建议从中划分一个30~50M的分区用于组建仲裁盘(保障数据安全性的一种机制,可选但推荐,这里为/dev/sdb1)
3、 软件:
l 操作系统sles11,主流平台都可支持,如sles9/10/11,redhat5/6,cgslv3/4等
l HA版本3.0.1.07,已从newstart官网获取,目前是最新的。
l 数据库,oracle10g
l 中间件:tomcat6.0
PS:上述操作系统,数据库及中间件安装、配置和调试过程这里不详列,网上相关参考资料很多;在开始下面操作之前,所有业务在两台服务器都已调试过,各自运行都是正常的,接下来看看newstartha的安装。
二、 安装NewStart HA
网上下载的安装程序是iso文件,使用用二进制(bin)传输方式上传服务器home目录,并挂载到/mnt目录:
# mount -o loop /home/xxxx.iso /mnt
安装过程:
执行安装脚本,开始安装,选择3,安装所有组件(主程序+命令行管理工具+web管理工具):
# /mnt/install HA Version: 1)New Version:3.0.1.07 2)Cancel
please select Version [1-2]?1 NewStart HA Installation Program Version: 3.0.1.07 Support email: ha-support@gd-linux.com
1)NewStart HA Server Program and CLI Administrative Tool 2)Web-based Administrative Tool (options)(version: 20121101) 3)All components 4)Cancel
select the components to be installed [1-4]? 3 Checking NewStart HA ... NOT running
Installing ... Installing the /mnt/nsha/x86/sles9/newstartha-3.0.1.07-20130107.i586.rpm ... Preparing... ########################################### [100%] 1:newstartha ########################################### [100%] newstartha 0:off 1:off 2:off 3:on 4:off 5:on 6:off Installing liblvm2clusterlock.so ok. 输入产品许可号(下面为试用SN) please enter the SN: 00TB24-FC0TCF-629A1H-B00D46
Make /etc/ha.d/lic/newstartha.key succeeded. [OK]
web-based administrative tool install, deploying, please wait... jdk installed ok! tomcat installed ok! web-based administrative tool installed ok!
Create keys(/usr/lib/newstartha/keystore.exp 1), please wait... Create tomcat.keystore OK.
Do you want to start web-based administrative tool automatically as a system service? y(es) or n(o)? y 系统启动时是否自动启动Web管理工具
Starting Web-based Administrative Tool Service ... [OK] Please remember to change the default web password immediately!
The component(s) is installed completely. |
HA程序安装完成,另一个服务器执行上述操作,两台服务器操作完成后往下看。
申请license
安装完成后进行license的申请,HA启动时会验证key及license文件有效性,否则无法启动,操作方法:
1、 把两台服务器上的/etc/ha.d/lic/newstartha.key文件打包(名字区分好,如newstartha.key_node1/2,二进制(bin)方式下载),然后发送到邮箱: 进行license文件的申请。
2、 收到的license文件后改名为newstartha.lic,用二进制(bin)方式上传到服务器,放到/etc/ha.d/lic/目录下。
编写管控业务的HA脚本(oracle及tomcat)
HA脚本是规定如何启动、停止、强制停止和检测业务程序,newstart HA提供主流应用的脚本模版供参考,如apache、tomcat、oracle等,位于/etc/ha.d/resource.d目录下,模版格式为:xxxx_example.ps。
编写oracle及tomcat的HA脚本:进入上述目录,复制oracle_example.ps和tomcat_example.sh模版,分别重命名为oracle.ps,tomcat.ps,接着拷贝到/home/script/目录下,最后根据实际环境编缉两个脚本开头几个变量值就可以,如下:
#vi /home/script/oracle.ps
#The following three variant should be set to proper value ORACLE_HOME="/home/oracle_home" ORACLE_SID="orcl" ALERTLOG="${ORACLE_HOME}/admin/${ORACLE_SID}/bdump/alert_${ORACLE_SID}.log" … |
#vi /home/script/tomcat.ps
#The following variants should be set correctly PORT=80 # tomcat listen port BINPWD=/opt/NewStartHA/web/tomcat/bin # tomcat bin path |
三、配置NewStart HA
整个配置过程分两步,集群初始化和服务初始化,必须按以上顺序进行操作。HA支持命令行(cli)及web两种管理工具进行配置,下面是cli工具的配置过程。
配置之前确认以下信息:
1. 两台服务器的主机名称;
2. 心跳和工作链路的网卡名对应并且相同,并配置好所有网卡的固定IP;
3. 确定访问oracle/tomcat的浮动IP;
4. HA脚本位置;(/home/script/oracle.ps和tomcat.ps)
5. 清楚磁阵挂载目录;(安装oracle时已建好,这里为/home/db)
6. 第三方IP列表:可选,建议配置3~5个IP,这些IP与工作网卡属于相同网段,注意不要配成两台服务器的IP,其作为是检测自身网络正常与否。
集群初始化,格式:cluster-init
命令行下运行cli指令,进入cli管理工具,然后运行cluster-init。在开始之前再啰嗦一下,接下来的整个集群配置过程中,粗体表示根据实际环境填写的值,斜粗体表示说明(其中回车表示推荐配置)。
cli:~>cluster-init
====================================== Cluster Initialization Utility ======================================
This utility sets up the initialization information of a 2-node cluster. It prompts you for the following information: - Hostname - Information about the heartbeat channels - How long between heartbeat - How long to declare heartbeat fails - Watchdog configuration - Lock disk configuration
Please input cluster name:cluster_ora 自定义集群名称 Input the first node name and IP:suse11-1 192.168.1.92 Input the second node name and IP:suse11-2 192.168.1.93 How long between heartbeats(in seconds)[1]:直接回车 How long to declare heartbeat has broken(in seconds)[60]: 直接回车 Do you want to enable watchdog device ? (yes/no)[no]: 直接回车 Please choose multicast heartbeat channel: 0) eth0 1) bond0 Select a multicast heartbeat channel [0, 1]:0 Another multicast heartbeat channel? (yes/no)[yes]:no Do you want to add a serial heartbeat channel? (yes/no)[yes]: 直接回车 Input serial heartbeat channel[/dev/ttyS0]: 直接回车 Another serial heartbeat channel? (yes/no)[yes]:no Do you want to enable worklink_hb ? (yes/no)[yes]: 直接回车 Do you want to add third-party ip list ? [recommended 3-5 ip] (yes/no)[yes]: 直接回车 Please input a third-party ip address:192.168.1.19 Another thirdpart ip address? (yes/no)[yes]: 直接回车 Please input a third-party ip address:192.168.1.20 Another thirdpart ip address? (yes/no)[yes]: 直接回车 Please input a third-party ip address:192.168.1.21 Another thirdpart ip address? (yes/no)[yes]:no Do you want to add a lock disk(recommend) ? (yes/no)[yes]: 直接回车 Please input the partition name (/dev/sdb):/dev/sdb1仲裁盘
Warning:All data in /dev/sdb1 will be destroyed, sure to format it? (yes/no)[no]:yes Do you want to enable kernel panic ? (yes/no)[no]: 直接回车 Please run service-init to initialize you services. |
集群初始化完成,接下来进行服务初始化。
服务初始化,格式:service-init
这里配置两个服务,先配数据库oracle,然后配置tomcat。cli管理工具中运行service-init,进行服务初始化。
cli:~>service-init
====================================== Service Initialization Utility ======================================
This utility sets up the initialization information of the service in the HA system. It prompts you for the following information: - Service information - Application resource information - Public net work interface information - Floating IP address information. - Block Disk information - Mount information - Raw Disk information
Input service name:oracle 自定义服务名称:oracle Is it enabled?(yes/no)[yes]: Do you want to configure preferred node ? (yes/no)[no]:yes Please choose preferred node: 0) suse11-1 1) suse11-2 Select a node: [0, 1]:0 Input start time out[60]: 直接回车 Input stop time out[120]: 直接回车 Input check interval[30]: 直接回车 Input check time out[60]: 直接回车 Input max error count[1]: 直接回车 Restart after check result is failed?(yes/no)[no]: 直接回车 Start service anyway when float IP exist?(yes/no)[no]: 直接回车 Do you want to add a application? (yes/no)[yes]: 直接回车
====== Application ====== Input name of application[oracle_app_0]: 直接回车 Input script of application [/etc/ha.d/resource.d/oracle]:/home/script/oracle.ps 管控oracle脚本 Is resource critical?[yes]: 直接回车 Is resource enable?[yes]: 直接回车 Add another application? (yes/no)[no]: 直接回车 Do you want to add a pubnic? (yes/no)[yes]: 直接回车
====== PubNIC ====== Input PubNIC name[oracle_net_card_0]: 直接回车 Is resource critical?[yes]: 直接回车 Please choose network device: 0) eth0 1) bond0 Select a network device [0, 1]:1 Add another pubnic? (yes/no)[no]: 直接回车
====== IP ====== Input IP name[oracle_ip_0]: 直接回车 Input IP address:192.168.1.96 浮点/业务IP Input netmask[255.255.255.0]: PubNIC of service: 0) oracle_net_card_0 suse11-1:bond0 suse11-2:bond0 Select a PubNIC: [0, 0]:0 Is resource critical?[yes]: 直接回车 Add another IP? (yes/no)[no]: 直接回车 Do you want to add a raw disk? (yes/no)[no]: 直接回车 Do you want to add a diskmount? (yes/no)[no]:yes
====== diskmount ====== Input diskmount name[oracle_diskmount_1]: 直接回车 Is resource critical?[yes]: 直接回车 Is resource enable?[yes]: 直接回车 0) disk 普通的块设备 1) nfs nfs设备 2) lvm 逻辑卷设备 3) cancel please choose a disk type? [0, 3]:0 Input block disk device[/dev/hda1]:/dev/sdb2 共享数据所在设备 Input mountpoint:/home/db 挂载目录 Input type of file system[ext3]: 直接回车 Input user[root]:oracle 挂载目录的操作用户 Input group[root]:oinstall 操作用户的群组 Input mode[755]: 直接回车 Input options[rw]: 直接回车 Input the quota of the device[90]: 直接回车 do you want to stop service when the disk is readonly?[yes]: 直接回车 Add another diskmount? (yes/no)[no]: 直接回车 Add another service? (yes/no)[no]: yes Input service name:tomcat 自定义服务名称:tomcat Is it enabled?(yes/no)[yes]: Do you want to configure preferred node ? (yes/no)[no]:yes Please choose preferred node: 0) suse11-1 1) suse11-2 Select a node: [0, 1]:1 Input start time out[60]: 直接回车 Input stop time out[120]: 直接回车 Input check interval[30]: 直接回车 Input check time out[60]: 直接回车 Input max error count[1]: 直接回车 Restart after check result is failed?(yes/no)[no]: 直接回车 Start service anyway when float IP exist?(yes/no)[no]: 直接回车 Do you want to add a application? (yes/no)[yes]: 直接回车
====== Application ====== Input name of application[tomcat_app_0]: 直接回车 Input script of application [/etc/ha.d/resource.d/tomcat]:/home/script/tomcat.ps 管控tomcat脚本 Is resource critical?[yes]: 直接回车 Is resource enable?[yes]: 直接回车 Add another application? (yes/no)[no]: 直接回车 Do you want to add a pubnic? (yes/no)[yes]: 直接回车
====== PubNIC ====== Input PubNIC name[tomcat_net_card_0]: 直接回车 Is resource critical?[yes]: 直接回车 Please choose network device: 0) eth0 1) bond0 Select a network device [0, 1]:1 Add another pubnic? (yes/no)[no]: 直接回车
====== IP ====== Input IP name[oracle_ip_0]: 直接回车 Input IP address:192.168.1.97 浮点/业务IP Input netmask[255.255.255.0]: PubNIC of service: 0) tomcat_net_card_0 suse11-1:bond0 suse11-2:bond0 Select a PubNIC: [0, 0]:0 Is resource critical?[yes]: 直接回车 Add another IP? (yes/no)[no]: 直接回车 Do you want to add a raw disk? (yes/no)[no]: 直接回车 Do you want to add a diskmount? (yes/no)[no]: 直接回车 Add another service? (yes/no)[no]: 直接回车 Please run cluster-start to start the HA system, or run cluster-restart to restart the HA system. |
服务初始化完成,此时集群不要启动,保持原状态,具体原因接下来说到。
HA脚本检测
前面已编写完oracle及tomcat脚本,但实际环境中仍需验证现有脚本能否完全管控应用,为此,HA提供了check-script工具作为快捷验证方法。注意操作前确认集群是停止状态,查看方式:cluster-stat。
cli:~>cluster-stat The HA system is not running now.
cli:~>check-script Current service: 0) name: oracle 1) name: tomcat 2) cancel Select a(n) service [0, 2]:0 Current Application: 0) script: /home/script/oracle.ps 1) cancel Select a(n) Application [0, 1]:0
Begin to test resource script...... Start resource oracle.ps: pass Check resource oracle.ps when running: pass Start resource oracle.ps when running: pass Check resource oracle.ps when running: pass Stop resource oracle.ps when running: pass Check resource oracle.ps when stopped: pass Stop resource oracle.ps when stopped: pass Check resource oracle.ps when stopped: pass Start resource oracle.ps: pass Forcedstop resource oracle.ps when running: pass Check resource oracle.ps when stopped: pass Forcedstop resource oracle.ps when stopped: pass Check resource oracle.ps when stopped: pass
End to test resource Oracle脚本检测通过,全pass,没问题
cli:~>check-script Current service: 0) name: oracle 1) name: tomcat 2) cancel Select a(n) service [0, 2]:1 Current Application: 0) script: /home/script/tomcat.ps 1) cancel Select a(n) Application [0, 1]:0
Begin to test resource script...... Start resource tomcat.ps: pass Check resource tomcat.ps when running: pass Start resource tomcat.ps when running: pass Check resource tomcat.ps when running: pass Stop resource tomcat.ps when running: pass Check resource tomcat.ps when stopped: pass Stop resource tomcat.ps when stopped: pass Check resource tomcat.ps when stopped: pass Start resource tomcat.ps: pass Forcedstop resource tomcat.ps when running: pass Check resource tomcat.ps when stopped: pass Forcedstop resource tomcat.ps when stopped: pass Check resource tomcat.ps when stopped: pass
End to test resource
tomcat脚本检测通过,全pass,没问题 |
四、集群启动及状态查询
1、启动集群:
进入cli,启动集群,指令:cluster-start
cli:~>cluster-start [suse11-1]Starting High-Availability services: Configuration file checked ok. ..done
Configuration file checked ok. [suse11-2]Starting High-Availability services: ..done
|
2、集群状态查询:
集群状态包括节点、心跳链路,工作链路和服务状态。进入cli,输入指令:cluster-stat(动态周期性刷新)查看。
cli:~>cluster-stat Press Ctrl-C or 'Q' to exit Date: Fri Apr 26 09:45:13 2013
Member status suse11-1 UP suse11-2 UP
WorkLink suse11-1 suse11-2 bond0 ONLINE ONLINE
HeartbeatLink suse11-1 suse11-2 status network eth0 eth0 ONLINE serial /dev/ttyS0 /dev/ttyS0 ONLINE LockDisk /dev/sdb1 /dev/sdb1 ONLINE
ServiceName suse11-1 suse11-2 Enable *oracle running stopped YES tomcat stoped running YES
|
状态图说明:节点(Member)状态都是”UP”(正常),工作链路(WorkLink)bond0都是”ONLINE”(正常),心跳链路(HeartbeatLink)都是”ONLINE(正常),服务oracle现运行(running)在suse11- 1上, 服务tomcat现运行(running)在suse11- 2节点。
五、集群测试
主要验证服务能否正常倒换,因为只有在此前提下才能保障当集群发生故障(如其中一台服务器挂掉,运行中服务突然停止等)时,服务能够接管,实现持续运行,下面是测试过程:
1、查看集群状态:
cli:~>cluster-stat Press Ctrl-C or 'Q' to exit Date: Fri Apr 26 11:45:13 2013
Member status suse11-1 UP suse11-2 UP
WorkLink suse11-1 suse11-2 bond0 ONLINE ONLINE
HeartbeatLink suse11-1 suse11-2 status network eth0 eth0 ONLINE serial /dev/ttyS0 /dev/ttyS0 ONLINE LockDisk /dev/sdb1 /dev/sdb1 ONLINE
ServiceName suse11-1 suse11-2 Enable *oracle running stopped YES tomcat stoped running YES
|
服务oracle现运行在suse11- 1节点, tomcat运行在suse11- 2节点。
2、 服务倒换,指令:service-migrate
cli:~>service-migrate Select service to migrate: Current service: 0) oracle 1) tomcat 2) cancel Select a service [0, 2]:0 倒换服务oracle Select the destination node: Current node: 0) suse11-2 1) cancel Select a node [0, 1]:0 Send message to migrate service oracle from suse11-1 to suse11-2. cli:~>service-migrate Select service to migrate: Current service: 0) oracle 1) tomcat 2) cancel Select a service [0, 2]:1 倒换服务tomcat Select the destination node: Current node: 0) suse11-1 1) cancel Select a node [0, 1]:0 Send message to migrate service tomcat from suse11-2 to suse11-1. |
3、 查看服务倒换结果
cli:~>cluster-stat Press Ctrl-C or 'Q' to exit Date: Fri Apr 26 11:46:20 2013
Member status suse11-1 UP suse11-2 UP
WorkLink suse11-1 suse11-2 bond0 ONLINE ONLINE
HeartbeatLink suse11-1 suse11-2 status network eth0 eth0 ONLINE serial /dev/ttyS0 /dev/ttyS0 ONLINE LockDisk /dev/sdb1 /dev/sdb1 ONLINE
ServiceName suse11-1 suse11-2 Enable oracle stoped running YES *tomcat running stoped YES
|
两个服务倒换成功,现oracle运行在suse11-2,tomcat运行在suse11-1。以上倒换操作在两台服务器上至少各执行一次,也建议模拟一些常见故障测试,如节点重启HA能否自动启动并加入集群,主机重启或者关机服务能否倒换到备机等。
到这里,Newstart HA的探索之旅已结束,Enjoy it.