RocketMQ如何实现2个IDC高可靠部署
1、部署背景
2、部署方案
- 2m-noslave 两主,无从的配置
- 2m-2s-sync 两主,两从,同步复制数据的配置
- 2m-2s-async 两主,两从,异步复制数据的配置
上面的部署master和slave无法实现故障自动切换无法采用,因此需要借助controller的Raft选举机制来实现broker的自动选主。
2.1、namesrv和controller配置启动
采用namesrv和controller合并在同一个进程的部署方式,每个机器上都准备一个配置文件,其中 controllerDLegerSelfId 必须和机器的IP(controllerDLegerPeers)对应上
# cat conf/controller/cluster-3n-namesrv-plugin/namesrv-n1.conf
#Namesrv config
listenPort = 9876
enableControllerInNamesrv = true
#controller config
controllerDLegerGroup = group1
controllerDLegerPeers = n0-192.168.150.130:9999;n1-192.168.150.128:9999;n2-192.168.150.129:9999
# 注意,这里的controllerDLegerSelfId每个namesrv/controller的配置里不同,必须和上面的IP地址对应上
controllerDLegerSelfId = n1
# controller数据存储路径,必须提前创建好,否则controller无法正常工作
controllerStorePath = /opt/rmq-data/controller
3台机器上分别启动namesrv(包含controller)
#192.168.150.130
nohup ./bin/mqnamesrv -c conf/controller/cluster-3n-namesrv-plugin/namesrv-n0.conf &
#192.168.150.128
nohup ./bin/mqnamesrv -c conf/controller/cluster-3n-namesrv-plugin/namesrv-n1.conf &
#192.168.150.129
nohup ./bin/mqnamesrv -c conf/controller/cluster-3n-namesrv-plugin/namesrv-n2.conf &
启动后,可以在/root/logs/rocketmqlogs/namesrv.log(或指定路径)下查看namesrv的日志,例如:
2025-08-09 23:23:27 INFO DLedgerControllerRoleChangeHandler_1 - Controller n0 change role to Follower, leaderId:n2
通过命令查看controller的选举状态信息,发现n2是Leader
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin getControllerMetaData -a localhost:9999
#ControllerGroup group1
#ControllerLeaderId n2
#ControllerLeaderAddress 192.168.150.129:9999
#Peer: n0:192.168.150.130:9999
#Peer: n1:192.168.150.128:9999
#Peer: n2:192.168.150.129:9999
2.2、broker配置启动
部署broker的机器上准备2个配置,broker-a.conf和broker-b.conf
# cat conf/broker-a.conf
brokerClusterName=DefaultCluster
#另一个配置就换成borker-b
brokerName=broker-a
# 这里必须配置-1,表示自主选择
brokerId=-1
deleteWhen=04
fileReservedTime=48
# 这里必须配置SLAVE,由controller选择MASTER
brokerRole=SLAVE
flushDiskType=ASYNC_FLUSH
namesrvAddr=192.168.150.129:9876;192.168.150.130:9876;192.168.150.128:9876
defaultTopicQueueNums=4
listenPort=10800
# 目录提前创建好
storePathRootDir=/opt/rmq-data/store
storePathCommitLog=/opt/rmq-data/store/consumequeue
storePathConsumerQueue=/opt/rmq-data/store/consumequeue
storePathIndex=/opt/rmq-data/store/index
storeCheckpoint=/opt/rmq-data/store/checkpoint
abortFile=/opt/rmq-data/store/abort
enableControllerMode=true
controllerAddr=192.168.150.130:9999;192.168.150.128:9999;192.168.150.129:9999
# 必须启用proxy
enableProxy=true
proxyListenPort=6666
n1(192.168.150.128)上启动broker-a和broker-b
bin/mqbroker -c conf/broker-a-n1.conf > log/broker-a.log 2>&1 &
bin/mqbroker -c conf/broker-b-n1.conf > log/broker-b.log 2>&1 &
n2(192.168.150.129)上启动broker-a和broker-b
bin/mqbroker -c conf/broker-a-n2.conf > log/broker-a.log 2>&1 &
bin/mqbroker -c conf/broker-b-n2.conf > log/broker-b.log 2>&1 &
注意,如果机器内存不足(<16GB),需要修改调低runbroker.sh、tools.sh里的JAVA内存配置,我这里调成了最大2GB。
[root@localhost rocketmq-all-5.3.2-bin-release]# grep -E "Xms|Xmx|Xmn" bin/*.sh
bin/runbroker.sh: JAVA_OPT="${JAVA_OPT} -Xmn2g -XX:+UseConcMarkSweepGC -XX:+UseCMSCompactAtFullCollection -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+CMSClassUnloadingEnabled -XX:SurvivorRatio=8 -XX:-UseParNewGC"
bin/runbroker.sh:JAVA_OPT="${JAVA_OPT} -server -Xms2g -Xmx2g"
bin/runserver.sh: JAVA_OPT="${JAVA_OPT} -server -Xms2g -Xmx2g -Xmn1g -XX:MetaspaceSize=128m -XX:MaxMetaspaceSize=320m"
bin/runserver.sh: JAVA_OPT="${JAVA_OPT} -server -Xms2g -Xmx2g -XX:MetaspaceSize=128m -XX:MaxMetaspaceSize=320m"
bin/tools.sh:JAVA_OPT="${JAVA_OPT} -server -Xms1g -Xmx1g -Xmn256m -XX:MetaspaceSize=128m -XX:MaxMetaspaceSize=128m"
3、测试
3.1、基本功能
查询集群信息,发现n2(192.168.150.129)上的borker-a和broker-b都是ACTIVED=true状态,也就是master,支持写入。
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin clusterList -n localhost:9876
#Cluster Name #Broker Name #BID #Addr #Version #InTPS(LOAD) #OutTPS(LOAD) #Timer(Progress) #PCWait(ms) #Hour #SPACE #ACTIVATED
DefaultCluster broker-a 0 192.168.150.129:10900 V5_3_2 0.00(0,0ms) 0.00(0,0ms|0,0ms) 0-0(0.0w, 0.0, 0.0) 0 2.95 0.5500 true
DefaultCluster broker-a 2 192.168.150.128:10800 V5_3_2 0.00(0,0ms) 0.00(0,0ms|0,0ms) 3-0(0.0w, 0.0, 0.0) 0 2.95 0.5600 false
DefaultCluster broker-b 0 192.168.150.129:10800 V5_3_2 0.00(0,0ms) 0.00(0,0ms|0,0ms) 0-0(0.0w, 0.0, 0.0) 0 2.95 0.5500 true
DefaultCluster broker-b 2 192.168.150.128:10900 V5_3_2 0.00(0,0ms) 0.00(0,0ms|0,0ms) 2-0(0.0w, 0.0, 0.0) 0 2.95 0.5600 false
创建查询topic
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin updateTopic -n "192.168.150.129:9876;192.168.150.130:9876;192.168.150.128:9876" -t demo1 -c DefaultCluster
create topic to 192.168.150.129:10900 success.
create topic to 192.168.150.129:10800 success.
TopicConfig [topicName=demo1, readQueueNums=8, writeQueueNums=8, perm=RW-, topicFilterType=SINGLE_TAG, topicSysFlag=0, order=false, attributes={}]
#查询topic,最后一个就是刚创建的
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin topicList -n "192.168.150.129:9876;192.168.150.130:9876;192.168.150.128:9876" -c DefaultCluster
#Cluster Name #Topic #Consumer Group
DefaultCluster RMQ_SYS_TRANS_HALF_TOPIC
DefaultCluster BenchmarkTest
DefaultCluster OFFSET_MOVED_EVENT
DefaultCluster TBW102
DefaultCluster rmq_sys_REVIVE_LOG_DefaultCluster
DefaultCluster SELF_TEST_TOPIC
DefaultCluster DefaultCluster
DefaultCluster SCHEDULE_TOPIC_XXXX
DefaultCluster DefaultCluster_REPLY_TOPIC
DefaultCluster rmq_sys_wheel_timer
DefaultCluster rmq_sys_SYNC_BROKER_MEMBER_broker-b
DefaultCluster rmq_sys_SYNC_BROKER_MEMBER_broker-a
DefaultCluster RMQ_SYS_TRANS_OP_HALF_TOPIC
DefaultCluster broker-b
DefaultCluster broker-a
DefaultCluster demo1
生产查询消息
# 带key生产和查询
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin sendMessage -n "192.168.150.129:9876;192.168.150.130:9876;192.168.150.128:9876" -t demo1 -p "hello, 5" -b broker-b -k key
#Broker Name #QID #Send Result #MsgId
broker-b 1 SEND_OK AC11000138602F0E140B2C0B56300000
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin sendMessage -n "192.168.150.129:9876;192.168.150.130:9876;192.168.150.128:9876" -t demo1 -p "hello, 6" -b broker-a -k key
#Broker Name #QID #Send Result #MsgId
broker-a 1 SEND_OK AC11000138942F0E140B2C0B74D70000
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin queryMsgByKey -n "192.168.150.129:9876;192.168.150.130:9876;192.168.150.128:9876" -t demo1 -k key
#Message ID #QID #Offset
AC1100012EA02F0E140B2B7DC4DA0000 0 0
AC1100012ED52F0E140B2B7DDFD80000 3 1
AC1100012FA02F0E140B2B83D1080000 2 0
AC11000130082F0E140B2B848C0E0000 5 0
AC110001303D2F0E140B2B84B1940000 6 0
AC11000130722F0E140B2B84CF370000 0 1
AC11000138602F0E140B2C0B56300000 1 1
AC11000138942F0E140B2C0B74D70000 1 0
目前为止RocketMQ 2个节点功能正常,下面开始故障模拟验证。
3.2、模拟n1(slave)故障
关闭n1(192.168.150.128)机器,查看集群状态信息,发现n1上的2个slave消失,n2上的master状态正常
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin clusterList -n localhost:9876
#Cluster Name #Broker Name #BID #Addr #Version #InTPS(LOAD) #OutTPS(LOAD) #Timer(Progress) #PCWait(ms) #Hour #SPACE #ACTIVATED
DefaultCluster broker-a 0 192.168.150.129:10900 V5_3_2 0.00(0,0ms) 0.00(0,0ms|0,0ms) 0-0(0.0w, 0.0, 0.0) 0 3.05 0.5500 true
DefaultCluster broker-b 0 192.168.150.129:10800 V5_3_2 0.00(0,0ms) 0.00(0,0ms|0,0ms) 0-0(0.0w, 0.0, 0.0) 0 3.05 0.5500 true
Namesrv.log里检测到broker-a slave故障
2025-08-10 01:17:05 WARN DefaultBrokerHeartbeatManager_scheduledService_1 - The broker channel [id: 0xfcc366d9, L:/192.168.150.130:9999 ! R:/192.168.150.128:56116] expired, brokerInfo BrokerIdentityInfo{clusterName='DefaultCluster', brokerName='broker-a', brokerId=2}, expired 10000ms
2025-08-10 01:17:05 INFO DefaultBrokerHeartbeatManager_executorService_2 - Controller Manager received broker inactive event, clusterName: DefaultCluster, brokerName: broker-a, brokerId: 2
2025-08-10 01:17:05 WARN DefaultBrokerHeartbeatManager_executorService_2 - The broker with brokerId: 2 in broker-set: broker-a has been inactive
还可以继续正常的生产和查询
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin sendMessage -n "192.168.150.129:9876;192.168.150.130:9876" -t demo1 -p "hello, 7" -b broker-a -k key
#Broker Name #QID #Send Result #MsgId
broker-a 3 SEND_OK AC11000139C12F0E140B2C0DCA4D0000
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin sendMessage -n "192.168.150.129:9876;192.168.150.130:9876" -t demo1 -p "hello, 8" -b broker-b -k key
#Broker Name #QID #Send Result #MsgId
broker-a 2 SEND_OK AC11000139F62F0E140B2C0DF09F0000
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin queryMsgByKey -n "192.168.150.129:9876;192.168.150.130:9876" -t demo1 -k key
#Message ID #QID #Offset
AC11000138942F0E140B2C0B74D70000 1 0
AC11000139C12F0E140B2C0DCA4D0000 3 0
启动n1,恢复集群到正常状态,n2仍然是master
# 重新恢复128,日志:
2025-08-10 01:22:31 INFO ControllerRequestExecutorThread_2 - new broker registered, BrokerIdentityInfo{clusterName='DefaultCluster', brokerName='broker-a', brokerId=2}, brokerId:2
2025-08-10 01:22:31 INFO RemotingExecutorThread_6 - new broker registered, BrokerIdentityInfo [clusterName=DefaultCluster, brokerAddr=192.168.150.128:10800] HAService: 192.168.150.128:10801
2025-08-10 01:22:31 INFO ControllerRequestExecutorThread_4 - new broker registered, BrokerIdentityInfo{clusterName='DefaultCluster', brokerName='broker-b', brokerId=2}, brokerId:2
2025-08-10 01:22:31 INFO RemotingExecutorThread_8 - new broker registered, BrokerIdentityInfo [clusterName=DefaultCluster, brokerAddr=192.168.150.128:10900] HAService: 192.168.150.128:10901
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin clusterList -n localhost:9876
#Cluster Name #Broker Name #BID #Addr #Version #InTPS(LOAD) #OutTPS(LOAD) #Timer(Progress) #PCWait(ms) #Hour #SPACE #ACTIVATED
DefaultCluster broker-a 0 192.168.150.129:10900 V5_3_2 0.00(0,0ms) 0.00(0,0ms|0,0ms) 0-0(0.0w, 0.0, 0.0) 0 3.15 0.5500 true
DefaultCluster broker-a 2 192.168.150.128:10800 V5_3_2 0.00(0,0ms) 0.00(0,0ms|0,0ms) 1-0(0.0w, 0.0, 0.0) 0 3.15 0.5600 false
DefaultCluster broker-b 0 192.168.150.129:10800 V5_3_2 0.00(0,0ms) 0.00(0,0ms|0,0ms) 0-0(0.0w, 0.0, 0.0) 0 3.15 0.5500 true
DefaultCluster broker-b 2 192.168.150.128:10900 V5_3_2 0.00(0,0ms) 0.00(0,0ms|0,0ms) 1-0(0.0w, 0.0, 0.0) 0 3.15 0.5600 false
恢复的128数据已经正常从n2完成同步,正常查询发现故障期间新写入的数据
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin queryMsgByKey -n "192.168.150.129:9876;192.168.150.130:9876" -t demo1 -k key
#Message ID #QID #Offset
AC11000138942F0E140B2C0B74D70000 1 0
AC11000139C12F0E140B2C0DCA4D0000 3 0
AC11000139F62F0E140B2C0DF09F0000 2 1
AC1100012EA02F0E140B2B7DC4DA0000 0 0
AC1100012ED52F0E140B2B7DDFD80000 3 1
AC1100012FA02F0E140B2B83D1080000 2 0
AC11000130082F0E140B2B848C0E0000 5 0
AC110001303D2F0E140B2B84B1940000 6 0
AC11000130722F0E140B2B84CF370000 0 1
AC11000138602F0E140B2C0B56300000 1 1
3.3、模拟n2(master)故障
关闭n2(192.168.150.129)机器,n1被选举为master。
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin getControllerMetaData -a localhost:9999
#ControllerGroup group1
#ControllerLeaderId n1
#ControllerLeaderAddress 192.168.150.128:9999
#Peer: n0:192.168.150.130:9999
#Peer: n1:192.168.150.128:9999
#Peer: n2:192.168.150.129:9999
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin clusterList -n localhost:9876
#Cluster Name #Broker Name #BID #Addr #Version #InTPS(LOAD) #OutTPS(LOAD) #Timer(Progress) #PCWait(ms) #Hour #SPACE #ACTIVATED
DefaultCluster broker-a 0 192.168.150.128:10800 V5_3_2 0.00(0,0ms) 0.00(0,0ms|0,0ms) 0-0(0.0w, 0.0, 0.0) 0 3.18 0.5600 true
DefaultCluster broker-b 0 192.168.150.128:10900 V5_3_2 0.00(0,0ms) 0.00(0,0ms|0,0ms) 0-0(0.0w, 0.0, 0.0) 0 3.18 0.5600 true
n1(128)被选出来的master,正常生产数据
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin sendMessage -n "192.168.150.128:9876;192.168.150.130:9876" -t demo1 -p "hello, 9" -b broker-b -k key
#Broker Name #QID #Send Result #MsgId
broker-b 7 SEND_OK AC1100013B662F0E140B2C14C0860000
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin sendMessage -n "192.168.150.128:9876;192.168.150.130:9876" -t demo1 -p "hello, 10" -b broker-a -k key
#Broker Name #QID #Send Result #MsgId
broker-b 0 SEND_OK AC1100013B9B2F0E140B2C14F2850000
至此,我们测试了借助controller可以正常实现:2个broker节点可以正常选举,无论故障的是master还是slave
启动n2(129)恢复集群。
3.4、模拟n0(仲裁的namesrv和controller)故障
关闭没有部署broker的n0(130),发现n1和n2角色状态不变,整个集群正常读写,功能没有任何影响。
3.5、模拟n1(master)和n0(controller)同时故障
再次关闭已经是master角色的n1(128)机器,发现集群故障,n2(129)保持slave角色,ControllerLeader 为null。
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin getControllerMetaData -a localhost:9999
#ControllerGroup group1
#ControllerLeaderId null
#ControllerLeaderAddress null
#Peer: n0:192.168.150.130:9999
#Peer: n1:192.168.150.128:9999
#Peer: n2:192.168.150.129:9999
可以查询,但是无法写入
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin queryMsgByKey -n "192.168.150.129:9876" -t demo1 -k key
#Message ID #QID #Offset
AC11000138942F0E140B2C0B74D70000 1 0
AC11000139C12F0E140B2C0DCA4D0000 3 0
AC11000139F62F0E140B2C0DF09F0000 2 1
AC1100012EA02F0E140B2B7DC4DA0000 0 0
AC1100012ED52F0E140B2B7DDFD80000 3 1
AC1100012FA02F0E140B2B83D1080000 2 0
AC11000130082F0E140B2B848C0E0000 5 0
AC110001303D2F0E140B2B84B1940000 6 0
AC11000130722F0E140B2B84CF370000 0 1
AC11000138602F0E140B2C0B56300000 1 1
AC1100013B662F0E140B2C14C0860000 7 1
AC1100013B9B2F0E140B2C14F2850000 0 2
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin sendMessage -n "192.168.150.129:9876" -t demo1 -p "hello, 11" -k key
java.lang.RuntimeException: SendMessageCommand command failed
at org.apache.rocketmq.tools.command.message.SendMessageCommand.execute(SendMessageCommand.java:137)
at org.apache.rocketmq.tools.command.MQAdminStartup.main0(MQAdminStartup.java:181)
at org.apache.rocketmq.tools.command.MQAdminStartup.main(MQAdminStartup.java:131)
Caused by: org.apache.rocketmq.client.exception.MQClientException: No route info of this topic: demo1
3.6、模拟n0(130)恢复
启动n2机器,集群很快恢复正常,选出master,并开始正常工作。
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin clusterList -n localhost:9876
#Cluster Name #Broker Name #BID #Addr #Version #InTPS(LOAD) #OutTPS(LOAD) #Timer(Progress) #PCWait(ms) #Hour #SPACE #ACTIVATED
DefaultCluster broker-a 0 192.168.150.129:10900 V5_3_2 0.00(0,0ms) 0.00(0,0ms|0,0ms) 0-0(0.0w, 0.0, 0.0) 0 3.37 0.5500 true
DefaultCluster broker-b 0 192.168.150.129:10800 V5_3_2 0.00(0,0ms) 0.00(0,0ms|0,0ms) 0-0(0.0w, 0.0, 0.0) 0 3.37 0.5500 true
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin getControllerMetaData -a localhost:9999
#ControllerGroup group1
#ControllerLeaderId n0
#ControllerLeaderAddress 192.168.150.130:9999
#Peer: n0:192.168.150.130:9999
#Peer: n1:192.168.150.128:9999
#Peer: n2:192.168.150.129:9999
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin sendMessage -n "192.168.150.129:9876" -t demo1 -p "hello, 12" -k key -b broker-a
#Broker Name #QID #Send Result #MsgId
broker-a 7 SEND_OK AC1100010B9B2F0E140B2C1F86620000
^[[A[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin sendMessage -n "192.168.150.129:9876" -t demo1 -p "hello, 13" -k key -b broker-b
#Broker Name #QID #Send Result #MsgId
broker-a 0 SEND_OK AC1100010BCF2F0E140B2C1FA4060000
结论
RokcetMQ 5.X借助3个controller仲裁,可以正常实现2 AZ高可用部署
- 点赞
- 收藏
- 关注作者
评论(0)