RocketMQ如何实现2个IDC高可靠部署

举报
Xiao_Chuan 发表于 2025/08/10 11:04:30 2025/08/10
【摘要】 RokcetMQ 5.X借助3个controller仲裁,可以正常实现2 AZ高可用部署

1、部署背景

目前有3个IDC,其中A城2个IDC相距<10km,时延<2ms;B城IDC与A城IDC时延>10ms。
A城业务需要RocketMQ跨IDC部署满足高可靠,当1个IDC故障后,RocketMQ集群仍然可以继续正常读写,RPO=0。

2、部署方案

RocketMQ 5.X版本支持2 AZ master-slave标准部署:
  • 2m-noslave    两主,无从的配置
  • 2m-2s-sync    两主,两从,同步复制数据的配置
  • 2m-2s-async  两主,两从,异步复制数据的配置

上面的部署master和slave无法实现故障自动切换无法采用,因此需要借助controller的Raft选举机制来实现broker的自动选主


rocketmq-2az.png

2.1、namesrv和controller配置启动

采用namesrv和controller合并在同一个进程的部署方式,每个机器上都准备一个配置文件,其中 controllerDLegerSelfId 必须和机器的IP(controllerDLegerPeers)对应上

# cat conf/controller/cluster-3n-namesrv-plugin/namesrv-n1.conf
#Namesrv config
listenPort = 9876
enableControllerInNamesrv = true
 
#controller config
controllerDLegerGroup = group1
controllerDLegerPeers = n0-192.168.150.130:9999;n1-192.168.150.128:9999;n2-192.168.150.129:9999
# 注意,这里的controllerDLegerSelfId每个namesrv/controller的配置里不同,必须和上面的IP地址对应上
controllerDLegerSelfId = n1
# controller数据存储路径,必须提前创建好,否则controller无法正常工作
controllerStorePath = /opt/rmq-data/controller

3台机器上分别启动namesrv(包含controller)

#192.168.150.130
nohup ./bin/mqnamesrv -c conf/controller/cluster-3n-namesrv-plugin/namesrv-n0.conf &

#192.168.150.128
nohup ./bin/mqnamesrv -c conf/controller/cluster-3n-namesrv-plugin/namesrv-n1.conf &

#192.168.150.129
nohup ./bin/mqnamesrv -c conf/controller/cluster-3n-namesrv-plugin/namesrv-n2.conf &

启动后,可以在/root/logs/rocketmqlogs/namesrv.log(或指定路径)下查看namesrv的日志,例如:

2025-08-09 23:23:27 INFO DLedgerControllerRoleChangeHandler_1 - Controller n0 change role to Follower, leaderId:n2

通过命令查看controller的选举状态信息,发现n2是Leader

[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin getControllerMetaData -a localhost:9999

#ControllerGroup	group1
#ControllerLeaderId	n2
#ControllerLeaderAddress	192.168.150.129:9999
#Peer:	n0:192.168.150.130:9999
#Peer:	n1:192.168.150.128:9999
#Peer:	n2:192.168.150.129:9999


2.2、broker配置启动

部署broker的机器上准备2个配置,broker-a.conf和broker-b.conf

# cat conf/broker-a.conf

brokerClusterName=DefaultCluster
#另一个配置就换成borker-b
brokerName=broker-a   
# 这里必须配置-1,表示自主选择
brokerId=-1
deleteWhen=04
fileReservedTime=48
# 这里必须配置SLAVE,由controller选择MASTER
brokerRole=SLAVE
flushDiskType=ASYNC_FLUSH
namesrvAddr=192.168.150.129:9876;192.168.150.130:9876;192.168.150.128:9876
defaultTopicQueueNums=4
listenPort=10800
# 目录提前创建好
storePathRootDir=/opt/rmq-data/store
storePathCommitLog=/opt/rmq-data/store/consumequeue
storePathConsumerQueue=/opt/rmq-data/store/consumequeue
storePathIndex=/opt/rmq-data/store/index
storeCheckpoint=/opt/rmq-data/store/checkpoint
abortFile=/opt/rmq-data/store/abort
enableControllerMode=true
controllerAddr=192.168.150.130:9999;192.168.150.128:9999;192.168.150.129:9999
# 必须启用proxy
enableProxy=true
proxyListenPort=6666

n1(192.168.150.128)上启动broker-a和broker-b

bin/mqbroker -c conf/broker-a-n1.conf  > log/broker-a.log 2>&1 &
bin/mqbroker -c conf/broker-b-n1.conf  > log/broker-b.log 2>&1 &

n2(192.168.150.129)上启动broker-a和broker-b

bin/mqbroker -c conf/broker-a-n2.conf  > log/broker-a.log 2>&1 &
bin/mqbroker -c conf/broker-b-n2.conf  > log/broker-b.log 2>&1 &

注意,如果机器内存不足(<16GB),需要修改调低runbroker.sh、tools.sh里的JAVA内存配置,我这里调成了最大2GB。

[root@localhost rocketmq-all-5.3.2-bin-release]# grep -E "Xms|Xmx|Xmn" bin/*.sh
bin/runbroker.sh:      JAVA_OPT="${JAVA_OPT} -Xmn2g -XX:+UseConcMarkSweepGC -XX:+UseCMSCompactAtFullCollection -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+CMSClassUnloadingEnabled -XX:SurvivorRatio=8 -XX:-UseParNewGC"
bin/runbroker.sh:JAVA_OPT="${JAVA_OPT} -server -Xms2g -Xmx2g"
bin/runserver.sh:      JAVA_OPT="${JAVA_OPT} -server -Xms2g -Xmx2g -Xmn1g -XX:MetaspaceSize=128m -XX:MaxMetaspaceSize=320m"
bin/runserver.sh:      JAVA_OPT="${JAVA_OPT} -server -Xms2g -Xmx2g -XX:MetaspaceSize=128m -XX:MaxMetaspaceSize=320m"
bin/tools.sh:JAVA_OPT="${JAVA_OPT} -server -Xms1g -Xmx1g -Xmn256m -XX:MetaspaceSize=128m -XX:MaxMetaspaceSize=128m"


3、测试

3.1、基本功能

查询集群信息,发现n2(192.168.150.129)上的borker-a和broker-b都是ACTIVED=true状态,也就是master,支持写入。

[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin clusterList -n localhost:9876
#Cluster Name           #Broker Name            #BID  #Addr                  #Version              #InTPS(LOAD)                   #OutTPS(LOAD)  #Timer(Progress)        #PCWait(ms)  #Hour         #SPACE    #ACTIVATED
DefaultCluster          broker-a                0     192.168.150.129:10900  V5_3_2                 0.00(0,0ms)               0.00(0,0ms|0,0ms)  0-0(0.0w, 0.0, 0.0)               0  2.95          0.5500          true
DefaultCluster          broker-a                2     192.168.150.128:10800  V5_3_2                 0.00(0,0ms)               0.00(0,0ms|0,0ms)  3-0(0.0w, 0.0, 0.0)               0  2.95          0.5600         false
DefaultCluster          broker-b                0     192.168.150.129:10800  V5_3_2                 0.00(0,0ms)               0.00(0,0ms|0,0ms)  0-0(0.0w, 0.0, 0.0)               0  2.95          0.5500          true
DefaultCluster          broker-b                2     192.168.150.128:10900  V5_3_2                 0.00(0,0ms)               0.00(0,0ms|0,0ms)  2-0(0.0w, 0.0, 0.0)               0  2.95          0.5600         false

创建查询topic

[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin updateTopic -n "192.168.150.129:9876;192.168.150.130:9876;192.168.150.128:9876" -t demo1 -c DefaultCluster
create topic to 192.168.150.129:10900 success.
create topic to 192.168.150.129:10800 success.
TopicConfig [topicName=demo1, readQueueNums=8, writeQueueNums=8, perm=RW-, topicFilterType=SINGLE_TAG, topicSysFlag=0, order=false, attributes={}]

#查询topic,最后一个就是刚创建的
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin topicList -n "192.168.150.129:9876;192.168.150.130:9876;192.168.150.128:9876" -c DefaultCluster
#Cluster Name         #Topic                                            #Consumer Group
DefaultCluster        RMQ_SYS_TRANS_HALF_TOPIC
DefaultCluster        BenchmarkTest
DefaultCluster        OFFSET_MOVED_EVENT
DefaultCluster        TBW102
DefaultCluster        rmq_sys_REVIVE_LOG_DefaultCluster
DefaultCluster        SELF_TEST_TOPIC
DefaultCluster        DefaultCluster
DefaultCluster        SCHEDULE_TOPIC_XXXX
DefaultCluster        DefaultCluster_REPLY_TOPIC
DefaultCluster        rmq_sys_wheel_timer
DefaultCluster        rmq_sys_SYNC_BROKER_MEMBER_broker-b
DefaultCluster        rmq_sys_SYNC_BROKER_MEMBER_broker-a
DefaultCluster        RMQ_SYS_TRANS_OP_HALF_TOPIC
DefaultCluster        broker-b
DefaultCluster        broker-a
DefaultCluster        demo1

生产查询消息

# 带key生产和查询
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin sendMessage -n "192.168.150.129:9876;192.168.150.130:9876;192.168.150.128:9876" -t demo1 -p "hello, 5" -b broker-b -k key
#Broker Name                      #QID  #Send Result            #MsgId
broker-b                          1     SEND_OK                 AC11000138602F0E140B2C0B56300000
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin sendMessage -n "192.168.150.129:9876;192.168.150.130:9876;192.168.150.128:9876" -t demo1 -p "hello, 6" -b broker-a -k key
#Broker Name                      #QID  #Send Result            #MsgId
broker-a                          1     SEND_OK                 AC11000138942F0E140B2C0B74D70000
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin queryMsgByKey -n "192.168.150.129:9876;192.168.150.130:9876;192.168.150.128:9876" -t demo1 -k key
#Message ID                                        #QID                                  #Offset
AC1100012EA02F0E140B2B7DC4DA0000                      0                                        0
AC1100012ED52F0E140B2B7DDFD80000                      3                                        1
AC1100012FA02F0E140B2B83D1080000                      2                                        0
AC11000130082F0E140B2B848C0E0000                      5                                        0
AC110001303D2F0E140B2B84B1940000                      6                                        0
AC11000130722F0E140B2B84CF370000                      0                                        1
AC11000138602F0E140B2C0B56300000                      1                                        1
AC11000138942F0E140B2C0B74D70000                      1                                        0

目前为止RocketMQ 2个节点功能正常,下面开始故障模拟验证。

3.2、模拟n1(slave)故障

关闭n1(192.168.150.128)机器,查看集群状态信息,发现n1上的2个slave消失,n2上的master状态正常

[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin clusterList -n localhost:9876
#Cluster Name           #Broker Name            #BID  #Addr                  #Version              #InTPS(LOAD)                   #OutTPS(LOAD)  #Timer(Progress)        #PCWait(ms)  #Hour         #SPACE    #ACTIVATED
DefaultCluster          broker-a                0     192.168.150.129:10900  V5_3_2                 0.00(0,0ms)               0.00(0,0ms|0,0ms)  0-0(0.0w, 0.0, 0.0)               0  3.05          0.5500          true
DefaultCluster          broker-b                0     192.168.150.129:10800  V5_3_2                 0.00(0,0ms)               0.00(0,0ms|0,0ms)  0-0(0.0w, 0.0, 0.0)               0  3.05          0.5500          true

Namesrv.log里检测到broker-a slave故障

2025-08-10 01:17:05 WARN DefaultBrokerHeartbeatManager_scheduledService_1 - The broker channel [id: 0xfcc366d9, L:/192.168.150.130:9999 ! R:/192.168.150.128:56116] expired, brokerInfo BrokerIdentityInfo{clusterName='DefaultCluster', brokerName='broker-a', brokerId=2}, expired 10000ms
2025-08-10 01:17:05 INFO DefaultBrokerHeartbeatManager_executorService_2 - Controller Manager received broker inactive event, clusterName: DefaultCluster, brokerName: broker-a, brokerId: 2
2025-08-10 01:17:05 WARN DefaultBrokerHeartbeatManager_executorService_2 - The broker with brokerId: 2 in broker-set: broker-a has been inactive

还可以继续正常的生产和查询

[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin sendMessage -n "192.168.150.129:9876;192.168.150.130:9876" -t demo1 -p "hello, 7" -b broker-a -k key
#Broker Name                      #QID  #Send Result            #MsgId
broker-a                          3     SEND_OK                 AC11000139C12F0E140B2C0DCA4D0000
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin sendMessage -n "192.168.150.129:9876;192.168.150.130:9876" -t demo1 -p "hello, 8" -b broker-b -k key
#Broker Name                      #QID  #Send Result            #MsgId
broker-a                          2     SEND_OK                 AC11000139F62F0E140B2C0DF09F0000
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin queryMsgByKey -n "192.168.150.129:9876;192.168.150.130:9876" -t demo1 -k key
#Message ID                                        #QID                                  #Offset
AC11000138942F0E140B2C0B74D70000                      1                                        0
AC11000139C12F0E140B2C0DCA4D0000                      3                                        0

启动n1,恢复集群到正常状态,n2仍然是master

# 重新恢复128,日志:
2025-08-10 01:22:31 INFO ControllerRequestExecutorThread_2 - new broker registered, BrokerIdentityInfo{clusterName='DefaultCluster', brokerName='broker-a', brokerId=2}, brokerId:2
2025-08-10 01:22:31 INFO RemotingExecutorThread_6 - new broker registered, BrokerIdentityInfo [clusterName=DefaultCluster, brokerAddr=192.168.150.128:10800] HAService: 192.168.150.128:10801
2025-08-10 01:22:31 INFO ControllerRequestExecutorThread_4 - new broker registered, BrokerIdentityInfo{clusterName='DefaultCluster', brokerName='broker-b', brokerId=2}, brokerId:2
2025-08-10 01:22:31 INFO RemotingExecutorThread_8 - new broker registered, BrokerIdentityInfo [clusterName=DefaultCluster, brokerAddr=192.168.150.128:10900] HAService: 192.168.150.128:10901
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin clusterList -n localhost:9876
#Cluster Name           #Broker Name            #BID  #Addr                  #Version              #InTPS(LOAD)                   #OutTPS(LOAD)  #Timer(Progress)        #PCWait(ms)  #Hour         #SPACE    #ACTIVATED
DefaultCluster          broker-a                0     192.168.150.129:10900  V5_3_2                 0.00(0,0ms)               0.00(0,0ms|0,0ms)  0-0(0.0w, 0.0, 0.0)               0  3.15          0.5500          true
DefaultCluster          broker-a                2     192.168.150.128:10800  V5_3_2                 0.00(0,0ms)               0.00(0,0ms|0,0ms)  1-0(0.0w, 0.0, 0.0)               0  3.15          0.5600         false
DefaultCluster          broker-b                0     192.168.150.129:10800  V5_3_2                 0.00(0,0ms)               0.00(0,0ms|0,0ms)  0-0(0.0w, 0.0, 0.0)               0  3.15          0.5500          true
DefaultCluster          broker-b                2     192.168.150.128:10900  V5_3_2                 0.00(0,0ms)               0.00(0,0ms|0,0ms)  1-0(0.0w, 0.0, 0.0)               0  3.15          0.5600         false

恢复的128数据已经正常从n2完成同步,正常查询发现故障期间新写入的数据

[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin queryMsgByKey -n "192.168.150.129:9876;192.168.150.130:9876" -t demo1 -k key
#Message ID                                        #QID                                  #Offset
AC11000138942F0E140B2C0B74D70000                      1                                        0
AC11000139C12F0E140B2C0DCA4D0000                      3                                        0
AC11000139F62F0E140B2C0DF09F0000                      2                                        1
AC1100012EA02F0E140B2B7DC4DA0000                      0                                        0
AC1100012ED52F0E140B2B7DDFD80000                      3                                        1
AC1100012FA02F0E140B2B83D1080000                      2                                        0
AC11000130082F0E140B2B848C0E0000                      5                                        0
AC110001303D2F0E140B2B84B1940000                      6                                        0
AC11000130722F0E140B2B84CF370000                      0                                        1
AC11000138602F0E140B2C0B56300000                      1                                        1

3.3、模拟n2(master)故障

关闭n2(192.168.150.129)机器,n1被选举为master。

[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin getControllerMetaData -a localhost:9999

#ControllerGroup	group1
#ControllerLeaderId	n1
#ControllerLeaderAddress	192.168.150.128:9999
#Peer:	n0:192.168.150.130:9999
#Peer:	n1:192.168.150.128:9999
#Peer:	n2:192.168.150.129:9999

[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin clusterList -n localhost:9876
#Cluster Name           #Broker Name            #BID  #Addr                  #Version              #InTPS(LOAD)                   #OutTPS(LOAD)  #Timer(Progress)        #PCWait(ms)  #Hour         #SPACE    #ACTIVATED
DefaultCluster          broker-a                0     192.168.150.128:10800  V5_3_2                 0.00(0,0ms)               0.00(0,0ms|0,0ms)  0-0(0.0w, 0.0, 0.0)               0  3.18          0.5600          true
DefaultCluster          broker-b                0     192.168.150.128:10900  V5_3_2                 0.00(0,0ms)               0.00(0,0ms|0,0ms)  0-0(0.0w, 0.0, 0.0)               0  3.18          0.5600          true

n1(128)被选出来的master,正常生产数据

[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin sendMessage -n "192.168.150.128:9876;192.168.150.130:9876" -t demo1 -p "hello, 9" -b broker-b -k key
#Broker Name                      #QID  #Send Result            #MsgId
broker-b                          7     SEND_OK                 AC1100013B662F0E140B2C14C0860000
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin sendMessage -n "192.168.150.128:9876;192.168.150.130:9876" -t demo1 -p "hello, 10" -b broker-a -k key
#Broker Name                      #QID  #Send Result            #MsgId
broker-b                          0     SEND_OK                 AC1100013B9B2F0E140B2C14F2850000

至此,我们测试了借助controller可以正常实现:2个broker节点可以正常选举,无论故障的是master还是slave

启动n2(129)恢复集群。


3.4、模拟n0(仲裁的namesrv和controller)故障

关闭没有部署broker的n0(130),发现n1和n2角色状态不变,整个集群正常读写,功能没有任何影响。


3.5、模拟n1(master)和n0(controller)同时故障

再次关闭已经是master角色的n1(128)机器,发现集群故障,n2(129)保持slave角色,ControllerLeader 为null。

[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin getControllerMetaData -a localhost:9999

#ControllerGroup	group1
#ControllerLeaderId	null
#ControllerLeaderAddress	null
#Peer:	n0:192.168.150.130:9999
#Peer:	n1:192.168.150.128:9999
#Peer:	n2:192.168.150.129:9999

可以查询,但是无法写入

[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin queryMsgByKey -n "192.168.150.129:9876" -t demo1 -k key
#Message ID                                        #QID                                  #Offset
AC11000138942F0E140B2C0B74D70000                      1                                        0
AC11000139C12F0E140B2C0DCA4D0000                      3                                        0
AC11000139F62F0E140B2C0DF09F0000                      2                                        1
AC1100012EA02F0E140B2B7DC4DA0000                      0                                        0
AC1100012ED52F0E140B2B7DDFD80000                      3                                        1
AC1100012FA02F0E140B2B83D1080000                      2                                        0
AC11000130082F0E140B2B848C0E0000                      5                                        0
AC110001303D2F0E140B2B84B1940000                      6                                        0
AC11000130722F0E140B2B84CF370000                      0                                        1
AC11000138602F0E140B2C0B56300000                      1                                        1
AC1100013B662F0E140B2C14C0860000                      7                                        1
AC1100013B9B2F0E140B2C14F2850000                      0                                        2
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin sendMessage -n "192.168.150.129:9876" -t demo1 -p "hello, 11" -k key
java.lang.RuntimeException: SendMessageCommand command failed
	at org.apache.rocketmq.tools.command.message.SendMessageCommand.execute(SendMessageCommand.java:137)
	at org.apache.rocketmq.tools.command.MQAdminStartup.main0(MQAdminStartup.java:181)
	at org.apache.rocketmq.tools.command.MQAdminStartup.main(MQAdminStartup.java:131)
Caused by: org.apache.rocketmq.client.exception.MQClientException: No route info of this topic: demo1

3.6、模拟n0(130)恢复

启动n2机器,集群很快恢复正常,选出master,并开始正常工作。

[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin clusterList -n localhost:9876
#Cluster Name           #Broker Name            #BID  #Addr                  #Version              #InTPS(LOAD)                   #OutTPS(LOAD)  #Timer(Progress)        #PCWait(ms)  #Hour         #SPACE    #ACTIVATED
DefaultCluster          broker-a                0     192.168.150.129:10900  V5_3_2                 0.00(0,0ms)               0.00(0,0ms|0,0ms)  0-0(0.0w, 0.0, 0.0)               0  3.37          0.5500          true
DefaultCluster          broker-b                0     192.168.150.129:10800  V5_3_2                 0.00(0,0ms)               0.00(0,0ms|0,0ms)  0-0(0.0w, 0.0, 0.0)               0  3.37          0.5500          true
[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin getControllerMetaData -a localhost:9999

#ControllerGroup	group1
#ControllerLeaderId	n0
#ControllerLeaderAddress	192.168.150.130:9999
#Peer:	n0:192.168.150.130:9999
#Peer:	n1:192.168.150.128:9999
#Peer:	n2:192.168.150.129:9999

[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin sendMessage -n "192.168.150.129:9876" -t demo1 -p "hello, 12" -k key -b broker-a
#Broker Name                      #QID  #Send Result            #MsgId
broker-a                          7     SEND_OK                 AC1100010B9B2F0E140B2C1F86620000
^[[A[root@localhost rocketmq-all-5.3.2-bin-release]# ./bin/mqadmin sendMessage -n "192.168.150.129:9876" -t demo1 -p "hello, 13" -k key -b broker-b
#Broker Name                      #QID  #Send Result            #MsgId
broker-a                          0     SEND_OK                 AC1100010BCF2F0E140B2C1FA4060000

结论

RokcetMQ 5.X借助3个controller仲裁,可以正常实现2 AZ高可用部署


【版权声明】本文为华为云社区用户原创内容,未经允许不得转载,如需转载请自行联系原作者进行授权。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。