GaussDB T 分布式集群DN恢复重建
GaussDB 分布式集群的集群状态Cluster_state分为:Normal,Unavailable,Degraded,Abnormal四种。
Normal:表示集群可用,CN、DN主全部在线。
Unavailable:表示集群不可用,某Group 的DN无主,CN全部掉线;或者某Group 中online的节点个数小于等于该Group节 点总数的一半(节点总数不包含Passive 节点)。
Degraded:表示集群可用,但数据没有冗余备份,某Group的DN裸奔(Group只有一个主运行,备全部停止运行且无法 启动)/CN裸奔(CN仅有一台online)。
Abnormal:表示集群中CN、DN中某台机器状态不是online。
如下集群状态信息中,可见集群状态异常为Degraded。仔细观察会发现节点Gauss2上,实例DB1_2备实例状态为NEED_REPAIR。Group_1组中DN处于裸奔状态,需要修复。
[omm@Gauss1 ~]$ gs_om -t status
Set output to terminal.
--------------------------------------------------------------------Cluster Status--------------------------------------------------------------------
az_state : single_az
cluster_state : Degraded
balanced : true
----------------------------------------------------------------------AZ Status-----------------------------------------------------------------------
AZ:AZ1 ROLE:primary STATUS:ONLINE
---------------------------------------------------------------------Host Status----------------------------------------------------------------------
HOST:Gauss1 AZ:AZ1 STATUS:ONLINE IP:192.168.10.11
HOST:Gauss2 AZ:AZ1 STATUS:ONLINE IP:192.168.10.12
HOST:Gauss3 AZ:AZ1 STATUS:ONLINE IP:192.168.10.13
HOST:Gauss4 AZ:AZ1 STATUS:ONLINE IP:192.168.10.14
----------------------------------------------------------------Cluster Manager Status----------------------------------------------------------------
INSTANCE:CM1 ROLE:slave STATUS:ONLINE HOST:Gauss1 ID:601
INSTANCE:CM2 ROLE:slave STATUS:ONLINE HOST:Gauss2 ID:602
INSTANCE:CM3 ROLE:slave STATUS:ONLINE HOST:Gauss3 ID:603
INSTANCE:CM4 ROLE:primary STATUS:ONLINE HOST:Gauss4 ID:604
---------------------------------------------------------------------ETCD Status----------------------------------------------------------------------
INSTANCE:ETCD1 ROLE:follower STATUS:ONLINE HOST:Gauss1 ID:701 PORT:2379 DataDir:/opt/huawei/gaussdb/data/etcd/data_etcd1
INSTANCE:ETCD2 ROLE:leader STATUS:ONLINE HOST:Gauss2 ID:702 PORT:2379 DataDir:/opt/huawei/gaussdb/data/etcd/data_etcd1
INSTANCE:ETCD3 ROLE:follower STATUS:ONLINE HOST:Gauss3 ID:703 PORT:2379 DataDir:/opt/huawei/gaussdb/data/etcd/data_etcd1
----------------------------------------------------------------------CN Status-----------------------------------------------------------------------
INSTANCE:cn_401 ROLE:no role STATUS:ONLINE HOST:Gauss1 ID:401 PORT:8000 DataDir:/opt/gaussdb/data/cn
INSTANCE:cn_402 ROLE:no role STATUS:ONLINE HOST:Gauss2 ID:402 PORT:8000 DataDir:/opt/gaussdb/data/cn
INSTANCE:cn_403 ROLE:no role STATUS:ONLINE HOST:Gauss3 ID:403 PORT:8000 DataDir:/opt/gaussdb/data/cn
INSTANCE:cn_404 ROLE:no role STATUS:ONLINE HOST:Gauss4 ID:404 PORT:8000 DataDir:/opt/gaussdb/data/cn
---------------------------------------------------------Instances Status in Group (group_1)----------------------------------------------------------
INSTANCE:DB1_1 ROLE:primary STATUS:ONLINE HOST:Gauss1 ID:1 PORT:40000 DataDir:/opt/gaussdb/data/dn1
INSTANCE:DB1_2 ROLE:standby STATUS:NEED_REPAIR HOST:Gauss2 ID:2 PORT:40021 DataDir:/opt/gaussdb/data/dn1
---------------------------------------------------------Instances Status in Group (group_2)----------------------------------------------------------
INSTANCE:DB2_3 ROLE:primary STATUS:ONLINE HOST:Gauss2 ID:3 PORT:40000 DataDir:/opt/gaussdb/data/dn2
INSTANCE:DB2_4 ROLE:standby STATUS:ONLINE HOST:Gauss3 ID:4 PORT:40021 DataDir:/opt/gaussdb/data/dn2
---------------------------------------------------------Instances Status in Group (group_3)----------------------------------------------------------
INSTANCE:DB3_5 ROLE:primary STATUS:ONLINE HOST:Gauss3 ID:5 PORT:40000 DataDir:/opt/gaussdb/data/dn3
INSTANCE:DB3_6 ROLE:standby STATUS:ONLINE HOST:Gauss4 ID:6 PORT:40021 DataDir:/opt/gaussdb/data/dn3
---------------------------------------------------------Instances Status in Group (group_4)----------------------------------------------------------
INSTANCE:DB4_8 ROLE:standby STATUS:ONLINE HOST:Gauss1 ID:8 PORT:40021 DataDir:/opt/gaussdb/data/dn4
INSTANCE:DB4_7 ROLE:primary STATUS:ONLINE HOST:Gauss4 ID:7 PORT:40000 DataDir:/opt/gaussdb/data/dn4
--------------------------------------------------Manage IP--------------------------------------------------
HOST:Gauss1 IP:192.168.10.11
HOST:Gauss2 IP:192.168.10.12
HOST:Gauss3 IP:192.168.10.13
HOST:Gauss4 IP:192.168.10.14
[omm@Gauss1 ~]$
GaussDB提供两种技术手段修复此种情况,分别为手动重建和自动重建。两者内在操作一致。本文展示重建过程,下边演示选择手动重建。自动重建在文章末尾作补充,仅需一条重建命令。
开始重建:
1> 停止需要恢复的实例
[omm@Gauss2 ~]$ gs_om -t stop -h Gauss2 -I DB1_2
2> 删除该实例相关文件
[omm@Gauss2 ~]$ cd opt/gaussdb/data/dn1/
[omm@Gauss2 dn1]$ rm -rf archive_log/*
[omm@Gauss2 dn1]$ rm -rf data/*
[omm@Gauss2 ~]$ cd /opt/gaussdb/arch_log/db_group_1/archive_log
[omm@Gauss2 archive_log]$ ls
arch_0_1.arc arch_0_2.arc arch_0_3.arc arch_1_4.arc arch_1_5.arc arch_1_6.arc arch_1_7.arc
[omm@Gauss2 archive_log]$ rm -rf ./*
3> 将该实例启动到nomount状态
[omm@Gauss2 ~]$ zengine nomount -D /opt/gaussdb/data/dn1
4> 执行重建命令
[omm@Gauss2 ~]$ zsql / as sysdba -D /opt/gaussdb/data/dn1
Warning: SSL connection to server without CA certificate is insecure. Continue anyway? (y/n):y
connected.
SQL> build database;
Succeed.
SQL> shutdown immediate;
Succeed.
SQL> exit
[omm@Gauss2 ~]$
5> 重建完成后启动该问题实例
[omm@Gauss2 ~]$ gs_om -t start -h Gauss2 -I DB1_2
6> 重建完成,查看集群状态。
[omm@Gauss2 ~]$ gs_om -t status
Set output to terminal.
--------------------------------------------------------------------Cluster Status--------------------------------------------------------------------
az_state : single_az
cluster_state : Normal
balanced : true
----------------------------------------------------------------------AZ Status-----------------------------------------------------------------------
AZ:AZ1 ROLE:primary STATUS:ONLINE
---------------------------------------------------------------------Host Status----------------------------------------------------------------------
HOST:Gauss1 AZ:AZ1 STATUS:ONLINE IP:192.168.10.11
HOST:Gauss2 AZ:AZ1 STATUS:ONLINE IP:192.168.10.12
HOST:Gauss3 AZ:AZ1 STATUS:ONLINE IP:192.168.10.13
HOST:Gauss4 AZ:AZ1 STATUS:ONLINE IP:192.168.10.14
----------------------------------------------------------------Cluster Manager Status----------------------------------------------------------------
INSTANCE:CM1 ROLE:slave STATUS:ONLINE HOST:Gauss1 ID:601
INSTANCE:CM2 ROLE:slave STATUS:ONLINE HOST:Gauss2 ID:602
INSTANCE:CM3 ROLE:primary STATUS:ONLINE HOST:Gauss3 ID:603
INSTANCE:CM4 ROLE:slave STATUS:ONLINE HOST:Gauss4 ID:604
---------------------------------------------------------------------ETCD Status----------------------------------------------------------------------
INSTANCE:ETCD1 ROLE:follower STATUS:ONLINE HOST:Gauss1 ID:701 PORT:2379 DataDir:/opt/huawei/gaussdb/data/etcd/data_etcd1
INSTANCE:ETCD2 ROLE:leader STATUS:ONLINE HOST:Gauss2 ID:702 PORT:2379 DataDir:/opt/huawei/gaussdb/data/etcd/data_etcd1
INSTANCE:ETCD3 ROLE:follower STATUS:ONLINE HOST:Gauss3 ID:703 PORT:2379 DataDir:/opt/huawei/gaussdb/data/etcd/data_etcd1
----------------------------------------------------------------------CN Status-----------------------------------------------------------------------
INSTANCE:cn_401 ROLE:no role STATUS:ONLINE HOST:Gauss1 ID:401 PORT:8000 DataDir:/opt/gaussdb/data/cn
INSTANCE:cn_402 ROLE:no role STATUS:ONLINE HOST:Gauss2 ID:402 PORT:8000 DataDir:/opt/gaussdb/data/cn
INSTANCE:cn_403 ROLE:no role STATUS:ONLINE HOST:Gauss3 ID:403 PORT:8000 DataDir:/opt/gaussdb/data/cn
INSTANCE:cn_404 ROLE:no role STATUS:ONLINE HOST:Gauss4 ID:404 PORT:8000 DataDir:/opt/gaussdb/data/cn
---------------------------------------------------------Instances Status in Group (group_1)----------------------------------------------------------
INSTANCE:DB1_1 ROLE:primary STATUS:ONLINE HOST:Gauss1 ID:1 PORT:40000 DataDir:/opt/gaussdb/data/dn1
INSTANCE:DB1_2 ROLE:standby STATUS:ONLINE HOST:Gauss2 ID:2 PORT:40021 DataDir:/opt/gaussdb/data/dn1
---------------------------------------------------------Instances Status in Group (group_2)----------------------------------------------------------
INSTANCE:DB2_3 ROLE:primary STATUS:ONLINE HOST:Gauss2 ID:3 PORT:40000 DataDir:/opt/gaussdb/data/dn2
INSTANCE:DB2_4 ROLE:standby STATUS:ONLINE HOST:Gauss3 ID:4 PORT:40021 DataDir:/opt/gaussdb/data/dn2
---------------------------------------------------------Instances Status in Group (group_3)----------------------------------------------------------
INSTANCE:DB3_5 ROLE:primary STATUS:ONLINE HOST:Gauss3 ID:5 PORT:40000 DataDir:/opt/gaussdb/data/dn3
INSTANCE:DB3_6 ROLE:standby STATUS:ONLINE HOST:Gauss4 ID:6 PORT:40021 DataDir:/opt/gaussdb/data/dn3
---------------------------------------------------------Instances Status in Group (group_4)----------------------------------------------------------
INSTANCE:DB4_8 ROLE:standby STATUS:ONLINE HOST:Gauss1 ID:8 PORT:40021 DataDir:/opt/gaussdb/data/dn4
INSTANCE:DB4_7 ROLE:primary STATUS:ONLINE HOST:Gauss4 ID:7 PORT:40000 DataDir:/opt/gaussdb/data/dn4
--------------------------------------------------Manage IP--------------------------------------------------
HOST:Gauss1 IP:192.168.10.11
HOST:Gauss2 IP:192.168.10.12
HOST:Gauss3 IP:192.168.10.13
HOST:Gauss4 IP:192.168.10.14
[omm@Gauss2 ~]$
此时,集群已修复成功,集群状态从Degraded恢复至Normal。问题实例DB1_2的状态也回归正常,完成手动对集群中损坏DN的重建工作。
扩展:
自动重建:
cm ctl build -H Gauss2 -I DB1_2
-H: 需要修复实例的主机名
-I:需要修复的实例名
如上可以看出,GaussDB在大规模分布式场景下的恢复管理工作时非常简单的,运维效率非常高。
转自墨天轮
- 点赞
- 收藏
- 关注作者
评论(0)