GaussDB T 分布式集群DN恢复重建

GaussDB 分布式集群的集群状态Cluster_state分为:Normal,Unavailable,Degraded,Abnormal四种。
Normal:表示集群可用,CN、DN主全部在线。
Unavailable:表示集群不可用,某Group 的DN无主,CN全部掉线;或者某Group 中online的节点个数小于等于该Group节 点总数的一半(节点总数不包含Passive 节点)。
Degraded:表示集群可用,但数据没有冗余备份,某Group的DN裸奔(Group只有一个主运行,备全部停止运行且无法 启动)/CN裸奔(CN仅有一台online)。
Abnormal:表示集群中CN、DN中某台机器状态不是online。
如下集群状态信息中,可见集群状态异常为Degraded。仔细观察会发现节点Gauss2上,实例DB1_2备实例状态为NEED_REPAIR。Group_1组中DN处于裸奔状态,需要修复。
[omm@Gauss1 ~]$ gs_om -t statusSet output to terminal.--------------------------------------------------------------------Cluster Status--------------------------------------------------------------------az_state : single_azcluster_state : Degradedbalanced : true----------------------------------------------------------------------AZ Status-----------------------------------------------------------------------AZ:AZ1 ROLE:primary STATUS:ONLINE---------------------------------------------------------------------Host Status----------------------------------------------------------------------HOST:Gauss1 AZ:AZ1 STATUS:ONLINE IP:192.168.10.11HOST:Gauss2 AZ:AZ1 STATUS:ONLINE IP:192.168.10.12HOST:Gauss3 AZ:AZ1 STATUS:ONLINE IP:192.168.10.13HOST:Gauss4 AZ:AZ1 STATUS:ONLINE IP:192.168.10.14----------------------------------------------------------------Cluster Manager Status----------------------------------------------------------------INSTANCE:CM1 ROLE:slave STATUS:ONLINE HOST:Gauss1 ID:601INSTANCE:CM2 ROLE:slave STATUS:ONLINE HOST:Gauss2 ID:602INSTANCE:CM3 ROLE:slave STATUS:ONLINE HOST:Gauss3 ID:603INSTANCE:CM4 ROLE:primary STATUS:ONLINE HOST:Gauss4 ID:604---------------------------------------------------------------------ETCD Status----------------------------------------------------------------------INSTANCE:ETCD1 ROLE:follower STATUS:ONLINE HOST:Gauss1 ID:701 PORT:2379 DataDir:/opt/huawei/gaussdb/data/etcd/data_etcd1INSTANCE:ETCD2 ROLE:leader STATUS:ONLINE HOST:Gauss2 ID:702 PORT:2379 DataDir:/opt/huawei/gaussdb/data/etcd/data_etcd1INSTANCE:ETCD3 ROLE:follower STATUS:ONLINE HOST:Gauss3 ID:703 PORT:2379 DataDir:/opt/huawei/gaussdb/data/etcd/data_etcd1----------------------------------------------------------------------CN Status-----------------------------------------------------------------------INSTANCE:cn_401 ROLE:no role STATUS:ONLINE HOST:Gauss1 ID:401 PORT:8000 DataDir:/opt/gaussdb/data/cnINSTANCE:cn_402 ROLE:no role STATUS:ONLINE HOST:Gauss2 ID:402 PORT:8000 DataDir:/opt/gaussdb/data/cnINSTANCE:cn_403 ROLE:no role STATUS:ONLINE HOST:Gauss3 ID:403 PORT:8000 DataDir:/opt/gaussdb/data/cnINSTANCE:cn_404 ROLE:no role STATUS:ONLINE HOST:Gauss4 ID:404 PORT:8000 DataDir:/opt/gaussdb/data/cn---------------------------------------------------------Instances Status in Group (group_1)----------------------------------------------------------INSTANCE:DB1_1 ROLE:primary STATUS:ONLINE HOST:Gauss1 ID:1 PORT:40000 DataDir:/opt/gaussdb/data/dn1INSTANCE:DB1_2 ROLE:standby STATUS:NEED_REPAIR HOST:Gauss2 ID:2 PORT:40021 DataDir:/opt/gaussdb/data/dn1---------------------------------------------------------Instances Status in Group (group_2)----------------------------------------------------------INSTANCE:DB2_3 ROLE:primary STATUS:ONLINE HOST:Gauss2 ID:3 PORT:40000 DataDir:/opt/gaussdb/data/dn2INSTANCE:DB2_4 ROLE:standby STATUS:ONLINE HOST:Gauss3 ID:4 PORT:40021 DataDir:/opt/gaussdb/data/dn2---------------------------------------------------------Instances Status in Group (group_3)----------------------------------------------------------INSTANCE:DB3_5 ROLE:primary STATUS:ONLINE HOST:Gauss3 ID:5 PORT:40000 DataDir:/opt/gaussdb/data/dn3INSTANCE:DB3_6 ROLE:standby STATUS:ONLINE HOST:Gauss4 ID:6 PORT:40021 DataDir:/opt/gaussdb/data/dn3---------------------------------------------------------Instances Status in Group (group_4)----------------------------------------------------------INSTANCE:DB4_8 ROLE:standby STATUS:ONLINE HOST:Gauss1 ID:8 PORT:40021 DataDir:/opt/gaussdb/data/dn4INSTANCE:DB4_7 ROLE:primary STATUS:ONLINE HOST:Gauss4 ID:7 PORT:40000 DataDir:/opt/gaussdb/data/dn4--------------------------------------------------Manage IP--------------------------------------------------HOST:Gauss1 IP:192.168.10.11HOST:Gauss2 IP:192.168.10.12HOST:Gauss3 IP:192.168.10.13HOST:Gauss4 IP:192.168.10.14[omm@Gauss1 ~]$
GaussDB提供两种技术手段修复此种情况,分别为手动重建和自动重建。两者内在操作一致。本文展示重建过程,下边演示选择手动重建。自动重建在文章末尾作补充,仅需一条重建命令。
开始重建:
1> 停止需要恢复的实例
[omm@Gauss2 ~]$ gs_om -t stop -h Gauss2 -I DB1_2
2> 删除该实例相关文件
[omm@Gauss2 ~]$ cd opt/gaussdb/data/dn1/[omm@Gauss2 dn1]$ rm -rf archive_log/*[omm@Gauss2 dn1]$ rm -rf data/*[omm@Gauss2 ~]$ cd /opt/gaussdb/arch_log/db_group_1/archive_log[omm@Gauss2 archive_log]$ lsarch_0_1.arc arch_0_2.arc arch_0_3.arc arch_1_4.arc arch_1_5.arc arch_1_6.arc arch_1_7.arc[omm@Gauss2 archive_log]$ rm -rf ./*
3> 将该实例启动到nomount状态
[omm@Gauss2 ~]$ zengine nomount -D /opt/gaussdb/data/dn1
4> 执行重建命令
[omm@Gauss2 ~]$ zsql / as sysdba -D /opt/gaussdb/data/dn1Warning: SSL connection to server without CA certificate is insecure. Continue anyway? (y/n):yconnected.SQL> build database;Succeed.SQL> shutdown immediate;Succeed.SQL> exit[omm@Gauss2 ~]$
5> 重建完成后启动该问题实例
[omm@Gauss2 ~]$ gs_om -t start -h Gauss2 -I DB1_2
6> 重建完成,查看集群状态。
[omm@Gauss2 ~]$ gs_om -t statusSet output to terminal.--------------------------------------------------------------------Cluster Status--------------------------------------------------------------------az_state : single_azcluster_state : Normalbalanced : true----------------------------------------------------------------------AZ Status-----------------------------------------------------------------------AZ:AZ1 ROLE:primary STATUS:ONLINE---------------------------------------------------------------------Host Status----------------------------------------------------------------------HOST:Gauss1 AZ:AZ1 STATUS:ONLINE IP:192.168.10.11HOST:Gauss2 AZ:AZ1 STATUS:ONLINE IP:192.168.10.12HOST:Gauss3 AZ:AZ1 STATUS:ONLINE IP:192.168.10.13HOST:Gauss4 AZ:AZ1 STATUS:ONLINE IP:192.168.10.14----------------------------------------------------------------Cluster Manager Status----------------------------------------------------------------INSTANCE:CM1 ROLE:slave STATUS:ONLINE HOST:Gauss1 ID:601INSTANCE:CM2 ROLE:slave STATUS:ONLINE HOST:Gauss2 ID:602INSTANCE:CM3 ROLE:primary STATUS:ONLINE HOST:Gauss3 ID:603INSTANCE:CM4 ROLE:slave STATUS:ONLINE HOST:Gauss4 ID:604---------------------------------------------------------------------ETCD Status----------------------------------------------------------------------INSTANCE:ETCD1 ROLE:follower STATUS:ONLINE HOST:Gauss1 ID:701 PORT:2379 DataDir:/opt/huawei/gaussdb/data/etcd/data_etcd1INSTANCE:ETCD2 ROLE:leader STATUS:ONLINE HOST:Gauss2 ID:702 PORT:2379 DataDir:/opt/huawei/gaussdb/data/etcd/data_etcd1INSTANCE:ETCD3 ROLE:follower STATUS:ONLINE HOST:Gauss3 ID:703 PORT:2379 DataDir:/opt/huawei/gaussdb/data/etcd/data_etcd1----------------------------------------------------------------------CN Status-----------------------------------------------------------------------INSTANCE:cn_401 ROLE:no role STATUS:ONLINE HOST:Gauss1 ID:401 PORT:8000 DataDir:/opt/gaussdb/data/cnINSTANCE:cn_402 ROLE:no role STATUS:ONLINE HOST:Gauss2 ID:402 PORT:8000 DataDir:/opt/gaussdb/data/cnINSTANCE:cn_403 ROLE:no role STATUS:ONLINE HOST:Gauss3 ID:403 PORT:8000 DataDir:/opt/gaussdb/data/cnINSTANCE:cn_404 ROLE:no role STATUS:ONLINE HOST:Gauss4 ID:404 PORT:8000 DataDir:/opt/gaussdb/data/cn---------------------------------------------------------Instances Status in Group (group_1)----------------------------------------------------------INSTANCE:DB1_1 ROLE:primary STATUS:ONLINE HOST:Gauss1 ID:1 PORT:40000 DataDir:/opt/gaussdb/data/dn1INSTANCE:DB1_2 ROLE:standby STATUS:ONLINE HOST:Gauss2 ID:2 PORT:40021 DataDir:/opt/gaussdb/data/dn1---------------------------------------------------------Instances Status in Group (group_2)----------------------------------------------------------INSTANCE:DB2_3 ROLE:primary STATUS:ONLINE HOST:Gauss2 ID:3 PORT:40000 DataDir:/opt/gaussdb/data/dn2INSTANCE:DB2_4 ROLE:standby STATUS:ONLINE HOST:Gauss3 ID:4 PORT:40021 DataDir:/opt/gaussdb/data/dn2---------------------------------------------------------Instances Status in Group (group_3)----------------------------------------------------------INSTANCE:DB3_5 ROLE:primary STATUS:ONLINE HOST:Gauss3 ID:5 PORT:40000 DataDir:/opt/gaussdb/data/dn3INSTANCE:DB3_6 ROLE:standby STATUS:ONLINE HOST:Gauss4 ID:6 PORT:40021 DataDir:/opt/gaussdb/data/dn3---------------------------------------------------------Instances Status in Group (group_4)----------------------------------------------------------INSTANCE:DB4_8 ROLE:standby STATUS:ONLINE HOST:Gauss1 ID:8 PORT:40021 DataDir:/opt/gaussdb/data/dn4INSTANCE:DB4_7 ROLE:primary STATUS:ONLINE HOST:Gauss4 ID:7 PORT:40000 DataDir:/opt/gaussdb/data/dn4--------------------------------------------------Manage IP--------------------------------------------------HOST:Gauss1 IP:192.168.10.11HOST:Gauss2 IP:192.168.10.12HOST:Gauss3 IP:192.168.10.13HOST:Gauss4 IP:192.168.10.14[omm@Gauss2 ~]$
此时,集群已修复成功,集群状态从Degraded恢复至Normal。问题实例DB1_2的状态也回归正常,完成手动对集群中损坏DN的重建工作。
扩展:
自动重建:
cm ctl build -H Gauss2 -I DB1_2
-H: 需要修复实例的主机名
-I:需要修复的实例名
如上可以看出,GaussDB在大规模分布式场景下的恢复管理工作时非常简单的,运维效率非常高。
转自墨天轮
- 点赞
- 收藏
- 关注作者
评论(0)