GaussDB T 1.0.1.SPC2.B003集群关闭后重启故障的解决方法

举报
社会主义的一块砖 发表于 2020/02/12 16:41:24 2020/02/12
【摘要】 GaussDB T 1.0.1 集群重启故障的解决方法

【实验环境】

OS版本:redhat 7.5,数据库版本:GaussDB T 1.0.1.SPC2.B003

【故障现象】:

操作步骤:手动关闭集群,关闭4个节点,重启四个节点,然后启动集群,出现故障。
此操作也会导致某些节点损坏,需要修复。

[omm@gaussdb11 ~]$ gs_om -t status

[GAUSS-50219] : Failed to obtain cluster status. Error:
time="2020-02-06T14:48:40+08:00" level=error msg="Error: cann't connect to etcdStore, error: (context deadline exceeded)"time="2020-02-06T14:48:40+08:00" level=error msg="can't get NewEtcdStore"time="2020-02-06T14:48:40+08:00" level=fatal msg="can't get Control data"[omm@gaussdb11 ~]$ gs_om -t start
Starting cluster
=========================================
**==ERRO[0020] **Error: cann't connect to etcdStore, error: (context deadline exceeded)**==** 
ERRO[0020] can't get NewEtcdStore                       
FATA[0020] can't get Control data                       
[GAUSS-51607] : Failed to start cluster.

【分析问题】

发现etcd服务进程出问题,需要查看其日志:

cat /opt/gaussdb/log/omm/etcd/etcd20301.log
  2020-02-06 14:34:40.851989 W | rafthttp: health check for peer 4ede7092bff5df23 could not connect: dial tcp 192.168.100.11:20301: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2020-02-06 14:34:41.963772 W | etcdserver: read-only range request "key:\"/clusters/layouts/AZ1/gaussdb14\" " with result "range_response_count:1 size:1584" took too long (258.818065ms) to execute
2020-02-06 14:34:41.963825 W | etcdserver: read-only range request "key:\"/clusters/layouts/AZ1/gaussdb13\" " with result "range_response_count:1 size:2087" took too long (193.903983ms) to execute
2020-02-06 14:34:42.810373 W | etcdserver: read-only range request "key:\"/clusters/inst_status/AZ1/gaussdb12/cn_402\" " with result "range_response_count:1 size:72" took too long (774.446332ms) to execute
2020-02-06 14:34:43.172806 W | etcdserver: read-only range request "key:\"/clusters/Globalinfo\" " with result "range_response_count:1 size:483" took too long (1.134431963s) to execute
2020-02-06 14:34:43.173043 W | etcdserver: request "header:<ID:1857295036222071595 username:\"root\" auth_revision:8 > lease_grant:<ttl:30-second id:19c66ffebd1ce32a>" with result "size:40" took too long (219.638715ms) to execute
2020-02-06 14:34:47.419445 W | wal: sync duration of 3.639853362s, expected less than 1s
2020-02-06 14:34:47.419698 W | etcdserver: read-only range request "key:\"/clusters/host_status/AZ1/gaussdb12\" " with result "range_response_count:1 size:74" took too long (4.604805258s) to execute
2020-02-06 14:34:47.420248 W | etcdserver: request "header:<ID:12427524836637247293 username:\"root\" auth_revision:8 > put:<key:\"/clusters/host_status/AZ1/gaussdb14\" value_size:6 lease:1857295036222071592 >" with result "size:6" took too long (3.641278887s) to execute
2020-02-06 14:34:47.421336 W | rafthttp: health check for peer 4ede7092bff5df23 could not connect: dial tcp 192.168.100.11:20301: i/o timeout (prober "ROUND_TRIPPER_RAFT_MESSAGE")

也看了相关的日志,对比了集群关闭前、关闭中等的状态等,才找出下面的方法。

【解决方法】

方法:重新启动AZ
1、启动AZ.

[omm@gaussdb12 ~]$ gs_om -t start  --az=AZ1
Starting az
=========================================
Starting specified az in the cluster.
Clean old cm and etcd for specified AZ.
Successfully clean old cm and etcd for specified AZ.
Restart etcd for specified AZ.
Successfully restart etcd for specified AZ.
Checking the etcd status.
Successfully checked the etcd status.
Changing the instances config for specified az.
Successfully changed the instances config for specified az.
Restart cm agent for specified AZ.
Successfully restart cm agent for specified AZ.
start the specified az in the cluster.
90s
======================================================================
Finish to start the specified az in the cluster.                                                                                                                                                                                               .Successfully starting specified az in the cluster.

2、 查看集群状态

[omm@gaussdb11 ~]$ gs_om -t status
Set output to terminal.
--------------------------------------------------------------------Cluster Status--------------------------------------------------------------------
az_state :      single_az
cluster_state : Degraded
balanced :      true----------------------------------------------------------------------AZ Status-----------------------------------------------------------------------
AZ:AZ1                ROLE:primary            STATUS:ONLINE      
---------------------------------------------------------------------Host Status----------------------------------------------------------------------
HOST:gaussdb11        AZ:AZ1                  STATUS:ONLINE       IP:192.168.100.11
HOST:gaussdb12        AZ:AZ1                  STATUS:ONLINE       IP:192.168.100.12
HOST:gaussdb13        AZ:AZ1                  STATUS:ONLINE       IP:192.168.100.13
HOST:gaussdb14        AZ:AZ1                  STATUS:ONLINE       IP:192.168.100.14
----------------------------------------------------------------Cluster Manager Status----------------------------------------------------------------
INSTANCE:CM2          ROLE:primary            STATUS:ONLINE       HOST:gaussdb11        ID:602
INSTANCE:CM1          ROLE:slave              STATUS:ONLINE       HOST:gaussdb12        ID:601
---------------------------------------------------------------------ETCD Status----------------------------------------------------------------------
INSTANCE:ETCD1        ROLE:leader             STATUS:ONLINE       HOST:gaussdb11        ID:701      PORT:20300        DataDir:/gaussdb/data/data_etcd
INSTANCE:ETCD2        ROLE:follower           STATUS:ONLINE       HOST:gaussdb12        ID:702      PORT:20300        DataDir:/gaussdb/data/data_etcd
INSTANCE:ETCD3        ROLE:follower           STATUS:ONLINE       HOST:gaussdb13        ID:703      PORT:20300        DataDir:/gaussdb/data/data_etcd
----------------------------------------------------------------------CN Status-----------------------------------------------------------------------
INSTANCE:cn_401       ROLE:no role            STATUS:ONLINE       HOST:gaussdb11        ID:401      PORT:8000         DataDir:/gaussdb/data/data_cn
INSTANCE:cn_402       ROLE:no role            STATUS:ONLINE       HOST:gaussdb12        ID:402      PORT:8000         DataDir:/gaussdb/data/data_cn
----------------------------------------------------------------------GTS Status----------------------------------------------------------------------
INSTANCE:GTS1         ROLE:primary            STATUS:ONLINE       HOST:gaussdb11        ID:441      PORT:13000        DataDir:/gaussdb/data/data_gts
INSTANCE:GTS2         ROLE:standby            STATUS:ONLINE       HOST:gaussdb12        ID:442      PORT:13000        DataDir:/gaussdb/data/data_gts
---------------------------------------------------------Instances Status in Group (group_1)----------------------------------------------------------
INSTANCE:DB1_2        ROLE:primary            STATUS:ONLINE       HOST:gaussdb11        ID:2        PORT:40000        DataDir:/gaussdb/data/data_dn
INSTANCE:DB1_1        ROLE:standby            STATUS:NEED_REPAIR  HOST:gaussdb13        ID:1        PORT:40000        DataDir:/gaussdb/data/data_dn
---------------------------------------------------------Instances Status in Group (group_2)----------------------------------------------------------
INSTANCE:DB2_4        ROLE:primary            STATUS:ONLINE       HOST:gaussdb12        ID:4        PORT:40000        DataDir:/gaussdb/data/data_dn
INSTANCE:DB2_3        ROLE:standby            STATUS:NEED_REPAIR  HOST:gaussdb14        ID:3        PORT:40000        DataDir:/gaussdb/data/data_dn
-----------------------------------------------------------------------Manage IP----------------------------------------------------------------------
HOST:gaussdb11        IP:192.168.100.11
HOST:gaussdb12        IP:192.168.100.12
HOST:gaussdb13        IP:192.168.100.13
HOST:gaussdb14        IP:192.168.100.14
-------------------------------------------------------------------Query Action Info------------------------------------------------------------------
HOSTNAME: gaussdb11     TIME: 2020-02-07 23:48:08.930061
------------------------------------------------------------------------Float Ip------------------------------------------------------------------
HOST:gaussdb12    DB2_4:192.168.100.12    IP:
HOST:gaussdb11    DB1_2:192.168.100.11    IP:

集群已经启动,接下来进行修复STATUS:NEED_REPAIR HOST:gaussdb13、gaussdb14这两个节点即可,至此,问题解决。


转自墨天轮

【版权声明】本文为华为云社区用户转载文章,如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。