GaussDB T 1.0.1.SPC2.B003集群关闭后重启故障的解决方法
【摘要】 GaussDB T 1.0.1 集群重启故障的解决方法
【实验环境】
OS版本:redhat 7.5,数据库版本:GaussDB T 1.0.1.SPC2.B003
【故障现象】:
操作步骤:手动关闭集群,关闭4个节点,重启四个节点,然后启动集群,出现故障。
此操作也会导致某些节点损坏,需要修复。
[omm@gaussdb11 ~]$ gs_om -t status [GAUSS-50219] : Failed to obtain cluster status. Error: time="2020-02-06T14:48:40+08:00" level=error msg="Error: cann't connect to etcdStore, error: (context deadline exceeded)"time="2020-02-06T14:48:40+08:00" level=error msg="can't get NewEtcdStore"time="2020-02-06T14:48:40+08:00" level=fatal msg="can't get Control data"[omm@gaussdb11 ~]$ gs_om -t start Starting cluster ========================================= **==ERRO[0020] **Error: cann't connect to etcdStore, error: (context deadline exceeded)**==** ERRO[0020] can't get NewEtcdStore FATA[0020] can't get Control data [GAUSS-51607] : Failed to start cluster.
【分析问题】
发现etcd服务进程出问题,需要查看其日志:
cat /opt/gaussdb/log/omm/etcd/etcd20301.log 2020-02-06 14:34:40.851989 W | rafthttp: health check for peer 4ede7092bff5df23 could not connect: dial tcp 192.168.100.11:20301: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE") 2020-02-06 14:34:41.963772 W | etcdserver: read-only range request "key:\"/clusters/layouts/AZ1/gaussdb14\" " with result "range_response_count:1 size:1584" took too long (258.818065ms) to execute 2020-02-06 14:34:41.963825 W | etcdserver: read-only range request "key:\"/clusters/layouts/AZ1/gaussdb13\" " with result "range_response_count:1 size:2087" took too long (193.903983ms) to execute 2020-02-06 14:34:42.810373 W | etcdserver: read-only range request "key:\"/clusters/inst_status/AZ1/gaussdb12/cn_402\" " with result "range_response_count:1 size:72" took too long (774.446332ms) to execute 2020-02-06 14:34:43.172806 W | etcdserver: read-only range request "key:\"/clusters/Globalinfo\" " with result "range_response_count:1 size:483" took too long (1.134431963s) to execute 2020-02-06 14:34:43.173043 W | etcdserver: request "header:<ID:1857295036222071595 username:\"root\" auth_revision:8 > lease_grant:<ttl:30-second id:19c66ffebd1ce32a>" with result "size:40" took too long (219.638715ms) to execute 2020-02-06 14:34:47.419445 W | wal: sync duration of 3.639853362s, expected less than 1s 2020-02-06 14:34:47.419698 W | etcdserver: read-only range request "key:\"/clusters/host_status/AZ1/gaussdb12\" " with result "range_response_count:1 size:74" took too long (4.604805258s) to execute 2020-02-06 14:34:47.420248 W | etcdserver: request "header:<ID:12427524836637247293 username:\"root\" auth_revision:8 > put:<key:\"/clusters/host_status/AZ1/gaussdb14\" value_size:6 lease:1857295036222071592 >" with result "size:6" took too long (3.641278887s) to execute 2020-02-06 14:34:47.421336 W | rafthttp: health check for peer 4ede7092bff5df23 could not connect: dial tcp 192.168.100.11:20301: i/o timeout (prober "ROUND_TRIPPER_RAFT_MESSAGE")
也看了相关的日志,对比了集群关闭前、关闭中等的状态等,才找出下面的方法。
【解决方法】
方法:重新启动AZ
1、启动AZ.
[omm@gaussdb12 ~]$ gs_om -t start --az=AZ1 Starting az ========================================= Starting specified az in the cluster. Clean old cm and etcd for specified AZ. Successfully clean old cm and etcd for specified AZ. Restart etcd for specified AZ. Successfully restart etcd for specified AZ. Checking the etcd status. Successfully checked the etcd status. Changing the instances config for specified az. Successfully changed the instances config for specified az. Restart cm agent for specified AZ. Successfully restart cm agent for specified AZ. start the specified az in the cluster. 90s ====================================================================== Finish to start the specified az in the cluster. .Successfully starting specified az in the cluster.
2、 查看集群状态
[omm@gaussdb11 ~]$ gs_om -t status Set output to terminal. --------------------------------------------------------------------Cluster Status-------------------------------------------------------------------- az_state : single_az cluster_state : Degraded balanced : true----------------------------------------------------------------------AZ Status----------------------------------------------------------------------- AZ:AZ1 ROLE:primary STATUS:ONLINE ---------------------------------------------------------------------Host Status---------------------------------------------------------------------- HOST:gaussdb11 AZ:AZ1 STATUS:ONLINE IP:192.168.100.11 HOST:gaussdb12 AZ:AZ1 STATUS:ONLINE IP:192.168.100.12 HOST:gaussdb13 AZ:AZ1 STATUS:ONLINE IP:192.168.100.13 HOST:gaussdb14 AZ:AZ1 STATUS:ONLINE IP:192.168.100.14 ----------------------------------------------------------------Cluster Manager Status---------------------------------------------------------------- INSTANCE:CM2 ROLE:primary STATUS:ONLINE HOST:gaussdb11 ID:602 INSTANCE:CM1 ROLE:slave STATUS:ONLINE HOST:gaussdb12 ID:601 ---------------------------------------------------------------------ETCD Status---------------------------------------------------------------------- INSTANCE:ETCD1 ROLE:leader STATUS:ONLINE HOST:gaussdb11 ID:701 PORT:20300 DataDir:/gaussdb/data/data_etcd INSTANCE:ETCD2 ROLE:follower STATUS:ONLINE HOST:gaussdb12 ID:702 PORT:20300 DataDir:/gaussdb/data/data_etcd INSTANCE:ETCD3 ROLE:follower STATUS:ONLINE HOST:gaussdb13 ID:703 PORT:20300 DataDir:/gaussdb/data/data_etcd ----------------------------------------------------------------------CN Status----------------------------------------------------------------------- INSTANCE:cn_401 ROLE:no role STATUS:ONLINE HOST:gaussdb11 ID:401 PORT:8000 DataDir:/gaussdb/data/data_cn INSTANCE:cn_402 ROLE:no role STATUS:ONLINE HOST:gaussdb12 ID:402 PORT:8000 DataDir:/gaussdb/data/data_cn ----------------------------------------------------------------------GTS Status---------------------------------------------------------------------- INSTANCE:GTS1 ROLE:primary STATUS:ONLINE HOST:gaussdb11 ID:441 PORT:13000 DataDir:/gaussdb/data/data_gts INSTANCE:GTS2 ROLE:standby STATUS:ONLINE HOST:gaussdb12 ID:442 PORT:13000 DataDir:/gaussdb/data/data_gts ---------------------------------------------------------Instances Status in Group (group_1)---------------------------------------------------------- INSTANCE:DB1_2 ROLE:primary STATUS:ONLINE HOST:gaussdb11 ID:2 PORT:40000 DataDir:/gaussdb/data/data_dn INSTANCE:DB1_1 ROLE:standby STATUS:NEED_REPAIR HOST:gaussdb13 ID:1 PORT:40000 DataDir:/gaussdb/data/data_dn ---------------------------------------------------------Instances Status in Group (group_2)---------------------------------------------------------- INSTANCE:DB2_4 ROLE:primary STATUS:ONLINE HOST:gaussdb12 ID:4 PORT:40000 DataDir:/gaussdb/data/data_dn INSTANCE:DB2_3 ROLE:standby STATUS:NEED_REPAIR HOST:gaussdb14 ID:3 PORT:40000 DataDir:/gaussdb/data/data_dn -----------------------------------------------------------------------Manage IP---------------------------------------------------------------------- HOST:gaussdb11 IP:192.168.100.11 HOST:gaussdb12 IP:192.168.100.12 HOST:gaussdb13 IP:192.168.100.13 HOST:gaussdb14 IP:192.168.100.14 -------------------------------------------------------------------Query Action Info------------------------------------------------------------------ HOSTNAME: gaussdb11 TIME: 2020-02-07 23:48:08.930061 ------------------------------------------------------------------------Float Ip------------------------------------------------------------------ HOST:gaussdb12 DB2_4:192.168.100.12 IP: HOST:gaussdb11 DB1_2:192.168.100.11 IP:
集群已经启动,接下来进行修复STATUS:NEED_REPAIR HOST:gaussdb13、gaussdb14这两个节点即可,至此,问题解决。
转自墨天轮
【版权声明】本文为华为云社区用户转载文章,如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱:
cloudbbs@huaweicloud.com
- 点赞
- 收藏
- 关注作者
评论(0)