GaussDB(DWS)实例build失败问题定位指南

举报
上官寒雨 发表于 2022/12/12 16:32:20 2022/12/12
【摘要】 问题描述:现网局点集群运行过程中实例出现build faile状态或者手动拉起实例build后无法成功,对于此类问题进行问题排查问题处理:步骤1:在build节点$GAUSSLOG/bin/gs_ctl目录下grep dn_xxx_xxx,确认build线程号[2021-09-11 15:34:46.653][2102134][dn_6009_6010][gs_rewind]: connec...

问题描述:现网局点集群运行过程中实例出现build faile状态或者手动拉起实例build后无法成功,对于此类问题进行问题排查

问题处理:

步骤1:在build节点$GAUSSLOG/bin/gs_ctl目录下grep dn_xxx_xxx,确认build线程号

[2021-09-11 15:34:46.653][2102134][dn_6009_6010][gs_rewind]: connected to server: host=10.18.20.144 port=25331 dbname=postgres application_name=gs_rewind connect_timeout=5

[2021-09-11 15:34:46.654][2102134][dn_6009_6010][gs_rewind]: connected to server success.

[2021-09-11 15:34:46.655][2102134][dn_6009_6010][gs_rewind]: send base backup command: BASE_BACKUP LABEL 'gs_ctl full build' FAST NOWAIT

[2021-09-11 15:34:46.655][2102134][dn_6009_6010][gs_rewind]: get the xlog start position

[2021-09-11 15:34:46.757][2102134][dn_6009_6010][gs_rewind]: request WAL start point: [0/3000028]

[2021-09-11 15:34:46.757][2102134][dn_6009_6010][gs_rewind]: building from checkpoint at 0/3000188 on timeline 1

[2021-09-11 15:34:46.758][2102134][dn_6009_6010][gs_rewind]: start getting chang

 

步骤2:使用build线程号搜索整个build日志:grep 2102134(此日志为正常增量备份日志)

[2021-09-11 15:34:46.604][2102134][][gs_ctl]: gs_ctl full build, datadir is -D "/srv/BigData/mppdb/data2/slave1"

[2021-09-11 15:34:46.604][2102134][][gs_ctl]: killing gaussdb by force ...

[2021-09-11 15:34:46.604][2102134][][gs_ctl]: command [ps c -eo pid,euid,cmd | grep gaussdb | grep -v grep | awk '{if($2 == curuid && $1!="-n") print "/proc/"$1"/cwd"}' curuid=`id -u`| xargs ls -l | awk '{if ($NF=="/srv/BigData/mppdb/data2/slave1") print $(NF-2)}' | awk -F/ '{print $3 }' | xargs kill -9 >/dev/null 2>&1 ] path: [/srv/BigData/mppdb/data2/slave1]

[2021-09-11 15:34:46.646][2102134][][gs_ctl]: server stopped

[2021-09-11 15:34:46.653][2102134][dn_6009_6010][gs_rewind]: connected to server: host=10.18.20.144 port=25331 dbname=postgres application_name=gs_rewind connect_timeout=5

[2021-09-11 15:34:46.654][2102134][dn_6009_6010][gs_rewind]: connected to server success.

[2021-09-11 15:34:46.655][2102134][dn_6009_6010][gs_rewind]: send base backup command: BASE_BACKUP LABEL 'gs_ctl full build' FAST NOWAIT

[2021-09-11 15:34:46.655][2102134][dn_6009_6010][gs_rewind]: get the xlog start position

[2021-09-11 15:34:46.757][2102134][dn_6009_6010][gs_rewind]: request WAL start point: [0/3000028]

[2021-09-11 15:34:46.757][2102134][dn_6009_6010][gs_rewind]: building from checkpoint at 0/3000188 on timeline 1

[2021-09-11 15:34:46.758][2102134][dn_6009_6010][gs_rewind]: start getting changed filemap

[2021-09-11 15:34:46.760][2102134][dn_6009_6010][gs_rewind]: get local filemap success

[2021-09-11 15:34:46.768][2102134][dn_6009_6010][gs_rewind]: fetch remote filemap success

[2021-09-11 15:34:46.783][2102134][dn_6009_6010][gs_rewind]: connected to server for checksum, parallel thread number: 4.

[2021-09-11 15:34:46.783][2102134][dn_6009_6010][gs_rewind]: source files checksum thread 0 starts successfully, pid 281464864406592.

[2021-09-11 15:34:46.783][2102134][dn_6009_6010][gs_rewind]: source files checksum thread 1 starts successfully, pid 281464855952448.

[2021-09-11 15:34:46.784][2102134][dn_6009_6010][gs_rewind]: source files checksum thread 2 starts successfully, pid 281464641911872.

[2021-09-11 15:34:46.784][2102134][dn_6009_6010][gs_rewind]: source files checksum thread 3 starts successfully, pid 281464776129600.

[2021-09-11 15:34:46.784][2102134][dn_6009_6010][gs_rewind]: target files checksum thread 0 starts successfully, pid 281464767675456.

[2021-09-11 15:34:46.784][2102134][dn_6009_6010][gs_rewind]: target files checksum thread 1 starts successfully, pid 281464759221312.

[2021-09-11 15:34:46.784][2102134][dn_6009_6010][gs_rewind]: target files checksum thread 2 starts successfully, pid 281464750767168.

[2021-09-11 15:34:46.784][2102134][dn_6009_6010][gs_rewind]: target files checksum thread 3 starts successfully, pid 281464742313024.

[2021-09-11 15:34:46.793][2102134][dn_6009_6010][gs_rewind]: calc and verify target files checksum successfully, thread 2 cost 9ms

[2021-09-11 15:34:46.795][2102134][dn_6009_6010][gs_rewind]: calc and verify target files checksum successfully, thread 1 cost 11ms

[2021-09-11 15:34:46.796][2102134][dn_6009_6010][gs_rewind]: calc and verify target files checksum successfully, thread 3 cost 12ms

[2021-09-11 15:34:46.796][2102134][dn_6009_6010][gs_rewind]: calc and verify target files checksum successfully, thread 0 cost 12ms

[2021-09-11 15:34:46.825][2102134][dn_6009_6010][gs_rewind]: calc source files checksum successfully, thread 2 cost 41ms

[2021-09-11 15:34:46.828][2102134][dn_6009_6010][gs_rewind]: calc source files checksum successfully, thread 0 cost 45ms

[2021-09-11 15:34:46.831][2102134][dn_6009_6010][gs_rewind]: calc source files checksum successfully, thread 1 cost 47ms

[2021-09-11 15:34:46.833][2102134][dn_6009_6010][gs_rewind]: calc source files checksum successfully, thread 3 cost 49ms

[2021-09-11 15:34:46.833][2102134][dn_6009_6010][gs_rewind]: source files checksum threads return successfully.

[2021-09-11 15:34:46.833][2102134][dn_6009_6010][gs_rewind]: target files checksum threads return successfully.

[2021-09-11 15:34:46.835][2102134][dn_6009_6010][gs_rewind]: get changed filemap successfully, cost 77ms

[2021-09-11 15:34:46.835][2102134][dn_6009_6010][gs_rewind]: need to copy 81MB (total source directory size is 82MB)

[2021-09-11 15:34:46.835][2102134][dn_6009_6010][gs_rewind]: receiving and unpacking files...

[2021-09-11 15:34:46.851][2102134][dn_6009_6010][gs_rewind]: connected to server to fetch files, parallel thread number: 4.

[2021-09-11 15:34:46.851][2102134][dn_6009_6010][gs_rewind]: fetch remote file thread 0 starts successfully, pid 281464742313024.

[2021-09-11 15:34:46.851][2102134][dn_6009_6010][gs_rewind]: fetch remote file thread 1 starts successfully, pid 281464750767168.

[2021-09-11 15:34:46.851][2102134][dn_6009_6010][gs_rewind]: fetch remote file thread 2 starts successfully, pid 281464759221312.

[2021-09-11 15:34:46.851][2102134][dn_6009_6010][gs_rewind]: fetch remote file thread 3 starts successfully, pid 281464767675456.

[2021-09-11 15:34:47.014][2102134][dn_6009_6010][gs_rewind]: fetch remote file successfully, thread 3 cost 163ms

[2021-09-11 15:34:47.063][2102134][dn_6009_6010][gs_rewind]: fetch remote file successfully, thread 0 cost 212ms

[2021-09-11 15:34:47.068][2102134][dn_6009_6010][gs_rewind]: pg_xlog type 1.

[2021-09-11 15:34:47.068][2102134][dn_6009_6010][gs_rewind]: fetch remote file successfully, thread 2 cost 217ms

[2021-09-11 15:34:47.075][2102134][dn_6009_6010][gs_rewind]: fetch remote file successfully, thread 1 cost 224ms

[2021-09-11 15:34:47.075][2102134][dn_6009_6010][gs_rewind]: execute file map threads return successfully.

[2021-09-11 15:34:48.867][2102134][dn_6009_6010][gs_rewind]: get WAL end point: [0/3000230]

[2021-09-11 15:34:48.867][2102134][dn_6009_6010][gs_rewind]: check checkpoint redo (0/3000028) success.

[2021-09-11 15:34:48.867][2102134][dn_6009_6010][gs_rewind]: check checkpoint rec (0/3000188) success.

[2021-09-11 15:34:48.867][2102134][dn_6009_6010][gs_rewind]: find max lsn rec (0/3000188) success.

[2021-09-11 15:34:48.867][2102134][dn_6009_6010][gs_rewind]: check lsn after checkpoint redo is continus to (0/3000188)

[2021-09-11 15:34:48.867][2102134][dn_6009_6010][gs_rewind]: identify system with primary success

[2021-09-11 15:34:48.867][2102134][dn_6009_6010][gs_rewind]: request WALs success with query START_REPLICATION 0/3000000

[2021-09-11 15:34:48.867][2102134][dn_6009_6010][gs_rewind]: begin to receive WALs at [0/3000000]

[2021-09-11 15:34:53.881][2102134][dn_6009_6010][gs_rewind]: receiving WALs success from [0/3000000] to [0/3800000]

[2021-09-11 15:34:53.881][2102134][dn_6009_6010][gs_rewind]: write slot file /srv/BigData/mppdb/data2/slave1/pg_replslot/dn_6009_6010/state

[2021-09-11 15:34:53.881][2102134][dn_6009_6010][gs_rewind]: set minRecoveryPoint success

[2021-09-11 15:34:53.882][2102134][dn_6009_6010][gs_rewind]: create backup label success

[2021-09-11 15:34:53.882][2102134][dn_6009_6010][gs_ctl]: build completed(/srv/BigData/mppdb/data2/slave1).

[2021-09-11 15:34:53.914][2102134][dn_6009_6010][gs_ctl]: waiting for server to start...

[2021-09-11 15:34:56.136][2102134][dn_6009_6010][gs_ctl]: done

[2021-09-11 15:34:56.136][2102134][dn_6009_6010][gs_ctl]: server started (/srv/BigData/mppdb/data2/slave1)

步骤三:根据build日志报错信息,确认失败原因

[2022-01-13 11:28:24.093][41230][dn_6009_6010][gs_rewind]: could not open target file "/srv/BigData/mppdb/data2/master2/pg_tblspc/534701960/PG_9.2_201611171_dn_6179_6180/16681/535722869": No such file or directory

问题修复:确认失败原因为网络异常后使用全量build重新尝试修复此实例,若由于对端某些文件不存在请将报错信息保留联系华为工程师处理。

https://bbs.huaweicloud.com/forum/forum.php?mod=viewthread&tid=93755

【版权声明】本文为华为云社区用户原创内容,转载时必须标注文章的来源(华为云社区)、文章链接、文章作者等基本信息, 否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。