GaussDB(DWS)实例build失败问题定位指南
问题描述:现网局点集群运行过程中实例出现build faile状态或者手动拉起实例build后无法成功,对于此类问题进行问题排查
问题处理:
步骤1:在build节点$GAUSSLOG/bin/gs_ctl目录下grep dn_xxx_xxx,确认build线程号
[2021-09-11 15:34:46.653][2102134][dn_6009_6010][gs_rewind]: connected to server: host=10.18.20.144 port=25331 dbname=postgres application_name=gs_rewind connect_timeout=5
[2021-09-11 15:34:46.654][2102134][dn_6009_6010][gs_rewind]: connected to server success.
[2021-09-11 15:34:46.655][2102134][dn_6009_6010][gs_rewind]: send base backup command: BASE_BACKUP LABEL 'gs_ctl full build' FAST NOWAIT
[2021-09-11 15:34:46.655][2102134][dn_6009_6010][gs_rewind]: get the xlog start position
[2021-09-11 15:34:46.757][2102134][dn_6009_6010][gs_rewind]: request WAL start point: [0/3000028]
[2021-09-11 15:34:46.757][2102134][dn_6009_6010][gs_rewind]: building from checkpoint at 0/3000188 on timeline 1
[2021-09-11 15:34:46.758][2102134][dn_6009_6010][gs_rewind]: start getting chang
步骤2:使用build线程号搜索整个build日志:grep 2102134(此日志为正常增量备份日志)
[2021-09-11 15:34:46.604][2102134][][gs_ctl]: gs_ctl full build, datadir is -D "/srv/BigData/mppdb/data2/slave1"
[2021-09-11 15:34:46.604][2102134][][gs_ctl]: killing gaussdb by force ...
[2021-09-11 15:34:46.604][2102134][][gs_ctl]: command [ps c -eo pid,euid,cmd | grep gaussdb | grep -v grep | awk '{if($2 == curuid && $1!="-n") print "/proc/"$1"/cwd"}' curuid=`id -u`| xargs ls -l | awk '{if ($NF=="/srv/BigData/mppdb/data2/slave1") print $(NF-2)}' | awk -F/ '{print $3 }' | xargs kill -9 >/dev/null 2>&1 ] path: [/srv/BigData/mppdb/data2/slave1]
[2021-09-11 15:34:46.646][2102134][][gs_ctl]: server stopped
[2021-09-11 15:34:46.653][2102134][dn_6009_6010][gs_rewind]: connected to server: host=10.18.20.144 port=25331 dbname=postgres application_name=gs_rewind connect_timeout=5
[2021-09-11 15:34:46.654][2102134][dn_6009_6010][gs_rewind]: connected to server success.
[2021-09-11 15:34:46.655][2102134][dn_6009_6010][gs_rewind]: send base backup command: BASE_BACKUP LABEL 'gs_ctl full build' FAST NOWAIT
[2021-09-11 15:34:46.655][2102134][dn_6009_6010][gs_rewind]: get the xlog start position
[2021-09-11 15:34:46.757][2102134][dn_6009_6010][gs_rewind]: request WAL start point: [0/3000028]
[2021-09-11 15:34:46.757][2102134][dn_6009_6010][gs_rewind]: building from checkpoint at 0/3000188 on timeline 1
[2021-09-11 15:34:46.758][2102134][dn_6009_6010][gs_rewind]: start getting changed filemap
[2021-09-11 15:34:46.760][2102134][dn_6009_6010][gs_rewind]: get local filemap success
[2021-09-11 15:34:46.768][2102134][dn_6009_6010][gs_rewind]: fetch remote filemap success
[2021-09-11 15:34:46.783][2102134][dn_6009_6010][gs_rewind]: connected to server for checksum, parallel thread number: 4.
[2021-09-11 15:34:46.783][2102134][dn_6009_6010][gs_rewind]: source files checksum thread 0 starts successfully, pid 281464864406592.
[2021-09-11 15:34:46.783][2102134][dn_6009_6010][gs_rewind]: source files checksum thread 1 starts successfully, pid 281464855952448.
[2021-09-11 15:34:46.784][2102134][dn_6009_6010][gs_rewind]: source files checksum thread 2 starts successfully, pid 281464641911872.
[2021-09-11 15:34:46.784][2102134][dn_6009_6010][gs_rewind]: source files checksum thread 3 starts successfully, pid 281464776129600.
[2021-09-11 15:34:46.784][2102134][dn_6009_6010][gs_rewind]: target files checksum thread 0 starts successfully, pid 281464767675456.
[2021-09-11 15:34:46.784][2102134][dn_6009_6010][gs_rewind]: target files checksum thread 1 starts successfully, pid 281464759221312.
[2021-09-11 15:34:46.784][2102134][dn_6009_6010][gs_rewind]: target files checksum thread 2 starts successfully, pid 281464750767168.
[2021-09-11 15:34:46.784][2102134][dn_6009_6010][gs_rewind]: target files checksum thread 3 starts successfully, pid 281464742313024.
[2021-09-11 15:34:46.793][2102134][dn_6009_6010][gs_rewind]: calc and verify target files checksum successfully, thread 2 cost 9ms
[2021-09-11 15:34:46.795][2102134][dn_6009_6010][gs_rewind]: calc and verify target files checksum successfully, thread 1 cost 11ms
[2021-09-11 15:34:46.796][2102134][dn_6009_6010][gs_rewind]: calc and verify target files checksum successfully, thread 3 cost 12ms
[2021-09-11 15:34:46.796][2102134][dn_6009_6010][gs_rewind]: calc and verify target files checksum successfully, thread 0 cost 12ms
[2021-09-11 15:34:46.825][2102134][dn_6009_6010][gs_rewind]: calc source files checksum successfully, thread 2 cost 41ms
[2021-09-11 15:34:46.828][2102134][dn_6009_6010][gs_rewind]: calc source files checksum successfully, thread 0 cost 45ms
[2021-09-11 15:34:46.831][2102134][dn_6009_6010][gs_rewind]: calc source files checksum successfully, thread 1 cost 47ms
[2021-09-11 15:34:46.833][2102134][dn_6009_6010][gs_rewind]: calc source files checksum successfully, thread 3 cost 49ms
[2021-09-11 15:34:46.833][2102134][dn_6009_6010][gs_rewind]: source files checksum threads return successfully.
[2021-09-11 15:34:46.833][2102134][dn_6009_6010][gs_rewind]: target files checksum threads return successfully.
[2021-09-11 15:34:46.835][2102134][dn_6009_6010][gs_rewind]: get changed filemap successfully, cost 77ms
[2021-09-11 15:34:46.835][2102134][dn_6009_6010][gs_rewind]: need to copy 81MB (total source directory size is 82MB)
[2021-09-11 15:34:46.835][2102134][dn_6009_6010][gs_rewind]: receiving and unpacking files...
[2021-09-11 15:34:46.851][2102134][dn_6009_6010][gs_rewind]: connected to server to fetch files, parallel thread number: 4.
[2021-09-11 15:34:46.851][2102134][dn_6009_6010][gs_rewind]: fetch remote file thread 0 starts successfully, pid 281464742313024.
[2021-09-11 15:34:46.851][2102134][dn_6009_6010][gs_rewind]: fetch remote file thread 1 starts successfully, pid 281464750767168.
[2021-09-11 15:34:46.851][2102134][dn_6009_6010][gs_rewind]: fetch remote file thread 2 starts successfully, pid 281464759221312.
[2021-09-11 15:34:46.851][2102134][dn_6009_6010][gs_rewind]: fetch remote file thread 3 starts successfully, pid 281464767675456.
[2021-09-11 15:34:47.014][2102134][dn_6009_6010][gs_rewind]: fetch remote file successfully, thread 3 cost 163ms
[2021-09-11 15:34:47.063][2102134][dn_6009_6010][gs_rewind]: fetch remote file successfully, thread 0 cost 212ms
[2021-09-11 15:34:47.068][2102134][dn_6009_6010][gs_rewind]: pg_xlog type 1.
[2021-09-11 15:34:47.068][2102134][dn_6009_6010][gs_rewind]: fetch remote file successfully, thread 2 cost 217ms
[2021-09-11 15:34:47.075][2102134][dn_6009_6010][gs_rewind]: fetch remote file successfully, thread 1 cost 224ms
[2021-09-11 15:34:47.075][2102134][dn_6009_6010][gs_rewind]: execute file map threads return successfully.
[2021-09-11 15:34:48.867][2102134][dn_6009_6010][gs_rewind]: get WAL end point: [0/3000230]
[2021-09-11 15:34:48.867][2102134][dn_6009_6010][gs_rewind]: check checkpoint redo (0/3000028) success.
[2021-09-11 15:34:48.867][2102134][dn_6009_6010][gs_rewind]: check checkpoint rec (0/3000188) success.
[2021-09-11 15:34:48.867][2102134][dn_6009_6010][gs_rewind]: find max lsn rec (0/3000188) success.
[2021-09-11 15:34:48.867][2102134][dn_6009_6010][gs_rewind]: check lsn after checkpoint redo is continus to (0/3000188)
[2021-09-11 15:34:48.867][2102134][dn_6009_6010][gs_rewind]: identify system with primary success
[2021-09-11 15:34:48.867][2102134][dn_6009_6010][gs_rewind]: request WALs success with query START_REPLICATION 0/3000000
[2021-09-11 15:34:48.867][2102134][dn_6009_6010][gs_rewind]: begin to receive WALs at [0/3000000]
[2021-09-11 15:34:53.881][2102134][dn_6009_6010][gs_rewind]: receiving WALs success from [0/3000000] to [0/3800000]
[2021-09-11 15:34:53.881][2102134][dn_6009_6010][gs_rewind]: write slot file /srv/BigData/mppdb/data2/slave1/pg_replslot/dn_6009_6010/state
[2021-09-11 15:34:53.881][2102134][dn_6009_6010][gs_rewind]: set minRecoveryPoint success
[2021-09-11 15:34:53.882][2102134][dn_6009_6010][gs_rewind]: create backup label success
[2021-09-11 15:34:53.882][2102134][dn_6009_6010][gs_ctl]: build completed(/srv/BigData/mppdb/data2/slave1).
[2021-09-11 15:34:53.914][2102134][dn_6009_6010][gs_ctl]: waiting for server to start...
[2021-09-11 15:34:56.136][2102134][dn_6009_6010][gs_ctl]: done
[2021-09-11 15:34:56.136][2102134][dn_6009_6010][gs_ctl]: server started (/srv/BigData/mppdb/data2/slave1)
步骤三:根据build日志报错信息,确认失败原因
[2022-01-13 11:28:24.093][41230][dn_6009_6010][gs_rewind]: could not open target file "/srv/BigData/mppdb/data2/master2/pg_tblspc/534701960/PG_9.2_201611171_dn_6179_6180/16681/535722869": No such file or directory
问题修复:确认失败原因为网络异常后使用全量build重新尝试修复此实例,若由于对端某些文件不存在请将报错信息保留联系华为工程师处理。
https://bbs.huaweicloud.com/forum/forum.php?mod=viewthread&tid=93755
- 点赞
- 收藏
- 关注作者
评论(0)