记一次虚机强制断电 K8s 集群 etcd pod 挂掉快照丢失(没有备份)问题处理
写在前面
-
不小心拔错电源了,虚机强制关机,开机后集群死掉了 -
记录下解决方案 -
断电导致etcd 快照数据丢失,没有备份.基本上是没办法处理 -
可以找专业的 DBA来处理数据看有没有可能恢复 -
这篇博文的解决办法是删除了 etcd 数据目录中的部分文件。 -
集群可以启动,但是 部署的环境数据都丢失了,包括CNI, 集群自带的 DNS 组件也丢了。 -
理解不足小伙伴帮忙指正 -
不管是生产还是测试, k8s集群 ETCD 一定要备份,ETCD 一定要备份,ETCD 一定要备份 ,重要的话说三遍。
我所渴求的,無非是將心中脫穎語出的本性付諸生活,為何竟如此艱難呢 ------赫尔曼·黑塞《德米安》
当前集群的状态
┌──[root@vms81.liruilongs.github.io]-[~]
└─$kubectl get nodes
The connection to the server 192.168.26.81:6443 was refused - did you specify the right host or port?
重启 docke 和 kubelet 尝试启动
┌──[root@vms81.liruilongs.github.io]-[~]
└─$systemctl restart docker
┌──[root@vms81.liruilongs.github.io]-[~]
└─$systemctl restart kubelet.service
还是不行,查看下 maser 节点的 kubelet 日志信息
┌──[root@vms81.liruilongs.github.io]-[~]
└─$journalctl -u kubelet.service -f
1月 19 09:32:06 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:06.703418 11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"
1月 19 09:32:06 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:06.804201 11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"
1月 19 09:32:06 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:06.905156 11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"
1月 19 09:32:07 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:07.005487 11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"
1月 19 09:32:07 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:07.105648 11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"
1月 19 09:32:07 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:07.186066 11344 controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://192.168.26.81:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/vms81.liruilongs.github.io?timeout=10s": dial tcp 192.168.26.81:6443: connect: connection refused
1月 19 09:32:07 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:07.205785 11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"
利用 docker 查看下当前存在的 pod 信息
┌──[root@vms81.liruilongs.github.io]-[~]
└─$docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d9d6471ce936 b51ddc1014b0 "kube-scheduler --au…" 17 minutes ago Up 17 minutes k8s_kube-scheduler_kube-scheduler-vms81.liruilongs.github.io_kube-system_e1b874bfdef201d69db10b200b8f47d5_14
010c1b8c30c6 5425bcbd23c5 "kube-controller-man…" 17 minutes ago Up 17 minutes k8s_kube-controller-manager_kube-controller-manager-vms81.liruilongs.github.io_kube-system_49b7654103f80170bfe29d034f806256_15
7e215924a1dd registry.aliyuncs.com/google_containers/pause:3.5 "/pause" 18 minutes ago Up About a minute k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7
f557435d150e registry.aliyuncs.com/google_containers/pause:3.5 "/pause" 18 minutes ago Up 18 minutes k8s_POD_kube-scheduler-vms81.liruilongs.github.io_kube-system_e1b874bfdef201d69db10b200b8f47d5_7
5deaffbc555a registry.aliyuncs.com/google_containers/pause:3.5 "/pause" 18 minutes ago Up 18 minutes k8s_POD_kube-controller-manager-vms81.liruilongs.github.io_kube-system_49b7654103f80170bfe29d034f806256_7
a418c2ce33f2 registry.aliyuncs.com/google_containers/pause:3.5 "/pause" 18 minutes ago Up 18 minutes k8s_POD_kube-apiserver-vms81.liruilongs.github.io_kube-system_a35cb37b6c90c72f607936b33161eefe_6
etcd 没有启动, apiservice 也没有启动。
┌──[root@vms81.liruilongs.github.io]-[~]
└─$docker ps -a | grep etcd
b5e18722315b 004811815584 "etcd --advertise-cl…" 5 minutes ago Exited (2) About a minute ago k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_19
7e215924a1dd registry.aliyuncs.com/google_containers/pause:3.5 "/pause" 21 minutes ago Up 4 minutes k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7
尝试重新启动 etcd
┌──[root@vms81.liruilongs.github.io]-[~]
└─$docker restart b5e18722315b
b5e18722315b
查看启动状态
┌──[root@vms81.liruilongs.github.io]-[~]
└─$docker ps -a | grep etcd
b5e18722315b 004811815584 "etcd --advertise-cl…" 5 minutes ago Exited (2) About a minute ago k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_19
7e215924a1dd registry.aliyuncs.com/google_containers/pause:3.5 "/pause" 21 minutes ago Up 4 minutes k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7
┌──[root@vms81.liruilongs.github.io]-[~]
└─$docker logs b5e18722315b
看一下 etcd
对应的日志
┌──[root@vms81.liruilongs.github.io]-[~]
└─$docker logs 8a53cbc545e4
..................................................
{"level":"info","ts":"2023-01-19T01:34:24.332Z","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/var/lib/etcd/member/snap/db","took":"5.557212ms"}
{"level":"warn","ts":"2023-01-19T01:34:24.332Z","caller":"wal/util.go:90","msg":"ignored file in WAL directory","path":"0000000000000014-0000000000185aba.wal.broken"}
{"level":"info","ts":"2023-01-19T01:34:24.770Z","caller":"etcdserver/server.go:508","msg":"recovered v2 store from snapshot","snapshot-index":26912747,"snapshot-size":"42 kB"}
{"level":"warn","ts":"2023-01-19T01:34:24.771Z","caller":"snap/db.go:88","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":26912747,"snapshot-file-path":"/var/lib/etcd/member/snap/00000000019aa7eb.snap.db","error":"snap: snapshot file doesn't exist"}
{"level":"panic","ts":"2023-01-19T01:43:31.738Z","caller":"etcdserver/server.go:515","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:515\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/embed/etcd.go:244\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:227\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:122\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/home/remote/sbatsche/.gvm/gos/go1.16.3/src/runtime/proc.go:225"}
panic: failed to recover v3 backend from snapshot
goroutine 1 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000114600, 0xc000588240, 0x1, 0x1)
/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc000080960, 0x122e2fc, 0x2a, 0xc000588240, 0x1, 0x1)
/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/etcdserver.NewServer(0x7ffe54af1e25, 0x1a, 0x0, 0x0, 0x0, 0x0, 0xc0004cf830, 0x1, 0x1, 0xc0004cfa70, ...)
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:515 +0x1656
go.etcd.io/etcd/server/v3/embed.StartEtcd(0xc0000ee000, 0xc0000ee600, 0x0, 0x0)
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/embed/etcd.go:244 +0xef8
go.etcd.io/etcd/server/v3/etcdmain.startEtcd(0xc0000ee000, 0x1202a6f, 0x6, 0xc000428401, 0x2)
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:227 +0x32
go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2(0xc00003a120, 0x12, 0x12)
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:122 +0x257a
go.etcd.io/etcd/server/v3/etcdmain.Main(0xc00003a120, 0x12, 0x12)
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/main.go:40 +0x11f
main.main()
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/main.go:32 +0x45
"msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","
"msg": "从快照恢复v3后台失败", "error": "未能找到数据库快照文件(snap: 快照文件不存在)","
断电照成数据文件损坏了,它希望从快照中恢复,但是没有快照。
额,这里没有备份,所以基本上是没有办法修复了。只能通过 kubeadm 重置集群了。
一些补救措施
如果说你希望通过一些其他的方式来启动集群,来获取一些当前集群的配置信息,下面的方式可以尝试,但是我的集群使用了下面的方法,所有的 pods 数据都丢失了,没办法最后重置集群了。
如果你想使用下面的方式,一定要备份删除的 etcd 数据文件
etcd
在 master
是一个静态 pod
,所以我们看下 yaml 文件,配置的数据文件中什么位置
┌──[root@vms81.liruilongs.github.io]-[~]
└─$cd /etc/kubernetes/manifests/
┌──[root@vms81.liruilongs.github.io]-[/etc/kubernetes/manifests]
└─$ls
etcd.yaml kube-apiserver.yaml kube-controller-manager.yaml kube-scheduler.yaml
- --data-dir=/var/lib/etcd
┌──[root@vms81.liruilongs.github.io]-[/etc/kubernetes/manifests]
└─$cat etcd.yaml | grep -e "--"
- --advertise-client-urls=https://192.168.26.81:2379
- --cert-file=/etc/kubernetes/pki/etcd/server.crt
- --client-cert-auth=true
- --data-dir=/var/lib/etcd
- --initial-advertise-peer-urls=https://192.168.26.81:2380
- --initial-cluster=vms81.liruilongs.github.io=https://192.168.26.81:2380
- --key-file=/etc/kubernetes/pki/etcd/server.key
- --listen-client-urls=https://127.0.0.1:2379,https://192.168.26.81:2379
- --listen-metrics-urls=http://127.0.0.1:2381
- --listen-peer-urls=https://192.168.26.81:2380
- --name=vms81.liruilongs.github.io
- --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
- --peer-client-cert-auth=true
- --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
- --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
- --snapshot-count=10000
- --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
对应的数据文件,可以尝试对数据文件进行修复,如果希望集群可以快速启动,可以
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd/member]
└─$tree
.
├── snap
│ ├── 0000000000000058-00000000019a0ba7.snap
│ ├── 0000000000000058-00000000019a32b8.snap
│ ├── 0000000000000058-00000000019a59c9.snap
│ ├── 0000000000000058-00000000019a80da.snap
│ ├── 0000000000000058-00000000019aa7eb.snap
│ └── db
└── wal
├── 0000000000000014-0000000000185aba.wal.broken
├── 0000000000000142-0000000001963c0e.wal
├── 0000000000000143-0000000001977bbe.wal
├── 0000000000000144-0000000001986aa6.wal
├── 0000000000000145-0000000001995ef6.wal
├── 0000000000000146-00000000019a544d.wal
└── 1.tmp
2 directories, 13 files
备份一下数据文件
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$ls
member
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$tar -cvf member.tar member/
member/
member/snap/
member/snap/db
member/snap/0000000000000058-00000000019a0ba7.snap
member/snap/0000000000000058-00000000019a32b8.snap
member/snap/0000000000000058-00000000019a59c9.snap
member/snap/0000000000000058-00000000019a80da.snap
member/snap/0000000000000058-00000000019aa7eb.snap
member/wal/
member/wal/0000000000000142-0000000001963c0e.wal
member/wal/0000000000000144-0000000001986aa6.wal
member/wal/0000000000000014-0000000000185aba.wal.broken
member/wal/0000000000000145-0000000001995ef6.wal
member/wal/0000000000000146-00000000019a544d.wal
member/wal/1.tmp
member/wal/0000000000000143-0000000001977bbe.wal
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$ls
member member.tar
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$mv member.tar /tmp/
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$rm -rf member/snap/*.snap
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$rm -rf member/wal/*.wal
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$
重新启动 docker 对应的镜像,或者重新启动 kubectl。
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$docker ps -a | grep etcd
a3b97cb34d9b 004811815584 "etcd --advertise-cl…" 2 minutes ago Exited (2) 2 minutes ago k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_45
7e215924a1dd registry.aliyuncs.com/google_containers/pause:3.5 "/pause" 3 hours ago Up 2 hours k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$docker start a3b97cb34d9b
a3b97cb34d9b
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$docker ps -a | grep etcd
e1fc068247af 004811815584 "etcd --advertise-cl…" 3 seconds ago Up 2 seconds k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_46
a3b97cb34d9b 004811815584 "etcd --advertise-cl…" 3 minutes ago Exited (2) 3 seconds ago k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_45
7e215924a1dd registry.aliyuncs.com/google_containers/pause:3.5 "/pause" 3 hours ago Up 2 hours k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$
查看 Node 状态
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$kubectl get nodes
NAME STATUS ROLES AGE VERSION
vms155.liruilongs.github.io Ready <none> 76s v1.22.2
vms81.liruilongs.github.io Ready <none> 76s v1.22.2
vms82.liruilongs.github.io Ready <none> 76s v1.22.2
vms83.liruilongs.github.io Ready <none> 76s v1.22.2
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$
查看集群当前所有的 Pod 。
┌──[root@vms81.liruilongs.github.io]-[~/ansible/kubevirt]
└─$kubectl get pods -A
NAME READY STATUS RESTARTS AGE
etcd-vms81.liruilongs.github.io 1/1 Running 48 (3h35m ago) 3h53m
kube-apiserver-vms81.liruilongs.github.io 1/1 Running 48 (3h35m ago) 3h51m
kube-controller-manager-vms81.liruilongs.github.io 1/1 Running 17 (3h35m ago) 3h51m
kube-scheduler-vms81.liruilongs.github.io 1/1 Running 16 (3h35m ago) 3h52m
网络相关的 pod 都不在了,而且 k8s 的 dns 组件也没有起来, 这里需要 重新配置网络,有点麻烦,正常情况下如果, 网络相关的组件没有起来, 所有节点应该都是未就绪状态。感觉有点妖。。。时间关系,我需要集群来做实验,所以通过 kubeadm
重置了
┌──[root@vms81.liruilongs.github.io]-[~/ansible]
└─$kubectl apply -f calico.yaml
博文参考
https://github.com/etcd-io/etcd/issues/11949
- 点赞
- 收藏
- 关注作者
评论(0)