记一次虚机强制断电 K8s 集群 etcd pod 挂掉快照丢失(没有备份)问题处理

举报
山河已无恙 发表于 2023/03/03 15:04:37 2023/03/03
【摘要】 写在前面不小心拔错电源了,虚机强制关机,开机后集群死掉了记录下解决方案断电导致etcd 快照数据丢失,没有备份.基本上是没办法处理可以找专业的 DBA来处理数据看有没有可能恢复这篇博文的解决办法是删除了 etcd 数据目录中的部分文件。集群可以启动,但是 部署的环境数据都丢失了,包括CNI, 集群自带的 DNS 组件也丢了。理解不足小伙伴帮忙指正不管是生产还是测试, k8s集群 ETCD 一...

写在前面

  • 不小心拔错电源了,虚机强制关机,开机后集群死掉了
  • 记录下解决方案
  • 断电导致etcd 快照数据丢失,没有备份.基本上是没办法处理
  • 可以找专业的 DBA来处理数据看有没有可能恢复
  • 这篇博文的解决办法是删除了 etcd 数据目录中的部分文件。
  • 集群可以启动,但是 部署的环境数据都丢失了,包括CNI, 集群自带的 DNS 组件也丢了。
  • 理解不足小伙伴帮忙指正
  • 不管是生产还是测试, k8s集群 ETCD 一定要备份,ETCD 一定要备份,ETCD 一定要备份 ,重要的话说三遍。

我所渴求的,無非是將心中脫穎語出的本性付諸生活,為何竟如此艱難呢 ------赫尔曼·黑塞《德米安》


当前集群的状态

┌──[root@vms81.liruilongs.github.io]-[~]
└─$kubectl get nodes
The connection to the server 192.168.26.81:6443 was refused - did you specify the right host or port?

重启 docke 和 kubelet 尝试启动

┌──[root@vms81.liruilongs.github.io]-[~]
└─$systemctl restart docker
┌──[root@vms81.liruilongs.github.io]-[~]
└─$systemctl restart kubelet.service

还是不行,查看下 maser 节点的 kubelet 日志信息

┌──[root@vms81.liruilongs.github.io]-[~]
└─$journalctl  -u kubelet.service -f
1月 19 09:32:06 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:06.703418   11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"
1月 19 09:32:06 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:06.804201   11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"
1月 19 09:32:06 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:06.905156   11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"
1月 19 09:32:07 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:07.005487   11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"
1月 19 09:32:07 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:07.105648   11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"
1月 19 09:32:07 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:07.186066   11344 controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://192.168.26.81:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/vms81.liruilongs.github.io?timeout=10s": dial tcp 192.168.26.81:6443: connect: connection refused
1月 19 09:32:07 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:07.205785   11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"

利用 docker 查看下当前存在的 pod 信息

┌──[root@vms81.liruilongs.github.io]-[~]
└─$docker ps
CONTAINER ID   IMAGE                                               COMMAND                  CREATED          STATUS              PORTS     NAMES
d9d6471ce936   b51ddc1014b0                                        "kube-scheduler --au…"   17 minutes ago   Up 17 minutes                 k8s_kube-scheduler_kube-scheduler-vms81.liruilongs.github.io_kube-system_e1b874bfdef201d69db10b200b8f47d5_14
010c1b8c30c6   5425bcbd23c5                                        "kube-controller-man…"   17 minutes ago   Up 17 minutes                 k8s_kube-controller-manager_kube-controller-manager-vms81.liruilongs.github.io_kube-system_49b7654103f80170bfe29d034f806256_15
7e215924a1dd   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"                 18 minutes ago   Up About a minute             k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7
f557435d150e   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"                 18 minutes ago   Up 18 minutes                 k8s_POD_kube-scheduler-vms81.liruilongs.github.io_kube-system_e1b874bfdef201d69db10b200b8f47d5_7
5deaffbc555a   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"                 18 minutes ago   Up 18 minutes                 k8s_POD_kube-controller-manager-vms81.liruilongs.github.io_kube-system_49b7654103f80170bfe29d034f806256_7
a418c2ce33f2   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"                 18 minutes ago   Up 18 minutes                 k8s_POD_kube-apiserver-vms81.liruilongs.github.io_kube-system_a35cb37b6c90c72f607936b33161eefe_6

etcd 没有启动, apiservice 也没有启动。

┌──[root@vms81.liruilongs.github.io]-[~]
└─$docker ps -a | grep etcd
b5e18722315b   004811815584                                        "etcd --advertise-cl…"   5 minutes ago    Exited (2) About a minute ago             k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_19
7e215924a1dd   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"                 21 minutes ago   Up 4 minutes                              k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7

尝试重新启动 etcd

┌──[root@vms81.liruilongs.github.io]-[~]
└─$docker restart b5e18722315b
b5e18722315b

查看启动状态

┌──[root@vms81.liruilongs.github.io]-[~]
└─$docker ps -a | grep etcd
b5e18722315b   004811815584                                        "etcd --advertise-cl…"   5 minutes ago    Exited (2) About a minute ago             k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_19
7e215924a1dd   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"                 21 minutes ago   Up 4 minutes                              k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7
┌──[root@vms81.liruilongs.github.io]-[~]
└─$docker logs b5e18722315b

看一下 etcd 对应的日志

┌──[root@vms81.liruilongs.github.io]-[~]
└─$docker logs 8a53cbc545e4
..................................................
{"level":"info","ts":"2023-01-19T01:34:24.332Z","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/var/lib/etcd/member/snap/db","took":"5.557212ms"}
{"level":"warn","ts":"2023-01-19T01:34:24.332Z","caller":"wal/util.go:90","msg":"ignored file in WAL directory","path":"0000000000000014-0000000000185aba.wal.broken"}
{"level":"info","ts":"2023-01-19T01:34:24.770Z","caller":"etcdserver/server.go:508","msg":"recovered v2 store from snapshot","snapshot-index":26912747,"snapshot-size":"42 kB"}
{"level":"warn","ts":"2023-01-19T01:34:24.771Z","caller":"snap/db.go:88","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":26912747,"snapshot-file-path":"/var/lib/etcd/member/snap/00000000019aa7eb.snap.db","error":"snap: snapshot file doesn't exist"}
{"level":"panic","ts":"2023-01-19T01:43:31.738Z","caller":"etcdserver/server.go:515","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:515\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/embed/etcd.go:244\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:227\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:122\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/home/remote/sbatsche/.gvm/gos/go1.16.3/src/runtime/proc.go:225"}
panic: failed to recover v3 backend from snapshot

goroutine 1 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000114600, 0xc000588240, 0x1, 0x1)
        /home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc000080960, 0x122e2fc, 0x2a, 0xc000588240, 0x1, 0x1)
        /home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/etcdserver.NewServer(0x7ffe54af1e25, 0x1a, 0x0, 0x0, 0x0, 0x0, 0xc0004cf830, 0x1, 0x1, 0xc0004cfa70, ...)
        /tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:515 +0x1656
go.etcd.io/etcd/server/v3/embed.StartEtcd(0xc0000ee000, 0xc0000ee600, 0x0, 0x0)
        /tmp/etcd-release-3.5.0/etcd/release/etcd/server/embed/etcd.go:244 +0xef8
go.etcd.io/etcd/server/v3/etcdmain.startEtcd(0xc0000ee000, 0x1202a6f, 0x6, 0xc000428401, 0x2)
        /tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:227 +0x32
go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2(0xc00003a120, 0x12, 0x12)
        /tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:122 +0x257a
go.etcd.io/etcd/server/v3/etcdmain.Main(0xc00003a120, 0x12, 0x12)
        /tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/main.go:40 +0x11f
main.main()
        /tmp/etcd-release-3.5.0/etcd/release/etcd/server/main.go:32 +0x45

"msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","

"msg": "从快照恢复v3后台失败", "error": "未能找到数据库快照文件(snap: 快照文件不存在)","

断电照成数据文件损坏了,它希望从快照中恢复,但是没有快照。

额,这里没有备份,所以基本上是没有办法修复了。只能通过 kubeadm 重置集群了。

一些补救措施

如果说你希望通过一些其他的方式来启动集群,来获取一些当前集群的配置信息,下面的方式可以尝试,但是我的集群使用了下面的方法,所有的 pods 数据都丢失了,没办法最后重置集群了。

如果你想使用下面的方式,一定要备份删除的 etcd 数据文件

etcd master 是一个静态 pod ,所以我们看下 yaml 文件,配置的数据文件中什么位置

┌──[root@vms81.liruilongs.github.io]-[~]
└─$cd /etc/kubernetes/manifests/
┌──[root@vms81.liruilongs.github.io]-[/etc/kubernetes/manifests]
└─$ls
etcd.yaml  kube-apiserver.yaml  kube-controller-manager.yaml  kube-scheduler.yaml

- --data-dir=/var/lib/etcd

┌──[root@vms81.liruilongs.github.io]-[/etc/kubernetes/manifests]
└─$cat etcd.yaml | grep -e "--"
    - --advertise-client-urls=https://192.168.26.81:2379
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --client-cert-auth=true
    - --data-dir=/var/lib/etcd
    - --initial-advertise-peer-urls=https://192.168.26.81:2380
    - --initial-cluster=vms81.liruilongs.github.io=https://192.168.26.81:2380
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --listen-client-urls=https://127.0.0.1:2379,https://192.168.26.81:2379
    - --listen-metrics-urls=http://127.0.0.1:2381
    - --listen-peer-urls=https://192.168.26.81:2380
    - --name=vms81.liruilongs.github.io
    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
    - --peer-client-cert-auth=true
    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
    - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --snapshot-count=10000
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt

对应的数据文件,可以尝试对数据文件进行修复,如果希望集群可以快速启动,可以

┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd/member]
└─$tree
.
├── snap
│   ├── 0000000000000058-00000000019a0ba7.snap
│   ├── 0000000000000058-00000000019a32b8.snap
│   ├── 0000000000000058-00000000019a59c9.snap
│   ├── 0000000000000058-00000000019a80da.snap
│   ├── 0000000000000058-00000000019aa7eb.snap
│   └── db
└── wal
    ├── 0000000000000014-0000000000185aba.wal.broken
    ├── 0000000000000142-0000000001963c0e.wal
    ├── 0000000000000143-0000000001977bbe.wal
    ├── 0000000000000144-0000000001986aa6.wal
    ├── 0000000000000145-0000000001995ef6.wal
    ├── 0000000000000146-00000000019a544d.wal
    └── 1.tmp

2 directories, 13 files

备份一下数据文件

┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$ls
member
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$tar -cvf member.tar member/
member/
member/snap/
member/snap/db
member/snap/0000000000000058-00000000019a0ba7.snap
member/snap/0000000000000058-00000000019a32b8.snap
member/snap/0000000000000058-00000000019a59c9.snap
member/snap/0000000000000058-00000000019a80da.snap
member/snap/0000000000000058-00000000019aa7eb.snap
member/wal/
member/wal/0000000000000142-0000000001963c0e.wal
member/wal/0000000000000144-0000000001986aa6.wal
member/wal/0000000000000014-0000000000185aba.wal.broken
member/wal/0000000000000145-0000000001995ef6.wal
member/wal/0000000000000146-00000000019a544d.wal
member/wal/1.tmp
member/wal/0000000000000143-0000000001977bbe.wal
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$ls
member  member.tar
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$mv member.tar  /tmp/
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$rm -rf  member/snap/*.snap
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$rm -rf  member/wal/*.wal
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$

重新启动 docker 对应的镜像,或者重新启动 kubectl。

┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$docker ps -a | grep etcd
a3b97cb34d9b   004811815584                                        "etcd --advertise-cl…"   2 minutes ago   Exited (2) 2 minutes ago              k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_45
7e215924a1dd   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"                 3 hours ago     Up 2 hours                            k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$docker start a3b97cb34d9b
a3b97cb34d9b
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$docker ps -a | grep etcd
e1fc068247af   004811815584                                        "etcd --advertise-cl…"   3 seconds ago   Up 2 seconds                          k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_46
a3b97cb34d9b   004811815584                                        "etcd --advertise-cl…"   3 minutes ago   Exited (2) 3 seconds ago              k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_45
7e215924a1dd   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"                 3 hours ago     Up 2 hours                            k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$

查看 Node 状态

┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$kubectl get nodes
NAME                          STATUS   ROLES    AGE   VERSION
vms155.liruilongs.github.io   Ready    <none>   76s   v1.22.2
vms81.liruilongs.github.io    Ready    <none>   76s   v1.22.2
vms82.liruilongs.github.io    Ready    <none>   76s   v1.22.2
vms83.liruilongs.github.io    Ready    <none>   76s   v1.22.2
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$

查看集群当前所有的 Pod 。

┌──[root@vms81.liruilongs.github.io]-[~/ansible/kubevirt]
└─$kubectl get pods -A
NAME                                                 READY   STATUS    RESTARTS         AGE
etcd-vms81.liruilongs.github.io                      1/1     Running   48 (3h35m ago)   3h53m
kube-apiserver-vms81.liruilongs.github.io            1/1     Running   48 (3h35m ago)   3h51m
kube-controller-manager-vms81.liruilongs.github.io   1/1     Running   17 (3h35m ago)   3h51m
kube-scheduler-vms81.liruilongs.github.io            1/1     Running   16 (3h35m ago)   3h52m

网络相关的 pod 都不在了,而且 k8s 的 dns 组件也没有起来, 这里需要 重新配置网络,有点麻烦,正常情况下如果, 网络相关的组件没有起来, 所有节点应该都是未就绪状态。感觉有点妖。。。时间关系,我需要集群来做实验,所以通过 kubeadm重置了

┌──[root@vms81.liruilongs.github.io]-[~/ansible]
└─$kubectl apply -f calico.yaml

博文参考


https://github.com/etcd-io/etcd/issues/11949

【版权声明】本文为华为云社区用户原创内容,转载时必须标注文章的来源(华为云社区)、文章链接、文章作者等基本信息, 否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。