- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

记一次虚机强制断电磁盘损坏导致 K8s 集群部分节点未就绪(NotReady) 问题解决

山河已无恙发表于 2023/03/01 19:33:39 2023/03/01

【摘要】写在前面自己的实验环境遇到，分享解决过程理解不足小伙伴帮忙指正我所渴求的，無非是將心中脫穎語出的本性付諸生活，為何竟如此艱難呢 ------赫尔曼·黑塞《德米安》我遇到了什么问题哈，中午走的时候钥匙被锁屋里了，急着回家找师傅开门，单位的 nuc 要带回去，就把自己 nuc 强制关机了，结果虚机部署的 k8s 集群回来都起不来了，不过还不算太糟糕，至少 master 还在，不幸的万幸。之...

写在前面

自己的实验环境遇到，分享解决过程
理解不足小伙伴帮忙指正

我所渴求的，無非是將心中脫穎語出的本性付諸生活，為何竟如此艱難呢 ------赫尔曼·黑塞《德米安》

我遇到了什么问题

哈，中午走的时候钥匙被锁屋里了，急着回家找师傅开门，单位的 nuc 要带回去，就把自己 nuc 强制关机了，结果虚机部署的 k8s 集群回来都起不来了，不过还不算太糟糕，至少 master 还在，不幸的万幸。之前有一次是强制关机了，结果也是 k8s 集群都起不来了，etcd 对应的 pod 也挂掉了，没有备份，最后没办法，使用 kubeadm 重置集群了。

┌──[root@vms81.liruilongs.github.io]-[~]
└─$kubectl get nodes
NAME                          STATUS     ROLES                  AGE    VERSION
vms155.liruilongs.github.io   NotReady   <none>                 76d    v1.22.2
vms156.liruilongs.github.io   NotReady   <none>                 76d    v1.22.2
vms81.liruilongs.github.io    Ready      control-plane,master   400d   v1.22.2
vms82.liruilongs.github.io    NotReady   <none>                 400d   v1.22.2
vms83.liruilongs.github.io    Ready      <none>                 400d   v1.22.2
┌──[root@vms81.liruilongs.github.io]-[~]
└─$

哈，部分集群节点未就绪，对应的虚机也起不来，下面为开机启动直接进入救援模式的提示信息。

[ 9.800336] XFS (sdal): Metadata corruption detected at xfs_agf_read_verify+0
×78/8×12[ xfs], xfs_agf block 0x4b000019.008356] XFS(sda1): Unmount and run xfs_repair
9.008376] XFS(sdal): First 64 butes of corrupted metadata buffer:
9.808395] ffff88803610a400:58 41 47 46 80 0880 01 80 08 80 81 80 96 88 88
XAGF...….
I9.8884151ffff88803610a410:80 88 80 81 80 88 88 82 80 88 80 88 80 88 80 819.808435] ffff88803610a420:80 88 80 81 80 88 80 88 80 88 88 88 80 8880 83
I 9.080454] ffff88003610a430:00 80 00 84 00 8d d1 2d 00 77 c3 a3 00 80 08 88
....-.w.……
I9.080515] XFS(sdal): metadata I/0 error: block 0x4b00001 ("xfs_trans_read_
buf_map") error 117 numblks 1
Generating "/run/initramfs/rdsosreport. txt."
Entering emergency mode. Exit the shell to continue.
Type "journalctl"to view system logs.
You might want to save "/run/initramfs/rdsosreport. txt"to a USB stick or /boot after mounting them and attach it to a bug report.
:/#

磁盘损坏，需要修复。哈，太坑了

我是如何做的

寻找磁盘恢复的解决方案，操作步骤：

启动虚拟机 E 进入单用户模式
在 linux16 开头的那行末尾添加 rd.break
在上一步的基础上 ctrl+x 进入救援模式,然后执行 xfs_repair -L /dev/sda1 : 这里的 sda1 是上面损坏的磁盘，可以在救援模式的输出中看到。
执行 reboot

OK，陆续修复磁盘，开机，然后查看节点，发现恢复了一个，还是有两个节点未就绪。

┌──[root@vms81.liruilongs.github.io]-[~]
└─$kubectl get nodes
NAME                          STATUS     ROLES                  AGE    VERSION
vms155.liruilongs.github.io   NotReady   <none>                 76d    v1.22.2
vms156.liruilongs.github.io   Ready      <none>                 76d    v1.22.2
vms81.liruilongs.github.io    Ready      control-plane,master   400d   v1.22.2
vms82.liruilongs.github.io    NotReady   <none>                 400d   v1.22.2
vms83.liruilongs.github.io    Ready      <none>                 400d   v1.22.2
┌──[root@vms81.liruilongs.github.io]-[~]
└─$

我最开始以为 kubectl 的问题,排查了日志发现没有问题。

┌──[root@vms82.liruilongs.github.io]-[~]
└─$systemctl status kubelet.service
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since 二 2023-01-17 20:53:02 CST; 1min 18s ago
   ....

然后在集群事件中，发现 Is the docker daemon running?, Error while dialing dial unix /run/containerd/containerd. sock: connect: connection refused": unavailable 类似的事件提示。

┌──[root@vms81.liruilongs.github.io]-[~]
└─$kubectl get events | grep -i error
54m         Warning   Unhealthy                pod/calico-node-nfkzd                                 Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/bird/bird.ctl: connect: no such file or directory
54m         Warning   Unhealthy                pod/calico-node-nfkzd                                 Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused
44m         Warning   FailedCreatePodSandBox   pod/calico-node-vxpxt                                 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "calico-node-vxpxt": Error response from daemon: connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused": unavailable
44m         Warning   FailedCreatePodSandBox   pod/calico-node-vxpxt                                 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "calico-node-vxpxt": Error response from daemon: transport is closing: unavailable
44m         Warning   FailedCreatePodSandBox   pod/kube-proxy-htg7t                                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "kube-proxy-htg7t": Error response from daemon: connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused": unavailable
44m         Warning   FailedCreatePodSandBox   pod/kube-proxy-htg7t                                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "kube-proxy-htg7t": Error response from daemon: transport is closing: unavailable
44m         Warning   FailedCreatePodSandBox   pod/kube-proxy-htg7t                                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "kube-proxy-htg7t": error during connect: Post "http://%2Fvar%2Frun%2Fdocker.sock/v1.41/containers/create?name=k8s_POD_kube-proxy-htg7t_kube-system_85fe510d-d713-4fe6-b852-dd1655d37fff_15": EOF
44m         Warning   FailedKillPod            pod/skooner-5b65f884f8-9cs4k                          error killing pod: failed to "KillPodSandbox" for "eb888be0-5f30-4620-a4a2-111f14bb092d" with KillPodSandbo
Error: "rpc error: code = Unknown desc = [networkPlugin cni failed to teardown pod \"skooner-5b65f884f8-9cs4k_kube-system\" network: error getting ClusterInformation: Get \"https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default\": dial tcp 10.96.0.1:443: connect: connection refused, Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?]"
┌──[root@vms81.liruilongs.github.io]-[~]
└─$

有可以能有些节点的 docker 没有起来,然后我查看了未就绪节点的 docker 的状态

┌──[root@vms82.liruilongs.github.io]-[~]
└─$systemctl  status docker
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
   Active: inactive (dead)
     Docs: https://docs.docker.com

1月 17 21:08:19 vms82.liruilongs.github.io systemd[1]: Dependency failed for Docker Application Container Engine.
1月 17 21:08:19 vms82.liruilongs.github.io systemd[1]: Job docker.service/start failed with result 'dependency'.
1月 17 21:08:25 vms82.liruilongs.github.io systemd[1]: Dependency failed for Docker Application Container Engine.
1月 17 21:08:25 vms82.liruilongs.github.io systemd[1]: Job docker.service/start failed with result 'dependency'.
1月 17 21:08:30 vms82.liruilongs.github.io systemd[1]: Dependency failed for Docker Application Container Engine.
。。。。。。。

发现 docker 果然没有启动成功，提示他的依赖没有启动成功，查看一下 docker 的正向依赖，即在 docker 之前启动的服务

┌──[root@vms82.liruilongs.github.io]-[~]
└─$systemctl list-dependencies docker.service
docker.service
● ├─containerd.service
● ├─docker.socket
● ├─system.slice
● ├─basic.target
● │ ├─microcode.service
● │ ├─rhel-autorelabel-mark.service
● │ ├─rhel-autorelabel.service
● │ ├─rhel-configure.service
● │ ├─rhel-dmesg.service
● │ ├─rhel-loadmodules.service
● │ ├─selinux-policy-migrate-local-changes@targeted.service
● │ ├─paths.target
● │ ├─slices.target
● │ │ ├─-.slice
● │ │ └─system.slice
● │ ├─sockets.target
............................

然后我们看一下第一个依赖的服务 containerd.service ,查看发现也么有启动成功

┌──[root@vms82.liruilongs.github.io]-[~]
└─$systemctl status containerd.service
● containerd.service - containerd container runtime
   Loaded: loaded (/usr/lib/systemd/system/containerd.service; disabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since 二 2023-01-17 21:14:58 CST; 4s ago
     Docs: https://containerd.io
  Process: 6494 ExecStart=/usr/bin/containerd (code=exited, status=2)
  Process: 6491 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
 Main PID: 6494 (code=exited, status=2)

1月 17 21:14:58 vms82.liruilongs.github.io systemd[1]: Failed to start containerd container runtime.
1月 17 21:14:58 vms82.liruilongs.github.io systemd[1]: Unit containerd.service entered failed state.
1月 17 21:14:58 vms82.liruilongs.github.io systemd[1]: containerd.service failed.
┌──[root@vms82.liruilongs.github.io]-[~]
└─$

没有更多的提示信息，只是提示启动失败了，这里我们尝试重启试试

┌──[root@vms82.liruilongs.github.io]-[~]
└─$systemctl restart containerd.service
Job for containerd.service failed because the control process exited with error code. See "systemctl status containerd.service" and "journalctl -xe" for details.

查看 containerd 服务日志，这里先查看一下 error 的信息

┌──[root@vms82.liruilongs.github.io]-[~]
└─$journalctl -u  containerd | grep -i error -m 3
1月 17 20:41:56 vms82.liruilongs.github.io containerd[962]: time="2023-01-17T20:41:56.203387028+08:00" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.aufs\"..." error="aufs is not supported (modprobe aufs failed: exit status 1 \"modprobe: FATAL: Module aufs not found.\\n\"): skip plugin" type=io.containerd.snapshotter.v1
1月 17 20:41:56 vms82.liruilongs.github.io containerd[962]: time="2023-01-17T20:41:56.203699262+08:00" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.devmapper" error="devmapper not configured"
1月 17 20:41:56 vms82.liruilongs.github.io containerd[962]: time="2023-01-17T20:41:56.204050775+08:00" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.zfs\"..." error="path /var/lib/containerd/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
┌──[root@vms82.liruilongs.github.io]-[~]
└─$

我们通过日志，得到了下面的日志信息,猜测可能是磁盘损坏照成的，这里我们备份 /var/lib/containerd/ 对应的文件夹，删除试试

aufs is not supported (modprobe aufs failed: exit status 1 \"modprobe: FATAL: Module aufs not found.
path /var/lib/containerd/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter: sk

删除文件夹下所有文件

┌──[root@vms82.liruilongs.github.io]-[~]
└─$cd /var/lib/containerd/
io.containerd.content.v1.content/       io.containerd.runtime.v1.linux/         io.containerd.snapshotter.v1.native/    tmpmounts/
io.containerd.metadata.v1.bolt/         io.containerd.runtime.v2.task/          io.containerd.snapshotter.v1.overlayfs/
┌──[root@vms82.liruilongs.github.io]-[~]
└─$cd /var/lib/containerd/
┌──[root@vms82.liruilongs.github.io]-[/var/lib/containerd]
└─$rm -rf *
┌──[root@vms82.liruilongs.github.io]-[/var/lib/containerd]
└─$ls

删除之后尝试重新启动 containerd

┌──[root@vms82.liruilongs.github.io]-[/var/lib/containerd]
└─$systemctl start containerd
┌──[root@vms82.liruilongs.github.io]-[/var/lib/containerd]
└─$systemctl status containerd
● containerd.service - containerd container runtime
   Loaded: loaded (/usr/lib/systemd/system/containerd.service; disabled; vendor preset: disabled)
   Active: active (running) since 二 2023-01-17 21:25:13 CST; 51s ago
     Docs: https://containerd.io
  Process: 8180 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
 Main PID: 8182 (containerd)
   Memory: 146.8M
   ...........

OK ,启动成功，这个时候我们发现节点也正常了。

┌──[root@vms81.liruilongs.github.io]-[~]
└─$kubectl get nodes
NAME                          STATUS     ROLES                  AGE    VERSION
vms155.liruilongs.github.io   NotReady   <none>                 76d    v1.22.2
vms156.liruilongs.github.io   Ready      <none>                 76d    v1.22.2
vms81.liruilongs.github.io    Ready      control-plane,master   400d   v1.22.2
vms82.liruilongs.github.io    Ready      <none>                 400d   v1.22.2
vms83.liruilongs.github.io    Ready      <none>                 400d   v1.22.2
┌──[root@vms81.liruilongs.github.io]-[~]
└─$

其他的节点陆续操作下

在 192.168.26.155 发现这个节点 docker 也是启动失败的，但是问题不一样，操作后，服务没有自动重启，日志有error 级别的日志，

┌──[root@vms81.liruilongs.github.io]-[~]
└─$ssh root@192.168.26.155
Last login: Mon Jan 16 02:26:43 2023 from 192.168.26.81
┌──[root@vms155.liruilongs.github.io]-[~]
└─$systemctl is-active  docker
failed
┌──[root@vms155.liruilongs.github.io]-[~]
└─$cd /var/lib/containerd/
┌──[root@vms155.liruilongs.github.io]-[/var/lib/containerd]
└─$rm -rf *
┌──[root@vms155.liruilongs.github.io]-[/var/lib/containerd]
└─$systemctl start containerd
┌──[root@vms155.liruilongs.github.io]-[/var/lib/containerd]
└─$systemctl is-active  docker
failed
┌──[root@vms155.liruilongs.github.io]-[/var/lib/containerd]
└─$systemctl status docker
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
   Active: failed (Result: start-limit) since 二 2023-01-17 20:20:03 CST; 1h 31min ago
     Docs: https://docs.docker.com
  Process: 2030 ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock (code=exited, status=0/SUCCESS)
 Main PID: 2030 (code=exited, status=0/SUCCESS)

1月 17 20:20:02 vms155.liruilongs.github.io dockerd[2030]: time="2023-01-17T20:20:02.796621853+08:00" level=error msg="712fd90a1962d0f546eaf6c9db05c2577ac9855b38f9f41e37724402f10d3045 cleanup: failed to de...
1月 17 20:20:02 vms155.liruilongs.github.io dockerd[2030]: time="2023-01-17T20:20:02.796669296+08:00" level=error msg="Handler for POST /v1.41/containers/712fd90a1962d0f546eaf6c9db05c2577ac9855b38f9f41e377...
1月 17 20:20:03 vms155.liruilongs.github.io dockerd[2030]: time="2023-01-17T20:20:03.285529266+08:00" level=warning msg="grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containe...
1月 17 20:20:03 vms155.liruilongs.github.io dockerd[2030]: time="2023-01-17T20:20:03.783878143+08:00" level=info msg="Processing signal 'terminated'"
1月 17 20:20:03 vms155.liruilongs.github.io systemd[1]: Stopping Docker Application Container Engine...
1月 17 20:20:03 vms155.liruilongs.github.io dockerd[2030]: time="2023-01-17T20:20:03.784550238+08:00" level=info msg="Daemon shutdown complete"
1月 17 20:20:03 vms155.liruilongs.github.io systemd[1]: start request repeated too quickly for docker.service
1月 17 20:20:03 vms155.liruilongs.github.io systemd[1]: Failed to start Docker Application Container Engine.
1月 17 20:20:03 vms155.liruilongs.github.io systemd[1]: Unit docker.service entered failed state.
1月 17 20:20:03 vms155.liruilongs.github.io systemd[1]: docker.service failed.
Hint: Some lines were ellipsized, use -l to show in full.

服务没有自动重启，这里手动重启试试

┌──[root@vms155.liruilongs.github.io]-[/var/lib/containerd]
└─$systemctl restart docker

查看节点状态，所有节点 ready。

┌──[root@vms81.liruilongs.github.io]-[~]
└─$kubectl get nodes
NAME                          STATUS   ROLES                  AGE    VERSION
vms155.liruilongs.github.io   Ready    <none>                 76d    v1.22.2
vms156.liruilongs.github.io   Ready    <none>                 76d    v1.22.2
vms81.liruilongs.github.io    Ready    control-plane,master   400d   v1.22.2
vms82.liruilongs.github.io    Ready    <none>                 400d   v1.22.2
vms83.liruilongs.github.io    Ready    <none>                 400d   v1.22.2
┌──[root@vms81.liruilongs.github.io]-[~]
└─$

博文参考

https://blog.csdn.net/qq_35022803/article/details/109287086

【声明】本内容来自华为云开发者社区博主，不代表华为云及华为云开发者社区的观点和立场。转载时必须标注文章的来源（华为云社区）、文章链接、文章作者等基本信息，否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容，欢迎发送邮件进行举报，并提供相关证据，一经查实，本社区将立刻删除涉嫌侵权内容，举报邮箱： cloudbbs@huaweicloud.com

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

记一次虚机强制断电磁盘损坏导致 K8s 集群部分节点未就绪(NotReady) 问题解决

写在前面

我遇到了什么问题

我是如何做的

博文参考

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

记一次虚机强制断电磁盘损坏导致 K8s 集群 部分节点未就绪(NotReady) 问题解决

写在前面

我遇到了什么问题

我是如何做的

博文参考

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

推荐阅读

相关产品

记一次虚机强制断电磁盘损坏导致 K8s 集群部分节点未就绪(NotReady) 问题解决