K8s 集群高可用master节点ETCD全部挂掉如何恢复?
写在前面
- 博文内容涉及集群 ETCD 全部挂掉,通过备份文件恢复的操作 Demo
- 理解不足小伙伴帮忙指正 :),生活加油
不必太纠结于当下,也不必太忧虑未来,当你经历过一些事情的时候,眼前的风景已经和从前不一样了。——村上春树
前提是需要etcd
备份文件,如果没有 etcd
备份,或者其他的备份手段,可能 GG 了
这里默认需要使用 etcdctl
的地方已经安装了该工具
备份文件分享
分享一个备份脚本
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$cat /usr/lib/systemd/system/etcd_back.sh
#!/bin/bash
#@File : erct_break.sh
#@Time : 2023/01/27 23:00:27
#@Author : Li Ruilong
#@Version : 1.0
#@Desc : ETCD 备份
#@Contact : 1224965096@qq.com
if [ ! -d /root/back/ ];then
mkdir -p /root/back/
fi
STR_DATE=$(date +%Y%m%d%H%M)
ETCDCTL_API=3 etcdctl \
--endpoints="https://127.0.0.1:2379" \
--cert="/etc/kubernetes/pki/etcd/server.crt" \
--key="/etc/kubernetes/pki/etcd/server.key" \
--cacert="/etc/kubernetes/pki/etcd/ca.crt" \
snapshot save /root/back/snap-${STR_DATE}.db
ETCDCTL_API=3 etcdctl --write-out=table snapshot status /root/back/snap-${STR_DATE}.db
sudo chmod o-w,u-w,g-w /root/back/snap-${STR_DATE}.db
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$
运行方式
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$/usr/bin/bash /usr/lib/systemd/system/etcd_back.sh
Snapshot saved at /root/back/snap-202406051145.db
+----------+----------+------------+------------+
| HASH | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| 7b00ddcf | 22243784 | 5999 | 88 MB |
+----------+----------+------------+------------+
生成对应的备份数据
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ll /root/back/snap-202*
.....
-r--r--r-- 1 root root 87515168 6月 5 11:45 /root/back/snap-202406051144.db
-r--r--r-- 1 root root 87515168 6月 5 11:45 /root/back/snap-202406051145.db
可以使用 systemd
配置成 service unit
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$systemctl cat etcd-backup.service
# /usr/lib/systemd/system/etcd-backup.service
# /usr/lib/systemd/system/etcd-backup.service
[Unit]
Description= "ETCD 备份"
After=network-online.target
[Service]
Type=oneshot
Environment=ETCDCTL_API=3
ExecStart=/usr/bin/bash /usr/lib/systemd/system/etcd_back.sh
[Install]
WantedBy=multi-user.target
主要是方便看日志,方便管理
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$journalctl -u etcd-backup.service
-- No entries --
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$systemctl start etcd-backup.service
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$journalctl -u etcd-backup.service
-- Logs begin at 三 2024-06-05 03:49:25 CST, end at 三 2024-06-05 11:49:08 CST. --
6月 05 11:49:04 vms100.liruilongs.github.io systemd[1]: Starting "ETCD 备份"...
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: Snapshot saved at /root/back/snap-202406051149.db
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: +----------+----------+------------+------------+
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: | HASH | REVISION | TOTAL KEYS | TOTAL SIZE |
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: +----------+----------+------------+------------+
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: | 1ce12bf7 | 22244346 | 3753 | 88 MB |
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: +----------+----------+------------+------------+
6月 05 11:49:07 vms100.liruilongs.github.io sudo[4344]: root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/chmod o-w,u-w,g-w /root/back/snap-202406051149.db
6月 05 11:49:07 vms100.liruilongs.github.io systemd[1]: Started "ETCD 备份".
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ll /root/back/snap-202406051*
........................
-r--r--r-- 1 root root 87515168 6月 5 11:49 /root/back/snap-202406051149.db
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$
然后使用timer unit
配置为定时启动
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$systemctl cat etcd-backup.timer
# /usr/lib/systemd/system/etcd-backup.timer
# /usr/lib/systemd/system/etcd-backup.timer
[Unit]
Description="每天备份一次 ETCD"
[Timer]
OnBootSec=3s
OnCalendar=*-*-* 00:00:00
Unit=etcd-backup.service
[Install]
WantedBy=multi-user.target
同样可以看日志
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$journalctl -u etcd-backup.timer
-- No entries --
故障处理恢复
故障表象,集群整个崩了,所有 master
上的 etcd 和 apiserver
都死掉了
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pods
The connection to the server 192.168.26.99:30033 was refused - did you specify the right host or port?
移动 etcd 和 apiserver
的对应 静态 pod
的 yaml
文件。关于 静态 Pod 运行原理这里不多讲,感兴趣小伙伴可以官网看下
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master -m shell -a "mv /etc/kubernetes/manifests/{etcd.yaml,kube-apiserver.yaml} /tmp/" -i host.yaml
192.168.26.102 | CHANGED | rc=0 >>
192.168.26.100 | CHANGED | rc=0 >>
192.168.26.101 | CHANGED | rc=0 >>
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$
清除当前集群的 etcd
的数据文件和对应的目录
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master -m shell -a "rm -rf /var/lib/etcd/*" -i host.yaml
[WARNING]: Consider using the file module with state=absent rather than running 'rm'. If you need to use command because file is insufficient you can add 'warn: false' to this command task or set
'command_warnings=False' in ansible.cfg to get rid of this message.
192.168.26.102 | CHANGED | rc=0 >>
192.168.26.100 | CHANGED | rc=0 >>
192.168.26.101 | CHANGED | rc=0 >>
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master -m shell -a "ls /var/lib/etcd/" -i host.yaml
192.168.26.102 | CHANGED | rc=0 >>
192.168.26.101 | CHANGED | rc=0 >>
192.168.26.100 | CHANGED | rc=0 >>
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$
拷贝 备份文件到当前集群的每个 etcd
节点
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master -m copy -a "src=/root/back/snap-202403270000.db dest=/root/" -i host.yaml
192.168.26.100 | CHANGED => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python"
},
"changed": true,
"checksum": "d8927d8fa47b1e162cb2326ddd968d8227a0555d",
"dest": "/root/snap-202403270000.db",
"gid": 0,
"group": "root",
"md5sum": "6489d7243f636086816ac13aa69ceb44",
"mode": "0644",
"owner": "root",
"size": 87515168,
"src": "/root/.ansible/tmp/ansible-tmp-1717557132.87-95740-233443993764822/source",
"state": "file",
"uid": 0
}
192.168.26.101 | CHANGED => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python"
},
"changed": true,
"checksum": "d8927d8fa47b1e162cb2326ddd968d8227a0555d",
"dest": "/root/snap-202403270000.db",
"gid": 0,
"group": "root",
"md5sum": "6489d7243f636086816ac13aa69ceb44",
"mode": "0644",
"owner": "root",
"size": 87515168,
"src": "/root/.ansible/tmp/ansible-tmp-1717557132.92-95742-263013169057776/source",
"state": "file",
"uid": 0
}
192.168.26.102 | CHANGED => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python"
},
"changed": true,
"checksum": "d8927d8fa47b1e162cb2326ddd968d8227a0555d",
"dest": "/root/snap-202403270000.db",
"gid": 0,
"group": "root",
"md5sum": "6489d7243f636086816ac13aa69ceb44",
"mode": "0644",
"owner": "root",
"size": 87515168,
"src": "/root/.ansible/tmp/ansible-tmp-1717557132.92-95744-205050494494041/source",
"state": "file",
"uid": 0
}
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$
确定拷贝文件的备份文件
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master -m shell -a "ls /root/snap-202403270000.db" -i host.yaml
192.168.26.102 | CHANGED | rc=0 >>
/root/snap-202403270000.db
192.168.26.101 | CHANGED | rc=0 >>
/root/snap-202403270000.db
192.168.26.100 | CHANGED | rc=0 >>
/root/snap-202403270000.db
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$
在其中一个节点执行备份恢复命令
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$vim etcd_break.sh
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$sh etcd_break.sh
Error: data-dir "/var/lib/etcd" exists
提示目录存在,所以需要把目录也同样删除掉
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master -m shell -a "rm -rf /var/lib/etcd" -i host.yaml
[WARNING]: Consider using the file module with state=absent rather than running 'rm'. If you need to use command because file is insufficient you can add 'warn: false' to this command task or set
'command_warnings=False' in ansible.cfg to get rid of this message.
192.168.26.102 | CHANGED | rc=0 >>
192.168.26.100 | CHANGED | rc=0 >>
192.168.26.101 | CHANGED | rc=0 >>
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$
备份恢复命令
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$cat etcd_break.sh
ETCDCTL_API=3 etcdctl snapshot restore /root/snap-202403270000.db \
--name vms100.liruilongs.github.io \
--cert="/etc/kubernetes/pki/etcd/server.crt" \
--key="/etc/kubernetes/pki/etcd/server.key" \
--cacert="/etc/kubernetes/pki/etcd/ca.crt" \
--endpoints="https://127.0.0.1:2379" \
--initial-advertise-peer-urls="https://192.168.26.100:2380" \
--initial-cluster="vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380" \
--data-dir=/var/lib/etcd
再次执行,备份恢复成功
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$sh etcd_break.sh
2024-06-05 11:19:12.114058 I | mvcc: restore compact to 22239463
2024-06-05 11:19:12.137939 I | etcdserver/membership: added member ee392e5273e89e2 [https://192.168.26.100:2380] to cluster 4816f346663d82a7
2024-06-05 11:19:12.138023 I | etcdserver/membership: added member 70059e836d19883d [https://192.168.26.101:2380] to cluster 4816f346663d82a7
2024-06-05 11:19:12.138055 I | etcdserver/membership: added member b8cb9f66c2e63b91 [https://192.168.26.102:2380] to cluster 4816f346663d82a7
其他的etcd节点备份恢复,需要修改脚本两个地方:
- –name vms100.liruilongs.github.io
- –initial-advertise-peer-urls=“https://192.168.26.100:2380”
192.168.26.101
节点执行
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible 192.168.26.101 -m shell -a "ETCDCTL_API=3 etcdctl snapshot restore /root/snap-202403270000.db --name vms101.liruilongs.github.io --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key" --cacert="/etc/kubernetes/pki/etcd/ca.crt" --endpoints="https://127.0.0.1:2379" --initial-advertise-peer-urls="https://192.168.26.101:2380" --initial-cluster="vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380" --data-dir=/var/lib/etcd" -i host.yaml
192.168.26.101 | CHANGED | rc=0 >>
2024-06-05 11:25:25.557851 I | mvcc: restore compact to 22239463
2024-06-05 11:25:25.614487 I | etcdserver/membership: added member ee392e5273e89e2 [https://192.168.26.100:2380] to cluster 4816f346663d82a7
2024-06-05 11:25:25.614549 I | etcdserver/membership: added member 70059e836d19883d [https://192.168.26.101:2380] to cluster 4816f346663d82a7
2024-06-05 11:25:25.614574 I | etcdserver/membership: added member b8cb9f66c2e63b91 [https://192.168.26.102:2380] to cluster 4816f346663d82a7
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$
192.168.26.102
节点执行
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible 192.168.26.102 -m shell -a "ETCDCTL_API=3 etcdctl snapshot restore /root/snap-202403270000.db --name vms102.l
iruilongs.github.io --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key" --cacert=
"/etc/kubernetes/pki/etcd/ca.crt" --endpoints="https://127.0.0.1:2379" --initial-advertise-peer-urls="https://192.168.26.102:2380" --initial-cluster="vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380" --data-dir=/var/lib/etcd" -i host.yaml
192.168.26.102 | CHANGED | rc=0 >>
2024-06-05 11:30:06.918159 I | mvcc: restore compact to 22239463
2024-06-05 11:30:06.935413 I | etcdserver/membership: added member ee392e5273e89e2 [https://192.168.26.100:2380] to cluster 4816f346663d82a7
2024-06-05 11:30:06.935460 I | etcdserver/membership: added member 70059e836d19883d [https://192.168.26.101:2380] to cluster 4816f346663d82a7
2024-06-05 11:30:06.935471 I | etcdserver/membership: added member b8cb9f66c2e63b91 [https://192.168.26.102:2380] to cluster 4816f346663d82a7
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$
移动静态 Pod对应的 yaml 文件,恢复 etcd 和apiserver 对应的Pod
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master -m shell -a "mv /tmp/{etcd.yaml,kube-apiserver.yaml} /etc/kubernetes/manifests/" -i host.yaml
192.168.26.102 | CHANGED | rc=0 >>
192.168.26.101 | CHANGED | rc=0 >>
192.168.26.100 | CHANGED | rc=0 >>
确认静态pod 恢复
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master -m shell -a "ls /etc/kubernetes/manifests/" -i host.yaml
192.168.26.100 | CHANGED | rc=0 >>
etcd.yaml
haproxy.yaml
keepalived.yaml
kube-apiserver.yaml
kube-controller-manager.yaml
kube-scheduler.yaml
192.168.26.102 | CHANGED | rc=0 >>
etcd.yaml
haproxy.yaml
keepalived.yaml
kube-apiserver.yaml
kube-controller-manager.yaml
kube-scheduler.yaml
192.168.26.101 | CHANGED | rc=0 >>
etcd.yaml
haproxy.yaml
keepalived.yaml
kube-apiserver.yaml
kube-controller-manager.yaml
kube-scheduler.yaml
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$
查看 etcd 集群节点状态
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ETCDCTL_API=3 etcdctl --endpoints https://127.0.0.1:2379 --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key" --cacert="/etc/kubernetes/pki/etcd/ca.crt" member list -w table
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
| ee392e5273e89e2 | started | vms100.liruilongs.github.io | https://192.168.26.100:2380 | https://192.168.26.100:2379 |
| 70059e836d19883d | started | vms101.liruilongs.github.io | https://192.168.26.101:2380 | https://192.168.26.101:2379 |
| b8cb9f66c2e63b91 | started | vms102.liruilongs.github.io | https://192.168.26.102:2380 | https://192.168.26.102:2379 |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
确认集群是否恢复
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$kubectl get nodes
NAME STATUS ROLES AGE VERSION
vms100.liruilongs.github.io Ready control-plane 495d v1.25.1
vms101.liruilongs.github.io Ready control-plane 495d v1.25.1
vms102.liruilongs.github.io Ready control-plane 495d v1.25.1
vms103.liruilongs.github.io Ready <none> 495d v1.25.1
vms105.liruilongs.github.io Ready <none> 495d v1.25.1
vms106.liruilongs.github.io Ready <none> 495d v1.25.1
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$
博文部分内容参考
© 文中涉及参考链接内容版权归原作者所有,如有侵权请告知 :)
© 2018-2024 liruilonger@gmail.com, 保持署名-非商用-相同方式共享(CC BY-NC-SA 4.0)
- 点赞
- 收藏
- 关注作者
评论(0)