实践:Kubernetes环境中Etcd集群的备份与恢复
这篇文章我们将进行Kubernetes集群的核心组件 etcd
集群备份,然后在具有一个主节点和一个从节点的 kubernetes 集群中恢复相同的备份。下面是实验的步骤和效果验证。
Step1 安装ETCD客户端
安装etcd cli 客户端, 管理etcd集群。这里在Ubuntu系统中安装。
apt install etcd-client
Step2 创建Nginx部署
我们将创建具有多个副本的 nginx 部署,这些副本将用于验证 etcd 数据的恢复。
kubectl create deployment nginx — image nginx --replicas=5
验证新部署的 Pod 是否处于运行状态
controlplane $ kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-77b4fdf86c-6m8gl 1/1 Running 0 50s
nginx-77b4fdf86c-bfcsr 1/1 Running 0 50s
nginx-77b4fdf86c-bqmqk 1/1 Running 0 50s
nginx-77b4fdf86c-nkh7j 1/1 Running 0 50s
nginx-77b4fdf86c-x946x 1/1 Running 0 50s
Step3 备份Etcd集群
为 etcd 备份创建一个备份目录mkdir etcd-backup
运行以下命令进行 etcd 备份。
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
snapshot save ./etcd-backup/etcdbackup.db
请注意,您不需要记住上述命令的证书路径,您可以从 kube-system 命名空间中运行的 etcd pod 获取证书路径。您可以通过运行以下命令来为 pod 运行命令
controlplane $ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-784cc4bcb7-xk6q7 1/1 Running 4 38d
canal-9nszc 2/2 Running 0 42m
canal-brzd7 2/2 Running 0 42m
coredns-5d769bfcf4-5mwkn 1/1 Running 0 38d
coredns-5d769bfcf4-w4xs7 1/1 Running 0 38d
etcd-controlplane 1/1 Running 0 38d
kube-apiserver-controlplane 1/1 Running 2 38d
kube-controller-manager-controlplane 1/1 Running 3 (41m ago) 38d
kube-proxy-5b8sx 1/1 Running 0 38d
kube-proxy-5qlc5 1/1 Running 0 38d
kube-scheduler-controlplane 1/1 Running 3 (41m ago) 38d
现在运行 get pods -o yaml
命令来获取 etcd pod
的容器命令。
kubectl get pods etcd-controlplane -o yaml -n kube-system
将得到它并可以获得所有证书路径。
containers:
- command:
- etcd
- --advertise-client-urls=https://172.30.1.2:2379
- --cert-file=/etc/kubernetes/pki/etcd/server.crt
- --client-cert-auth=true
- --data-dir=/var/lib/etcd
- --experimental-initial-corrupt-check=true
- --experimental-watch-progress-notify-interval=5s
- --initial-advertise-peer-urls=https://172.30.1.2:2380
- --initial-cluster=controlplane=https://172.30.1.2:2380
- --key-file=/etc/kubernetes/pki/etcd/server.key
- --listen-client-urls=https://127.0.0.1:2379,https://172.30.1.2:2379
- --listen-metrics-urls=http://127.0.0.1:2381
- --listen-peer-urls=https://172.30.1.2:2380
- --name=controlplane
- --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
- --peer-client-cert-auth=true
- --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
- --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
- --snapshot-count=10000
- --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
Step4 验证备份数据
运行以下命令,以从新备份数据中获取密钥列表和详细信息ETCDCTL_API=3 etcdctl --write-out=table snapshot status ./etcd-backup/etcdbackup.db
controlplane $ ETCDCTL_API=3 etcdctl --write-out=table snapshot status ./etcd-backup/etcdbackup.db
+---------+----------+------------+------------+
| HASH | REVISION | TOTAL KEYS | TOTAL SIZE |
+---------+----------+------------+------------+
| cb4c04c | 4567 | 1346 | 6.0 MB |
+---------+----------+------------+------------+
Step5 将备份恢复到集群
在这里,我们将删除之前创建的 nginx
部署,然后恢复备份,以便恢复 nginx
部署。
A.删除nginx部署
controlplane $ kubectl delete deploy nginx
deployment.apps "nginx" deleted
B.将数据从备份恢复
ETCDCTL_API=3 etcdctl snapshot restore etcd-backup/etcdbackup.db
这将创建一个名为的default.etcd
文件夹, 恢复备份时您可能会遇到如下错误:
controlplane $ ETCDCTL_API=3 etcdctl snapshot restore etcd-backup/etcdbackup.db
Error: expected sha256 [253 81 3 207 182 43 249 52 218 166 71 135 221 106 6 216 216 21 183 250 36 126 187 251 171 98 91 69 113 40 229 2], got [63 25 34 167 139 91 18 135 249 179 157 115 214 138 237 35 161 237 175 12 61 31 141 130 204 146 143 177 132 241 193 15]
为了避免这种情况,您可以在上面的恢复命令中使用--skip-hash-check=true
此标志,您应该可以很好地获取default.etcd
当前路径上的文件夹。
controlplane $ ETCDCTL_API=3 etcdctl snapshot restore etcd-backup/etcdbackup.db --skip-hash-check=true
2023-06-28 15:35:36.180956 I | etcdserver/membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32
controlplane $ ls
default.etcd etcd-backup filesystem
C.现在我们需要停止所有正在运行的 Kubernetes 组件以更新 etcd 数据。为此,我们在/etc/kubernetes/manifests/
文件夹中放置了 kubernetes
组件的清单文件,我们将临时将此文件移出此路径,kubelet 将自动删除这些 pod。
controlplane $ ls /etc/kubernetes/manifests/
etcd.yaml kube-apiserver.yaml kube-controller-manager.yaml kube-scheduler.yaml
controlplane $ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-784cc4bcb7-xk6q7 1/1 Running 4 38d
canal-5lxjg 2/2 Running 0 28m
canal-zv77t 2/2 Running 0 28m
coredns-5d769bfcf4-5mwkn 1/1 Running 0 38d
coredns-5d769bfcf4-w4xs7 1/1 Running 0 38d
etcd-controlplane 1/1 Running 0 38d
kube-apiserver-controlplane 1/1 Running 2 38d
kube-controller-manager-controlplane 1/1 Running 2 38d
kube-proxy-5b8sx 1/1 Running 0 38d
kube-proxy-5qlc5 1/1 Running 0 38d
kube-scheduler-controlplane 1/1 Running 2 38d
controlplane $ mkdir temp_yaml_files
controlplane $ mv /etc/kubernetes/manifests/* temp_yaml_files/
controlplane $ kubectl get pods -n kube-system
The connection to the server 172.30.1.2:6443 was refused - did you specify the right host or port?
您可以在上面看到,一旦我们从清单路径中删除文件,api-server pod
将被终止,您将无法访问集群。你可以检查这些组件的docker容器是否被Kill或处于运行状态。在移动文件之前,容器将运行。
controlplane $ crictl ps
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
6a2bce359c15b 6f6e73fa8162b 3 seconds ago Running kube-apiserver 0 fe1be6aa651dd kube-apiserver-controlplane
a26534b2e6244 c6b5118178229 4 seconds ago Running kube-controller-manager 0 38fb48a4ebb62 kube-controller-manager-controlplane
58ac164968ec3 86b6af7dd652c 4 seconds ago Running etcd 0 170af0e603a02 etcd-controlplane
e98ef4185206b 6468fa8f98696 4 seconds ago Running kube-scheduler 0 0bd26fd661a2c kube-scheduler-controlplane
7a03436be6ce6 f9c3c1813269c 23 seconds ago Running calico-kube-controllers 7 6da32eed5e939 calico-kube-controllers-784cc4bcb7-xk6q7
1edf2a857f1d4 e6ea68648f0cd 31 minutes ago Running kube-flannel 0 3dac4c0c5960d canal-5lxjg
e249d3e4b2b51 75392e3500e36 31 minutes ago Running calico-node 0 3dac4c0c5960d canal-5lxjg
039999604ba8c ead0a4a53df89 5 weeks ago Running coredns 0 f8b31a08b4907 coredns-5d769bfcf4-5mwkn
26d7a0bc1b1b9 1780fa6665ff0 5 weeks ago Running local-path-provisioner 0 1913e8d9cb757 local-path-provisioner-bf548cc96-fchvw
c86359e6bf649 fbe39e5d66b6a 5 weeks ago Running
一旦文件被移动,它们将被终止。
controlplane $ mv /etc/kubernetes/manifests/* temp_yaml_files/
controlplane $ crictl ps
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
7a03436be6ce6 f9c3c1813269c 2 minutes ago Running calico-kube-controllers 7 6da32eed5e939 calico-kube-controllers-784cc4bcb7-xk6q7
1edf2a857f1d4 e6ea68648f0cd 34 minutes ago Running kube-flannel 0 3dac4c0c5960d canal-5lxjg
e249d3e4b2b51 75392e3500e36 34 minutes ago Running calico-node 0 3dac4c0c5960d canal-5lxjg
039999604ba8c ead0a4a53df89 5 weeks ago Running coredns 0 f8b31a08b4907 coredns-5d769bfcf4-5mwkn
26d7a0bc1b1b9 1780fa6665ff0 5 weeks ago Running local-path-provisioner 0 1913e8d9cb757 local-path-provisioner-bf548cc96-fchvw
c86359e6bf649 fbe39e5d66b6a 5 weeks ago Running kube-proxy 0 d69f1cd083173 kube-proxy-5b8sx
D.现在 api-server/controller-manager/kube-scheduler
已终止,我们将把数据从default.etcd
文件夹移动到 etcd data-dir
,我们可以从第 3 阶段获取该数据,在阶段 3 中,我们在 etcd pod 中运行 etcd 命令,并且设置了 data-dir到--data-dir=/var/lib/etcd
.
controlplane $ cd default.etcd/
controlplane $ ls
member
controlplane $ ls /var/lib/etcd
member
我们将从备份目录中重命名并添加member
文件夹/var/lib/etcd/member
。备份默认/var/lib/etcd/
目录中的member
到文件夹/var/lib/etcd/member.bak
controlplane $ cd default.etcd/
controlplane $ ls
member
controlplane $ mv /var/lib/etcd/member/ /var/lib/etcd/member.bak
controlplane $ mv member/ /var/lib/etcd/
controlplane $ ls /var/lib/etcd
member member.bak
E. 现在,由于我们的数据已恢复,我们将停止 kubelet 服务并将 yaml 文件再次移动到清单文件夹。
controlplane $ systemctl stop kubelet
controlplane $ systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: inactive (dead) since Wed 2023-06-28 16:03:32 UTC; 6s ago
Docs: https://kubernetes.io/docs/home/
Process: 25011 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, stat>
Main PID: 25011 (code=exited, status=0/SUCCESS)
Jun 28 16:03:30 controlplane kubelet[25011]: E0628 16:03:30.524978 25011 controller.go:146] "Failed to ensure lease exists, will retry" err="Get \"htt>
Jun 28 16:03:31 controlplane kubelet[25011]: I0628 16:03:31.195933 25011 status_manager.go:809] "Failed to get status for pod" podUID=4ad6dc12-6828-45>
Jun 28 16:03:31 controlplane kubelet[25011]: E0628 16:03:31.196843 25011 mirror_client.go:138] "Failed deleting a mirror pod" err="Delete \"https://17>
Jun 28 16:03:31 controlplane kubelet[25011]: E0628 16:03:31.197110 25011 mirror_client.go:138] "Failed deleting a mirror pod" err="Delete \"https://17>
Jun 28 16:03:31 controlplane kubelet[25011]: E0628 16:03:31.197392 25011 mirror_client.go:138] "Failed deleting a mirror pod" err="Delete \"https://17>
Jun 28 16:03:31 controlplane kubelet[25011]: E0628 16:03:31.197721 25011 mirror_client.go:138] "Failed deleting a mirror pod" err="Delete \"https://17>
Jun 28 16:03:32 controlplane systemd[1]: Stopping kubelet: The Kubernetes Node Agent...
Jun 28 16:03:32 controlplane kubelet[25011]: I0628 16:03:32.098579 25011 dynamic_cafile_content.go:171] "Shutting down controller" name="client-ca-bun>
Jun 28 16:03:32 controlplane systemd[1]: kubelet.service: Succeeded.
Jun 28 16:03:32 controlplane systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
lines 1-19/19 (END)
controlplane $ mv temp_yaml_files/* /etc/kubernetes/manifests/
controlplane $ ls /etc/kubernetes/manifests/
etcd.yaml kube-apiserver.yaml kube-controller-manager.yaml kube-scheduler.yaml
一旦这些文件被移动,我们将启动 kubelet 服务,以便它选择这些文件并部署组件。
controlplane $ systemctl start kubelet
controlplane $ systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Wed 2023-06-28 16:05:56 UTC; 3s ago
Docs: https://kubernetes.io/docs/home/
Main PID: 60741 (kubelet)
Tasks: 9 (limit: 2339)
Memory: 70.5M
CGroup: /system.slice/kubelet.service
└─60741 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/>
Jun 28 16:05:57 controlplane kubelet[60741]: W0628 16:05:57.729886 60741 reflector.go:533] vendor/k8s.io/client-go/informers/factory.go:150: failed to>
Jun 28 16:05:57 controlplane kubelet[60741]: E0628 16:05:57.729952 60741 reflector.go:148] vendor/k8s.io/client-go/informers/factory.go:150: Failed to>
Jun 28 16:05:57 controlplane kubelet[60741]: W0628 16:05:57.831598 60741 reflector.go:533] vendor/k8s.io/client-go/informers/factory.go:150: failed to>
Jun 28 16:05:57 controlplane kubelet[60741]: E0628 16:05:57.832204 60741 reflector.go:148] vendor/k8s.io/client-go/informers/factory.go:150: Failed to>
Jun 28 16:05:58 controlplane kubelet[60741]: W0628 16:05:58.130322 60741 reflector.go:533] vendor/k8s.io/client-go/informers/factory.go:150: failed to>
Jun 28 16:05:58 controlplane kubelet[60741]: E0628 16:05:58.130397 60741 reflector.go:148] vendor/k8s.io/client-go/informers/factory.go:150: Failed to>
Jun 28 16:05:58 controlplane kubelet[60741]: E0628 16:05:58.274435 60741 controller.go:146] "Failed to ensure lease exists, will retry" err="Get \"htt>
Jun 28 16:05:58 controlplane kubelet[60741]: I0628 16:05:58.360755 60741 kubelet_node_status.go:70] "Attempting to register node" node="controlplane"
Jun 28 16:05:58 controlplane kubelet[60741]: E0628 16:05:58.361160 60741 kubelet_node_status.go:92] "Unable to register node with API server" err="Pos>
Jun 28 16:05:59 controlplane kubelet[60741]: I0628 16:05:59.962674 60741 kubelet_node_status.go:70] "Attempting to register node" node="controlplane"
lines 1-22/22 (END)
您现在可以看到容器现在再次运行, kubectl 命令可能需要几分钟才能工作。
crictl ps
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
688cfa2890b4f f9c3c1813269c 23 seconds ago Running calico-kube-controllers 12 6da32eed5e939 calico-kube-controllers-784cc4bcb7-xk6q7
db1797e3e2e83 6468fa8f98696 28 seconds ago Running kube-scheduler 0 307a1600b4346 kube-scheduler-controlplane
1dc176c2a599e c6b5118178229 28 seconds ago Running kube-controller-manager 0 f9efc6c4c8d91 kube-controller-manager-controlplane
f70e2103ec1e0 6f6e73fa8162b 29 seconds ago Running kube-apiserver 0 32f49c141ea69 kube-apiserver-controlplane
2e274f5176656 86b6af7dd652c 29 seconds ago Running etcd 0 9c561113f9fcd etcd-controlplane
1edf2a857f1d4 e6ea68648f0cd 47 minutes ago Running kube-flannel 0 3dac4c0c5960d canal-5lxjg
e249d3e4b2b51 75392e3500e36 47 minutes ago Running calico-node 0 3dac4c0c5960d canal-5lxjg
039999604ba8c ead0a4a53df89 5 weeks ago Running coredns 0 f8b31a08b4907 coredns-5d769bfcf4-5mwkn
26d7a0bc1b1b9 1780fa6665ff0 5 weeks ago Running local-path-provisioner 0 1913e8d9cb757 local-path-provisioner-bf548cc96-fchvw
c86359e6bf649 fbe39e5d66b6a 5 weeks ago Running
您现在可以通过运行 get pods 命令来验证我们的 nginx 部署是否已恢复(我们在备份后删除了该部署)
controlplane $ kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-77b4fdf86c-8n7kg 1/1 Running 0 40m
nginx-77b4fdf86c-gmbjm 1/1 Running 0 40m
nginx-77b4fdf86c-pjpnr 1/1 Running 0 40m
nginx-77b4fdf86c-qjxmd 1/1 Running 0 40m
nginx-77b4fdf86c-zhvnv 1/1 Running 0 40m
恭喜!!!您现在已成功恢复 ETCD 数据。
文章转载自公众号:DevOps云学堂