kubernete编排技术八:使用operator管理有状态应用
作者 | 朱晋君
来源 | 君哥聊技术(ID:gh_1f109b82d301)
operator是kubernetes的一个扩展,它使用自定义资源(Custom Resources)来管理应用和组件,并且遵循kubernetes的规范。它的核心就是自己编写控制器来实现自动化任务的效果,从而取代kubernetes自己的控制器和CRD资源。
使用operator是实现自动化的需求大概有以下几类:
- 按照需求部署一个应用
- 获取或者恢复一个应用的状态
- 应用代码升级,同时关联的数据库或者配置等一并升级
- 发布一个服务,让不支持kubernetes api的应用也能发现它
- 模拟集群的故障以测试集群稳定性
- 为分布式集群选取leader节点
上面的描述来自于kubernetes官网,这个描述可以看出,operator可以使用自定义资源来编排有状态应用。
本文我以使用etcd-operator部署etcd集群为例来介绍operator的编排功能,这个案例就是用operator在kubernetes上部署一套etcd的集群。
etcd集群搭建
下载源代码,地址如下:
https://github.com/coreos/etcd-operator
安装operator之前,先执行一个脚本,如下:
[root@master etcd-operator]# cd example/rbac/
[root@master rbac]# ./create_role.sh
Creating role with ROLE_NAME=etcd-operator, NAMESPACE=default
clusterrole.rbac.authorization.k8s.io/etcd-operator created
Creating role binding with ROLE_NAME=etcd-operator, ROLE_BINDING_NAME=etcd-operator, NAMESPACE=default
clusterrolebinding.rbac.authorization.k8s.io/etcd-operator created
从这个执行结果我们看到,这个脚本是在为operator创建RBAC访问权限,这里创建了ClusterRole(etcd-operator)和ClusterRoleBinding(etcd-operator),他们的namespace都是default。对RBAC不太熟悉的同学可以查看这篇文章《kubernete编排技术六:RBAC权限控制》。
这样etcd-operator就具备了访问apiserver的权限。具体有哪些权限,我们查看一下etcd-operator这个ClusterRole:
[root@master rbac]# kubectl describe ClusterRole etcd-operator
Name: etcd-operator
Labels: <none>
Annotations: <none>
PolicyRule:
Resources Non-Resource URLs Resource Names Verbs
--------- ----------------- -------------- -----
endpoints [] [] [*]
events [] [] [*]
persistentvolumeclaims [] [] [*]
pods [] [] [*]
services [] [] [*]
customresourcedefinitions.apiextensions.k8s.io [] [] [*]
deployments.apps [] [] [*]
etcdbackups.etcd.database.coreos.com [] [] [*]
etcdclusters.etcd.database.coreos.com [] [] [*]
etcdrestores.etcd.database.coreos.com [] [] [*]
secrets [] [] [get]
可以看到,它的权限是非常大的,除了对secrets只有get权限外,对其他api对象都有所有的权限。我们再看下这个ClusterRoleBinding:
[root@master rbac]# kubectl describe ClusterRoleBinding etcd-operator
Name: etcd-operator
Labels: <none>
Annotations: <none>
Role:
Kind: ClusterRole
Name: etcd-operator
Subjects:
Kind Name Namespace
---- ---- ---------
ServiceAccount default default
可以看到,User类型是一个ServiceAccount。
RBAC创建好之后,我们再来看一下etcd-operator的yaml文件,如下(注意,源码中的yaml文件不匹配我的kubernetes版本v1.17.3,我做了修改并且标记出来了):
[root@master example]# cat deployment.yaml
apiVersion: apps/v1 #修改过
kind: Deployment
metadata:
name: etcd-operator
spec:
selector: #增加这个标签
matchLabels:
app: etcd-operator
replicas: 1
template:
metadata:
labels:
app: etcd-operator #key从name改为app
spec:
containers:
- name: etcd-operator
image: quay.io/coreos/etcd-operator:v0.9.4
command:
- etcd-operator
# Uncomment to act for resources in all namespaces. More information in doc/user/clusterwide.md
#- -cluster-wide
env:
- name: MY_POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
可以看出,其实它就是一个Deployment,replicas配置的是1。下面我们创建这个对象,也就是创建etcd-operator:
[root@master example]# kubectl create -f deployment.yaml
deployment.apps/etcd-operator created
创建成功后,过一段时间,就能看到这个pod进入running状态了:
[root@master rbac]# kubectl get pod
NAME READY STATUS RESTARTS AGE
etcd-operator-84cf6bc5d5-gfwzn 1/1 Running 0 105s
创建operator时,应用会创建一个crd(Custom Resource Definition),这正是operator的特性,我们查看一下这个crd:
[root@master rbac]# kubectl get crd
NAME CREATED AT
etcdclusters.etcd.database.coreos.com 2020-08-29T06:51:14Z
接着,我们查看一下这个crd的详细信息:
[root@master rbac]# kubectl describe crd etcdclusters.etcd.database.coreos.com
Name: etcdclusters.etcd.database.coreos.com
Namespace:
Labels: <none>
Annotations: <none>
API Version: apiextensions.k8s.io/v1
Kind: CustomResourceDefinition
Metadata:
Creation Timestamp: 2020-08-29T06:51:14Z
Generation: 1
Resource Version: 3553
Self Link: /apis/apiextensions.k8s.io/v1/customresourcedefinitions/etcdclusters.etcd.database.coreos.com
UID: e71b69e9-1c55-4f38-a970-dad8484265ea
Spec:
Conversion:
Strategy: None
Group: etcd.database.coreos.com
Names:
Kind: EtcdCluster
List Kind: EtcdClusterList
Plural: etcdclusters
Short Names:
etcd
Singular: etcdcluster
Preserve Unknown Fields: true
Scope: Namespaced
Versions:
Name: v1beta2
Served: true
Storage: true
Status:
Accepted Names:
Kind: EtcdCluster
List Kind: EtcdClusterList
Plural: etcdclusters
Short Names:
etcd
Singular: etcdcluster
Conditions:
Last Transition Time: 2020-08-29T06:51:14Z
Message: no conflicts found
Reason: NoConflicts
Status: True
Type: NamesAccepted
Last Transition Time: 2020-08-29T06:51:14Z
Message: the initial names have been accepted
Reason: InitialNamesAccepted
Status: True
Type: Established
Stored Versions:
v1beta2
Events: <none>
这个crd里面定义的Group是etcd.database.coreos.com,kind是EtcdCluster。有了这个crd,operator就可以作为一个控制器来对这个crd进行控制了。
下面我们创建集群,我们先看一下yaml文件,内容如下:
[root@master example]# cat example-etcd-cluster.yaml
apiVersion: "etcd.database.coreos.com/v1beta2"
kind: "EtcdCluster"
metadata:
name: "example-etcd-cluster"
## Adding this annotation make this cluster managed by clusterwide operators
## namespaced operators ignore it
# annotations:
# etcd.database.coreos.com/scope: clusterwide
spec:
size: 2 #这儿修改过,原始配置是3
version: "3.3.25" #这儿修改过,原始配置是3.2.13
这个yaml文件的定义非常简单,集群数量是2,etcd版本号是3.3.25,而kind就是我们自定义的资源类型EtcdCluster,所以它其实就是crd的具体实现,即CR(Custom Resources)。创建这个集群:
[root@master example]# kubectl create -f example-etcd-cluster.yaml
etcdcluster.etcd.database.coreos.com/example-etcd-cluster created
过一小段时间后查看pod创建情况,可以看到,集群中的2个节点已经创建出来了:
NAME READY STATUS RESTARTS AGE
example-etcd-cluster-4t886mhnwv 1/1 Running 0 2m43s
example-etcd-cluster-jkclxffwf5 1/1 Running 0 2m52s
这时我们进入一个pod,进行etcd数据操作,结果如下,可以集群已经可以使用:
[root@master ~]# kubectl exec -it example-etcd-cluster-4t886mhnwv -- /bin/sh
/ # etcdctl set test "123456"
123456
/ # etcdctl get test
123456
注意:
1.因为我本地环境的限制,kubernetes集群只有2个节点,一个master和一个worker节点,而master节点默认是不能部署pod的,如果想要去除这个污点(Taint),有2个方法,一个就是我之前讲到的使用DaemonSet,另一个就是比较简单,使用下面命令:
#下面的master是主节点名字
kubectl taint node master node-role.kubernetes.io/master-
而想要让master节点恢复这个污点(Taint),使用如下命令:
kubectl taint node k8s-master node-role.kubernetes.io/master=""
2.一开始部署的时候,pod一直创建失败,查看docker日志,发现是nslookupup下面这个域名失败,提示的ip地址是10.96.0.10,这是因为etcd集群中的每个pod都有一个check-dns的容器在做这个事情:
example-etcd-cluster-4t886mhnwv.example-etcd-cluster.default.svc
这时我想到api-kubelete中配置的默认集群地址是:
--service-cluster-ip-range=10.96.0.0/12
我把上面地址修改成下面地址就可以了
#/etc/kubernetes/manifests/kube-apiserver.yaml文件
- --service-cluster-ip-range=10.244.0.0/16,
集群原理
首先我们查看这2个pod的详情:
[root@master example]# kubectl describe pod example-etcd-cluster-jkclxffwf5
Name: example-etcd-cluster-jkclxffwf5
Namespace: default
Priority: 0
Node: worker1/192.168.59.141
Start Time: Sat, 29 Aug 2020 03:03:57 -0400
Labels: app=etcd
etcd_cluster=example-etcd-cluster
etcd_node=example-etcd-cluster-jkclxffwf5
Annotations: etcd.version: 3.3.25
Status: Running
IP: 10.244.1.3
IPs:
IP: 10.244.1.3
Controlled By: EtcdCluster/example-etcd-cluster
Init Containers:
check-dns:
Container ID: docker://c45f53ff2aa08527fe969bb92dee474d1286b8279418c0493ea94c6fdac2635d
Image: busybox:1.28.0-glibc
Image ID: docker-pullable://busybox@sha256:0b55a30394294ab23b9afd58fab94e61a923f5834fba7ddbae7f8e0c11ba85e6
Port: <none>
Host Port: <none>
Command:
/bin/sh
-c
TIMEOUT_READY=0
while ( ! nslookup example-etcd-cluster-jkclxffwf5.example-etcd-cluster.default.svc )
do
# If TIMEOUT_READY is 0 we should never time out and exit
TIMEOUT_READY=$(( TIMEOUT_READY-1 ))
if [ $TIMEOUT_READY -eq 0 ];
then
echo "Timed out waiting for DNS entry"
exit 1
fi
sleep 1
done
State: Terminated
Reason: Completed
Exit Code: 0
Started: Sat, 29 Aug 2020 03:03:57 -0400
Finished: Sat, 29 Aug 2020 03:03:58 -0400
Ready: True
Restart Count: 0
Environment: <none>
Mounts: <none>
Containers:
etcd:
Container ID: docker://71f35e088edf6633c8f5edad81cc157b278644c19ab21107e9f3b1facd55f72d
Image: quay.io/coreos/etcd:v3.3.25
Image ID: docker-pullable://quay.io/coreos/etcd@sha256:ff9226afaecbe1683f797f84326d1494092ac41d688b8d68b69f7a6462d51dc9
Ports: 2380/TCP, 2379/TCP
Host Ports: 0/TCP, 0/TCP
Command:
/usr/local/bin/etcd
--data-dir=/var/etcd/data
--name=example-etcd-cluster-jkclxffwf5
--initial-advertise-peer-urls=http://example-etcd-cluster-jkclxffwf5.example-etcd-cluster.default.svc:2380
--listen-peer-urls=http://0.0.0.0:2380
--listen-client-urls=http://0.0.0.0:2379
--advertise-client-urls=http://example-etcd-cluster-jkclxffwf5.example-etcd-cluster.default.svc:2379
--initial-cluster=example-etcd-cluster-jkclxffwf5=http://example-etcd-cluster-jkclxffwf5.example-etcd-cluster.default.svc:2380
--initial-cluster-state=new
--initial-cluster-token=091b2149-aec1-428a-9fe4-c8d242d5be1b
State: Running
Started: Sat, 29 Aug 2020 03:03:59 -0400
Ready: True
Restart Count: 0
Liveness: exec [/bin/sh -ec ETCDCTL_API=3 etcdctl endpoint status] delay=10s timeout=10s period=60s #success=1 #failure=3
Readiness: exec [/bin/sh -ec ETCDCTL_API=3 etcdctl endpoint status] delay=1s timeout=5s period=5s #success=1 #failure=3
Environment: <none>
Mounts:
/var/etcd from etcd-data (rw)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
etcd-data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned default/example-etcd-cluster-jkclxffwf5 to worker1
Normal Pulled 15m kubelet, worker1 Container image "busybox:1.28.0-glibc" already present on machine
Normal Created 15m kubelet, worker1 Created container check-dns
Normal Started 15m kubelet, worker1 Started container check-dns
Normal Pulled 15m kubelet, worker1 Container image "quay.io/coreos/etcd:v3.3.25" already present on machine
Normal Created 15m kubelet, worker1 Created container etcd
Normal Started 15m kubelet, worker1 Started container etcd
这个pod是etcd的主节点,因为上面有个参数--initial-cluster-state=new,而从节点这个属性的值是existing,如下:
[root@master example]# kubectl describe pod example-etcd-cluster-4t886mhnwv
Name: example-etcd-cluster-4t886mhnwv
Namespace: default
Priority: 0
Node: master/192.168.59.132
Start Time: Sat, 29 Aug 2020 03:04:05 -0400
Labels: app=etcd
etcd_cluster=example-etcd-cluster
etcd_node=example-etcd-cluster-4t886mhnwv
Annotations: etcd.version: 3.3.25
Status: Running
IP: 10.244.0.4
IPs:
IP: 10.244.0.4
#省略其他输出
Containers:
etcd:
Container ID: docker://495b349704d5f4199b200913b4fb4d5d682258e486490e2ab64da1f7b55a5945
Image: quay.io/coreos/etcd:v3.3.25
Image ID: docker-pullable://quay.io/coreos/etcd@sha256:ff9226afaecbe1683f797f84326d1494092ac41d688b8d68b69f7a6462d51dc9
Ports: 2380/TCP, 2379/TCP
Host Ports: 0/TCP, 0/TCP
Command:
/usr/local/bin/etcd
--data-dir=/var/etcd/data
--name=example-etcd-cluster-4t886mhnwv
--initial-advertise-peer-urls=http://example-etcd-cluster-4t886mhnwv.example-etcd-cluster.default.svc:2380
--listen-peer-urls=http://0.0.0.0:2380
--listen-client-urls=http://0.0.0.0:2379
--advertise-client-urls=http://example-etcd-cluster-4t886mhnwv.example-etcd-cluster.default.svc:2379
--initial-cluster=example-etcd-cluster-jkclxffwf5=http://example-etcd-cluster-jkclxffwf5.example-etcd-cluster.default.svc:2380,example-etcd-cluster-4t886mhnwv=http://example-etcd-cluster-4t886mhnwv.example-etcd-cluster.default.svc:2380
--initial-cluster-state=existing
State: Running
Started: Sat, 29 Aug 2020 03:04:37 -0400
Ready: True
Restart Count: 0
Liveness: exec [/bin/sh -ec ETCDCTL_API=3 etcdctl endpoint status] delay=10s timeout=10s period=60s #success=1 #failure=3
Readiness: exec [/bin/sh -ec ETCDCTL_API=3 etcdctl endpoint status] delay=1s timeout=5s period=5s #success=1 #failure=3
#省略其他输出
这里介绍一下如果不采用etcd operator,etcd组建集群的过程,这样etcd组建集群就是一个静态的集群。
首先需要启动一个节点,这个节点的initial-cluster-state参数值是new,然后固定了ip地址,比如10.244.1.3,这样initial-cluster参数就是master=http://10.244.1.3:2380。
然后再启动一个etcd节点加入第一个节点就可以了,比如ip是10.244.0.4,这个节点的initial-cluster-state参数值是existing,加入成功后,集群中就有2个节点了,所以initial-cluster参数就是master=http://10.244.1.3:2380,worker1=http://10.244.0.4:2380
之后其他的节点都不断地加入第一个节点。这样这个静态集群就一步一步组建起来了。
而使用etcd operator的优势,就是可以把上面这个静态组建集群的过程自动化,上面两个pod的describe中,我们可以看到都有一个command命令,这个正是etcd operator自动组建集群的过程。下面再次贴一下这2个命令
/usr/local/bin/etcd
--data-dir=/var/etcd/data
--name=example-etcd-cluster-jkclxffwf5
--initial-advertise-peer-urls=http://example-etcd-cluster-jkclxffwf5.example-etcd-cluster.default.svc:2380
--listen-peer-urls=http://0.0.0.0:2380
--listen-client-urls=http://0.0.0.0:2379
--advertise-client-urls=http://example-etcd-cluster-jkclxffwf5.example-etcd-cluster.default.svc:2379
--initial-cluster=example-etcd-cluster-jkclxffwf5=http://example-etcd-cluster-jkclxffwf5.example-etcd-cluster.default.svc:2380
--initial-cluster-state=new
--initial-cluster-token=091b2149-aec1-428a-9fe4-c8d242d5be1b
/usr/local/bin/etcd
--data-dir=/var/etcd/data
--name=example-etcd-cluster-4t886mhnwv
--initial-advertise-peer-urls=http://example-etcd-cluster-4t886mhnwv.example-etcd-cluster.default.svc:2380
--listen-peer-urls=http://0.0.0.0:2380
--listen-client-urls=http://0.0.0.0:2379
--advertise-client-urls=http://example-etcd-cluster-4t886mhnwv.example-etcd-cluster.default.svc:2379
--initial-cluster=example-etcd-cluster-jkclxffwf5=http://example-etcd-cluster-jkclxffwf5.example-etcd-cluster.default.svc:2380,example-etcd-cluster-4t886mhnwv=http://example-etcd-cluster-4t886mhnwv.example-etcd-cluster.default.svc:2380
--initial-cluster-state=existing
etcd operator会根据size=2这个参数,动态组建一个etcd集群,首先启动一个节点,用上面的第一个命令,启动成功后,使用上面的第二个命令启动第二个节点,并且加入第一个节点的集群,重复这个过程只要pod数量达到了size的数量。从上面的initial-cluster字段我们能看出,operator组建的集群使用的是域名。
etcd operator创建集群的这个过程,其实就是operator对CRD的控制过程,operator控制的就是EtcdCluster类型的pod数量和EtcdCluster定义的size数量一致。
etcd operator不仅定义了集群的创建,同时也定义了集群的备份和故障恢复,yaml文件在下面2个目录,这个就不深入讲解了,
/root/k8s/etcd-operator/example/etcd-backup-operator
/root/k8s/etcd-operator/example/etcd-restore-operator
总结
operator的本质就是创建CRD,然后编写控制器来控制CRD的创建过程。跟StatefulSet的编排不一样的是,StatefulSet的编排pod是通过绑定编号的方式来固定拓扑结构的,而本文中operator的创建过程并没有这样,原因就是etcd operator的编排无非就是新增节点加入集群和删除多的节点,这个拓扑结果etcd内部就可以维护了,绑定编号没有意义。